# ------------------------ **pytorch** ------------------------

# Assignment 2:  Deep N-grams

Welcome to the second assignment of course 3. In this assignment you will explore Recurrent Neural Networks `RNN`.
- You will be using the fundamentals of pytorch to implement any kind of deeplearning model.

By completing this assignment, you will learn how to implement models from scratch:
- How to convert a line of text into a tensor
- Create an iterator to feed data to the model
- Define a GRU model using `pytorch`
- Train the model using `pytorch`
- Compute the accuracy of your model using the perplexity
- Predict using your own model

## Table of Contents

- [Overview](#0)
- [1 - Importing the Data](#1)
    - [1.1 - Loading in the Data](#1-1)
    - [1.2 - Convert a Line to Tensor](#1-2)
        - [Exercise 1 - line_to_tensor (UNQ_C1)](#ex-1)
    - [1.3 - Batch Generator](#1-3)
        - [Exercise 2 - data_generator (UNQ_C2)](#ex-2)
    - [1.4 - Repeating Batch Generator](#1-4)        
- [2 - Defining the GRU Model](#2)
    - [Exercise 3 - GRULM (UNQ_C3)](#ex-3)
- [3 - Training](#3)
    - [3.1 - Training the Model](#3-1)
        - [Exercise 4 - train_model (UNQ_C4)](#ex-4)
- [4 - Evaluation](#4)
    - [4.1 - Evaluating using the Deep Nets](#4-1)
        - [Exercise 5 - test_model (UNQ_C5)](#ex-5)
- [5 - Generating the Language with your Own Model](#5)    
- [Summary](#6)


<a name='0'></a>
## Overview

Your task will be to predict the next set of characters using the previous characters.
- Although this task sounds simple, it is pretty useful.
- You will start by converting a line of text into a tensor
- Then you will create a generator to feed data into the model
- You will train a neural network in order to predict the new set of characters of defined length.
- You will use embeddings for each character and feed them as inputs to your model.
    - Many natural language tasks rely on using embeddings for predictions.
- Your model will convert each character to its embedding, run the embeddings through a Gated Recurrent Unit `GRU`, and run it through a linear layer to predict the next set of characters.

<img src = "images/model.png" style="width:600px;height:150px;"/>

The figure above gives you a summary of what you are about to implement.
- You will get the embeddings;
- Stack the embeddings on top of each other;
- Run them through two layers with a relu activation in the middle;
- Finally, you will compute the softmax.

To predict the next character:
- Use the softmax output and identify the word with the highest probability.
- The word with the highest probability is the prediction for the next word.

In [1]:
import os
import shutil



import pickle
import numpy as np
import random as rnd


# set random seed
rnd.seed(32)

import torch
from torch.nn import functional as F



<a name='1'></a>
## 1 - Importing the Data

<a name='1-1'></a>
### 1.1 - Loading in the Data

<img src = "images/shakespeare.png" style="width:250px;height:250px;"/>

Now import the dataset and do some processing.
- The dataset has one sentence per line.
- You will be doing character generation, so you have to process each sentence by converting each **character** (and not word) to a number.
- You will use the `ord` function to convert a unique character to a unique integer ID.
- Store each line in a list.
- Create a data generator that takes in the `batch_size` and the `max_length`.
    - The `max_length` corresponds to the maximum length of the sentence.

In [2]:
dirname = 'data/'
filename = 'shakespeare_data.txt'
lines = [] # storing all the lines in a variable.

counter = 0

with open(filename) as files:
    for line in files:
        # remove leading and trailing whitespace    
        pure_line = line.strip()

        # if pure_line is not the empty string,
        if pure_line:
            # append it to the list
            lines.append(pure_line)


In [3]:
n_lines = len(lines)
print(f"Number of lines: {n_lines}")
print(f"Sample line at position 0 {lines[0]}")
print(f"Sample line at position 999 {lines[999]}")

Number of lines: 125097
Sample line at position 0 A LOVER'S COMPLAINT
Sample line at position 999 With this night's revels and expire the term


Notice that the letters are both uppercase and lowercase.  In order to reduce the complexity of the task, we will convert all characters to lowercase.  This way, the model only needs to predict the likelihood that a letter is 'a' and not decide between uppercase 'A' and lowercase 'a'.

In [4]:
# go through each line
for i, line in enumerate(lines):
    # convert to all lowercase
    lines[i] = line.lower()        # new list nahi bnanai pri, vohi puranin ist hi uodate ho gai

print(f"Number of lines: {n_lines}")
print(f"Sample line at position 0 {lines[0]}")
print(f"Sample line at position 999 {lines[999]}")


Number of lines: 125097
Sample line at position 0 a lover's complaint
Sample line at position 999 with this night's revels and expire the term


In [5]:
eval_lines = lines[-1000:] # Create a holdout validation set
lines      = lines[:-1000] # Leave the rest for training

print(f"Number of lines for training: {len(lines)}")
print(f"Number of lines for validation: {len(eval_lines)}")


Number of lines for training: 124097
Number of lines for validation: 1000


<a name='1-2'></a>
### 1.2 - Convert a Line to Tensor

Now that you have your list of lines, you will convert each character in that list to a number. You can use Python's `ord` function to do it.

Given a string representation of one Unicode character, the `ord` function return an integer representing the Unicode code point of that character.



In [8]:
# View the unique unicode integer associated with each character
print(f"ord('a'): {ord('a')}")
print(f"ord('b'): {ord('b')}")
print(f"ord('c'): {ord('c')}")
print(f"ord(' '): {ord(' ')}")    # blank space ko bhi aik number associate kia hua hay
print(f"ord('x'): {ord('x')}")
print(f"ord('y'): {ord('y')}")
print(f"ord('z'): {ord('z')}")
print(f"ord('1'): {ord('1')}")
print(f"ord('2'): {ord('2')}")
print(f"ord('3'): {ord('3')}")


ord('a'): 97
ord('b'): 98
ord('c'): 99
ord(' '): 32
ord('x'): 120
ord('y'): 121
ord('z'): 122
ord('1'): 49
ord('2'): 50
ord('3'): 51


### Line_to_tensor

**Instructions:** Write a function that takes in a single line and transforms each character into its unicode integer.  **This returns a list of integers, which we'll refer to as a tensor**.
- Use a special integer to represent the end of the sentence (the end of the line).
- This will be the EOS_int (end of sentence integer) parameter of the function.
- Include the EOS_int as the last integer of the
- For this exercise, you will use the number `1` to represent the end of a sentence.

In [6]:
# UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION: line_to_tensor
def line_to_tensor(line, EOS_int=1):
    """Turns a line of text into a tensor

    Args:
        line (str): A single line of text.
        EOS_int (int, optional): End-of-sentence integer. Defaults to 1.

    Returns:
        list: a list of integers (unicode values) for the characters in the `line`.
    """

    # Initialize the tensor as an empty list
    tensor = []

    ### START CODE HERE (Replace instances of 'None' with your code) ###
    # for each character:
    for c in line:

        # convert to unicode int
        c_int = ord(c)              # current integer

        # append the unicode integer to the tensor list
        tensor.append(c_int)        # abhi ye aik list hi hay

    # include the end-of-sentence integer
    tensor.append(EOS_int)          # sentence poray par loop chal gai, ab end main hum nay aik End Of Sequence integer append kar dia

    ### END CODE HERE ###

    return tensor

In [10]:
# Testing your output
line_to_tensor('abc xyz')  # white space k corresponding bhi aik integer hay


[97, 98, 99, 32, 120, 121, 122, 1]

##### Expected Output
```CPP
[97, 98, 99, 32, 120, 121, 122, 1]
```

# Converting Data to Tensors

In [12]:
lines[:5]

["a lover's complaint",
 'from off a hill whose concave womb reworded',
 'a plaintful story from a sistering vale,',
 'my spirits to attend this double voice accorded,',
 'and down i laid to list the sad-tuned tale;']

In [11]:
i=0
for line in lines:
  i+=1
  print(line)
  if i>=5:
    break

a lover's complaint
from off a hill whose concave womb reworded
a plaintful story from a sistering vale,
my spirits to attend this double voice accorded,
and down i laid to list the sad-tuned tale;


In [12]:
tensor_list = []
i=0
for line in lines:
    i+=1
    current_tensor = line_to_tensor(line)
    tensor_list.append(current_tensor)
    if i>=3:
        break
print(tensor_list)

[[97, 32, 108, 111, 118, 101, 114, 39, 115, 32, 99, 111, 109, 112, 108, 97, 105, 110, 116, 1], [102, 114, 111, 109, 32, 111, 102, 102, 32, 97, 32, 104, 105, 108, 108, 32, 119, 104, 111, 115, 101, 32, 99, 111, 110, 99, 97, 118, 101, 32, 119, 111, 109, 98, 32, 114, 101, 119, 111, 114, 100, 101, 100, 1], [97, 32, 112, 108, 97, 105, 110, 116, 102, 117, 108, 32, 115, 116, 111, 114, 121, 32, 102, 114, 111, 109, 32, 97, 32, 115, 105, 115, 116, 101, 114, 105, 110, 103, 32, 118, 97, 108, 101, 44, 1]]


In [15]:
print(tensor_list[0])

[97, 32, 108, 111, 118, 101, 114, 39, 115, 32, 99, 111, 109, 112, 108, 97, 105, 110, 116, 1]


In [16]:
print(line_to_tensor(lines[0]))

[97, 32, 108, 111, 118, 101, 114, 39, 115, 32, 99, 111, 109, 112, 108, 97, 105, 110, 116, 1]


## Tensor List
List of all tensorized string lines

In [13]:
max_len=64
tensor_list = []
# i=0
for line in lines:
    # i+=1
    current_tensor = line_to_tensor(line)
    if len(current_tensor) <= max_len:
        tensor_list.append(current_tensor)
    # if i>=3:
    #     break
print(len(tensor_list))

122689


In [14]:
len(lines) - len(tensor_list)


1408

## Tensor Array
tensor list converted into a numpy array

In [15]:
tensor_matrix = np.zeros((len(tensor_list), 64))
tensor_matrix


array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [16]:
tensor_matrix.shape


(122689, 64)

In [17]:
i=0
for tensor in tensor_list:                   # for i, tensor in enumerate(tensor_list):
    tensor_matrix[i,:len(tensor)] = tensor
    i+=1
    if i>=3:
        break
tensor_matrix[:3,:]

array([[ 97.,  32., 108., 111., 118., 101., 114.,  39., 115.,  32.,  99.,
        111., 109., 112., 108.,  97., 105., 110., 116.,   1.,   0.,   0.,
          0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
          0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
          0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
          0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [102., 114., 111., 109.,  32., 111., 102., 102.,  32.,  97.,  32.,
        104., 105., 108., 108.,  32., 119., 104., 111., 115., 101.,  32.,
         99., 111., 110.,  99.,  97., 118., 101.,  32., 119., 111., 109.,
         98.,  32., 114., 101., 119., 111., 114., 100., 101., 100.,   1.,
          0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
          0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [ 97.,  32., 112., 108.,  97., 105., 110., 116., 102., 117., 108.,
         32., 115., 116., 111., 114., 121.,  32., 102., 114.

In [22]:
tensor_matrix[0,:]

array([ 97.,  32., 108., 111., 118., 101., 114.,  39., 115.,  32.,  99.,
       111., 109., 112., 108.,  97., 105., 110., 116.,   1.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.])

In [23]:
print(line_to_tensor(lines[0]))


[97, 32, 108, 111, 118, 101, 114, 39, 115, 32, 99, 111, 109, 112, 108, 97, 105, 110, 116, 1]


In [18]:
i=0
for tensor in tensor_list:
    tensor_matrix[i,:len(tensor)] = tensor
    i+=1
    # if i>=3:
    #     break
tensor_matrix[0,:]


array([ 97.,  32., 108., 111., 118., 101., 114.,  39., 115.,  32.,  99.,
       111., 109., 112., 108.,  97., 105., 110., 116.,   1.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.])

## Training and Evaluation Data Generator

In [21]:
inputs         = tensor_matrix.copy()
shifted_inputs = tensor_matrix.copy()
targets        = tensor_matrix.copy()


### shifting the inputs to the right by 1 index
necessary for current assignment

In [22]:

arr = np.array([1, 2, 3, 4, 5])

# Roll the array by 1 position to the right
rolled_arr = np.roll(arr, 1)
print("Rolled array:", rolled_arr)


Rolled array: [5 1 2 3 4]


In [23]:
shifted_inputs = np.roll(shifted_inputs, 1)
shifted_inputs[0,:]

array([  0.,  97.,  32., 108., 111., 118., 101., 114.,  39., 115.,  32.,
        99., 111., 109., 112., 108.,  97., 105., 110., 116.,   1.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.])

In [24]:
inputs[0,:]


array([ 97.,  32., 108., 111., 118., 101., 114.,  39., 115.,  32.,  99.,
       111., 109., 112., 108.,  97., 105., 110., 116.,   1.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.])

### Converting the numpy arrays into torch tensors

In [25]:
inputs         = torch.from_numpy(inputs).long()
shifted_inputs = torch.from_numpy(shifted_inputs).long()
targets        = torch.from_numpy(targets).long()


In [26]:
inputs[0,:]

tensor([ 97,  32, 108, 111, 118, 101, 114,  39, 115,  32,  99, 111, 109, 112,
        108,  97, 105, 110, 116,   1,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0])

## Training and Evaluation Data generator 

In [27]:
def shifted_inputs_and_NOT_shifted_targets_generator(shifted_inputs_, targets_, batch_size_):

  counter = 0
  second_counter = batch_size_
  while True:
    batch_of_shifted_inputs_ = shifted_inputs_[counter:second_counter]
    batch_of_targets_        = targets_[counter:second_counter]

    yield batch_of_shifted_inputs_, batch_of_targets_

    counter        = counter + batch_size_
    second_counter = second_counter + batch_size_
    if second_counter >= shifted_inputs_.shape[0]:
      counter=0
      second_counter = batch_size_


In [28]:
gen = shifted_inputs_and_NOT_shifted_targets_generator(shifted_inputs_=shifted_inputs, targets_=targets, batch_size_=2)

In [29]:
tmp_inputs, tmp_targets = next(gen)


In [30]:
tmp_inputs, tmp_targets


(tensor([[  0,  97,  32, 108, 111, 118, 101, 114,  39, 115,  32,  99, 111, 109,
          112, 108,  97, 105, 110, 116,   1,   0,   0,   0,   0,   0,   0,   0,
            0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
            0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
            0,   0,   0,   0,   0,   0,   0,   0],
         [  0, 102, 114, 111, 109,  32, 111, 102, 102,  32,  97,  32, 104, 105,
          108, 108,  32, 119, 104, 111, 115, 101,  32,  99, 111, 110,  99,  97,
          118, 101,  32, 119, 111, 109,  98,  32, 114, 101, 119, 111, 114, 100,
          101, 100,   1,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
            0,   0,   0,   0,   0,   0,   0,   0]]),
 tensor([[ 97,  32, 108, 111, 118, 101, 114,  39, 115,  32,  99, 111, 109, 112,
          108,  97, 105, 110, 116,   1,   0,   0,   0,   0,   0,   0,   0,   0,
            0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
            0,  

In [31]:
tmp_inputs.shape


torch.Size([2, 64])

In [32]:
tmp_inputs, tmp_targets = next(gen)
tmp_inputs, tmp_targets


(tensor([[  0,  97,  32, 112, 108,  97, 105, 110, 116, 102, 117, 108,  32, 115,
          116, 111, 114, 121,  32, 102, 114, 111, 109,  32,  97,  32, 115, 105,
          115, 116, 101, 114, 105, 110, 103,  32, 118,  97, 108, 101,  44,   1,
            0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
            0,   0,   0,   0,   0,   0,   0,   0],
         [  0, 109, 121,  32, 115, 112, 105, 114, 105, 116, 115,  32, 116, 111,
           32,  97, 116, 116, 101, 110, 100,  32, 116, 104, 105, 115,  32, 100,
          111, 117,  98, 108, 101,  32, 118, 111, 105,  99, 101,  32,  97,  99,
           99, 111, 114, 100, 101, 100,  44,   1,   0,   0,   0,   0,   0,   0,
            0,   0,   0,   0,   0,   0,   0,   0]]),
 tensor([[ 97,  32, 112, 108,  97, 105, 110, 116, 102, 117, 108,  32, 115, 116,
          111, 114, 121,  32, 102, 114, 111, 109,  32,  97,  32, 115, 105, 115,
          116, 101, 114, 105, 110, 103,  32, 118,  97, 108, 101,  44,   1,   0,
            0,  

In [33]:
tmp_inputs, tmp_targets = next(gen)
tmp_inputs, tmp_targets


(tensor([[  0,  97, 110, 100,  32, 100, 111, 119, 110,  32, 105,  32, 108,  97,
          105, 100,  32, 116, 111,  32, 108, 105, 115, 116,  32, 116, 104, 101,
           32, 115,  97, 100,  45, 116, 117, 110, 101, 100,  32, 116,  97, 108,
          101,  59,   1,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
            0,   0,   0,   0,   0,   0,   0,   0],
         [  0, 101, 114, 101,  32, 108, 111, 110, 103,  32, 101, 115, 112, 105,
          101, 100,  32,  97,  32, 102, 105,  99, 107, 108, 101,  32, 109,  97,
          105, 100,  32, 102, 117, 108, 108,  32, 112,  97, 108, 101,  44,   1,
            0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
            0,   0,   0,   0,   0,   0,   0,   0]]),
 tensor([[ 97, 110, 100,  32, 100, 111, 119, 110,  32, 105,  32, 108,  97, 105,
          100,  32, 116, 111,  32, 108, 105, 115, 116,  32, 116, 104, 101,  32,
          115,  97, 100,  45, 116, 117, 110, 101, 100,  32, 116,  97, 108, 101,
           59,  

In [34]:
shifted_inputs[3,:]


tensor([  0, 109, 121,  32, 115, 112, 105, 114, 105, 116, 115,  32, 116, 111,
         32,  97, 116, 116, 101, 110, 100,  32, 116, 104, 105, 115,  32, 100,
        111, 117,  98, 108, 101,  32, 118, 111, 105,  99, 101,  32,  97,  99,
         99, 111, 114, 100, 101, 100,  44,   1,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0])

## Data Generator for calculating perplexity
It generates one hot encoded targets, which are necessary for the computing perplexity score

In [35]:
# import torch

# Example usage
categories = torch.tensor([[0,1,2],[1,2,3]])
num_classes = 3
one_hot_encoded = torch.nn.functional.one_hot(categories, num_classes=4)

print("One-hot encoded tensor:\n", one_hot_encoded)


One-hot encoded tensor:
 tensor([[[1, 0, 0, 0],
         [0, 1, 0, 0],
         [0, 0, 1, 0]],

        [[0, 1, 0, 0],
         [0, 0, 1, 0],
         [0, 0, 0, 1]]])


In [36]:
def shifted_inputs_and_one_hot_encoded_targets_generator(shifted_inputs_, targets_, batch_size_):

  # num_of_epochs = shifted_inputs_.shape[0]/batch_size_
  counter = 0
  second_counter = batch_size_
  while True:
    batch_of_shifted_inputs_            = shifted_inputs_[counter:second_counter]
    batch_of_one_hot_encoded_targets_   = torch.nn.functional.one_hot(targets_[counter:second_counter], num_classes=256)

    yield batch_of_shifted_inputs_, batch_of_one_hot_encoded_targets_

    counter        = counter + batch_size_
    second_counter = second_counter + batch_size_
    if second_counter >= shifted_inputs_.shape[0]:
      counter=0
      second_counter = batch_size_



In [37]:
gen = shifted_inputs_and_one_hot_encoded_targets_generator(shifted_inputs_=shifted_inputs, targets_=targets, batch_size_=2)

In [38]:
tmp_inputs, tmp_targets = next(gen)


In [39]:
tmp_inputs, tmp_targets


(tensor([[  0,  97,  32, 108, 111, 118, 101, 114,  39, 115,  32,  99, 111, 109,
          112, 108,  97, 105, 110, 116,   1,   0,   0,   0,   0,   0,   0,   0,
            0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
            0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
            0,   0,   0,   0,   0,   0,   0,   0],
         [  0, 102, 114, 111, 109,  32, 111, 102, 102,  32,  97,  32, 104, 105,
          108, 108,  32, 119, 104, 111, 115, 101,  32,  99, 111, 110,  99,  97,
          118, 101,  32, 119, 111, 109,  98,  32, 114, 101, 119, 111, 114, 100,
          101, 100,   1,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
            0,   0,   0,   0,   0,   0,   0,   0]]),
 tensor([[[0, 0, 0,  ..., 0, 0, 0],
          [0, 0, 0,  ..., 0, 0, 0],
          [0, 0, 0,  ..., 0, 0, 0],
          ...,
          [1, 0, 0,  ..., 0, 0, 0],
          [1, 0, 0,  ..., 0, 0, 0],
          [1, 0, 0,  ..., 0, 0, 0]],
 
         [[0, 0, 0,  .

In [40]:
tmp_targets.shape


torch.Size([2, 64, 256])

In [43]:
print(tmp_targets[0,0,:])


tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])


In [44]:
torch.nonzero(tmp_targets[0,0,:])
# perfect

tensor([[97]])

# I did'nt use the following data generator provided by course instructors

<a name='1-3'></a>
### 1.3 - Batch Generator

Most of the time in Natural Language Processing, and AI in general we use batches when training our data sets. Here, you will build a data generator that takes in a text and returns a batch of text lines (lines are sentences).
- The generator converts text lines (sentences) into numpy arrays of integers padded by zeros so that all arrays have the same length, which is **the length of the longest sentence in the ENTIRE data set** ???

Once you create the generator, you can iterate on it like this:

```
next(data_generator)
```

This generator returns the data in a format that you could directly use in your model when computing the feed-forward of your algorithm. This iterator returns a batch of lines and per token mask. The batch is a tuple of three parts: inputs, targets, mask. The inputs and targets are identical. The second column will be used to evaluate your predictions. Mask is 1 for non-padding tokens.

<a name='ex-2'></a>
### Exercise 2 - data_generator
**Instructions:** Implement the data generator below. Here are some things you will need.

- While True loop: this will yield one batch at a time.
- if index >= num_lines, set index to 0.
- The generator should return shuffled batches of data. To achieve this without modifying the actual lines a list containing the indexes of `data_lines` is created. This list can be shuffled and used to get random batches everytime the index is reset.
- if len(line) < max_length append line to cur_batch.
    - Note that a line that has length equal to max_length should not be appended to the batch.
    - This is because when converting the characters into a tensor of integers, an additional end of sentence token id will be added.  
    - So if max_length is 5, and a line has 4 characters, the tensor representing those 4 characters plus the end of sentence character will be of length 5, which is the max length.
- if len(cur_batch) == batch_size, go over every line, convert it to an int and store it.


<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<p>
<ul>
    <li>Use the line_to_tensor function above inside a list comprehension in order to pad lines with zeros.</li>
    <li>Keep in mind that the length of the tensor is always 1 + the length of the original line of characters.  Keep this in mind when setting the padding of zeros.</li>
</ul>
</p>

In [274]:
[*range(10)]

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [275]:
# # UNQ_C2 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# # GRADED FUNCTION: data_generator
# def data_generator(batch_size, max_length, data_lines, line_to_tensor=line_to_tensor, shuffle=True):
#     """Generator function that yields batches of data

#     Args:
#         batch_size (int): number of examples (in this case, sentences) per batch.
#         max_length (int): maximum length of the output tensor.                         # max length hum nay khud pass ki hay | function k andar, data say infer nahi kar rahay |
#         NOTE: max_length includes the end-of-sentence character that will be added
#                 to the tensor.
#                 Keep in mind that the length of the tensor is always 1 + the length
#                 of the original line of characters.
#         data_lines (list): list of the sentences to group into batches.
#         line_to_tensor (function, optional): function that converts line to tensor. Defaults to line_to_tensor.
#         shuffle (bool, optional): True if the generator should generate random batches of data. Defaults to True.

#     Yields:
#         tuple: two copies of the batch and mask.
#     """
#     # initialize the index that points to the current position in the lines index array
#     index = 0

#     # initialize the list that will contain the current batch
#     cur_batch = []

#     # count the number of lines in data_lines
#     num_lines = len(data_lines)

#     # create an array with the indexes of data_lines that can be shuffled
#     lines_index = [*range(num_lines)]

#     # shuffle line indexes if shuffle is set to True
#     if shuffle:
#         rnd.shuffle(lines_index)

#     ### START CODE HERE ###
#     while True:

#         # if the index is greater than or equal to the number of lines in data_lines | yani agar aik baar saaray ka sara data read kar lia hay tu phir next set of batches nikalnay say pehlay dobara shuffle kr do
#         if index >= num_lines:
#             # then reset the index to 0
#             index = 0
#             # shuffle line indexes if shuffle is set to True
#             if shuffle:
#                 rnd.shuffle(lines_index)

#         # get a line at the `lines_index[index]` position in data_lines
#         line = data_lines[lines_index[index]]

#         # if the length of the line is less than max_length
#         if len(line) < max_length:
#             # append the line to the current batch
#             cur_batch.append(line)                               # initially string sentences k hi batch bana rahay hain

#         # increment the index by one
#         index += 1

#         # if the current batch is now equal to the desired batch size
#         if len(cur_batch) == batch_size:

#             batch = []          # agar jitnay sentences aik batch main chahiye thay, otnay mil gaye hain, tu phir ab conversion-of-sentences-to-integers start krtay hain
#             mask  = []

#             # go through each line (li) in cur_batch
#             for li in cur_batch:
#                 # convert the line (li) to a tensor of integers
#                 tensor = line_to_tensor(li)

#                 # Create a list of zeros to represent the padding
#                 # so that the tensor plus padding will have length `max_length`
#                 pad = [0] * (max_length - len(tensor))

#                 # combine the tensor plus pad
#                 tensor_pad = tensor + pad

#                 # append the padded tensor to the batch
#                 batch.append(tensor_pad)

#                 # A mask for this tensor_pad is 1 whereever tensor_pad is not
#                 # 0 and 0 whereever tensor_pad is 0, i.e. if tensor_pad is
#                 # [1, 2, 3, 0, 0, 0] then example_mask should be
#                 # [1, 1, 1, 0, 0, 0]
#                 example_mask = [0 if t == 0 else 1 for t in tensor_pad]
#                 mask.append(example_mask) # @ KEEPTHIS                             # har sample ki creation k sath hi os ka mask bhi create kr lia | see nb 4 why mask is imp for perplexity score calcs

#             # convert the batch (data type list) to a numpy array
#             batch_np_arr = np.array(batch)
#             mask_np_arr  = np.array(mask)

#             ### END CODE HERE ##

#             # Yield two copies of the batch and mask.
#             yield batch_np_arr, batch_np_arr, mask_np_arr

#             # reset the current batch to an empty list | curr_batch string-form-of-sentences  ko store krnay k liye use hotay tha,
#             cur_batch = []                              # pehlay yeild krva lia, os k bad function continue raha, nice!


In [276]:
# # Try out your data generator
# tmp_lines = ['12345678901', #length 11
#              '123456789', # length 9
#              '234567890', # length 9
#              '345678901'] # length 9

# # Get a batch size of 2, max length 10
# tmp_data_gen = data_generator(batch_size = 2,
#                               max_length = 10,
#                               data_lines = tmp_lines,
#                               shuffle    = False)

# # get one batch
# tmp_batch = next(tmp_data_gen)

# # view the batch
# tmp_batch


##### Expected output

```CPP
(array([[49, 50, 51, 52, 53, 54, 55, 56, 57,  1],
              [50, 51, 52, 53, 54, 55, 56, 57, 48,  1]], dtype=int32),
 array([[49, 50, 51, 52, 53, 54, 55, 56, 57,  1],
              [50, 51, 52, 53, 54, 55, 56, 57, 48,  1]], dtype=int32),
 array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
              [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int32))
```

In [277]:
# type(tmp_batch)


In [278]:
# len(tmp_batch)


In [279]:
# tmp_batch[0]


In [280]:
# # Try out your data generator
# tmp_lines = ['98765432101', #length 11
#              '123456789', # length 9
#              '234567890', # length 9
#              '345'] # length 9

# # Get a batch size of 2, max length 10
# tmp_data_gen = data_generator(batch_size = 4,
#                               max_length = 10,                  # jis sequnce ki length 10 say ziada thi, osay is nay siray say hi uthaya hi nahi, ye nahi k trim kr dain, nahi, uthaya hi nahi osay
#                               data_lines = tmp_lines,
#                               shuffle    = False)

# # get one batch
# tmp_batch = next(tmp_data_gen)

# # view the batch
# tmp_batch


In [281]:
# # Test your function
# w2_unittest.test_data_generator(data_generator)

Now that you have your generator, you can just call them and they will return tensors which correspond to your lines in Shakespeare. The first column and the second column are identical. Now you can go ahead and start building your neural network.

<a name='1-4'></a>
### 1.4 - Repeating Batch Generator

The way the iterator is currently defined, it will keep providing batches forever.

Although it is not needed, we want to show you the `itertools.cycle` function which is really useful when the generator eventually stops

Notice that it is expected to use this function within the training function further below

Usually we want to cycle over the dataset multiple times during training (i.e. train for multiple *epochs*).

For small datasets we can use [`itertools.cycle`](https://docs.python.org/3.8/library/itertools.html#itertools.cycle) to achieve this easily.

In [282]:
# import itertools

# infinite_data_generator = itertools.cycle(
#     data_generator(batch_size=2, max_length=10, data_lines=tmp_lines))


You can see that we can get more than the 5 lines in tmp_lines using this.

In [283]:
# ten_lines = [next(infinite_data_generator) for _ in range(10)]
# print(len(ten_lines))


In [284]:
# ten_lines

<a name='2'></a>
## 2 - Defining the GRU Model

Now that you have the input and output tensors, you will go ahead and initialize your model. You will be implementing the `GRULM`, gated recurrent unit model. To implement this model, you will be using google's `pytorch`. You can use the following techniques and tools when constructing the model:


- `Embedding`: Initializes the embedding. In this case it is the size of the vocabulary by the dimension of the model.
    - `Embedding(vocab_size, d_feature)`.
    - `vocab_size` is the number of unique words in the given vocabulary.
    - `d_feature` is the number of elements in the word embedding (some choices for a word embedding size range from 150 to 300, for example).
___

- `GRU`: GRU layer.
    - `GRU(n_units)` Builds a traditional GRU of n_cells with dense internal transformations.
    - `GRU` paper: https://arxiv.org/abs/1412.3555
___

- `Dense`: A dense layer.
    - `Dense(n_units)`: The parameter `n_units` is the number of units chosen for this dense layer.

**Instructions:** Implement the `GRULM` class below. You should be using all the methods explained above.


In [45]:
len(tensor_list)/32
# ~ 3500

3834.03125

In [60]:
import torch
import torch.nn as nn
import torch.optim as optim

# Define the device (GPU if available, otherwise CPU)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Parameters
vocab_size         = 256  # Size of the vocabulary
emb_size           = 512  # Dimensionality of embedding vectors
hidden_size        = 512  # Number of features in the GRU hidden state
num_layers         = 2    # Number of GRU layers
linear_output_size = 256  # Number of output features for the linear layer
sequence_length    = 64   # Length of the sequences
batch_size         = 32   # Batch size

# Define the neural network
class GRULM(nn.Module):
    def __init__(self):
        super(GRULM, self).__init__()

        self.embedding = nn.Embedding(vocab_size, emb_size)

        self.gru = nn.GRU(emb_size, hidden_size, num_layers, batch_first=True)

        self.linear = nn.Linear(hidden_size, linear_output_size)

    def forward(self, x, hidden):

        x = self.embedding(x)

        gru_output, hidden = self.gru(x, hidden)

        linear_output = self.linear(gru_output)

        log_softmax_output = F.log_softmax(linear_output, dim=-1)
        return log_softmax_output, hidden

# Instantiate the network and move it to the device
model = GRULM().to(device)



In [61]:
# Define the loss function
criterion = nn.NLLLoss()

# Define the optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001)

# # Define the generator
# def shifted_inputs_and_NOT_shifted_targets_generator(shifted_inputs_, targets_, batch_size_):
#     counter = 0
#     second_counter = batch_size_
#     while True:
#         batch_of_shifted_inputs_ = shifted_inputs_[counter:second_counter].to(device)
#         batch_of_targets_ = targets_[counter:second_counter].to(device)
#         yield batch_of_shifted_inputs_, batch_of_targets_
#         counter += batch_size_
#         second_counter += batch_size_
#         if second_counter >= shifted_inputs_.shape[0]:
#             counter = 0
#             second_counter = batch_size_



In [63]:
# Prepare the data
inputs         = tensor_matrix.copy()
shifted_inputs = tensor_matrix.copy()
targets        = tensor_matrix.copy()

shifted_inputs = np.roll(shifted_inputs, 1)

inputs         = torch.from_numpy(inputs).long().to(device)
shifted_inputs = torch.from_numpy(shifted_inputs).long().to(device)
targets        = torch.from_numpy(targets).long().to(device)

# Initialize the hidden state
hidden_state = torch.zeros(num_layers, batch_size, hidden_size).to(device)


In [64]:
# Initialize the generator
training_gen = shifted_inputs_and_NOT_shifted_targets_generator(shifted_inputs_=shifted_inputs, targets_=targets, batch_size_=batch_size)

# Number of epochs and steps per epoch
num_epochs = 10
num_steps_per_epoch = len(tensor_list)/batch_size

for epoch in range(num_epochs):
    model.train()
    total_loss = 0.0

    # Reset hidden state for each epoch
    hidden_state = torch.zeros(num_layers, batch_size, hidden_size).to(device)

    i = 0

    for batch_of_shifted_inputs, batch_of_targets in training_gen:
        i += 1

        # Ensure tensors are on the correct device
        batch_of_shifted_inputs =  batch_of_shifted_inputs.to(device)
        batch_of_targets        =  batch_of_targets.to(device)

        optimizer.zero_grad()

        # Forward pass
        log_softmax_output, _ = model(batch_of_shifted_inputs, hidden_state)

        # Reshape the outputs and targets to match
        num_classes        = log_softmax_output.size(-1)
        log_softmax_output = log_softmax_output.view(-1, num_classes)
        batch_of_targets   = batch_of_targets.view(-1)

        # Compute the loss
        loss = criterion(log_softmax_output, batch_of_targets)

        # Backward pass and optimization
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

        if i % 10 == 0:
          print('i: ',i)
          print('loss: ', total_loss / (i + 1))
          print()

    avg_loss = total_loss / (i + 1)
    print(f"Epoch [{epoch + 1}/{num_epochs}], Average Loss: {avg_loss:.4f}")


i:  10
loss:  2.766151254827326

i:  20
loss:  2.350588736079988

i:  30
loss:  2.166026588409178

i:  40
loss:  2.028937802082155

i:  50
loss:  1.9379488954357071

i:  60
loss:  1.8714234457641352

i:  70
loss:  1.8208710777927453

i:  80
loss:  1.773002313978878

i:  90
loss:  1.7325746410495633

i:  100
loss:  1.6923185339068423

i:  110
loss:  1.661053245132034

i:  120
loss:  1.628834010155733

i:  130
loss:  1.6027340843477322

i:  140
loss:  1.5726939947047132

i:  150
loss:  1.549672634396332

i:  160
loss:  1.5341296906797042

i:  170
loss:  1.5164401837956836

i:  180
loss:  1.49820435771626

i:  190
loss:  1.4802117129270944

i:  200
loss:  1.4691077900763174

i:  210
loss:  1.4616154000092456

i:  220
loss:  1.4540658247956324

i:  230
loss:  1.4457016752395795

i:  240
loss:  1.435878803373867

i:  250
loss:  1.4241608293407941

i:  260
loss:  1.4117574751148736

i:  270
loss:  1.3978148657017528

i:  280
loss:  1.3863523538850804

i:  290
loss:  1.3801749154054832

i:  3

KeyboardInterrupt: 

In [65]:
#torch.save(model,'GRULM.pt') # pt: pytorch


# Looking at inputs and outputs of each layer

In [155]:
import torch
import torch.nn.functional as F

# Define dimensions
batch_size      = 32
sequence_length = 64
num_classes     = 256

# Generate example tensors
log_softmax_output = torch.randn(batch_size, sequence_length, num_classes)  # Random values
temp_targets = torch.randint(0, num_classes, (batch_size, sequence_length))  # Random class indices

print("Original log_softmax_output shape:", log_softmax_output.shape)
print("Original temp_targets shape:", temp_targets.shape)



Original log_softmax_output shape: torch.Size([32, 64, 256])
Original temp_targets shape: torch.Size([32, 64])


In [162]:
num_classes = log_softmax_output.size(-1)
num_classes

256

In [164]:
log_softmax_output = log_softmax_output.view(-1, num_classes)
log_softmax_output.shape


torch.Size([2048, 256])

In [165]:
temp_targets = temp_targets.view(-1)
temp_targets, temp_targets.shape

(tensor([129,  55,  65,  ..., 107, 161, 143]), torch.Size([2048]))

In [None]:
# class GRULM(nn.Module):
#     def __init__(self):
#         super(GRULM, self).__init__()

#         self.embedding = nn.Embedding(vocab_size, emb_size)

#         self.gru = nn.GRU(emb_size, hidden_size, num_layers, batch_first=True)

#         self.linear = nn.Linear(hidden_size, linear_output_size)

#     def forward(self, x, hidden):

#         x = self.embedding(x)

#         gru_output, hidden = self.gru(x, hidden)

#         linear_output = self.linear(gru_output)

#         log_softmax_output = F.log_softmax(linear_output, dim=-1)
#         return log_softmax_output, hidden


In [47]:
import torch
import torch.nn as nn

# Parameters
vocab_size      = 256   # Number of unique tokens in the vocabulary
emb_size        = 512   # Dimensionality of the embedding vectors
batch_size      = 32    # Batch size
sequence_length = 64    # Sequence length

# Define the embedding layer
embedding = nn.Embedding(vocab_size, emb_size)

# Set random seed for reproducibility
torch.manual_seed(0)

# Create a batch of input sequences (indices)
input_sequences = torch.randint(0, vocab_size, (batch_size, sequence_length))

# Get the embedding representation of the sequences
embedded_sequences = embedding(input_sequences)

# Display shapes
print("Input Sequences shape:\n", input_sequences.shape)
print("\nEmbedded Sequences shape:\n", embedded_sequences.shape)


Input Sequences shape:
 torch.Size([32, 64])

Embedded Sequences shape:
 torch.Size([32, 64, 512])


In [50]:
import torch
import torch.nn as nn

# Parameters
hidden_size     = 512   # Number of features in the hidden state
num_layers      = 2     # Number of recurrent layers
batch_size      = 32    # Batch size
sequence_length = 64    # Sequence length
emb_size        = 512   # Dimensionality of the embedding vectors

# Define the GRU layer with batch_first=True and 2 layers
gru = nn.GRU(emb_size, hidden_size, num_layers, batch_first=True)

# Initialize the hidden state (num_layers, batch_size, hidden_size)
initial_hidden_state = torch.zeros(num_layers, batch_size, hidden_size)

# Forward pass through the GRU using the previously computed embeddings
output, hidden_state = gru(embedded_sequences, initial_hidden_state)

# Display shapes
print("Embedded Sequences shape:\n", embedded_sequences.shape)
print("\nInitial Hidden State shape:\n", initial_hidden_state.shape)
print("\nGRU Output shape:\n", output.shape)
print("\nFinal Hidden State shape:\n", hidden_state.shape)


Embedded Sequences shape:
 torch.Size([32, 64, 512])

Initial Hidden State shape:
 torch.Size([2, 32, 512])

GRU Output shape:
 torch.Size([32, 64, 512])

Final Hidden State shape:
 torch.Size([2, 32, 512])


In [51]:
import torch
import torch.nn as nn

# Parameters
hidden_size        = 512  # Number of features in the hidden state (same as GRU output feature size)
linear_output_size = 256  # Number of output features for the linear layer

# Define the Linear layer
linear = nn.Linear(hidden_size, linear_output_size)

# Assume gru_output is available from the previous cell
# For demonstration, creating a dummy gru_output tensor with the same shape as expected
# If you are running this in sequence, remove this line
gru_output = torch.randn(32, 64, 512)

# Forward pass through the Linear layer
linear_output = linear(gru_output)

# Display shapes
print("GRU Output shape:\n", gru_output.shape)
print("\nLinear Output shape:\n", linear_output.shape)



GRU Output shape:
 torch.Size([32, 64, 512])

Linear Output shape:
 torch.Size([32, 64, 256])


In [52]:
linear_output_soft_maxed = F.softmax(linear_output, dim=-1)


In [53]:
linear_output_soft_maxed.shape

torch.Size([32, 64, 256])

In [54]:
linear_output_soft_maxed[0,0,:10]


tensor([0.0074, 0.0042, 0.0041, 0.0012, 0.0033, 0.0063, 0.0107, 0.0034, 0.0024,
        0.0017], grad_fn=<SliceBackward0>)

In [55]:
torch.sum(linear_output_soft_maxed[0,0,:])


tensor(1., grad_fn=<SumBackward0>)

In [56]:
log_linear_output_soft_maxed = torch.log(linear_output_soft_maxed)


In [57]:
log_linear_output_soft_maxed[0,0,:15]


tensor([-4.9077, -5.4785, -5.4887, -6.7558, -5.7147, -5.0720, -4.5364, -5.6715,
        -6.0319, -6.3949, -5.4049, -5.5059, -5.3138, -6.0489, -5.8283],
       grad_fn=<SliceBackward0>)

In [58]:
F.log_softmax(linear_output, dim=-1)[0,0,:15]


tensor([-4.9077, -5.4785, -5.4887, -6.7558, -5.7147, -5.0720, -4.5364, -5.6715,
        -6.0319, -6.3949, -5.4049, -5.5059, -5.3138, -6.0489, -5.8283],
       grad_fn=<SliceBackward0>)

In [66]:
batch_size=32
def n_used_lines(lines, max_length):
    '''
    Args:
    lines: all lines of text an array of lines
    max_length - max_length of a line in order to be considered an int
    output_dir - folder to save your file an int
    Return:
    number of efective examples
    '''

    n_lines = 0
    for l in lines:
        if len(l) <= max_length:
            n_lines += 1
    return n_lines


num_used_lines = n_used_lines(lines, 64)                          # onhoon nay max length 32 d
print('Number of used lines from the dataset:', num_used_lines)
print('Batch size (a power of 2):', int(batch_size))
steps_per_epoch = int(num_used_lines/batch_size)
print('Number of steps to cover one epoch:', steps_per_epoch)

Number of used lines from the dataset: 123034
Batch size (a power of 2): 32
Number of steps to cover one epoch: 3844


## Evaluation  

### Evaluating using the Deep Nets

Now that you have learned how to train a model, you will learn how to evaluate it. To evaluate language models, we usually use perplexity which is a measure of how well a probability model predicts a sample. Note that perplexity is defined as:

$$P(W) = \sqrt[N]{\prod_{i=1}^{N} \frac{1}{P(w_i| w_1,...,w_{n-1})}}$$

As an implementation hack, you would usually take the log of that formula (to enable us to use the log probabilities we get as output of our `RNN`, convert exponents to products, and products into sums which makes computations less complicated and computationally more efficient). You should also take care of the padding, since you do not want to include the padding when calculating the perplexity (because we do not want to have a perplexity measure artificially good).


$$\log P(W) = {\log\left(\sqrt[N]{\prod_{i=1}^{N} \frac{1}{P(w_i| w_1,...,w_{n-1})}}\right)}$$$$ = \log\left(\left(\prod_{i=1}^{N} \frac{1}{P(w_i| w_1,...,w_{n-1})}\right)^{\frac{1}{N}}\right)$$
$$ = \log\left(\left({\prod_{i=1}^{N}{P(w_i| w_1,...,w_{n-1})}}\right)^{-\frac{1}{N}}\right)$$$$ = -\frac{1}{N}{\log\left({\prod_{i=1}^{N}{P(w_i| w_1,...,w_{n-1})}}\right)} $$$$ = -\frac{1}{N}{{\sum_{i=1}^{N}{\log P(w_i| w_1,...,w_{n-1})}}} $$

<a name='ex-5'></a>
### test_model
**Instructions:** Write a program that will help evaluate your model. Implementation hack: your program takes in preds and target. Preds is a tensor of log probabilities. You can use `torch.nn.functional.one_hot(.....)` to transform the target into the same dimension. You then multiply them and sum.

You also have to create a mask to only get the non-padded probabilities. Good luck!

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<p>
<ul>
    <li>To convert the target into the same dimension as the predictions tensor use tl.one.hot with target and preds.shape[-1].</li>
    <li>You will also need the np.equal function in order to unpad the data and properly compute perplexity.</li>
    <li>Keep in mind while implementing the formula above that <em> w<sub>i</sub></em> represents a letter from our 256 letter alphabet.</li>
</ul>
</p>

In [None]:
# UNQ_C5 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION: test_model
def test_model(preds, target):
    """Function to test the model.

    Args:
        preds (jax.interpreters.xla.DeviceArray): Predictions of a list of batches of tensors corresponding to lines of text.
        target (jax.interpreters.xla.DeviceArray): Actual list of batches of tensors corresponding to lines of text.

    Returns:
        float: log_perplexity of the model.
    """
    ### START CODE HERE ###

    log_p = np.sum(preds * tl.one_hot(target, preds.shape[-1]), axis= -1) # HINT: tl.one_hot() should replace one of the Nones

    non_pad = 1.0 - np.equal(target, 0)          # You should check if the target equals 0
    log_p = log_p * non_pad                             # Get rid of the padding

    log_ppx = np.sum(log_p, axis=1) / np.sum(non_pad, axis=1) # Remember to set the axis properly when summing up
    log_ppx = np.mean(log_ppx) # Compute the mean of the previous expression


    ### END CODE HERE ###

    return -log_ppx

In [None]:
# UNQ_C6 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# Testing
model = GRULM()
model.init_from_file('model.pkl.gz')
batch = next(data_generator(batch_size, max_length, lines, shuffle=False))
preds = model(batch[0])
log_ppx = test_model(preds, batch[1])
print('The log perplexity and perplexity of your model are respectively', log_ppx, np.exp(log_ppx))

The log perplexity and perplexity of your model are respectively 1.7646704 5.8396473


**Expected Output:** The log perplexity and perplexity of your model are respectively around 1.7 and 5.8.

In [None]:
# Test your function
pretrained_model = GRULM()
pretrained_model.init_from_file('model.pkl.gz')
w2_unittest.unittest_test_model(test_model, pretrained_model)
del pretrained_model

[92m All tests passed


## Generating the Language with your Own Model

We will now use your own language model to generate new sentences for that we need to make draws from a Gumbel distribution.

The Gumbel Probability Density Function (PDF) is defined as:

$$ f(z) = {1\over{\beta}}e^{(-z+e^{(-z)})} $$

where: $$ z = {(x - \mu)\over{\beta}}$$

The maximum value, which is what we choose as the prediction in the last step of a Recursive Neural Network `RNN` we are using for text generation, in a sample of a random variable following an exponential distribution approaches the Gumbel distribution when the sample increases asymptotically. For that reason, the Gumbel distribution is used to sample from a categorical distribution.

In [67]:
import numpy as np
import torch
import torch.nn.functional as F

# Define the device (GPU if available, otherwise CPU)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Gumbel sampling function for PyTorch tensors
def gumbel_sample(log_probs, temperature=1.0):
    u = torch.empty_like(log_probs).uniform_(1e-6, 1.0 - 1e-6)
    g = -torch.log(-torch.log(u))
    return torch.argmax(log_probs + g * temperature, dim=-1)

# Prediction function
def predict(model, num_chars, prefix, device):
    model.eval()  # Set model to evaluation mode

    inp = [ord(c) for c in prefix]
    result = [c for c in prefix]
    max_len = len(prefix) + num_chars

    inp_tensor = torch.tensor(inp, dtype=torch.long).unsqueeze(0).to(device)  # Add batch dimension and move to device
    hidden_state = torch.zeros(num_layers, 1, hidden_size).to(device)  # Initial hidden state

    for _ in range(num_chars):
        # Prepare input tensor by padding to max_len
        cur_inp = torch.cat([inp_tensor, torch.zeros(1, max_len - inp_tensor.size(1), dtype=torch.long).to(device)], dim=1)

        # Forward pass through the model
        with torch.no_grad():
            log_softmax_output, hidden_state = model(cur_inp, hidden_state)

        # Get the output probabilities for the next character
        next_char_log_probs = log_softmax_output[0, len(inp)-1, :]  # Shape: (num_classes,)

        # Sample the next character using Gumbel sampling
        next_char = gumbel_sample(next_char_log_probs, temperature=1.0).item()

        # Append the next character to the input sequence
        inp.append(next_char)
        inp_tensor = torch.tensor(inp, dtype=torch.long).unsqueeze(0).to(device)

        # Break if EOS (end of sequence) token is encountered
        if next_char == 1:
            break

        # Append the next character to the result
        result.append(chr(next_char))

    return "".join(result)

# Example usage
print(predict(model, 32, "", device))


                                


In [88]:
# Example usage
print(predict(model, 32, " ", device))

 their cause may be,


In [75]:
print(predict(model, 32, "Hello wor", device))


Hello work


In [78]:
print(predict(model, 32, "what is your n", device))


what is your night


In [81]:
print(predict(model, 32, "do", device))


doubt, then i'll


In the generated text above, you can see that the model generates text that makes sense capturing dependencies between words and without any input. A simple n-gram model would have not been able to capture all of that in one sentence.

<a name='6'></a>
###  <span style="color:blue"> On statistical methods </span>

Using a statistical method like the one you implemented in course 2 will not give you results that are as good. Your model will not be able to encode information seen previously in the data set and as a result, the perplexity will increase. Remember from course 2 that the higher the perplexity, the worse your model is. Furthermore, statistical ngram models take up too much space and memory. As a result, it will be inefficient and too slow. Conversely, with deepnets, you can get a better perplexity. Note, learning about n-gram language models is still important and allows you to better understand deepnets.
