# IS319 - Deep Learning

## TP3 - Recurrent neural networks

Credits: Andrej Karpathy

The goal of this TP is to experiment with recurrent neural networks for a character-level language model to generate text that looks like training text data.

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

## 1. Text data preprocessing

Several text datasets are provided, feel free to experiment with different ones throughout the TP. At the beginning, use a small subset of a given dataset (for example use only 10k characters).

In [2]:
!tar -xvf text-datasets.tgz

tar: text-datasets.tgz: Cannot open: No such file or directory
tar: Error is not recoverable: exiting now


In [3]:
text_data_fname = 'baudelaire.txt'  # ~0.1m characters (French)
# text_data_fname = 'proust.txt'      # ~7.3m characters (French)
# text_data_fname = 'shakespeare.txt' # ~0.1m characters (English)
# text_data_fname = 'lotr.txt'        # ~2.5m characters (English)
# text_data_fname = 'doom.c'          # ~1m characters (C Code)
# text_data_fname = 'linux.c'         # ~11.5m characters (C code)

text_data = open(text_data_fname, 'r',encoding="utf8").read()
text_data = text_data[:10000] # use a small subset
print(f'Dataset `{text_data_fname}` contains {len(text_data)} characters.')
print('Excerpt of the dataset:')
print(text_data[:2000])

Dataset `baudelaire.txt` contains 10000 characters.
Excerpt of the dataset:
LES FLEURS DU MAL

par

CHARLES BAUDELAIRE


AU LECTEUR


La sottise, l'erreur, le péché, la lésine,
Occupent nos esprits et travaillent nos corps,
Et nous alimentons nos aimables remords,
Comme les mendiants nourrissent leur vermine.

Nos péchés sont têtus, nos repentirs sont lâches,
Nous nous faisons payer grassement nos aveux,
Et nous rentrons gaîment dans le chemin bourbeux,
Croyant par de vils pleurs laver toutes nos taches.

Sur l'oreiller du mal c'est Satan Trismégiste
Qui berce longuement notre esprit enchanté,
Et le riche métal de notre volonté
Est tout vaporisé par ce savant chimiste.

C'est le Diable qui tient les fils qui nous remuent!
Aux objets répugnants nous trouvons des appas;
Chaque jour vers l'Enfer nous descendons d'un pas,
Sans horreur, à travers des ténèbres qui puent.

Ainsi qu'un débauché pauvre qui baise et mange
Le sein martyrisé d'une antique catin,
Nous volons au passage un plaisir c

**(Question)** Create a character-level vocabulary for your text data. Create two dictionaries: `ctoi` mapping each character to an index, and the reverse `itoc` mapping each index to its corresponding character. Implement the functions to convert text to tensor and tensor to text using these mappings. Apply these functions to some text data.

In [4]:
# Create the vocabulary and the two mapping dictionaries
# YOUR CODE HERE
import numpy as np

idx = 0
ctoi = {}
itoc = {}

voc = set(text_data)

for elt in voc:
    ctoi[elt] = idx
    itoc[idx] = elt
    idx += 1

print(ctoi)
print(itoc)
print(len(ctoi))
# Implement the function converting text to tensor
def text_to_tensor(text, ctoi):
    # YOUR CODE HERE
    return torch.LongTensor(np.array([ctoi[c] for c in text]))


# Implement the function converting tensor to text
def tensor_to_text(tensor, itoc):
    # YOUR CODE HERE
    return ''.join([itoc[elt.item()] for elt in tensor])#torch.argmax(tensor, dim=2)])

# Apply your functions to some text data
# YOUR CODE HERE
#raise NotImplementedError()
a = text_to_tensor(text_data[:10], ctoi)

print(a)
print(tensor_to_text(a, itoc))



{'î': 0, 'E': 1, 'v': 2, 'P': 3, 'r': 4, 'J': 5, 'S': 6, ' ': 7, 'e': 8, '!': 9, ':': 10, 'g': 11, "'": 12, 'a': 13, 'O': 14, 'p': 15, 's': 16, 'T': 17, 'F': 18, 'i': 19, 'z': 20, 'É': 21, 'ô': 22, 'L': 23, 'y': 24, 'x': 25, 'ç': 26, '»': 27, 'G': 28, 'f': 29, 'R': 30, 'C': 31, 'B': 32, 'd': 33, 'u': 34, '.': 35, 'U': 36, 'o': 37, ';': 38, 'V': 39, 'è': 40, 'q': 41, 'c': 42, 'k': 43, 'ù': 44, 't': 45, 'D': 46, ',': 47, 'l': 48, 'b': 49, 'é': 50, 'Q': 51, '?': 52, '-': 53, 'M': 54, 'm': 55, 'N': 56, 'I': 57, '_': 58, 'W': 59, 'à': 60, '«': 61, '\n': 62, 'h': 63, 'û': 64, 'â': 65, 'A': 66, 'n': 67, 'ê': 68, 'H': 69, 'j': 70}
{0: 'î', 1: 'E', 2: 'v', 3: 'P', 4: 'r', 5: 'J', 6: 'S', 7: ' ', 8: 'e', 9: '!', 10: ':', 11: 'g', 12: "'", 13: 'a', 14: 'O', 15: 'p', 16: 's', 17: 'T', 18: 'F', 19: 'i', 20: 'z', 21: 'É', 22: 'ô', 23: 'L', 24: 'y', 25: 'x', 26: 'ç', 27: '»', 28: 'G', 29: 'f', 30: 'R', 31: 'C', 32: 'B', 33: 'd', 34: 'u', 35: '.', 36: 'U', 37: 'o', 38: ';', 39: 'V', 40: 'è', 41: 'q', 

## 2. Setup a character-level recurrent neural network

**(Question)** Setup a simple embedding layer with `nn.Embedding` to project character indices to `embedding_dim` dimensional vectors. Explain precisely how this layer works and what are its outputs for a given input sequence.

In [5]:
# YOUR CODE HERE
embedding_dim = 5
vocab_size = len(ctoi)
embedding = nn.Embedding(vocab_size,embedding_dim)

emb = embedding(text_to_tensor(text_data[:8], ctoi))

The nn.Embedding layer is designed to embed categorical variables, such as characters or words within a vocabulary, into continuous vectors of a specified dimension (embedding_dim). In the context of an input sequence, the layer transforms the sequence of indices into a tensor of size (len(input_seq), embedding_dim). Each row of this tensor represents the embedding of the corresponding index in the input sequence, projecting it into an embedding_dim-dimensional space.

In the example above, the initial representation is randomized. However, when a learning model is applied, the weights of this layer are adjusted through training to capture textual relationships that cannot be represented using inddices in the vocabulary. As the model learns, the embedding vectors adapt to represent semantic connections between the inputs. For instance, characters or words that frequently co-occur in the training set will have proximate representations in the embedding space.

This dynamic adjustment of weights during training allows the nn.Embedding layer to create embeddings that encode meaningful relationships between categorical inputs, enhancing the model's ability to understand and generalize from the input data.

**(Question)** Setup a single-layer RNN with `nn.RNN` (without defining a custom class). Use `hidden_dim` size for hidden states. Explain precisely the outputs of this layer for a given input sequence.

In [6]:
# YOUR CODE HERE
hidden_dim = 20
rnn = nn.RNN(input_size=embedding_dim, hidden_size=hidden_dim, num_layers=2)
output, hidden = rnn(emb)

print(output.size())
print(hidden.size())

torch.Size([8, 20])
torch.Size([2, 20])


For a given input sequence, the nn.RNN layer returns two tensors. The first tensor, named 'output', is of size (len(input), hidden_size). This tensor stores the output of the RNN layer at each time step in the input sequence. Each row of the 'output' tensor represents the hidden state of the RNN at a specific time step.

The second tensor, named 'hidden', is of size (num_layers, hidden_dim). This tensor stores the final hidden state of the RNN. In the context of 'hidden' contains the summarization of information from the entire input sequence in each layer of the RNN. It serves as a compact representation that captures the essential features learned by the RNN during the processing of the input sequence.

**(Question)** Create a simple RNN model with a custom `nn.Module` class. It should contain: an embedding layer, a single-layer RNN, and a dense output layer. For each character of the input sequence, the model should predict the probability of the next character. The forward method should return the probabilities for next characters and the corresponding hidden states.
After completing the class, create a model and apply the forward pass on some input text. Understand and explain the results.

*Note:* depending on how you implement the loss function later, it can be convenient to return logits instead of probabilities, i.e. raw values of the output layer before any activation function.

In [7]:
class CharRNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_layers=1):
        '''Initialize model parameters and layers.'''
        super().__init__()
        # YOUR CODE HERE
        self.embedding = nn.Embedding(vocab_size,embedding_dim)
        self.rnn = nn.RNN(input_size=embedding_dim, hidden_size=hidden_dim, num_layers=num_layers)
        self.dense = nn.Linear(hidden_dim, vocab_size)

    def forward(self, tensor_data, hidden_state=None):
        '''Apply the forward pass for some text data already converted to tensor.'''
        # YOUR CODE HERE
        embedding = self.embedding(tensor_data)
        output, hidden = self.rnn(embedding, hidden_state)
        logits = self.dense(output)
        return logits, hidden

# Initialize a model and apply the forward pass on some input text
# YOUR CODE HERE


vocab_size = len(ctoi)
embedding_dim = 200
hidden_dim = 100

charRNN = CharRNN(vocab_size,embedding_dim,hidden_dim)
logits, _ = charRNN.forward(text_to_tensor(text_data[:1],ctoi))

print(text_data[0])
print(logits.shape)

L
torch.Size([1, 71])


**(Question)** Implement a simple training loop to overfit on a small input sequence. The loss function should be a categorical cross entropy on the predicted characters. Monitor the loss function value over the iterations.

In [8]:
# Sample a small input sequence into tensor `input_seq` and store its corresponding expected sequence into tensor `target_seq`
# YOUR CODE HERE
vocab_size = len(ctoi)
embedding_dim = 20
hidden_dim = 100
input_seq = text_to_tensor(text_data[:10], ctoi)
input_seq, last_seq = input_seq[:-2], input_seq[-1:]
print(input_seq, last_seq)

target_seq = torch.cat([input_seq[1:], last_seq])

criterion = nn.CrossEntropyLoss()

print(input_seq.shape, target_seq.shape)

# Implement a training loop overfitting an input sequence and monitoring the loss function
def train_overfit(model, input_seq, target_seq, n_iters=200, learning_rate=0.02):
    # YOUR CODE HERE
    optimizer = torch.optim.SGD(model.parameters(), lr = learning_rate, weight_decay=5e-3, momentum=0.9)
    hidden = None

    for iter in range(1, n_iters + 1):
      logits, hidden = model.forward(input_seq, hidden)
      hidden = hidden.detach() #Once we update the hidden state we need to detach it, to not backpropagate through it in the next batch
      #output = F.softmax(logits, dim=1)
      loss = criterion(logits, target_seq)

      optimizer.zero_grad()
      loss.backward()
      optimizer.step()

      if iter % 10 == 0:
          print(f'Iteration {iter}/{n_iters}, Loss: {loss.item()}')



# Initialize a model and make it overfit the input sequence
# YOUR CODE HERE

charRNN = CharRNN(vocab_size, embedding_dim, hidden_dim)
train_overfit(charRNN, input_seq, target_seq)

tensor([23,  1,  6,  7, 18, 23,  1, 36]) tensor([6])
torch.Size([8]) torch.Size([8])
Iteration 10/200, Loss: 1.9554550647735596
Iteration 20/200, Loss: 0.19256284832954407
Iteration 30/200, Loss: 0.056652314960956573
Iteration 40/200, Loss: 0.0120839299634099
Iteration 50/200, Loss: 0.007024620659649372
Iteration 60/200, Loss: 0.005442512221634388
Iteration 70/200, Loss: 0.0047719902358949184
Iteration 80/200, Loss: 0.004523928742855787
Iteration 90/200, Loss: 0.00447916379198432
Iteration 100/200, Loss: 0.004535422660410404
Iteration 110/200, Loss: 0.0046462747268378735
Iteration 120/200, Loss: 0.004790342412889004
Iteration 130/200, Loss: 0.004957183264195919
Iteration 140/200, Loss: 0.00514143705368042
Iteration 150/200, Loss: 0.005339182913303375
Iteration 160/200, Loss: 0.0055473302491009235
Iteration 170/200, Loss: 0.0057634287513792515
Iteration 180/200, Loss: 0.005984853487461805
Iteration 190/200, Loss: 0.00620912853628397
Iteration 200/200, Loss: 0.006433810107409954


**(Question)** Implement a `predict_argmax` method for your `RNN` model. Then, verify your overfitting: use some characters of your input sequence as context to predict the remaining ones. Experiment with the current model and analyze the results.

In [9]:
class CharRNN(CharRNN):
    def predict_argmax(self, context_tensor, n_predictions):
        # Apply the forward pass for the context tensor
        # Then, store the last prediction and last hidden state
        # YOUR CODE HERE
        predictions = []
        logits, hidden = self.forward(context_tensor)
        output = F.softmax(logits, dim=1)[-1]
        last_pred = torch.argmax(output)
        last_pred = torch.LongTensor(last_pred).unsqueeze(-1)
        predictions.append(last_pred)
        # Use the last prediction and last hidden state as inputs to the next forward pass
        # Do this in a loop to predict the next `n_predictions` characters
        # YOUR CODE HERE
        for _ in range(n_predictions):
            logits, hidden = self.forward(last_pred, hidden)
            output = F.softmax(logits, dim=1)
            last_pred = torch.argmax(output).unsqueeze(-1)
            predictions.append(last_pred)
        return predictions

overfit_data = "hello world!"
target_overfit = "ello world! "
voc_ = set(overfit_data)

ctoi_ = {}
itoc_ = {}

idx_ = 0

for elt in voc_:
    ctoi_[elt] = idx_
    itoc_[idx_] = elt
    idx_ += 1
# Initialize a model and make it overfit as above
# Then, verify your overfitting by predicting characters given some context
# YOUR CODE HERE

vocab_size = len(ctoi_)
embedding_dim = 20
hidden_dim = 32


input_seq = text_to_tensor(overfit_data, ctoi_)
target_seq = text_to_tensor(target_overfit,ctoi_)

print(input_seq, target_seq)
charRNN = CharRNN(vocab_size, embedding_dim, hidden_dim)
train_overfit(charRNN, input_seq, target_seq)

print("hello" + tensor_to_text(charRNN.predict_argmax(text_to_tensor("hello",ctoi_),7),itoc_))


tensor([3, 6, 4, 4, 1, 5, 8, 1, 2, 4, 0, 7]) tensor([6, 4, 4, 1, 5, 8, 1, 2, 4, 0, 7, 5])
Iteration 10/200, Loss: 1.5233383178710938
Iteration 20/200, Loss: 0.783195972442627
Iteration 30/200, Loss: 0.32838568091392517
Iteration 40/200, Loss: 0.1534775346517563
Iteration 50/200, Loss: 0.0832783505320549
Iteration 60/200, Loss: 0.05514344573020935
Iteration 70/200, Loss: 0.04263845086097717
Iteration 80/200, Loss: 0.03613058105111122
Iteration 90/200, Loss: 0.03223803639411926
Iteration 100/200, Loss: 0.029654113575816154
Iteration 110/200, Loss: 0.027799280360341072
Iteration 120/200, Loss: 0.026394164189696312
Iteration 130/200, Loss: 0.025290435180068016
Iteration 140/200, Loss: 0.024401918053627014
Iteration 150/200, Loss: 0.02367417700588703
Iteration 160/200, Loss: 0.02307082526385784
Iteration 170/200, Loss: 0.022565973922610283
Iteration 180/200, Loss: 0.022140586748719215
Iteration 190/200, Loss: 0.0217799823731184
Iteration 200/200, Loss: 0.021473055705428123
hello world! 


The model trained on "hello world!" manages to overfit on this sequence, and when asked to generate text given "hello" as a context, the model manages to predict "world!"

Using the argmax function to predict the next character can yield a deterministic generator always predicting the same characters. Instead, it is common to predict the next character by sampling from the distribution of output predictions, adding some randomness into the generator.

**(Question)** Implement a `predict_proba` method for your `RNN` model. It should be very similar to `predict_argmax`, but instead of using argmax, it should randomly sample from the output predictions. To do that, you can use the `torch.distribution.Categorical` class and its `sample()` method. Verify that your method correctly added some randomness.

In [10]:
from torch.distributions import Categorical


tensor = torch.tensor([0.06,0.04,0.3,0.2,0.1,0.05,0.09,0.06,0.1])
distribution = Categorical(probs=tensor)
last_pred1 = distribution.sample()
last_pred2 = distribution.sample()
last_pred3 = distribution.sample()
argmax = tensor.argmax()


print("argmax : " ,argmax, tensor[argmax])
print("sample distribution : ", last_pred1, tensor[last_pred1])
print("sample distribution : ", last_pred2, tensor[last_pred2])
print("sample distribution : ", last_pred3, tensor[last_pred3])

argmax :  tensor(2) tensor(0.3000)
sample distribution :  tensor(3) tensor(0.2000)
sample distribution :  tensor(1) tensor(0.0400)
sample distribution :  tensor(2) tensor(0.3000)


In [52]:
class CharRNN(CharRNN):
    def predict_proba(self, input_context, n_predictions):
        # YOUR CODE HERE
        predictions = []
        logits, hidden = self.forward(input_context)
        output = F.softmax(logits, dim=1)[-1]
        from torch.distributions import Categorical
        distribution = Categorical(probs=output)
        last_pred = distribution.sample()
        #max_pred = torch.argmax(output)
        #print(last_pred, max_pred, max(output))
        last_pred = torch.LongTensor(last_pred).unsqueeze(-1)
        predictions.append(last_pred)
        # Use the last prediction and last hidden state as inputs to the next forward pass
        # Do this in a loop to predict the next `n_predictions` characters
        # YOUR CODE HERE
        for _ in range(n_predictions):
            logits, hidden = self.forward(last_pred, hidden)
            output = F.softmax(logits, dim=1)
            #last_pred = torch.argmax(output).unsqueeze(-1)     
            distribution = Categorical(logits=output)
            last_pred = distribution.sample()
            predictions.append(last_pred)
        return predictions
        

vocab_size = len(ctoi_)
embedding_dim = 5
hidden_dim = 20

print(input_seq, target_seq)
charRNN = CharRNN(vocab_size, embedding_dim, hidden_dim)
train_overfit(charRNN, input_seq, target_seq)

print("hello" + tensor_to_text(charRNN.predict_argmax(text_to_tensor(overfit_data,ctoi_),7),itoc_))


tensor([3, 6, 4, 4, 1, 5, 8, 1, 2, 4, 0, 7]) tensor([6, 4, 4, 1, 5, 8, 1, 2, 4, 0, 7, 5])
Iteration 10/200, Loss: 1.9770393371582031
Iteration 20/200, Loss: 1.5594006776809692
Iteration 30/200, Loss: 1.0353912115097046
Iteration 40/200, Loss: 0.5800424218177795
Iteration 50/200, Loss: 0.3222600519657135
Iteration 60/200, Loss: 0.20012183487415314
Iteration 70/200, Loss: 0.14171545207500458
Iteration 80/200, Loss: 0.10973448306322098
Iteration 90/200, Loss: 0.09056443721055984
Iteration 100/200, Loss: 0.07795416563749313
Iteration 110/200, Loss: 0.06901571899652481
Iteration 120/200, Loss: 0.06235348805785179
Iteration 130/200, Loss: 0.05720638111233711
Iteration 140/200, Loss: 0.05311970040202141
Iteration 150/200, Loss: 0.049804817885160446
Iteration 160/200, Loss: 0.04706993326544762
Iteration 170/200, Loss: 0.04478207603096962
Iteration 180/200, Loss: 0.04284609854221344
Iteration 190/200, Loss: 0.041191816329956055
Iteration 200/200, Loss: 0.03976643458008766
hello world! 


## 3. Train the RNN model on text data

**(Question)** Adapt your previous code to implement a proper training loop for a text dataset. To do so, we need to specify a sequence length `seq_len`, acting similarly to the batch size in classic neural networks. Then, you can either randomly sample sequences of length `seq_len` from the text dataset over `n_iters` iterations, or properly loop over the text dataset for `n_epochs` epochs (with a random starting point for each epoch to ensure different sequences), to make sure the whole dataset is seen by the model. Feel free to adjust training and model parameters empirically. Start with a small model and a small subset of the text dataset, then move on to larger experiments. Remember to use GPU if available.

In [26]:
# Create the text dataset, compute its mappings and convert it to tensor
# YOUR CODE HERE
seq_len = 10
dataset_size = int(len(text_data) / seq_len) + 1
text_dataset_input = torch.empty((dataset_size, seq_len)).long()
text_dataset_output = torch.empty((dataset_size, seq_len)).long()
for i in range(1,len(text_data), seq_len):
    target_input = text_to_tensor(text_data[i-1:i-1+seq_len], ctoi)
    target_output = text_to_tensor(text_data[i:i + seq_len], ctoi)
    if target_input.size(dim=0) != seq_len:
        pad = torch.cat([torch.tensor([len(ctoi) - 1]) for _ in range(seq_len - target_input.size(dim=0))])
        target_input = torch.cat((target_input,pad))
    if target_output.size(dim=0) != seq_len:
        pad = torch.cat([torch.tensor([len(ctoi) - 1]) for _ in range(seq_len - target_output.size(dim=0))])
        target_output = torch.cat((target_output,pad))
    text_dataset_input[int(i / seq_len)] = target_input
    text_dataset_output[int(i / seq_len)] = target_output
#print(text_to_tensor(text_data[i:i+seq_len],ctoi))
print(text_dataset_input[0])

# Initialize training parameters
# YOUR CODE HERE
vocab_size = len(ctoi)
embedding_dim = 32
hidden_dim = 256
n_epochs = 50
# Initialize a character-level RNN model
# YOUR CODE HERE
textRNN = CharRNN(vocab_size, embedding_dim, hidden_dim)

optimizer = torch.optim.SGD(textRNN.parameters(), lr = 0.01)#, weight_decay=5e-3, momentum=0.9)
criterion = nn.CrossEntropyLoss()
# Setup the training loop
# Regularly record the loss and sample from the model to monitor what is happening
# YOUR CODE HERE
def fit(model, text_dataset_input, text_dataset_output, dataset_size,n_epochs, optimizer):
  hidden = None
  for iter in range(1, n_epochs + 1):
    start_idx = torch.randint(0,int(seq_len / 2),(1,)).item() #randomly select a start_idx
    for i in range(dataset_size): 
      logits, hidden = model(text_dataset_input[i][start_idx:],hidden)
      hidden = hidden.detach() #Once we update the hidden state we need to detach it, to not backpropagate through it in the next batch
      loss = criterion(logits, text_dataset_output[i][start_idx:])
      optimizer.zero_grad()
      loss.backward()
      optimizer.step()

      
    print(f'Epoch {iter}, Loss: {loss.item()}')
  return model

textRNN = fit(textRNN, text_dataset_input, text_dataset_output, dataset_size, n_epochs, optimizer)
      

tensor([23,  1,  6,  7, 18, 23,  1, 36, 30,  6])
Epoch 1, Loss: 5.885560512542725
Epoch 2, Loss: 5.094353199005127
Epoch 3, Loss: 4.323110580444336
Epoch 4, Loss: 3.7375683784484863
Epoch 5, Loss: 3.367299795150757
Epoch 6, Loss: 2.9117343425750732
Epoch 7, Loss: 2.5556092262268066
Epoch 8, Loss: 2.1915972232818604
Epoch 9, Loss: 1.919665813446045
Epoch 10, Loss: 1.5312891006469727
Epoch 11, Loss: 1.350659966468811
Epoch 12, Loss: 1.0119707584381104
Epoch 13, Loss: 0.9105053544044495
Epoch 14, Loss: 0.7413322925567627
Epoch 15, Loss: 0.705493152141571
Epoch 16, Loss: 0.582102358341217
Epoch 17, Loss: 0.5269046425819397
Epoch 18, Loss: 0.4788762331008911
Epoch 19, Loss: 0.4838051497936249
Epoch 20, Loss: 0.45311132073402405
Epoch 21, Loss: 0.39722150564193726
Epoch 22, Loss: 0.5340391397476196
Epoch 23, Loss: 0.3462653160095215
Epoch 24, Loss: 0.2952503561973572
Epoch 25, Loss: 0.29238584637641907
Epoch 26, Loss: 0.3000987470149994
Epoch 27, Loss: 0.2876540720462799
Epoch 28, Loss: 0.39

**(Question)** From your trained model, play around with its predictions: start with a custom input sequence and ask the model to predict the rest. Analyze and comment your results.

In [50]:
context_tensor = text_to_tensor(text_data[100:150],ctoi)
print(tensor_to_text(textRNN.predict_argmax(context_tensor,200),itoc))


 des mers,
Pour toutes des perce de chaque jeume de chaque jeume de chaque jeume de chaque jeume de chaque jeume de chaque jeume de chaque jeume de chaque jeume de chaque jeume de chaque jeume de chaqu


YOUR ANSWER HERE

## 4. Experiment with different RNN architectures

**(Question)** Experiment with different RNN architecures. Potential ideas are multi-layer RNNs, GRUs and LSTMs. All models can be extended to multi-layer using the `num_layers` parameter. Analyze and comment your results.

In [103]:
# YOUR CODE HERE
class GruNN(CharRNN):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_layers=1):
        '''Initialize model parameters and layers.'''
        super().__init__(vocab_size, embedding_dim, hidden_dim, num_layers=1)
        # YOUR CODE HERE
        self.gru = nn.GRU(input_size=embedding_dim, hidden_size=hidden_dim, num_layers=num_layers)

    def forward(self, tensor_data, hidden_state=None):
        '''Apply the forward pass for some text data already converted to tensor.'''
        # YOUR CODE HERE
        embedding = self.embedding(tensor_data)
        output, hidden = self.gru(embedding, hidden_state)
        logits = self.dense(output)
        return logits, hidden
        
    def predict_argmax(self, context_tensor, n_predictions):
        # Apply the forward pass for the context tensor
        # Then, store the last prediction and last hidden state
        # YOUR CODE HERE
        predictions = []
        logits, hidden = self.forward(context_tensor)
        output = F.softmax(logits, dim=1)[-1]
        last_pred = torch.argmax(output)
        last_pred = torch.LongTensor(last_pred).unsqueeze(-1)
        predictions.append(last_pred)
        # Use the last prediction and last hidden state as inputs to the next forward pass
        # Do this in a loop to predict the next `n_predictions` characters
        # YOUR CODE HERE
        for _ in range(n_predictions):
            logits, hidden = self.forward(last_pred, hidden)
            output = F.softmax(logits, dim=1)
            last_pred = torch.argmax(output).unsqueeze(-1)
            predictions.append(last_pred)
        return predictions

In [104]:
# YOUR CODE HERE
class LSTMNN(CharRNN):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_layers=1):
        '''Initialize model parameters and layers.'''
        super().__init__(vocab_size, embedding_dim, hidden_dim)
        # YOUR CODE HERE
        self.lstm = nn.LSTM(input_size=embedding_dim, hidden_size=hidden_dim, num_layers=num_layers)

    def forward(self, tensor_data, hidden_state=None, c=None):
        '''Apply the forward pass for some text data already converted to tensor.'''
        # YOUR CODE HERE
        hidden, c = hidden_state, c
        embedding = self.embedding(tensor_data)
        if(c is None or hidden is None):
            output, (hidden, c) = self.lstm(embedding)
        else:
            output, (hidden, c) = self.lstm(embedding, (hidden_state, c))
        logits = self.dense(output)
        return logits, hidden, c

    def predict_argmax(self, context_tensor, n_predictions):
        # Apply the forward pass for the context tensor
        # Then, store the last prediction and last hidden state
        # YOUR CODE HERE
        predictions = []
        logits, hidden, c = self.forward(context_tensor)
        output = F.softmax(logits, dim=1)[-1]
        last_pred = torch.argmax(output)
        last_pred = torch.LongTensor(last_pred).unsqueeze(-1)
        predictions.append(last_pred)
        # Use the last prediction and last hidden state as inputs to the next forward pass
        # Do this in a loop to predict the next `n_predictions` characters
        # YOUR CODE HERE
        for _ in range(n_predictions):
            logits, hidden, c = self.forward(last_pred, hidden)
            output = F.softmax(logits, dim=1)
            last_pred = torch.argmax(output).unsqueeze(-1)
            predictions.append(last_pred)
        return predictions


In [105]:
vocab_size = len(ctoi)
embedding_dim = 32
hidden_dim = 256

n_epochs = 50

In [106]:
context_tensor = text_to_tensor(text_data[10:50], ctoi)

In [57]:
multilayer_rnn = CharRNN(vocab_size, embedding_dim, hidden_dim, num_layers=4)
optimizer = torch.optim.SGD(multilayer_rnn.parameters(), lr = 0.01)#, weight_decay=5e-3, momentum=0.9)

multilayer_rnn = fit(multilayer_rnn, text_dataset_input, text_dataset_output, dataset_size, n_epochs, optimizer)

Epoch 1, Loss: 6.422689914703369
Epoch 2, Loss: 6.2804951667785645
Epoch 3, Loss: 5.785146236419678
Epoch 4, Loss: 5.048469543457031
Epoch 5, Loss: 4.8103814125061035
Epoch 6, Loss: 4.400051116943359
Epoch 7, Loss: 4.671872615814209
Epoch 8, Loss: 4.374535083770752
Epoch 9, Loss: 3.7471237182617188
Epoch 10, Loss: 3.0568227767944336
Epoch 11, Loss: 2.755739450454712
Epoch 12, Loss: 2.3996903896331787
Epoch 13, Loss: 2.741488218307495
Epoch 14, Loss: 2.3396313190460205
Epoch 15, Loss: 1.7182997465133667
Epoch 16, Loss: 1.5921144485473633
Epoch 17, Loss: 1.4816151857376099
Epoch 18, Loss: 1.4913387298583984
Epoch 19, Loss: 1.1968955993652344
Epoch 20, Loss: 1.1413242816925049
Epoch 21, Loss: 1.11558997631073
Epoch 22, Loss: 1.139900803565979
Epoch 23, Loss: 1.0630971193313599
Epoch 24, Loss: 0.9689602851867676
Epoch 25, Loss: 0.9149562120437622
Epoch 26, Loss: 1.072037696838379
Epoch 27, Loss: 0.8277516961097717
Epoch 28, Loss: 0.7358985543251038
Epoch 29, Loss: 0.6720336675643921
Epoch 

In [58]:
print(tensor_to_text(multilayer_rnn.predict_argmax(context_tensor,200),itoc))

ChENENALE



L'ENNEER




LENENALE



LENENALE



LENENALE



LENENALE



LENENALE



LENENALE



LENENALE



LENENALE



LENENALE



LENENALE



LENENALE



LENENALE



LENENALE



LENENALE



LENENAL


In [59]:


multilayer_gru = GruNN(vocab_size, embedding_dim, hidden_dim, num_layers=2)
optimizer = torch.optim.SGD(multilayer_gru.parameters(), lr = 0.001)#, weight_decay=5e-3, momentum=0.9)
multilayer_gru = fit(multilayer_gru, text_dataset_input, text_dataset_output, dataset_size, n_epochs, optimizer)


Epoch 1, Loss: 4.153772830963135
Epoch 2, Loss: 4.206570148468018
Epoch 3, Loss: 4.404646873474121
Epoch 4, Loss: 4.801566123962402
Epoch 5, Loss: 5.138530254364014
Epoch 6, Loss: 5.401076316833496
Epoch 7, Loss: 5.583259105682373
Epoch 8, Loss: 5.684397220611572
Epoch 9, Loss: 5.806121349334717
Epoch 10, Loss: 5.855288505554199
Epoch 11, Loss: 5.929509162902832
Epoch 12, Loss: 5.958600997924805
Epoch 13, Loss: 5.997044563293457
Epoch 14, Loss: 5.990281105041504
Epoch 15, Loss: 6.029736042022705
Epoch 16, Loss: 6.007689476013184
Epoch 17, Loss: 6.011616230010986
Epoch 18, Loss: 6.039679050445557
Epoch 19, Loss: 6.033377647399902
Epoch 20, Loss: 6.0259552001953125
Epoch 21, Loss: 6.030393600463867
Epoch 22, Loss: 6.027670383453369
Epoch 23, Loss: 6.024017810821533
Epoch 24, Loss: 5.9795403480529785
Epoch 25, Loss: 5.9704694747924805
Epoch 26, Loss: 6.000383377075195
Epoch 27, Loss: 5.949653625488281
Epoch 28, Loss: 5.983710765838623
Epoch 29, Loss: 5.975042819976807
Epoch 30, Loss: 5.96

In [101]:
print(tensor_to_text(multilayer_gru.predict_argmax(context_tensor,200),itoc))

u le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le l


In [98]:
def fit_lstm(model, text_dataset_input, text_dataset_output, dataset_size,n_epochs, optimizer):
  start_idx = torch.randint(0,int(seq_len / 2),(1,)).item() #randomly select a start_idx
  hidden = None
  for iter in range(1, n_epochs + 1):
        indices = torch.randperm(dataset_size) #randomly permute the indices
        for i in range(dataset_size): 
          logits, hidden, c = model(text_dataset_input[i][start_idx:], hidden)
          hidden[0] = hidden[0].detach()
          hidden[1] = hidden[1].detach() #Once we update the hidden state we need to detach it, to not backpropagate through it in the next batch
          loss = criterion(logits, text_dataset_output[i][start_idx:])
          optimizer.zero_grad()
          loss.backward()
          optimizer.step()

        
        print(f'Epoch {iter}, Loss: {loss.item()}')
  return model

In [None]:
multilayer_lstm = LSTMNN(vocab_size, embedding_dim, hidden_dim, num_layers=4)
optimizer = torch.optim.SGD(multilayer_lstm.parameters(), lr = 0.01)#, weight_decay=5e-3, momentum=0.9)

multilayer_lstm = fit_lstm(multilayer_lstm, text_dataset_input, text_dataset_output, dataset_size, n_epochs, optimizer)

Epoch 1, Loss: 4.678082466125488
Epoch 2, Loss: 5.584222316741943
Epoch 3, Loss: 6.010735988616943
Epoch 4, Loss: 6.216280460357666
Epoch 5, Loss: 6.331568241119385
Epoch 6, Loss: 6.400148868560791
Epoch 7, Loss: 6.44180965423584
Epoch 8, Loss: 6.467369079589844
Epoch 9, Loss: 6.483094692230225
Epoch 10, Loss: 6.492684841156006
Epoch 11, Loss: 6.498358726501465
Epoch 12, Loss: 6.501486778259277
Epoch 13, Loss: 6.502933502197266
Epoch 14, Loss: 6.503262519836426
Epoch 15, Loss: 6.502846717834473
Epoch 16, Loss: 6.501940727233887


In [107]:
multilayer_lstm.predict_argmax(context_tensor,200)
print(tensor_to_text(multilayer_lstm.predict_argmax(context_tensor,200),itoc))

ValueError: too many values to unpack (expected 2)

YOUR ANSWER HERE