# IS319 - Deep Learning

## TP3 - Recurrent neural networks

Credits: Andrej Karpathy

The goal of this TP is to experiment with recurrent neural networks for a character-level language model to generate text that looks like training text data.

In [112]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

## 1. Text data preprocessing

Several text datasets are provided, feel free to experiment with different ones throughout the TP. At the beginning, use a small subset of a given dataset (for example use only 10k characters).

In [113]:
!tar -xvf text-datasets.tgz

tar: Error opening archive: Failed to open 'text-datasets.tgz'


In [114]:
text_data_fname = 'baudelaire.txt'  # ~0.1m characters (French)
# text_data_fname = 'proust.txt'      # ~7.3m characters (French)
# text_data_fname = 'shakespeare.txt' # ~0.1m characters (English)
# text_data_fname = 'lotr.txt'        # ~2.5m characters (English)
# text_data_fname = 'doom.c'          # ~1m characters (C Code)
# text_data_fname = 'linux.c'         # ~11.5m characters (C code)

text_data = open(text_data_fname, 'r',encoding="utf8").read()
text_data = text_data[:10000] # use a small subset
print(f'Dataset `{text_data_fname}` contains {len(text_data)} characters.')
print('Excerpt of the dataset:')
print(text_data[:2000])

Dataset `baudelaire.txt` contains 10000 characters.
Excerpt of the dataset:
LES FLEURS DU MAL

par

CHARLES BAUDELAIRE


AU LECTEUR


La sottise, l'erreur, le péché, la lésine,
Occupent nos esprits et travaillent nos corps,
Et nous alimentons nos aimables remords,
Comme les mendiants nourrissent leur vermine.

Nos péchés sont têtus, nos repentirs sont lâches,
Nous nous faisons payer grassement nos aveux,
Et nous rentrons gaîment dans le chemin bourbeux,
Croyant par de vils pleurs laver toutes nos taches.

Sur l'oreiller du mal c'est Satan Trismégiste
Qui berce longuement notre esprit enchanté,
Et le riche métal de notre volonté
Est tout vaporisé par ce savant chimiste.

C'est le Diable qui tient les fils qui nous remuent!
Aux objets répugnants nous trouvons des appas;
Chaque jour vers l'Enfer nous descendons d'un pas,
Sans horreur, à travers des ténèbres qui puent.

Ainsi qu'un débauché pauvre qui baise et mange
Le sein martyrisé d'une antique catin,
Nous volons au passage un plaisir c

**(Question)** Create a character-level vocabulary for your text data. Create two dictionaries: `ctoi` mapping each character to an index, and the reverse `itoc` mapping each index to its corresponding character. Implement the functions to convert text to tensor and tensor to text using these mappings. Apply these functions to some text data.

In [115]:
# Create the vocabulary and the two mapping dictionaries
# YOUR CODE HERE
import numpy as np

idx = 0
ctoi = {}
itoc = {}

voc = set(text_data)

for elt in voc:
    ctoi[elt] = idx
    itoc[idx] = elt
    idx += 1

print(ctoi)
print(itoc)
print(len(ctoi))
# Implement the function converting text to tensor
def text_to_tensor(text, ctoi):
    # YOUR CODE HERE
    return torch.LongTensor(np.array([ctoi[c] for c in text]))


# Implement the function converting tensor to text
def tensor_to_text(tensor, itoc):
    # YOUR CODE HERE
    return ''.join([itoc[elt.item()] for elt in tensor])#torch.argmax(tensor, dim=2)])

# Apply your functions to some text data
# YOUR CODE HERE
#raise NotImplementedError()
a = text_to_tensor(text_data[:10], ctoi)

print(a)
print(tensor_to_text(a, itoc))



{'z': 0, 'J': 1, '?': 2, 'H': 3, 'ç': 4, 'y': 5, 'N': 6, 'I': 7, 'c': 8, 'n': 9, 'L': 10, 'i': 11, 'b': 12, 'î': 13, 'k': 14, 'o': 15, 'E': 16, 'à': 17, 'D': 18, 'g': 19, 'R': 20, 'P': 21, 'l': 22, 'é': 23, ';': 24, 'û': 25, 'Q': 26, '!': 27, "'": 28, 'ù': 29, ' ': 30, 'â': 31, 'j': 32, 's': 33, '.': 34, 'è': 35, 'd': 36, 'É': 37, '»': 38, 'A': 39, 'v': 40, 'B': 41, 'O': 42, 'f': 43, ',': 44, 'q': 45, 'ê': 46, 't': 47, 'G': 48, '«': 49, 'W': 50, 'ô': 51, 'h': 52, 'e': 53, 'x': 54, '-': 55, 'S': 56, 'U': 57, 'F': 58, 'p': 59, 'a': 60, '\n': 61, 'r': 62, 'C': 63, 'M': 64, 'u': 65, 'm': 66, ':': 67, 'V': 68, '_': 69, 'T': 70}
{0: 'z', 1: 'J', 2: '?', 3: 'H', 4: 'ç', 5: 'y', 6: 'N', 7: 'I', 8: 'c', 9: 'n', 10: 'L', 11: 'i', 12: 'b', 13: 'î', 14: 'k', 15: 'o', 16: 'E', 17: 'à', 18: 'D', 19: 'g', 20: 'R', 21: 'P', 22: 'l', 23: 'é', 24: ';', 25: 'û', 26: 'Q', 27: '!', 28: "'", 29: 'ù', 30: ' ', 31: 'â', 32: 'j', 33: 's', 34: '.', 35: 'è', 36: 'd', 37: 'É', 38: '»', 39: 'A', 40: 'v', 41: 'B', 

## 2. Setup a character-level recurrent neural network

**(Question)** Setup a simple embedding layer with `nn.Embedding` to project character indices to `embedding_dim` dimensional vectors. Explain precisely how this layer works and what are its outputs for a given input sequence.

In [156]:
# YOUR CODE HERE
embedding_dim = 5
vocab_size = len(ctoi)
embedding = nn.Embedding(vocab_size,embedding_dim)

emb = embedding(text_to_tensor(text_data[:8], ctoi))

The nn.Embedding layer is designed to embed categorical variables, such as characters or words within a vocabulary, into continuous vectors of a specified dimension (embedding_dim). In the context of an input sequence, the layer transforms the sequence of indices into a tensor of size (len(input_seq), embedding_dim). Each row of this tensor represents the embedding of the corresponding index in the input sequence, projecting it into an embedding_dim-dimensional space.

In the example above, the initial representation is randomized. However, when a learning model is applied, the weights of this layer are adjusted through training to capture textual relationships that cannot be represented using inddices in the vocabulary. As the model learns, the embedding vectors adapt to represent semantic connections between the inputs. For instance, characters or words that frequently co-occur in the training set will have proximate representations in the embedding space.

This dynamic adjustment of weights during training allows the nn.Embedding layer to create embeddings that encode meaningful relationships between categorical inputs, enhancing the model's ability to understand and generalize from the input data.

**(Question)** Setup a single-layer RNN with `nn.RNN` (without defining a custom class). Use `hidden_dim` size for hidden states. Explain precisely the outputs of this layer for a given input sequence.

In [158]:
# YOUR CODE HERE
hidden_dim = 20
rnn = nn.RNN(input_size=embedding_dim, hidden_size=hidden_dim, num_layers=2)
output, hidden = rnn(emb)

print(output.size())
print(hidden.size())

torch.Size([8, 20])
torch.Size([2, 20])


For a given input sequence, the nn.RNN layer returns two tensors. The first tensor, named 'output', is of size (len(input), hidden_size). This tensor stores the output of the RNN layer at each time step in the input sequence. Each row of the 'output' tensor represents the hidden state of the RNN at a specific time step.

The second tensor, named 'hidden', is of size (num_layers, hidden_dim). This tensor stores the final hidden state of the RNN. In the context of 'hidden' contains the summarization of information from the entire input sequence in each layer of the RNN. It serves as a compact representation that captures the essential features learned by the RNN during the processing of the input sequence.

**(Question)** Create a simple RNN model with a custom `nn.Module` class. It should contain: an embedding layer, a single-layer RNN, and a dense output layer. For each character of the input sequence, the model should predict the probability of the next character. The forward method should return the probabilities for next characters and the corresponding hidden states.
After completing the class, create a model and apply the forward pass on some input text. Understand and explain the results.

*Note:* depending on how you implement the loss function later, it can be convenient to return logits instead of probabilities, i.e. raw values of the output layer before any activation function.

In [140]:
class CharRNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_layers=1):
        '''Initialize model parameters and layers.'''
        super().__init__()
        # YOUR CODE HERE
        self.embedding = nn.Embedding(vocab_size,embedding_dim)
        self.rnn = nn.RNN(input_size=embedding_dim, hidden_size=hidden_dim, num_layers=num_layers)
        self.dense = nn.Linear(hidden_dim, vocab_size)

    def forward(self, tensor_data, hidden_state=None):
        '''Apply the forward pass for some text data already converted to tensor.'''
        # YOUR CODE HERE
        embedding = self.embedding(tensor_data)
        output, hidden = self.rnn(embedding, hidden_state)
        logits = self.dense(output)
        return logits, hidden

# Initialize a model and apply the forward pass on some input text
# YOUR CODE HERE


vocab_size = len(ctoi)
embedding_dim = 200
hidden_dim = 100

charRNN = CharRNN(vocab_size,embedding_dim,hidden_dim)
logits, _ = charRNN.forward(text_to_tensor(text_data[:1],ctoi))

print(text_data[0])
print(logits.shape)

L
torch.Size([1, 71])


**(Question)** Implement a simple training loop to overfit on a small input sequence. The loss function should be a categorical cross entropy on the predicted characters. Monitor the loss function value over the iterations.

In [119]:
# Sample a small input sequence into tensor `input_seq` and store its corresponding expected sequence into tensor `target_seq`
# YOUR CODE HERE
vocab_size = len(ctoi)
embedding_dim = 20
hidden_dim = 100
input_seq = text_to_tensor(text_data[:10], ctoi)
input_seq, last_seq = input_seq[:-2], input_seq[-1:]
print(input_seq, last_seq)

target_seq = torch.cat([input_seq[1:], last_seq])

criterion = nn.CrossEntropyLoss()

print(input_seq.shape, target_seq.shape)

# Implement a training loop overfitting an input sequence and monitoring the loss function
def train_overfit(model, input_seq, target_seq, n_iters=200, learning_rate=0.02):
    # YOUR CODE HERE
    optimizer = torch.optim.SGD(model.parameters(), lr = learning_rate, weight_decay=5e-3, momentum=0.9)
    hidden = None

    for iter in range(1, n_iters + 1):
      logits, hidden = model.forward(input_seq, hidden)
      hidden = hidden.detach() #Once we update the hidden state we need to detach it, to not backpropagate through it in the next batch
      #output = F.softmax(logits, dim=1)
      loss = criterion(logits, target_seq)

      optimizer.zero_grad()
      loss.backward()
      optimizer.step()

      if iter % 10 == 0:
          print(f'Iteration {iter}/{n_iters}, Loss: {loss.item()}')



# Initialize a model and make it overfit the input sequence
# YOUR CODE HERE

charRNN = CharRNN(vocab_size, embedding_dim, hidden_dim)
train_overfit(charRNN, input_seq, target_seq)

tensor([10, 16, 56, 30, 58, 10, 16, 57]) tensor([56])
torch.Size([8]) torch.Size([8])
Iteration 10/200, Loss: 2.2388200759887695
Iteration 20/200, Loss: 0.18988797068595886
Iteration 30/200, Loss: 0.04161001369357109
Iteration 40/200, Loss: 0.009962371550500393
Iteration 50/200, Loss: 0.004840867128223181
Iteration 60/200, Loss: 0.003572132671251893
Iteration 70/200, Loss: 0.003282556077465415
Iteration 80/200, Loss: 0.0032899973448365927
Iteration 90/200, Loss: 0.0034065735526382923
Iteration 100/200, Loss: 0.003574709640815854
Iteration 110/200, Loss: 0.003773884382098913
Iteration 120/200, Loss: 0.003994535654783249
Iteration 130/200, Loss: 0.004231789615005255
Iteration 140/200, Loss: 0.004481961950659752
Iteration 150/200, Loss: 0.0047417981550097466
Iteration 160/200, Loss: 0.005008433014154434
Iteration 170/200, Loss: 0.005278781056404114
Iteration 180/200, Loss: 0.00554983364418149
Iteration 190/200, Loss: 0.005818541627377272
Iteration 200/200, Loss: 0.006082095205783844


**(Question)** Implement a `predict_argmax` method for your `RNN` model. Then, verify your overfitting: use some characters of your input sequence as context to predict the remaining ones. Experiment with the current model and analyze the results.

In [143]:
class CharRNN(CharRNN):
    def predict_argmax(self, context_tensor, n_predictions):
        # Apply the forward pass for the context tensor
        # Then, store the last prediction and last hidden state
        # YOUR CODE HERE
        predictions = []
        logits, hidden = self.forward(context_tensor)
        output = F.softmax(logits, dim=1)[-1]
        last_pred = torch.argmax(output)
        last_pred = torch.LongTensor(last_pred).unsqueeze(-1)
        predictions.append(last_pred)
        # Use the last prediction and last hidden state as inputs to the next forward pass
        # Do this in a loop to predict the next `n_predictions` characters
        # YOUR CODE HERE
        for _ in range(n_predictions):
            logits, hidden = self.forward(last_pred, hidden)
            output = F.softmax(logits, dim=1)
            last_pred = torch.argmax(output).unsqueeze(-1)
            predictions.append(last_pred)
        return predictions

overfit_data = "hello world!"
target_overfit = "ello world! "
voc_ = set(overfit_data)

ctoi_ = {}
itoc_ = {}

idx_ = 0

for elt in voc_:
    ctoi_[elt] = idx_
    itoc_[idx_] = elt
    idx_ += 1
# Initialize a model and make it overfit as above
# Then, verify your overfitting by predicting characters given some context
# YOUR CODE HERE

vocab_size = len(ctoi_)
embedding_dim = 20
hidden_dim = 32


input_seq = text_to_tensor(overfit_data, ctoi_)
target_seq = text_to_tensor(target_overfit,ctoi_)

print(input_seq, target_seq)
charRNN = CharRNN(vocab_size, embedding_dim, hidden_dim)
train_overfit(charRNN, input_seq, target_seq)

print("hello" + tensor_to_text(charRNN.predict_argmax(text_to_tensor("hello",ctoi_),7),itoc_))


tensor([1, 2, 3, 3, 8, 7, 4, 8, 6, 3, 0, 5]) tensor([2, 3, 3, 8, 7, 4, 8, 6, 3, 0, 5, 7])
Iteration 10/200, Loss: 1.6787503957748413
Iteration 20/200, Loss: 0.9198446869850159
Iteration 30/200, Loss: 0.4029098451137543
Iteration 40/200, Loss: 0.18057680130004883
Iteration 50/200, Loss: 0.09902171045541763
Iteration 60/200, Loss: 0.06560871750116348
Iteration 70/200, Loss: 0.04994722083210945
Iteration 80/200, Loss: 0.04155581817030907
Iteration 90/200, Loss: 0.03648228570818901
Iteration 100/200, Loss: 0.03309526666998863
Iteration 110/200, Loss: 0.030664006248116493
Iteration 120/200, Loss: 0.028828440234065056
Iteration 130/200, Loss: 0.027393845841288567
Iteration 140/200, Loss: 0.026245824992656708
Iteration 150/200, Loss: 0.025311579927802086
Iteration 160/200, Loss: 0.024541625753045082
Iteration 170/200, Loss: 0.02390090562403202
Iteration 180/200, Loss: 0.023363424465060234
Iteration 190/200, Loss: 0.022909604012966156
Iteration 200/200, Loss: 0.02252424694597721
hello world! 


The model trained on "hello world!" manages to overfit on this sequence, and when asked to generate text given "hello" as a context, the model manages to predict "world!"

Using the argmax function to predict the next character can yield a deterministic generator always predicting the same characters. Instead, it is common to predict the next character by sampling from the distribution of output predictions, adding some randomness into the generator.

**(Question)** Implement a `predict_proba` method for your `RNN` model. It should be very similar to `predict_argmax`, but instead of using argmax, it should randomly sample from the output predictions. To do that, you can use the `torch.distribution.Categorical` class and its `sample()` method. Verify that your method correctly added some randomness.

In [121]:
from torch.distributions import Categorical


tensor = torch.tensor([0.06,0.04,0.3,0.2,0.1,0.05,0.09,0.06,0.1])
distribution = Categorical(probs=tensor)
last_pred1 = distribution.sample()
last_pred2 = distribution.sample()
last_pred3 = distribution.sample()
argmax = tensor.argmax()


print("argmax : " ,argmax, tensor[argmax])
print("sample distribution : ", last_pred1, tensor[last_pred1])
print("sample distribution : ", last_pred2, tensor[last_pred2])
print("sample distribution : ", last_pred3, tensor[last_pred3])

argmax :  tensor(2) tensor(0.3000)
sample distribution :  tensor(2) tensor(0.3000)
sample distribution :  tensor(3) tensor(0.2000)
sample distribution :  tensor(3) tensor(0.2000)


In [160]:
class CharRNN(CharRNN):
    def predict_proba(self, input_context, n_predictions):
        # YOUR CODE HERE
        predictions = []
        logits, hidden = self.forward(input_context)
        output = F.softmax(logits, dim=1)[-1]
        from torch.distributions import Categorical
        distribution = Categorical(probs=output)
        last_pred = distribution.sample()
        #max_pred = torch.argmax(output)
        #print(last_pred, max_pred, max(output))
        last_pred = torch.LongTensor(last_pred).unsqueeze(-1)
        predictions.append(last_pred)
        # Use the last prediction and last hidden state as inputs to the next forward pass
        # Do this in a loop to predict the next `n_predictions` characters
        # YOUR CODE HERE
        for _ in range(n_predictions):
            logits, hidden = self.forward(last_pred, hidden)
            output = F.softmax(logits, dim=1)
            #last_pred = torch.argmax(output).unsqueeze(-1)     
            distribution = Categorical(logits=output)
            last_pred = distribution.sample()
            predictions.append(last_pred)
        return predictions
        

vocab_size = len(ctoi_)
embedding_dim = 5
hidden_dim = 20

print(input_seq, target_seq)
charRNN = CharRNN(vocab_size, embedding_dim, hidden_dim)
train_overfit(charRNN, input_seq, target_seq)

print("hello" + tensor_to_text(charRNN.predict_argmax(text_to_tensor(overfit_data,ctoi_),7),itoc_))


tensor([1, 2, 3, 3, 8, 7, 4, 8, 6, 3, 0, 5]) tensor([2, 3, 3, 8, 7, 4, 8, 6, 3, 0, 5, 7])
Iteration 10/200, Loss: 1.8943047523498535
Iteration 20/200, Loss: 1.4104305505752563
Iteration 30/200, Loss: 0.9544656276702881
Iteration 40/200, Loss: 0.6130251288414001
Iteration 50/200, Loss: 0.3648529052734375
Iteration 60/200, Loss: 0.22753620147705078
Iteration 70/200, Loss: 0.15725648403167725
Iteration 80/200, Loss: 0.11885526031255722
Iteration 90/200, Loss: 0.09642663598060608
Iteration 100/200, Loss: 0.08206096291542053
Iteration 110/200, Loss: 0.07215886563062668
Iteration 120/200, Loss: 0.06494277715682983
Iteration 130/200, Loss: 0.05945123732089996
Iteration 140/200, Loss: 0.055133964866399765
Iteration 150/200, Loss: 0.05165635421872139
Iteration 160/200, Loss: 0.048802196979522705
Iteration 170/200, Loss: 0.04642455652356148
Iteration 180/200, Loss: 0.044419318437576294
Iteration 190/200, Loss: 0.0427105575799942
Iteration 200/200, Loss: 0.04124153032898903
hello world! 


## 3. Train the RNN model on text data

**(Question)** Adapt your previous code to implement a proper training loop for a text dataset. To do so, we need to specify a sequence length `seq_len`, acting similarly to the batch size in classic neural networks. Then, you can either randomly sample sequences of length `seq_len` from the text dataset over `n_iters` iterations, or properly loop over the text dataset for `n_epochs` epochs (with a random starting point for each epoch to ensure different sequences), to make sure the whole dataset is seen by the model. Feel free to adjust training and model parameters empirically. Start with a small model and a small subset of the text dataset, then move on to larger experiments. Remember to use GPU if available.

In [185]:
# Create the text dataset, compute its mappings and convert it to tensor
# YOUR CODE HERE
seq_len = 10
dataset_size = int(len(text_data) / seq_len) + 1
text_dataset_input = torch.empty((dataset_size, seq_len)).long()
text_dataset_output = torch.empty((dataset_size, seq_len)).long()
for i in range(1,len(text_data), seq_len):
    target_input = text_to_tensor(text_data[i-1:i-1+seq_len], ctoi)
    target_output = text_to_tensor(text_data[i:i + seq_len], ctoi)
    if target_input.size(dim=0) != seq_len:
        pad = torch.cat([torch.tensor([len(ctoi) - 1]) for _ in range(seq_len - target_input.size(dim=0))])
        target_input = torch.cat((target_input,pad))
    if target_output.size(dim=0) != seq_len:
        pad = torch.cat([torch.tensor([len(ctoi) - 1]) for _ in range(seq_len - target_output.size(dim=0))])
        target_output = torch.cat((target_output,pad))
    text_dataset_input[int(i / seq_len)] = target_input
    text_dataset_output[int(i / seq_len)] = target_output
#print(text_to_tensor(text_data[i:i+seq_len],ctoi))
print(text_dataset_input[0])

# Initialize training parameters
# YOUR CODE HERE
vocab_size = len(ctoi)
embedding_dim = 32
hidden_dim = 256
n_epochs = 100
# Initialize a character-level RNN model
# YOUR CODE HERE
textRNN = CharRNN(vocab_size, embedding_dim, hidden_dim)

optimizer = torch.optim.SGD(textRNN.parameters(), lr = 0.01)#, weight_decay=5e-3, momentum=0.9)
criterion = nn.CrossEntropyLoss()
# Setup the training loop
# Regularly record the loss and sample from the model to monitor what is happening
# YOUR CODE HERE
def fit(model, text_dataset_input, text_dataset_output, dataset_size,n_epochs, optimizer):
  hidden = None
  for iter in range(1, n_epochs + 1):
    start_idx = torch.randint(0,int(seq_len / 2),(1,)).item() #randomly select a start_idx
    for i in range(dataset_size): 
      logits, hidden = model(text_dataset_input[i][start_idx:],hidden)
      hidden = hidden.detach() #Once we update the hidden state we need to detach it, to not backpropagate through it in the next batch
      loss = criterion(logits, text_dataset_output[i][start_idx:])
      optimizer.zero_grad()
      loss.backward()
      optimizer.step()

      
    print(f'Epoch {iter}, Loss: {loss.item()}')
  return model

textRNN = fit(textRNN, text_dataset_input, text_dataset_output, dataset_size, n_epochs, optimizer)
      

tensor([10, 16, 56, 30, 58, 10, 16, 57, 20, 56])
Epoch 1, Loss: 6.495410442352295
Epoch 2, Loss: 5.965396404266357
Epoch 3, Loss: 5.257569789886475
Epoch 4, Loss: 4.40095853805542
Epoch 5, Loss: 3.839010715484619
Epoch 6, Loss: 3.49481463432312
Epoch 7, Loss: 3.144225835800171
Epoch 8, Loss: 2.70739483833313
Epoch 9, Loss: 2.3990097045898438
Epoch 10, Loss: 1.7991336584091187
Epoch 11, Loss: 1.300616979598999
Epoch 12, Loss: 1.2830010652542114
Epoch 13, Loss: 0.8495723009109497
Epoch 14, Loss: 0.7639928460121155
Epoch 15, Loss: 0.6662375330924988
Epoch 16, Loss: 0.586346447467804
Epoch 17, Loss: 0.5872800946235657
Epoch 18, Loss: 0.3733738660812378
Epoch 19, Loss: 0.4009561240673065
Epoch 20, Loss: 0.29682981967926025
Epoch 21, Loss: 0.29772889614105225
Epoch 22, Loss: 0.25221002101898193
Epoch 23, Loss: 0.3406127393245697
Epoch 24, Loss: 0.2938058376312256
Epoch 25, Loss: 0.3220052421092987
Epoch 26, Loss: 0.22417937219142914
Epoch 27, Loss: 0.27561044692993164
Epoch 28, Loss: 0.20409

**(Question)** From your trained model, play around with its predictions: start with a custom input sequence and ask the model to predict the rest. Analyze and comment your results.

In [186]:
context_tensor = text_to_tensor(text_data[100:150],ctoi)
print(tensor_to_text(textRNN.predict_argmax(context_tensor,200),itoc))


 que toursasier la palerinations de la pares
Qui palais,
Ces palais,
Ces palais,
Ces palais,
Ces palais,
Ces palais,
Ces palais,
Ces palais,
Ces palais,
Ces palais,
Ces palais,
Ces palais,
Ces palais,



YOUR ANSWER HERE

## 4. Experiment with different RNN architectures

**(Question)** Experiment with different RNN architecures. Potential ideas are multi-layer RNNs, GRUs and LSTMs. All models can be extended to multi-layer using the `num_layers` parameter. Analyze and comment your results.

In [125]:
# YOUR CODE HERE
class GruNN(CharRNN):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_layers=1):
        '''Initialize model parameters and layers.'''
        super().__init__(vocab_size, embedding_dim, hidden_dim, num_layers=1)
        # YOUR CODE HERE
        self.gru = nn.GRU(input_size=embedding_dim, hidden_size=hidden_dim, num_layers=num_layers)

    def forward(self, tensor_data, hidden_state=None):
        '''Apply the forward pass for some text data already converted to tensor.'''
        # YOUR CODE HERE
        embedding = self.embedding(tensor_data)
        output, hidden = self.gru(embedding, hidden_state)
        logits = self.dense(output)
        return logits, hidden

In [176]:
# YOUR CODE HERE
class LSTMNN(CharRNN):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_layers=1):
        '''Initialize model parameters and layers.'''
        super().__init__(vocab_size, embedding_dim, hidden_dim)
        # YOUR CODE HERE
        self.lstm = nn.LSTM(input_size=embedding_dim, hidden_size=hidden_dim, num_layers=num_layers)

    def forward(self, tensor_data, hidden_state=None, c=None):
        '''Apply the forward pass for some text data already converted to tensor.'''
        # YOUR CODE HERE
        embedding = self.embedding(tensor_data)
        output, (hidden, c) = self.lstm(embedding, hidden_state, c)
        logits = self.dense(output)
        return logits, hidden, c


In [169]:
vocab_size = len(ctoi)
embedding_dim = 32
hidden_dim = 256

n_epochs = 50

In [170]:
context_tensor = text_to_tensor(text_data[10:50], ctoi)

In [171]:
multilayer_rnn = CharRNN(vocab_size, embedding_dim, hidden_dim, num_layers=4)
optimizer = torch.optim.SGD(multilayer_rnn.parameters(), lr = 0.01)#, weight_decay=5e-3, momentum=0.9)

multilayer_rnn = fit(multilayer_rnn, text_dataset_input, text_dataset_output, dataset_size, n_epochs, optimizer)

Epoch 1, Loss: 5.897712707519531
Epoch 2, Loss: 4.4717888832092285
Epoch 3, Loss: 4.11949348449707
Epoch 4, Loss: 4.1267828941345215
Epoch 5, Loss: 3.3360939025878906
Epoch 6, Loss: 2.8215410709381104
Epoch 7, Loss: 3.0039632320404053
Epoch 8, Loss: 1.9431304931640625
Epoch 9, Loss: 1.7728513479232788
Epoch 10, Loss: 1.508933663368225
Epoch 11, Loss: 1.7067683935165405
Epoch 12, Loss: 0.9359719753265381
Epoch 13, Loss: 0.8309974670410156
Epoch 14, Loss: 1.137855052947998
Epoch 15, Loss: 0.6840835213661194
Epoch 16, Loss: 0.6545213460922241
Epoch 17, Loss: 0.5482220649719238
Epoch 18, Loss: 0.5534723401069641
Epoch 19, Loss: 0.5304295420646667
Epoch 20, Loss: 0.4010707437992096
Epoch 21, Loss: 0.5849392414093018
Epoch 22, Loss: 0.31918802857398987
Epoch 23, Loss: 0.4908319413661957
Epoch 24, Loss: 0.3148772716522217
Epoch 25, Loss: 0.3190081715583801
Epoch 26, Loss: 0.32696613669395447
Epoch 27, Loss: 0.3534165918827057
Epoch 28, Loss: 0.21998031437397003
Epoch 29, Loss: 0.2793371677398

In [172]:
print(tensor_to_text(multilayer_rnn.predict_argmax(context_tensor,200),itoc))

CEnnale violets?

Range
Rembragé par un boit et la palais,
Et de chont leurs par de vine lieu des puissants nous puissants nous puissants nous puissants nous puissants nous puissants nous puissants nou


In [173]:


multilayer_gru = GruNN(vocab_size, embedding_dim, hidden_dim, num_layers=2)
optimizer = torch.optim.SGD(multilayer_gru.parameters(), lr = 0.001)#, weight_decay=5e-3, momentum=0.9)
multilayer_gru = fit(multilayer_gru, text_dataset_input, text_dataset_output, dataset_size, n_epochs, optimizer)


Epoch 1, Loss: 4.27915096282959
Epoch 2, Loss: 4.315567970275879
Epoch 3, Loss: 4.445417881011963
Epoch 4, Loss: 4.556044101715088
Epoch 5, Loss: 4.79245138168335
Epoch 6, Loss: 4.742119789123535
Epoch 7, Loss: 4.783855438232422
Epoch 8, Loss: 4.859288215637207
Epoch 9, Loss: 5.052785873413086
Epoch 10, Loss: 4.726766109466553
Epoch 11, Loss: 4.696249485015869
Epoch 12, Loss: 5.026475429534912
Epoch 13, Loss: 4.8592209815979
Epoch 14, Loss: 4.640098571777344
Epoch 15, Loss: 4.775290489196777
Epoch 16, Loss: 4.472657203674316
Epoch 17, Loss: 4.677436351776123
Epoch 18, Loss: 4.522556304931641
Epoch 19, Loss: 4.389292240142822
Epoch 20, Loss: 4.524266719818115
Epoch 21, Loss: 4.369021892547607
Epoch 22, Loss: 4.424742698669434
Epoch 23, Loss: 4.375044345855713
Epoch 24, Loss: 4.145031452178955
Epoch 25, Loss: 4.1023783683776855
Epoch 26, Loss: 3.9952476024627686
Epoch 27, Loss: 4.200692176818848
Epoch 28, Loss: 4.059335708618164
Epoch 29, Loss: 3.9473016262054443
Epoch 30, Loss: 3.986573

In [174]:
print(tensor_to_text(multilayer_gru.predict_argmax(context_tensor,200),itoc))

 an de les les les les les les les les les les les les les les les les les les les les les les les les les les les les les les les les les les les les les les les les les les les les les les les les le


In [183]:
def fit_lstm(model, text_dataset_input, text_dataset_output, dataset_size,n_epochs, optimizer):
  hidden = None
  c = None
  for iter in range(1, n_epochs + 1):
    start_idx = torch.randint(0,int(seq_len / 2),(1,)).item() #randomly select a start_idx
    for i in range(dataset_size): 
      logits, hidden, c = model(text_dataset_input[i][start_idx:], hidden, c)
      print(hidden.size())
      hidden = hidden.detach()
      c = c.detach() #Once we update the hidden state we need to detach it, to not backpropagate through it in the next batch
      loss = criterion(logits, text_dataset_output[i][start_idx:])
      optimizer.zero_grad()
      loss.backward()
      optimizer.step()

        
    print(f'Epoch {iter}, Loss: {loss.item()}')
  return model

In [184]:
multilayer_lstm = LSTMNN(vocab_size, embedding_dim, hidden_dim, num_layers=4)
optimizer = torch.optim.SGD(multilayer_lstm.parameters(), lr = 0.01)#, weight_decay=5e-3, momentum=0.9)

multilayer_lstm = fit_lstm(multilayer_lstm, text_dataset_input, text_dataset_output, dataset_size, n_epochs, optimizer)

TypeError: LSTM.forward() takes from 2 to 3 positional arguments but 4 were given

In [134]:
print(tensor_to_text(multilayer_lstm.predict_argmax(context_tensor,200),itoc))

TypeError: LSTM.forward() takes from 2 to 3 positional arguments but 4 were given

YOUR ANSWER HERE