# IS319 - Deep Learning

## TP3 - Recurrent neural networks

Credits: Andrej Karpathy

The goal of this TP is to experiment with recurrent neural networks for a character-level language model to generate text that looks like training text data.

In [6]:
import torch
import torch.nn as nn
import torch.optim as optimizer
import numpy as np
import torch.nn.functional as F
import torch.distributions as distributions
import matplotlib.pyplot as plt
device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available() # For macOS
    else "cpu"
)
print(f'Using {device}')

Using cuda


## 1. Text data preprocessing

Several text datasets are provided, feel free to experiment with different ones throughout the TP. At the beginning, use a small subset of a given dataset (for example use only 10k characters).

In [7]:
dir_datasets = "./"
# text_data_fname = 'baudelaire.txt'  # ~0.1m characters (French)
# text_data_fname = 'proust.txt'      # ~7.3m characters (French)
text_data_fname = 'shakespeare.txt' # ~0.1m characters (English)
# text_data_fname = 'lotr.txt'        # ~2.5m characters (English)
# text_data_fname = 'doom.c'          # ~1m characters (C Code)
# text_data_fname = 'linux.c'         # ~11.5m characters (C code)

text_data = open(dir_datasets+text_data_fname, 'r').read()
text_data = text_data[:10000] # use a small subset
print(f'Dataset `{text_data_fname}` contains {len(text_data)} characters.')
print('Excerpt of the dataset:')
print(text_data[:2000])

Dataset `shakespeare.txt` contains 10000 characters.
Excerpt of the dataset:
    SONNETS



TO THE ONLY BEGETTER OF
THESE INSUING SONNETS
MR. W. H. ALL HAPPINESS
AND THAT ETERNITY
PROMISED BY
OUR EVER-LIVING POET WISHETH
THE WELL-WISHING
ADVENTURER IN
SETTING FORTH
T. T.


I.

FROM fairest creatures we desire increase,
That thereby beauty's rose might never die,
But as the riper should by time decease,
His tender heir might bear his memory:
But thou, contracted to thine own bright eyes,
Feed'st thy light'st flame with self-substantial fuel,
Making a famine where abundance lies,
Thyself thy foe, to thy sweet self too cruel.
Thou that art now the world's fresh ornament
And only herald to the gaudy spring,
Within thine own bud buriest thy content
And, tender churl, makest waste in niggarding.
  Pity the world, or else this glutton be,
  To eat the world's due, by the grave and thee.

II.

When forty winters shall beseige thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's p

**(Question)** Create a character-level vocabulary for your text data. Create two dictionaries: `ctoi` mapping each character to an index, and the reverse `itoc` mapping each index to its corresponding character. Implement the functions to convert text to tensor and tensor to text using these mappings. Apply these functions to some text data.

In [8]:
# Create the vocabulary and the two mapping dictionaries
def create_vocab(text):
    vocab, ctoi, itoc = [], {}, {}
    for character in text :
        if character not in vocab :
            vocab += [character]
            ctoi[character] = vocab.index(character)
            itoc[vocab.index(character)] = character
    return vocab, ctoi, itoc


# Implement the function converting text to tensor
def text_to_tensor(text, ctoi):
    arr = np.zeros((len(text)))
    for idx, char in enumerate(text):
        arr[idx] = ctoi[char]
    return torch.tensor(arr, dtype=torch.long)



# Implement the function converting tensor to text
def tensor_to_text(tensor, itoc):
    arr = tensor.cpu().detach().numpy()
    text = ""
    for elm in arr :
        text+=itoc[elm]
    return text

# Apply your functions to some text data

vocab, ctoi, itoc = create_vocab(text_data)

print(vocab)
print(ctoi)
print(itoc)

example_tensor = text_to_tensor(text_data, ctoi)
print(example_tensor)

example_text = tensor_to_text(example_tensor, itoc)
# verify integrity
assert example_text == text_data

[' ', 'S', 'O', 'N', 'E', 'T', '\n', 'H', 'L', 'Y', 'B', 'G', 'R', 'F', 'I', 'U', 'M', '.', 'W', 'A', 'P', 'D', 'V', '-', 'f', 'a', 'i', 'r', 'e', 's', 't', 'c', 'u', 'w', 'd', 'n', ',', 'h', 'b', 'y', "'", 'o', 'm', 'g', 'v', 'p', 'l', ':', 'k', 'z', 'x', '!', ';', '?', 'C', 'q', 'j', 'X']
{' ': 0, 'S': 1, 'O': 2, 'N': 3, 'E': 4, 'T': 5, '\n': 6, 'H': 7, 'L': 8, 'Y': 9, 'B': 10, 'G': 11, 'R': 12, 'F': 13, 'I': 14, 'U': 15, 'M': 16, '.': 17, 'W': 18, 'A': 19, 'P': 20, 'D': 21, 'V': 22, '-': 23, 'f': 24, 'a': 25, 'i': 26, 'r': 27, 'e': 28, 's': 29, 't': 30, 'c': 31, 'u': 32, 'w': 33, 'd': 34, 'n': 35, ',': 36, 'h': 37, 'b': 38, 'y': 39, "'": 40, 'o': 41, 'm': 42, 'g': 43, 'v': 44, 'p': 45, 'l': 46, ':': 47, 'k': 48, 'z': 49, 'x': 50, '!': 51, ';': 52, '?': 53, 'C': 54, 'q': 55, 'j': 56, 'X': 57}
{0: ' ', 1: 'S', 2: 'O', 3: 'N', 4: 'E', 5: 'T', 6: '\n', 7: 'H', 8: 'L', 9: 'Y', 10: 'B', 11: 'G', 12: 'R', 13: 'F', 14: 'I', 15: 'U', 16: 'M', 17: '.', 18: 'W', 19: 'A', 20: 'P', 21: 'D', 22: 

## 2. Setup a character-level recurrent neural network

**(Question)** Setup a simple embedding layer with `nn.Embedding` to project character indices to `embedding_dim` dimensional vectors. Explain precisely how this layer works and what are its outputs for a given input sequence.

In [9]:
# n_vocab : the total number of unique indices that the embedding layer can handle.
# n_dim : the size of the vector space in which the indices will be embedded.
n_vocab, n_dim = len(vocab), 10

# initiate the Embedding layer
emb_layer = nn.Embedding(n_vocab, embedding_dim=n_dim)

# given the example tensor generate an embedding of each index (indirectly character) of the text
emb_data = emb_layer(example_tensor)

print(emb_data)

tensor([[-0.7388,  1.1008, -0.1606,  ...,  0.2317, -0.4473,  0.2614],
        [-0.7388,  1.1008, -0.1606,  ...,  0.2317, -0.4473,  0.2614],
        [-0.7388,  1.1008, -0.1606,  ...,  0.2317, -0.4473,  0.2614],
        ...,
        [-0.6019, -1.4626,  0.2987,  ..., -0.6729, -1.1158, -0.3574],
        [ 1.8359,  1.0572,  0.0029,  ...,  0.0080,  1.0674, -1.0970],
        [-0.6019, -1.4626,  0.2987,  ..., -0.6729, -1.1158, -0.3574]],
       grad_fn=<EmbeddingBackward0>)


**(Question)** Setup a single-layer RNN with `nn.RNN` (without defining a custom class). Use `hidden_dim` size for hidden states. Explain precisely the outputs of this layer for a given input sequence.

In [20]:
input_size, hidden_size = 10, 16
rnn_layer = nn.RNN(input_size, hidden_size)

# Initialize the hidden state
hidden_state = torch.zeros(1, hidden_size)

# run the embedded data through the RNN
output_sequence, final_hidden_state = rnn_layer(emb_data, hidden_state)

print(output_sequence.shape, output_sequence)
print(final_hidden_state)

torch.Size([10000, 16]) tensor([[-0.3715,  0.1184, -0.0998,  ...,  0.2753, -0.5606,  0.1890],
        [ 0.2594,  0.0479, -0.2172,  ...,  0.2403, -0.2392,  0.0881],
        [ 0.0631, -0.2388, -0.1634,  ...,  0.2913, -0.1183,  0.1537],
        ...,
        [-0.2905, -0.2651, -0.2395,  ..., -0.4819, -0.0869,  0.5245],
        [ 0.1154,  0.1141,  0.6491,  ..., -0.7616,  0.1123,  0.1617],
        [-0.8379, -0.1121,  0.2051,  ..., -0.7690, -0.7159,  0.4521]],
       grad_fn=<SqueezeBackward1>)
tensor([[-8.3788e-01, -1.1205e-01,  2.0512e-01,  8.6108e-01, -8.4592e-04,
         -5.6566e-02,  8.1530e-01,  1.1157e-01,  3.4455e-01, -5.5600e-01,
         -7.3686e-01,  3.2595e-01, -5.0581e-01, -7.6900e-01, -7.1592e-01,
          4.5210e-01]], grad_fn=<SqueezeBackward1>)


## Answer :
The `output_sequence` represents the output of the RNN at each time step when it processes an input sequence. It is a tensor that contains the predicted value for each hidden state at each time step in the sequence. And the `final_hidden_state` is, as shown, the final hidden state of the layer.
****

**(Question)** Implement a simple training loop to overfit on a small input sequence. The loss function should be a categorical cross entropy on the predicted characters. Monitor the loss function value over the iterations.

In [25]:
# Sample a small input sequence into tensor `input_seq` and store its corresponding expected sequence into tensor `target_seq`
input_seq = torch.arange(0, 40).long()
target_seq = (input_seq+1).clone() 

# Implement a training loop overfitting an input sequence and monitoring the loss function
def train_overfit(model, input_seq, target_seq, optim, loss_function, n_iters=200, learning_rate=0.002, check_iter=50):

    for i in range(n_iters):
        # Forward pass
        output, hidden = model(input_seq.unsqueeze(0))  # Add batch dimension
        loss = loss_function(output.squeeze(0), target_seq)

        # Backward pass and optimization
        optim.zero_grad()
        loss.backward()
        optim.step()

        if (i + 1) % check_iter == 0:
            print(f"Iteration {i + 1}/{n_iters}, Loss: {loss.item()}")

# Initialize a model and make it overfit the input sequence
class simpleRNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(simpleRNN, self).__init__()
        self.emb = nn.Embedding(input_size, hidden_size)
        self.rnn = nn.RNN(hidden_size, hidden_size, batch_first=True)
        self.lin = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        x = self.emb(x)
        output, hidden = self.rnn(x)
        output = self.lin(output)
        return output, hidden

# Initialize the model, loss function, and optimizer (Adam)
input_size, hidden_size, output_size, learning_rate = 200, 16, 200, 0.002
model = simpleRNN(input_size, hidden_size, output_size)
optim = optimizer.Adam(model.parameters(), lr=learning_rate)
loss_function = nn.CrossEntropyLoss()

# train the model
train_overfit(model, input_seq, target_seq, optim, loss_function, n_iters=900)

Iteration 50/900, Loss: 3.5212912559509277
Iteration 100/900, Loss: 1.862529993057251
Iteration 150/900, Loss: 0.8865076303482056
Iteration 200/900, Loss: 0.4528396725654602
Iteration 250/900, Loss: 0.2679155468940735
Iteration 300/900, Loss: 0.17986321449279785
Iteration 350/900, Loss: 0.13153553009033203
Iteration 400/900, Loss: 0.10171101987361908
Iteration 450/900, Loss: 0.08169753849506378
Iteration 500/900, Loss: 0.06744050979614258
Iteration 550/900, Loss: 0.056826673448085785
Iteration 600/900, Loss: 0.04865102469921112
Iteration 650/900, Loss: 0.042175233364105225
Iteration 700/900, Loss: 0.03692689910531044
Iteration 750/900, Loss: 0.03262072056531906
Iteration 800/900, Loss: 0.0290543045848608
Iteration 850/900, Loss: 0.026061970740556717
Iteration 900/900, Loss: 0.023522594943642616


**(Question)** Implement a `predict_argmax` method for your `RNN` model. Then, verify your overfitting: use some characters of your input sequence as context to predict the remaining ones. Experiment with the current model and analyze the results.

In [26]:
class CharRNN(simpleRNN):
    def predict_argmax(self, context_tensor, n_predictions):
        predictions, hidden = [], None

        # Use the context tensor to apply the forward pass for the context tensor
        for char_index in context_tensor:
            output, hidden = self.forward(char_index.unsqueeze(0).unsqueeze(0))

        # Predict the next n_predictions indices
        for _ in range(n_predictions):
            # Forward pass with the last index and hidden state
            output, hidden = self.forward(context_tensor[-1].unsqueeze(0).unsqueeze(0))

            # Get the index of the predicted index using argmax
            predicted_index = output.squeeze(0).argmax().item()
            predictions.append(predicted_index)

            # Update the context tensor with the new prediction
            context_tensor = torch.cat((context_tensor, torch.tensor([predicted_index])))

        return predictions

# Initialize a model and make it overfit as above
# Then, verify your overfitting by predicting characters given some context
model = CharRNN(input_size, hidden_size, output_size)
optim = optimizer.Adam(model.parameters(), lr=learning_rate)
loss_function = nn.CrossEntropyLoss()

train_overfit(model, input_seq, target_seq, optim, loss_function, n_iters=900)

Iteration 50/900, Loss: 3.4821624755859375
Iteration 100/900, Loss: 1.969241738319397
Iteration 150/900, Loss: 1.0188318490982056
Iteration 200/900, Loss: 0.5720313191413879
Iteration 250/900, Loss: 0.36356136202812195
Iteration 300/900, Loss: 0.24904301762580872
Iteration 350/900, Loss: 0.18203362822532654
Iteration 400/900, Loss: 0.14000003039836884
Iteration 450/900, Loss: 0.1117682009935379
Iteration 500/900, Loss: 0.09171818196773529
Iteration 550/900, Loss: 0.0768367201089859
Iteration 600/900, Loss: 0.06539736688137054
Iteration 650/900, Loss: 0.05636948347091675
Iteration 700/900, Loss: 0.049108438193798065
Iteration 750/900, Loss: 0.04318055883049965
Iteration 800/900, Loss: 0.038270678371191025
Iteration 850/900, Loss: 0.034140296280384064
Iteration 900/900, Loss: 0.030614936724305153


In [27]:
# predict 3 indices
print(model.predict_argmax(input_seq[:30], n_predictions=3))

[30, 31, 32]


Using the argmax function to predict the next character can yield a deterministic generator always predicting the same characters. Instead, it is common to predict the next character by sampling from the distribution of output predictions, adding some randomness into the generator.

**(Question)** Implement a `predict_proba` method for your `RNN` model. It should be very similar to `predict_argmax`, but instead of using argmax, it should randomly sample from the output predictions. To do that, you can use the `torch.distribution.Categorical` class and its `sample()` method. Verify that your method correctly added some randomness.

In [35]:
class CharRNN(CharRNN):
    def predict_proba(self, context_tensor, n_predictions):
        predictions, hidden = [], None

        # Use the context tensor to apply the forward pass for the context tensor
        for char_index in context_tensor:
            output, hidden = self.forward(char_index.unsqueeze(0).unsqueeze(0))

        # Predict the next n_predictions characters by sampling from the distribution
        for _ in range(n_predictions):
            # Forward pass with the last character and hidden state
            output, hidden = self.forward(context_tensor[-1].unsqueeze(0).unsqueeze(0))

            # Use Categorical distribution to sample from the predicted probabilities
            categorical_dist = distributions.Categorical(logits=output.squeeze(0))
            predicted_index = categorical_dist.sample().item()
            predictions.append(predicted_index)

            # Update the context tensor with the new prediction
            context_tensor = torch.cat((context_tensor, torch.tensor([predicted_index])))

        return predictions

# Verify that your predictions are not deterministic anymore
model = CharRNN(input_size, hidden_size, output_size)
optim = optimizer.Adam(model.parameters(), lr=learning_rate)
loss_function = nn.CrossEntropyLoss()

train_overfit(model, input_seq, target_seq, optim, loss_function, n_iters=900)

Iteration 50/900, Loss: 3.5964713096618652
Iteration 100/900, Loss: 1.9690039157867432
Iteration 150/900, Loss: 0.9671480059623718
Iteration 200/900, Loss: 0.5276398658752441
Iteration 250/900, Loss: 0.3305578827857971
Iteration 300/900, Loss: 0.2284894436597824
Iteration 350/900, Loss: 0.16795094311237335
Iteration 400/900, Loss: 0.12825140357017517
Iteration 450/900, Loss: 0.10072934627532959
Iteration 500/900, Loss: 0.081171914935112
Iteration 550/900, Loss: 0.06697218120098114
Iteration 600/900, Loss: 0.05636342242360115
Iteration 650/900, Loss: 0.04819870740175247
Iteration 700/900, Loss: 0.04175559803843498
Iteration 750/900, Loss: 0.036569222807884216
Iteration 800/900, Loss: 0.032329704612493515
Iteration 850/900, Loss: 0.028819208964705467
Iteration 900/900, Loss: 0.025877902284264565


In [36]:
# predict 3 indices
print(model.predict_proba(input_seq[:30], n_predictions=3))

[30, 31, 32]


## 3. Train the RNN model on text data

**(Question)** Adapt your previous code to implement a proper training loop for a text dataset. To do so, we need to specify a sequence length `seq_len`, acting similarly to the batch size in classic neural networks. Then, you can either randomly sample sequences of length `seq_len` from the text dataset over `n_iters` iterations, or properly loop over the text dataset for `n_epochs` epochs (with a random starting point for each epoch to ensure different sequences), to make sure the whole dataset is seen by the model. Feel free to adjust training and model parameters empirically. Start with a small model and a small subset of the text dataset, then move on to larger experiments. Remember to use GPU if available.

In [None]:
# Create the text dataset, compute its mappings and convert it to tensor
vocab, ctoi, itoc = create_vocab(text_data)
data_tensor = text_to_tensor(text_data, ctoi)
seq_len = 30

# Initialize training parameters
input_size, hidden_size, output_size, learning_rate = 200, 16, 200, 0.002

# Initialize a character-level RNN model
model = simpleRNN(input_size, hidden_size, output_size)
optim = optimizer.Adam(model.parameters(), lr=learning_rate)
loss_function = nn.CrossEntropyLoss()

# Setup the training loop
# Regularly record the loss and sample from the model to monitor what is happening


**(Question)** From your trained model, play around with its predictions: start with a custom input sequence and ask the model to predict the rest. Analyze and comment your results.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

## Answer :
****

## 4. Experiment with different RNN architectures

**(Question)** Experiment with different RNN architecures. Potential ideas are multi-layer RNNs, GRUs and LSTMs. All models can be extended to multi-layer using the `num_layers` parameter. Analyze and comment your results.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

## Answer :
****