# Text data processing with recurrent neural networks

In this series of labs (7, 8, and 9), we use recurrent neural networks (RNNs) to manipulate text data. RNNs are appropriate to handle sequential data where each point depends on the previous ones. This is the case, e.g., for time series (audio/speech signals, financial data...), but also videos (time sequence of images), or text.

We will consider the problem of sequence to sequence (seq2seq) learning using RNNs. This task consists in producing one sequence of data from another, of possibly different lengths (it is one example of *many-to-many* RNNs you have studied during lectures).

More specifically, we will work with textual data for the *machine translation* task: the goal is to automatically translate a sentence from one language to another.

<center><a href="https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html">
    <img src="https://pytorch.org/tutorials/_images/seq2seq.png" width="500"></a></center>

In this lab, we study the global pipeline for preprocessing text data, the basics of RNNs, and we write the encoder for the seq2seq model.

**Note**: This notebook is based on [this tutorial](https://github.com/bentrevett/pytorch-seq2seq), which you are strongly encouraged to check as it goes into much more details about seq2seq models.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import random
import math
import time

# We'll be using torchtext and spacy to do most of the pre-processing
import spacy
import torchtext
from torchtext.datasets import Multi30k
from torchtext.data import Field, BucketIterator

# Download specific language models (comment if already done)
!python3 -m spacy download en_core_web_sm
!python3 -m spacy download de_core_news_sm

Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.9/13.9 MB[0m [31m25.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
Collecting de-core-news-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-3.2.0/de_core_news_sm-3.2.0-py3-none-any.whl (19.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.1/19.1 MB[0m [31m24.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('de_core_news_sm')


## Preprocessing text data

### Tokenizers

The first step is to define *tokenizers*, that is, how a string is transformed into a sequence of words (or *tokens*). For instance, "Welcome to the U.S.A.!" should be transformed into \["Welcome", "to", "the", "U.S.A.", "!"\]. These are called tokens in the general case because "!" is not a word.

We will also use language models, in order to preserve some specific rules for tokenization: with a naïve approach, "U.S.A." would be split into 6 tokens \["U", ".", "S', ".", "A", "."\], but we want to consider it as a single token.

In [2]:
# load the German and English specific pipelines
spacy_de = spacy.load('de_core_news_sm')
spacy_en = spacy.load('en_core_web_sm')

# define tokenizers
def tokenize_de(text):
    return [tok.text for tok in spacy_de.tokenizer(text)]

def tokenize_en(text):
    return [tok.text for tok in spacy_en.tokenizer(text)]

# check on an example
print('Welcome to the U.S.A!')
print(tokenize_en('Welcome to the U.S.A.!'))

Welcome to the U.S.A!
['Welcome', 'to', 'the', 'U.S.A.', '!']


In [3]:
# We also define 'Fields', which handle how the data will be processed. In addition to tokenizing, it can convert
# all characters to lower case and add extra tokens for the start and end of sentences.
SRC = Field(tokenize=tokenize_de, init_token='<sos>', eos_token='<eos>', lower=True)
TRG = Field(tokenize=tokenize_en, init_token='<sos>', eos_token='<eos>', lower=True)

### Dataset

We use the Multi30k dataset, which contains sentences in German and English (as well as French, but it's not available using torchtext).

In [4]:
# Load the full dataset
train_data, valid_data, test_data = Multi30k.splits(root='data/', exts = ('.de', '.en'), fields = (SRC, TRG))

print(f"Number of training examples: {len(train_data.examples)}")
print(f"Number of validation examples: {len(valid_data.examples)}")
print(f"Number of testing examples: {len(test_data.examples)}")

# We take a subset of the full dataset for speed
train_data.examples = train_data.examples[:1000]
valid_data.examples = valid_data.examples[:100]
test_data.examples = train_data.examples[:100]

# Print one example
print(train_data.examples[0].src)
print(train_data.examples[0].trg)

downloading training.tar.gz


training.tar.gz: 100%|██████████| 1.21M/1.21M [00:00<00:00, 3.46MB/s]


downloading validation.tar.gz


validation.tar.gz: 100%|██████████| 46.3k/46.3k [00:00<00:00, 639kB/s]


downloading mmt_task1_test2016.tar.gz


mmt_task1_test2016.tar.gz: 100%|██████████| 66.2k/66.2k [00:00<00:00, 747kB/s]


Number of training examples: 29000
Number of validation examples: 1014
Number of testing examples: 1000
['zwei', 'junge', 'weiße', 'männer', 'sind', 'im', 'freien', 'in', 'der', 'nähe', 'vieler', 'büsche', '.']
['two', 'young', ',', 'white', 'males', 'are', 'outside', 'near', 'many', 'bushes', '.']


### Vocabulary

We now use the ```build_vocab``` method of the ```Field``` to create a vocabulary from the data. A vocabulary maps each token to an integer, and using the Field also transforms it into a torch tensor.

In [5]:
SRC.build_vocab(train_data, min_freq=2)
TRG.build_vocab(train_data, min_freq=2)

print(f"Unique tokens in source (de) vocabulary: {len(SRC.vocab)}")
print(f"Unique tokens in target (en) vocabulary: {len(TRG.vocab)}")

Unique tokens in source (de) vocabulary: 802
Unique tokens in target (en) vocabulary: 823


In [6]:
# We can use the 'vocab.itos' method to convert indices to the corresponding tokens
# The first tokens in the vocabulary are 'unknown word', 'padding', 'start of sentence' and 'end of sequence'.
# Then the tokens are ranked by frequency of appearence in the dataset.
for i in range(20):
    print(TRG.vocab.itos[i])

<unk>
<pad>
<sos>
<eos>
a
.
in
the
on
man
and
is
of
with
are
two
,
woman
at
people


### Dataloader

The final step is to create a dataloader to generate batches of data. Instead of using the classic ```Dataloader``` function, we use ```BucketIterator```, which creates batches by assembling sentences of same or close lengths. This reduces the amount of padding and therefore of useless calculation.

In [7]:
batch_size = 128
train_dataloader, _, test_dataloader = BucketIterator.splits((train_data, valid_data, test_data),
                                                           batch_size = batch_size)

# Fetch one batch as an example
example_batch = next(iter(train_dataloader))

# Each batch contains a 'src' and 'trg' entries (source and target), corresponding to English and German sentences.
print(example_batch.src[:,1])

# The shape of this tensor should be [seq_length, batch_size] where seq_length is the maximum length of a sentence in this batch
print(example_batch.src.shape)

tensor([  2,   5,  13,  25,  10,   7, 474,   4,   3,   1,   1,   1,   1,   1,
          1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
          1,   1,   1,   1,   1])
torch.Size([33, 128])


In [8]:
# TO DO: write a function that takes a list of integers (such as a slice of 'example_batch' above) as input
# and return the corresponding tokens (hint: use the 'vocab.itos' method)
# only keep the token after '<sos>' and before '<eos>'
def indx2tokens_list(token_list, token_dic):
    return [token_dic.vocab.itos[i] for i in token_list]
    

In [9]:
# Apply this function to a source sentence and its corresponding target translation
indx_example = 1
print(indx2tokens_list(example_batch.src[:, indx_example], SRC))
print(indx2tokens_list(example_batch.trg[:, indx_example], TRG))

['<sos>', 'ein', 'mann', 'sitzt', 'auf', 'einem', 'stein', '.', '<eos>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']
['<sos>', 'a', 'man', 'sits', 'on', 'a', 'rock', '.', '<eos>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']


<span style="color:red">**Q1**</span> Put these sentences in your report.

## Recurrent networks basics

Now that the data is ready, let's see the basic operations used in RNNs: [embedding layers](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html?highlight=embedding#torch.nn.Embedding), [dropout](https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html), and [recurrent layers](https://pytorch.org/docs/stable/nn.html#recurrent-layers).

### Embedding layer

Sentences have been tokenized and tokens have been transformed into integers. We need to further transform these integers into word vectors: the idea is that two *similar* words should have similar word vectors. 


<center><a href="https://ruder.io/word-embeddings-1/">
    <img src="https://ruder.io/content/images/size/w2000/2016/04/word_embeddings_colah.png" width="500"></a></center>

This notion of similarity (and what word vectors exactly represent) is hard to define explicitly. Then, we use an embedding layer to produce these word vectors, and this layer is learned during training.
Many pre-trained word embeddings are available (e.g., word2vec) but here we will learn it from scratch along with the rest of the network.

In [10]:
# Create an embedding layer. We need to specify:
# - the input size, that is, how many words are in the vocabulary
# - the embedding size, that is, how "big" is the word vectors space 
input_size = len(SRC.vocab)
emb_size = 32
src_emb_layer = nn.Embedding(input_size, emb_size)

# Apply it to the example batch and display it
embedded_batch = src_emb_layer(example_batch.src)
print(embedded_batch)

# The size of the word vectors for a batch should be [seq_length, batch_size, emb_size]
print(embedded_batch.shape)

tensor([[[ 1.9498, -0.2377,  0.3892,  ...,  0.6667, -1.0490,  1.5987],
         [ 1.9498, -0.2377,  0.3892,  ...,  0.6667, -1.0490,  1.5987],
         [ 1.9498, -0.2377,  0.3892,  ...,  0.6667, -1.0490,  1.5987],
         ...,
         [ 1.9498, -0.2377,  0.3892,  ...,  0.6667, -1.0490,  1.5987],
         [ 1.9498, -0.2377,  0.3892,  ...,  0.6667, -1.0490,  1.5987],
         [ 1.9498, -0.2377,  0.3892,  ...,  0.6667, -1.0490,  1.5987]],

        [[ 0.5344,  0.5631, -0.9296,  ...,  1.5760,  0.9768, -0.9856],
         [ 0.5344,  0.5631, -0.9296,  ...,  1.5760,  0.9768, -0.9856],
         [ 0.5344,  0.5631, -0.9296,  ...,  1.5760,  0.9768, -0.9856],
         ...,
         [ 1.3062, -1.5726,  0.8415,  ...,  0.1753, -0.3047, -0.8642],
         [ 0.5344,  0.5631, -0.9296,  ...,  1.5760,  0.9768, -0.9856],
         [ 0.2472,  0.4570, -1.9588,  ...,  1.0134,  1.1166, -0.2600]],

        [[-0.6605,  0.5439,  0.9969,  ...,  0.2231, -0.0096, -2.1519],
         [-1.6911,  1.0794, -0.5341,  ..., -0

### Dropout

The core idea of a dropout layer is to reduce the risk of overfitting by randomly setting some inputed values at 0. Since the non-zero inputs (and the corresponding weights in the network) are not the same from one batch to another, it results in forcing these weights not to be batch-specific, and therefore avoid overfitting. Dropout can be used for any network (including MLP or CNNs), but it's more important for RNNs, which are more prone to overfitting.

In [11]:
# the percentage of zeroed values (expressed between 0 and 1) is given as input
dropout_layer = nn.Dropout(0.5)
drop_batch = dropout_layer(embedded_batch)

# in this example, half of the entries (50%) are set at 0
print(drop_batch)

tensor([[[ 3.8997, -0.0000,  0.0000,  ...,  1.3334, -2.0979,  3.1975],
         [ 0.0000, -0.0000,  0.7785,  ...,  1.3334, -0.0000,  3.1975],
         [ 0.0000, -0.0000,  0.7785,  ...,  1.3334, -2.0979,  3.1975],
         ...,
         [ 3.8997, -0.0000,  0.0000,  ...,  0.0000, -0.0000,  0.0000],
         [ 3.8997, -0.0000,  0.7785,  ...,  0.0000, -2.0979,  0.0000],
         [ 0.0000, -0.0000,  0.7785,  ...,  0.0000, -0.0000,  3.1975]],

        [[ 1.0689,  0.0000, -1.8591,  ...,  0.0000,  1.9537, -0.0000],
         [ 1.0689,  0.0000, -0.0000,  ...,  3.1520,  0.0000, -1.9712],
         [ 0.0000,  0.0000, -0.0000,  ...,  0.0000,  0.0000, -0.0000],
         ...,
         [ 2.6125, -0.0000,  0.0000,  ...,  0.3506, -0.0000, -1.7283],
         [ 1.0689,  1.1262, -0.0000,  ...,  0.0000,  0.0000, -0.0000],
         [ 0.0000,  0.0000, -3.9176,  ...,  0.0000,  0.0000, -0.0000]],

        [[-1.3211,  0.0000,  0.0000,  ...,  0.4462, -0.0192, -0.0000],
         [-3.3822,  2.1588, -0.0000,  ..., -0

### Recurrent layers

<center><a href="https://www.researchgate.net/profile/Rezzy-Caraka/publication/346410173_Employing_Long_Short-Term_Memory_and_Facebook_Prophet_Model_in_Air_Temperature_Forecasting/links/60077104a6fdccdcb868957f/Employing-Long-Short-Term-Memory-and-Facebook-Prophet-Model-in-Air-Temperature-Forecasting.pdf">
    <img src="https://www.researchgate.net/profile/Rezzy-Caraka/publication/346410173/figure/fig2/AS:962598073823234@1606512673418/Network-Structure-of-RNN-LSTM-and-GRU.png" width="500"></a></center>

We now see the 3 main recurrent layers (simple RNN, LSTM and GRU). We won't focus on the technical difference between these, but you can find more info online (e.g., [here](https://medium.com/analytics-vidhya/rnn-vs-gru-vs-lstm-863b0b7b1573)).

#### Simple RNN

First, let's see the basic RNN. We note $x_t$ the $t$-th element of the input to the RNN (in our case: this is the embedding after dropout). We have $h_t = \text{RNN}(x_t, h_{t-1})$, where $h_{t}$ is the hidden state. To define such an RNN in Pytorch (using ```nn.RNN```), you need to specify:

- the size of the input (here, it's the size of the embeddings)
- the size of the hidden space (`hidden_size`)
- the number of layers (`n_layers`)

By default, the RNN is uni-directional, uses bias, and uses tanh as activation function. If you use a multi-layer RNN, you can also add dropout in the intermediate layers. You can change these by playing with the parameters of the function (see the [doc](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html#torch.nn.RNN) for more info).

**Note**: for the first element of the sequence, we have $h_1 = \text{RNN}(x_1, h_{0})$, so we normally need to provide an initial hidden state $h_0$. In pytorch, we can either provide it explicitly or not. If we don't, it will use $h_0=0$ by default. This is what we do here, and it also applies to other recurrent units (LSTM and GRU).

In Pytorch, applying an RNN returns not one, but two outputs, usually called `out` and `hidden`, illustrated below:

- `out` is the whole sequence of hidden states of the last layer
- `hidden` is the hidden state of the last token for all layers

<center><img src="rnn_outputs.png" width="500"></center>

In [12]:
# Define a basic RNN
hidden_size = 50
n_layers = 2
rnn = nn.RNN(emb_size, hidden_size, n_layers)

# Apply the RNN to the input (embeddings after dropout, called 'drop_batch')
rnn_out, rnn_hidden = rnn(drop_batch)

# Get the size of the 'rnn_out': it should be [seq_length, batch_size, hidden_size]
print(rnn_out.shape)

# Get the size of the'rnn_hidden': it should be [n_layers, batch_size, hidden_size]
print(rnn_hidden.shape)

torch.Size([33, 128, 50])
torch.Size([2, 128, 50])


In [35]:
# TO DO:
# - create a 3-layer bidirectional RNN (check the doc!)
# - apply it to the same input 'embedded_batch'

# Print the size of rnn_out (it should be [seq_length, batch_size, 2*hidden_size])
# Print the size of the final hidden state (it should be [2*n_layers, batch_size, hidden_size])
# (the factor '2' in the shapes comes from the fact that the network is bidirectional)
rnn_2 = nn.Sequential( 
                      nn.RNN(emb_size, hidden_size, 3, bidirectional = True))

rnn_out, rnn_hidden = rnn_2(embedded_batch)

In [36]:
print(rnn_out.shape) 
print(rnn_hidden.shape)

torch.Size([33, 128, 100])
torch.Size([6, 128, 50])


<span style="color:red">**Q2**</span> Put the sizes of these outputs in your report.

#### LSTM

The basic RNN suffers from the gradient vanishing problem, so we will instead use a variant of it called *long short-term memory* (LSTM) networks. The key idea of LSTM is that it has an extra hidden feature called a *cell state* which allows the network to "remember" which part of the input sequence is useful or not, and therefore to avoid backpropagating the gradient throughout the whole sequence, thus avoiding gradient vanishing.

The formula for the [LSTM](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html#torch.nn.LSTM) is therefore: $(h_t, c_t) = \text{LSTM}(x_t, h_{t-1}, c_{t-1})$ where $c_t$ is this extra cell state.

In [27]:
# Define an LSTM
hidden_size = 50
n_layers = 2
lstm = nn.LSTM(emb_size, hidden_size, n_layers)

# Apply the LSTM to the embedded batch
lstm_out, (lstm_hidden, lstm_cell) = lstm(embedded_batch)

# The shape of the output and final hidden state are the same as before.
# The final cell state as the same size as the final hidden state.
print(lstm_out.shape)
print(lstm_hidden.shape)
print(lstm_cell.shape)

torch.Size([33, 128, 50])
torch.Size([2, 128, 50])
torch.Size([2, 128, 50])


#### GRU

The last main type of recurrent layer is the gated reccurent unit (GRU). It is sort of a simplified LSTM: it also has some memory mechanism to avoid gradient vanishing, but it outputs only a single hidden state vector (instead of the additional cell state in LSTM). It generally performs similarly with LSTM (but this depends on applications). Writting a [GRU in pytorch](https://pytorch.org/docs/stable/generated/torch.nn.GRU.html#torch.nn.GRU) is similar to a basic RNN.

In [29]:
# TO DO: using the doc, write a GRU layer with a hidden size of 50 and 2 layers.
# Apply it to the embedded batch as before, and print the size of the output and final hidden state.
gru = nn.GRU(emb_size, hidden_size, n_layers)
gru_out, gru_hidden = lstm(embedded_batch)
print(gru_out.shape)
print(gru_hidden[1].shape)

torch.Size([33, 128, 50])
torch.Size([2, 128, 50])


Finally, note that all recurrent layers might also use dropout to randomly remove some of the inner RNN weights. This only makes sense if `n_layers` > 1, otherwise you'll get a warning.

## Building the translation model

We'll now build the machine translation model. This model is based on two part:

- an *encoder*, which takes as input the source sentence (in German) and encodes it into a *context* vector. This context vector is sort of a summary of the whole input sentence.
- a *decoder*, which takes as input this context vector and sequentially generates a sentence in English. It always starts with the `<sos>` token and uses the context vector to generate the second token. Then, it recursively uses the last produced token and the hidden state to generate the next token.

<center><a href="https://github.com/bentrevett/pytorch-seq2seq">
    <img src="https://github.com/bentrevett/pytorch-seq2seq/raw/49df8404d938a6edbf729876405558cc2c2b3013//assets/seq2seq1.png"></a></center>

On this picture, $h_t$ represent the hidden states of the encoder and $s_t$ the hidden states of the decoder. For LSTMs, we also need to consider the cell states, but they're not displayed here for brevity. The yellow blocks represent the embedding and dropout, the purple blocks represent the linear classifier, and the green blocks are the recurrent units.

In this lab, we only write the encoder. It consists of:

- an embedding layer to transform token indices into word vectors.
- a dropout layer.
- a single-layer LSTM, to learn the context vector.

**Note**: We don't need to keep track of all the hidden states ($h_1$, $h_2$, $h_3$, etc.), we only need the final hidden state called the context vector (and denoted $z$ on the picture). Therefore, we can simply apply our LSTM on the whole sequence, instead of writting a loop explicitly (this will not be the case for the decoder).

In [38]:
class LSTMEncoder(nn.Module):
    def __init__(self, input_size, emb_size, hidden_size, n_layers, dropout_rate):
        super().__init__()
        
        # TO DO: initialize the network (remember to use 'self.' for all attributes / parameters / layers)
        # - store the input parameters as class attributes
        # - create the embedding layer (transform indices into word vectors)
        # - create the dropout layer
        # - create the LSTM layer
        self.embed_layer = nn.Embedding(input_size, emb_size)
        self.dropout = nn.Dropout(dropout_rate)
        self.lstm = nn.LSTM(emb_size, hidden_size, n_layers)
        
    def forward(self, src):
        
        # TO DO: write the forward pass
        # - compute the embeddings
        # - apply dropout to the word embeddings
        # - apply the LSTM layer
        # - return the final hidden and cell states
        embedded = self.dropout(self.embed_layer(src))
        lstm_out, (lstm_hidden, _) = self.lstm(embedded)
        
        return lstm_hidden, lstm_out 

In [39]:
# Encoder parameters
input_size = len(SRC.vocab)
emb_size_enc = 32
hidden_size = 50
n_layers = 1
dropout_rate = 0.5

In [40]:
# TO DO:
# - Instanciate the encoder
# - print the number of trainable parameters
# - pass the example_batch to the encoder, and print the size of the two outputs
encoder = LSTMEncoder(input_size, emb_size, hidden_size, n_layers, dropout_rate)
print('Total number of parameters:', sum(p.numel() for p in encoder.parameters()))

Total number of parameters: 42464


In [43]:
out, hidden =encoder(example_batch.src)
print(out.shape)
print(hidden.shape)

torch.Size([1, 128, 50])
torch.Size([33, 128, 50])


<span style="color:red">**Q3**</span> Put the number of parameters in the LSTM encoder in your report.