# Sequence to Sequence Learning in Pytorch

**|| Jonty Sinai ||** 28-04-2019

- **Paper:** [Sequence to Sequence Learning with Neural Networks](https://arxiv.org/abs/1409.3215)
- **Authors:** Ilya Sutskever, Oriol Vinyals, Quoc V. Le
- **Topic:** Machine translation, sequence modelling, Neural network architectures
- **Year:** 2014

In this notebook I implement a sequence-to-sequence neural network based on the RNN-Encoder-Decoder framework introduced by two seminal papers (above) in neural machine translation (NMT) and deep learning. Both papers were released within a few months of each other and presented at NeurIPS 2014 and as such both are credited with introducing the modern sequence-to-sequence framework. In contemporary deep learning ideas from both papers have been combined and indeed extended into more sophisticated frameworks.

I will implement a generic sequence-to-sequence model based on ideas from both papers and will train the model on a simplified English-French translation dataset based on the [official PyTorch sequence-to-sequence tutorial](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html#sphx-glr-intermediate-seq2seq-translation-tutorial-py). Whereas in my [ResNet implementation](https://github.com/JontySinai/artificial_neural_networks/blob/master/notebooks/resnet.ipynb) the goal was to explore modular composeability with PyTorch, the goal here is to explore sequence-to-sequence design patterns and engineering with PyTorch.

> **Note:** Compared to computer vision, feature selection and data preprocessing in natural language processing (NLP) requires more care and attention. There is no correct way of preprocessing text, although there are many incorrect ways. Choices must be made carefully and efficiently. As a result, a good portion of this notebook is spent preprocessing the dataset and converting it into the right format for sequence-to-sequence learning.

In [1]:
import os
import re
import time
import math
import string
import glob
import random
import unicodedata
from io import open

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, SubsetRandomSampler

import numpy as np
import matplotlib.pyplot as plt
plt.switch_backend('agg')
import matplotlib.ticker as ticker

HOME = os.environ['AI_HOME']
ROOT = os.path.join(HOME, 'artificial_neural_networks')
DATA = os.path.join(ROOT, 'data')
ENG_FR = os.path.join(DATA, 'english_french')

random.seed(1901)
np.random.seed(1901)
torch.manual_seed(1901)

<torch._C.Generator at 0x7fcb9817c6d0>

## Sequence-to-Sequence Model Architecture

A **sequence-to-sequence** architecture is a generic type of deep learning architecture consisting of two main components:

- An **encoder** which takes in an **input sequence** and encodes it into some **latent representation**, sometimes called the **context vector**.
- A **decoder** which takes in the latent representation and produces an **output sequence**.
    - The decoder may also take in an input sequence of its own. During training this is typically the target sequence. During prediction, this is the predicted sequence itself, but more on these details later.
    - During training the decoder can be thought of as a **generative model** which tries to produce the target sequence, given the context vector.

For example, in the French --> English translation task, a sequence-to-sequence model will look as follows

<img src="assets/seq2seq.png">

Here the output sequence is fed back into the decoder for prediction. 

An **RNN-Encoder-Decoder** is a type of sequence-to-sequence architecture where both the encoder and decoder are RNN's. The papers presented above have a slight different treatment of how information is passed from the encoder to the decoder. I will summarise both below:

>**Note:** We will implement both and compare their performance on the translation task. Note we will be using shallower versions of the LSTM/GRU cells, a smaller dataset and smaller compute, so the conclusions of training in no way reflect which method will be better. In fact, both are valid and contemporary methods use ideas from both. In general it is best to experiment with different architectures for the task at hand.

#### Sequence to Sequence Learning with Neural Networks (Sutskever et al, 2014)

<img src="assets/sutskever-seq2seq.png" width="650px">

- The **encoder** loops through each timestep in the input sequence, $(x_1, x_2, ... ,x_{n})$, updating its hidden state with each step. The final hidden state will form the context vector summarising the entire input sequence. In general if $f_E$ is the encoder, then at each time $t$, the hidden state of the encoder is computed as follows:

\begin{align}
h_t = f_E(h_{t-1}, x_t )
\end{align}

- The **context vector** is set to the final hidden state of the input sequence:

$$c = h_n$$

- The **decoder** is started by passing the `<EOS>` token of the input sequence (this is identical to passing a `<SOS>` token) to produce an initial output, $\hat{y}_1$.
    - During training, the true target $y_1$ will be passed to the decoder at time $t=2$, and so on. This is known as **teacher forcing**.
    - During prediction, the estimate $\hat{y}_1$ is passed to the decoder at time $t=2$, and so on.
    - The process is repeated until the decoder predicts an `<EOS>` token, marking termination of the generative procedure. 
    <br/><br/>
    In general if $f_D$ is the decoder, then at each timestep $t$, the hidden state of the decoder is computed as
    
    \begin{align}
    h_t = f_D(h_{t-1}, \tilde{y}_t ) \ , \ \ h_0 = c
    \end{align} 
    <br/><br/>
    where $\tilde{y}_t = y_{t-1}$ during training and $\tilde{y}_t = \hat{y}_{t-1}$ during prediction, with $\tilde{y}_1 = \text{<SOS>}$.
    <br/>
    The output is then calculated using a linear layer:
    
    $$z_t = Wh_t$$

- The recurrent cell is an **LSTM** for both the encoder and decoder.


In [2]:
class sutskeverEncoder(nn.Module):
    
    def __init__(self, input_vocab_size, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size
        
        self.embedding  = nn.Embedding(input_vocab_size, hidden_size, padding_idx=0)  # PAD_token = 0
        self.cell = nn.LSTM(hidden_size, hidden_size)
        
    def forward(self, input_, hidden_states):
        """
            input size : max_seq_len x batch_size
            hidden_states: tuple (h,c) corresponding to the hidden and cell states
                           respectively each with size batch x hidden_size
        """
        
        embedded = self.embedding(input_)  # size: max_seq_len x batch_size x hidden_size
        output, hidden_states = self.cell(embedded, hidden_states)
        
        return output, hidden_states
    
    def init_hidden(self, batch_size):
        h0 = torch.randn(1, batch_size, self.hidden_size)  # first dim: num_layers * num_directions
        c0 = torch.randn(1, batch_size, self.hidden_size)
        return h0, c0

#### Test: Sutskever Encoder

Let's now see how a random tensor with the right dimensionality is transformed through the forward pass of the Encoder.

In [3]:
seq2seq_encoder = sutskeverEncoder(10, 8)  # input_vocab_size=10, hidden_size=8

print(seq2seq_encoder)

sutskeverEncoder(
  (embedding): Embedding(10, 8, padding_idx=0)
  (cell): LSTM(8, 8)
)


In [4]:
input_ = torch.randint(size=(3, 5), low=0, high=10) # max_seq_length = 3, batch_size = 5

In [5]:
embedded = seq2seq_encoder.embedding(input_)

print(embedded.size())

torch.Size([3, 5, 8])


In [6]:
hidden_states = seq2seq_encoder.init_hidden(5)  # batch_size = 5

print(hidden_states[0].size())
print(hidden_states[1].size())

torch.Size([1, 5, 8])
torch.Size([1, 5, 8])


In [7]:
output, hidden_states = seq2seq_encoder.cell(embedded, hidden_states)

print(output.size())

torch.Size([3, 5, 8])


In [8]:
output, hidden_states = seq2seq_encoder(input_, hidden_states)

encoder_hidden, _ = hidden_states

Now for the decoder

In [9]:
class sutskeverDecoder(nn.Module):
    
    def __init__(self, output_vocab_size, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size
        
        self.embedding  = nn.Embedding(output_vocab_size, hidden_size, padding_idx=0)  # PAD_token = 0
        self.cell = nn.LSTM(hidden_size, hidden_size)
        
        self.linear = nn.Linear(hidden_size, output_vocab_size)
        
    def forward(self, input_, hidden_states):
        """
            input size : max_seq_len x batch_size
            hidden_states: tuple (h,c) corresponding to the hidden and cell states
                           respectively each with size batch x hidden_size
        """
        
        embedded = self.embedding(input_)
        output, hidden_states = self.cell(embedded, hidden_states)
        output = F.softmax(self.linear(output), dim=1)
        
        return output, hidden_states
    
    def init_hidden(self, context, batch_size):
        c0 = torch.randn(1, batch_size, self.hidden_size)  # first dim: num_layers * num_directions
        return context, c0

##### Test: Sutskever Decoder

In [10]:
seq2seq_decoder = sutskeverDecoder(12, 8)  # output_vocab_size=10, hidden_size=8

print(seq2seq_decoder)

sutskeverDecoder(
  (embedding): Embedding(12, 8, padding_idx=0)
  (cell): LSTM(8, 8)
  (linear): Linear(in_features=8, out_features=12, bias=True)
)


In [11]:
input_ = torch.randint(size=(3, 5), low=0, high=10) # max_seq_length = 3, batch_size = 5
hidden_states = seq2seq_decoder.init_hidden(context=encoder_hidden, batch_size=5)

output, hidden_states = seq2seq_decoder(input_, hidden_states)

print(output.size())

torch.Size([3, 5, 12])
