# RNN-Encoder-Decoder in Pytorch

**|| Jonty Sinai ||** 28-04-2019

- **Paper:** [Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation](https://arxiv.org/abs/1406.1078)
- **Authors:** Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, Yoshua Bengio
- **Topic:** Machine translation, sequence modelling, Neural network architectures
- **Year:** 2014

In this notebook I implement a sequence-to-sequence neural network based on the RNN-Encoder-Decoder framework introduced by two seminal papers (above) in neural machine translation (NMT) and deep learning. Both papers were released within a few months of each other and presented at NeurIPS 2014 and as such both are credited with introducing the modern sequence-to-sequence framework. In contemporary deep learning ideas from both papers have been combined and indeed extended into more sophisticated frameworks.

I will implement a generic sequence-to-sequence model based on ideas from both papers and will train the model on a simplified English-French translation dataset based on the [official PyTorch sequence-to-sequence tutorial](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html#sphx-glr-intermediate-seq2seq-translation-tutorial-py). Whereas in my [ResNet implementation](https://github.com/JontySinai/artificial_neural_networks/blob/master/notebooks/resnet.ipynb) the goal was to explore modular composeability with PyTorch, the goal here is to explore sequence-to-sequence design patterns and engineering with PyTorch.

> **Note:** Compared to computer vision, feature selection and data preprocessing in natural language processing (NLP) requires more care and attention. There is no correct way of preprocessing text, although there are many incorrect ways. Choices must be made carefully and efficiently. As a result, a good portion of this notebook is spent preprocessing the dataset and converting it into the right format for sequence-to-sequence learning.

In [1]:
import os
import re
import time
import math
import string
import glob
import random
import unicodedata
from io import open

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, SubsetRandomSampler

import numpy as np
import matplotlib.pyplot as plt
plt.switch_backend('agg')
import matplotlib.ticker as ticker

HOME = os.environ['AI_HOME']
ROOT = os.path.join(HOME, 'artificial_neural_networks')
DATA = os.path.join(ROOT, 'data')
ENG_FR = os.path.join(DATA, 'english_french')

random.seed(1901)
np.random.seed(1901)
torch.manual_seed(1901)

<torch._C.Generator at 0x7f4e4c14c6d0>

## Sequence-to-Sequence Model Architecture

A **sequence-to-sequence** architecture is a generic type of deep learning architecture consisting of two main components:

- An **encoder** which takes in an **input sequence** and encodes it into some **latent representation**, sometimes called the **context vector**.
- A **decoder** which takes in the latent representation and produces an **output sequence**.
    - The decoder may also take in an input sequence of its own. During training this is typically the target sequence. During prediction, this is the predicted sequence itself, but more on these details later.
    - During training the decoder can be thought of as a **generative model** which tries to produce the target sequence, given the context vector.

For example, in the French --> English translation task, a sequence-to-sequence model will look as follows

<img src="assets/seq2seq.png">

Here the output sequence is fed back into the decoder for prediction. 

An **RNN-Encoder-Decoder** is a type of sequence-to-sequence architecture where both the encoder and decoder are RNN's. The papers presented above have a slight different treatment of how information is passed from the encoder to the decoder. I will summarise both below:

>**Note:** We will implement both and compare their performance on the translation task. Note we will be using shallower versions of the LSTM/GRU cells, a smaller dataset and smaller compute, so the conclusions of training in no way reflect which method will be better. In fact, both are valid and contemporary methods use ideas from both. In general it is best to experiment with different architectures for the task at hand.

#### Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation (Cho et al, 2014)

<img src="assets/cho-rnn-encoder-decoder.png" width="400px">

- The **encoder** will loop through each timestep in the input sequence exactly as it is with the Sutskever paper.

- This time the **context vector** is calculated by passing the final hidden state through a nonlinear layer $f_C$:

$$c = f_C(h_n)$$

- The **decoder** is kicked off with the `<SOS>` token as input and but now the initial hidden state is calculated using a nonlinear layer, $f_H$, using the contect vector:
$$h_0 = f_H(c)$$
    <br/>
    Then at each time $t$, the hidden state is calculated using the previous hidden state, the decoder input, $\tilde{y}_{t-1}$ (whose value differs depending on whether the model is being used for training or prediction), and the context vector.
    
    \begin{align}
    h_t = f_D(h_{t-1}, \tilde{y}_t , c) \ , \ \ h_0 = c
    \end{align}
    <br/>
    Finally the output is calculated as a _linear combination_ of the hidden state, decoder input and context vector:
    
    $$z_t = O_{h}h_t + O_{y}\tilde{y}_{t-1} + O_{c}c$$

- The recurrent cell is a **GRU**, which was introduced in this paper, for both the encoder and decoder.


In [2]:
class choEncoder(nn.Module):
    
    def __init__(self, input_vocab_size, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size
        
        self.embedding  = nn.Embedding(input_vocab_size, hidden_size, padding_idx=0)  # PAD_token = 0
        self.cell = nn.LSTM(hidden_size, hidden_size)
        
    def forward(self, input_, hidden_states):
        """
            input size : max_seq_len x batch_size
            hidden_states: tuple (h,c) corresponding to the hidden and cell states
                           respectively each with size batch x hidden_size
        """
        
        embedded = self.embedding(input_)  # size: max_seq_len x batch_size x hidden_size
        output, hidden_states = self.cell(embedded, hidden_states)
        
        return output, hidden_states
    
    def init_hidden(self, batch_size):
        h0 = torch.randn(1, batch_size, self.hidden_size)  # first dim: num_layers * num_directions
        c0 = torch.randn(1, batch_size, self.hidden_size)
        return h0, c0