# Practice: Sequence to Sequence for Neural Machne Translation.

_Reference: this notebook is based on [open-source implementation](https://github.com/bentrevett/pytorch-seq2seq/blob/master/1%20-%20Sequence%20to%20Sequence%20Learning%20with%20Neural%20Networks.ipynb) of seq2seq NMT in PyTorch._

We are going to implement the model from the [Sequence to Sequence Learning with Neural Networks](https://arxiv.org/abs/1409.3215) paper. 

The model will be trained for German to English translations, but it can be applied to any problem that involves going from one sequence to another, such as summarization.


## Introduction

The most common sequence-to-sequence (seq2seq) models are *encoder-decoder* models, which often use a *recurrent neural network* (RNN) to *encode* the source (input) sentence into a single vector. In this notebook, we'll refer to this single vector as a *context vector*. You can think of the context vector as being an abstract representation of the entire input sentence. This vector is then *decoded* by a second RNN which learns to output the target (output) sentence by generating it one word at a time.

![](https://raw.githubusercontent.com/bentrevett/pytorch-seq2seq/master/assets/seq2seq1.png)

The above image shows an example translation. The input/source sentence, "guten morgen", is input into the encoder (green) one word at a time. We also append a *start of sequence* (`<sos>`) and *end of sequence* (`<eos>`) token to the start and end of sentence, respectively. At each time-step, the input to the encoder RNN is both the current word, $x_t$, as well as the hidden state from the previous time-step, $h_{t-1}$, and the encoder RNN outputs a new hidden state $h_t$. You can think of the hidden state as a vector representation of the sentence so far. The RNN can be represented as a function of both of $x_t$ and $h_{t-1}$:

$$h_t = \text{EncoderRNN}(x_t, h_{t-1})$$

We're using the term RNN generally here, it could be any recurrent architecture, such as an *LSTM* (Long Short-Term Memory) or a *GRU* (Gated Recurrent Unit). 

Here, we have $X = \{x_1, x_2, ..., x_T\}$, where $x_1 = \text{<sos>}, x_2 = \text{guten}$, etc. The initial hidden state, $h_0$, is usually either initialized to zeros or a learned parameter.

Once the final word, $x_T$, has been passed into the RNN, we use the final hidden state, $h_T$, as the context vector, i.e. $h_T = z$. This is a vector representation of the entire source sentence.

Now we have our context vector, $z$, we can start decoding it to get the target sentence, "good morning". Again, we append start and end of sequence tokens to the target sentence. At each time-step, the input to the decoder RNN (blue) is the current word, $y_t$, as well as the hidden state from the previous time-step, $s_{t-1}$, where the initial decoder hidden state, $s_0$, is the context vector, $s_0 = z = h_T$, i.e. the initial decoder hidden state is the final encoder hidden state. Thus, similar to the encoder, we can represent the decoder as:

$$s_t = \text{DecoderRNN}(y_t, s_{t-1})$$

In the decoder, we need to go from the hidden state to an actual word, therefore at each time-step we use $s_t$ to predict (by passing it through a `Linear` layer, shown in purple) what we think is the next word in the sequence, $\hat{y}_t$. 

$$\hat{y}_t = f(s_t)$$

We always use `<sos>` for the first input to the decoder, $y_1$, but for subsequent inputs, $y_{t>1}$, we will sometimes use the actual, ground truth next word in the sequence, $y_t$ and sometimes use the word predicted by our decoder, $\hat{y}_{t-1}$. This is called *teacher forcing*, and you can read about it more [here](https://machinelearningmastery.com/teacher-forcing-for-recurrent-neural-networks/).

When training/testing our model, we always know how many words are in our target sentence, so we stop generating words once we hit that many. During inference (i.e. real world usage) it is common to keep generating words until the model outputs an `<eos>` token or after a certain amount of words have been generated.

Once we have our predicted target sentence, $\hat{Y} = \{ \hat{y}_1, \hat{y}_2, ..., \hat{y}_T \}$, we compare it against our actual target sentence, $Y = \{ y_1, y_2, ..., y_T \}$, to calculate our loss. We then use this loss to update all of the parameters in our model.

## Preparing Data

We'll be using data provided by [torchtext](https://pytorch.org/text/stable/) and coding the models in PyTorch. We'll also be using [nltk](https://www.nltk.org) to assist with the tokenization.

First of all, let's load the data. We will be using the [Multi30k dataset](https://github.com/multi30k/dataset). This is a dataset with ~30,000 parallel English, German and French sentences, each with ~12 words per sentence. We will train model to translate sentences from German into English.

In [2]:
!python -m pip install torchtext==0.4.0
!python -m pip install subword_nmt

[0m

In [3]:
import torch
import torch.nn as nn
import torch.optim as optim

import torchtext
from torchtext.datasets import TranslationDataset, Multi30k
from torchtext.data import Field, BucketIterator

import spacy

import random
import math
import time

import matplotlib
matplotlib.rcParams.update({'figure.figsize': (16, 12), 'font.size': 14})
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.display import clear_output

from nltk.tokenize import WordPunctTokenizer
from subword_nmt.learn_bpe import learn_bpe
from subword_nmt.apply_bpe import BPE

As we can see, dataset provides us with pairs of sentences. However, working with whole sentences is not convenient. For this reason we will use tokenizers. A tokenizer is used to turn a string containing a sentence into a list of individual tokens that make up that string, e.g. "good morning!" becomes ["good", "morning", "!"]. We'll start talking about the sentences being a sequence of tokens from now, instead of saying they're a sequence of words. What's the difference? Well, "good" and "morning" are both words and tokens, but "!" is a token, not a word.

Just like in previous practice we'll use the `WordPunctTokenizer` from `nltk` library.

Sentences in our dataset are slightly more complicated, but nothing nltk cannot handle. Before tokenization, however, it is important to lowercase the data. And get rid of a `\n` at the end of each sentence whilst we are at it. This yields the following data processing pipeline:

In [4]:
from nltk.tokenize import WordPunctTokenizer

tokenizer = WordPunctTokenizer()
def tokenize(x, tokenizer=tokenizer):
    """Lower sentence and then return the list of tokens"""
    return tokenizer.tokenize(x.lower())

In [36]:
trg = Field(
            lower = True,
            init_token="<sos>",
            eos_token="<eos>",
            tokenize=tokenize,
            pad_token='<pad>',
            unk_token='<unk>',
           )

src = Field(
            lower = True,
            init_token="<sos>",
            eos_token="<eos>",
            tokenize=tokenize,
            pad_token='<pad>',
            unk_token='<unk>',
           )


train_data, valid_data, test_data = Multi30k.splits(
    exts=('.de', '.en'),
    fields = (src, trg)
)

trg.build_vocab(train_data, min_freq=3)
src.build_vocab(train_data, min_freq=3)


print(f"Number of training examples: {len(train_data)}")
print(vars(train_data.examples[0]))

Number of training examples: 29000
{'src': ['zwei', 'junge', 'weiße', 'männer', 'sind', 'im', 'freien', 'in', 'der', 'nähe', 'vieler', 'büsche', '.'], 'trg': ['two', 'young', ',', 'white', 'males', 'are', 'outside', 'near', 'many', 'bushes', '.']}


In [None]:
cd .data\\val.de

In [6]:
trg.vocab.itos[:10]

['<unk>', '<pad>', '<sos>', '<eos>', 'a', '.', 'in', 'the', 'on', 'man']

In the paper we are implementing, they find it beneficial to reverse the order of the input which they believe "introduces many short term dependencies in the data that make the optimization problem much easier". This means that we need to slightly modify previous code:

Next, we need to build the *vocabulary* for the source and target languages. The vocabulary is used to associate each unique token with an index (an integer) and this is used to build a one-hot encoding for each token (a vector of all zeros except for the position represented by the index, which is 1). The vocabularies of the source and target languages are distinct. It is important to note that your vocabulary should only be built from the training set and not the validation/test set. This prevents "information leakage" into your model, giving you artifically inflated validation/test scores.

Using the `min_freq` argument, we only allow tokens that appear at least 2 times to appear in our vocabulary.

Tokens that appear only once (or do not appear in training data at all) should be converted into an `<unk>` (unknown) token. We can achieve this by adding it to our vocabularies and setting it to default.

Another special tokens we want to have in our vocabularies are `<sos>` (start of sequence), `<eos>` (end of sequence) and `<pad>` (padding) tokens.

In [7]:
sos_token, eos_token, pad_token = "<sos>", "<eos>", "<pad>"
for stop_token in (sos_token, eos_token, pad_token):
    assert stop_token in src.vocab.itos
    assert stop_token in trg.vocab.itos

Let's check the sizes of our vocabularies:

In [8]:
src_vocab = src.vocab.stoi
trg_vocab = trg.vocab.stoi
print(f"Unique tokens in source (de) vocabulary: {len(src_vocab)}")
print(f"Unique tokens in target (en) vocabulary: {len(trg_vocab)}")

Unique tokens in source (de) vocabulary: 5429
Unique tokens in target (en) vocabulary: 4566


Now we can encode our tokenized sentences (i.e. convert them into sequences of token indices) as follows:

In [9]:
print(vars(train_data.examples[0]))

{'src': ['zwei', 'junge', 'weiße', 'männer', 'sind', 'im', 'freien', 'in', 'der', 'nähe', 'vieler', 'büsche', '.'], 'trg': ['two', 'young', ',', 'white', 'males', 'are', 'outside', 'near', 'many', 'bushes', '.']}


In [10]:
test_src = vars(train_data.examples[0])['src']
test_trg = vars(train_data.examples[0])['trg']
print(f"test_src: {test_src}")
print(f"test_trg: {test_trg}")

test_src: ['zwei', 'junge', 'weiße', 'männer', 'sind', 'im', 'freien', 'in', 'der', 'nähe', 'vieler', 'büsche', '.']
test_trg: ['two', 'young', ',', 'white', 'males', 'are', 'outside', 'near', 'many', 'bushes', '.']


In [11]:
# Tokenize sentence and add <sos> and <eos> special tokens.
test_tokenized = [sos_token] + test_trg + [eos_token]

# Transform tokens into indices using our vocab.
test_encoded = [trg_vocab[tok] for tok in test_tokenized]

[(tok, idx) for tok, idx in zip(test_tokenized, test_encoded)]

[('<sos>', 2),
 ('two', 16),
 ('young', 24),
 (',', 15),
 ('white', 25),
 ('males', 780),
 ('are', 17),
 ('outside', 57),
 ('near', 80),
 ('many', 203),
 ('bushes', 1312),
 ('.', 5),
 ('<eos>', 3)]

Just like before with `tokenize`, let's pack it into a neat little function:

In [37]:
trg_itos = trg.vocab.itos

In [12]:
def encode(sent, vocab):
    tokenized = [sos_token] + sent + [eos_token]
    return [vocab[tok] for tok in tokenized]


# Note that here we commited a little crime: after [::-1] the <sos> token in
# src sequence is gone to the end whilst the <eos> ended up the last token.
# However, it is not that big a problem: <sos> and <eos> tokens only have
# special meaning for us, for model they are just some tokens until it's been
# trained to use them. For this reason it doesn't care for the actual name of
# init and end tokens. And in trg sequence everything is fine, so we are not
# ending up starting translation with <eos> token or anything.
print(encode(test_src, src_vocab)[::-1])
print(encode(test_trg, trg_vocab))

[3, 4, 3223, 0, 115, 15, 7, 89, 20, 85, 31, 215, 27, 18, 2]
[2, 16, 24, 15, 25, 780, 17, 57, 80, 203, 1312, 5, 3]


Now we know how to preprocess our input and output sentences into a NN-readable format. The last thing we need to do is to create a `DataLoader` for our data, which will take our sentences and put them together to form a batch. Problem here lies in the fact that sentences can have different sizes and items in one batch absolutely cannot. For this we use padding (remember the `<pad>` token). Back in version `0.8` torchtext used to provide its own custom classes which handled tokenization, `<sos>`, `<eos>` and `<unk>` tokens and added padding. However, since that time, torchtext ditched this functionality in order to make their dataloading API consistent with PyTorch's one. This in turn means that padding (as well as everything else mentioned) ourselves. Luckily, PyTorch's `DataLoader` supports custom collate functions, which seems like a great place to do all our preprocessing, including padding.

> **Note:** you can read more about difference in data handling between versions `0.8` and `0.9` in this [migration tutorial](https://github.com/pytorch/text/blob/master/examples/legacy_tutorial/migration_tutorial.ipynb) from which I borrowed the original idea for `collate_batch` function. However, be careful, as the tutorial uses version `0.9` whilst the actual version for today is `0.11` and there have been slight differences introduced inbetween, but nothing as big as transition from `0.8` to `0.9`.

Collate function (`collate_fn` optional argument in `DataLoader.__init__`) takes only one argument which is current data batch, collated into a python list, and should output a collated tensor. `DataLoader` supports different data formats for `Dataset`. Most common are a tuple or a dictionary of values. `DataLoader` takes these containers and collates them into the same typed container with lists of items, e.g. dictionary with same keys, but each key has a list of items or a tuple of lists. We are working with `Multi30k` data, which means that our `train_data` contains tuples of strings (`src` and `trg`). This means that our `collate_batch` function will receive a tuple of 2 lists of strings. And we need to collate them into a single `torch.tensor`. In order to do this, we first need to encode each string into a list of token indices (don't forget to reverse the `src` sequence) and pad them to a same length before creating a tensor. For padding we will use the PyTorch's `pad_sequence` function, which takes a list of tensors of different shapes and a padding value and returns the padded tensor.

In [13]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def _len_sort_key(x):
    return len(x.src)

BATCH_SIZE = 128

train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE, 
    device = device,
    sort_key=_len_sort_key
)

In [14]:
for batch in train_iterator:
    (x, y), _ = batch
    break
print(x.shape)
print(y.shape)

torch.Size([31, 128])
torch.Size([30, 128])


In [15]:
import torch
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader


def collate_batch(batch):
    src_list, trg_list = [], []
    for src, trg in batch:
        # YOUR CODE HERE
        # Encode src and trg sentences, convert them into a tensor
        # and store them in src_list and trg_list respectively.
        encoded_src = torch.as_tensor(encode(src, src_vocab))[::-1]
        encoded_trg = torch.as_tensor(encode(trg, trg_vocab))
        src_list.append(encoded_src)
        trg_list.append(encoded_trg)
    # YOUR CODE HERE
    # Pad sequences with pad_sequence function.
    src_padded = pad_sequence(src_list, batch_first=True, padding_value=src_vocab[pad_token])
    trg_padded = pad_sequence(trg_list, batch_first=True, padding_value=trg_vocab[pad_token])

    return src_padded, trg_padded


# batch_size = 256
# train_dataloader = DataLoader(train_iterator, batch_size, shuffle=True, collate_fn=collate_batch)
# src_batch, trg_batch = next(iter(train_dataloader))
# src_batch.shape, trg_batch.shape

Cool! Now we can load our data and store it in batches. Whilst we are at it, let's create a dataloader for a validation, which we will use to evaluate out model during training.

One could mention, that the first dimention is now `seq_len`, not `batch_size` as used to be. It's because in PyTorch LSTM (and other recurrent units) await for input in format `(seq_len, batch_size, input_size)`. Be careful with that (especially in your homework assignment).

## Building the Seq2Seq Model

We'll be building our model in three parts. The encoder, the decoder and a seq2seq model that encapsulates the encoder and decoder and will provide a way to interface with each.

### Encoder

First, the encoder, a 2 layer LSTM. The paper we are implementing uses a 4-layer LSTM, but in the interest of training time we cut this down to 2-layers. The concept of multi-layer RNNs is easy to expand from 2 to 4 layers. 

For a multi-layer RNN, the input sentence, $X$, goes into the first (bottom) layer of the RNN and hidden states, $H=\{h_1, h_2, ..., h_T\}$, output by this layer are used as inputs to the RNN in the layer above. Thus, representing each layer with a superscript, the hidden states in the first layer are given by:

$$h_t^1 = \text{EncoderRNN}^1(x_t, h_{t-1}^1)$$

The hidden states in the second layer are given by:

$$h_t^2 = \text{EncoderRNN}^2(h_t^1, h_{t-1}^2)$$

Using a multi-layer RNN also means we'll also need an initial hidden state as input per layer, $h_0^l$, and we will also output a context vector per layer, $z^l$.

Without going into too much detail about LSTMs (see [this](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) blog post if you want to learn more about them), all we need to know is that they're a type of RNN which instead of just taking in a hidden state and returning a new hidden state per time-step, also take in and return a *cell state*, $c_t$, per time-step.

$$\begin{align*}
h_t &= \text{RNN}(x_t, h_{t-1})\\
(h_t, c_t) &= \text{LSTM}(x_t, (h_{t-1}, c_{t-1}))
\end{align*}$$


You can just think of $c_t$ as another type of hidden state. Similar to $h_0^l$, $c_0^l$ will be initialized to a tensor of all zeros. Also, our context vector will now be both the final hidden state and the final cell state, i.e. $z^l = (h_T^l, c_T^l)$.

Extending our multi-layer equations to LSTMs, we get:

$$\begin{align*}
(h_t^1, c_t^1) &= \text{EncoderLSTM}^1(x_t, (h_{t-1}^1, c_{t-1}^1))\\
(h_t^2, c_t^2) &= \text{EncoderLSTM}^2(h_t^1, (h_{t-1}^2, c_{t-1}^2))
\end{align*}$$

Note how only our hidden state from the first layer is passed as input to the second layer, and not the cell state.

So our encoder looks something like this: 

![](https://raw.githubusercontent.com/bentrevett/pytorch-seq2seq/master/assets/seq2seq2.png)

We create this in code by making an `Encoder` module, which requires we inherit from `torch.nn.Module` and use the `super().__init__()` as some boilerplate code. The encoder takes the following arguments:
- `n_tokens` is the input (source) vocabulary size.
- `emb_dim` is the dimensionality of the embedding layer. This layer converts the one-hot vectors into dense vectors with `emb_dim` dimensions. 
- `hid_dim` is the dimensionality of the hidden and cell states.
- `n_layers` is the number of layers in the RNN.
- `dropout` is the amount of dropout to use. This is a regularization parameter to prevent overfitting. Check out [this](https://www.coursera.org/lecture/deep-neural-network/understanding-dropout-YaGbR) for more details about dropout.

To get more info about `nn.Embedding` one could refer to these articles: [1](https://monkeylearn.com/blog/word-embeddings-transform-text-numbers/), [2](http://p.migdal.pl/2017/01/06/king-man-woman-queen-why.html), [3](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/), [4](http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/). 

The embedding layer is created using `nn.Embedding`, the LSTM with `nn.LSTM` and a dropout layer with `nn.Dropout`. Check the PyTorch [documentation](https://pytorch.org/docs/stable/nn.html) for more about these.

One thing to note is that the `dropout` argument to the LSTM is how much dropout to apply between the layers of a multi-layer RNN, i.e. between the hidden states output from layer $l$ and those same hidden states being used for the input of layer $l+1$.

In the `forward` method, we pass in the source sentence, $X$, which is converted into dense vectors using the `embedding` layer, and then dropout is applied. These embeddings are then passed into the RNN. As we pass a whole sequence to the RNN, it will automatically do the recurrent calculation of the hidden states over the whole sequence for us! You may notice that we do not pass an initial hidden or cell state to the RNN. This is because, as noted in the [documentation](https://pytorch.org/docs/stable/nn.html#torch.nn.LSTM), that if no hidden/cell state is passed to the RNN, it will automatically create an initial hidden/cell state as a tensor of all zeros. 

The RNN returns: `outputs` (the top-layer hidden state for each time-step), `hidden` (the final hidden state for each layer, $h_T$, stacked on top of each other) and `cell` (the final cell state for each layer, $c_T$, stacked on top of each other).

As we only need the final hidden and cell states (to make our context vector), `forward` only returns `hidden` and `cell`. 

The sizes of each of the tensors is left as comments in the code. In this implementation `n_directions` will always be 1, because for now we are working only with one-direction LSTM.

In [16]:
import torch.nn as nn


class Encoder(nn.Module):
    def __init__(self, n_tokens, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()

        self.n_tokens = n_tokens
        self.hid_dim = hid_dim
        self.n_layers = n_layers

        # YOUR CODE HERE
        # Define embedding, dropout and LSTM layers.
        self.embedding = nn.Embedding(n_tokens, emb_dim)
        self.dropout = nn.Dropout(dropout)
        self.rnn = nn.LSTM(emb_dim, hid_dim, num_layers = n_layers, batch_first=False)

    def forward(self, src):
        # src has a shape of [seq_len, batch_size]

        # YOUR CODE HERE
        # Compute an embedding from src data and apply dropout.
        embedded = self.embedding(src)
        # embedded should have a shape of [seq_len, batch_size, emb_dim]

        # YOUR CODE HERE
        # Compute the RNN output values.
        all_hidden, (last_hidden, last_cell) = self.rnn(embedded)
        # When using LSTM, hidden should be a tuple of two tensors:
        # 1) hidden state
        # 2) cell state
        # both of shape [n_layers * n_directions, batch_size, hid_dim]

        return (last_hidden, last_cell)

In [16]:
enc = Encoder(len(src_vocab), emb_dim=256, hid_dim=512, n_layers=2, dropout=0.5)
enc_out = enc(x)
enc_out[0].shape, enc_out[1].shape

(torch.Size([2, 128, 512]), torch.Size([2, 128, 512]))

### Decoder

Next, we'll build our decoder, which will also be a 2-layer (4 in the paper) LSTM.

![](https://raw.githubusercontent.com/bentrevett/pytorch-seq2seq/master/assets/seq2seq3.png)

The `Decoder` class does a single step of decoding. The first layer will receive a hidden and cell state from the previous time-step, $(s_{t-1}^1, c_{t-1}^1)$, and feed it through the LSTM with the current token, $y_t$, to produce a new hidden and cell state, $(s_t^1, c_t^1)$. The subsequent layers will use the hidden state from the layer below, $s_t^{l-1}$, and the previous hidden and cell states from their layer, $(s_{t-1}^l, c_{t-1}^l)$. This provides equations very similar to those in the encoder.

$$\begin{align*}
(s_t^1, c_t^1) = \text{DecoderLSTM}^1(y_t, (s_{t-1}^1, c_{t-1}^1))\\
(s_t^2, c_t^2) = \text{DecoderLSTM}^2(s_t^1, (s_{t-1}^2, c_{t-1}^2))
\end{align*}$$

Remember that the initial hidden and cell states to our decoder are our context vectors, which are the final hidden and cell states of our encoder from the same layer, i.e. $(s_0^l,c_0^l)=z^l=(h_T^l,c_T^l)$.

We then pass the hidden state from the top layer of the RNN, $s_t^L$, through a linear layer, $f$, to make a prediction of what the next token in the target (output) sequence should be, $\hat{y}_{t+1}$. 

$$\hat{y}_{t+1} = f(s_t^L)$$

The arguments and initialization are similar to the `Encoder` class, except now `n_tokens` is the size of the target vocabulary. There is also the addition of the `Linear` layer, used to make the predictions from the top layer hidden state.

Within the `forward` method, we accept a batch of input tokens, previous hidden states and previous cell states. We `unsqueeze` the input tokens to add a sentence length dimension of 1. Then, similar to the encoder, we pass through an embedding layer and apply dropout. This batch of embedded tokens is then passed into the RNN with the previous hidden and cell states. This produces an `output` (hidden state from the top layer of the RNN), a new `hidden` state (one for each layer, stacked on top of each other) and a new `cell` state (also one per layer, stacked on top of each other). We then pass the `output` (after getting rid of the sentence length dimension) through the linear layer to receive our `prediction`. We then return the `prediction`, the new `hidden` state and the new `cell` state.

In [17]:
class Decoder(nn.Module):
    def __init__(self, n_tokens, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()

        self.n_tokens = n_tokens
        self.hid_dim = hid_dim
        self.n_layers = n_layers

        # YOUR CODE HERE
        # Define embedding, dropout and LSTM layers.
        self.embedding = nn.Embedding(n_tokens, emb_dim)
        self.dropout = nn.Dropout(dropout)
        self.rnn = nn.LSTM(emb_dim, hid_dim, num_layers = n_layers, batch_first = False)  
        # Additionaly, Decoder will need a linear layer to predict next token.
        self.fc = nn.Linear(hid_dim, n_tokens)
        
        
    def forward(self, input, hidden):
        # input has a shape of [batch_size]
        # hidden is a tuple of two tensors:
        # 1) hidden state
        # 2) cell state
        # both of shape [n_layers, batch_size, hid_dim]
        # (n_directions in the decoder shall always be 1)

        # YOUR CODE HERE
        # Compute an embedding from input data and apply dropout.
        # Remember, that LSTM layer expects input to have a shape of
        # [seq_len, batch_size, emb_dim], which means that we need
        # to somehow introduce the seq_len dimension into our input tensor.
        embedded = self.embedding(input.unsqueeze(0))
        embedded = self.dropout(embedded)

        # YOUR CODE HERE
        # Compute the RNN output values.
        output, hidden = self.rnn(embedded, hidden)

        # YOUR CODE HERE
        # output has a shape of [seq_len, batch_size, hid dim]
        # Compute logits for the next token probabilities from RNN output.
        pred = self.fc(output.squeeze(0))

        # should have a shape [batch_size, n_tokens]
        return pred, hidden

In [28]:
dec = Decoder(len(trg_vocab), emb_dim=256, hid_dim=512, n_layers=2, dropout=0.5)
dec_out = enc(x)
dec_out[0].shape, dec_out[1].shape

(torch.Size([2, 128, 512]), torch.Size([2, 128, 512]))

### Seq2Seq

For the final part of the implemenetation, we'll implement the seq2seq model. This will handle: 
- receiving the input/source sentence
- using the encoder to produce the context vectors 
- using the decoder to produce the predicted output/target sentence

Our full model will look like this:

![](https://raw.githubusercontent.com/bentrevett/pytorch-seq2seq/master/assets/seq2seq4.png)

The `Seq2Seq` model takes in an `Encoder`, `Decoder`, and a `device` (used to place tensors on the GPU, if it exists).

For this implementation, we have to ensure that the number of layers and the hidden (and cell) dimensions are equal in the `Encoder` and `Decoder`. This is not always the case, you do not necessarily need the same number of layers or the same hidden dimension sizes in a sequence-to-sequence model. However, if you do something like having a different number of layers you will need to make decisions about how this is handled. For example, if your encoder has 2 layers and your decoder only has 1, how is this handled? Do you average the two context vectors output by the decoder? Do you pass both through a linear layer? Do you only use the context vector from the highest layer? Etc.

Our `forward` method takes the source sentence, target sentence and a teacher-forcing ratio. The teacher forcing ratio is used when training our model. When decoding, at each time-step we will predict what the next token in the target sequence will be from the previous tokens decoded, $\hat{y}_{t+1}=f(s_t^L)$. With probability equal to the teaching forcing ratio (`teacher_forcing_ratio`) we will use the actual ground-truth next token in the sequence as the input to the decoder during the next time-step. However, with probability `1 - teacher_forcing_ratio`, we will use the token that the model predicted as the next input to the model, even if it doesn't match the actual next token in the sequence.  

The first thing we do in the `forward` method is to create an `outputs` tensor that will store all of our predictions, $\hat{Y}$.

We then feed the input/source sentence, $X$/`src`, into the encoder and receive out final hidden and cell states.

The first input to the decoder is the start of sequence (`<sos>`) token. As our `trg` tensor already has the `<sos>` token appended (all the way back when we defined the `init_token` in our `TRG` field) we get our $y_1$ by slicing into it. We know how long our target sentences should be (`max_len`), so we loop that many times. During each iteration of the loop, we:
- pass the input, previous hidden and previous cell states ($y_t, s_{t-1}, c_{t-1}$) into the decoder
- receive a prediction, next hidden state and next cell state ($\hat{y}_{t+1}, s_{t}, c_{t}$) from the decoder
- place our prediction, $\hat{y}_{t+1}$/`output` in our tensor of predictions, $\hat{Y}$/`outputs`
- decide if we are going to "teacher force" or not
    - if we do, the next `input` is the ground-truth next token in the sequence, $y_{t+1}$/`trg[t]`
    - if we don't, the next `input` is the predicted next token in the sequence, $\hat{y}_{t+1}$/`top1`
    
Once we've made all of our predictions, we return our tensor full of predictions, $\hat{Y}$/`outputs`.

> **Note:** our decoder loop starts at 1, not 0. This means that `preds` will have `seq_len - 1` items. So our `trg` and `outputs` look something like:
> $$\begin{align*}
  \text{trg} = [<sos>, &y_1, y_2, y_3, <eos>]\\
  \text{preds} = [&\hat{y}_1, \hat{y}_2, \hat{y}_3, <eos>]
  \end{align*}$$
> Later on when we calculate the loss, we cut off the first element of `trg` tensor to get:
> $$\begin{align*}
  \text{trg} = [&y_1, y_2, y_3, <eos>]\\
  \text{preds} = [&\hat{y}_1, \hat{y}_2, \hat{y}_3, <eos>]
  \end{align*}$$

In [18]:
import random


class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder):
        super().__init__()

        self.encoder = encoder
        self.decoder = decoder

        assert encoder.hid_dim == decoder.hid_dim, "encoder and decoder must have same hidden dim"
        assert (
            encoder.n_layers == decoder.n_layers
        ), "encoder and decoder must have equal number of layers"

    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        # src has a shape of [src_seq_len, batch_size]
        # trg has a shape of [trg_seq_len, batch_size]
        # teacher_forcing_ratio is probability to use teacher forcing, e.g. if
        # teacher_forcing_ratio is 0.75 we use ground-truth inputs 75% of the time
        batch_size = trg.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.n_tokens

        # tensor to store decoder predictions
        preds = []

        # Last hidden state of the encoder is used as
        # the initial hidden state of the decoder.
        hidden = self.encoder(src)

        # First input to the decoder is the  token.
        input = trg[0, :]

        for i in range(1, trg_len):
            #print(input.shape, hidden[0].shape, hidden[1].shape)
            pred, hidden = self.decoder(input, hidden)
            preds.append(pred)
            teacher_force = random.random() < teacher_forcing_ratio
            _, top_pred = pred.max(dim=1)
            input = trg[i, :] if teacher_force else top_pred

        return torch.stack(preds)

# Training the Seq2Seq Model

Now we have our model implemented, we can begin training it. 

First, we'll initialize our model. As mentioned before, the input and output dimensions are defined by the size of the vocabulary. The embedding dimesions and dropout for the encoder and decoder can be different, but the number of layers and the size of the hidden/cell states must be the same. 

We then define the encoder, decoder and then our `Seq2Seq` model, which we place on the `device`.

In [28]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
enc = Encoder(len(src_vocab), emb_dim=256, hid_dim=512, n_layers=2, dropout=0.5)
dec = Decoder(len(trg_vocab), emb_dim=256, hid_dim=512, n_layers=2, dropout=0.5)
model = Seq2Seq(enc, dec).to(device)

Next up is initializing the weights of our model. In the paper they state they initialize all weights from a uniform distribution between -0.08 and +0.08, i.e. $\mathcal{U}(-0.08, 0.08)$.

We initialize weights in PyTorch by creating a function which we `apply` to our model. When using `apply`, the `init_weights` function will be called on every module and sub-module within our model. For each module we loop through all of the parameters and sample them from a uniform distribution with `nn.init.uniform_`.

In [29]:
def init_weights(m):
    for name, param in m.named_parameters():
        nn.init.uniform_(param, -0.08, 0.08)

model.apply(init_weights);

We also define a function that will calculate the number of trainable parameters in the model.

In [30]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)


print(f"The model has {count_parameters(model):,} trainable parameters")

The model has 19,691,738 trainable parameters


We define our optimizer, which we use to update our parameters in the training loop. Check out [this](http://ruder.io/optimizing-gradient-descent/) post for information about different optimizers. Here, we'll use Adam.

In [31]:
optimizer = torch.optim.Adam(model.parameters())

Next, we define our loss function. The `CrossEntropyLoss` function calculates both the log softmax as well as the negative log-likelihood of our predictions. 

Our loss function calculates the average loss per token, however by passing the index of the `<pad>` token as the `ignore_index` argument we ignore the loss whenever the target token is a padding token. 

In [32]:
criterion = nn.CrossEntropyLoss(ignore_index=trg_vocab[pad_token])

Now we have everything we need to start training our model. However, there is one more detail to todays practice. Today, instead of plotting the training process with `matplotlib`, we will use a [TensorBoard](https://www.tensorflow.org/tensorboard) to track our training process online. TensorBoard is an application by Google which allows to visualize training logs. In order for this to work, however, you need to store your logs in a local directory and using special data format. Luckily, PyTorch has a [module](https://pytorch.org/docs/stable/tensorboard.html) which handles the data flow for us. We will see, how to work with this module in a minute, however, right now we need to start our tensorboard process, so that we can keep track of our progress on the go. Jupyter notebooks (and Google Colab) have a special extension to run tensorboard and display it in a notebook inplace. You can read more on that in this [guide](https://www.tensorflow.org/tensorboard/tensorboard_in_notebooks). For now let's just copy-paste the required commands and move on.

**Note**: if you're running this notebook on your own machine, you would need to install the `tensorboard` package before running the following cell.

In [None]:
%load_ext tensorboard
%tensorboard --logdir runs

Finally, let's train our model. We will train it for 50 epochs, evaluating it after each epoch on a validation data. We will only track crossentropy for now, however, you're free to try and add accuracy or even BLEU score computation.

First, we'll set the model into "training mode" with `model.train()`. This will turn on dropout (and batch normalization, which we aren't using) and then iterate through our data iterator.

At each iteration:
- get the source and target sentences from the batch, $X$ and $Y$
- zero the gradients calculated from the last batch
- feed the source and target into the model to get the output, $\hat{Y}$
- as the loss function only works on 2d inputs with 1d targets we need to flatten each of them with `.view`
    - we also don't want to measure the loss of the `<sos>` token, hence we slice off the first column of the output and target tensors
- calculate the gradients with `loss.backward()`
- clip the gradients to prevent them from exploding (a common issue in RNNs)
- update the parameters of our model by doing an optimizer step
- sum the loss value to a running total

Finally, we average the loss over all batches and evaluate our model on validation. During evaluation it's important to remember to switch model to "evaluation mode" with `model.eval()` and use the `with torch.no_grad()` block to ensure no gradients are calculated within the block. This reduces memory consumption and speeds things up.

In [33]:
x.shape, y.shape

(torch.Size([31, 128]), torch.Size([30, 128]))

In [34]:
from torch.nn.utils import clip_grad_norm_
from torch.utils.tensorboard import SummaryWriter
from tqdm.auto import tqdm, trange



n_epochs = 50
clip = 1
global_step = 0  # for writer
for epoch in trange(n_epochs, desc="Epochs"):
    model.train()
    train_loss = 0
    for (src, trg), _ in tqdm(train_iterator, desc="Train", leave=False):
        # YOUR CODE HERE
        # Use model to get prediction and compute loss using criterion.
        # After you've computed loss, zero gradients, run backprop, clip
        # gradients and update model with optimizer.
        src, trg = src.to(device), trg.to(device)
        output = model(src, trg)

        output = output.view(-1, output.shape[-1])
        trg = trg[1:].view(-1)

        loss = criterion(output, trg)
        optimizer.zero_grad()
        loss.backward()
        clip_grad_norm_(model.parameters(), clip)
        optimizer.step()

        train_loss += loss.item()
        writer.add_scalar("Training/loss", loss.item(), global_step)
        global_step += 1

    train_loss /= len(train_iterator)

    model.eval()
    val_loss = 0
    with torch.no_grad():
        for (src, trg), _ in tqdm(valid_iterator, desc="Val", leave=False):
            # YOUR CODE HERE
            # Once again compute model prediction and loss, but don't
            # try and update model parameters with it.
            # Just use it for model evaluation.
            src, trg = src.to(device), trg.to(device)
            output = model(src, trg)

            output = output.view(-1, output.shape[-1])
            trg = trg[1:].view(-1)

            loss = criterion(output, trg)

            val_loss += loss.item()

    val_loss /= len(valid_iterator)
    # writer.add_scalar("Evaluation/val_loss", val_loss, epoch)
    print(f"epoch: {epoch+1}, train_loss: {train_loss:.3f},  valid_loss: {val_loss:.3f}")

Epochs:   0%|          | 0/50 [00:00<?, ?it/s]

Train:   0%|          | 0/227 [00:00<?, ?it/s]

Val:   0%|          | 0/8 [00:00<?, ?it/s]

epoch: 1, train_loss: 5.017,  valid_loss: 4.573


Train:   0%|          | 0/227 [00:00<?, ?it/s]

Val:   0%|          | 0/8 [00:00<?, ?it/s]

epoch: 2, train_loss: 4.368,  valid_loss: 4.105


Train:   0%|          | 0/227 [00:00<?, ?it/s]

Val:   0%|          | 0/8 [00:00<?, ?it/s]

epoch: 3, train_loss: 4.070,  valid_loss: 3.982


Train:   0%|          | 0/227 [00:00<?, ?it/s]

Val:   0%|          | 0/8 [00:00<?, ?it/s]

epoch: 4, train_loss: 3.869,  valid_loss: 3.744


Train:   0%|          | 0/227 [00:00<?, ?it/s]

Val:   0%|          | 0/8 [00:00<?, ?it/s]

epoch: 5, train_loss: 3.754,  valid_loss: 3.574


Train:   0%|          | 0/227 [00:00<?, ?it/s]

Val:   0%|          | 0/8 [00:00<?, ?it/s]

epoch: 6, train_loss: 3.594,  valid_loss: 3.540


Train:   0%|          | 0/227 [00:00<?, ?it/s]

Val:   0%|          | 0/8 [00:00<?, ?it/s]

epoch: 7, train_loss: 3.460,  valid_loss: 3.436


Train:   0%|          | 0/227 [00:00<?, ?it/s]

Val:   0%|          | 0/8 [00:00<?, ?it/s]

epoch: 8, train_loss: 3.327,  valid_loss: 3.271


Train:   0%|          | 0/227 [00:00<?, ?it/s]

Val:   0%|          | 0/8 [00:00<?, ?it/s]

epoch: 9, train_loss: 3.186,  valid_loss: 3.245


Train:   0%|          | 0/227 [00:00<?, ?it/s]

Val:   0%|          | 0/8 [00:00<?, ?it/s]

epoch: 10, train_loss: 3.068,  valid_loss: 3.242


Train:   0%|          | 0/227 [00:00<?, ?it/s]

Val:   0%|          | 0/8 [00:00<?, ?it/s]

epoch: 11, train_loss: 2.949,  valid_loss: 3.024


Train:   0%|          | 0/227 [00:00<?, ?it/s]

Val:   0%|          | 0/8 [00:00<?, ?it/s]

epoch: 12, train_loss: 2.828,  valid_loss: 3.034


Train:   0%|          | 0/227 [00:00<?, ?it/s]

Val:   0%|          | 0/8 [00:00<?, ?it/s]

epoch: 13, train_loss: 2.714,  valid_loss: 2.840


Train:   0%|          | 0/227 [00:00<?, ?it/s]

Val:   0%|          | 0/8 [00:00<?, ?it/s]

epoch: 14, train_loss: 2.608,  valid_loss: 2.781


Train:   0%|          | 0/227 [00:00<?, ?it/s]

Val:   0%|          | 0/8 [00:00<?, ?it/s]

epoch: 15, train_loss: 2.492,  valid_loss: 2.903


Train:   0%|          | 0/227 [00:00<?, ?it/s]

Val:   0%|          | 0/8 [00:00<?, ?it/s]

epoch: 16, train_loss: 2.426,  valid_loss: 2.771


Train:   0%|          | 0/227 [00:00<?, ?it/s]

Val:   0%|          | 0/8 [00:00<?, ?it/s]

epoch: 17, train_loss: 2.338,  valid_loss: 2.758


Train:   0%|          | 0/227 [00:00<?, ?it/s]

Val:   0%|          | 0/8 [00:00<?, ?it/s]

epoch: 18, train_loss: 2.235,  valid_loss: 2.808


Train:   0%|          | 0/227 [00:00<?, ?it/s]

Val:   0%|          | 0/8 [00:00<?, ?it/s]

epoch: 19, train_loss: 2.137,  valid_loss: 2.825


Train:   0%|          | 0/227 [00:00<?, ?it/s]

Val:   0%|          | 0/8 [00:00<?, ?it/s]

epoch: 20, train_loss: 2.047,  valid_loss: 2.879


Train:   0%|          | 0/227 [00:00<?, ?it/s]

Val:   0%|          | 0/8 [00:00<?, ?it/s]

epoch: 21, train_loss: 1.973,  valid_loss: 2.834


Train:   0%|          | 0/227 [00:00<?, ?it/s]

Val:   0%|          | 0/8 [00:00<?, ?it/s]

epoch: 22, train_loss: 1.871,  valid_loss: 2.659


Train:   0%|          | 0/227 [00:00<?, ?it/s]

Val:   0%|          | 0/8 [00:00<?, ?it/s]

epoch: 23, train_loss: 1.772,  valid_loss: 2.892


Train:   0%|          | 0/227 [00:00<?, ?it/s]

Val:   0%|          | 0/8 [00:00<?, ?it/s]

epoch: 24, train_loss: 1.686,  valid_loss: 2.797


Train:   0%|          | 0/227 [00:00<?, ?it/s]

Val:   0%|          | 0/8 [00:00<?, ?it/s]

epoch: 25, train_loss: 1.590,  valid_loss: 2.771


Train:   0%|          | 0/227 [00:00<?, ?it/s]

Val:   0%|          | 0/8 [00:00<?, ?it/s]

epoch: 26, train_loss: 1.547,  valid_loss: 2.819


Train:   0%|          | 0/227 [00:00<?, ?it/s]

Val:   0%|          | 0/8 [00:00<?, ?it/s]

epoch: 27, train_loss: 1.443,  valid_loss: 2.909


Train:   0%|          | 0/227 [00:00<?, ?it/s]

Val:   0%|          | 0/8 [00:00<?, ?it/s]

epoch: 28, train_loss: 1.366,  valid_loss: 2.880


Train:   0%|          | 0/227 [00:00<?, ?it/s]

Val:   0%|          | 0/8 [00:00<?, ?it/s]

epoch: 29, train_loss: 1.277,  valid_loss: 3.013


Train:   0%|          | 0/227 [00:00<?, ?it/s]

Val:   0%|          | 0/8 [00:00<?, ?it/s]

epoch: 30, train_loss: 1.215,  valid_loss: 2.967


Train:   0%|          | 0/227 [00:00<?, ?it/s]

Val:   0%|          | 0/8 [00:00<?, ?it/s]

epoch: 31, train_loss: 1.132,  valid_loss: 3.101


Train:   0%|          | 0/227 [00:00<?, ?it/s]

Val:   0%|          | 0/8 [00:00<?, ?it/s]

epoch: 32, train_loss: 1.038,  valid_loss: 3.168


Train:   0%|          | 0/227 [00:00<?, ?it/s]

Val:   0%|          | 0/8 [00:00<?, ?it/s]

epoch: 33, train_loss: 0.966,  valid_loss: 3.171


Train:   0%|          | 0/227 [00:00<?, ?it/s]

Val:   0%|          | 0/8 [00:00<?, ?it/s]

epoch: 34, train_loss: 0.897,  valid_loss: 3.126


Train:   0%|          | 0/227 [00:00<?, ?it/s]

Val:   0%|          | 0/8 [00:00<?, ?it/s]

epoch: 35, train_loss: 0.833,  valid_loss: 3.205


Train:   0%|          | 0/227 [00:00<?, ?it/s]

KeyboardInterrupt: 

Now that we've trained our model, let's see how good is it at actual translation. Let's translate first 10 examples in validation dataset.

In [41]:
src_itos

NameError: name 'src_itos' is not defined

In [38]:
trg_itos = trg.vocab.itos
model.eval()
max_len = 50
with torch.no_grad():
    vars(train_data.examples[0])
    for i in range(10):
        test_src = valid_data.examples[i].src
        test_trg = valid_data.examples[i].trg
        encoded = encode(test_src, src_vocab)[::-1]
        encoded = torch.tensor(encoded)[:, None].to(device)
        hidden = model.encoder(encoded)

        pred_tokens = [trg_vocab[sos_token]]
        for _ in range(max_len):
            decoder_input = torch.tensor([pred_tokens[-1]]).to(device)
            pred, hidden = model.decoder(decoder_input, hidden)
            _, pred_token = pred.max(dim=1)
            if pred_token == trg_vocab[eos_token]:
                # Don't add it to prediction for cleaner output.
                break

            pred_tokens.append(pred_token.item())
        test_src = ' '.join(test_src)
        test_trg = ' '.join(test_trg)
        print(f"src: {test_src}")
        print(f"trg: {test_trg}")
        print(f"pred: '{' '.join(trg_itos[i] for i in pred_tokens[1:])}'")
        print()

src: eine gruppe von männern lädt baumwolle auf einen lastwagen
trg: a group of men are loading cotton onto a truck
pred: 'a crowd of people is being <unk> , talking on a cart .'

src: ein mann schläft in einem grünen raum auf einem sofa .
trg: a man sleeping in a green room on a couch .
pred: 'a man sitting on the grass in a white checkered blanket .'

src: ein junge mit kopfhörern sitzt auf den schultern einer frau .
trg: a boy wearing headphones sits on a woman ' s shoulders .
pred: 'a woman is hugging a man sitting on the floor while a a woman is laughing .'

src: zwei männer bauen eine blaue eisfischerhütte auf einem zugefrorenen see auf
trg: two men setting up a blue ice fishing hut on an iced over lake
pred: 'on a , , a a on a on a two two of men on a sunny day .'

src: ein mann mit beginnender glatze , der eine rote rettungsweste trägt , sitzt in einem kleinen boot .
trg: a balding man wearing a red life jacket is sitting in a small boat .
pred: 'a group of people are along the