### Sequence to Sequence.
From the previous season we have been working with sentiment analyisis and word embedings to predict one or more classes. In this notebook we are hoing to understand the idea behind what we call `sequence` 2 `sequence` by working with a German English translation. The same process can be applied to models like text summarization

### Introduction
The most common ``sequence-to-sequence`` (seq2seq) models are encoder-decoder models, which commonly use a recurrent neural network (RNN) to encode the source (input) sentence into a single vector which we can call **context vector**. We can think of the context vector as being an abstract representation of the entire input sentence. This vector is then decoded by a second RNN which learns to output the target (output) sentence by generating it one word at a time.


<p align="center"><img src="https://github.com/bentrevett/pytorch-seq2seq/raw/49df8404d938a6edbf729876405558cc2c2b3013/assets/seq2seq1.png"/><p>

The above image shows an example translation. The input/source sentence, "guten morgen", is passed through the embedding layer (yellow) and then input into the encoder (green). We also append a start of sequence (<sos>) and end of sequence (<eos>) token to the start and end of sentence, respectively. At each time-step, the input to the encoder RNN is both the embedding, $e$, of the current word, $e(x_t)$, as well as the hidden state from the previous time-step, $h_{t-1}$, and the encoder RNN outputs a new hidden state $h_t$. We can think of the hidden state as a vector representation of the sentence so far. The RNN can be represented as a function of both of $e(x_t)$ and $h_{t-1}$:

$$h_t = \text{EncoderRNN}(e(x_t), h_{t-1})$$
We're using the term RNN generally here, it could be any recurrent architecture, such as an LSTM (Long Short-Term Memory) or a GRU (Gated Recurrent Unit).

Here, we have $X = \{x_1, x_2, ..., x_T\}$, where $x_1 = \text{&lt;sos&gt;}, x_2 = \text{guten}$, etc. The initial hidden state, $h_0$, is usually either initialized to zeros or a learned parameter.

Once the final word, $x_T$, has been passed into the RNN via the embedding layer, we use the final hidden state, $h_T$, as the context vector, i.e. $h_T = z$. This is a vector representation of the entire source sentence.

Now we have our context vector, $z$, we can start decoding it to get the output/target sentence, "good morning". Again, we append start and end of sequence tokens to the target sentence. At each time-step, the input to the decoder RNN (blue) is the embedding, $d$, of current word, $d(y_t)$, as well as the hidden state from the previous time-step, $s_{t-1}$, where the initial decoder hidden state, $s_0$, is the context vector, $s_0 = z = h_T$, i.e. the initial decoder hidden state is the final encoder hidden state. Thus, similar to the encoder, we can represent the decoder as:

$$s_t = \text{DecoderRNN}(d(y_t), s_{t-1})$$
Although the input/source embedding layer, $e$, and the output/target embedding layer, $d$, are both shown in yellow in the diagram they are two different embedding layers with their own parameters.

In the decoder, we need to go from the hidden state to an actual word, therefore at each time-step we use $s_t$ to predict (by passing it through a Linear layer, shown in purple) what we think is the next word in the sequence, $\hat{y}_t$.

$$\hat{y}_t = f(s_t)$$
The words in the decoder are always generated one after another, with one per time-step. We always use <sos> for the first input to the decoder, $y_1$, but for subsequent inputs, $y_{t&gt;1}$, we will sometimes use the actual, ground truth next word in the sequence, $y_t$ and sometimes use the word predicted by our decoder, $\hat{y}_{t-1}$. This is called teacher forcing, see a bit more info about it [here](https://machinelearningmastery.com/teacher-forcing-for-recurrent-neural-networks/).

When training/testing our model, we always know how many words are in our target sentence, so we stop generating words once we hit that many. During inference it is common to keep generating words until the model outputs an <eos> token or after a certain amount of words have been generated.

Once we have our predicted target sentence, $\hat{Y} = \{ \hat{y}_1, \hat{y}_2, ..., \hat{y}_T \}$, we compare it against our actual target sentence, $Y = \{ y_1, y_2, ..., y_T \}$, to calculate our loss. We then use this loss to update all of the parameters in our model.

### Data preparation
We will be using torchtext and `spacy` for tokenization.

In [35]:
import torch
from torch import nn
from torch.nn  import functional as F
import spacy, math, random
import numpy as np

from torchtext.legacy import datasets, data

### We will be setting the Seed.

In [None]:
SEED = 42

torch.manual_seed(SEED)
np.random.seed(SEED)
random.seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deteministic = True

### Creating Tokens

spaCy has model for each language ``("de_core_news_sm" for German and "en_core_web_sm" for English)`` which need to be loaded so we can access the tokenizer of each model.

**Note:** the models must first be downloaded using the following on the command line:

````shell
python -m spacy download en_core_web_sm
python -m spacy download de_core_news_sm
````

When you are using google colab like me you download it as follows:

```python
import spacy.cli
spacy.cli.download("en_core_web_sm")
```

In [None]:
import spacy.cli
spacy.cli.download('de_core_news_sm')

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('de_core_news_sm')


In [None]:
import en_core_web_sm, de_core_news_sm

In [None]:
spacy_de = spacy.load('de_core_news_sm')
spacy_en = spacy.load('en_core_web_sm')

### Creating Tokenize functions

We will create tokenize functions for each language.

Researchers find it beneficial to reverse the order of the input which they believe "introduces many short term dependencies in the data that make the optimization problem much easier". We copy this by reversing the German sentence after it has been transformed into a list of tokens.

In [None]:
def tokenize_de(sent):
  """
  Tokenize german sentence and reverse it
  """
  return [tok.text for tok in spacy_de.tokenizer(sent)][::-1]

def tokenize_en(sent):
  return [tok.text for tok in spacy_en.tokenizer(sent)]

### Creating Fields
We set the tokenize argument to the correct tokenization function for each, with German being the SRC (source) field and English being the TRG (target) field. The field also appends the "start of sequence" and "end of sequence" tokens via the init_token and eos_token arguments, and converts all words to lowercase.

In [None]:
SRC = data.Field(
    tokenize = tokenize_de,
    lower= True,
    init_token = "<sos>",
     eos_token = "<eos>"
)
TRG = data.Field(
    tokenize = tokenize_en,
    lower= True,
    init_token = "<sos>",
     eos_token = "<eos>"
)

#### Loading the Dataset
The dataset we'll be using is the ``[Multi30k]`` [dataset](https://github.com/multi30k/dataset). This is a dataset with ``~30,000`` parallel English, German and French sentences, each with ~12 words per sentence.

``exts`` specifies which languages to use as the source and target **(source goes first)** and fields specifies which field to use for the source and target.

In [None]:
train_data, validation_data, test_data = datasets.Multi30k.splits(
    exts=('.de', '.en'),
    fields = (SRC, TRG)
)

### Checking if we have loaded the data.

We are going to use pretty table to visualize the this.

In [36]:
from prettytable import PrettyTable
def tabulate(column_names, data):
  table = PrettyTable(column_names)
  for row in data:
    table.add_row(row)
  print(table)

column_names = ["SUBSET", "EXAMPLE(s)"]
row_data = [
        ["training", len(train_data)],
        ['validation', len(validation_data)],
        ['test', len(test_data)]
]
tabulate(column_names, row_data)

+------------+------------+
|   SUBSET   | EXAMPLE(s) |
+------------+------------+
|  training  |   29000    |
| validation |    1014    |
|    test    |    1000    |
+------------+------------+


### Printing the first example of the SRC and make sure that it is reversed.

In [25]:
print(vars(train_data.examples[0]))


{'src': ['.', 'büsche', 'vieler', 'nähe', 'der', 'in', 'freien', 'im', 'sind', 'männer', 'weiße', 'junge', 'zwei'], 'trg': ['two', 'young', ',', 'white', 'males', 'are', 'outside', 'near', 'many', 'bushes', '.']}


### Building Vocabulary

Next, we'll build the vocabulary for the ``source`` and ``target`` languages. The vocabulary is used to associate each unique token with an index (an integer). **The vocabularies of the source and target languages are distinct.**

Using the **``min_freq``** argument, we only allow tokens that appear at least 2 times to appear in our vocabulary. Tokens that appear only **once** are converted into an ``<unk>`` (unknown) token.

**It is important to note that our vocabulary should only be built from the training set and not the validation/test set. This prevents "information leakage" into our model, giving us artifically inflated validation/test scores.**

In [29]:
SRC.build_vocab(train_data, min_freq = 2)
TRG.build_vocab(train_data, min_freq = 2)

### Checking the unique tokens for each language.

In [37]:
column_names = ["Language", "Number of tokens"]
row_data = [
        ["DE (source)", len(SRC.vocab)],
        ["EN (target)", len(TRG.vocab)]
]

tabulate(column_names=column_names, data=row_data)

+-------------+------------------+
|   Language  | Number of tokens |
+-------------+------------------+
| DE (source) |       7855       |
| EN (target) |       5893       |
+-------------+------------------+


### Data Preparation and building Iterators.

We are going to use the `BucketIterator` and push to the `device`.

When we get a batch of examples using an iterator we need to make sure that all of the **source sentences are padded to the same length, the same with the target sentences**. Luckily, torchText iterators handle this for us!

We use a ``BucketIterator`` instead of the standard Iterator as it creates batches in such a way that it minimizes the amount of padding in both the source and target sentences.

In [34]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

In [38]:
BATCH_SIZE = 128

train_iterator, validation_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, validation_data, test_data),
    device = device,
    batch_size= BATCH_SIZE
)

### Building a Sequence to Sequence Model.
We'll be building our model in three parts. The encoder, the decoder and a seq2seq model that encapsulates the encoder and decoder and will provide a way to interface with each.

Encoder
First, the encoder, a 2 layer LSTM. The concept of multi-layer RNNs is easy to expand from 2 to 4 layers.

For a multi-layer RNN, the input sentence, $X$, after being embedded goes into the first (bottom) layer of the RNN and hidden states, $H=\{h_1, h_2, ..., h_T\}$, output by this layer are used as inputs to the RNN in the layer above. Thus, representing each layer with a superscript, the hidden states in the first layer are given by:

$$h_t^1 = \text{EncoderRNN}^1(e(x_t), h_{t-1}^1)$$
The hidden states in the second layer are given by:

$$h_t^2 = \text{EncoderRNN}^2(h_t^1, h_{t-1}^2)$$
Using a multi-layer RNN also means we'll also need an initial hidden state as input per layer, $h_0^l$, and we will also output a context vector per layer, $z^l$.

Without going into too much detail about LSTMs (see [this](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)blog post to learn more about them), all we need to know is that they're a type of RNN which instead of just taking in a hidden state and returning a new hidden state per time-step, also take in and return a cell state, $c_t$, per time-step.

$$\begin{align*}
h_t &amp;= \text{RNN}(e(x_t), h_{t-1})\\
(h_t, c_t) &amp;= \text{LSTM}(e(x_t), h_{t-1}, c_{t-1})
\end{align*}$$
We can just think of $c_t$ as another type of hidden state. Similar to $h_0^l$, $c_0^l$ will be initialized to a tensor of all zeros. Also, our context vector will now be both the final hidden state and the final cell state, i.e. $z^l = (h_T^l, c_T^l)$.

Extending our multi-layer equations to LSTMs, we get:

$$\begin{align*}
(h_t^1, c_t^1) &amp;= \text{EncoderLSTM}^1(e(x_t), (h_{t-1}^1, c_{t-1}^1))\\
(h_t^2, c_t^2) &amp;= \text{EncoderLSTM}^2(h_t^1, (h_{t-1}^2, c_{t-1}^2))
\end{align*}$$
Note how only our hidden state from the first layer is passed as input to the second layer, and not the cell state.

So our encoder looks something like this:

<p align="center"><img src="https://github.com/bentrevett/pytorch-seq2seq/raw/49df8404d938a6edbf729876405558cc2c2b3013/assets/seq2seq2.png"/></p>



We create this in code by making an ``Encoder`` module, which requires we inherit from torch.nn.Module and use the ``super().__init__()`` as some boilerplate code. The encoder takes the following arguments:

* **input_dim** - is the size/dimensionality of the one-hot vectors that will be input to the encoder. This is equal to the input (source) vocabulary size.
* **emb_dim** - is the dimensionality of the embedding layer. This layer converts the one-hot vectors into dense vectors with emb_dim dimensions.
* **hid_dim** - is the dimensionality of the hidden and cell states.
* **n_layers** - is the number of layers in the RNN.
* **dropout** - is the amount of dropout to use. This is a regularization parameter to prevent overfitting. Check out this for more details about dropout.


The embedding layer is created using ``nn.Embedding``, the LSTM with ``nn.LSTM`` and a dropout layer with nn.Dropout. Check the PyTorch documentation for more about these.

One thing to note is that the dropout argument to the LSTM is how much dropout to apply between the layers of a multi-layer RNN, i.e. between the hidden states output from layer $l$ and those same hidden states being used for the input of layer $l+1$.

In the forward method, we pass in the source sentence, $X$, which is converted into dense vectors using the embedding layer, and then dropout is applied. These embeddings are then passed into the RNN. As we pass a whole sequence to the RNN, it will automatically do the recurrent calculation of the hidden states over the whole sequence for us! Notice that we do not pass an initial hidden or cell state to the RNN. This is because, as noted in the documentation, that if no hidden/cell state is passed to the RNN, it will automatically create an initial hidden/cell state as a tensor of all zeros.

The RNN returns: outputs (the top-layer hidden state for each time-step), hidden (the final hidden state for each layer, $h_T$, stacked on top of each other) and cell (the final cell state for each layer, $c_T$, stacked on top of each other).

As we only need the final hidden and cell states (to make our context vector), forward only returns hidden and cell.

The sizes of each of the tensors is left as comments in the code. In this implementation ``n_directions`` will always be 1, however note that bidirectional RNNs (covered in tutorial 3) will have ``n_directions`` as 2.

In [46]:
class Encoder(nn.Module):
  def __init__(self, input_dim, emb_dim, hid_dim, n_layers, dropout):
    super(Encoder, self).__init__()
    self.hid_dim = hid_dim
    self.n_layers = n_layers
    self.embedding = nn.Embedding(input_dim, embedding_dim=emb_dim)
    self.rnn = nn.LSTM(emb_dim, hid_dim ,n_layers, dropout=dropout)
    self.dropout = nn.Dropout(dropout)

  def forward(self, src):
    # src = [src len, batch size]
    embedded = self.dropout(self.embedding(src))
    # embedded = [src len, batch size, emb dim]
    outputs, (h_0, c_0) = self.rnn(embedded)

    """
    outputs = [src len, batch size, hid dim * n directions]
    hidden (h_0) = [n layers * n directions, batch size, hid dim]
    cell (c_0) = [n layers * n directions, batch size, hid dim]
    ** outputs are always from the top hidden layer
    """
    return h_0, c_0
    

### Decoder
Next, we'll build our decoder, which will also be a 2-layer **LSTM**.

<p align="center"><img src="https://github.com/bentrevett/pytorch-seq2seq/raw/49df8404d938a6edbf729876405558cc2c2b3013/assets/seq2seq3.png"/></p>

The ``Decoder`` class does a single step of decoding, i.e. it ouputs single token per time-step. The first layer will receive a hidden and cell state from the previous time-step, $(s_{t-1}^1, c_{t-1}^1)$, and feeds it through the LSTM with the current embedded token, $y_t$, to produce a new hidden and cell state, $(s_t^1, c_t^1)$. The subsequent layers will use the hidden state from the layer below, $s_t^{l-1}$, and the previous hidden and cell states from their layer, $(s_{t-1}^l, c_{t-1}^l)$. This provides equations very similar to those in the encoder.

$$\begin{align*}
(s_t^1, c_t^1) = \text{DecoderLSTM}^1(d(y_t), (s_{t-1}^1, c_{t-1}^1))\\
(s_t^2, c_t^2) = \text{DecoderLSTM}^2(s_t^1, (s_{t-1}^2, c_{t-1}^2))
\end{align*}$$
Remember that the initial hidden and cell states to our decoder are our context vectors, which are the final hidden and cell states of our encoder from the same layer, i.e. $(s_0^l,c_0^l)=z^l=(h_T^l,c_T^l)$.

We then pass the hidden state from the top layer of the RNN, $s_t^L$, through a linear layer, $f$, to make a prediction of what the next token in the target (output) sequence should be, $\hat{y}_{t+1}$.

$$\hat{y}_{t+1} = f(s_t^L)$$
The arguments and initialization are similar to the Encoder class, except we now have an output_dim which is the size of the vocabulary for the output/target. There is also the addition of the Linear layer, used to make the predictions from the top layer hidden state.

Within the forward method, we accept a batch of input tokens, previous hidden states and previous cell states. As we are only decoding one token at a time, the input tokens will always have a sequence length of 1. We unsqueeze the input tokens to add a sentence length dimension of 1. Then, similar to the encoder, we pass through an embedding layer and apply dropout. This batch of embedded tokens is then passed into the RNN with the previous hidden and cell states. This produces an output (hidden state from the top layer of the RNN), a new hidden state (one for each layer, stacked on top of each other) and a new cell state (also one per layer, stacked on top of each other). We then pass the output (after getting rid of the sentence length dimension) through the linear layer to receive our prediction. We then return the prediction, the new hidden state and the new cell state.

**Note:** as we always have a sequence length of 1, we could use ``nn.LSTMCell``, instead of ``nn.LSTM``, as it is designed to handle a batch of inputs that aren't necessarily in a sequence. ``nn.LSTMCell`` is just a single cell and ``nn.LSTM`` is a wrapper around potentially multiple cells. Using the ``nn.LSTMCell`` in this case would mean we don't have to unsqueeze to add a fake sequence length dimension, but we would need one ``nn.LSTMCell`` per layer in the decoder and to ensure each ``nn.LSTMCell`` receives the correct initial hidden state from the encoder. All of this makes the code less concise - hence the decision to stick with the regular ``nn.LSTM``.

In [67]:
class Decoder(nn.Module):
  def __init__(self, output_dim, emb_dim, hid_dim, n_layers, dropout):
    super(Decoder, self).__init__()
    self.output_dim = output_dim
    self.hid_dim = hid_dim
    self.n_layers = n_layers

    self.embedding = nn.Embedding(output_dim, emb_dim)
    self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout=dropout)
    self.fc = nn.Linear(hid_dim, output_dim)
    self.dropout = nn.Dropout(dropout)

  def forward(self, input, hidden, cell):
    """
    input = [batch size]
    hidden = [n layers * n directions, batch size, hid dim]
    cell = [n layers * n directions, batch size, hid dim]
    **n directions in the decoder will both always be 1, therefore:**
      hidden = [n layers, batch size, hid dim]
      context = [n layers, batch size, hid dim]
    """
    input = input.unsqueeze(0)
    # input = [1, batch size]

    embedded = self.dropout(self.embedding(input))
    # embedded = [1, batch size, emb dim]

    output, (h_0, c_0) = self.rnn(embedded, (hidden, cell))
    """
    output = [seq len, batch size, hid dim * n directions]
    hidden = [n layers * n directions, batch size, hid dim]
    cell = [n layers * n directions, batch size, hid dim]

    **seq len and n directions will always be 1 in the decoder, therefore:**
      output = [1, batch size, hid dim]
      hidden (h_0) = [n layers, batch size, hid dim]
      cell (c_0) = [n layers, batch size, hid dim]
    """
    predictions = self.fc(output.squeeze(0))
    #prediction = [batch size, output dim]
    return predictions, h_0, c_0

### Seq2Seq (Sequence to Sequence)
For the final part of the implemenetation, we'll implement the seq2seq model. This will handle:

* receiving the input/source sentence
* using the encoder to produce the context vectors
* using the decoder to produce the predicted output/target sentence

Our full model will look like this:

<p align="center"><img src="https://github.com/bentrevett/pytorch-seq2seq/raw/49df8404d938a6edbf729876405558cc2c2b3013/assets/seq2seq4.png"/></p>

The ``Seq2Seq`` model takes in an Encoder, Decoder, and a device (used to place tensors on the GPU, if it exists).

For this implementation, we have to ensure that the number of layers and the hidden (and cell) dimensions are equal in the Encoder and Decoder. This is not always the case, we do not necessarily need the same number of layers or the same hidden dimension sizes in a sequence-to-sequence model. However, if we did something like having a different number of layers then we would need to make decisions about how this is handled. For example, if our encoder has 2 layers and our decoder only has 1, how is this handled? Do we average the two context vectors output by the decoder? Do we pass both through a linear layer? Do we only use the context vector from the highest layer? Etc.

Our forward method takes the source sentence, target sentence and a ``teacher-forcing ratio``. The teacher **forcing ratio** is used when training our model. When decoding, at each time-step we will predict what the next token in the target sequence will be from the previous tokens decoded, $\hat{y}_{t+1}=f(s_t^L)$. With probability equal to the teaching forcing ratio ``(teacher_forcing_ratio)`` we will use the actual ground-truth next token in the sequence as the input to the decoder during the next time-step. However, with probability ``1 - teacher_forcing_ratio``, we will use the token that the model predicted as the next input to the model, even if it doesn't match the actual next token in the sequence.

The first thing we do in the forward method is to create an outputs tensor that will store all of our predictions, $\hat{Y}$.

We then feed the input/source sentence, src, into the encoder and receive out final hidden and cell states.

The first input to the decoder is the start of sequence (<sos>) token. As our trg tensor already has the <sos> token appended (all the way back when we defined the init_token in our TRG field) we get our $y_1$ by slicing into it. We know how long our target sentences should be (max_len), so we loop that many times. The last token input into the decoder is the one before the <eos> token - the <eos> token is never input into the decoder.

During each iteration of the loop, we:

pass the input, previous hidden and previous cell states ($y_t, s_{t-1}, c_{t-1}$) into the decoder
receive a prediction, next hidden state and next cell state ($\hat{y}_{t+1}, s_{t}, c_{t}$) from the decoder
place our prediction, $\hat{y}_{t+1}$/output in our tensor of predictions, $\hat{Y}$/outputs
decide if we are going to "teacher force" or not
if we do, the next input is the ground-truth next token in the sequence, $y_{t+1}$/trg[t]
if we don't, the next input is the predicted next token in the sequence, $\hat{y}_{t+1}$/top1, which we get by doing an argmax over the output tensor
Once we've made all of our predictions, we return our tensor full of predictions, $\hat{Y}$/outputs.

Note: our decoder loop starts at 1, not 0. This means the 0th element of our outputs tensor remains all zeros. So our trg and outputs look something like:

$$\begin{align*}
\text{trg} = [&lt;sos&gt;, &amp;y_1, y_2, y_3, &lt;eos&gt;]\\
\text{outputs} = [0, &amp;\hat{y}_1, \hat{y}_2, \hat{y}_3, &lt;eos&gt;]
\end{align*}$$
Later on when we calculate the loss, we cut off the first element of each tensor to get:

$$\begin{align*}
\text{trg} = [&amp;y_1, y_2, y_3, &lt;eos&gt;]\\
\text{outputs} = [&amp;\hat{y}_1, \hat{y}_2, \hat{y}_3, &lt;eos&gt;]
\end{align*}$$

In [80]:
class Seq2Seq(nn.Module):
  def __init__(self, encoder, decoder, device):
    super(Seq2Seq, self).__init__()

    self.encoder = encoder
    self.decoder = decoder
    self.device = device

    assert encoder.hid_dim == decoder.hid_dim, "Hidden dimensions of encoder and decoder must be equal!"
    assert encoder.n_layers == decoder.n_layers, "Encoder and decoder must have equal number of layers!"

  def forward(self, src, trg, teacher_forcing_ratio=.5):
    """
    src = [src len, batch size]
    trg = [trg len, batch size]
    teacher_forcing_ratio is probability to use teacher forcing
    e.g. if teacher_forcing_ratio is 0.75 we use ground-truth inputs 75% of the time
    """
    trg_len, batch_size = trg.shape
    trg_vocab_size = self.decoder.output_dim

    # tensor to store decoder outputs
    outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
    # last hidden state of the encoder is used as the initial hidden state of the decoder
    hidden, cell = self.encoder(src)
    # first input to the decoder is the <sos> tokens
    input = trg[0,:]

    for t in range(1, trg_len):
      """
      insert input token embedding, previous hidden and previous cell states
      receive output tensor (predictions) and new hidden and cell states
      """
      output, hidden, cell = self.decoder(input, hidden, cell)
      # place predictions in a tensor holding predictions for each token
      outputs[t] = output
      # decide if we are going to use teacher forcing or not
      teacher_force = random.random() < teacher_forcing_ratio
      # get the highest predicted token from our predictions
      top1 = output.argmax(1) 
      """
      if teacher forcing, use actual next token as next input
      if not, use predicted token
      """
      input = trg[t] if teacher_force else top1
      return outputs

### Training the `Seq2Seq`
Now we have our model implemented, we can begin training it.

First, we'll initialize our model. As mentioned before, the input and output dimensions are defined by the size of the vocabulary. The embedding dimesions and dropout for the encoder and decoder can be different, but the number of layers and the size of the hidden/cell states must be the same.

We then define the encoder, decoder and then our Seq2Seq model, which we place on the device.

In [81]:
INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
HID_DIM = 512
N_LAYERS = 2
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT)

model = Seq2Seq(enc, dec, device).to(device)
model

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(7855, 256)
    (rnn): LSTM(256, 512, num_layers=2, dropout=0.5)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(5893, 256)
    (rnn): LSTM(256, 512, num_layers=2, dropout=0.5)
    (fc): Linear(in_features=512, out_features=5893, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

### Counting model parameters

In [82]:
def count_trainable_params(model):
  return sum(p.numel() for p in model.parameters()), sum(p.numel() for p in model.parameters() if p.requires_grad)

n_params, trainable_params = count_trainable_params(model)
print(f"Total number of paramaters: {n_params:,}\nTotal tainable parameters: {trainable_params:,}")

Total number of paramaters: 13,899,013
Total tainable parameters: 13,899,013


### Initializing weights to the model


Next up is initializing the weights of our model.

We initialize weights in PyTorch by creating a function which we apply to our model. When using apply, the ``init_weights`` function will be called on every module and sub-module within our model. For each module we loop through all of the parameters and sample them from a uniform distribution with ``nn.init.uniform_``.

In [83]:
def init_weights(m):
  for name, param in m.named_parameters():
    nn.init.uniform_(param.data, -0.8, 0.8)

model.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(7855, 256)
    (rnn): LSTM(256, 512, num_layers=2, dropout=0.5)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(5893, 256)
    (rnn): LSTM(256, 512, num_layers=2, dropout=0.5)
    (fc): Linear(in_features=512, out_features=5893, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

### Counting parameters after initializing weights

In [84]:
n_params, trainable_params = count_trainable_params(model)
print(f"Total number of paramaters: {n_params:,}\nTotal tainable parameters: {trainable_params:,}")

Total number of paramaters: 13,899,013
Total tainable parameters: 13,899,013


### Optimizer

In [85]:
optimizer = torch.optim.Adam(model.parameters())

### Loss Function
Next, we define our loss function. The **``CrossEntropyLoss``** function calculates both the log softmax as well as the negative log-likelihood of our predictions.

Our loss function calculates the average loss per token, however by passing the index of the ``<pad> `` token as the ``ignore_index`` argument we ignore the loss whenever the target token is a padding token.

In [86]:
TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]
criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX)

### Defining the training loop


First, we'll set the model into "training mode" with ``model.train()``. This will turn on dropout (and batch normalization, which we aren't using) and then iterate through our data iterator.

As stated before, our decoder loop starts at 1, not 0. This means the 0th element of our outputs tensor remains all zeros. So our ``trg`` and outputs look something like:

$$\begin{align*}
\text{trg} = [&lt;sos&gt;, &amp;y_1, y_2, y_3, &lt;eos&gt;]\\
\text{outputs} = [0, &amp;\hat{y}_1, \hat{y}_2, \hat{y}_3, &lt;eos&gt;]
\end{align*}$$

Here, when we calculate the loss, we cut off the first element of each tensor to get:

$$\begin{align*}
\text{trg} = [&amp;y_1, y_2, y_3, &lt;eos&gt;]\\
\text{outputs} = [&amp;\hat{y}_1, \hat{y}_2, \hat{y}_3, &lt;eos&gt;]
\end{align*}$$

At each iteration:
* get the source and target sentences from the batch, $X$ and $Y$
* zero the gradients calculated from the last batch
* feed the source and target into the model to get the output, $\hat{Y}$
* as the loss function only works on 2d inputs with 1d targets we need to flatten each of them with .view
 * we slice off the first column of the output and target tensors as mentioned above
* calculate the gradients with loss.backward()
* clip the gradients to prevent them from exploding (a common issue in RNNs)
* update the parameters of our model by doing an optimizer step
* sum the loss value to a running total

Finally, we return the loss that is averaged over all batches.

In [87]:
def train(model, iterator, optimizer, criterion, clip):
  model.train()
  epoch_loss = 0
  for i, batch in enumerate(iterator):
    src = batch.src
    trg = batch.trg
    optimizer.zero_grad()
    output = model(src, trg)
    """
    trg = [trg len, batch size]
    output = [trg len, batch size, output dim]
    """
    output_dim = output.shape[-1]
    output = output[1:].view(-1, output_dim)
    trg = trg[1:].view(-1)

    """
    trg = [(trg len - 1) * batch size]
    output = [(trg len - 1) * batch size, output dim]
    """
    loss = criterion(output, trg)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
    optimizer.step()
    epoch_loss += loss.item()
    return epoch_loss / len(iterator)

### Defining the Evaluation Loop

Our evaluation loop is similar to our training loop, however as we aren't updating any parameters we don't need to pass an optimizer or a clip value.

We must remember to set the model to evaluation mode with ``model.eval()``. This will turn off dropout (and batch normalization, if used).

We use the with ``torch.no_grad()`` block to ensure no gradients are calculated within the block. This reduces memory consumption and speeds things up.

The iteration loop is similar (without the parameter updates), however we must ensure we turn teacher forcing off for evaluation. This will cause the model to only use it's own predictions to make further predictions within a sentence, which mirrors how it would be used in deployment.

In [88]:
def evaluate(model, iterator, criterion):
  model.eval()
  epoch_loss = 0
  with torch.no_grad():
    for i, batch in enumerate(iterator):
      src = batch.src
      trg = batch.trg
      output = model(src, trg, 0) #turn off teacher forcing
      """
      trg = [trg len, batch size]
      output = [trg len, batch size, output dim]
      """
      output_dim = output.shape[-1]
      output = output[1:].view(-1, output_dim)
      trg = trg[1:].view(-1)
      """
      trg = [(trg len - 1) * batch size]
      output = [(trg len - 1) * batch size, output dim]
      """
      loss = criterion(output, trg)
      epoch_loss += loss.item()
  return epoch_loss / len(iterator)

Next, we'll create a function that we'll use to tell us how long an epoch takes.

In [89]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

### Training the model.

At each epoch, we'll be checking if our model has achieved the best validation loss so far. If it has, we'll update our best validation loss and save the parameters of our model **(called ``state_dict`` in PyTorch)**. Then, when we come to test our model, we'll use the saved parameters used to achieve the best validation loss.

We'll be printing out both the loss and the perplexity at each epoch. It is easier to see a change in perplexity than a change in loss as the numbers are much bigger.

In [90]:
import time

In [93]:
N_EPOCHS = 10
CLIP = 1
best_valid_loss = float('inf')
for epoch in range(N_EPOCHS):
    start_time = time.time()
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, validation_iterator, criterion)
    end_time = time.time()
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'best-model.pt')
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Epoch: 01 | Time: 0m 0s
	Train Loss: 0.041 | Train PPL:   1.041
	 Val. Loss: 9.165 |  Val. PPL: 9558.906
Epoch: 02 | Time: 0m 0s
	Train Loss: 0.041 | Train PPL:   1.041
	 Val. Loss: 9.180 |  Val. PPL: 9700.942
Epoch: 03 | Time: 0m 0s
	Train Loss: 0.040 | Train PPL:   1.041
	 Val. Loss: 9.151 |  Val. PPL: 9424.860
Epoch: 04 | Time: 0m 0s
	Train Loss: 0.040 | Train PPL:   1.041
	 Val. Loss: 9.156 |  Val. PPL: 9468.162
Epoch: 05 | Time: 0m 0s
	Train Loss: 0.040 | Train PPL:   1.041
	 Val. Loss: 9.170 |  Val. PPL: 9605.405
Epoch: 06 | Time: 0m 0s
	Train Loss: 0.040 | Train PPL:   1.041
	 Val. Loss: 9.155 |  Val. PPL: 9457.176
Epoch: 07 | Time: 0m 0s
	Train Loss: 0.041 | Train PPL:   1.041
	 Val. Loss: 9.151 |  Val. PPL: 9425.137
Epoch: 08 | Time: 0m 0s
	Train Loss: 0.041 | Train PPL:   1.042
	 Val. Loss: 9.153 |  Val. PPL: 9446.761
Epoch: 09 | Time: 0m 0s
	Train Loss: 0.040 | Train PPL:   1.041
	 Val. Loss: 9.147 |  Val. PPL: 9381.689
Epoch: 10 | Time: 0m 0s
	Train Loss: 0.040 | Train PPL:

### Evaluating the model.

In [92]:
model.load_state_dict(torch.load('best-model.pt'))

test_loss = evaluate(model, test_iterator, criterion)

print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')

| Test Loss: 9.175 | Test PPL: 9653.451 |


### What's Next?

In the following notebook we'll implement a model that achieves improved test perplexity, but only uses a single layer in the encoder and the decoder.


#### Credits
[bentrevett](https://github.com/bentrevett/pytorch-seq2seq/blob/master/1%20-%20Sequence%20to%20Sequence%20Learning%20with%20Neural%20Networks.ipynb)