<a href="https://colab.research.google.com/github/SophieShin/NLP_22_Fall/blob/main/%5BSSH%5D_lab08_seq2seq_mt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 8: Seq2Seq for Machine Translation

**Please connect to GPU.**

Learning outcomes:
- Implement seq2seq model that uses an encoder and a decoder for machine translation between German and English sentences
- Use **spaCy** tokeniser and some of its functions
- Use deprecated features (`Field`, `BucketIterator`) using `legacy`
- Add start-of-sequence `<sos>` and end-of-sequence `<eos>` tokens
- Include **dropouts**
- Initialise model weights
- Understand **teacher-forcing**
- Gradient clipping
- Save and load model parameters

Sources:
- Code source: [Ben Trevett](https://github.com/bentrevett/pytorch-seq2seq)
- The model is based an implementation of the paper [Sequence to Sequence Learning with Neural Networks](https://arxiv.org/abs/1409.3215), which uses multi-layer LSTMs.



### Check torchext version

In [1]:
!pip3 show torchtext

Name: torchtext
Version: 0.13.1
Summary: Text utilities and datasets for PyTorch
Home-page: https://github.com/pytorch/text
Author: PyTorch core devs and James Bradbury
Author-email: jekbradbury@gmail.com
License: BSD
Location: /usr/local/lib/python3.7/dist-packages
Requires: tqdm, numpy, requests, torch
Required-by: 


### Downgrade Torchtext to version 0.10.0
- This is so that we can use some older functions for the purpose of this lab.
- Takes <font color='green'>~2 mins (Restart Runtime if required) </font>

In [2]:
# Q1. Install torchtext version 0.10.0
# CODE
!pip install torchtext==0.10.0


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torchtext==0.10.0
  Downloading torchtext-0.10.0-cp37-cp37m-manylinux1_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 6.8 MB/s 
Collecting torch==1.9.0
  Downloading torch-1.9.0-cp37-cp37m-manylinux1_x86_64.whl (831.4 MB)
[K     |████████████████████████████████| 831.4 MB 2.8 kB/s 
Installing collected packages: torch, torchtext
  Attempting uninstall: torch
    Found existing installation: torch 1.12.1+cu113
    Uninstalling torch-1.12.1+cu113:
      Successfully uninstalled torch-1.12.1+cu113
  Attempting uninstall: torchtext
    Found existing installation: torchtext 0.13.1
    Uninstalling torchtext-0.13.1:
      Successfully uninstalled torchtext-0.13.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchvision 0.13.1+

In [3]:
import torch
import torch.nn as nn
import torch.optim as optim

from torchtext.legacy.datasets import Multi30k
from torchtext.legacy.data import Field, BucketIterator

import spacy
import numpy as np

import random
import math
import time

We'll set the random seeds for deterministic results. A bit about reproducibility in PyTorch can be found [here](https://pytorch.org/docs/stable/notes/randomness.html). 


In [4]:
SEED = 1234

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

Reproducible PyTorch를 위한 randomness 올바르게 제어하기
https://hoya012.github.io/blog/reproducible_pytorch/


### Tokenisers
Next, we'll create the tokenisers. A tokeniser is used to turn a string containing a sentence into a list of individual tokens that make up that string, e.g. "good morning!" becomes ["good", "morning", "!"]. We'll start talking about the sentences being a sequence of tokens from now, instead of saying they're a sequence of words. What's the difference? Well, "good" and "morning" are both words and tokens, but "!" is a token, not a word.

[spaCy](https://spacy.io/) has model for each language (e.g. `"de_core_news_sm"` for German and `"en_core_web_sm"` for English) which need to be loaded so we can access the tokeniser of each model.

Note: the models must first be downloaded using the following on the command line:

In [5]:
!python3 -m spacy download en_core_web_sm
!python3 -m spacy download de_core_news_sm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-sm==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl (12.8 MB)
[K     |████████████████████████████████| 12.8 MB 11.6 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting de-core-news-sm==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-3.4.0/de_core_news_sm-3.4.0-py3-none-any.whl (14.6 MB)
[K     |████████████████████████████████| 14.6 MB 8.1 MB/s 
Installing collected packages: de-core-news-sm
Successfully installed de-core-news-sm-3.4.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('de_core_news_sm')


Load the models
- the end of the model name denotes the size `sm`: small, `md`: medium, `lg`:large


In [6]:
spacy_de = spacy.load('de_core_news_sm')
spacy_en = spacy.load('en_core_web_sm')

### Timeout: Let's explore spaCy
Tokenise sentences and words

In [7]:
ger = spacy_de("Sprechen sie Deutsch? Ich spreche kein Deutsch.")
for tok in ger:
  print(tok)

Sprechen
sie
Deutsch
?
Ich
spreche
kein
Deutsch
.


In [8]:
# Sentence level tokenisation
for sentence in ger.sents:
  print(sentence)

Sprechen sie Deutsch?
Ich spreche kein Deutsch.


In [9]:
eng = spacy_en("Do you speak Korean, German, etc.? No I don't :-( Dr. Who!")
for sentence in eng.sents:
  print(sentence)

Do you speak Korean, German, etc.?
No I don't :-(
Dr. Who!


In [10]:
# Q2. State a few things that this tokeniser does with regard to tokenising sentences.
# Feel free to test it out with other sentences to see how it behaves.
# It splits the sentence into words and punctuation marks with meaningful chunk. 

In [11]:
# Word-level tokenisation
for sentence in eng.sents:
  for tok in sentence:
    print(tok)

Do
you
speak
Korean
,
German
,
etc
.
?
No
I
do
n't
:-(
Dr.
Who
!


**Part-of-Speech Tagging**

spaCy can do more things than just tokenise. 
It can also do POS tagging. A POS tag (or part-of-speech tag) is a special label assigned to each token (word) in a text corpus to indicate the part of speech and often also other grammatical categories such as tense, number (plural/singular), case etc.

Each row in the results below correspond the following: 
- `text`: The original word text.
- `lemma`: The base form of the word.
- `pos`: The simple UPOS part-of-speech tag.
- `tag`: The detailed part-of-speech tag.
- `dep`: Syntactic dependency, i.e. the relation between tokens.
- `shape`: The word shape – capitalisation, punctuation, digits.
- `is_alpha`: Is the token an alpha character?
- `is_stop`: Is the token part of a stop list, i.e. the most common words of the language?

In [12]:
doc = spacy_en("I will be going to the U.S. to spend $3 thousand.")

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

I I PRON PRP nsubj X True True
will will AUX MD aux xxxx True True
be be AUX VB aux xx True True
going go VERB VBG ROOT xxxx True False
to to ADP IN prep xx True True
the the DET DT det xxx True True
U.S. U.S. PROPN NNP pobj X.X. False False
to to PART TO aux xx True True
spend spend VERB VB advcl xxxx True False
$ $ SYM $ quantmod $ False False
3 3 NUM CD compound d False False
thousand thousand NUM CD dobj xxxx True False
. . PUNCT . punct . False False


**Named-entity Recognition (NER)**

Named-entity recognition (NER) seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

Each row in the result below correspond to the following:
- `text`: The original entity text.
- `start`: Index of start of entity in the Doc.
- `end`: Index of end of entity in the Doc.
- `label`: Entity label, i.e. type.



In [13]:
doc = spacy_en("Apple is looking at buying U.K. startup MagixAI for $1 billion")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Apple 0 5 ORG
U.K. 27 31 GPE
MagixAI 40 47 ORG
$1 billion 52 62 MONEY


In [14]:
#  Display the list of named entity recognition labels
ner_lst = spacy_en.pipe_labels['ner']

print(len(ner_lst))
print(ner_lst)

18
['CARDINAL', 'DATE', 'EVENT', 'FAC', 'GPE', 'LANGUAGE', 'LAW', 'LOC', 'MONEY', 'NORP', 'ORDINAL', 'ORG', 'PERCENT', 'PERSON', 'PRODUCT', 'QUANTITY', 'TIME', 'WORK_OF_ART']


In [18]:
# Q3. Try to change the sentence to contain interesting people, places or organisations and see how well it works.
# Put your comments here if it can correctly recognise some tricky entities.
ex = spacy_en("If you come to Italy, you should visit Firenze and have some Gelatos ")

for ent in ex.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Italy 15 20 GPE
Firenze 39 46 PERSON


There are many other functionalities, see the full list here:
https://spacy.io/usage/linguistic-features

Now let's get back to translation!

### Tokeniser Functions
These can be passed to torchtext and will take in the sentence as a string and return the sentence as a list of tokens.

In the paper, the authors found it beneficial to reverse the order of the input which they believe "introduces many short term dependencies in the data that make the optimisation problem much easier". We copy this by reversing the German sentence after it has been transformed into a list of tokens.

In [19]:
def tokeniser_de(text):
  return [tok.text for tok in spacy_de.tokenizer(text)][::-1] # reverse the src sentence
  # return[tok.text for tok in spacy_de.tokenizer(text)]

def tokeniser_en(text):
  return[tok.text for tok in spacy_en.tokenizer(text)]


### Using `Field`
The now deprecated torchtext's `Field` handles how data should be processed. All of the possible arguments are detailed [here](https://github.com/pytorch/text/blob/master/torchtext/data/field.py#L61). 

We set the `tokenize` argument to the correct tokenization function for each, with German being the `SRC` (source) field and English being the `TRG` (target) field. The field also appends the "start of sequence" and "end of sequence" tokens via the `init_token` and `eos_token` arguments, and converts all words to lowercase.

In [20]:
SRC = Field(tokenize = tokeniser_de, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True)

TRG = Field(tokenize = tokeniser_en, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True)

### Load the dataset
Next, we download and load the train, validation and test data. 

The dataset we'll be using is the [Multi30k dataset](https://github.com/multi30k/dataset). This is a dataset with ~30,000 parallel English, German and French sentences, each with ~12 words per sentence. 

`exts` specifies which languages to use as the source and target (source goes first) and `fields` specifies which field to use for the source and target.

In [21]:
train_data, valid_data, test_data = Multi30k.splits(exts = ('.de', '.en'), 
                                                    fields = (SRC, TRG))

downloading training.tar.gz


training.tar.gz: 100%|██████████| 1.21M/1.21M [00:02<00:00, 574kB/s] 


downloading validation.tar.gz


validation.tar.gz: 100%|██████████| 46.3k/46.3k [00:00<00:00, 175kB/s]


downloading mmt_task1_test2016.tar.gz


mmt_task1_test2016.tar.gz: 100%|██████████| 66.2k/66.2k [00:00<00:00, 167kB/s]


Check Train, Validation and Test Sizes

In [22]:
print(f"Number of training examples: {len(train_data.examples)}")
print(f"Number of validation examples: {len(valid_data.examples)}")
print(f"Number of testing examples: {len(test_data.examples)}")

Number of training examples: 29000
Number of validation examples: 1014
Number of testing examples: 1000


We can also print out an example, making sure the source sentence is reversed:

In [23]:
print(vars(train_data.examples[0]))

{'src': ['.', 'büsche', 'vieler', 'nähe', 'der', 'in', 'freien', 'im', 'sind', 'männer', 'weiße', 'junge', 'zwei'], 'trg': ['two', 'young', ',', 'white', 'males', 'are', 'outside', 'near', 'many', 'bushes', '.']}


The period is at the beginning of the German (src) sentence, so it looks like the sentence has been correctly reversed.

Next, we'll build the *vocabulary* for the source and target languages. The vocabulary is used to associate each unique token with an index (an integer). The vocabularies of the source and target languages are distinct.

Using the `min_freq` argument, we only allow tokens that appear at least 2 times to appear in our vocabulary. Tokens that appear only once are converted into an `<unk>` (unknown) token.

It is important to note that our vocabulary should only be built from the training set and not the validation/test set. This prevents "information leakage" into our model, giving us artifically inflated validation/test scores.

In [24]:
SRC.build_vocab(train_data, min_freq = 2)
TRG.build_vocab(train_data, min_freq = 2)

In [26]:
# Q4. Print the number of unique tokens in the source and target vocabs
# CODE #

print(f'Unique tokens in the source: {len(SRC.vocab)}')
print(f'Unique tokens in the target: {len(TRG.vocab)}')

Unique tokens in the source: 7853
Unique tokens in the target: 5893


In [None]:
# Q5. Run SRC.vocab.itos[idx] by supplying index values to idx to see the first 10 tokens.
# Are there additional special tokens that build_vocab has created apart from <sos> and <eos>?
# CODE

for i in range(10):
  print(SRC.vocab.itos[i])

# It also created 'unknown token' and 'padding'.

<unk>
<pad>
<sos>
<eos>
.
ein
einem
in
eine
,


### Create the Iterators

The now deprecated `BucketIterator` is similar in applying `DataLoader` to a PyTorch Dataset.

The train, validation and test iterators be iterated on to return a batch of data which will have a `src` attribute (the PyTorch tensors containing a batch of numericalized source sentences) and a `trg` attribute (the PyTorch tensors containing a batch of numericalized target sentences). Numericalized is just a fancy way of saying they have been converted from a sequence of readable tokens to a sequence of corresponding indexes, using the vocabulary. 

We also need to define a `torch.device`. This is used to tell torchtext to put the tensors on the GPU or not. We use the `torch.cuda.is_available()` function, which will return `True` if a GPU is detected on our computer. We pass this `device` to the iterator.

When we get a batch of examples using an iterator we need to make sure that all of the source sentences are padded to the same length, the same with the target sentences. 


In [27]:
# Q6. Configure device to cuda if gpu is available. Print device.
# CODE

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


In [28]:
BATCH_SIZE = 128

train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE, 
    device = device)

# Encoder-Decoder Models

The most common sequence-to-sequence (seq2seq) models are *encoder-decoder* models, which commonly use a *recurrent neural network* (RNN) to *encode* the source (input) sentence into a single vector. In this notebook, we'll refer to this single vector as a *context vector*. We can think of the context vector as being an abstract representation of the entire input sentence. This vector is then *decoded* by a second RNN which learns to output the target (output) sentence by generating it one word at a time.

![](https://github.com/bentrevett/pytorch-seq2seq/blob/master/assets/seq2seq1.png?raw=1)

The above image shows an example translation. The input/source sentence, "guten morgen", is passed through the embedding layer (yellow) and then input into the encoder (green). We also append a *start of sequence* (`<sos>`) and *end of sequence* (`<eos>`) token to the start and end of sentence, respectively. At each time-step, the input to the encoder RNN is both the embedding, $emb$, of the current word, $emb(x_t)$, as well as the hidden state from the previous time-step, $h_{t-1}$, and the encoder RNN outputs a new hidden state $h_t$. We can think of the hidden state as a vector representation of the sentence so far. The RNN can be represented as a function of both of $emb(x_t)$ and $h_{t-1}$:

$$h_t = \text{EncoderRNN}(emb(x_t), h_{t-1})$$

We're using the term RNN generally here, it could be any recurrent architecture, such as an *LSTM* (Long Short-Term Memory) or a *GRU* (Gated Recurrent Unit). 

Here, we have $X = \{x_1, x_2, ..., x_T\}$, where $x_1 = \text{<sos>}, x_2 = \text{guten}$, etc. The initial hidden state, $h_0$, is usually either initialized to zeros or a learned parameter.

Once the final word, $x_T$, has been passed into the RNN via the embedding layer, we use the final hidden state, $h_T$, as the context vector, i.e. $h_T = z$. This is a vector representation of the entire source sentence.

Now we have our context vector, $z$, we can start decoding it to get the output/target sentence, "good morning". Again, we append start and end of sequence tokens to the target sentence. At each time-step, the input to the decoder RNN (blue) is the embedding, $d$, of current word, $d(y_t)$, as well as the hidden state from the previous time-step, $s_{t-1}$, where the **initial decoder hidden state, $s_0$, is the context vector, $s_0 = z = h_T$,** i.e. the initial decoder hidden state is the final encoder hidden state. Thus, similar to the encoder, we can represent the decoder as:

$$s_t = \text{DecoderRNN}(d(y_t), s_{t-1})$$

Although the input/source embedding layer, $emb$, and the output/target embedding layer, $d$, are both shown in yellow in the diagram they are two different embedding layers with their own parameters.

In the decoder, we need to go from the hidden state to an actual word, therefore at each time-step we use $s_t$ to predict (by passing it through a `Linear` layer, shown in purple) what we think is the next word in the sequence, $\hat{y}_t$. 

$$\hat{y}_t = f(s_t)$$

The words in the decoder are always generated one after another, with one per time-step. **We always use `<sos>` for the first input to the decoder, $y_1$,** but for subsequent inputs, $y_{t>1}$, we will sometimes use the actual, ground truth next word in the sequence, $y_t$ and sometimes use the word predicted by our decoder, $\hat{y}_{t-1}$. This is called *teacher forcing*, see a bit more info about it [here](https://machinelearningmastery.com/teacher-forcing-for-recurrent-neural-networks/). 

When training/testing our model, we always know how many words are in our target sentence, so we stop generating words once we hit that many. During inference it is **common to keep generating words until the model outputs an `<eos>`** token or after a certain amount of words have been generated.

Once we have our predicted target sentence, $\hat{Y} = \{ \hat{y}_1, \hat{y}_2, ..., \hat{y}_T \}$, we compare it against our actual target sentence, $Y = \{ y_1, y_2, ..., y_T \}$, to calculate our loss. We then use this loss to update all of the parameters in our model.


## Building the Seq2Seq Model

We'll be building our model in three parts. The encoder, the decoder and a seq2seq model that encapsulates the encoder and decoder and will provide a way to interface with each.

### Encoder

First, the encoder, a 2 layer LSTM. The paper uses a 4-layer LSTM, but in the interest of training time we cut this down to 2-layers. The concept of multi-layer RNNs is easy to expand from 2 to 4 layers. 

For a multi-layer RNN, the input sentence, $X$, after being embedded goes into the first (bottom) layer of the RNN and hidden states, $H=\{h_1, h_2, ..., h_T\}$, output by this layer are used as inputs to the RNN in the layer above. Thus, representing each layer with a superscript, the hidden states in the first layer are given by:

$$h_t^1 = \text{EncoderRNN}^1(emb(x_t), h_{t-1}^1)$$

The hidden states in the second layer are given by:

$$h_t^2 = \text{EncoderRNN}^2(h_t^1, h_{t-1}^2)$$

Using a multi-layer RNN also means we'll also need an initial hidden state as input per layer, $h_0^l$, and we will also output a context vector per layer, $z^l$.

[LSTM](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) is a type of RNN which instead of just taking in a hidden state and returning a new hidden state per time-step, also take in and return a *cell state*, $c_t$, per time-step.

$$\begin{align*}
h_t &= \text{RNN}(emb(x_t), h_{t-1})\\
(h_t, c_t) &= \text{LSTM}(e(x_t), h_{t-1}, c_{t-1})
\end{align*}$$

We can just think of $c_t$ as another type of hidden state. Similar to $h_0^l$, $c_0^l$ will be initialized to a tensor of all zeros. Also, our context vector will now be both the final hidden state and the final cell state, i.e. $z^l = (h_T^l, c_T^l)$.

Extending our multi-layer equations to LSTMs, we get:

$$\begin{align*}
(h_t^1, c_t^1) &= \text{EncoderLSTM}^1(emb(x_t), (h_{t-1}^1, c_{t-1}^1))\\
(h_t^2, c_t^2) &= \text{EncoderLSTM}^2(h_t^1, (h_{t-1}^2, c_{t-1}^2))
\end{align*}$$

Note how only our hidden state from the first layer is passed as input to the second layer, and not the cell state.

So our encoder looks something like this: 

![](https://github.com/bentrevett/pytorch-seq2seq/blob/master/assets/seq2seq2.png?raw=1)

We create this in code by making an `Encoder` module, which requires we inherit from `torch.nn.Module` and use the `super().__init__()` as some boilerplate code. The encoder takes the following arguments:
- `input_dim` is the size/dimensionality of the one-hot vectors that will be input to the encoder. This is equal to the input (source) vocabulary size.
- `emb_dim` is the dimensionality of the embedding layer. This layer converts the one-hot vectors into dense vectors with `emb_dim` dimensions. 
- `hid_dim` is the dimensionality of the hidden and cell states.
- `n_layers` is the number of layers in the RNN.
- `dropout` is the amount of dropout to use. This is a regularization parameter to prevent overfitting. Check out [this](https://www.coursera.org/lecture/deep-neural-network/understanding-dropout-YaGbR) for more details about dropout.

The embedding layer is created using `nn.Embedding`, the LSTM with `nn.LSTM` and a dropout layer with `nn.Dropout`. Check the PyTorch [documentation](https://pytorch.org/docs/stable/nn.html) for more about these.

One thing to note is that the `dropout` argument to the LSTM is how much dropout to apply between the layers of a multi-layer RNN, i.e. between the hidden states output from layer $l$ and those same hidden states being used for the input of layer $l+1$.

In the `forward` method, we pass in the source sentence, $X$, which is converted into dense vectors using the `embedding` layer, and then dropout is applied. These embeddings are then passed into the RNN. As we pass a whole sequence to the RNN, it will automatically do the recurrent calculation of the hidden states over the whole sequence for us! Notice that we do not pass an initial hidden or cell state to the RNN. This is because, as noted in the [documentation](https://pytorch.org/docs/stable/nn.html#torch.nn.LSTM), that if no hidden/cell state is passed to the RNN, it will automatically create an initial hidden/cell state as a tensor of all zeros. 

The RNN returns: `outputs` (the top-layer hidden state for each time-step), `hidden` (the final hidden state for each layer, $h_T$, stacked on top of each other) and `cell` (the final cell state for each layer, $c_T$, stacked on top of each other).

As we only need the **final hidden and cell states to make our context vector,** `forward` only returns `hidden` and `cell`. 

The sizes of each of the tensors is left as comments in the code. In this implementation `n_directions` will always be 1, however note that bidirectional RNNs will have `n_directions` as 2.

In [29]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        
        self.embedding = nn.Embedding(input_dim, emb_dim)
        
        self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout = dropout)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src):
        
        #src = [src len, batch size]
        
        embedded = self.dropout(self.embedding(src))
        
        #embedded = [src len, batch size, emb dim]
        
        outputs, (hidden, cell) = self.rnn(embedded)
        
        #outputs = [src len, batch size, hid dim * n directions]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        
        #outputs are always from the top hidden layer
        
        return hidden, cell

In [None]:
# Q7. What happens outputs, hidden and cell at time step t?
# Each time, when fed with input, the model generates hidden and cell state as well as output.
# Then the hidden and cell stated are fed into the next timestamp with input.
# It also generates hidden and cell state, then it repeatedly does the same thing.

### Decoder

Next, we'll build our decoder, which will also be a 2-layer (4 in the paper) LSTM.

![](https://github.com/bentrevett/pytorch-seq2seq/blob/master/assets/seq2seq3.png?raw=1)

The `Decoder` class does a single step of decoding, i.e. it ouputs single token per time-step. The first layer will receive a hidden and cell state from the previous time-step, $(s_{t-1}^1, c_{t-1}^1)$, and feeds it through the LSTM with the current embedded token, $y_t$, to produce a new hidden and cell state, $(s_t^1, c_t^1)$. The subsequent layers will use the hidden state from the layer below, $s_t^{l-1}$, and the previous hidden and cell states from their layer, $(s_{t-1}^l, c_{t-1}^l)$. This provides equations very similar to those in the encoder.

$$\begin{align*}
(s_t^1, c_t^1) = \text{DecoderLSTM}^1(d(y_t), (s_{t-1}^1, c_{t-1}^1))\\
(s_t^2, c_t^2) = \text{DecoderLSTM}^2(s_t^1, (s_{t-1}^2, c_{t-1}^2))
\end{align*}$$

**Remember that the initial hidden and cell states to our decoder are our context vectors, which are the final hidden and cell states of our encoder from the same layer**, i.e. $(s_0^l,c_0^l)=z^l=(h_T^l,c_T^l)$.

We then pass the hidden state from the top layer of the RNN, $s_t^L$, through a linear layer (purple), $f$, to make a prediction of what the next token in the target (output) sequence should be, $\hat{y}_{t+1}$. 

$$\hat{y}_{t+1} = f(s_t^L)$$

The arguments and initialization are similar to the `Encoder` class, except we now have an `output_dim` which is the size of the vocabulary for the output/target. There is also the addition of the `Linear` layer, used to make the predictions from the top layer hidden state.

Within the `forward` method, we accept a batch of input tokens, previous hidden states and previous cell states. As we are only decoding one token at a time, the **input tokens will always have a sequence length of 1**. We `unsqueeze` the input tokens to add a sentence length dimension of 1. Then, similar to the encoder, we pass through an embedding layer and apply dropout. This batch of embedded tokens is then passed into the RNN with the previous hidden and cell states. This produces an `output` (hidden state from the top layer of the RNN), a new `hidden` state (one for each layer, stacked on top of each other) and a new `cell` state (also one per layer, stacked on top of each other). We then pass the `output` (after getting rid of the sentence length dimension) through the linear layer to receive our `prediction`. We then return the `prediction`, the new `hidden` state and the new `cell` state.


In [30]:
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        
        self.output_dim = output_dim
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        
        self.embedding = nn.Embedding(output_dim, emb_dim)
        
        self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout = dropout)
        
        self.fc_out = nn.Linear(hid_dim, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, cell):
        
        #input = [batch size]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        
        #n directions in the decoder will both always be 1, therefore:
        #hidden = [n layers, batch size, hid dim]
        #context = [n layers, batch size, hid dim]
        
        input = input.unsqueeze(0)
        
        #input = [1, batch size]
        
        embedded = self.dropout(self.embedding(input))
        
        #embedded = [1, batch size, emb dim]
                
        output, (hidden, cell) = self.rnn(embedded, (hidden, cell))
        
        #output = [seq len, batch size, hid dim * n directions]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        
        #seq len and n directions will always be 1 in the decoder, therefore:
        #output = [1, batch size, hid dim]
        #hidden = [n layers, batch size, hid dim]
        #cell = [n layers, batch size, hid dim]
        
        prediction = self.fc_out(output.squeeze(0))
        
        #prediction = [batch size, output dim]
        
        return prediction, hidden, cell

In [None]:
# Q8. How is the decoder different to the encoder?
# The main goal of encoder is generating context vector.
# After generating the context vector, the the decoder will generate the output with the input and previous output instead of hidden state.(teacher forcing)

### Seq2Seq

For the final part of the implementation, we'll implement the seq2seq model. This will handle: 
- receiving the input/source sentence
- using the encoder to produce the context vectors 
- using the decoder to produce the predicted output/target sentence

Our full model will look like this:

![](https://github.com/bentrevett/pytorch-seq2seq/blob/master/assets/seq2seq4.png?raw=1)

The `Seq2Seq` model takes in an `Encoder`, `Decoder`, and a `device` (used to place tensors on the GPU, if it exists).

For this implementation, we have to ensure that the number of layers and the hidden (and cell) dimensions are equal in the `Encoder` and `Decoder`. This is not always the case, we do not necessarily need the same number of layers or the same hidden dimension sizes in a sequence-to-sequence model. However, if we did something like having a different number of layers then we would need to make decisions about how this is handled. For example, if our encoder has 2 layers and our decoder only has 1, how is this handled? Do we average the two context vectors output by the decoder? Do we pass both through a linear layer? Do we only use the context vector from the highest layer? Etc.

Our `forward` method takes the source sentence, target sentence and a teacher-forcing ratio. The teacher forcing ratio is used when training our model. When decoding, at each time-step we will predict what the next token in the target sequence will be from the previous tokens decoded, $\hat{y}_{t+1}=f(s_t^L)$. With probability equal to the teaching forcing ratio (`teacher_forcing_ratio`) we will use the actual ground-truth next token in the sequence as the input to the decoder during the next time-step. However, with probability `1 - teacher_forcing_ratio`, we will use the token that the model predicted as the next input to the model, even if it doesn't match the actual next token in the sequence.  

The first thing we do in the `forward` method is to create an `outputs` tensor that will store all of our predictions, $\hat{Y}$.

We then feed the input/source sentence, `src`, into the encoder and receive out final hidden and cell states.

The first input to the decoder is the start of sequence (`<sos>`) token. As our `trg` tensor already has the `<sos>` token appended (all the way back when we defined the `init_token` in our `TRG` field) we get our $y_1$ by slicing into it. We know how long our target sentences should be (`trg_len`), so we loop that many times. The last token input into the decoder is the one **before** the `<eos>` token - the `<eos>` token is never input into the decoder. 

During each iteration of the loop, we:
- pass the input, previous hidden and previous cell states ($y_t, s_{t-1}, c_{t-1}$) into the decoder
- receive a prediction, next hidden state and next cell state ($\hat{y}_{t+1}, s_{t}, c_{t}$) from the decoder
- place our prediction, $\hat{y}_{t+1}$/`output` in our tensor of predictions, $\hat{Y}$/`outputs`
- decide if we are going to "teacher force" or not
    - if we do, the next `input` is the ground-truth next token in the sequence, $y_{t+1}$/`trg[t]`
    - if we don't, the next `input` is the predicted next token in the sequence, $\hat{y}_{t+1}$/`top1`, which we get by doing an `argmax` over the output tensor
    
Once we've made all of our predictions, we return our tensor full of predictions, $\hat{Y}$/`outputs`.

**Note**: our decoder loop starts at 1, not 0. This means the 0th element of our `outputs` tensor remains all zeros. So our `trg` and `outputs` look something like:

$$\begin{align*}
\text{trg} = [<sos>, &y_1, y_2, y_3, <eos>]\\
\text{outputs} = [0, &\hat{y}_1, \hat{y}_2, \hat{y}_3, <eos>]
\end{align*}$$

Later on when we calculate the loss, we cut off the first element of each tensor to get:

$$\begin{align*}
\text{trg} = [&y_1, y_2, y_3, <eos>]\\
\text{outputs} = [&\hat{y}_1, \hat{y}_2, \hat{y}_3, <eos>]
\end{align*}$$

In [31]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
        assert encoder.hid_dim == decoder.hid_dim, \
            "Hidden dimensions of encoder and decoder must be equal!"
        assert encoder.n_layers == decoder.n_layers, \
            "Encoder and decoder must have equal number of layers!"
        
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):
        
        #src = [src len, batch size]
        #trg = [trg len, batch size]
        #teacher_forcing_ratio is probability to use teacher forcing
        #e.g. if teacher_forcing_ratio is 0.75 we use ground-truth inputs 75% of the time
        
        batch_size = trg.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        #tensor to store decoder outputs
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
        
        #last hidden state of the encoder is used as the initial hidden state of the decoder
        hidden, cell = self.encoder(src)
        
        #first input to the decoder is the <sos> tokens
        input = trg[0,:]
        
        for t in range(1, trg_len):
            
            #insert input token embedding, previous hidden and previous cell states
            #receive output tensor (predictions) and new hidden and cell states
            output, hidden, cell = self.decoder(input, hidden, cell)
            
            #place predictions in a tensor holding predictions for each token
            outputs[t] = output
            
            #decide if we are going to use teacher forcing or not
            teacher_force = random.random() < teacher_forcing_ratio
            
            #get the highest predicted token from our predictions
            top1 = output.argmax(1) 
            
            #if teacher forcing, use actual next token as next input
            #if not, use predicted token
            input = trg[t] if teacher_force else top1
        
        return outputs

In [None]:
# Q9. What will be the input to the decoder at time step t?
# The input of each time and the previous output instead of hidden state.

# Training the Seq2Seq Model

Now we have our model implemented, we can begin training it. 

First, we'll initialize our model. As mentioned before, the input and output dimensions are defined by the size of the vocabulary. The embedding dimesions and dropout for the encoder and decoder can be different, but the number of layers and the size of the hidden/cell states must be the same. 

We then define the encoder, decoder and then our Seq2Seq model, which we place on the `device`.

In [32]:
INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
HID_DIM = 512
N_LAYERS = 2
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT)

model = Seq2Seq(enc, dec, device).to(device)

In [44]:
NPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
HID_DIM = 512
N_LAYERS = 2
ENC_DROPOUT = 0.3
DEC_DROPOUT = 0.3

enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT)

model = Seq2Seq(enc, dec, device).to(device)

In [53]:
NPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
HID_DIM = 512
N_LAYERS = 2
ENC_DROPOUT = 0
DEC_DROPOUT = 0

enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT)

model = Seq2Seq(enc, dec, device).to(device)

### Initialising Weights
Next up is initialising the weights of our model. In the paper they state they initialise all weights from a uniform distribution between -0.08 and +0.08, i.e. $\mathcal{U}(-0.08, 0.08)$.

We initialise weights in PyTorch by creating a function which we `apply` to our model. When using `apply`, the `init_weights` function will be called on every module and sub-module within our model. For each module we loop through all of the parameters and sample them from a uniform distribution with `nn.init.uniform_`.

In [45]:
def init_weights(m):
    for name, param in m.named_parameters():
        nn.init.uniform_(param.data, -0.08, 0.08)
        
model.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(7853, 256)
    (rnn): LSTM(256, 512, num_layers=2, dropout=0.3)
    (dropout): Dropout(p=0.3, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(5893, 256)
    (rnn): LSTM(256, 512, num_layers=2, dropout=0.3)
    (fc_out): Linear(in_features=512, out_features=5893, bias=True)
    (dropout): Dropout(p=0.3, inplace=False)
  )
)

We also define a function that will calculate the number of trainable parameters in the model.

In [46]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 13,898,501 trainable parameters


### Optimiser & Loss Function

The `CrossEntropyLoss` function calculates both the log softmax as well as the negative log-likelihood of our predictions. 

Our loss function calculates the average loss per token, however by passing the index of the `<pad>` token as the `ignore_index` argument we ignore the loss whenever the target token is a padding token. 

In [47]:
TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]
criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX)

optimizer = optim.Adam(model.parameters())

### Training Loop

First, we'll set the model into "training mode" with `model.train()`. This will turn on dropout (and batch normalization, which we aren't using) and then iterate through our data iterator.

As stated before, our decoder loop starts at 1, not 0. This means the 0th element of our `outputs` tensor remains all zeros. So our `trg` and `outputs` look something like:

$$\begin{align*}
\text{trg} = [<sos>, &y_1, y_2, y_3, <eos>]\\
\text{outputs} = [0, &\hat{y}_1, \hat{y}_2, \hat{y}_3, <eos>]
\end{align*}$$

Here, when we calculate the loss, we cut off the first element of each tensor to get:

$$\begin{align*}
\text{trg} = [&y_1, y_2, y_3, <eos>]\\
\text{outputs} = [&\hat{y}_1, \hat{y}_2, \hat{y}_3, <eos>]
\end{align*}$$

At each iteration:
- get the source and target sentences from the batch, $X$ and $Y$
- zero the gradients calculated from the last batch
- feed the source and target into the model to get the output, $\hat{Y}$
- as the loss function only works on 2d inputs with 1d targets we need to flatten each of them with `.view`
    - we slice off the first column of the output and target tensors as mentioned above
- calculate the gradients with `loss.backward()`
- clip the gradients to prevent them from exploding (a common issue in RNNs)
- update the parameters of our model by doing an optimizer step
- sum the loss value to a running total

Finally, we return the loss that is averaged over all batches.

In [48]:
def train(model, iterator, optimizer, criterion, clip):
    
    model.train()
    
    epoch_loss = 0
    
    for i, batch in enumerate(iterator):
        
        src = batch.src
        trg = batch.trg
        
        optimizer.zero_grad()
        
        output = model(src, trg)
        
        #trg = [trg len, batch size]
        #output = [trg len, batch size, output dim]
        
        output_dim = output.shape[-1]
        
        output = output[1:].view(-1, output_dim)
        trg = trg[1:].view(-1)
        
        #trg = [(trg len - 1) * batch size]
        #output = [(trg len - 1) * batch size, output dim]
        
        loss = criterion(output, trg)
        
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

### Evaluation loop


In [49]:
def evaluate(model, iterator, criterion):
    
    model.eval()
    
    epoch_loss = 0
    
    with torch.no_grad():
    
        for i, batch in enumerate(iterator):

            src = batch.src
            trg = batch.trg

            output = model(src, trg, 0) #turn off teacher forcing

            #trg = [trg len, batch size]
            #output = [trg len, batch size, output dim]

            output_dim = output.shape[-1]
            
            output = output[1:].view(-1, output_dim)
            trg = trg[1:].view(-1)

            #trg = [(trg len - 1) * batch size]
            #output = [(trg len - 1) * batch size, output dim]

            loss = criterion(output, trg)
            
            epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

In [None]:
# Q10. Looking at the functions for train() and evaluate(), state what are their differences and why.
# when training, the gradient is on since we need backpropagation, but in case of evaluation, we don't need it, thus, gradient is off. 

Next, we'll create a function that we'll use to tell us how long an epoch takes.

In [38]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

## Train the model!

At each epoch, we'll be checking if our model has achieved the best validation loss so far. If it has, we'll update our best validation loss and save the parameters of our model (called `state_dict` in PyTorch). Then, when we come to test our model, we'll use the saved parameters used to achieve the best validation loss. 

We'll be printing out both the loss and the perplexity at each epoch. It is easier to see a change in perplexity than a change in loss as the numbers are much bigger.

Training time: <font color="green">~6.25m on GPU</font>

In [39]:
N_EPOCHS = 10
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'lab08-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Epoch: 01 | Time: 0m 36s
	Train Loss: 5.050 | Train PPL: 156.032
	 Val. Loss: 5.009 |  Val. PPL: 149.738
Epoch: 02 | Time: 0m 37s
	Train Loss: 4.500 | Train PPL:  90.013
	 Val. Loss: 4.871 |  Val. PPL: 130.418
Epoch: 03 | Time: 0m 36s
	Train Loss: 4.205 | Train PPL:  67.024
	 Val. Loss: 4.600 |  Val. PPL:  99.458
Epoch: 04 | Time: 0m 37s
	Train Loss: 4.004 | Train PPL:  54.837
	 Val. Loss: 4.545 |  Val. PPL:  94.184
Epoch: 05 | Time: 0m 36s
	Train Loss: 3.851 | Train PPL:  47.017
	 Val. Loss: 4.433 |  Val. PPL:  84.155
Epoch: 06 | Time: 0m 37s
	Train Loss: 3.726 | Train PPL:  41.510
	 Val. Loss: 4.310 |  Val. PPL:  74.439
Epoch: 07 | Time: 0m 36s
	Train Loss: 3.591 | Train PPL:  36.268
	 Val. Loss: 4.214 |  Val. PPL:  67.641
Epoch: 08 | Time: 0m 36s
	Train Loss: 3.443 | Train PPL:  31.288
	 Val. Loss: 4.115 |  Val. PPL:  61.230
Epoch: 09 | Time: 0m 36s
	Train Loss: 3.307 | Train PPL:  27.295
	 Val. Loss: 4.005 |  Val. PPL:  54.857
Epoch: 10 | Time: 0m 36s
	Train Loss: 3.213 | Train PPL

In [50]:
N_EPOCHS = 15
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'lab08-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Epoch: 01 | Time: 0m 38s
	Train Loss: 5.035 | Train PPL: 153.668
	 Val. Loss: 4.917 |  Val. PPL: 136.555
Epoch: 02 | Time: 0m 36s
	Train Loss: 4.475 | Train PPL:  87.751
	 Val. Loss: 4.744 |  Val. PPL: 114.896
Epoch: 03 | Time: 0m 36s
	Train Loss: 4.170 | Train PPL:  64.734
	 Val. Loss: 4.531 |  Val. PPL:  92.822
Epoch: 04 | Time: 0m 36s
	Train Loss: 3.925 | Train PPL:  50.649
	 Val. Loss: 4.483 |  Val. PPL:  88.510
Epoch: 05 | Time: 0m 36s
	Train Loss: 3.748 | Train PPL:  42.445
	 Val. Loss: 4.293 |  Val. PPL:  73.160
Epoch: 06 | Time: 0m 36s
	Train Loss: 3.577 | Train PPL:  35.756
	 Val. Loss: 4.057 |  Val. PPL:  57.804
Epoch: 07 | Time: 0m 36s
	Train Loss: 3.384 | Train PPL:  29.485
	 Val. Loss: 4.053 |  Val. PPL:  57.553
Epoch: 08 | Time: 0m 36s
	Train Loss: 3.259 | Train PPL:  26.015
	 Val. Loss: 3.948 |  Val. PPL:  51.808
Epoch: 09 | Time: 0m 36s
	Train Loss: 3.094 | Train PPL:  22.064
	 Val. Loss: 3.871 |  Val. PPL:  47.997
Epoch: 10 | Time: 0m 36s
	Train Loss: 2.964 | Train PPL

In [54]:
N_EPOCHS = 10
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'lab08-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Epoch: 01 | Time: 0m 35s
	Train Loss: 8.684 | Train PPL: 5906.560
	 Val. Loss: 8.686 |  Val. PPL: 5919.015
Epoch: 02 | Time: 0m 36s
	Train Loss: 8.684 | Train PPL: 5907.284
	 Val. Loss: 8.686 |  Val. PPL: 5919.015
Epoch: 03 | Time: 0m 35s
	Train Loss: 8.684 | Train PPL: 5907.253
	 Val. Loss: 8.686 |  Val. PPL: 5919.015
Epoch: 04 | Time: 0m 36s
	Train Loss: 8.684 | Train PPL: 5906.903
	 Val. Loss: 8.686 |  Val. PPL: 5919.015
Epoch: 05 | Time: 0m 36s
	Train Loss: 8.684 | Train PPL: 5906.696
	 Val. Loss: 8.686 |  Val. PPL: 5919.015
Epoch: 06 | Time: 0m 36s
	Train Loss: 8.684 | Train PPL: 5907.255
	 Val. Loss: 8.686 |  Val. PPL: 5919.015
Epoch: 07 | Time: 0m 36s
	Train Loss: 8.684 | Train PPL: 5906.474
	 Val. Loss: 8.686 |  Val. PPL: 5919.015
Epoch: 08 | Time: 0m 36s
	Train Loss: 8.684 | Train PPL: 5906.698
	 Val. Loss: 8.686 |  Val. PPL: 5919.015
Epoch: 09 | Time: 0m 36s
	Train Loss: 8.684 | Train PPL: 5907.004
	 Val. Loss: 8.686 |  Val. PPL: 5919.015
Epoch: 10 | Time: 0m 36s
	Train Loss:

In [None]:
# Q11. What is contained in model.state_dict()?


We'll load the parameters (`state_dict`) that gave our model the best validation loss and run it the model on the test set.

In [55]:
model.load_state_dict(torch.load('lab08-model.pt'))

test_loss = evaluate(model, test_iterator, criterion)

print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')

| Test Loss: 8.686 | Test PPL: 5918.874 |


### PREDICTION

Let's see how the model translates a sample sentence

In [41]:
# Take ground truth samples from data set
example_idx = 23
example = train_data.examples[example_idx]
print('source sentence: ', ' '.join(example.src[::-1])) # reverse the reversed sentence to get original sentence
print('target sentence: ', ' '.join(example.trg))

source sentence:  zwei kinder sitzen auf einer kleinen wippe im sand .
target sentence:  two children sit on a small seesaw in the sand .


In [51]:
# Get translation for example src sentence
src_tensor = SRC.process([example.src]).to(device)
trg_tensor = TRG.process([example.trg]).to(device)
print(trg_tensor.shape)

model.eval()
with torch.no_grad():
    outputs = model(src_tensor, trg_tensor, teacher_forcing_ratio=0)

outputs.shape

torch.Size([13, 1])


torch.Size([13, 1, 5893])

In [None]:
# Q12. Why is teacher forcing ratio set to 0 here?
# When we test with unseen data, teacher forcing should not be applied. 

# Q13. What does the shape of the output tell you about the outputs of the model?
# The size of vocab is 5893 and the 13 tokens include EOS and SOS, therefore the output is just 11 tokens.  

In [43]:
# Convert translated indexes back to a sentence
output_idx = outputs[1:].squeeze(1).argmax(1)
' '.join([TRG.vocab.itos[idx] for idx in output_idx])

'two men are sitting on a field with a water . <eos>'

In [None]:
# Q14. Explain what this line of code is doing:
# output_idx = outputs[1:].squeeze(1).argmax(1)
# It uses the output from the second timestamp and remove the second dimemsion then pick the highest probability.

In [None]:
# Q15. Change the index (example_idx) and try out with different sample sentences.
# Tweak one or two hyperparameters to see if the model could be improved. Note that GPU allocation is limited on free Colab.
# Try removing dropout as well.
# result1 : | Test Loss: 3.969 | Test PPL:  52.948 |
# result2: | Test Loss: 3.693 | Test PPL:  40.161 |
# result3: | Test Loss: 8.686 | Test PPL: 5918.874 |
# I changed the dropout rate as 0.3 and incresed the number of epochs and the result was better.
# If remove dropout, then the result was really bad. 