### Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation.

In this notebook we are going to implement the model from [this](https://arxiv.org/abs/1406.1078) paper.

From the previous notebook we talked about the encoder and the decoder model. Let's remind ourselves about the encoder decoder model.


<p align="center"><img src="https://github.com/bentrevett/pytorch-seq2seq/raw/49df8404d938a6edbf729876405558cc2c2b3013/assets/seq2seq1.png"/></p>

In the previous notebook we used a multi-layer LSTM which was looking as follows:


<p align="center"><img src="https://github.com/bentrevett/pytorch-seq2seq/raw/49df8404d938a6edbf729876405558cc2c2b3013/assets/seq2seq4.png"/></p>

One downside of the previous model is that the decoder is trying to cram lots of information into the hidden states. Whilst decoding, the hidden state will need to contain information about the whole of the source sequence, as well as all of the tokens have been decoded so far. By alleviating some of this information compression, we can create a better model!

We'll also be using a GRU (Gated Recurrent Unit) instead of an LSTM (Long Short-Term Memory). Why? Mainly because that's what they did in the paper (this paper also introduced GRUs) and also because we used LSTMs last time. Both GRU and LSTM are pretty much the same as they differ from regular RNNs.

### Data Preparation.
The preparation of the data is just like from the previous Notebook. Nothing will change that much. If there's a code block that changes along the way i will highlight changes.



In [1]:
import torch
from torch import nn
from torch.nn  import functional as F
import spacy, math, random
import numpy as np

from torchtext.legacy import datasets, data

### Setting seeds

In [2]:
SEED = 42

np.random.seed(SEED)
torch.manual_seed(SEED)
random.seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deteministic = True

### Creating Tokens

In [3]:
import spacy
import spacy.cli
spacy.cli.download('de_core_news_sm')

import de_core_news_sm, en_core_web_sm

spacy_de = spacy.load('de_core_news_sm')
spacy_en = spacy.load('en_core_web_sm')

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('de_core_news_sm')


Previously we reversed the source `(German)` sentence, however in the paper we are implementing they don't do this, so neither will we.

In [4]:
def tokenize_de(sent):
  return [tok.text for tok in spacy_de.tokenizer(sent)]

def tokenize_en(sent):
  return [tok.text for tok in spacy_en.tokenizer(sent)]

### Creating `Fields` that process data.

In [5]:
SRC = data.Field(
    tokenize = tokenize_de,
    lower= True,
    init_token = "<sos>",
     eos_token = "<eos>"
)
TRG = data.Field(
    tokenize = tokenize_en,
    lower= True,
    init_token = "<sos>",
     eos_token = "<eos>"
)

### Loading the `Multi30k` Dataset

Just like from the previous notebook: `exts` specifies which languages to use as the source and target **(source goes first)** and fields specifies which field to use for the source and target.

In [6]:
train_data, valid_data, test_data = datasets.Multi30k.splits(
    exts=('.de', '.en'),
    fields = (SRC, TRG)
)

downloading training.tar.gz


training.tar.gz: 100%|██████████| 1.21M/1.21M [00:00<00:00, 1.82MB/s]


downloading validation.tar.gz


validation.tar.gz: 100%|██████████| 46.3k/46.3k [00:00<00:00, 279kB/s]


downloading mmt_task1_test2016.tar.gz


mmt_task1_test2016.tar.gz: 100%|██████████| 66.2k/66.2k [00:00<00:00, 266kB/s]


### Checking if we have loaded the data corectly.

In [7]:
from prettytable import PrettyTable
def tabulate(column_names, data):
  table = PrettyTable(column_names)
  table.title= "VISUALIZING SETS EXAMPLES"
  table.align[column_names[0]] = 'l'
  table.align[column_names[1]] = 'r'
  for row in data:
    table.add_row(row)
  print(table)

column_names = ["SUBSET", "EXAMPLE(s)"]
row_data = [
        ["training", len(train_data)],
        ['validation', len(valid_data)],
        ['test', len(test_data)]
]
tabulate(column_names, row_data)

+-----------------------------+
|  VISUALIZING SETS EXAMPLES  |
+--------------+--------------+
| SUBSET       |   EXAMPLE(s) |
+--------------+--------------+
| training     |        29000 |
| validation   |         1014 |
| test         |         1000 |
+--------------+--------------+


### Checking a single example, of the `SRC` and the `TRG`.

In [8]:
print(vars(train_data.examples[0]))

{'src': ['zwei', 'junge', 'weiße', 'männer', 'sind', 'im', 'freien', 'in', 'der', 'nähe', 'vieler', 'büsche', '.'], 'trg': ['two', 'young', ',', 'white', 'males', 'are', 'outside', 'near', 'many', 'bushes', '.']}


### Building the vocabulary.
Just like from the previous notebook all the tokens that apears less than 2, are automatically converted to unknown token `<unk>`.

In [9]:
SRC.build_vocab(train_data, min_freq=2)
TRG.build_vocab(train_data, min_freq=2)

### Device.

In [10]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

### Creating Iterators.
Just like from the previous notebook we are going to use the `BucketIterator` to create the train, validation and test sets.

In [11]:
BATCH_SIZE = 128
train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data),
    device = device,
    batch_size = BATCH_SIZE
)

### Encoder Model.
The encoder is similar to the previous one, with the multi-layer LSTM swapped for a single-layer GRU. We also don't pass the dropout as an argument to the GRU as that dropout is used between each layer of a multi-layered RNN. As we only have a single layer, PyTorch will display a warning if we try and use pass a dropout value to it.

Another thing to note about the GRU is that it only requires and returns a hidden state, there is no cell state like in the LSTM.

<p align="center"> <img src="https://render.githubusercontent.com/render/math?math=%5Cbegin%7Balign%2A%7D%0Ah_t%20%26amp%3B%3D%20%5Ctext%7BGRU%7D%28e%28x_t%29%2C%20h_%7Bt-1%7D%29%5C%5C%0A%28h_t%2C%20c_t%29%20%26amp%3B%3D%20%5Ctext%7BLSTM%7D%28e%28x_t%29%2C%20h_%7Bt-1%7D%2C%20c_%7Bt-1%7D%29%5C%5C%0Ah_t%20%26amp%3B%3D%20%5Ctext%7BRNN%7D%28e%28x_t%29%2C%20h_%7Bt-1%7D%29%0A%5Cend%7Balign%2A%7D&mode=display"/></p>

From the equations above, it looks like the RNN and the GRU are identical. Inside the GRU, however, is a number of gating mechanisms that control the information flow in to and out of the hidden state (similar to an LSTM). 

The rest of the encoder should be very familar from the previous notebook, it takes in a sequence, $X = \{x_1, x_2, ... , x_T\}$, passes it through the embedding layer, recurrently calculates hidden states, $H = \{h_1, h_2, ..., h_T\}$, and returns a context vector **(the final hidden state)**, $z=h_T$.

$$h_t = \text{EncoderGRU}(e(x_t), h_{t-1})$$
This is identical to the encoder of the general **`seq2seq`** model, with all the "magic" happening inside the GRU (green).

<p align="center"><img src="https://github.com/bentrevett/pytorch-seq2seq/raw/49df8404d938a6edbf729876405558cc2c2b3013/assets/seq2seq5.png"/></p>



In [12]:
class Encoder(nn.Module):
  def __init__(self, input_dim, emb_dim, hid_dim, dropout):
    super(Encoder, self).__init__()
    self.hid_dim = hid_dim
    self.embedding = nn.Embedding(input_dim, embedding_dim=emb_dim)
    """
    No Dropout (GRU) since we have **one** layer.
    """
    self.gru = nn.GRU(emb_dim, hid_dim)
    self.dropout = nn.Dropout(dropout)

  def forward(self, src):
    # src = [src len, batch size]
    embedded = self.dropout(self.embedding(src))
    # embedded = [src len, batch size, emb dim]
    ouputs, h_0 = self.gru(embedded) # no cell state since it is a GRU not LSTM
    """
    outputs = [src len, batch size, hid dim * n directions]
    hidden (h_0) = [n_layers * n_directions, batch size, hid dim]
    ** outputs are always from the top hidden layer
    """
    return h_0

### Decoder Model.
The decoder is where the implementation differs significantly from the previous model and we alleviate some of the information compression.

Instead of the GRU in the decoder taking just the embedded target token, $d(y_t)$ and the previous hidden state $s_{t-1}$ as inputs, it also takes the context vector $z$.

$$s_t = \text{DecoderGRU}(d(y_t), s_{t-1}, z)$$
Note how this context vector, $z$, does not have a $t$ subscript, meaning we re-use the same context vector returned by the encoder for every time-step in the decoder.

Before, we predicted the next token, $\hat{y}_{t+1}$, with the linear layer, $f$, only using the top-layer decoder hidden state at that time-step, $s_t$, as $\hat{y}_{t+1}=f(s_t^L)$. Now, we also pass the embedding of current token, $d(y_t)$ and the context vector, $z$ to the linear layer.

$$\hat{y}_{t+1} = f(d(y_t), s_t, z)$$
Thus, our decoder now looks something like this:

<p align="center"><img src="https://github.com/bentrevett/pytorch-seq2seq/raw/49df8404d938a6edbf729876405558cc2c2b3013/assets/seq2seq6.png"/></p>


Note, the initial hidden state, $s_0$, is still the context vector, $z$, so when generating the first token we are actually inputting two identical context vectors into the GRU.

How do these two changes reduce the information compression? Well, hypothetically the decoder hidden states, $s_t$, no longer need to contain information about the source sequence as it is always available as an input. Thus, it only needs to contain information about what tokens it has generated so far. The addition of $y_t$ to the linear layer also means this layer can directly see what the token is, without having to get this information from the hidden state.

However, this hypothesis is just a hypothesis, it is impossible to determine how the model actually uses the information provided to it (don't listen to anyone that says differently). Nevertheless, it is a solid intuition and the results seem to indicate that this modifications are a good idea!

Within the implementation, we will pass $d(y_t)$ and $z$ to the GRU by concatenating them together, so the input dimensions to the GRU are now emb_dim + hid_dim (as context vector will be of size hid_dim). The linear layer will take $d(y_t), s_t$ and $z$ also by concatenating them together, hence the input dimensions are now emb_dim + hid_dim*2. We also don't pass a value of dropout to the GRU as it only uses a single layer.

forward now takes a context argument. Inside of forward, we concatenate $y_t$ and $z$ as emb_con before feeding to the GRU, and we concatenate $d(y_t)$, $s_t$ and $z$ together as output before feeding it through the linear layer to receive our predictions, $\hat{y}_{t+1}$.

In [13]:
class Decoder(nn.Module):
  def __init__(self, output_dim, emb_dim, hid_dim, dropout):
    super(Decoder, self).__init__()
    self.hid_dim = hid_dim
    self.output_dim = output_dim

    self.embedding = nn.Embedding(output_dim, emb_dim)
    self.gru = nn.GRU(emb_dim + hid_dim, hid_dim)
    self.fc = nn.Linear(emb_dim + hid_dim * 2, output_dim)
    self.dropout = nn.Dropout(dropout)

  def forward(self, input, hidden, context):
    """     
    input = [batch size]
    hidden, (h_0) = [n_layers * n_directions, batch_size, hid_dim]
    context = [n_layers * n_directions, batch_size, hid_dim]

    n_layers and n_directions in the decoder will both always be 1, therefore:
    hidden (h_0) = [1, batch_size, hid_dim]
    context = [1, batch_size, hid_dim]
    """
    input = input.unsqueeze(0)  # nput = [1, batch size]
    embedded = self.dropout(self.embedding(input)) # embedded = [1, batch_size, emb_dim]
    emb_con = torch.cat((embedded, context), dim = 2) # emb_con = [1, batch size, emb dim + hid dim]

    output, h_0 = self.gru(emb_con, hidden)
    """      
    output = [seq_len, batch_size, hid dim * n_directions]
    hidden (h_0) = [n_layers * n_directions, batch_size, hid_dim]

    seq_len, n_layers and n_directions will always be 1 in the decoder, therefore:
    output = [1, batch_size, hid_dim]
    hidden (h_0) = [1, batch_size, hid_dim]
    """
    output = torch.cat((embedded.squeeze(0), h_0.squeeze(0), context.squeeze(0)), 
                           dim = 1) # output = [batch size, emb dim + hid dim * 2]
    prediction = self.fc(output) # prediction = [batch size, output dim]
    return prediction, h_0

### `Seq2Seq` Model
Putting the encoder and decoder together, we get:

<p><img src="https://github.com/bentrevett/pytorch-seq2seq/raw/49df8404d938a6edbf729876405558cc2c2b3013/assets/seq2seq7.png"/></p>

**Again, in this implementation we need to ensure the hidden dimensions in both the encoder and the decoder are the same.**

**Briefly going over all of the steps:**

* the outputs tensor is created to hold all predictions, $\hat{Y}$
* the source sequence, $X$, is fed into the encoder to receive a context vector
* the initial decoder hidden state is set to be the `context vector`, $s_0 = z = h_T$
* we use a batch of `<sos>` tokens as the first input, $y_1$
* we then decode within a loop:
  * inserting the input token $y_t$, previous hidden state, $s_{t-1}$, and the context vector, $z$, into the decoder
  * receiving a prediction, $\hat{y}_{t+1}$, and a new hidden state, $s_t$
  * we then decide if we are going to teacher force or not, setting the next input as appropriate (either the ground truth next token in the target sequence or the highest predicted next token)

In [18]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
        assert encoder.hid_dim == decoder.hid_dim, "Hidden dimensions of encoder and decoder must be equal!"
        
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):
        """
        src = [src_len, batch_size]
        trg = [trg_len, batch_size]
        teacher_forcing_ratio is probability to use teacher forcing
        e.g. if teacher_forcing_ratio is 0.75 we use ground-truth inputs 75% of the time
        """
        batch_size = trg.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        # tensor to store decoder outputs
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
        # last hidden state of the encoder is the context
        context = self.encoder(src)
        
        # context also used as the initial hidden state of the decoder
        hidden = context
        
        # first input to the decoder is the <sos> tokens
        input = trg[0,:]
        
        for t in range(1, trg_len):
            # insert input token embedding, previous hidden state and the context state
            # receive output tensor (predictions) and new hidden state
            output, hidden = self.decoder(input, hidden, context)
            # place predictions in a tensor holding predictions for each token
            outputs[t] = output
            # decide if we are going to use teacher forcing or not
            teacher_force = random.random() < teacher_forcing_ratio
            # get the highest predicted token from our predictions
            top1 = output.argmax(1) 
            # if teacher forcing, use actual next token as next input
            # if not, use predicted token
            input = trg[t] if teacher_force else top1
        return outputs

### Training the `seq2seq` model.
The rest of the code will remain the same. We will first define hyper params and then initialize the model.

We will then create the instance of the `Encoder` and `Decoder` so that we will pass them to the `Seq2Seq` class instance.

In [19]:
INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
HID_DIM = 512
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, DEC_DROPOUT)
model = Seq2Seq(enc, dec, device).to(device)
model

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(7855, 256)
    (gru): GRU(256, 512)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(5893, 256)
    (gru): GRU(768, 512)
    (fc): Linear(in_features=1280, out_features=5893, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

Next, we initialize our parameters. The paper states the parameters are initialized from a normal distribution with a mean of 0 and a standard deviation of 0.01, i.e. $\mathcal{N}(0, 0.01)$.

It also states we should initialize the recurrent parameters to a special initialization, however to keep things simple we'll also initialize them to $\mathcal{N}(0, 0.01)$.


In [20]:
def init_weights(m):
    for name, param in m.named_parameters():
        nn.init.normal_(param.data, mean=0, std=0.01)
        
model.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(7855, 256)
    (gru): GRU(256, 512)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(5893, 256)
    (gru): GRU(768, 512)
    (fc): Linear(in_features=1280, out_features=5893, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

### Counting trainabe parameters.

We print out the number of parameters.

**Even though we only have a single layer RNN for our encoder and decoder we actually have more parameters than the last model. This is due to the increased size of the inputs to the GRU and the linear layer. However, it is not a significant amount of parameters and causes a minimal amount of increase in training time (~3 seconds per epoch extra).**

In [21]:
def count_trainable_params(model):
  return sum(p.numel() for p in model.parameters()), sum(p.numel() for p in model.parameters() if p.requires_grad)

n_params, trainable_params = count_trainable_params(model)
print(f"Total number of paramaters: {n_params:,}\nTotal tainable parameters: {trainable_params:,}")

Total number of paramaters: 14,220,293
Total tainable parameters: 14,220,293


### The optimizer.

In [22]:
optimizer = torch.optim.Adam(model.parameters())

### Loss Function.
Our loss function calculates the average loss per token, however by passing the index of the `<pad>` token as the `ignore_index` argument we ignore the loss whenever the target token is a padding token.



In [23]:
TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]
criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX)

### The train and evaluation function.

In [24]:
def train(model, iterator, optimizer, criterion, clip):
    model.train()
    epoch_loss = 0
    for i, batch in enumerate(iterator):
        src = batch.src
        trg = batch.trg
        optimizer.zero_grad()
        output = model(src, trg)
        # trg = [trg len, batch size]
        # output = [trg len, batch size, output dim]
        output_dim = output.shape[-1]
        output = output[1:].view(-1, output_dim)
        trg = trg[1:].view(-1)
        # trg = [(trg len - 1) * batch size]
        # output = [(trg len - 1) * batch size, output dim]
        loss = criterion(output, trg)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()
        epoch_loss += loss.item()
    return epoch_loss / len(iterator)

def evaluate(model, iterator, criterion):
    model.eval()
    epoch_loss = 0
    with torch.no_grad():
        for i, batch in enumerate(iterator):
            src = batch.src
            trg = batch.trg
            output = model(src, trg, 0) # turn off teacher forcing
            # trg = [trg len, batch size]
            # output = [trg len, batch size, output dim]
            output_dim = output.shape[-1]
            output = output[1:].view(-1, output_dim)
            trg = trg[1:].view(-1)
            # trg = [(trg len - 1) * batch size]
            # output = [(trg len - 1) * batch size, output dim]
            loss = criterion(output, trg)
            epoch_loss += loss.item()
    return epoch_loss / len(iterator)

We'll also define the function that calculates how long an epoch takes.

In [25]:
import time
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

### The trainning loop.

In [26]:

N_EPOCHS = 10
CLIP = 1
best_valid_loss = float('inf')
for epoch in range(N_EPOCHS):
    start_time = time.time()
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, valid_iterator, criterion)
    end_time = time.time()
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'best-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Epoch: 01 | Time: 0m 35s
	Train Loss: 5.017 | Train PPL: 150.934
	 Val. Loss: 5.155 |  Val. PPL: 173.216
Epoch: 02 | Time: 0m 35s
	Train Loss: 4.393 | Train PPL:  80.915
	 Val. Loss: 5.215 |  Val. PPL: 184.101
Epoch: 03 | Time: 0m 36s
	Train Loss: 4.089 | Train PPL:  59.668
	 Val. Loss: 4.714 |  Val. PPL: 111.548
Epoch: 04 | Time: 0m 36s
	Train Loss: 3.758 | Train PPL:  42.854
	 Val. Loss: 4.365 |  Val. PPL:  78.663
Epoch: 05 | Time: 0m 36s
	Train Loss: 3.452 | Train PPL:  31.569
	 Val. Loss: 4.152 |  Val. PPL:  63.551
Epoch: 06 | Time: 0m 36s
	Train Loss: 3.147 | Train PPL:  23.270
	 Val. Loss: 4.017 |  Val. PPL:  55.516
Epoch: 07 | Time: 0m 36s
	Train Loss: 2.887 | Train PPL:  17.938
	 Val. Loss: 3.908 |  Val. PPL:  49.784
Epoch: 08 | Time: 0m 36s
	Train Loss: 2.652 | Train PPL:  14.181
	 Val. Loss: 3.716 |  Val. PPL:  41.092
Epoch: 09 | Time: 0m 36s
	Train Loss: 2.460 | Train PPL:  11.702
	 Val. Loss: 3.704 |  Val. PPL:  40.629
Epoch: 10 | Time: 0m 36s
	Train Loss: 2.263 | Train PPL

Finally, we test the model on the test set using these "best" parameters.

In [27]:
model.load_state_dict(torch.load('best-model.pt'))

test_loss = evaluate(model, test_iterator, criterion)

print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')

| Test Loss: 3.635 | Test PPL:  37.895 |



Just looking at the test loss, we get better performance than the previous model. This is a pretty good sign that this model architecture is doing something right! Relieving the information compression seems like the way forard, and in the next notebook we'll expand on this even further with attention.

### Credits

* [bentrevett](https://github.com/bentrevett?tab=repositories&q=&type=&language=&sort=)