### Neural Machine Translation by Jointly Learning to Align and Translate

In this notebook we will implement the model from [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473) that will improve PPL (perplexity) as compared to the previous notebook.

> This notebook is nothing but a modified version of my previous notebooks on `seq2seq` machine translation model. The only difference is that in this notebook we are going to use the `torchtext` api instead of the `torchtext.lergacy`. The following notebooks are used as refences to this notebook. There you will find a deep understanding on how `Seq2Seq` models work.

1. [03_Neural_Machine_Translation_by_Jointly_Learning_to_Align_and_Translate.ipynb](https://github.com/CrispenGari/pytorch-python/blob/main/09_NLP/03_Sequence_To_Sequence/03_Neural_Machine_Translation_by_Jointly_Learning_to_Align_and_Translate.ipynb)


The model implemented in this notebook avoids this compression by allowing the decoder to look at the entire source sentence (via its hidden states) at each decoding step! How does it do this? It uses **attention**.

### Attention.
Attention works by first, calculating an attention vector, $a$ , that is the length of the source sentence. The attention vector has the property that each element is between `0` and `1`, and the entire vector sums to `1`. We then calculate a weighted sum of our source sentence hidden states, $H$ , to get a weighted source vector, $w$. You can read more about it in [this notebook]((https://github.com/CrispenGari/pytorch-python/blob/main/09_NLP/03_Sequence_To_Sequence/03_Neural_Machine_Translation_by_Jointly_Learning_to_Align_and_Translate.ipynb).


 
We calculate a new weighted source vector every time-step when decoding, using it as input to our decoder RNN as well as the linear layer to make a prediction.

### Installation of Packages
In the following code cell we are going to install the packages that we are going to use in this notebook which are `helperfns` and `torchdata`. The `helperfns` package allows us to get some machine learning helper function which we are also going to use in this notebook as well.

In [1]:
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

In [2]:
!pip install helperfns torchdata portalocker -q

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for sklearn (setup.py) ... [?25l[?25hdone


### Imports

In the following code cell we are going to import the packages that are going to be used in this notebook.

In [3]:
from torch import nn
from torch.nn  import functional as F
from torchtext import data, datasets
from collections import Counter
from torchtext import vocab
from helperfns import tables, visualization, utils
from helperfns.torch.models import model_params

import spacy
import math
import random
import torch
import torchtext
import numpy as np
import time
import gc

torchtext.__version__, torch.__version__

('0.15.2+cpu', '2.0.1+cu118')

### Seed
In the following code cell we are going to set up the `SEED` for reproducivity in this notebook.

In [4]:
SEED = 42

torch.manual_seed(SEED)
np.random.seed(SEED)
random.seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deteministic = True

### Device

In the following code cell we are going to declare a `device` variable that will allow us to make use of `GPU` if available.

In [5]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

### Tokenizer Models

For the tokenizer we are going to make use of the `spacy` language models to tokenize sentences for each language. So first we will need to download the `en_core_web_sm` and the `de_core_news_sm`.

In [6]:
spacy.cli.download('de_core_news_sm')

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('de_core_news_sm')


In [7]:
spacy_de = spacy.load('de_core_news_sm')
spacy_en = spacy.load('en_core_web_sm')

In the following code cell we are going to create our tokenizers functions that takes in a sentence in english and tokenize it.

> Previously we reversed the source (German) sentence, however in the paper we are implementing they don't do this, so neither will we.

In [8]:
def tokenize_de(sent:str)->list:
  return [tok.text for tok in spacy_de.tokenizer(sent)]

def tokenize_en(sent:str)->list:
  return [tok.text for tok in spacy_en.tokenizer(sent)]

Testing our tokenizer functions.

In [9]:
tokenize_de("Kannst du mir helfen?")

['Kannst', 'du', 'mir', 'helfen', '?']

In [10]:
tokenize_en("Can you help me?")

['Can', 'you', 'help', 'me', '?']

### Dataset

The dataset we'll be using is the [Multi30k](https://pytorch.org/text/stable/datasets.html#multi30k) dataset. This is a dataset with `~30,000` parallel English, German and French sentences, each with `~12` words per sentence.

In [11]:
train_iter, valid_iter, test_iter = datasets.Multi30k(
     root = '.data', split = ('train', 'valid', 'test'), 
    language_pair= ('de', 'en')
)

Checking as single train example.

In [12]:
next(iter(train_iter))

('Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.',
 'Two young, White males are outside near many bushes.')

### Getting `src` and `trg`
Our `src` field will be the `de` sentence and the `trg` field will the the `en` sentence.

In [13]:
train_src = []
train_trg = []
valid_src = []
valid_trg = []
test_src = []
test_trg = []

for (src, trg) in train_iter:
  train_src.append(src)
  train_trg.append(trg)

for (src, trg) in test_iter:
  test_src.append(src)
  test_trg.append(trg)

for (src, trg) in valid_iter:
  valid_src.append(src)
  valid_trg.append(trg)



Checking if the `src` and the `trg` has the same length for all sets.

In [14]:
assert len(train_src) == len(train_trg), f"The src and trg must have the same length got {len(train_src)} and {len(train_trg)}."
assert len(valid_src) == len(valid_trg), f"The src and trg must have the same length got {len(valid_src)} and {len(valid_trg)}."
assert len(test_src) == len(test_trg), f"The src and trg must have the same length got {len(test_src)} and {len(test_trg)}."

### Counting examples

In the following code cell we are going to count examples in each language pair and visualize them using a table.

In [15]:
columns = ["set", "src", "trg"]
examples = [
    ['training', len(train_src), len(train_trg)],
    ['validation' , len(valid_src), len(valid_trg)],
    ['testing' , len(test_src), len(test_trg)],
]
tables.tabulate_data(columns, examples, "Examples")

+------------+-------+-------+
| set        |   src |   trg |
+------------+-------+-------+
| training   | 29001 | 29001 |
| validation |  1015 |  1015 |
| testing    |  1000 |  1000 |
+------------+-------+-------+


### Building Vocabulary
Next, we'll build the vocabulary for the `source` (src) and `target` (trg) languages. The vocabulary is used to associate each unique token with an index (an integer). The vocabularies of the `source` and `target` languages are distinct.

Using the `min_freq` argument, we only allow tokens that appear at least `2` times to appear in our vocabulary. Tokens that appear only once are converted into an `<unk>` (unknown) token.

> It is important to note that our vocabulary should only be built from the `training` set and not the `validation/test` set. This prevents `"information leakage"` into our model, giving us artifically inflated `validation/test` scores.

Also note that in this notebook we are not going to focus much about text cleaning. We are going to convert sentences to lower case only.

In [16]:
counter_src = Counter()
for line in train_src:
  counter_src.update(tokenize_de(line.lower()))
vocabulary_src = vocab.vocab(counter_src, min_freq=5, specials=('-unk-', '-sos-', '-eos-', '-pad-'))

counter_trg = Counter()
for line in train_trg:
  counter_trg.update(tokenize_de(line.lower()))
vocabulary_trg = vocab.vocab(counter_trg, min_freq=5, specials=('-unk-', '-sos-', '-eos-', '-pad-'))

In the following code cell we are going to get the `string-to-integer` representation of our `src` and `trg` fields.

In [17]:
stoi_src = vocabulary_src.get_stoi()
stoi_trg = vocabulary_trg.get_stoi()

SRC_VOCAB_SIZE = len(stoi_src)
TRG_VOCAB_SIZE = len(stoi_trg)

### SRC and Target Pipelines
After our text has been tokenized we need a way of converting those words into numbers because machine leaning models understand numbers not words. That's where we the `src_text_pipeline` and `trg_text_pipeline` functions into play. So these function takes in a sentence and tokenize it then converts each word to a number. Note that the word that does not exists in the vocabulay for either `src` or `trg` will be converted to an `unkown` (`-unk-`) token.

In [18]:
def src_text_pipeline(x: str)->list:
  values = list()
  tokens = tokenize_de(x.lower())
  for token in tokens:
    try:
      v = stoi_src[token]
    except KeyError as e:
      v = stoi_trg['-unk-']
    values.append(v)
  return values

def trg_text_pipeline(x: str)->list:
  values = list()
  tokens = tokenize_de(x.lower())
  for token in tokens:
    try:
      v = stoi_trg[token]
    except KeyError as e:
      v = stoi_trg['-unk-']
    values.append(v)
  return values

#### Translation Dataset

In the following code cell we are going to create a `TranslationDataset` class which will inherid from the `torch.utils.data.Dataset` class. This class will take in the `src` and `trg` values and pair them together.

In [19]:
class TranslationDataset(torch.utils.data.Dataset):
  def __init__(self, src, trg):
    super(TranslationDataset, self).__init__()
    self.src = src
    self.trg = trg
      
  def __getitem__(self, index):
    return self.src[index], self.trg[index]
  
  def __len__(self):
    return len(self.src)

### collate_fn of the `DataLoader`
Our collate function will be named `tokenize_batch` which takes in the batch of `src` and `trg` pairs then `tokenize`, `numericalize` and then `pad` the to a given `max_len`.

In [20]:
def tokenize_batch(batch, max_len=100, padding="pre"):
  assert padding=="pre" or padding=="post", "the padding can be either pre or post"
  src, trg = [], []
  for src_, trg_ in batch:
    _src = torch.zeros(max_len, dtype=torch.int32)
    _trg = torch.zeros(max_len, dtype=torch.int32)
    _src_processed_text = torch.tensor(src_text_pipeline(src_.lower()), dtype=torch.int32)
    _trg_processed_text = torch.tensor(trg_text_pipeline(trg_.lower()), dtype=torch.int32)
    _src_pos = min(max_len, len(_src_processed_text))
    _trg_pos = min(max_len, len(_trg_processed_text))
    if padding == "pre":
      _src[:_src_pos] = _src_processed_text[:_src_pos]
      _trg[:_trg_pos] = _trg_processed_text[:_trg_pos]
    else:
      _src[-_src_pos:] = _src_processed_text[-_src_pos:]
      _trg[-_trg_pos:] = _trg_processed_text[-_trg_pos:]
    trg.append(_trg.unsqueeze(dim=0))
    src.append(_src.unsqueeze(dim=0))
  #  the target values must be a LongTensor
  return torch.cat(src, dim=0), torch.cat(trg, dim=0).type(torch.LongTensor)

### Creating datasets
In the following code cell we are going to create datasets for all our `3` sets.

In [21]:
train_dataset = TranslationDataset(train_src, train_trg)
test_dataset = TranslationDataset(test_src, test_trg)
valid_dataset = TranslationDataset(valid_src, valid_trg)

Checking a single example in the `train` set.

In [22]:
train_dataset[0]

('Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.',
 'Two young, White males are outside near many bushes.')

### DataLoaders

In the following code cell we are going to create  dataloaders for our three sets of data. We are going to pass the `tokenize_batch` as our `collate_fn`. We are going to shuffle example in the `train` set only.

In [23]:
BATCH_SIZE = 64
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=tokenize_batch)
valid_loader = torch.utils.data.DataLoader(valid_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=tokenize_batch)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=tokenize_batch)

#### Checking examples in the train_loader.

In [24]:
src, trg = next(iter(train_loader))

In [25]:
src[:2]

tensor([[ 198, 1220,  153,   20, 2445,  318, 1667, 1823,   15,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0],
        [  20,   26,   31,   27,  780,   90,  192,   45, 1700,   15,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0, 

In [26]:
trg[:2]

tensor([[ 111,    9,  211,   21,  574,  204, 2861,    0,   39,  572,   14,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0],
        [  21,   29,   34,   21,  850,  751,  233,   21, 1602,   14,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0, 

### Sequence To Sequence Model

We are going to build our model in `3` parts. The `econder`, `decoder` and `seq2seq`.

### Encoder

First, we'll build the encoder. Similar to the previous model, we only use a single layer GRU, however we now use a bidirectional RNN. With a bidirectional RNN, we have two RNNs in each layer. A forward RNN going over the embedded sentence from left to right (shown below in green), and a backward RNN going over the embedded sentence from right to left (teal). All we need to do in code is set bidirectional = True and then pass the embedded sentence to the RNN as before.

In [27]:
class Encoder(nn.Module):
  def __init__(self, input_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout):
    super(Encoder, self).__init__()

    self.embedding = nn.Embedding(input_dim, embedding_dim=emb_dim)
    self.gru = nn.GRU(emb_dim, enc_hid_dim, bidirectional = True)
    self.fc = nn.Linear(enc_hid_dim * 2, dec_hid_dim)
    self.dropout = nn.Dropout(dropout)

  def forward(self, src): # src = [src len, batch size]
    embedded = self.dropout(self.embedding(src)) # embedded = [src len, batch size, emb dim]
    outputs, hidden = self.gru(embedded)
    """
    outputs = [src len, batch size, hid dim * num directions]
    hidden = [n layers * num directions, batch size, hid dim]

    hidden is stacked [forward_1, backward_1, forward_2, backward_2, ...]
    outputs are always from the last layer

    hidden [-2, :, : ] is the last of the forwards RNN 
    hidden [-1, :, : ] is the last of the backwards RNN

    initial decoder hidden is final hidden state of the forwards and backwards 
    encoder RNNs fed through a linear layer
    """
    hidden = torch.tanh(self.fc(torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1)))
    """
    outputs = [src len, batch size, enc hid dim * 2]
    hidden = [batch size, dec hid dim]
    """
    return outputs, hidden

### Attention Layer

You can read more about the implementation of this layer in [this notebook.](https://github.com/CrispenGari/pytorch-python/blob/main/09_NLP/03_Sequence_To_Sequence/03_Neural_Machine_Translation_by_Jointly_Learning_to_Align_and_Translate.ipynb)

In [28]:
class Attention(nn.Module):
  def __init__(self, enc_hid_dim, dec_hid_dim):
    super(Attention, self).__init__()
    self.attn = nn.Linear((enc_hid_dim * 2) + dec_hid_dim, dec_hid_dim)
    self.v = nn.Linear(dec_hid_dim, 1, bias = False)

  def forward(self, hidden, encoder_outputs):
    """
    hidden = [batch size, dec hid dim]
    encoder_outputs = [src len, batch size, enc hid dim * 2]
    """
    batch_size = encoder_outputs.shape[1]
    src_len = encoder_outputs.shape[0]
    # repeat decoder hidden state src_len times
    hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)
    encoder_outputs = encoder_outputs.permute(1, 0, 2)

    """
    hidden = [batch size, src len, dec hid dim]
    encoder_outputs = [batch size, src len, enc hid dim * 2]
    """
    energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim = 2))) # energy = [batch size, src len, dec hid dim]
    attention = self.v(energy).squeeze(2) # attention= [batch size, src len]
    return F.softmax(attention, dim=1)

### Decoder

The decoder contains the attention layer, attention, which takes the previous hidden state. You can read more [here.](https://github.com/CrispenGari/pytorch-python/blob/main/09_NLP/03_Sequence_To_Sequence/03_Neural_Machine_Translation_by_Jointly_Learning_to_Align_and_Translate.ipynb)

In [29]:
class Decoder(nn.Module):
  def __init__(self, output_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout, attention):
    super(Decoder, self).__init__()
    self.output_dim = output_dim
    self.attention = attention

    self.embedding = nn.Embedding(output_dim, emb_dim)
    self.gru = nn.GRU((enc_hid_dim * 2) + emb_dim, dec_hid_dim)
    self.fc_out = nn.Linear((enc_hid_dim * 2) + dec_hid_dim + emb_dim, output_dim)
    self.dropout = nn.Dropout(dropout)
        
  def forward(self, input, hidden, encoder_outputs):
    """
    input = [batch size]
    hidden = [batch size, dec hid dim]
    encoder_outputs = [src len, batch size, enc hid dim * 2]
    """
    input = input.unsqueeze(0) # input = [1, batch size]
    embedded = self.dropout(self.embedding(input)) # embedded = [1, batch size, emb dim]
    a = self.attention(hidden, encoder_outputs)# a = [batch size, src len]
    a = a.unsqueeze(1) # a = [batch size, 1, src len]
    encoder_outputs = encoder_outputs.permute(1, 0, 2) # encoder_outputs = [batch size, src len, enc hid dim * 2]
    weighted = torch.bmm(a, encoder_outputs) # weighted = [batch size, 1, enc hid dim * 2]
    weighted = weighted.permute(1, 0, 2) # weighted = [1, batch size, enc hid dim * 2]
    rnn_input = torch.cat((embedded, weighted), dim = 2) # rnn_input = [1, batch size, (enc hid dim * 2) + emb dim]
    output, hidden = self.gru(rnn_input, hidden.unsqueeze(0))
    
    """
    output = [seq len, batch size, dec hid dim * n directions]
    hidden = [n layers * n directions, batch size, dec hid dim]
    
    seq len, n layers and n directions will always be 1 in this decoder, therefore:
    output = [1, batch size, dec hid dim]
    hidden = [1, batch size, dec hid dim]
    this also means that output == hidden
    """
    assert (output == hidden).all()
    embedded = embedded.squeeze(0)
    output = output.squeeze(0)
    weighted = weighted.squeeze(0)

    prediction = self.fc_out(torch.cat((output, weighted, embedded), dim = 1)) # prediction = [batch size, output dim]
    return prediction, hidden.squeeze(0)

###Seq2Seq (Sequence to Sequence)

This is the first model where we don't have to have the encoder RNN and decoder RNN have the same hidden dimensions, however the encoder has to be bidirectional. This requirement can be removed by changing all occurences of enc_dim 2 to enc_dim 2 if encoder_is_bidirectional else enc_dim.

This seq2seq encapsulator is similar to the last two. The only difference is that the encoder returns both the final hidden state (which is the final hidden state from both the forward and backward encoder RNNs passed through a linear layer) to be used as the initial hidden state for the decoder, as well as every hidden state (which are the forward and backward hidden states stacked on top of each other). We also need to ensure that hidden and encoder_outputs are passed to the decoder.

In [30]:
class Seq2Seq(nn.Module):
  def __init__(self, encoder, decoder, device):
    super().__init__()
    self.encoder = encoder
    self.decoder = decoder
    self.device = device
        
  def forward(self, src, trg, teacher_forcing_ratio = 0.5):
    """
    src = [src len, batch size]
    trg = [trg len, batch size]
    teacher_forcing_ratio is probability to use teacher forcing
    e.g. if teacher_forcing_ratio is 0.75 we use teacher forcing 75% of the time
    """
    trg_len, batch_size = trg.shape
    trg_vocab_size = self.decoder.output_dim
        
    # tensor to store decoder outputs
    outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
    # encoder_outputs is all hidden states of the input sequence, back and forwards
    # hidden is the final forward and backward hidden states, passed through a linear layer
    encoder_outputs, hidden = self.encoder(src)     
    # first input to the decoder is the <sos> tokens
    input = trg[0,:]
    for t in range(1, trg_len):
      # insert input token embedding, previous hidden state and all encoder hidden states
      # receive output tensor (predictions) and new hidden state
      output, hidden = self.decoder(input, hidden, encoder_outputs)
      
      # place predictions in a tensor holding predictions for each token
      outputs[t] = output
      
      # decide if we are going to use teacher forcing or not
      teacher_force = random.random() < teacher_forcing_ratio
      
      # get the highest predicted token from our predictions
      top1 = output.argmax(1) 
      
      # if teacher forcing, use actual next token as next input
      # if not, use predicted token
      input = trg[t] if teacher_force else top1
    return outputs

### Training the `Seq2Seq`


In [31]:
INPUT_DIM = SRC_VOCAB_SIZE
OUTPUT_DIM = TRG_VOCAB_SIZE
ENC_EMB_DIM = DEC_EMB_DIM = 256
ENC_HID_DIM = DEC_HID_DIM = 512
ENC_DROPOUT = DEC_DROPOUT = 0.5

attn = Attention(ENC_HID_DIM, DEC_HID_DIM)
enc = Encoder(INPUT_DIM, ENC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, DEC_DROPOUT, attn)

model = Seq2Seq(enc, dec, device).to(device)
model

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(3554, 256)
    (gru): GRU(256, 512, bidirectional=True)
    (fc): Linear(in_features=1024, out_features=512, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (attention): Attention(
      (attn): Linear(in_features=1536, out_features=512, bias=True)
      (v): Linear(in_features=512, out_features=1, bias=False)
    )
    (embedding): Embedding(3275, 256)
    (gru): GRU(1280, 512)
    (fc_out): Linear(in_features=1792, out_features=3275, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

### Counting model parameters.
We are going to use the `model_params` function from `helperfns` to count model parameters.

In [32]:
model_params(model)

TOTAL MODEL PARAMETERS: 	14,053,579
TOTAL TRAINABLE PARAMETERS: 	14,053,579


### Initializing the weights
here, we will initialize all biases to zero and all weights from $N(0, 0.01)$.

In [33]:
def init_weights(m):
  for name, param in m.named_parameters():
    if 'weight' in name:
      nn.init.normal_(param.data, mean=0, std=0.01)
    else:
      nn.init.constant_(param.data, 0)    
model.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(3554, 256)
    (gru): GRU(256, 512, bidirectional=True)
    (fc): Linear(in_features=1024, out_features=512, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (attention): Attention(
      (attn): Linear(in_features=1536, out_features=512, bias=True)
      (v): Linear(in_features=512, out_features=1, bias=False)
    )
    (embedding): Embedding(3275, 256)
    (gru): GRU(1280, 512)
    (fc_out): Linear(in_features=1792, out_features=3275, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

### Optimizer
For the optimizer we are going to use the `Adam` optimizer with default parameters.

In [34]:
optimizer = torch.optim.Adam(model.parameters())

### Criterion
Next, we define our loss function. The `CrossEntropyLoss` function calculates both the log softmax as well as the negative log-likelihood of our predictions.

Our loss function calculates the average loss per token, however by passing the index of the `<pad>` token as the `ignore_index` argument we ignore the loss whenever the target token is a padding token.

In [35]:
TRG_PAD_IDX = stoi_trg["-pad-"]
criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX)

### Train Loop


Lets's first create a function that collects all gabages and clears the gabage together with empting cuda cache.

In [36]:
def clear_gpu_memory():
  torch.cuda.empty_cache()
  variables = gc.collect()
  del variables

In [37]:
def train(model, iterator, optimizer, criterion, clip):
  model.train()
  epoch_loss = 0
  for i, (src, trg) in enumerate(iterator):
    src = src.to(device)
    trg = trg.to(device)
    optimizer.zero_grad()
    output = model(src, trg)
    # trg = [trg len, batch size]
    # output = [trg len, batch size, output dim]
    output_dim = output.shape[-1]
    output = output[1:].view(-1, output_dim)
    trg = trg[1:].view(-1)
    # trg = [(trg len - 1) * batch size]
    # output = [(trg len - 1) * batch size, output dim]
    loss = criterion(output, trg)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
    optimizer.step()
    epoch_loss += loss.item()
    clear_gpu_memory()
  return epoch_loss / len(iterator)

### Evaluation Loop


In [38]:
def evaluate(model, iterator, criterion):
  model.eval()
  epoch_loss = 0
  with torch.no_grad():
    for i, (src, trg) in enumerate(iterator):
      src = src.to(device)
      trg = trg.to(device)
      output = model(src, trg, 0) # turn off teacher forcing
      # trg = [trg len, batch size]
      # output = [trg len, batch size, output dim]
      output_dim = output.shape[-1]
      output = output[1:].view(-1, output_dim)
      trg = trg[1:].view(-1)
      # trg = [(trg len - 1) * batch size]
      # output = [(trg len - 1) * batch size, output dim]
      loss = criterion(output, trg)
      epoch_loss += loss.item()
      clear_gpu_memory()
  return epoch_loss / len(iterator)

### Running the training loop.
During training we are going to visualize our training metrics in tabular form. We are going to save the best model if and only if the previous validation loss is less greater than the current epoch validation loss.

In [39]:
N_EPOCHS = 10
CLIP = 1
best_valid_loss = float('inf')
MODEL_NAME = 'best-model.pt'
for epoch in range(N_EPOCHS):
    start = time.time()
    train_loss = train(model, train_loader, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, valid_loader, criterion)
    title = f"EPOCH: {epoch+1:02}/{N_EPOCHS:02} {'saving best model...' if valid_loss < best_valid_loss else 'not saving...'}"
    end = time.time()
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), MODEL_NAME)
    data = [
       ["Training", f'{train_loss:.3f}', f'{math.exp(train_loss):7.3f}', f"{utils.hms_string(end - start)}" ],
       ["Validation", f'{valid_loss:.3f}', f'{math.exp(valid_loss):7.3f}', "" ],       
   ]
    columns = ["CATEGORY", "LOSS", "PPL", "ETA"]
    print(title)
    tables.tabulate_data(columns, data, title)

EPOCH: 01/10 saving best model...
+------------+-------+---------+------------+
| CATEGORY   |  LOSS |     PPL |        ETA |
+------------+-------+---------+------------+
| Training   | 0.840 |   2.315 | 0:09:29.16 |
| Validation | 0.718 |   2.050 |            |
+------------+-------+---------+------------+
EPOCH: 02/10 saving best model...
+------------+-------+---------+------------+
| CATEGORY   |  LOSS |     PPL |        ETA |
+------------+-------+---------+------------+
| Training   | 0.716 |   2.045 | 0:09:24.43 |
| Validation | 0.714 |   2.043 |            |
+------------+-------+---------+------------+
EPOCH: 03/10 saving best model...
+------------+-------+---------+------------+
| CATEGORY   |  LOSS |     PPL |        ETA |
+------------+-------+---------+------------+
| Training   | 0.711 |   2.037 | 0:09:22.23 |
| Validation | 0.709 |   2.032 |            |
+------------+-------+---------+------------+
EPOCH: 04/10 saving best model...
+------------+-------+---------+----

### Evaluating the Best Model

In the following code cell we are going to evaluate the best model.

In [40]:
column_names = ["Set", "Loss", "PPL", "ETA (time)"]
model.load_state_dict(torch.load(MODEL_NAME))
test_loss= evaluate(model, test_loader, criterion)
title = "Model Evaluation Summary"
data_rows = [["Test", f'{test_loss:.3f}', f'{math.exp(test_loss):7.3f}', ""]]

tables.tabulate_data(column_names, data_rows, title)

+------+-------+---------+------------+
| Set  |  Loss |     PPL | ETA (time) |
+------+-------+---------+------------+
| Test | 0.705 |   2.023 |            |
+------+-------+---------+------------+



We've improved on the previous model, but this came at the cost of doubling the training time.

In the next notebook, we'll be using the same architecture but using a few tricks that are applicable to all RNN architectures - **packed padded sequences** and **masking**. We'll also implement code which will allow us to look at what words in the input the RNN is paying attention to when decoding the output.