### Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation.

In this notebook we are going to implement the model from [this](https://arxiv.org/abs/1406.1078) paper. In the previous notebook we created a `Seq2Seq` model.

One downside of the previous model is that the decoder is trying to cram lots of information into the hidden states. Whilst decoding, the hidden state will need to contain information about the whole of the source sequence, as well as all of the tokens have been decoded so far. By alleviating some of this information compression, we can create a better model!

We'll also be using a `GRU` (Gated Recurrent Unit) instead of an `LSTM` (Long Short-Term Memory). Why? Mainly because that's what they did in the paper (this paper also introduced `GRUs`) and also because we used `LSTMs` last time. Both `GRU` and `LSTM` are pretty much the same as they differ from regular `RNNs`.

> This notebook is nothing but a modified version of my previous notebooks on `seq2seq` machine translation model. The only difference is that in this notebook we are going to use the `torchtext` api instead of the `torchtext.lergacy`. The following notebooks are used as refences to this notebook. There you will find a deep understanding on how `Seq2Seq` models work.

1. [02__Learning_Phrase_Representations_using_RNN_Encoder_Decoder_for_Statistical_Machine_Translation.ipynb](https://github.com/CrispenGari/pytorch-python/blob/main/09_NLP/03_Sequence_To_Sequence/02__Learning_Phrase_Representations_using_RNN_Encoder_Decoder_for_Statistical_Machine_Translation.ipynb)


### Installation of Packages
In the following code cell we are going to install the packages that we are going to use in this notebook which are `helperfns` and `torchdata`. The `helperfns` package allows us to get some machine learning helper function which we are also going to use in this notebook as well.

In [1]:
!pip install helperfns torchdata -q

### Imports

In the following code cell we are going to import the packages that are going to be used in this notebook.

In [2]:

from torch import nn
from torch.nn  import functional as F
from torchtext import data, datasets
from collections import Counter
from torchtext import vocab
from helperfns import tables, visualization, utils
from helperfns.torch.models import model_params

import spacy
import math
import random
import torch
import torchtext
import numpy as np
import time

torchtext.__version__, torch.__version__

('0.13.1', '1.12.1+cu113')

### Seed
In the following code cell we are going to set up the `SEED` for reproducivity in this notebook.

In [3]:
SEED = 42

torch.manual_seed(SEED)
np.random.seed(SEED)
random.seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deteministic = True

### Device

In the following code cell we are going to declare a `device` variable that will allow us to make use of `GPU` if available.

In [4]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

### Tokenizer Models

For the tokenizer we are going to make use of the `spacy` language models to tokenize sentences for each language. So first we will need to download the `en_core_web_sm` and the `de_core_news_sm`.

In [5]:
spacy.cli.download('de_core_news_sm')

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('de_core_news_sm')


In [6]:
spacy_de = spacy.load('de_core_news_sm')
spacy_en = spacy.load('en_core_web_sm')

In the following code cell we are going to create our tokenizers functions that takes in a sentence in english and tokenize it.

> Previously we reversed the source (German) sentence, however in the paper we are implementing they don't do this, so neither will we.

In [None]:
def tokenize_de(sent:str)->list:
  return [tok.text for tok in spacy_de.tokenizer(sent)]

def tokenize_en(sent:str)->list:
  return [tok.text for tok in spacy_en.tokenizer(sent)]

Testing our tokenizer functions.

In [43]:
tokenize_de("Kannst du mir helfen?")

['Kannst', 'du', 'mir', 'helfen', '?']

In [44]:
tokenize_en("Can you help me?")

['Can', 'you', 'help', 'me', '?']

### Dataset

The dataset we'll be using is the [Multi30k](https://pytorch.org/text/stable/datasets.html#multi30k) dataset. This is a dataset with `~30,000` parallel English, German and French sentences, each with `~12` words per sentence.

In [45]:
train_iter, valid_iter, test_iter = datasets.Multi30k(
     root = '.data', split = ('train', 'valid', 'test'), 
    language_pair= ('de', 'en')
)

Checking as single train example.

In [46]:
next(iter(train_iter))

('Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.',
 'Two young, White males are outside near many bushes.')

### Getting `src` and `trg`
Our `src` field will be the `de` sentence and the `trg` field will the the `en` sentence.

In [47]:
train_src = []
train_trg = []
valid_src = []
valid_trg = []
test_src = []
test_trg = []

for (src, trg) in train_iter:
  train_src.append(src)
  train_trg.append(trg)

for (src, trg) in test_iter:
  test_src.append(src)
  test_trg.append(trg)

for (src, trg) in valid_iter:
  valid_src.append(src)
  valid_trg.append(trg)

Checking if the `src` and the `trg` has the same length for all sets.

In [48]:
assert len(train_src) == len(train_trg), f"The src and trg must have the same length got {len(train_src)} and {len(train_trg)}."
assert len(valid_src) == len(valid_trg), f"The src and trg must have the same length got {len(valid_src)} and {len(valid_trg)}."
assert len(test_src) == len(test_trg), f"The src and trg must have the same length got {len(test_src)} and {len(test_trg)}."

### Counting examples

In the following code cell we are going to count examples in each language pair and visualize them using a table.

In [49]:
columns = ["set", "src", "trg"]
examples = [
    ['training', len(train_src), len(train_trg)],
    ['validation' , len(valid_src), len(valid_trg)],
    ['testing' , len(test_src), len(test_trg)],
]
tables.tabulate_data(columns, examples, "Examples")

+----------------------------+
|          Examples          |
+------------+-------+-------+
|    set     |  src  |  trg  |
+------------+-------+-------+
| training   | 29001 | 29001 |
| validation |  1015 |  1015 |
| testing    |  1000 |  1000 |
+------------+-------+-------+


### Building Vocabulary
Next, we'll build the vocabulary for the `source` (src) and `target` (trg) languages. The vocabulary is used to associate each unique token with an index (an integer). The vocabularies of the `source` and `target` languages are distinct.

Using the `min_freq` argument, we only allow tokens that appear at least `2` times to appear in our vocabulary. Tokens that appear only once are converted into an `<unk>` (unknown) token.

> It is important to note that our vocabulary should only be built from the `training` set and not the `validation/test` set. This prevents `"information leakage"` into our model, giving us artifically inflated `validation/test` scores.

Also note that in this notebook we are not going to focus much about text cleaning. We are going to convert sentences to lower case only.

In [50]:
counter_src = Counter()
for line in train_src:
  counter_src.update(tokenize_de(line.lower()))
vocabulary_src = vocab.vocab(counter_src, min_freq=5, specials=('<unk>', '<sos>', '<eos>', '<pad>'))

counter_trg = Counter()
for line in train_trg:
  counter_trg.update(tokenize_de(line.lower()))
vocabulary_trg = vocab.vocab(counter_trg, min_freq=5, specials=('<unk>', '<sos>', '<eos>', '<pad>'))

In the following code cell we are going to get the `string-to-integer` representation of our `src` and `trg` fields.

In [51]:
stoi_src = vocabulary_src.get_stoi()
stoi_trg = vocabulary_trg.get_stoi()

SRC_VOCAB_SIZE = len(stoi_src)
TRG_VOCAB_SIZE = len(stoi_trg)

### SRC and Target Pipelines
After our text has been tokenized we need a way of converting those words into numbers because machine leaning models understand numbers not words. That's where we the `src_text_pipeline` and `trg_text_pipeline` functions into play. So these function takes in a sentence and tokenize it then converts each word to a number. Note that the word that does not exists in the vocabulay for either `src` or `trg` will be converted to an `unkown` (`<unk>`) token.

In [52]:
def src_text_pipeline(x: str)->list:
  values = list()
  tokens = tokenize_de(x.lower())
  for token in tokens:
    try:
      v = stoi_src[token]
    except KeyError as e:
      v = stoi_trg['<unk>']
    values.append(v)
  return values

def trg_text_pipeline(x: str)->list:
  values = list()
  tokens = tokenize_de(x.lower())
  for token in tokens:
    try:
      v = stoi_trg[token]
    except KeyError as e:
      v = stoi_trg['<unk>']
    values.append(v)
  return values

#### Translation Dataset

In the following code cell we are going to create a `TranslationDataset` class which will inherid from the `torch.utils.data.Dataset` class. This class will take in the `src` and `trg` values and pair them together.

In [53]:
class TranslationDataset(torch.utils.data.Dataset):
  def __init__(self, src, trg):
    super(TranslationDataset, self).__init__()
    self.src = src
    self.trg = trg
      
  def __getitem__(self, index):
    return self.src[index], self.trg[index]
  
  def __len__(self):
    return len(self.src)

### collate_fn of the `DataLoader`
Our collate function will be named `tokenize_batch` which takes in the batch of `src` and `trg` pairs then `tokenize`, `numericalize` and then `pad` the to a given `max_len`.

In [54]:
def tokenize_batch(batch, max_len=100, padding="pre"):
  assert padding=="pre" or padding=="post", "the padding can be either pre or post"
  src, trg = [], []
  for src_, trg_ in batch:
    _src = torch.zeros(max_len, dtype=torch.int32)
    _trg = torch.zeros(max_len, dtype=torch.int32)
    _src_processed_text = torch.tensor(src_text_pipeline(src_.lower()), dtype=torch.int32)
    _trg_processed_text = torch.tensor(trg_text_pipeline(trg_.lower()), dtype=torch.int32)
    _src_pos = min(max_len, len(_src_processed_text))
    _trg_pos = min(max_len, len(_trg_processed_text))
    if padding == "pre":
      _src[:_src_pos] = _src_processed_text[:_src_pos]
      _trg[:_trg_pos] = _trg_processed_text[:_trg_pos]
    else:
      _src[-_src_pos:] = _src_processed_text[-_src_pos:]
      _trg[-_trg_pos:] = _trg_processed_text[-_trg_pos:]
    trg.append(_trg.unsqueeze(dim=0))
    src.append(_src.unsqueeze(dim=0))
  #  the target values must be a LongTensor
  return torch.cat(src, dim=0), torch.cat(trg, dim=0).type(torch.LongTensor)

### Creating datasets
In the following code cell we are going to create datasets for all our `3` sets.

In [55]:
train_dataset = TranslationDataset(train_src, train_trg)
test_dataset = TranslationDataset(test_src, test_trg)
valid_dataset = TranslationDataset(valid_src, valid_trg)

Checking a single example in the `train` set.

In [56]:
train_dataset[0]

('Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.',
 'Two young, White males are outside near many bushes.')

### DataLoaders

In the following code cell we are going to create  dataloaders for our three sets of data. We are going to pass the `tokenize_batch` as our `collate_fn`. We are going to shuffle example in the `train` set only.

In [57]:
BATCH_SIZE = 128
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=tokenize_batch)
valid_loader = torch.utils.data.DataLoader(valid_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=tokenize_batch)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=tokenize_batch)

#### Checking examples in the train_loader.

In [58]:
src, trg = iter(train_loader).next()

In [59]:
src[:2]

tensor([[  45,   61,   17, 2714,   29, 3011,  175,   45, 1138,   54,   47,   48,
           58,  209,   32,  243,   61,   99,   15,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0],
        [  20,    5, 1355,   20,    0,   47,   20,  314,  344,  175,    0,   11,
          332,  995,   15,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0, 

In [60]:
trg[:2]

tensor([[  21,   59,   80,   21,  389,   83,   32, 3181,   21, 1206,   46,  101,
           60,  286,   59,   14,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0],
        [  21,    5,  220, 1422,   21,  414,   70,    0,    6,   60,  341,    0,
           17,   49, 1039,   14,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0, 

### Sequence To Sequence Model

We are going to build our model in `3` parts. The `econder`, `decoder` and `seq2seq`. You can refer from [this notebook](https://github.com/CrispenGari/pytorch-python/blob/main/09_NLP/03_Sequence_To_Sequence/01_Sequence_To_Sequence_Introduction.ipynb).


### Encoder

The encoder is similar to the previous one, with the multi-layer `LSTM` swapped for a single-layer `GRU`. We also don't pass the dropout as an argument to the `GRU` as that dropout is used between each layer of a multi-layered `RNN`. As we only have a single layer, PyTorch will display a warning if we try and use pass a dropout value to it.

Another thing to note about the GRU is that it only requires and returns a hidden state, there is no cell state like in the `LSTM`.

In [61]:
class Encoder(nn.Module):
  def __init__(self, input_dim, emb_dim, hid_dim, dropout):
    super(Encoder, self).__init__()
    self.hid_dim = hid_dim
    self.embedding = nn.Embedding(input_dim, embedding_dim=emb_dim)
    """
    No Dropout (GRU) since we have **one** layer.
    """
    self.gru = nn.GRU(emb_dim, hid_dim)
    self.dropout = nn.Dropout(dropout)

  def forward(self, src):
    # src = [src len, batch size]
    embedded = self.dropout(self.embedding(src))
    # embedded = [src len, batch size, emb dim]
    ouputs, h_0 = self.gru(embedded) # no cell state since it is a GRU not LSTM
    """
    outputs = [src len, batch size, hid dim * n directions]
    hidden (h_0) = [n_layers * n_directions, batch size, hid dim]
    ** outputs are always from the top hidden layer
    """
    return h_0

### Decoder

The decoder is where the implementation differs significantly from the previous model and we alleviate some of the information compression. You can read more [here.](https://github.com/CrispenGari/pytorch-python/blob/main/09_NLP/03_Sequence_To_Sequence/02__Learning_Phrase_Representations_using_RNN_Encoder_Decoder_for_Statistical_Machine_Translation.ipynb)

In [62]:
class Decoder(nn.Module):
  def __init__(self, output_dim, emb_dim, hid_dim, dropout):
    super(Decoder, self).__init__()
    self.hid_dim = hid_dim
    self.output_dim = output_dim

    self.embedding = nn.Embedding(output_dim, emb_dim)
    self.gru = nn.GRU(emb_dim + hid_dim, hid_dim)
    self.fc = nn.Linear(emb_dim + hid_dim * 2, output_dim)
    self.dropout = nn.Dropout(dropout)

  def forward(self, input, hidden, context):
    """     
    input = [batch size]
    hidden, (h_0) = [n_layers * n_directions, batch_size, hid_dim]
    context = [n_layers * n_directions, batch_size, hid_dim]

    n_layers and n_directions in the decoder will both always be 1, therefore:
    hidden (h_0) = [1, batch_size, hid_dim]
    context = [1, batch_size, hid_dim]
    """
    input = input.unsqueeze(0)  # nput = [1, batch size]
    embedded = self.dropout(self.embedding(input)) # embedded = [1, batch_size, emb_dim]
    emb_con = torch.cat((embedded, context), dim = 2) # emb_con = [1, batch size, emb dim + hid dim]

    output, h_0 = self.gru(emb_con, hidden)
    """      
    output = [seq_len, batch_size, hid dim * n_directions]
    hidden (h_0) = [n_layers * n_directions, batch_size, hid_dim]

    seq_len, n_layers and n_directions will always be 1 in the decoder, therefore:
    output = [1, batch_size, hid_dim]
    hidden (h_0) = [1, batch_size, hid_dim]
    """
    output = torch.cat((embedded.squeeze(0), h_0.squeeze(0), context.squeeze(0)), 
                           dim = 1) # output = [batch size, emb dim + hid dim * 2]
    prediction = self.fc(output) # prediction = [batch size, output dim]
    return prediction, h_0

###Seq2Seq (Sequence to Sequence)
For the final part of the implemenetation, we'll implement the `seq2seq` model. This will handle:

* the outputs tensor is created to hold all predictions, 
* the source sequence, is fed into the encoder to receive a context vector
* the initial decoder hidden state is set to be the context vector, 
* we use a batch of `<sos> ` tokens as the first input, 
* we then decode within a loop:
  * inserting the input token, previous hidden state, , and the context vector, , into the decoder
* receiving a prediction, , and a new hidden state, 
* we then decide if we are going to teacher force or not, setting the next input as appropriate (either the ground truth next token in the target sequence or the highest predicted next token)

In [63]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        assert encoder.hid_dim == decoder.hid_dim, "Hidden dimensions of encoder and decoder must be equal!"
        
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):
        """
        src = [src_len, batch_size]
        trg = [trg_len, batch_size]
        teacher_forcing_ratio is probability to use teacher forcing
        e.g. if teacher_forcing_ratio is 0.75 we use ground-truth inputs 75% of the time
        """
        batch_size = trg.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        # tensor to store decoder outputs
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
        # last hidden state of the encoder is the context
        context = self.encoder(src)
        
        # context also used as the initial hidden state of the decoder
        hidden = context
        
        # first input to the decoder is the <sos> tokens
        input = trg[0,:]
        
        for t in range(1, trg_len):
            # insert input token embedding, previous hidden state and the context state
            # receive output tensor (predictions) and new hidden state
            output, hidden = self.decoder(input, hidden, context)
            # place predictions in a tensor holding predictions for each token
            outputs[t] = output
            # decide if we are going to use teacher forcing or not
            teacher_force = random.random() < teacher_forcing_ratio
            # get the highest predicted token from our predictions
            top1 = output.argmax(1) 
            # if teacher forcing, use actual next token as next input
            # if not, use predicted token
            input = trg[t] if teacher_force else top1
        return outputs

### Training the `Seq2Seq`
Now we have our model implemented, we can begin training it.

First, we'll initialize our model. As mentioned before, the `input` and `output` dimensions are defined by the size of the vocabulary. The embedding dimesions and dropout for the encoder and decoder can be different, but the number of layers and the size of the hidden/cell states must be the same.

We then define the encoder, decoder and then our `Seq2Seq` model, which we place on the device.

In [64]:
INPUT_DIM = SRC_VOCAB_SIZE
OUTPUT_DIM = TRG_VOCAB_SIZE
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
HID_DIM = 512
N_LAYERS = 2
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, DEC_DROPOUT)
model = Seq2Seq(enc, dec, device).to(device)
model

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(3554, 256)
    (gru): GRU(256, 512)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(3275, 256)
    (gru): GRU(768, 512)
    (fc): Linear(in_features=1280, out_features=3275, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

### Counting model parameters.
We are going to use the `model_params` function from `helperfns` to count model parameters.

In [65]:
model_params(model)

TOTAL MODEL PARAMETERS: 	9095371
TOTAL TRAINABLE PARAMETERS: 	9095371


### Initializing weights to the model

Next, we initialize our parameters. The paper states the parameters are initialized from a normal distribution with a mean of `0` and a standard deviation of `0.01`, i.e. $N(0, 0.01)$ .

It also states we should initialize the recurrent parameters to a special initialization, however to keep things simple we'll also initialize them to $N(0, 0.01)$.

In [66]:
def init_weights(m):
    for name, param in m.named_parameters():
        nn.init.normal_(param.data, mean=0, std=0.01)
        
model.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(3554, 256)
    (gru): GRU(256, 512)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(3275, 256)
    (gru): GRU(768, 512)
    (fc): Linear(in_features=1280, out_features=3275, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

### Optimizer
For the optimizer we are going to use the `Adam` optimizer with default parameters.

In [67]:
optimizer = torch.optim.Adam(model.parameters())

### Criterion
Next, we define our loss function. The `CrossEntropyLoss` function calculates both the log softmax as well as the negative log-likelihood of our predictions.

Our loss function calculates the average loss per token, however by passing the index of the `<pad>` token as the `ignore_index` argument we ignore the loss whenever the target token is a padding token.

In [68]:
TRG_PAD_IDX = stoi_trg["<pad>"]
criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX)

### Train Loop

First, we'll set the model into "training mode" with `model.train()`. This will turn on dropout (and batch normalization, which we aren't using) and then iterate through our data iterator.

At each iteration:

* get the source and target sentences from the batch,  and put them to the `device` 
* zero the gradients calculated from the last batch
* feed the source and target into the model to get the output, 
* as the loss function only works on 2d inputs with 1d targets we need to flatten each of them with `.view`
  * we slice off the first column of the output and target tensors as mentioned above
* calculate the gradients with `loss.backward()`
* clip the gradients to prevent them from exploding (a common issue in RNNs)
* update the parameters of our model by doing an optimizer step
* sum the loss value to a running total
* Finally, we return the loss that is averaged over all batches.

In [69]:
def train(model, iterator, optimizer, criterion, clip):
  model.train()
  epoch_loss = 0
  for i, (src, trg) in enumerate(iterator):
    src = src.to(device)
    trg = trg.type(torch.LongTensor).to(device)
    optimizer.zero_grad()
    output = model(src, trg)
    """
    trg = [trg len, batch size]
    output = [trg len, batch size, output dim]
    """
    output_dim = output.shape[-1]
    output = output[1:].view(-1, output_dim)
    trg = trg[1:].view(-1)

    """
    trg = [(trg len - 1) * batch size]
    output = [(trg len - 1) * batch size, output dim]
    """
    loss = criterion(output, trg)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
    optimizer.step()
    epoch_loss += loss.item()
    return epoch_loss / len(iterator)

### Evaluation Loop
Our evaluation loop is similar to our training loop, however as we aren't updating any parameters we don't need to pass an `optimizer` or a `clip` value.

We must remember to set the model to evaluation mode with `model.eval()`. This will turn off dropout (and batch normalization, if used).

We use the with `torch.no_grad()` block to ensure no gradients are calculated within the block. This reduces memory consumption and speeds things up.

The iteration loop is similar (without the parameter updates), however we must ensure we turn teacher forcing off for evaluation. This will cause the model to only use it's own predictions to make further predictions within a sentence, which mirrors how it would be used in deployment.

In [70]:
def evaluate(model, iterator, criterion):
  model.eval()
  epoch_loss = 0
  with torch.no_grad():
    for i, (src, trg) in enumerate(iterator):
      src = src.to(device)
      trg = trg.type(torch.LongTensor).to(device)
      output = model(src, trg, 0) #turn off teacher forcing
      """
      trg = [trg len, batch size]
      output = [trg len, batch size, output dim]
      """
      output_dim = output.shape[-1]
      output = output[1:].view(-1, output_dim)
      trg = trg[1:].view(-1)
      """
      trg = [(trg len - 1) * batch size]
      output = [(trg len - 1) * batch size, output dim]
      """
      loss = criterion(output, trg)
      epoch_loss += loss.item()
  return epoch_loss / len(iterator)

### Running the training loop.
During training we are going to visualize our training metrics in tabular form. We are going to save the best model if and only if the previous validation loss is less greater than the current epoch validation loss.

In [71]:
N_EPOCHS = 10
CLIP = 1
best_valid_loss = float('inf')
MODEL_NAME = 'best-model.pt'
for epoch in range(N_EPOCHS):
    start = time.time()
    train_loss = train(model, train_loader, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, valid_loader, criterion)
    title = f"EPOCH: {epoch+1:02}/{N_EPOCHS:02} {'saving best model...' if valid_loss < best_valid_loss else 'not saving...'}"
    end = time.time()
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), MODEL_NAME)
    data = [
       ["Training", f'{train_loss:.3f}', f'{math.exp(train_loss):7.3f}', f"{utils.hms_string(end - start)}" ],
       ["Validation", f'{valid_loss:.3f}', f'{math.exp(valid_loss):7.3f}', "" ],       
   ]
    columns = ["CATEGORY", "LOSS", "PPL", "ETA"]
    tables.tabulate_data(columns, data, title)

+--------------------------------------------+
|     EPOCH: 01/10 saving best model...      |
+------------+-------+----------+------------+
|  CATEGORY  |  LOSS |   PPL    |    ETA     |
+------------+-------+----------+------------+
| Training   | 0.036 |    1.036 | 0:00:01.95 |
| Validation | 8.021 | 3043.698 |            |
+------------+-------+----------+------------+
+--------------------------------------------+
|     EPOCH: 02/10 saving best model...      |
+------------+-------+----------+------------+
|  CATEGORY  |  LOSS |   PPL    |    ETA     |
+------------+-------+----------+------------+
| Training   | 0.035 |    1.036 | 0:00:01.83 |
| Validation | 7.901 | 2700.634 |            |
+------------+-------+----------+------------+
+--------------------------------------------+
|     EPOCH: 03/10 saving best model...      |
+------------+-------+----------+------------+
|  CATEGORY  |  LOSS |   PPL    |    ETA     |
+------------+-------+----------+------------+
| Training   

### Evaluating the Best Model

In the following code cell we are going to evaluate the best model.

In [72]:
column_names = ["Set", "Loss", "PPL", "ETA (time)"]
model.load_state_dict(torch.load(MODEL_NAME))
test_loss= evaluate(model, test_loader, criterion)
title = "Model Evaluation Summary"
data_rows = [["Test", f'{test_loss:.3f}', f'{math.exp(test_loss):7.3f}', ""]]

tables.tabulate_data(column_names, data_rows, title)

+-------------------------------------+
|       Model Evaluation Summary      |
+------+-------+---------+------------+
| Set  |  Loss |   PPL   | ETA (time) |
+------+-------+---------+------------+
| Test | 1.153 |   3.168 |            |
+------+-------+---------+------------+



Just looking at the test loss, we get better performance than the previous model. This is a pretty good sign that this model architecture is doing something right! Relieving the information compression seems like the way forard, and in the next notebook we'll expand on this even further with attention.
