## **Prerequisites**

### **Check what GPU you got**
If you didn't get the P100-PCIE GPU, click on the Runtime dropdown at the top of the page and Factory Reset Runtime 

In [1]:
!nvidia-smi

Sat Apr 25 17:11:55 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

### **Mount and Install Packages**
Mount the data and set up Spacy

In [2]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

import os
os.chdir("/content/drive/My Drive/English-to-French-Translation")
!ls

!python -m spacy download en_core_web_sm
!python -m spacy download fr_core_news_sm

!pip3 install 'torchtext==0.5.0'

!pip3 install torch torchvision

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive
data  embeddings  experiments  models  README.md  scripts  src
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
Collecting fr_core_news_sm==2.2.5
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-2.2.5/fr_core_news_sm-2.2.5.tar.gz (14.7MB)
[K     |████████████████████████████████| 14.7MB 82.6MB/s 
Building wheels for collected packages: fr-core-news-sm

### **Important: Reset Runtime**

Note: there is a slight bug with Google Colab. After installing Spacy, you need to restart the Jupyter Notebook runtime.

There are two ways:
1. Click on the Runtime dropdown, and select "Restart Runtime". Once that is done, proceed to the next step (no need to remount the drive).
2. Run the code below. It will kill the current process, effectively restarting the runtime.

In [0]:
import os
os.kill(os.getpid(), 9)

##Get the data

### **Get a list of vocabs:**
The list of vocabs are already stored in the Google Drive folder; thus, we just have to load it.

In [1]:
import sys
sys.path.append('/content/drive/My Drive/English-to-French-Translation/src/dataloader/')

import vocabs

models_dir = "/content/drive/My Drive/English-to-French-Translation/models/Multi30k/"
source_vocabs = vocabs.load_vocabs_from_file(models_dir + 'vocab.english.gz')
target_vocabs = vocabs.load_vocabs_from_file(models_dir + 'vocab.french.gz')

Loaded 9757 words
Loaded 11137 words


### **Tokenize and split the dataset**
We will have three types of datasets:
1. Training data: it is the data used to train our model
2. Validation (val) data: it is the data used to test our model at each step of the training process
3. Test data: it is the data used to test our model after all the training is done

How to get the three types of data?
* The test set is already in the `Test` folder
* The validation set is a piece of the data in the `Train`` folder

The code to get our three types of datasets is:

In [2]:
import sys
sys.path.append('/content/drive/My Drive/English-to-French-Translation/src/dataloader')

import os
from tqdm.notebook import tqdm

import torch
import numpy as np
import utils
import vocabs

class Seq2SeqDataset(torch.utils.data.Dataset):
  def __init__(
    self,
    dir_: str,
    source_vocabs: vocabs.VocabDataset,
    target_vocabs: vocabs.VocabDataset,
    source_lang: str,
    target_lang: str,
  ):

    """ Initialize the Hansard dataset from a directory of parallel texts
        Note:
            The parallel text in 'dir_' must contain source and target transcriptions
            with the file extension 'source_lang' and 'target_lang'

        Parameters
        ----------
        dir_ : str
            A path to the directory of parallel text
        source_vocabs : VocabDataset
            A vocab dataset of the source text
        target_vocabs : VocabDataset
            A vocab dataset of the target text
        source_lang : { 'en', 'fr' }
            The source language
        target_lang : { 'en', 'fr' }
            The target language
    """

    # Get the spacy instances
    source_spacy = utils.get_spacy_instance(source_lang)
    target_spacy = utils.get_spacy_instance(target_lang)

    # Get all the files with the text
    transcriptions = utils.get_parallel_text(dir_, [source_lang, target_lang])

    source_filepaths = [os.path.join(dir_, trans[0]) for trans in transcriptions]
    target_filepaths = [os.path.join(dir_, trans[1]) for trans in transcriptions]

    source_l = utils.read_transcription_files(source_filepaths, source_spacy)
    target_l = utils.read_transcription_files(target_filepaths, target_spacy)

    source_word2id = source_vocabs.get_word2id()
    target_word2id = target_vocabs.get_word2id()

    src_unk, src_pad = range(len(source_word2id), len(source_word2id) + 2)
    trg_unk, trg_sos, trg_eos, trg_pad = range(len(target_word2id), len(target_word2id) + 4)

    corpus_iterator = zip(target_l, source_l)
    corpus_size = utils.get_size_of_corpus(source_filepaths)

    pairs = []
    src_lens = []
    trg_lens = []

    for (trg, trg_filename, _), (src, src_filename, _) in tqdm(corpus_iterator, total=corpus_size):
      assert trg_filename[:-2] == src_filename[:-2]

      if not src or not trg:
        continue

      # Skip sentences > 50 words
      if len(src) > 50 or len(trg) > 50:
        continue

      # Skip sentences with no words
      if len(src) <= 0 or len(trg) <= 0:
        print("Found a sentence with no words in it!")
        continue

      src_tensor = torch.tensor([source_word2id.get(word, src_unk) for word in src])
      trg_tensor = torch.tensor(
        [trg_sos] + [target_word2id.get(word, trg_unk) for word in trg] + [trg_eos]
      )

      # Validate the contents of E and F
      if torch.any(src_tensor < 0) or torch.any(src_tensor > src_unk):
        print("src_unk:", src_unk)
        print("src_tensor:", src_tensor)
        raise ValueError("Contents of src_tensor should be <= src_unk and >= 0!")

      if torch.any(trg_tensor < 0) or torch.any(trg_tensor > trg_eos):
        print("trg_eos:", trg_eos)
        print("trg_tensor:", trg_tensor)
        raise ValueError("Contents of trg_tensor should be <= trg_eos and >= 0!")

      # Skip sentences that don't have any words in the vocab
      if torch.all(src_tensor == src_unk) and torch.all(trg_tensor[1:-1] == trg_unk):
        continue

      pairs.append((src_tensor, trg_tensor))
      src_lens.append(src_tensor.size()[0])
      trg_lens.append(trg_tensor.size()[0])

    print("Number of sentence pairs:", len(pairs))

    print("Avg. num words in source text:", np.mean(src_lens))
    print("Std. num words in source text:", np.std(src_lens))
    print("Max. num words in source text:", np.max(src_lens))
    print("Min. num words in source text:", np.min(src_lens))

    print("Avg. num words in target text:", np.mean(trg_lens))
    print("Std. num words in target text:", np.std(trg_lens))
    print("Max. num words in target text:", np.max(trg_lens))
    print("Min. num words in target text:", np.min(trg_lens))
    
    self.source_unk = src_unk
    self.source_pad_id = src_pad
    self.source_vocab_size = len(source_word2id) + 2  # pad id and unk

    self.target_unk = trg_unk
    self.target_sos = trg_sos
    self.target_eos = trg_eos
    self.target_pad_id = trg_pad
    self.target_vocab_size = len(target_word2id) + 4  # unk, sos, eos, and pad id
    
    self.dir_ = dir_
    self.pairs = pairs

  def __len__(self):
    """ Returns the number of parallel texts in this dataset """
    return len(self.pairs)

  def __getitem__(self, i):
    """ Returns the i-th parallel texts in this dataset """
    return self.pairs[i]


class Seq2SeqDataLoader(torch.utils.data.DataLoader):
  def __init__(self, dataset, source_pad_id, target_pad_id, batch_first=False, **kwargs):
    """ Loads the dataset for the model
        It can load the dataset in parallel by setting 'num_workers' param > 0

        Parameters
        ----------
        dataset : Seq2SeqDataset
            The parallel text dataset
        source_pad_id : int
            An ID used to pad the source text for batching
        target_pad_id : int
            An ID used to pad the target text for batching
    """
    super().__init__(dataset, collate_fn=self.collate, **kwargs)

    self.source_pad_id = source_pad_id
    self.target_pad_id = target_pad_id
    self.batch_first = batch_first

  def collate(self, batch):
    """ Given a batch of source and target texts, it will pad it
        Specifically, it pads F with self.source_pad_id and E with self.target_eos

        Parameters
        ----------
        batch : A set of sequences F and E where F is torch.tensor and E is torch.tensor

        Returns
        -------
        (F, F_lens, E, E_lens) : tuple
            F is a torch.tensor of size (S, N)
            E is a torch.tensor of size (S, N)
    """
    src_batch, trg_batch = zip(*batch)
    src_lens = torch.tensor([src_seq.size()[0] for src_seq in src_batch])
    trg_lens = torch.tensor([trg_seq.size()[0] for trg_seq in trg_batch])

    src = torch.nn.utils.rnn.pad_sequence(src_batch, batch_first=self.batch_first, padding_value=self.source_pad_id)
    trg = torch.nn.utils.rnn.pad_sequence(trg_batch, batch_first=self.batch_first, padding_value=self.target_pad_id)

    return src, src_lens, trg, trg_lens


train_dir = "/content/drive/My Drive/English-to-French-Translation/data/Multi30k/Training"
test_dir = "/content/drive/My Drive/English-to-French-Translation/data/Multi30k/Testing"

source_lang = "en"
target_lang = "fr"
train_val_ratio = 0.75
batch_size = 64

dataset = Seq2SeqDataset(
  train_dir,
  source_vocabs,
  target_vocabs,
  source_lang,
  target_lang
)

num_training_data = int(len(dataset) * train_val_ratio)
num_val_data = len(dataset) - num_training_data

train_dataset, val_dataset = torch.utils.data.random_split(
  dataset, [num_training_data, num_val_data]
)

train_dataloader = Seq2SeqDataLoader(
  train_dataset,
  dataset.source_pad_id,
  dataset.target_pad_id,
  batch_first=True,
  batch_size=batch_size,
  shuffle=True,
  pin_memory=True,
  num_workers=10
)
val_dataloader = Seq2SeqDataLoader(
  val_dataset,
  dataset.source_pad_id,
  dataset.target_pad_id,
  batch_first=True,
  batch_size=batch_size,
  shuffle=True,
  pin_memory=True,
  num_workers=10
)

test_dataset = Seq2SeqDataset(
  test_dir,
  source_vocabs,
  target_vocabs,
  source_lang,
  target_lang
)
test_dataloader = Seq2SeqDataLoader(
  test_dataset,
  test_dataset.source_pad_id,
  test_dataset.target_pad_id,
  batch_first=True,
  batch_size=batch_size,
  shuffle=True,
  pin_memory=True,
  num_workers=10
)


HBox(children=(IntProgress(value=0, max=29461), HTML(value='')))


Number of sentence pairs: 29460
Avg. num words in source text: 13.082416836388322
Std. num words in source text: 4.061960254692422
Max. num words in source text: 41
Min. num words in source text: 4
Avg. num words in target text: 16.268940936863544
Std. num words in target text: 4.730511458726377
Max. num words in target text: 49
Min. num words in target text: 6


HBox(children=(IntProgress(value=0, max=1014), HTML(value='')))


Number of sentence pairs: 1014
Avg. num words in source text: 13.241617357001973
Std. num words in source text: 4.056529028481095
Max. num words in source text: 32
Min. num words in source text: 4
Avg. num words in target text: 16.357988165680474
Std. num words in target text: 4.777396885525875
Max. num words in target text: 38
Min. num words in target text: 7


##**Our Model**

###**The Intuition behind Attention**

Idea behind attention:
* Suppose we have a sentence like "Mary gave roses to Susan"
* We want our ML model to focus on the fact that the **giving** is being done by Mary to Susan with a rose. 
* The model should not give much attention that Mary came before Susan.
* This is called attention
* The attention layer will allow the ML model to focus on certain aspects that are more important than others

Building our self-attention layer:
* Let $x_1, x_2, ..., x_t \in R^k$ be our sequence of inputs represented with word embeddings of dimension $k$. 
* Attention is basically:

  \begin{align*}
    y_i &= \sum_{j = 0}^{t} w_{i, j} x_j
  \end{align*}

  where:
  \begin{align*}
    w'_{i, j} &= x_i^T x_j
  \end{align*}

* But the dot product will give a value anywhere between $[-\infty, \infty]$
* Solution:
  * We apply softmax to map the values to $[0, 1]$:

    \begin{align}
      w_{i, j} = \frac{exp(w'_{i, j})}{\sum_{j}^{t} exp(w'_{i, j})}
    \end{align}

* That is the basics of self-attention.

The problem:
* There is no parameters in our attention formula.
* We need parameters so that it can attend to different areas of the inputs.

The solution:
* If we want to add weights, we just apply a linear transformation on each of the $x_i \in X$ that we use in our attention equation:

  \begin{align}
    q_i &= W_q x_i \text{ with } W_q \in R^{k x k} \\
    k_i &= W_k x_i \text{ with } W_k \in R^{k x k} \\
    v_i &= W_v x_i \text{ with } W_v \in R^{k x k} \\
  \end{align}


* We then use it on these formulas: \\

  \begin{align}
    w'_{i, j} &= q_i^T k_j \\
    w_{i, j} &= softmax(w'_{i, j}) \\
    y_i &= \sum_{j}^{t} w_{i, j} v_j \\
  \end{align}

Another problem:
* The softmax function is very sensitive to very large inputs
* It will squeeze all the inputs to 0 if there is an input that is very large.

The solution:
* We scale the dot product by $\sqrt{k}$. 
* Now, our formula is:

  \begin{align}
    w'_{i, j} = \frac{q_i^T k_j}{\sqrt{k}}
  \end{align}

Generalizing attention to Queries, Keys, and Values
* Recall our attention model:

  \begin{align}
    y_i &= \sum_{j}^{t} w_{i, j} v_j \\
    w_{i, j} &= softmax(w'_{i, j}) \\
    w'_{i, j} &= q_i^T k_j \\
    \text{where:} & \\
    q_i &= W_q x_i \text{ with } W_q \in R^{k x k} \\
    k_i &= W_k x_i \text{ with } W_k \in R^{k x k} \\
    v_i &= W_v x_i \text{ with } W_v \in R^{k x k} \\
  \end{align}

* We can generalize it to: \\

  \begin{align}
    y_i &= \sum_{j}^{t} w_{i, j} v_j \\
    w_{i, j} &= softmax(w'_{i, j}) \\
    w'_{i, j} &= q_i^T k_j \\
    \text{where:} & \\
    q_i &= W_q q'_i \text{ with } W_q \in R^{k x k} \\
    k_i &= W_k k'_i \text{ with } W_k \in R^{k x k} \\
    v_i &= W_v v'_i \text{ with } W_v \in R^{k x k} \\
  \end{align}

* We are going to use this in our Transformer model


###**The multi-headed attention layer**
Note that a word can mean different things to different neighbours.
* For instance, consider the example:

  F = [mary, gave, roses, to, susan]

* The word "gave" has different relations to different parts of the sentnece. Mary is the one whose giving, and Susan is the one who is receiving
* In a normal Seq2Seq model with RNNs, the attention layer was not able to determine which is which and would confuse Susan giving the roses to Mary instead
* The solution to this is to have multiple attention heads

Let `h` = number of heads

To give additional attention heads, we can do one of the following ways:
1. Split the hidden dimension of each word evenly by `h`
2. Create `h` different weight matrixes

**Split the hidden dimension of each word evenly by `h`**:

This is where we split the word embeddings by `h`:
* Let $x'_i = [x_i[0], x_i[1], ..., x_i[h]]$
* Now, apply each row in $x'_i$ to $W_k, W_q, W_v$, i.e,

  * $q_i[j] = W_q * x'_i[j]$
  * $k_i[j] = W_k * x'_i[j]$
  * $v_i[j] = W_v * x'_i[j]$

* Now, we combine it together:

  * $q_i = [q_i[0], q_i[1], ..., q_i[h]]$
  * $k_i = [k_i[0], k_i[1], ..., k_i[h]]$
  * $v_i = [v_i[0], v_i[1], ..., v_i[h]]$

In lin algebra, the stuff above is the same as the formulas below:

  * $q_i = W_q * x_i$
  * $k_i = W_k * x_i$
  * $v_i = W_v * x_i$

We will be using this ^

**Create `h` different weight matrixes**:

To give additional attention heads, we do:
* Make additional weight matrixes:

  $W^{\rho}_{q}, W^{\rho}_{k}, W^{\rho}_{v}$

  where ${\rho}$ is the attention head index

* For each input $x_i$, we pass it through each head attention weight matrixes, concatenate all of them (so our output becomes $k\rho \cdot k$), and we do one last linear transformation with another weight matrix to reduce the dimension back to k.

In CUDA, it is more efficient to concatenate the different heads of one type of weight matrix as one matrix.

So:

* $W_q = [W^0_q, W^1_q, W^2_q, ..., W^h_q] \in R^{kh \times k}$
* $W_k = [W^0_k, W^1_k, W^2_k, ..., W^h_k] \in R^{kh \times k}$
* $W_v = [W^0_v, W^1_v, W^2_v, ..., W^h_v] \in R^{kh \times k}$

So now, if we apply $W_q, W_k, W_v$ to $x_i$, we get:

* $q_i = W_q x_i \in R^{kh \times 1}$ 
* $k_i = W_k x_i \in R^{kh \times 1}$ 
* $v_i = W_v x_i \in R^{kh \times 1}$ 

In our code, we will use option (1), and our self-attention module is now:

In [0]:
import torch
from torch import nn
import torch.nn.functional as F

class MultiHeadedSelfAttention(nn.Module):
  def __init__(self, hidden_size, device, num_heads=10, dropout_value=0.1):
    ''' Constructs the MultiHeadedSelfAttention layer

        Parameters
        ----------
        hidden_size : int
          The hidden size
        device : torch.device
          The device running the model
        num_heads : int (optional)
          The number of heads to self-attend to
        dropout_value : float (optional)
          The dropout rate for the output layer
    '''
    super().__init__()

    self.hidden_size = hidden_size
    self.num_heads = num_heads
    self.head_dim = hidden_size // num_heads

    assert hidden_size % num_heads == 0

    # Our weight matrixes
    self.to_keys = nn.Linear(self.hidden_size, self.hidden_size, bias=False)
    self.to_queries = nn.Linear(self.hidden_size, self.hidden_size, bias=False)
    self.to_values = nn.Linear(self.hidden_size, self.hidden_size, bias=False)

    # The scale
    self.scale = torch.sqrt(torch.FloatTensor([self.head_dim])).to(device)

    # The final linear transformation
    self.unify_heads = nn.Linear(self.hidden_size, self.hidden_size)

    # The dropout
    self.dropout = nn.Dropout(dropout_value)

  def forward(self, value, key, query, mask=None):
    ''' Performs a forward prop of the multi-headed attention layer

        Parameters
        ----------
        value : torch.Tensor(b, v_t, d)
        key : torch.Tensor(b, k_t, d)
        query : torch.Tensor(b, q_t, d)
        mask : torch.Tensor(b, 1, 1, d)
    '''
    b, q_t, d = query.size()
    _, k_t, _ = key.size()
    _, v_t, _ = value.size()
    h = self.num_heads

    # Apply the weight matrix to x (Size: (b, t, h * k))
    queries = self.to_queries(query)
    keys = self.to_keys(key)
    values = self.to_values(value)

    assert queries.size() == query.size()
    assert keys.size() == key.size()
    assert values.size() == value.size()

    # Transform the matrix from (b, t, k) to (b, t, h, k // h)
    queries = queries.view(b, -1, self.num_heads, self.head_dim).permute(0, 2, 1, 3)
    keys = keys.view(b, -1, self.num_heads, self.head_dim).permute(0, 2, 1, 3)
    values = values.view(b, -1, self.num_heads, self.head_dim).permute(0, 2, 1, 3)

    # assert queries.size() == (b, h, t, k // h)
    # assert keys.size() == (b, h, t, k // h)
    # assert values.size() == (b, h, t, k // h)

    # Perform the dot product (Size: (b, h, t, t))
    w_prime = torch.matmul(queries, keys.permute(0, 1, 3, 2)) / self.scale    
    # assert w_prime.size() == (b, h, t, t)

    # Perform a mask (if needed)
    if mask is not None:
      w_prime = w_prime.masked_fill(mask == 0, -1e10)

    # assert w_prime.size() == (b, h, t, t)

    # Perform the softmax
    w = torch.nn.functional.softmax(w_prime, dim=-1)
    # assert w.size() == (b, h, t, t)

    # Apply dropout to the attention
    dropped_w = self.dropout(w)

    # Perform the self-attention
    y = torch.matmul(dropped_w, values)
    # assert y.size() == (b, h, t, k // h)

    # Perform the last linear transformation from (b, t, h, k) to (b, t, k)
    y = y.permute(0, 2, 1, 3).contiguous().view(b, -1, self.hidden_size)
    unified_y = self.unify_heads(y)

    # assert unified_y.size() == (b, t, k)

    return unified_y
      


###**Our Translator Model**

Our model will look like this:

![alt text](https://raw.githubusercontent.com/bentrevett/pytorch-seq2seq/e8209a7b0207cde55871be352819cac3dd5c05ce/assets/transformer1.png)

where:
* We will be applying the encoder and decoder layers N times.
* We apply a positional encoding to our inputs so that it can know the time sequence of things
* We apply normalization so that we normalize things. In Pytorch, it is the nn.LayerNorm






####**Our Encoder**

The input:
* Let $X = [x_0, x_1, x_2, ..., x_t] \in N^t$ be our inputs, where each number in $X$ represents a word.

The Encoder will do the following:

1. It will map each word ID in $X$ to word embeddings ($R^t \rightarrow R^{t \times k}$)

2. It will map the position (i.e, index) of each word to a position embedding ($R^{t} \rightarrow R^{t \times k}$)

3. It will multiply each element in the matrix from (1) by a scalar value, and add it with the matrix from (2) ( ($R^{t \times k}, R^{t \times k}) \rightarrow R^{t \times k}$)

4. It will apply the values from (3) multiple times through the various Encoder layers, passing the output of the previous Encoder layer as inputs to the next Encoder layer ($R^{t \times k} \rightarrow R^{t \times k}$)


In code:
* We want to handle batches of sequences. 
* So our inputs become $X \in N^{b \times t}$ where $b$ is the batch size.
* The code will look like:

In [0]:
import torch
from torch import nn
import torch.nn.functional as F

class Encoder(nn.Module):
  def __init__(self, 
               source_vocab_size : int, 
               word_embedding_size : int, 
               num_layers : int, 
               num_heads : int, 
               pf_dim : int, 
               dropout_value : float, 
               device : torch.device, 
               max_length : int = 100):
    ''' Constructs the Encoder

        Parameters
        ----------
        source_vocab_size : int
          The vocab size in the source language
        word_embedding_size : int
          The word embedding size
        num_layers : int
          The number of DecoderLayer layers
        num_heads : int
          The number of heads for each attention layer
        pf_dim : int
          The dimension for the Positional layer
        dropout_value : float
          The dropout value for the outputs
        device : torch.device
          The device to run the model on
        max_length : int (optional)
          The max. length of each target sequence

    '''
    
    super().__init__()
    self.device = device

    # Create our word and position embeddings
    self.word_embedding = nn.Embedding(source_vocab_size, word_embedding_size)
    self.pos_embedding = nn.Embedding(max_length, word_embedding_size)

    # Create our layers of Encoder layers
    self.encoder_layers = nn.ModuleList(
      [EncoderLayer(word_embedding_size, num_heads, pf_dim, dropout_value, device) for _ in range(num_layers)]
    )

    # Create our dropout and scaler
    self.dropout = nn.Dropout(dropout_value)
    self.scale = torch.sqrt(torch.FloatTensor([word_embedding_size])).to(device)

  def forward(self, src, src_mask):
    ''' Performs a forward propagation on the Encoder

        Parameters
        ----------
        src : torch.LongTensor(N, S)
          A batch of source sequences
        src_mask : torch.LongTensor(N, 1, 1, S)
          The masks of each source sequence in the current batch

        Returns
        -------
        encoded_src : torch.FloatTensor(N, S)
          A batch of encoded source sequences
    '''
    batch_size, seq_len = src.size()

    # Get the position embeddings for each src seq.
    src_pos = torch.arange(0, seq_len).unsqueeze(0).repeat(batch_size, 1).to(self.device)
    src_pos = self.pos_embedding(src_pos)

    # Get word embeddings for each src seq
    src = self.word_embedding(src)
    
    # Combine the word and position embeddings
    src = src * self.scale + src_pos
    
    # Apply dropout
    src = self.dropout(src)

    # Obtain the encoded src seq
    encoded_src = src
    for layer in self.encoder_layers:
      encoded_src = layer(encoded_src, src_mask)

    return encoded_src


####**Our Encoder Layer:**

The encoder layer is responsible for handling the gray part of the encoder:

![alt text](https://raw.githubusercontent.com/bentrevett/pytorch-seq2seq/e8209a7b0207cde55871be352819cac3dd5c05ce/assets/transformer1.png)

Let $X = [x_0, x_1, x_2, ..., x_t] \in R^{t \times k}$ be the word embeddings of our input.

We then do the following:
1. Pass $X$ into the attention layer. It will output another matrix $X'$ with the same size as $X$

2. Apply a dropout to the attended $X$ matrix ($X'$) add it with the original word embeddings ($X$), and apply a normalization to the added matrix

3. Send the matrix from (2) into a Position-Wise Feed-Forward Neural Net that will output a matrix with the same size as its input.

4. Lastly, apply a dropout to the matrix from (3), add it with the matrix from (2), and apply the matrix from (2) + (3) with a normalization layer. It will output a matrix with the same size as its inputs.

In code, it looks like:



In [0]:
import torch
from torch import nn
import torch.nn.functional as F

class EncoderLayer(nn.Module):
  def __init__(self, 
               hidden_size : int, 
               num_heads : int, 
               pf_dim : int, 
               dropout_value : float, 
               device : torch.device):
    ''' Constructs the EncoderLayer

        Parameters
        ----------
        hidden_size : int
          The hidden size
        num_heads : int
          The number of heads in the self-attention layer
        pf_dim : int
          The dimension for the PositionwiseFeedforwardLayer
        dropout_value : float
          The dropout value
        device : torch.device
          The device to run the model on
    '''
    super().__init__()

    self.attention_layer = MultiHeadedSelfAttention(
      hidden_size, device, num_heads=num_heads, dropout_value=dropout_value
    )

    self.norm_layer = nn.LayerNorm(hidden_size)
    self.positionwise_feedforward = PositionwiseFeedforwardLayer(hidden_size, pf_dim, dropout_value)
    self.dropout = nn.Dropout(dropout_value)

  def forward(self, src, src_mask):
    ''' Performs a forward propagation of the Encoder layer

        Parameters
        ----------
        src : torch.Longtensor(N, S, H)
          A batch of source sequences
        src_mask : torch.LongTensor(N, 1, 1, S)
          The masks of each source sequence in the current batch

        Returns
        -------
        encoded_src : torch.Tensor(N, S, H)
          A batch of source sequences from this EncoderLayer
    '''

    # Apply attention
    attended_src = self.attention_layer(src, src, src, src_mask)

    # Apply dropout and normalization
    new_src = self.norm_layer(src + self.dropout(attended_src))

    # Apply positionalwise feedforward layer
    pos_src = self.positionwise_feedforward(new_src)

    # Apply dropout and layer normalization
    encoded_src = self.norm_layer(new_src + self.dropout(pos_src))

    assert encoded_src.size() == src.size()

    return encoded_src



####**Our Positionwise Feedforward Layer**
This is a three-layer neural net that takes an input of (t, k), expand it to (t, pf), and compress it back to (t, k).

In math, it will look like:

* $R^{t \times k} \rightarrow R^{t \times pf} \rightarrow R^{t \times k}$

In code, it will look like:


In [0]:
import torch
from torch import nn
import torch.nn.functional as F


class PositionwiseFeedforwardLayer(nn.Module):
  def __init__(self, hidden_size : int, pf_dim : int, dropout_value : float):
    ''' Constructs the PositionwiseFeedforwardLayer

        Parameters
        ----------
        hidden_size : int
          The hidden size
        pf_dim : int
          The dimension for the hidden layer
        dropout_value : float
          The dropout value for the output layer
    '''
    super().__init__()

    self.fc_1 = nn.Linear(hidden_size, pf_dim)
    self.fc_2 = nn.Linear(pf_dim, hidden_size)
    self.dropout = nn.Dropout(dropout_value)

  def forward(self, x):
    ''' Performs forward prop of the PositionwiseFeedforwardLayer

        Parameters
        ----------
        x : torch.FloatTensor(N, S, H)
          A batch of sequences with word embeddings

        Returns
        -------
        x : torch.FloatTensor(N, S, H)
          A batch of transformed sequences with word embeddings
    '''
    new_x = self.fc_1(x)
    new_x = torch.relu(new_x)
    new_x = self.dropout(new_x)
    new_x = self.fc_2(new_x)

    assert new_x.size() == x.size()

    return x


####**Our Decoder**
This will encapsulate a set of decoder layers, passing the outputs of the previous decoder layer as inputs to the next decoder layer

In code, it looks like:


In [0]:
import torch
from torch import nn
import torch.nn.functional as F


class Decoder(nn.Module):
  def __init__(self, 
               target_vocab_size : int, 
               word_embedding_size : int, 
               num_layers : int, 
               num_heads : int, 
               pf_dim : int, 
               dropout_value : float, 
               device : torch.device, 
               max_length : int = 100):
    ''' Constructs the Decoder

        Parameters
        ----------
        target_vocab_size : int
          The vocab size in the target language
        word_embedding_size : int
          The word embedding size
        num_layers : int
          The number of DecoderLayer layers
        num_heads : int
          The number of heads for each attention layer
        pf_dim : int
          The dimension for the Positional layer
        dropout_value : float
          The dropout value for the outputs
        device : torch.device
          The device to run the model on
        max_length : int (optional)
          The max. length of each target sequence
    '''
    super().__init__()
    self.device = device
    
    # Create the word and length embedding
    self.word_embedding = nn.Embedding(target_vocab_size, word_embedding_size)
    self.pos_embedding = nn.Embedding(max_length, word_embedding_size)

    # Create our list of decoder layers
    self.decoder_layers = nn.ModuleList(
      [ DecoderLayer(word_embedding_size, num_heads, pf_dim, dropout_value, device) for _ in range(num_layers) ]
    )

    self.scale = torch.sqrt(torch.FloatTensor([word_embedding_size])).to(device)

    # The last linear layer for the decoder
    self.fc_out = nn.Linear(word_embedding_size, target_vocab_size)
    self.dropout = nn.Dropout(dropout_value)

  def forward(self, trg, enc_src, trg_mask, src_mask):
    ''' Performs a forward prop. on the Decoder

        Parameters
        ----------
        trg : torch.LongTensor(N, S')
          A batch of target sequences
        enc_src : torch.LongTensor(N, S, H)
          A batch of source sequences that were encoded from the Encoder
        trg_mask : torch.LongTensor(N, S')
          The masks of each target sequence in the current batch
        src_mask : torch.LongTensor(N, S)
          The masks of each source sequence in the current batch

        Returns
        -------
        logits : torch.FloatTensor(N, target_vocab_size)
          The output probabilities that are not softmaxed
    '''
    batch_size, trg_seq_len = trg.size()
    _, _, hidden_dim = enc_src.size()

    pos = torch.arange(0, trg_seq_len).unsqueeze(0).repeat(batch_size, 1).to(self.device)
    assert pos.size() == (batch_size, trg_seq_len)

    trg = self.dropout((self.word_embedding(trg) / self.scale) + self.pos_embedding(pos))
    assert trg.size() == (batch_size, trg_seq_len, hidden_dim)

    for layer in self.decoder_layers:
      trg = layer(trg, enc_src, trg_mask, src_mask)

    logits = self.fc_out(trg)

    return logits


####**Our Decoder layer:**
Our decoder layer is going to look like:

![alt text](https://raw.githubusercontent.com/bentrevett/pytorch-seq2seq/e8209a7b0207cde55871be352819cac3dd5c05ce/assets/transformer-decoder.png)


In [0]:
import torch
from torch import nn
import torch.nn.functional as F


class DecoderLayer(nn.Module):
  def __init__(self, 
               hidden_size : int, 
               num_heads : int, 
               pf_dim : int, 
               dropout_value : float, 
               device : torch.device):
    ''' Creates a decoder layer. It should only be used in the Decoder class

        Parameter
        ---------
        hidden_size : int
          The hidden size
        num_heads : int
          The number of heads for the self-attention layers
        pf_dim : int
          The dimension for the PositionwiseFeedforward layer
        dropout_value : float
          The dropout value
        device : torch.Device
          The device to run the model
    '''
    super().__init__()

    self.norm_layer = nn.LayerNorm(hidden_size)

    self.attention_layer_1 = MultiHeadedSelfAttention(
      hidden_size, device, num_heads=num_heads, dropout_value=dropout_value
    )
    self.attention_layer_2 = MultiHeadedSelfAttention(
      hidden_size, device, num_heads=num_heads, dropout_value=dropout_value
    )

    self.positionwise_feedforward = PositionwiseFeedforwardLayer(hidden_size, pf_dim, dropout_value)

    self.dropout = nn.Dropout(dropout_value)

  def forward(self, trg, encoder_src, trg_mask, src_mask):
    ''' Performs a forward prop on the Decoder layer

        Paremeters
        ----------
        trg : torch.LongTensor(N, S', H)
          A batch of expected target sequences with embeddings
        encoder_src : torch.LongTensor(N, S, H)
          A batch of source sequence outputted from the Encoder
        trg_mask : torch.LongTensor(N, S')
          The masks for each target sequence in the current batch
        src_mask : torch.LongTensor(N, S)
          The masks for each source sequence in the current batch

        Returns
        -------
        decoded_trg : torch.tensor(N, S', H)
          A batch of decoded target sequences with embeddings
    '''
    assert trg.size()[0] == encoder_src.size()[0]
    assert trg.size()[2] == encoder_src.size()[2]

    # Apply attention
    attended_trg = self.attention_layer_1(trg, trg, trg, trg_mask)

    # Apply dropout and layer norm
    trg = self.norm_layer(trg + self.dropout(attended_trg))

    # Apply attention
    attended_trg = self.attention_layer_2(encoder_src, encoder_src, trg, src_mask)

    # Apply dropout and layer norm
    trg = self.norm_layer(trg + self.dropout(attended_trg))

    # Apply positionwise feedforward
    poswise_trg = self.positionwise_feedforward(trg)

    # Apply dropout and layer norm
    decoded_trg = self.norm_layer(trg + self.dropout(poswise_trg))

    return decoded_trg



####**Our final Seq2Seq model:**

Our final model will combine the Encoder and Decoder

The code looks like:


In [0]:
import torch
from torch import nn
import torch.nn.functional as F


class Seq2Seq(nn.Module):
  def __init__(self, 
               encoder : nn.Module, 
               decoder : nn.Module, 
               source_pad_idx : int, 
               target_sos : int, 
               target_eos : int, 
               target_pad_idx : int,
               device):
    ''' Constructs the Seq2Seq model

        Parameters
        ----------
        encoder : nn.Module
          The encoder
        decoder : nn.Module
          The decoder
        source_pad_idx : int
          The padding ID in the source language
        target_sos : int
          The SOS token in the target language
        target_eos : int
          The EOS token in the target language
        target_pad_idx : int
          The padding ID in the target language
        device: torch.device
          The device to run the model on
    '''
    super().__init__()

    self.encoder = encoder
    self.decoder = decoder

    self.source_pad_idx = source_pad_idx

    self.target_sos = target_sos
    self.target_eos = target_eos
    self.target_pad_idx = target_pad_idx

    self.device = device

  def get_source_padding_mask(self, src, src_lens):
    ''' Makes a mask for our source sequences, where src_mask[i, 0, 0, j] = 1 
        if src[i, j] is not a padding token; else 0

        Parameters
        ----------
        src : torch.Longtensor(N, S)
          A batch of source sequences
        src_lens : torch.LongTensor(N, )
          The lengths of each source sequence in the current batch

        Returns
        -------
        src_mask : torch.LongTensor(N, 1, 1, S)
          The masks of each source sequence in the current batch
    '''
    batch_size, seq_len = src.size()

    # Extend src_lens from (N, ) to (N, S)
    src_lens = src_lens.unsqueeze(1).repeat(1, seq_len)

    # Make a mask where src_mask[i, j] = 1 if src[i, j] == src padding token; else 0
    src_mask = torch.arange(seq_len).to(self.device)
    src_mask = src_mask.unsqueeze(0).repeat(batch_size, 1)
    src_mask = src_mask < src_lens

    # Extend src_mask from (N, S) to (N, 1, 1, S)
    src_mask = src_mask.unsqueeze(1).unsqueeze(2)
    
    batch_size, seq_len = src.size()
    assert src_mask.size() == (batch_size, 1, 1, seq_len)

    return src_mask

  def get_target_padding_mask(self, trg, trg_lens):
    ''' Makes a mask for our target sequences, where trg_mask[i, j] = 1 
        if a word exists in sequence i; else 0

        It relies on self.trg_pad_idx to see if the j-th spot of sequence i 
        in trg[] is a valid word or not

        Parameters
        ----------
        trg : torch.LongTensor(N, S')
          A batch of target sequences
        trg_lens : torch.LongTensor(N, )
          The lengths of each target sequence in the current batch

        Returns
        -------
        trg_mask : torch.LongTensor(N, 1, S', S')
          The mask of each target sequence in the current batch
    '''
    batch_size, seq_len = trg.size()

    # Extend src_lens from (N, ) to (N, S')
    trg_lens = trg_lens.unsqueeze(1).repeat(1, seq_len)

    # Make a mask where trg_pad_mask[i, j] = 1 if trg[i, j] == trg padding token; else 0
    trg_pad_mask = torch.arange(seq_len).to(self.device)
    trg_pad_mask = trg_pad_mask.unsqueeze(0).repeat(batch_size, 1)
    trg_pad_mask = trg_pad_mask < trg_lens

    # Extend the mask from (N, S') to (N, 1, S', 1)
    trg_pad_mask = trg_pad_mask.unsqueeze(1).unsqueeze(3)
    
    # Apply a sub mask on the lower triangular matrix (Size: (S', S'))    
    trg_sub_mask = torch.tril(torch.ones((seq_len, seq_len), device=self.device)).bool()
    
    assert trg_sub_mask.size() == (seq_len, seq_len)

    # Apply a mask on the inputs (Size: (N, 1, S', S'))      
    trg_mask = trg_pad_mask & trg_sub_mask

    assert trg_mask.size() == (batch_size, 1, seq_len, seq_len) 
    
    return trg_mask

  def forward(self, src, src_lens, trg=None, trg_lens=None):
    ''' Performs forward propogation of our model
        If trg == None, it will get the logits from the greedy search

        Parameters
        ----------
        src : torch.Longtensor(N, S)
          A batch of source sequences
        src_lens : torch.LongTensor(N, )
          The lengths of each source sequence in the current batch
        trg : torch.Longtensor(N, S')
          A batch of target sequences
        trg_lens : torch.LongTensor(N, )
          The lengths of each target sequence in the current batch

        Returns
        -------
        logits : torch.LongTensor(N, S', target_vocab_size)
          The set of logits for the target sequence
    '''
    if self.training and trg is None:
      raise ValueError("Expected target sequence (trg) must be set!")

    if trg is not None:
      return self.get_logits_from_teacher_forcing(src, src_lens, trg, trg_lens)
    
    else:
      return self.get_logits_from_greedy_search(src, src_lens)

  def get_logits_from_teacher_forcing(self, src, src_lens, trg, trg_lens):
    ''' Translates a batch of sequences in the source language to a target language
        with the help of its expected translated sequences.

        NOTE: This should only be used in training

        Parameters
        ----------
        src : torch.Longtensor(N, S)
          A batch of source sequences
        src_lens : torch.LongTensor(N, )
          The lengths of each source sequence in the current batch
        trg : torch.Longtensor(N, S')
          A batch of target sequences
        trg_lens : torch.LongTensor(N, )
          The lengths of each target sequence in the current batch

        Returns
        -------
        logits : torch.LongTensor(N, S', target_vocab_size)
          The set of logits for the predicted target sequence
    '''
    assert src.size()[0] == trg.size()[0]

    src_mask = self.get_source_padding_mask(src, src_lens)
    encoded_src = self.encoder(src, src_mask)
    
    trg_mask = self.get_target_padding_mask(trg, trg_lens)
    logits = self.decoder(trg, encoded_src, trg_mask, src_mask)
    
    return logits

  def get_logits_from_greedy_search(self, src, src_lens, max_len=50):
    ''' Translates a batch of sentences in a source language to a target language

        Parameters
        ----------
        src : torch.Longtensor(N, S)
          A batch of source sequences
        src_lens : torch.LongTensor(N, )
          The lengths of each source sequence in the current batch
        max_len : int (optional)
          The max length of the target sequence

        Returns
        -------
        logits : torch.LongTensor(N, S', target_vocab_size)
          The set of logits for the target sequence
    '''
    batch_size = src.size()[0]

    # All inputs to the decoder starts with SOS
    trg = torch.tensor([[self.target_sos] for _ in range(batch_size)]).to(self.device)
    logits = None

    # Encode the source sequences
    src_mask = self.get_source_padding_mask(src, src_lens)
    encoded_src = self.encoder(src, src_mask)

    # The lengths of each target sequence
    # where trg_lens[i] = the length of sequence i from index 0 to its first EOS inclusive
    trg_lens = torch.tensor([max_len + 100 for _ in range(batch_size)]).to(self.device)

    # Decode the target sequences
    cur_len = 1
    while cur_len < max_len and torch.any(trg_lens == max_len + 100):

      trg_mask = self.get_target_padding_mask(trg, trg_lens * 0 + cur_len)      
      cur_logits = self.decoder(trg, encoded_src, trg_mask, src_mask)     # (N, cur_len, self.target_vocab_size)
      last_logits = cur_logits[:, cur_len - 1, :]                         # (N, self.target_vocab_size)

      # Get the best tokens
      best_tokens = last_logits.argmax(1)                # (N, )

      # See which has newly ended and update trg_lens (Size: (N, ))
      new_trg_lens = (best_tokens == self.target_eos).long() * cur_len
      condition = (new_trg_lens < trg_lens) & (best_tokens == self.target_eos)
      trg_lens = torch.where(condition, new_trg_lens, trg_lens)

      trg = torch.cat([trg, best_tokens.unsqueeze(-1)], dim=1)
      logits = cur_logits
      cur_len += 1

      assert trg.size() == (batch_size, cur_len)

    return logits

  def get_logits_from_beam_search(self, src, src_lens, max_len=50, beam_width=3):
    ''' Translates a batch of sentences in a source language to a target language

        Parameters
        ----------
        src : torch.Longtensor(N, S)
          A batch of source sequences
        src_lens : torch.LongTensor(N, )
          The lengths of each source sequence in the current batch
        max_len : int (optional)
          The max length of the target sequence
        beam_width : int (optional)
          The beam width

        Returns
        -------
        logits : torch.LongTensor(N, beam_width, S', target_vocab_size)
          The set of logits for the target sequence per beam width
    '''
    raise NotImplementedError()


## **Training:**
We are going to train our model and see if for each epoch it improves its predictions on the validation set

How we train it:
* It uses teacher forcing:
  1. First, we have the source and target sentences
  2. Then, we feed the source sentence into the Encoder. The encoder returns the attended source sentence.
  3. Next, we feed the attended source sentence and the target sentence into the decoder
  4. We check if the output of the decoder is the same as the target sentence

How we make predictions:
* It is similar to RNNs, where we feed in the source input to the encoder, feed in an SOS in the decoder, and we take the outputs of the previous decoder as inputs to the next decoder
* In more detail:
  1. First, we have the source sentence
  2. Then, we feed the source sentence into the Encoder to get an attended version of the source sentence
  3. Next, we feed an SOS token and the attended source sentence in the Decoder as the first input to our decoder
  4. We get the outputs of the decoder and use it as the next token to feed as the second input to our decoder
  5. Repeat 3-4 until we get an EOS token



### **How to build our model:**

In [0]:
from torchtext.data.metrics import bleu_score
from tqdm.notebook import tqdm
import torch.nn.functional as F


def make_model(source_vocab_size : int, 
               target_vocab_size : int, 
               source_pad_id : int, 
               target_sos_id : int, 
               target_eos_id : int, 
               target_pad_id : int, 
               device : torch.device):
  ''' Builds our Transformer Seq2Seq model

      Parameters
      ----------
      source_vocab_size : int
        The vocab size in the source language
      target_vocab_size : int
        The vocab size in the target language
      source_pad_id : int
        The ID of a padding token in the source language
      target_sos_id : int
        The ID of a start-of-sequence token in the target language
      target_eos_id : int
        The ID of an end-of-sequence token in the target language
      target_pad_id : int
        The ID of a padding token in the target language
      device : torch.device
        The device to run the model on
        
      Returns
      -------
      model : Seq2Seq
        The model
  '''
  word_embedding_size = 256

  num_encoder_layers = 3
  num_encoder_heads = 8
  encoder_pf_dim = 512
  encoder_dropout = 0.1

  num_decoder_layers = 3
  num_decoder_heads = 8
  decoder_pf_dim = 512
  decoder_dropout = 0.1
  
  encoder = Encoder(source_vocab_size, 
                    word_embedding_size, 
                    num_encoder_layers, 
                    num_encoder_heads, 
                    encoder_pf_dim, 
                    encoder_dropout, 
                    device)
  
  decoder = Decoder(target_vocab_size, 
                    word_embedding_size, 
                    num_decoder_layers, 
                    num_decoder_heads, 
                    decoder_pf_dim, 
                    decoder_dropout, 
                    device)
  
  model = Seq2Seq(encoder, 
                  decoder,
                  source_pad_id, 
                  target_sos_id, 
                  target_eos_id, 
                  target_pad_id, 
                  device)
  model.to(device)

  return model

### **How to train our model for one epoch:**

In [0]:
from torchtext.data.metrics import bleu_score
from tqdm.notebook import tqdm
import torch.nn.functional as F


def train_for_one_epoch(model, loss_function, optimizer, train_dataloader, device):
  ''' Trains the model on the training set

      Parameters
      ----------
      model : Seq2Seq
        The model
      loss_function : torch.nn.LossFunction
        The loss function (ex: torch.nn.CrossEntropyLoss)
      optimizer : torch.nn.Optimizer
        The optimizer (ex: SGD, AdamOptimizer, etc)
      train_dataloader : Seq2SeqDataLoader
        The dataloader for the training set
      device : torch.device
        The device to run predictions on

      Returns
      -------
      loss : float
        The loss for the training set
  '''
  
  model.train()
  train_loss = 0.0

  for _, (src, src_lens, trg, trg_lens) in tqdm(enumerate(train_dataloader), total=len(train_dataloader)):

    # Send the data to the specified device
    src = src.to(device)
    src_lens = src_lens.to(device)
    trg = trg.to(device)
    trg_lens = trg_lens.to(device)

    # Zeros out the model's previous gradient
    optimizer.zero_grad()

    # Get the logits
    logits = model(src, src_lens, trg=trg[:, : -1], trg_lens=trg_lens)

    # Flatten the logits so that it is (b * (t - 1), target_vocab_size)
    flattened_logits = logits.contiguous().view(-1, logits.shape[2])

    # Remove the SOS and flatten it so that it is (b * (t - 1))
    flattened_trg = trg[:, 1:].contiguous().view(-1)

    # Compute loss
    loss = loss_function(flattened_logits, flattened_trg)
    train_loss += loss.item()

    # Backward prop
    loss.backward()
    optimizer.step()

    del src, src_lens, trg, trg_lens, loss, logits

  return train_loss / len(train_dataloader)

### **How to evaluate our model:**

In [0]:
from torchtext.data.metrics import bleu_score
from tqdm.notebook import tqdm
import torch.nn.functional as F


def evaluate_model(model, 
                   loss_function, 
                   test_dataloader, 
                   device, 
                   target_sos, 
                   target_eos, 
                   target_pad_id, 
                   target_id2word):
  ''' Evaluates the model on the test set

      Parameters
      ----------
      model : Seq2Seq
        The model
      loss_function : torch.nn.LossFunction
        The loss function (ex: torch.nn.CrossEntropyLoss)
      test_dataloader : Seq2SeqDataLoader
        The dataloader for the test set
      device : torch.device
        The device to run predictions on
      target_sos : int
        The ID of a SOS token in the target language
      target_eos : int
        The ID of an EOS token in the target language
      target_pad_id : int
        The ID of a padding token in the target language
      target_id2word : { int : str }
        A mapping of word IDs in the target language to its string representative

      Returns
      -------
      loss, bleu_score : float, float
        The loss and bleu score for the test set
  '''
  
  model.eval() 
  bleu = 0
  eval_loss = 0

  with torch.no_grad():
    for _, (src, src_lens, trg, trg_lens) in tqdm(enumerate(test_dataloader), total=len(test_dataloader)):

      # Send the data to the specified device
      src = src.to(device)
      src_lens = src_lens.to(device)
      trg = trg.to(device)
      trg_lens = trg_lens.to(device)

      # Get the logits
      logits = model(src, src_lens, trg=trg[:, : -1], trg_lens=trg_lens)

      # Flatten the logits so that it is (b * (t - 1), model.target_vocab_size)
      flattened_logits = logits.contiguous().view(-1, logits.shape[2])

      # Remove the SOS and flatten it so that it is (b * (t - 1))
      flattened_trg = trg[:, 1:].contiguous().view(-1)

      # Compute loss
      loss = loss_function(flattened_logits, flattened_trg)
      eval_loss += loss.item()

      # Compute BLEU score
      bleu += compute_batch_bleu_score(model, src, src_lens, trg, target_id2word, target_sos, target_eos, target_pad_id, device)

      del src, src_lens, trg, trg_lens, loss, logits
      
  eval_loss /= len(test_dataloader)
  bleu /= len(test_dataloader)

  return eval_loss, bleu

def compute_batch_bleu_score(model, src, src_lens, trg, target_id2word, target_sos, target_eos, target_pad_id, device):
  ''' Computes the BLEU score on a batch of sequences

      Parameters
      ----------
      src : torch.tensor(N, S)
        A batch of source sequences
      src_lens : torch.LongTensor(N, )
        The lengths of each source sequence in the current batch
      trg : torch.tensor(N, S')
        A batch of expected target sequences
      target_id2word : { int : str }
        A mapping of word IDs in the target language to its string representative
      target_sos : int
        The ID of a SOS token in the target language
      target_eos : int
        The ID of an EOS token in the target language
      target_pad_id : int
        The ID of a padding token in the target language
      device : torch.device
        The device to run predictions on

      Returns
      -------
      bleu_score : float
        The bleu score for the batch of sequences
  '''
  # Get predicted output and add EOS to the end of each sequence (in case any seq doesn't have an EOS)
  logits = model(src, src_lens)
  predicted_trg = logits.argmax(2)

  # Remove SOS token
  expected_trg = trg[:, 1:]

  # Move to the CPU
  predicted_trg = predicted_trg.cpu().tolist()
  expected_trg = expected_trg.cpu().tolist()

  # Populate lst for bleu score
  predicted_seqs = []
  expected_seqs = []

  for i in range(len(predicted_trg)):

    predicted_seq = predicted_trg[i]
    expected_seq = expected_trg[i]

    # Remove the EOS
    if target_eos in predicted_seq:
      predicted_seq = predicted_seq[: predicted_seq.index(target_eos)]
    if target_eos in expected_seq:
      expected_seq = expected_seq[: expected_seq.index(target_eos)]

    # Convert IDs to words
    predicted_seq = [target_id2word.get(id_, "NAN") for id_ in predicted_seq]
    expected_seq = [target_id2word.get(id_, "NAN") for id_ in expected_seq]

    predicted_seqs.append(predicted_seq)
    expected_seqs.append([expected_seq])

  return bleu_score(predicted_seqs, expected_seqs)


### **How to train our model for many epochs:**

In [13]:
from tqdm.notebook import tqdm

def train():
  ''' The main function to train and test our model.
      It can perform early-stopping instead of hard-coding the number of epochs to run for

      Returns
      -------
      model : Seq2Seq
        The trained Seq2Seq model
  '''
  global source_vocabs, target_vocabs
  global dataset, train_dataset, val_dataset
  global train_dataloader, val_dataloader

  device = torch.device("cuda")

  model = make_model(dataset.source_vocab_size, 
                     dataset.target_vocab_size,
                     dataset.source_pad_id, 
                     dataset.target_sos,
                     dataset.target_eos,
                     dataset.target_pad_id,
                     device)
  
  patience = 3 #float("inf")
  num_epochs = float("inf")

  best_val_loss = float("inf")

  num_poor = 0
  epoch = 1
  
  optimizer = torch.optim.Adam(model.parameters(), lr=0.0005)

  while epoch <= num_epochs and num_poor < patience:
      
    # Train
    loss_function = torch.nn.CrossEntropyLoss(ignore_index=dataset.target_pad_id)
    train_loss = train_for_one_epoch(model, 
                                     loss_function, 
                                     optimizer, 
                                     train_dataloader, 
                                     device)
    
    # Evaluate the model
    val_loss, val_bleu = evaluate_model(model, 
                                        loss_function, 
                                        val_dataloader, 
                                        device, 
                                        dataset.target_sos, 
                                        dataset.target_eos,
                                        dataset.target_pad_id,
                                        target_vocabs.get_id2word())

    print(f"Epoch {epoch}: Train loss={train_loss}, Val loss={val_loss}, Val BLEU={val_bleu}")

    if val_loss > best_val_loss:
      num_poor += 1

    else:
      num_poor = 0
      best_val_loss = val_loss

    epoch += 1

  return model

trained_model = train()



HBox(children=(IntProgress(value=0, max=346), HTML(value='')))




HBox(children=(IntProgress(value=0, max=116), HTML(value='')))


Epoch 1: Train loss=4.829797588331851, Val loss=4.019924899627423, Val BLEU=0.07824927062354849


HBox(children=(IntProgress(value=0, max=346), HTML(value='')))




HBox(children=(IntProgress(value=0, max=116), HTML(value='')))


Epoch 2: Train loss=3.7660733261549404, Val loss=3.377837306466596, Val BLEU=0.13099991108494288


HBox(children=(IntProgress(value=0, max=346), HTML(value='')))




HBox(children=(IntProgress(value=0, max=116), HTML(value='')))


Epoch 3: Train loss=3.243533132393236, Val loss=3.008571817957122, Val BLEU=0.16806063581843336


HBox(children=(IntProgress(value=0, max=346), HTML(value='')))




HBox(children=(IntProgress(value=0, max=116), HTML(value='')))


Epoch 4: Train loss=2.872940674682573, Val loss=2.74376511984858, Val BLEU=0.20142291382050853


HBox(children=(IntProgress(value=0, max=346), HTML(value='')))




HBox(children=(IntProgress(value=0, max=116), HTML(value='')))


Epoch 5: Train loss=2.5676805504484674, Val loss=2.5586785324688615, Val BLEU=0.2306284617083488


HBox(children=(IntProgress(value=0, max=346), HTML(value='')))




HBox(children=(IntProgress(value=0, max=116), HTML(value='')))


Epoch 6: Train loss=2.310778110358067, Val loss=2.4027339466686906, Val BLEU=0.24768150766502692


HBox(children=(IntProgress(value=0, max=346), HTML(value='')))




HBox(children=(IntProgress(value=0, max=116), HTML(value='')))


Epoch 7: Train loss=2.1018241240799083, Val loss=2.285192982903842, Val BLEU=0.27027300271563864


HBox(children=(IntProgress(value=0, max=346), HTML(value='')))




HBox(children=(IntProgress(value=0, max=116), HTML(value='')))


Epoch 8: Train loss=1.9187282844085913, Val loss=2.1851467839602767, Val BLEU=0.29221272883658145


HBox(children=(IntProgress(value=0, max=346), HTML(value='')))




HBox(children=(IntProgress(value=0, max=116), HTML(value='')))


Epoch 9: Train loss=1.7561732602946332, Val loss=2.1181903337610177, Val BLEU=0.30411030188321037


HBox(children=(IntProgress(value=0, max=346), HTML(value='')))




HBox(children=(IntProgress(value=0, max=116), HTML(value='')))


Epoch 10: Train loss=1.621830485459697, Val loss=2.0600934090285468, Val BLEU=0.3158219502714136


HBox(children=(IntProgress(value=0, max=346), HTML(value='')))




HBox(children=(IntProgress(value=0, max=116), HTML(value='')))


Epoch 11: Train loss=1.5022827094000888, Val loss=2.019174390825732, Val BLEU=0.32712729654684874


HBox(children=(IntProgress(value=0, max=346), HTML(value='')))




HBox(children=(IntProgress(value=0, max=116), HTML(value='')))


Epoch 12: Train loss=1.407898038797985, Val loss=2.0036077838519524, Val BLEU=0.33240368932356235


HBox(children=(IntProgress(value=0, max=346), HTML(value='')))




HBox(children=(IntProgress(value=0, max=116), HTML(value='')))


Epoch 13: Train loss=1.3210895230315325, Val loss=1.9749235079206269, Val BLEU=0.34271181671609086


HBox(children=(IntProgress(value=0, max=346), HTML(value='')))




HBox(children=(IntProgress(value=0, max=116), HTML(value='')))


Epoch 14: Train loss=1.2502033064475637, Val loss=1.9576816291644656, Val BLEU=0.35236664130812395


HBox(children=(IntProgress(value=0, max=346), HTML(value='')))




HBox(children=(IntProgress(value=0, max=116), HTML(value='')))


Epoch 15: Train loss=1.1891738300378611, Val loss=1.950673339695766, Val BLEU=0.3528871710238762


HBox(children=(IntProgress(value=0, max=346), HTML(value='')))




HBox(children=(IntProgress(value=0, max=116), HTML(value='')))


Epoch 16: Train loss=1.1325603864785563, Val loss=1.9223674268558109, Val BLEU=0.36831577589328


HBox(children=(IntProgress(value=0, max=346), HTML(value='')))




HBox(children=(IntProgress(value=0, max=116), HTML(value='')))


Epoch 17: Train loss=1.0868108026209595, Val loss=1.9321869560356797, Val BLEU=0.36434273806268097


HBox(children=(IntProgress(value=0, max=346), HTML(value='')))




HBox(children=(IntProgress(value=0, max=116), HTML(value='')))


Epoch 18: Train loss=1.0412763961822311, Val loss=1.922632688078387, Val BLEU=0.3740773429587025


HBox(children=(IntProgress(value=0, max=346), HTML(value='')))




HBox(children=(IntProgress(value=0, max=116), HTML(value='')))


Epoch 19: Train loss=1.0011622173593224, Val loss=1.9348515395460457, Val BLEU=0.3740988789128009


## **Testing**
We are going to test our model on the test set

In [0]:
def test():
  ''' Used to test the model on the testing data'''
  
  global source_vocabs, target_vocabs
  global dataset, test_dataset, test_dataloader
  global trained_model

  device = torch.device("cuda")
    
  # The loss function
  loss_function = torch.nn.CrossEntropyLoss(ignore_index=test_dataset.target_pad_id)

  # Evaluate the model
  test_loss, test_bleu = evaluate_model(trained_model, 
                                        loss_function, 
                                        test_dataloader, 
                                        device, 
                                        test_dataset.target_sos, 
                                        test_dataset.target_eos,
                                        test_dataset.target_pad_id,
                                        target_vocabs.get_id2word())

  print(f"Test loss={test_loss}, Test Bleu={test_bleu}")

test()

HBox(children=(IntProgress(value=0, max=16), HTML(value='')))


Test loss=1.5746025443077087, Test Bleu=0.42946506908385707
