> **Copyright (c) 2020 Skymind Holdings Berhad**<br><br>
> **Copyright (c) 2021 Skymind Education Group Sdn. Bhd.**<br>
<br>
Licensed under the Apache License, Version 2.0 (the \"License\");
<br>you may not use this file except in compliance with the License.
<br>You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0/
<br>
<br>Unless required by applicable law or agreed to in writing, software
<br>distributed under the License is distributed on an \"AS IS\" BASIS,
<br>WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
<br>See the License for the specific language governing permissions and
<br>limitations under the License.
<br>
<br>
**SPDX-License-Identifier: Apache-2.0**
<br>

# Introduction

In this notebook, we are going to build neural machine translation (NMT) using Transformer with pytorch. This NMT could translate English to French. 

Let's get started.

![Language Modelling](../../../images/NMT.gif)

# What will we accomplish?

Steps to implement neural machine translation using Transformer with Pytorch:

> Step 1: Load and preprocess dataset

> Step 2: Building transformer architecture

> Step 3: Train the transformer model

> Step 4: Test the trained model

# Notebook Content

* [Load Dataset](#Load-Dataset)


* [Tokenization](#Tokenization)


* [Preprocessing](#Preprocessing)

    * [Train-Test Split](#Train-Test-Split)
    * [TabularDataset](#TabularDataset)
    * [BucketIterator](#BucketIterator)
    * [Custom Iterator](#Custom-Iterator)


* [Dive Deep into Transformer](#Dive-Deep-into-Transformer)

    * [Embedding](#Embedding)
    * [Positional Encoding](#Positional-Encoding)
    * [Masking](#Masking)
        * [Input Masks](#Input-Masks)
        * [Target Sequence Masks](#Target-Sequence-Masks)
    * [Multi-Headed Attention](#Multi-Headed-Attention)
    * [Attention](#Attention)
    * [Feed-Forward Network](#Feed-Forward-Network)
    * [Normalisation](#Normalisation)
    
    
* [Building Transformer](#Building-Transformer)
    
    * [EncoderLayer](#EncoderLayer)
    * [DecoderLayer](#DecoderLayer)
    * [Encoder](#Encoder)
    * [Decoder](#Decoder)
    * [Transformer](#Transformer)
    
    
* [Training the Model](#Training-the-Model)


* [Testing the Model](#Testing-the-Model)

# Load Dataset

The dataset we used is [parallel corpus French-English](http://www.statmt.org/europarl/v7/fr-en.tgz) dataset from [European Parliament Proceedings Parallel Corpus (1996–2011)](http://www.statmt.org/europarl/). This dataset contains 15 years of write-ups from E.U. proceedings, weighing in at 2,007,724 sentences, and 50,265,039 words. You should found the dataset in the `datasets` folder, else you may download it [here](http://www.statmt.org/europarl/v7/fr-en.tgz). You will have the following files after unzipping the downloaded file:

1. europarl-v7.fr-en.en
2. europarl-v7.fr-en.fr

![](../../../images/fr-en.png)

Now we are going to load the dataset for preprocessing.

In [1]:
europarl_en = open('../../../resources/day_11/fr-en/europarl-v7.fr-en.en', encoding='utf-8').read().split('\n')
europarl_fr = open('../../../resources/day_11/fr-en/europarl-v7.fr-en.fr', encoding='utf-8').read().split('\n')

# Tokenization

The first job we need done is to **create a tokenizer for each language**. This is a function that will split the text into separate words and assign them unique numbers (indexes). This number will come into play later when we discuss embeddings.

![Tokenization](../../../images/tokenize.png)

He we will tokenize the text using **Torchtext** and **Spacy** together. Spacy is a library that has been specifically built to take sentences in various languages and split them into different tokens (see [here](https://spacy.io/) for more information). Without Spacy, Torchtext defaults to a simple .split(' ') method for tokenization. This is much less nuanced than Spacy’s approach, which will also split words like “don’t” into “do” and “n’t”, and much more.

In [2]:
import spacy
import torchtext
import torch
import numpy as np
from torchtext.legacy.data import Field, BucketIterator, TabularDataset

In [3]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [4]:
# !python -m spacy download fr_core_news_lg
# !python -m spacy download en_core_web_lg

In [5]:
en = spacy.load('en_core_web_lg')
fr = spacy.load('fr_core_news_lg')

def tokenize_en(sentence):
    return [tok.text for tok in en.tokenizer(sentence)]

def tokenize_fr(sentence):
    return [tok.text for tok in fr.tokenizer(sentence)]

EN_TEXT = Field(tokenize=tokenize_en)
FR_TEXT = Field(tokenize=tokenize_fr, init_token = "<sos>", eos_token = "<eos>")

# Preprocessing

The best way to work with Torchtext is to turn your data into **spreadsheet format**, no matter the original format of your data file. This is due to the incredible versatility of the **Torchtext TabularDataset** function, which creates datasets from spreadsheet formats. So first to turn our data into an appropriate CSV file.

In [6]:
import pandas as pd

raw_data = {'English' : [line for line in europarl_en], 'French': [line for line in europarl_fr]}
df = pd.DataFrame(raw_data, columns=["English", "French"])

In [7]:
# Remove very long sentences and sentences where translations are not of roughly equal length
df['eng_len'] = df['English'].str.count(' ')
df['fr_len'] = df['French'].str.count(' ')
df = df.query('fr_len < 80 & eng_len < 80')
df = df.query('fr_len < eng_len * 1.5 & fr_len * 1.5 > eng_len')

## Train-Test Split

Now we are going to split the data into train set and test set. Fortunately Sklearn and Torchtext together make this process incredibly easy.

In [8]:
from sklearn.model_selection import train_test_split

# Create train and validation set 
train, test = train_test_split(df, test_size=0.1)
train.to_csv("../../../resources/day_11/train.csv", index=False)
test.to_csv("../../../resources/day_11/test.csv", index=False)

This creates a train and test csv each with two columns (English, French), where each row contains an English sentence in the 'English' column, and its French translation in the 'French' column.

## TabularDataset

Calling the magic `TabularDataset.splits` then returns a train and test dataset with the respective data loaded into them, processed(/tokenized) according to the fields we defined earlier.

In [9]:
# Associate the text in the 'English' column with the EN_TEXT field, # and 'French' with FR_TEXT

data_fields = [('English', EN_TEXT), ('French', FR_TEXT)]

train, test = TabularDataset.splits(path='../../../resources/day_11', train='train.csv', validation='test.csv', 
                                    format='csv', fields=data_fields)

Processing a few million words can take a while so grab a cup of tea here…

In [10]:
FR_TEXT.build_vocab(train, test)
EN_TEXT.build_vocab(train, test)

To see what numbers the tokens have been assigned and vice versa in each field, we can use `self.vocab.stoi` and `self.vocab.itos`.

In [11]:
print(EN_TEXT.vocab.stoi['the'])

2


In [12]:
print(EN_TEXT.vocab.itos[11])

a


## BucketIterator

**BucketIterator** Defines an iterator that batches examples of similar lengths together.

It minimizes amount of padding needed while producing freshly shuffled batches for each new epoch. See pool for the bucketing procedure used.

In [13]:
train_iter = BucketIterator(train, batch_size=20, sort_key=lambda x: len(x.French), shuffle=True)

The `sort_key` dictates how to form each batch. The lambda function tells the iterator to try and find sentences of the **same length** (meaning more of the matrix is filled with useful data and less with padding).

In [14]:
batch = next(iter(train_iter))

print(batch.English)

tensor([[ 315, 1296,  315,  ..., 3004,   68,   68],
        [  13,    3,  579,  ...,   84, 1225,  474],
        [ 152,   15, 1515,  ...,  813, 1024, 1407],
        ...,
        [   1,   85,    1,  ...,    2,    1,    1],
        [   1,    4,    1,  ...,  146,    1,    1],
        [   1,    1,    1,  ...,    4,    1,    1]])


In [15]:
print("Number of columns:", len(batch))

Number of columns: 20


In each batch, sentences have been transposed so they are descending vertically (important: we will need to transpose these again to work with transformer). **Each index represents a token (word)**, and **each column represents a sentence**. We have 20 columns, as 20 was the batch_size we specified.

You might notice all the ‘1’s and think which incredibly common word is this the index for? Well the ‘1’ is not of course a word, but purely **padding**. While Torchtext is brilliant, it’s `sort_key` based batching leaves a little to be desired. Often the sentences aren’t of the same length at all, and you end up feeding a lot of padding into your network (as you can see with all the 1s in the last figure). We will solve this by implementing our own iterator.

## Custom Iterator

The custom iterator is built in reference to the code from http://nlp.seas.harvard.edu/2018/04/03/attention.html. Feel free to explore yourself to have more understanding about `MyIterator` class.

In [16]:
from torchtext.legacy import data

global max_src_in_batch, max_tgt_in_batch

def batch_size_fn(new, count, sofar):
    "Keep augmenting batch and calculate total number of tokens + padding."
    global max_src_in_batch, max_tgt_in_batch
    if count == 1:
        max_src_in_batch = 0
        max_tgt_in_batch = 0
    max_src_in_batch = max(max_src_in_batch,  len(new.English))
    max_tgt_in_batch = max(max_tgt_in_batch,  len(new.French) + 2)
    src_elements = count * max_src_in_batch
    tgt_elements = count * max_tgt_in_batch
    return max(src_elements, tgt_elements)

class MyIterator(data.Iterator):
    
    def create_batches(self):
        if self.train:
            def pool(d, random_shuffler):
                for p in data.batch(d, self.batch_size * 100):
                    p_batch = data.batch(
                        sorted(p, key=self.sort_key),
                        self.batch_size, self.batch_size_fn)
                    for b in random_shuffler(list(p_batch)):
                        yield b
            self.batches = pool(self.data(), self.random_shuffler)     
        else:
            self.batches = []
            for b in data.batch(self.data(), self.batch_size, self.batch_size_fn):
                self.batches.append(sorted(b, key=self.sort_key))

In [17]:
train_iter = MyIterator(train, batch_size= 64, device=device, repeat=False, 
                        sort_key= lambda x: (len(x.English), len(x.French)),
                        batch_size_fn=batch_size_fn, train=True, shuffle=True)

# Dive Deep into Transformer

![Transformer](../../../images/transformer.png)

The diagram above shows the overview of the Transformer model. The inputs to the encoder will be the **English** sentence, and the 'Outputs' from the decoder will be the **French** sentence.

## Embedding

Embedding words has become standard practice in NMT, feeding the network with far more information about words than a one hot encoding would.

![Embedding Layer](../../../images/embeddings.gif)

In [18]:
from torch import nn

class Embedder(nn.Module):
    def __init__(self, vocab_size, embedding_dimension):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, embedding_dimension)
        
    def forward(self, x):
        return self.embed(x)

When each word is fed into the network, this code will perform a look-up and retrieve its embedding vector. These vectors will then be learnt as a parameters by the model, adjusted with each iteration of gradient descent.

## Positional Encoding

In order for the model to make sense of a sentence, it needs to know two things about each word: what does the **word mean**? And what is its **position** in the sentence?

The embedding vector for each word will **learn the meaning**, so now we need to input something that tells the network about the word’s position.

*Vasmari et al* answered this problem by using these functions to create a constant of position-specific values:

![Position Encoding](../../../images/pos_encoding_1.png)

![Position Encoding](../../../images/pos_encoding_2.png)

This constant is a 2D matrix. Pos refers to the order in the sentence, and $i$ refers to the position along the embedding vector dimension. Each value in the pos/i matrix is then worked out using the equations above.

![Position Encoding](../../../images/pos_encoding_3.png)

In [19]:
import math

class PositionalEncoder(nn.Module):
    def __init__(self, d_model, max_seq_len = 200, dropout = 0.1):
        super().__init__()
        self.d_model = d_model
        self.dropout = nn.Dropout(dropout)
        # Create constant 'pe' matrix with values dependant on pos and i
        
        pe = torch.zeros(max_seq_len, d_model)
        for pos in range(max_seq_len):
            for i in range(0, d_model, 2):
                pe[pos, i] = \
                math.sin(pos / (10000 ** ((2 * i)/d_model)))
                pe[pos, i + 1] = \
                math.cos(pos / (10000 ** ((2 * (i + 1))/d_model)))
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)
 
    
    def forward(self, x):
        # Make embeddings relatively larger
        x = x * math.sqrt(self.d_model)
        # Add constant to embedding
        
        seq_len = x.size(1)
        pe = Variable(self.pe[:,:seq_len], requires_grad=False)
        
        if x.is_cuda:
            pe.cuda()
        x = x + pe
        
        return self.dropout(x)

`PositionalEncoder` lets us add the **positional encoding to the embedding vector**, providing information about structure to the model.

The reason we increase the embedding values before addition is to make the positional encoding relatively smaller. This means the original meaning in the embedding vector won’t be lost when we add them together.

## Masking

Masking plays an important role in the transformer. It serves two purposes:

* In the encoder and decoder: To zero attention outputs wherever there is just padding in the input sentences.


* In the decoder: To prevent the decoder ‘peaking’ ahead at the rest of the translated sentence when predicting the next word.

![Masking](../../../images/masking.gif)

### Input Masks

In [20]:
batch = next(iter(train_iter))
input_seq = batch.English.transpose(0,1)
input_pad = EN_TEXT.vocab.stoi['<pad>']

# creates mask with 0s wherever there is padding in the input
input_msk = (input_seq != input_pad).unsqueeze(1)

### Target Sequence Masks

In [21]:
from torch.autograd import Variable

target_seq = batch.French.transpose(0,1)
target_pad = FR_TEXT.vocab.stoi['<pad>']
target_msk = (target_seq != target_pad).unsqueeze(1)

The initial input into the decoder will be the **target sequence** (the French translation). The way the decoder predicts each output word is by making use of all the encoder outputs and the French sentence only up until the point of each word its predicting.

Therefore we need to prevent the first output predictions from being able to see later into the sentence. For this we use the `nopeak_mask`.

In [22]:
# Get seq_len for matrix
size = target_seq.size(1) 

nopeak_mask = np.triu(np.ones((1, size, size)), k=1).astype('uint8')
nopeak_mask = Variable(torch.from_numpy(nopeak_mask) == 0).cuda()

print(nopeak_mask)

target_msk = target_msk & nopeak_mask

tensor([[[ True, False, False,  ..., False, False, False],
         [ True,  True, False,  ..., False, False, False],
         [ True,  True,  True,  ..., False, False, False],
         ...,
         [ True,  True,  True,  ...,  True, False, False],
         [ True,  True,  True,  ...,  True,  True, False],
         [ True,  True,  True,  ...,  True,  True,  True]]], device='cuda:0')


In [23]:
def create_masks(src, trg):
    src_pad = EN_TEXT.vocab.stoi['<pad>']
    trg_pad = FR_TEXT.vocab.stoi['<pad>']
    
    src_mask = (src != src_pad).unsqueeze(-2)

    if trg is not None:
        trg_mask = (trg != trg_pad).unsqueeze(-2)
        
        # Get seq_len for matrix
        size = trg.size(1) 
        np_mask = nopeak_mask(size)
        
        if device.type == 'cuda':
            np_mask = np_mask.cuda()
            
        trg_mask = trg_mask & np_mask
        
    else:
        trg_mask = None
    return src_mask, trg_mask

In [24]:
def nopeak_mask(size):
    np_mask = np.triu(np.ones((1, size, size)), k=1).astype('uint8')
    np_mask =  Variable(torch.from_numpy(np_mask) == 0)
    
    return np_mask

If we later apply this mask to the attention scores, the values wherever the input is ahead will not be able to contribute when calculating the outputs.

## Multi-Headed Attention

Once we have our embedded values (with positional encodings) and our masks, we can start building the layers of our model.
Here is an overview of the multi-headed attention layer:

![Multi-Headed Attention](../../../images/multi-head-attention.png)

In multi-headed attention layer, each **input is split into multiple heads** which allows the network to simultaneously attend to different subsections of each embedding.

$V$, $K$ and $Q$ stand for ***key***, ***value*** and ***query***. These are terms used in attention functions. In the case of the Encoder, $V$, $K$ and $Q$ will simply be identical copies of the embedding vector (plus positional encoding). They will have the dimensions `Batch_size` * `seq_len` * `embedding_dimension`.

In multi-head attention we split the embedding vector into $N$ heads, so they will then have the dimensions `batch_size` * `N` * `seq_len` * (`embedding_dimension` / `N`).

This final dimension (`embedding_dimension` / `N`) we will refer to as $d_k$.

In [25]:
class MultiHeadAttention(nn.Module):
    def __init__(self, heads, d_model, dropout = 0.1):
        super().__init__()
        
        self.d_model = d_model
        self.d_k = d_model // heads
        self.h = heads
        
        self.q_linear = nn.Linear(d_model, d_model)
        self.v_linear = nn.Linear(d_model, d_model)
        self.k_linear = nn.Linear(d_model, d_model)
        self.dropout = nn.Dropout(dropout)
        self.out = nn.Linear(d_model, d_model)
    
    def forward(self, q, k, v, mask=None):
        
        bs = q.size(0)
        
        # Perform linear operation and split into h heads
        
        k = self.k_linear(k).view(bs, -1, self.h, self.d_k)
        q = self.q_linear(q).view(bs, -1, self.h, self.d_k)
        v = self.v_linear(v).view(bs, -1, self.h, self.d_k)
        
        # Transpose to get dimensions bs * h * sl * d_model
       
        k = k.transpose(1,2)
        q = q.transpose(1,2)
        v = v.transpose(1,2)
        
        # Calculate attention using function we will define next
        scores = attention(q, k, v, self.d_k, mask, self.dropout)
        
        # Concatenate heads and put through final linear layer
        concat = scores.transpose(1,2).contiguous()\
        .view(bs, -1, self.d_model)
        
        output = self.out(concat)
    
        return output

## Attention

The equation below is the attention formula with retrieved from [Attention is All You Need](https://arxiv.org/abs/1706.03762) paper and it does a good job at explaining each step.

![Attention Equation](../../../images/attention.png)

![Attention Equation](../../../images/attention-img.png)

Each arrow in the diagram reflects a part of the equation.

Initially we must **multiply** $Q$ by the transpose of $K$. This is then scaled by **dividing the output by the square root** of $d_k$.

A step that’s not shown in the equation is the masking operation. Before we perform **Softmax**, we apply our mask and hence reduce values where the input is padding (or in the decoder, also where the input is ahead of the current word). Another step not shown is **dropout**, which we will apply after Softmax.

Finally, the last step is doing a **dot product** between the result so far and $V$.

In [26]:
import torch.nn.functional as F

def attention(q, k, v, d_k, mask=None, dropout=None):
    
    scores = torch.matmul(q, k.transpose(-2, -1)) /  math.sqrt(d_k)
    
    if mask is not None:
        mask = mask.unsqueeze(1)
        scores = scores.masked_fill(mask == 0, -1e9)
    
    scores = F.softmax(scores, dim=-1)
    
    if dropout is not None:
        scores = dropout(scores)
        
    output = torch.matmul(scores, v)
    return output

## Feed-Forward Network

The feed-forward layer just consists of two linear operations, with a **relu** and **dropout** operation in between them. It simply deepens our network, employing linear layers to **analyse patterns** in the attention layers output.

![FeedForward Neural Network](../../../images/feed-forward-nn.gif)

In [27]:
class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff=2048, dropout = 0.1):
        super().__init__() 
        # We set d_ff as a default to 2048
        self.linear_1 = nn.Linear(d_model, d_ff)
        self.dropout = nn.Dropout(dropout)
        self.linear_2 = nn.Linear(d_ff, d_model)
        
    def forward(self, x):
        x = self.dropout(F.relu(self.linear_1(x)))
        x = self.linear_2(x)
        return x

## Normalisation

Normalisation is highly important in deep neural networks. It prevents the range of values in the layers changing too much, meaning the model **trains faster** and has **better ability to generalise**.

![Normalization](../../../images/norm.png)

We will be normalising our results between each layer in the encoder/decoder.

In [28]:
class Norm(nn.Module):
    def __init__(self, d_model, eps = 1e-6):
        super().__init__()
    
        self.size = d_model
        # create two learnable parameters to calibrate normalisation
        self.alpha = nn.Parameter(torch.ones(self.size))
        self.bias = nn.Parameter(torch.zeros(self.size))
        self.eps = eps
        
    def forward(self, x):
        norm = self.alpha * (x - x.mean(dim=-1, keepdim=True)) / (x.std(dim=-1, keepdim=True) + self.eps) + self.bias
        return norm

# Building Transformer

Let’s have another look at the over-all architecture and start building:

![Transformer](../../../images/transformer.png)

**One last Variable**: If you look at the diagram closely you can see a $N_x$ next to the encoder and decoder architectures. In reality, the encoder and decoder in the diagram above represent one layer of an encoder and one of the decoder. $N$ is the variable for the **number of layers** there will be. Eg. if `N=6`, the data goes through six encoder layers (with the architecture seen above), then these outputs are passed to the decoder which also consists of six repeating decoder layers.

We will now build `EncoderLayer` and `DecoderLayer` modules with the architecture shown in the model above. Then when we build the encoder and decoder we can define how many of these layers to have.

## EncoderLayer

In [29]:
# build an encoder layer with one multi-head attention layer and one feed-forward layer

class EncoderLayer(nn.Module):
    def __init__(self, d_model, heads, dropout = 0.1):
        super().__init__()
        self.norm_1 = Norm(d_model)
        self.norm_2 = Norm(d_model)
        self.attn = MultiHeadAttention(heads, d_model)
        self.ff = FeedForward(d_model)
        self.dropout_1 = nn.Dropout(dropout)
        self.dropout_2 = nn.Dropout(dropout)
        
    def forward(self, x, mask):
        x2 = self.norm_1(x)
        x = x + self.dropout_1(self.attn(x2,x2,x2,mask))
        x2 = self.norm_2(x)
        x = x + self.dropout_2(self.ff(x2))
        return x

## DecoderLayer

In [30]:
# build a decoder layer with two multi-head attention layers and one feed-forward layer

class DecoderLayer(nn.Module):
    def __init__(self, d_model, heads, dropout=0.1):
        super().__init__()
        self.norm_1 = Norm(d_model)
        self.norm_2 = Norm(d_model)
        self.norm_3 = Norm(d_model)
        
        self.dropout_1 = nn.Dropout(dropout)
        self.dropout_2 = nn.Dropout(dropout)
        self.dropout_3 = nn.Dropout(dropout)
        
        self.attn_1 = MultiHeadAttention(heads, d_model)
        self.attn_2 = MultiHeadAttention(heads, d_model)
        self.ff = FeedForward(d_model).cuda()
        
    def forward(self, x, e_outputs, src_mask, trg_mask):
            x2 = self.norm_1(x)
            x = x + self.dropout_1(self.attn_1(x2, x2, x2, trg_mask))

            x2 = self.norm_2(x)
            x = x + self.dropout_2(self.attn_2(x2, e_outputs, e_outputs, src_mask))

            x2 = self.norm_3(x)
            x = x + self.dropout_3(self.ff(x2))
            return x

We can then build a convenient cloning function that can generate multiple layers:

In [31]:
import copy

def get_clones(module, N):
    return nn.ModuleList([copy.deepcopy(module) for i in range(N)])

## Encoder

In [32]:
class Encoder(nn.Module):
    def __init__(self, vocab_size, d_model, N, heads):
        super().__init__()
        self.N = N
        self.embed = Embedder(vocab_size, d_model)
        self.pe = PositionalEncoder(d_model)
        self.layers = get_clones(EncoderLayer(d_model, heads), N)
        self.norm = Norm(d_model)
        
    def forward(self, src, mask):
        x = self.embed(src)
        x = self.pe(x)
        for i in range(N):
            x = self.layers[i](x, mask)
        return self.norm(x)

## Decoder

In [33]:
class Decoder(nn.Module):
    def __init__(self, vocab_size, d_model, N, heads):
        super().__init__()
        self.N = N
        self.embed = Embedder(vocab_size, d_model)
        self.pe = PositionalEncoder(d_model)
        self.layers = get_clones(DecoderLayer(d_model, heads), N)
        self.norm = Norm(d_model)
        
    def forward(self, trg, e_outputs, src_mask, trg_mask):
        x = self.embed(trg)
        x = self.pe(x)
        for i in range(self.N):
            x = self.layers[i](x, e_outputs, src_mask, trg_mask)
        return self.norm(x)

## Transformer

In [34]:
class Transformer(nn.Module):
    def __init__(self, src_vocab, trg_vocab, d_model, N, heads):
        super().__init__()
        self.encoder = Encoder(src_vocab, d_model, N, heads)
        self.decoder = Decoder(trg_vocab, d_model, N, heads)
        self.out = nn.Linear(d_model, trg_vocab)
        
    def forward(self, src, trg, src_mask, trg_mask):
        e_outputs = self.encoder(src, src_mask)
        d_output = self.decoder(trg, e_outputs, src_mask, trg_mask)
        output = self.out(d_output)
        return output

**Note**: We don't perform softmax on the output as this will be handled automatically by our loss function.

# Training the Model

With the transformer built, all that remains is to train on the dataset. The coding part is done, but be prepared to wait for about 2 days for this model to start converging! However, in this session, we only perform minimal epoch to train the model and you may try to use more epoch during your self-study.

Let’s define some parameters first:

In [35]:
embedding_dimension = 512
heads = 4
N = 6

src_vocab = len(EN_TEXT.vocab)
trg_vocab = len(FR_TEXT.vocab)

model = Transformer(src_vocab, trg_vocab, embedding_dimension, N, heads)

if device.type == 'cuda':
    model.cuda()

for p in model.parameters():
    if p.dim() > 1:
        nn.init.xavier_uniform_(p)
        
# This code is very important! It initialises the parameters with a
# range of values that stops the signal fading or getting too big.
optim = torch.optim.Adam(model.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

Now we’re good to train the transformer model

In [36]:
import time
import torch

def train_model(epochs, print_every=100, timelimit=None):
    
    model.train()
    
    start = time.time()
    temp = start
    
    total_loss = 0
    min_loss = float('inf')
    
    for epoch in range(epochs):
       
        for i, batch in enumerate(train_iter):
            src = batch.English.transpose(0,1)
            trg = batch.French.transpose(0,1)
            # the French sentence we input has all words except
            # the last, as it is using each word to predict the next
            
            trg_input = trg[:, :-1]
            
            # the words we are trying to predict
            
            targets = trg[:, 1:].contiguous().view(-1)
            
            # create function to make masks using mask code above
            
            src_mask, trg_mask = create_masks(src, trg_input)
            
            preds = model(src, trg_input, src_mask, trg_mask)
            
            ys = trg[:, 1:].contiguous().view(-1)
            
            optim.zero_grad()
            
            loss = F.cross_entropy(preds.view(-1, preds.size(-1)), ys, ignore_index=target_pad)
            
            loss.backward()
            
            optim.step()
                        
            total_loss += loss.data.item()
            
            if (i + 1) % print_every == 0:
                loss_avg = total_loss / print_every
                
                duration = (time.time() - start) // 60
                
                print("time = %dm, epoch %d, iter = %d, loss = %.3f, %ds per %d iters" % 
                      (duration, epoch + 1, i + 1, loss_avg, time.time() - temp, print_every))
                
                if loss_avg < min_loss:
                    min_loss = loss_avg
                    torch.save(model, "model/training.model")
                    print("Current best model saved", "loss =", loss_avg)
                    
                if (timelimit and duration >= timelimit):
                    break
                
                total_loss = 0
                temp = time.time()

In [37]:
# train_model(1, timelimit=300)

torch.load("model/pretrained.model")

Transformer(
  (encoder): Encoder(
    (embed): Embedder(
      (embed): Embedding(109944, 512)
    )
    (pe): PositionalEncoder(
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (layers): ModuleList(
      (0): EncoderLayer(
        (norm_1): Norm()
        (norm_2): Norm()
        (attn): MultiHeadAttention(
          (q_linear): Linear(in_features=512, out_features=512, bias=True)
          (v_linear): Linear(in_features=512, out_features=512, bias=True)
          (k_linear): Linear(in_features=512, out_features=512, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (out): Linear(in_features=512, out_features=512, bias=True)
        )
        (ff): FeedForward(
          (linear_1): Linear(in_features=512, out_features=2048, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (linear_2): Linear(in_features=2048, out_features=512, bias=True)
        )
        (dropout_1): Dropout(p=0.1, inplace=False)
        (dropout_2): Dropout(p=0

# Testing the Model

We can use the below function to translate sentences. We can feed it sentences directly from our batches, or input custom strings.

The translator works by running a loop. We start off by encoding the English sentence. We then feed the decoder the `<sos>` token index and the encoder outputs. The decoder makes a prediction for the first word, and we add this to our decoder input with the sos token. We rerun the loop, getting the next prediction and adding this to the decoder input, until we reach the `<eos>` token letting us know it has finished translating.

In [38]:
def translate(model, src, max_len = 80, custom_string=False):
    
    model.eval()
    
    if custom_string == True:
        src = tokenize_en(src)
        src = Variable(torch.LongTensor([[EN_TEXT.vocab.stoi[tok] for tok in src]])).cuda()
            
    src_mask = (src != input_pad).unsqueeze(-2)
    e_outputs = model.encoder(src, src_mask)

    outputs = torch.zeros(max_len).type_as(src.data)
    outputs[0] = torch.LongTensor([FR_TEXT.vocab.stoi['<sos>']])
    
    for i in range(1, max_len):
        trg_mask = np.triu(np.ones((1, i, i)), k=1).astype('uint8')

        trg_mask = Variable(torch.from_numpy(trg_mask) == 0).cuda()

        out = model.out(model.decoder(outputs[:i].unsqueeze(0), e_outputs, src_mask, trg_mask))

        out = F.softmax(out, dim=-1)

        val, ix = out[:, -1].data.topk(1)

        outputs[i] = ix[0][0]

        if ix[0][0] == FR_TEXT.vocab.stoi['<eos>']:
            break
                           
    return ' '.join([FR_TEXT.vocab.itos[ix] for ix in outputs[:i]])

In [39]:
translate(model, "How're you my friend?", custom_string=True)

'<sos> Essential Essential Essential strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises pro-indonésiennes strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises strasbourgeoises strasbourg

# Contributors

**Author**
<br>Chee Lam

# References

1. [How to Code The Transformer in Pytorch](https://towardsdatascience.com/how-to-code-the-transformer-in-pytorch-24db27c8f9ec#b0ed)

2. [How to Use TorchText for Neural Machine Translation](https://towardsdatascience.com/how-to-use-torchtext-for-neural-machine-translation-plus-hack-to-make-it-5x-faster-77f3884d95)