# Attention Is All you Need

### The original transformer
This notebook is an attempt to implement the classic paper [Attention is all you need](https://arxiv.org/pdf/1706.03762.pdf) (Vasmanit et al, 2017) with pytorch. *Attention* was previously used in RNN encoder-decoder models which have the disadvantage of having to loop over the entire sequence. The authors dispens with recurrence and show how *self-attention*, can be used as the *heart* of the model.

The paper written in 2017 had a significant impact on NLP and deep learning and paved the way for later breakthroughs such as BERT or GPT-3. The model of 2017 is sometimes refered to as the *original transformer*. 

<img src="https://pbs.twimg.com/media/DCYHlQCUMAAsLhG?format=jpg&name=small" style="width:512px;height:384px;" alt="Dwight's Attention">

## About this Notebook üë®‚Äçüíª
The transformer is trained to translate sentences from French to English . I use a rather small dataset with English-French sentence pairs (e.g. "I don't know", "Je ne sais pas").

The goal was to make a simple end to end example and to undestand the attention-mechanism as well as the model architechture of the first transformer. For a long time the model did not learn as expected. Despite the loss getting smaller every epoch and "semi-sain"inference, the model would never converge üßê . I could finally fix it by borrowing the code for the sublayers from [the annotated transformer](http://nlp.seas.harvard.edu/annotated-transformer/#embeddings-and-softmax). Another good source is Peter Bloem's [blog post](https://peterbloem.nl/blog/transformers). 

Since the original paper is really well written, I mostly quote from it directly to explain what's happening. (All quotes are from the paper *Attention is all you need* )

In [1]:
import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F
from torch.optim.lr_scheduler import LambdaLR
import math 
import numpy as np
import unicodedata
import string
import re
import random
from tqdm import tqdm
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("device :", device)

device : cuda


# Text Preprocessing ü§ñ

NLP involves a lot of data wrangling. To stick to the simple but insightful example I won't use the usual convenience functions (for example by hugginface) to create the text-pipeline. 
Tokenizer, Vocabulary, Batch-Makers are coded with good ol' python. 

*The code for the tokenizer and vocab builder are from [this](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html) excellent torch tutorial!*

In [2]:
SOS_token = 0
EOS_token = 1
PAD_token = 2
MAX_LENGTH = 50

class Lang:
    def __init__(self, name):
        self.name = name
        self.word2index = {}
        self.word2count = {}
        self.index2word = {0: "SOS", 1: "EOS", 2:"PAD"}
        self.n_words = 3  # Count SOS and EOS

    def addSentence(self, sentence):
        for word in sentence.split(' '):
            self.addWord(word)

    def addWord(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.n_words
            self.word2count[word] = 1
            self.index2word[self.n_words] = word
            self.n_words += 1
        else:
            self.word2count[word] += 1

def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

# Lowercase, trim, and remove non-letter characters
def normalizeString(s):
    s = unicodeToAscii(s.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    return s

def readLangs(lang1, lang2, reverse=False):
    lines = open('../input/english-french/%s-%s.txt' % (lang1, lang2), encoding='utf-8').\
        read().strip().split('\n')

    # Split every line into pairs and normalize
    pairs = [[normalizeString(s) for s in l.split('\t')] for l in lines]

    # Reverse pairs, make Lang instances
    if reverse:
        pairs = [list(reversed(p)) for p in pairs]
        input_lang = Lang(lang2)
        output_lang = Lang(lang1)
    else:
        input_lang = Lang(lang1)
        output_lang = Lang(lang2)

    return input_lang, output_lang, pairs

def filterPair(p):
    return len(p[0].split(' ')) < MAX_LENGTH and \
        len(p[1].split(' ')) < MAX_LENGTH 

def filterPairs(pairs):
    return [pair for pair in pairs if filterPair(pair)]

In [3]:
def prepareData(lang1, lang2, reverse=False):
    input_lang, output_lang, pairs = readLangs(lang1, lang2, reverse)
    print("Read %s sentence pairs" % len(pairs))
    pairs = filterPairs(pairs)
    print("Trimmed to %s sentence pairs" % len(pairs))
    print("Counting words...")
    for pair in pairs:
        input_lang.addSentence(pair[0])
        output_lang.addSentence(pair[1])
    print("Counted words:")
    print(input_lang.name, input_lang.n_words)
    print(output_lang.name, output_lang.n_words)
    return input_lang, output_lang, pairs

input_lang, output_lang, pairs = prepareData('eng', 'fra', True)
print(random.choice(pairs))

Read 135842 sentence pairs
Trimmed to 135837 sentence pairs
Counting words...
Counted words:
fra 21326
eng 13039
['voudrais tu te presenter ?', 'would you introduce yourself ?']


### From sentence to indexes and batches

Bellow are the functions that transform sentences into lists of indexes and then creates batches. 

In [4]:
def indexes_from_sentence(lang, sentence):
    idxs = [lang.word2index[word] for word in sentence.split(' ')]
    idxs.append(EOS_token)
    idxs.insert(SOS_token,0)
    return idxs
    
def batch_from_pairs(pairs):
    batch_inp = [indexes_from_sentence(input_lang, p[0]) for p in pairs]
    longest_seq = max([len(seq) for seq in batch_inp])
    batch_inp = [seq+[PAD_token]*(longest_seq-len(seq)) for seq in batch_inp]
    input_tensor = torch.tensor(batch_inp, dtype=torch.long, device=device)
    
    batch_trg = [indexes_from_sentence(output_lang, p[1]) for p in pairs]
    longest_seq = max([len(seq) for seq in batch_trg])
    batch_trg = [seq+[PAD_token]*(longest_seq-len(seq)) for seq in batch_trg]
    target_tensor = torch.tensor(batch_trg, dtype=torch.long, device=device)    
    
    return input_tensor,target_tensor

# Implementing the Transformer 

Here is a high level overview. It's actually just that.<br>
So, once you understand all the parts you can come back and marvel at the explanatory power of this illustration.üí™üèãÔ∏è‚Äç

![Model Architecture](https://www.researchgate.net/profile/Dennis-Gannon-2/publication/339390384/figure/fig1/AS:860759328321536@1582232424168/The-transformer-model-from-Attention-is-all-you-need-Viswani-et-al.jpg)

### Some further notes to the illustration

* It is encoder-decoder architecture. 
* The encoder creates features of the input sequence.
* The decoder uses the output of the encoder to decode the target sequence.
* The decoder uses the shifted targets during training. During prediction the outputs are fed back as the new inputs.
* Both encoder are using stacked self-attention and point-wise, fully connected layers.

Let's go through it step by step.

## Attention!

So here is the magic sauce. ü™Ñ
My intution of attention is the following. Usually when we feed our input vectors into to a linear layer the vectors do not *see each other*. There is no interaction. Well, with attention, they DO *see each other*. Look at the formula bellow. The dot product of the keys and values is some sort of interbreeding. ‚ù§Ô∏è (Key and values are linear projections of the same vectors.) Through the dot product with the values, the most intersting interactions are then selected. So, we get context-enriched, interaction heavy feature vectors as ouput. So that no illegal interactions happen (appart from the incest), a mask is applied to prevent interactions with "the future". (Words can only flirt with words that come before.)

If you perform (scaled-dot-product) attention several times you get *multlihead attention*. Let's see what the experts say about that.

###  Scaled dot-product attention
The authors wrote:

> We call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists of
queries and keys of dimension dk, and values of dimension dv. We compute the dot products of the
query with all keys, divide each by ‚àö
dk, and apply a softmax function to obtain the weights on the
values.
In practice, we compute the attention function on a set of queries simultaneously, packed together
into a matrix Q. The keys and values are also packed together into matrices K and V . We compute
the matrix of outputs as:

![Attention function](https://miro.medium.com/max/720/1*P9sV1xXM10t943bXy_G9yg.png)

### Multihead attention

> Instead of performing a single attention function with dmodel-dimensional keys, values and queries,
we found it beneficial to linearly project the queries, keys and values h times with different, learned
linear projections to dk, dk and dv dimensions, respectively. On each of these projected versions of
queries, keys and values we then perform the attention function in parallel, yielding dv-dimensional
output values. These are concatenated and once again projected, resulting in the final values ... Multi-head attention allows the model to jointly attend to information from different representation
subspaces at different positions. With a single attention head, averaging inhibits this.


In [5]:

def attention(q,k,v,dropout,mask=None):
    b,h,l,dk = q.size()
    x = torch.matmul(q,k.transpose(-2,-1)) / dk**0.5
    
    if mask is not None:
        x= x.masked_fill(mask==0,-1e9)
    x = x.softmax(dim=-1)
    x = dropout(x)    
    x = torch.matmul(x,v)
    return x

class MultiHeaderAttention(nn.Module):
    def __init__(self,d_model,dropout,n_heads=8, dk=64,dv=64):
        super(MultiHeaderAttention,self).__init__()
        self.dims = dk,dv,n_heads
        self.q = nn.Linear(d_model, dk*n_heads)
        self.k = nn.Linear(d_model, dk*n_heads)
        self.v = nn.Linear(d_model, dv*n_heads)
        self.dropout = nn.Dropout(p=dropout)

        self.out = nn.Linear(dv*n_heads, d_model)
            
    def forward(self,k,v,q,mask=None):
        b,len_k,len_q,len_v = k.size(0), k.size(1), q.size(1), v.size(1)
        dk,dv,h =self.dims
        q = self.q(q).view(b,len_q,h,dk).transpose(1,2)
        k = self.k(k).view(b,len_k,h,dk).transpose(1,2)
        v = self.v(v).view(b,len_v,h,dk).transpose(1,2)
        if mask is not None:
            mask = mask.unsqueeze(1) #put header dim for broadcasting
            
        x = attention(q,k,v, self.dropout, mask)
        x = x.transpose(1,2).contiguous().view(b,len_q,h*dk) # swap headers and seq_len
        return self.out(x)

### Positional encoding

The model needs to know the order of the words in a sentence. (Or more precisely, the order of the embeddings)
In the paper they do this by adding a cyclical signal to the word embeddings.

The authors write:
> Since our model contains no recurrence and no convolution, in order for the model to make use of the
> order of the sequence, we must inject some information about the relative or absolute position of the"

and

> To this end, we add "positional encodings" to the input embeddings at the
bottoms of the encoder and decoder stacks. The positional encodings have the same dimension dmodel
as the embeddings, so that the two can be summed. There are many choices of positional encodings,
learned and fixed [9].
In this work, we use sine and cosine functions of different frequencies:
PE(pos,2i) = sin(pos/10000* 2i/dmodel) PE(pos,2i+1) = cos(pos/10000 * 2i/dmodel)
where pos is the position and i is the dimension. That is, each dimension of the positional encoding
corresponds to a sinusoid. The wavelengths form a geometric progression from 2œÄ to 10000 ¬∑ 2œÄ. 

In [6]:
class PositionEncoding(nn.Module):
    def __init__(self,max_len,d_model):
        super(PositionEncoding,self).__init__()
        self.max_len = max_len+5
        self.register_buffer('pos_table', self.tensor_pos_encoding(self.max_len, d_model))

    def pos_encoding(self,pos, k):
        """taking an vocab index and generating a a geometric progression with k dimensions """
        f = lambda i,k: pos / 10000**(2 * (i // 2) / k)
        return [math.sin(f(i,k)) if i%2==0 else math.cos(f(i,k)) for i in range(0,k)]

    def tensor_pos_encoding(self,max_len,dim):
        return torch.tensor([self.pos_encoding(i,dim) for i in range(max_len)],device=device).view( max_len,dim)

    def forward(self,x):
        return x+ self.pos_table[:x.size(1),:].detach().clone().unsqueeze(0)

### Subylayer connection

Sometimes it's good to learn from the past. That's basically what residual connections do. 

> We employ a residual connection [11] around each of
the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is
LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer
itself. 

Concering dropout:
> We apply dropout [33] to the output of each sub-layer, before it is added to the
sub-layer input and normalized. In addition, we apply dropout to the sums of the embeddings and the
positional encodings in both the encoder and decoder stacks. For the base model, we use a rate of
Pdrop = 0.1.

In [7]:
# code from "the annotated transfomer"
class SublayerConnection(nn.Module):
    def __init__(self, size, dropout):
        super(SublayerConnection, self).__init__()
        self.norm = nn.LayerNorm(size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, sublayer):
        return x + self.dropout(sublayer(self.norm(x)))

### Encoder

The encoder will take our French sentences an *encode* it. Imagine that you are engineering features for your favorite gradient booster model. The output of the encoder is basically the features of our French sentences, ready to be fed into the decoder.

The authors description of the encoder:
> The encoder is composed of a stack of N = 6 identical layers. Each layer has two
sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, positionwise fully connected feed-forward network. We employ a residual connection [11] around each of
the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is
LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer
itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding
layers, produce outputs of dimension dmodel = 512.

The application of attention in the encoder is described as followed:
> The encoder contains self-attention layers. In a self-attention layer all of the keys, values
and queries come from the same place, in this case, the output of the previous layer in the
encoder. Each position in the encoder can attend to all positions in the previous layer of the
encoder.

In [8]:
class Encoder(nn.Module):
    def __init__(self, n_input_vocab, d_model,n_hidden,n_layers,dropout):
        super().__init__()
        self.d_model = d_model
        
        self.embedding = nn.Embedding(n_input_vocab,d_model,padding_idx=PAD_token)
        self.dropout = nn.Dropout(p=dropout)
        self.normal = nn.LayerNorm(d_model, eps=1e-6)   
        self.encoder_layers = nn.ModuleList(
            [EncoderLayer(d_model, n_hidden,dropout) for i in range(n_layers)]
            )

        self.pos_enc = PositionEncoding(MAX_LENGTH,d_model)

        
    def forward(self,x,mask):
        x = self.embedding(x) * self.d_model**0.5
        x = self.pos_enc(x)
        
        #stack of N = 6 identical layers
        for layer in self.encoder_layers:
            x = layer(x,mask)
        return self.normal(x)


class EncoderLayer(nn.Module):
    def __init__(self, d_model, n_hidden,dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = attn = MultiHeaderAttention(d_model,dropout)
        self.feed_forward = nn.Sequential(
                        nn.Linear(d_model,n_hidden), 
                        nn.ReLU(),
                        nn.Dropout(p=dropout),
                        nn.Linear(n_hidden,d_model) 
        )
        self.sublayer = nn.ModuleList([SublayerConnection(d_model, dropout) for i in range(2)])

    def forward(self, x, mask):
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
        return self.sublayer[1](x, self.feed_forward)

### Decoder

The decoder will take the encoder output (features of the French sentences) and the target sequence (here the English sentences) and decode it. The decoder output is then fed into a linear layer whose job it is to predict the next word of each vector of the sequence. (See how the labels are made during training to understand why.) 

> The decoder is also composed of a stack of N = 6 identical layers. In addition to the two
sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head
attention over the output of the encoder stack. Similar to the encoder, we employ residual connections
around each of the sub-layers, followed by layer normalization. We also modify the self-attention
sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This
masking, combined with fact that the output embeddings are offset by one position, ensures that the
predictions for position i can depend only on the known outputs at positions less than i.


The first application of the attention in the decoder (sublayer1) is described as follows:
> Similarly, self-attention layers in the decoder allow each position in the decoder to attend to
all positions in the decoder up to and including that position. We need to prevent leftward
information flow in the decoder to preserve the auto-regressive property. We implement this
inside of scaled dot-product attention by masking out (setting to ‚àí‚àû) all values in the input
of the softmax which correspond to illegal connections. See Figure 2.

The second application of the attention in the decoder (sublayer2) is described as follows:
> In "encoder-decoder attention" layers, the queries come from the previous decoder layer,
and the memory keys and values come from the output of the encoder. This allows every
position in the decoder to attend over all positions in the input sequence. 

In [9]:
class Decoder(nn.Module):
    def __init__(self,n_target_vocab,d_model, n_hidden,n_layers,dropout):
        super().__init__()
        self.d_model = d_model
        
        self.embedding = nn.Embedding(n_target_vocab,d_model,padding_idx=PAD_token)
        self.dropout = nn.Dropout(p=dropout)
        self.normal = nn.LayerNorm(d_model, eps=1e-6)
        self.decoder_layers = nn.ModuleList(
            [DecoderLayer(d_model, n_hidden,dropout) for i in range(n_layers)]
            )
            
        self.pos_enc = PositionEncoding(MAX_LENGTH,d_model)
        
    def forward(self,x,encoder_outputs,self_attn_mask, enc_dec_mask):
        x = self.embedding(x)*self.d_model**0.5
        x = self.pos_enc(x)
        
        #stack of N = 6 identical layers
        for layer in self.decoder_layers:
            x = layer(x,encoder_outputs,self_attn_mask, enc_dec_mask)
        
        return self.normal(x)
    
class DecoderLayer(nn.Module):
    def __init__(self, d_model, n_hidden,dropout):
        super(DecoderLayer, self).__init__()

        self.self_attn = MultiHeaderAttention(d_model,dropout)
        self.src_attn = MultiHeaderAttention(d_model,dropout)
        self.feed_forward = nn.Sequential(
                        nn.Linear(d_model,n_hidden), 
                        nn.ReLU(),
                        nn.Dropout(p=dropout),
                        nn.Linear(n_hidden,d_model) 
        )
        self.sublayer = nn.ModuleList([SublayerConnection(d_model, dropout) for i in range(3)])

    def forward(self, x, memory, tgt_mask,src_mask):
        m = memory
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, tgt_mask))
        x = self.sublayer[1](x, lambda x: self.src_attn(m, m, x,src_mask))
        return self.sublayer[2](x, self.feed_forward)

## The Transfomer 

Now we have all the pieces for the transformer. Here, the encoder and decoder are instanciated, the masks that prevent looking at irrelevant (pad tokens) or illegal (looking ahead) content are generated. A linear layer outputs the final guesses of the word distributions for the whole sequence. Note that the Softmax functions are not implemented. This is because the Crossentropy Loss expects the raw output. 

In [10]:
class Transformer(nn.Module):
    def __init__(self,d_model, n_input_vocab,n_target_vocab, n_hidden,n_layers,dropout):
        super().__init__()

        self.encoder = Encoder(n_input_vocab=n_input_vocab, d_model=d_model,n_hidden=n_hidden,n_layers=n_layers,dropout=dropout)
        self.decoder = Decoder(n_target_vocab=n_target_vocab,d_model=d_model,n_hidden=n_hidden,n_layers=n_layers,dropout=dropout)
        self.out = nn.Linear(d_model,n_target_vocab)  
        
    def get_target_mask(self, target_seq):
        b_sz, len_s = target_seq.size()
        return torch.tril(torch.ones(len_s,len_s,device=device)).bool().expand(1,len_s,len_s)

    def get_pad_mask(self,seq):
        return (seq != PAD_token).unsqueeze(-2)   
    
    def forward(self,input_seq, target_seq):
        trg_mask = self.get_pad_mask(target_seq)
        trg_mask = trg_mask & self.get_target_mask(target_seq).type_as(trg_mask.data)
        inp_mask = self.get_pad_mask(input_seq)

        encoder_out = self.encoder(input_seq,inp_mask)
        decoder_out = self.decoder(target_seq,encoder_out,trg_mask,inp_mask)
        
        out = self.out(decoder_out)
        return out


# Training & Random Tests üí™

It's training time. First we shuffle the dataset and split it into training and test partitions. Then we instatiate the model and set up the optimizer and scheduler. Finally we run the training loop. We get a hot cup of coffee, lean back and look mesmerized at the ever decreasing loss. ‚òïÔ∏è

In [11]:
#splitting the data into train and test datasets

np.random.shuffle(pairs)
train = pairs[:-1000]
test = pairs[-1000:]

#the model with the parameter-values of the paper
transformer1 = Transformer(d_model=512, 
                            n_input_vocab=input_lang.n_words, 
                            n_target_vocab= output_lang.n_words,
                            n_hidden = 2048,
                            n_layers=6, 
                            dropout=0.1).to(device)

for p in transformer1.parameters():
    if p.dim() > 1:
        nn.init.xavier_uniform_(p)

### Optimizer 

> We used the Adam optimizer [20] with Œ≤1 = 0.9, Œ≤2 = 0.98 and  = 10‚àí9
. We varied the learning
rate over the course of training, according to the formula:<br>
`lrate = d_model**‚àí0.5 model ¬∑ min(step_num‚àí0.5, step_num ¬∑ warmup_steps‚àí1.5)` (3)<br>
This corresponds to increasing the learning rate linearly for the first warmup_steps training steps,
and decreasing it thereafter proportionally to the inverse square root of the step number. We used
warmup_steps = 4000.

In [12]:
#the optimizer 
lr= 1
opt1 = optim.Adam(transformer1.parameters(),lr=lr, betas=(0.9, 0.98), eps=1e-09)

def lr_rate(step_num, d_model, factor, warmup_steps):
    step_num =max(1,step_num)
    return factor * (
        d_model ** (-0.5) * min(step_num ** (-0.5), step_num * warmup_steps ** (-1.5))
    )

lr_scheduler = LambdaLR(
    optimizer=opt1,
    lr_lambda=lambda step_num: lr_rate(
        step_num, 512, factor=1, warmup_steps=4000
    ),
)

### Training and evaluation functions

Let's define a function to test the model with random samples from the test set. This will make the learning process more intuitive. Ultimately, we want to see the model translate something it has not seen before and the better the quality of this, the more we will be impressed. üòÄ

In [13]:
def pred(input_seq, model):
    outputs = [SOS_token]
    
    loss = 0
    for i in range(MAX_LENGTH):
        target_seq = torch.tensor([outputs],device=device)
        output = model(input_seq,target_seq)
        probs = F.softmax(output,dim=2)
        word_pred = torch.argmax(probs[:,-1,:],dim=1)

        outputs.append(word_pred.item())

        if word_pred.item()== EOS_token:
            break
    
    return outputs[1:]

def random_model_testing(n_examples,model):
    batch_sz=1
    test_samples = [random.choice(test) for i in range(n_samples)]
    print("Random Tests")
    print("*"*30)
    for i in range(0,len(test_samples[:n_examples]),batch_sz):
        input_tensor, output_tensor = batch_from_pairs(test_samples[i:i+batch_sz])
        out = pred(input_tensor, model)
        print("Pred: ", " ".join([output_lang.index2word[i] for i in out]), "True: ",test_samples[i][1])
    print("*"*30)

The "train_batch" function takes in a batch, runs that batch through the model, calculates the loss and runs backpropagation. Also, the learning rate is adjusted by calling `scheduler.step()`.\
Label smoothing is applied according to the paper:
>During training, we employed label smoothing of value ls = 0.1 [36]. This
hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score.

In [14]:
def train_batch(input_seq, target_seq, model, optimizer,scheduler):
    target, truth = target_seq[:,:-1], target_seq[:,1:]
    pred = model(input_seq,target)
    
    loss = F.cross_entropy(pred.view(-1,output_lang.n_words), truth.reshape(-1),reduction='sum',label_smoothing=0.1)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()
    scheduler.step()

    return loss.item()

### Let's train our transfomer!

Bellow is a simple training loop. Each epoch we feed the input and target batches to the model, calculate the loss and backpropagate. To make training a bit more interesting, we draw a small random a sample for each epoch instead of iterating over all ~135k pairs. It is easier to track the progress like that. 

To make the learning process more intuitive, the function "random model testing" is called every 25 epochs, where the model translates 25 sentences from French to English. We see that the model gets "saner" as we go. For example, after 25 epochs the model outputs **"go and do you get the hair !"** instead of **"get a haircut"**. So we put our model back in the oven and wait for it to mature...

After 50 epochs, instead of **"you re the most important person in my life "** the model output is **"you are the one in my life"**. Alright, why not? A philosopher... ü§î

In [15]:
n_samples=21000
epochs =200
batch_sz=32

for e in range(epochs):
    loss = 0
    train_samples = [random.choice(train) for i in range(n_samples)]
    for i in range(0,len(train_samples),batch_sz):
        input_tensor, output_tensor = batch_from_pairs(train_samples[i:i+batch_sz])
        loss += train_batch(input_tensor, output_tensor, transformer1, opt1,lr_scheduler)
    print(f"Epoch {e}/{epochs} | loss: {round(loss/n_samples,2)} | learning rate: {round(lr_scheduler.get_last_lr()[0],6)}")
    #random testing
    if e%25==0:
        random_model_testing(10,transformer1)

Epoch 0/200 | loss: 71.54 | learning rate: 0.000115
Random Tests
******************************
Pred:  the is a father is of the of the of the EOS True:  at least we re still in one piece .
Pred:  i m a of the of the . EOS True:  i d like one more blanket .
Pred:  i don t know i don t know it . EOS True:  i didn t feel like going .
Pred:  we re not . EOS True:  we had lunch early .
Pred:  it is a of the of the . EOS True:  my father is in his room .
Pred:  the is the of the of the of the father the of the of the of the of the of the of the of the EOS True:  the british had military bases along new york s hudson river .
Pred:  it is a . EOS True:  that s rather amusing .
Pred:  you re not a of the ? EOS True:  does it hurt when you chew ?
Pred:  you re not to do you do ? EOS True:  didn t you know that oil floats on water ?
Pred:  she was a of the . EOS True:  she made the same mistake again .
******************************
Epoch 1/200 | loss: 46.19 | learning rate: 0.00023
Epoch 2/200 

In [16]:
#more random tests 
random_model_testing(20,transformer1)

Random Tests
******************************
Pred:  you can do it too . EOS True:  you can do it too .
Pred:  you re wrong . EOS True:  you are wrong .
Pred:  she has much better to cook recently . EOS True:  she has improved her skill in cooking recently .
Pred:  don t make any prisoner . EOS True:  don t take any prisoners .
Pred:  don t touch my bicycle . EOS True:  keep your hands off my bicycle .
Pred:  i live here alone . EOS True:  i live here alone .
Pred:  don t be so impatient . EOS True:  don t be so impatient .
Pred:  why am i so tired ? EOS True:  why am i so tired ?
Pred:  i know i gave the right thing this time . EOS True:  i know i got it right this time .
Pred:  this is not a no town . EOS True:  it s non refundable .
Pred:  he is the tallest of the two . EOS True:  he is the taller of the two .
Pred:  i don t know what s fear . EOS True:  i don t know what fear is .
Pred:  i m being on your age when you knew how to drive . EOS True:  i ll treat you like an adult when y

# Conclusion

The model does a decent job after approximately an hour of training on GPU. Of course for better results more training and more data would do the trick.

This was the first time I implemented a high quality deep-learning paper. I learned that this model is "sensitive", meaning that small errors in the implementation will make the model very *unreasonable*. ü§™ It took some patience, but, in order to understand the topic there is no better way. Hopefully, this is helpful to somebody out there... üòä  