# Introduction to Transformer Networks

Transformer Network were introduced in 2017 in the paper [**Attention Is All You Need**](https://arxiv.org/abs/1706.03762). They are designed to handle sequential data (such as text) and could be use in sequence to sequence tasks. Unlike the RNN methods, Transformer Networks does not needs to see the inputs sequentially allowing them to be trained in parallel. Thus, Transformer Networks became the model of choice for NLP as it is now possible to train them on very large datasets.

This Notebook aims to show, in practice, how we move from RNN to Transformer Networks for a Sequence to Sequence (Seq2Seq) task. The task chosen is language translation. This Notebook showcase three different methods :

- **Seq2Seq RNN** : Our baseline model for language translation.
- **Seq2Seq RNN with Attention Mecanism** : An improved version of the RNN model.
- **Transformer Network** : The model introduced in *Attention Is All You Need* dropping the RNN part of the network and keeping only the attention mecanism.

In [1]:
import torch
import unicodedata
import re
import numpy as np
from sklearn.model_selection import train_test_split

# Simple RNN : https://keras.io/examples/nlp/lstm_seq2seq/ ou https://towardsdatascience.com/a-comprehensive-guide-to-neural-machine-translation-using-seq2sequence-modelling-using-pytorch-41c9b84ba350
# RNN + Attention : https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html
# Transformer : https://pytorch.org/tutorials/beginner/transformer_tutorial.html

## The Data

The model we showcase are trained for english to french translation. It is a classical Seq2Seq problem for which a lot of datasets exists. We chose to use the data available on the [manythings.org](https://www.manythings.org/anki/) website.

First we load language models from [**Spacy**](https://spacy.io/), a python library for NLP. These language model are used to tokenize each sentences in our dataset. We also use the [**Torchtext**](https://pytorch.org/text/stable/index.html) library as it provides useful class for loading text datasets and generating the vocabulary.

Once preprocessed, the sentences start with a special start of sentence token (*<sos>*) and end of sentence token (*<eos>*) and are transformed into a vector of numbers each number corresponding to one word in the vocabulary of the language.
    

In [2]:
from torchtext.data import Field, BucketIterator
from torchtext.datasets import TranslationDataset
import spacy
import pandas as pd

# Downloading vocabulary for our chosen languages
!python -m spacy download en_core_web_sm --quiet
!python -m spacy download fr_core_news_sm --quiet

spacy_french = spacy.load("fr_core_news_sm")
spacy_english = spacy.load("en_core_web_sm")

def tokenize_french(text):
    return [token.text for token in spacy_french.tokenizer(text)]

def tokenize_english(text):
    return [token.text for token in spacy_english.tokenizer(text)]


french = Field(tokenize=tokenize_french, lower=True,
               init_token="<sos>", eos_token="<eos>")

english = Field(tokenize=tokenize_english, lower=True,
               init_token="<sos>", eos_token="<eos>")

train_data, valid_data, test_data = TranslationDataset.splits(path="./data", exts = (".en", ".fr"),
                                                    fields=(english, french))

french.build_vocab(train_data, max_size=10000, min_freq=3)
english.build_vocab(train_data, max_size=10000, min_freq=3)

print(f"Unique tokens in source (fr) vocabulary: {len(french.vocab)}")
print(f"Unique tokens in target (en) vocabulary: {len(english.vocab)}")

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('fr_core_news_sm')




Unique tokens in source (fr) vocabulary: 9609
Unique tokens in target (en) vocabulary: 6535


In [3]:
# Creating the iterator and print sample

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)
BATCH_SIZE = 128

train_iterator, valid_iterator, test_iterator = BucketIterator.splits((train_data, valid_data, test_data), 
                                                                      batch_size = BATCH_SIZE,
                                                                      sort_within_batch=True,
                                                                      sort_key=lambda x: len(x.src),
                                                                      device = device)
max_len_fr = []
max_len_en = []
count = 0

for data in train_data:
    max_len_en.append(len(data.src))
    max_len_fr.append(len(data.trg))
    if count > 1000 and count < 1005 :
        print("English - ",*data.src, " Length - ", len(data.src))
        print("French - ",*data.trg, " Length - ", len(data.trg))
        print()
    count += 1

print("Maximum Length of English Sentence {} and French Sentence {} in the dataset".format(max(max_len_en),max(max_len_fr)))
print("Minimum Length of English Sentence {} and French Sentence {} in the dataset".format(min(max_len_en),min(max_len_fr)))


cuda
English -  how 's work ?  Length -  4
French -  comment va le travail ?  Length -  5

English -  hurry back .  Length -  3
French -  reviens vite .  Length -  3

English -  hurry home .  Length -  3
French -  dépêche - toi d' aller chez toi !  Length -  8

English -  hurry home .  Length -  3
French -  dépêchez -vous d' aller chez vous !  Length -  7

Maximum Length of English Sentence 48 and French Sentence 57 in the dataset
Minimum Length of English Sentence 2 and French Sentence 2 in the dataset




# Seq2Seq RNN

A Seq2Seq RNN is made of two component. An encoder network evaluating the input sequence and generating a vector representing the sentence called the **Context Vector**. The context vector is then passed to a decoder network that will construct the input sequence.

## Encoder

The input of the encoder is the tokenized sentence with the start and end of sentence token. The purpose of the encoder is to create a context vector containing all the information needed by the decoder to reconstitute the translated sentence. To process the sequence of word token, we use LSTM layer and the hidden state of the last LSTM layer is used as the context vector.

![](./fig/seq2seq-encoder.png)


In [4]:
import torch.nn as nn

class EncoderLSTM(nn.Module):
    def __init__(self, input_size, embedding_size, hidden_size, num_layers, dropout_p):
        super(EncoderLSTM, self).__init__()
        
        # Size of the input vector
        self.input_size = input_size
        
        # Size of the word embedding
        self.embedding_size = embedding_size
        
        # LSTM hidden layer size
        self.hidden_size = hidden_size
        
        # Number of LSTM layers
        self.num_layers = num_layers
        
        # Initializing the network layer
        
        self.dropout = nn.Dropout(dropout_p)
        self.embedding = nn.Embedding(self.input_size, self.embedding_size)
        self.LSTM = nn.LSTM(self.embedding_size, self.hidden_size, self.num_layers, dropout=dropout_p)
        
        
    def forward(self, x):
    
        x = self.embedding(x)
        x = self.dropout(x)
        
        outputs, (hidden_state, cell_state) = self.LSTM(x)
        
        return hidden_state, cell_state


## Decoder

As the context vector issued by the encoder are the hidden state of LSTM, the decoder could use those context vector as the initial hidden state of its LSTM unit. In order to predict the next word in the sentence, the decoder network could use the information contained in the hiddent state of its LSTM units and the previous word it predicted :

![](./fig/seq2seq-decoder.png)

The first call of the decoder is initialize with the context vector and the start of sentence token. Then, the decoder use the previous token it emits as an input and could rely on its hidden state to convey the remaining contextual information about the sentence. The decoder is expected to emit the end of sentence token once he reached the end of the sentence.


In [5]:
import torch.nn as nn

class DecoderLSTM(nn.Module):
    def __init__(self, input_size, embedding_size, hidden_size, num_layers, dropout_p, output_size):
        super(DecoderLSTM, self).__init__()
        
        # Input size of the decoder (size of the context vector)
        self.input_size = input_size
        
        # Embedding size 
        self.embedding_size = embedding_size
        
        # Hidden unit size
        self.hidden_size = hidden_size
        
        # num of LSTM layers
        self.num_layers = num_layers
        
        # Vocabulary size of the target language
        self.output_size = output_size
        
        self.dropout = nn.Dropout(dropout_p)
        
        self.embedding = nn.Embedding(self.input_size, self.embedding_size)
        
        self.LSTM = nn.LSTM(self.embedding_size, self.hidden_size, self.num_layers, dropout = dropout_p)
        
        self.fc = nn.Linear(self.hidden_size, self.output_size)
        
        
    def forward(self, x, hidden_state, cell_state):
        
        # Shape of [1, batch_size]
        x = x.unsqueeze(0)
        
        x = self.embedding(x)
        x = self.dropout(x)
        
        outputs, (hidden_state, cell_state) = self.LSTM(x, (hidden_state, cell_state))
        
        predictions = self.fc(outputs)
        
        predictions = predictions.squeeze(0)
        
        return predictions, hidden_state, cell_state



## Joining Encoder and Decoder

The final Seq2seq model could be built by stacking the encoder and the decoder network. But a regularization mecanism is added in the final seq2seq model to ease the learning tasks. The regularization is called the **Teach Force Ration** (tfr) and aims to correct the decoder by providing the actual token from the target sentence as an input instead of reusing the previously predicted token. Here is an example of the full seq2seq network :

![](./fig/seq2seq.png)

We just pass the context vector as the initial hidden state of the decoder and add the TFR mecanism that consist of providing the real target token expected as input of the LSTM instead of the previously predicted token with a probability of 50%. This mecanism is just used during training and not during inference.


In [6]:
import random

class Seq2Seq(nn.Module):
    def __init__(self, Encoder_LSTM, Decoder_LSTM):
        super(Seq2Seq, self).__init__()
        self.Encoder_LSTM = Encoder_LSTM
        self.Decoder_LSTM = Decoder_LSTM
        
    def forward(self, source, target, tfr=0.5):
        
        batch_size = source.shape[1]
        
        target_len = target.shape[0]
        target_vocab_size = self.Decoder_LSTM.output_size
        
        # Creating empty output tensor
        outputs = torch.zeros(target_len, batch_size, target_vocab_size).to(device)
        
        hidden_state_encoder, cell_state_encoder = self.Encoder_LSTM(source)
        
        # Collecting a SOS token from the target in order to "seed" our decoder
        x = target[0]
        
        for i in range(1, target_len):
            output, hidden_state_decoder, cell_state_decoder = self.Decoder_LSTM(x, hidden_state_encoder, cell_state_encoder)
            outputs[i] = output
            best_guess = output.argmax(1) # 0th dimension is batch size, 1st dimension is word embedding
            x = target[i] if random.random() < tfr else best_guess # Either pass the next word correctly from the dataset or use the earlier predicted word


        return outputs
    

# Constructing model

# ENCODER
input_voc_size = len(english.vocab)
encoder_embedding_size = 300
hidden_size = 2048
encoder_num_layers = 2
encoder_dropout = float(0.5)

encoder_lstm = EncoderLSTM(input_voc_size, encoder_embedding_size, hidden_size, encoder_num_layers, encoder_dropout).to(device)

# DECODER
french_voc_size = len(french.vocab)
decoder_embedding_size = 300
hidden_size = 2048
num_layers = 2
decoder_dropout = float(0.5)
output_size = len(english.vocab)

decoder_lstm = DecoderLSTM(french_voc_size, decoder_embedding_size, hidden_size, num_layers, decoder_dropout, french_voc_size).to(device)


seq2seq_model = Seq2Seq(encoder_lstm, decoder_lstm)

print(seq2seq_model)

Seq2Seq(
  (Encoder_LSTM): EncoderLSTM(
    (dropout): Dropout(p=0.5, inplace=False)
    (embedding): Embedding(6535, 300)
    (LSTM): LSTM(300, 2048, num_layers=2, dropout=0.5)
  )
  (Decoder_LSTM): DecoderLSTM(
    (dropout): Dropout(p=0.5, inplace=False)
    (embedding): Embedding(9609, 300)
    (LSTM): LSTM(300, 2048, num_layers=2, dropout=0.5)
    (fc): Linear(in_features=2048, out_features=9609, bias=True)
  )
)


# Training

In [None]:
import torch
from torch.utils.tensorboard import SummaryWriter
from torchsummary import summary
from utils.seq2seq_utils import bleu, checkpoint_and_save, translate_sentence
import torch.optim as optim

learning_rate = 0.001
writer = SummaryWriter(f"runs/loss_plot")
step = 0

model = Seq2Seq(encoder_lstm, decoder_lstm).to(device)
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

pad_idx = french.vocab.stoi["<pad>"]
criterion = nn.CrossEntropyLoss(ignore_index=pad_idx)

epoch_loss = 0.0
num_epochs = 1000
best_loss = 999999
best_epoch = -1
test_sentence = "Please write down your name, address, and phone number here."
ts1 = []

hist_seq2seq_loss = []
hist_seq2seq_bleu = []

tot_batch = len(train_iterator)

for epoch in range(num_epochs):
    print("Epoch - {} / {}".format(epoch+1, num_epochs))
    
    seq2seq_model.eval()
    translated_test_sentence = translate_sentence(seq2seq_model, test_sentence, english, french, device, max_length=50)
    translated_test_sentence = " ".join(translated_test_sentence)
    print(f"Source : \n {test_sentence}")
    print(f"Translation : \n {translated_test_sentence}")
    ts1.append(translated_test_sentence)
    seq2seq_model.train(True)
    
    for batch_idx, batch in enumerate(train_iterator):
        print(f"{batch_idx} / {tot_batch}", end='\r')
        
        input = batch.src.to(device)
        target = batch.trg.to(device)

        # Pass the input and target for model's forward method
        output = model(input, target)
        output = output[1:].reshape(-1, output.shape[2])
        target = target[1:].reshape(-1)

        # Clear the accumulating gradients
        optimizer.zero_grad()

        # Calculate the loss value for every epoch
        loss = criterion(output, target)

        # Calculate the gradients for weights & biases using back-propagation
        loss.backward()

        # Clip the gradient value is it exceeds > 1
        torch.nn.utils.clip_grad_norm_(seq2seq_model.parameters(), max_norm=1)

        # Update the weights values using the gradients we calculated using bp 
        optimizer.step()
        step += 1
        epoch_loss += loss.item()
        writer.add_scalar("Training loss", loss, global_step=step)
        
    epoch_loss = epoch_loss / len(train_iterator)
    
    hist_seq2seq_loss.append(epoch_loss)
    
    if epoch_loss < best_loss:
        best_loss = epoch_loss
        best_epoch = epoch
        checkpoint_and_save(model, best_loss, epoch, optimizer, epoch_loss) 
        
    if ((epoch - best_epoch) >= 10):
        print("no improvement in 10 epochs, break")
        break
        
    print("Epoch_Loss - {}".format(epoch_loss))
        

    score = bleu(test_data[100:600], seq2seq_model, english, french, device)
    print(f"Bleu score {score*100:.2f}")
    hist_seq2seq_bleu.append(score)
    
    epoch_loss = 0.0

Bleu score 0.00
Epoch - 15 / 1000
Source : 
 Please write down your name, address, and phone number here.
Translation : 
 veuillez vous en prie <eos>
saving873

Epoch_Loss - 1.6856157862307144
Bleu score 0.00
Epoch - 16 / 1000
Source : 
 Please write down your name, address, and phone number here.
Translation : 
 veuillez lui dire à <eos>
saving873

Epoch_Loss - 1.6265845902323859
Bleu score 0.00
Epoch - 17 / 1000
Source : 
 Please write down your name, address, and phone number here.
Translation : 
 écrivez à mon <eos>
saving873

Epoch_Loss - 1.5800690896049534
Bleu score 1.08
Epoch - 18 / 1000
Source : 
 Please write down your name, address, and phone number here.
Translation : 
 veuillez me le <eos>
saving873

Epoch_Loss - 1.5553015100450722
Bleu score 1.03
Epoch - 19 / 1000
Source : 
 Please write down your name, address, and phone number here.
Translation : 
 écrivez je suis suis -ce tu ne ne ne est est est a a a <eos>
saving873

Epoch_Loss - 1.5149828655067041
Bleu score 0.76
Epo

## Results and Conclusion

The Seq2seq model is powerful enough to produce decent translation on relatively small sentences and was once the state of the art for language translation. As the model solely rely on the Context Vector to encode the whole sentence it quickly become a bottleneck to learn complexe seq2seq translation.


In [None]:
# Plotting results

# Seq2Seq + Attention Mecanism

To improve the performance of the Seq2Seq model a new mecanism called **Attention** was introduced. It leverage the output of the encoder's LSTM layers in order to provide more information to the decoder. It is called *Attention* as the architecture of the network encourage the decoder to learn on which part of the encoder's output it should focus. Here is a schema showing the attention module added to the decoder network to leverage the encoder outputs :

![](./fig/attentionModule.png)

Based on the previous hidden state of the decoder and the previous token generated, the attention module compute a vector called the *Attention weights* having the same length as the output of the decoder. By mutliplying the *Attention weights* vectore to the encoder outputs we obtain a new vector representing the attention of the network on the various part of the encoder output.

## Encoder

The encoder is the same as the one of the Seq2seq model. But some changes in the code are required to be able to retrieve the output for each LSTM step. In the Seq2Seq model we passed the whole sequence directly to the LSTM and pytorch only return the last output and hidden state. Now we have to collect the output of the LSTM at each timestep of the sequence forcing us to loop manually on the sequence, to collect the output and the hidden state of the LSTM step and we also have to take care of transmitting the previous hidden state to the next step.

To be able to do that we just added an additional parameter to the *forward* methods representing the hidden state. We also added an *initHidden* method to generate an initial hidden state for the LSTM layer of the encoder. You will see how we use these different methods later in the notebook.


In [None]:
import torch.nn as nn

class AttnEncoderLSTM(nn.Module):
    def __init__(self, input_size, embedding_size, hidden_size, num_layers, dropout_p):
        super(EncoderLSTM, self).__init__()
        
        # Size of the input vector
        self.input_size = input_size
        
        # Size of the word embedding
        self.embedding_size = embedding_size
        
        # LSTM hidden layer size
        self.hidden_size = hidden_size
        
        # Number of LSTM layers
        self.num_layers = num_layers
        
        # Initializing the network layer
        
        self.dropout = nn.Dropout(dropout_p)
        self.embedding = nn.Embedding(self.input_size, self.embedding_size)
        self.LSTM = nn.LSTM(self.embedding_size, self.hidden_size, self.num_layers, dropout=dropout_p)
        
        
    def forward(self, x, hidden):
        x = self.embedding(x)
        x = self.dropout(x)
        
        outputs, (hidden_state, cell_state) = self.LSTM(x, hidden)
        
        return outputs, hidden_state, cell_state
    
    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size).to(device)


## Decoder

The decoder is modified to add the attention mecanism to its process. The schema for the decoder with attention is the following :

![](./fig/attentionDecoder.png)

In [None]:

class AttnDecoderLSTM(nn.Module):
    
    def __init__(self, input_size, embedding_size, hidden_size, num_layers, dropout_p, output_size, max_len):
        super(AttnDecoderRNN, self).__init__()
        
        # Input size of the decoder (size of the context vector)
        self.input_size = input_size
        
        # Embedding size 
        self.embedding_size = embedding_size
        
        # Hidden unit size
        self.hidden_size = hidden_size
        
        # num of LSTM layers
        self.num_layers = num_layers
        
        # Vocabulary size of the target language
        self.output_size = output_size
        
        self.dropout = nn.Dropout(dropout_p)
        self.embedding = nn.Embedding(self.input_size, self.embedding_size)
        self.LSTM = nn.LSTM(self.embedding_size, self.hidden_size, self.num_layers, dropout = dropout_p)
        self.fc = nn.Linear(self.hidden_size, self.output_size)
        
        # Attention layer
        self.attn = nn.Linear(self.hidden_size * 2, self.max_length)
        self.attn_combine = nn.Linear(self.hidden_size * 2, self.hidden_size)
        self.softmax = nn.Softmax()
        
        
    def forward(self, x, hidden_state, cell_state, encoder_outputs):
        
        # Shape of [1, batch_size]
        x = x.unsqueeze(0)
        
        x = self.embedding(x)
        embedding = self.dropout(x)
        
        # Attention computation
        attn_weights = self.softmax(self.attn(torch.cat((embedded[0], hidden_state), 1)), dim=1)
        attn_applied = torch.bmm(attn_weights.unsqueeze(0), encoder_outputs.unsqueeze(0))
        
        output = torch.cat((embedded[0], attn_applied[0]), 1)
        output = self.attn_combine(output).unsqueeze(0)
        
        outputs, (hidden_state, cell_state) = self.LSTM(output, (hidden_state, cell_state))
        
        predictions = self.fc(outputs)
        
        predictions = predictions.squeeze(0)
        
        return predictions, hidden_state, cell_state

    

## Attention Seq2Seq

Now we explicitly loop on the input data in order to collect to outputs of the encoder and pass them as an additionnal parameter for the decoder.

In [None]:
class AttnSeq2Seq(nn.Module):
    def __init__(self, Attn_Encoder_LSTM, Attn_Decoder_LSTM, max_length):
        super(Seq2Seq, self).__init__()
        self.Attn_Encoder_LSTM = Attn_Encoder_LSTM
        self.Attn_Decoder_LSTM = Attn_Decoder_LSTM
        self.max_length = max_length
        
    def forward(self, source, target, tfr=0.5):
        
        batch_size = source.shape[1]
        
        # Processing the encoder part
        
        # Tensor to collect the outputs
        encoder_outputs = torch.zeros(max_length, self.Attn_Encoder_LSTM.hidden_size).to(device)
        
        encoder_hidden = self.Attn_Encoder_LSTM.initHidden()
        
        input_len = source.shape[0]
        for idx in range(input_len):
            encoder_output, hidden_state, cell_state = self.Attn_Encoder_LSTM(source[idx], encoder_hidden)
            encoder_hidden = (hidden_state, cell_state)
            
            encoder_outputs[idx] = encoder_output
            
        # Processing decoder
        # Collecting a SOS token from the target in order to "seed" our decoder
        x = target[0]
        
        for i in range(1, target_len):
            output, hidden_state_decoder, cell_state_decoder = self.Attn_Decoder_LSTM(x, hidden_state_encoder, cell_state_encoder, encoder_outputs)
            outputs[i] = output
            best_guess = output.argmax(1) # 0th dimension is batch size, 1st dimension is word embedding
            x = target[i] if random.random() < tfr else best_guess # Either pass the next word correctly from the dataset or use the earlier predicted word


        return outputs
    