<a href="https://colab.research.google.com/github/EliaTorre/NLP/blob/main/Sequence_to_Sequence_Learning_with_Neural_Networks_Implementation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sequence to Sequence Learning with Neural Networks

In [31]:
import random
import math
import time

import spacy
import numpy as np

import torch
import torch.nn as nn
import torch.optim as optim

from torchtext.legacy.datasets import Multi30k
from torchtext.legacy.data import Field, BucketIterator

from torchtext.data.metrics import bleu_score

Downloading Spacy packages 

In [32]:
#!python -m spacy download en
#!python -m spacy download de

Initializing the seed to enforce repdocuibility of results

In [33]:
seed = 4321
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cudnn.deterministic = True

Loading the spacy modules for German and English

In [34]:
spacy_de = spacy.load('de_core_news_sm')
spacy_en = spacy.load('en_core_web_sm')

Defining the tokenizer functions, following the paper, I reverted the order of the tokens within the german sentences

In [35]:
def deutsch_tokenizer(text):
    return [token.text for token in spacy_de.tokenizer(text)][::-1]

def english_tokenizer(text):
    return [token.text for token in spacy_en.tokenizer(text)]

Through pytorch's Field function I appended "sos" and "eos" tokens at the beginning and end of the sentences and transformed all of the sentences' words to lowercase

In [36]:
DE = Field(tokenize = deutsch_tokenizer, init_token = '<sos>', eos_token = '<eos>', lower = True)
EN = Field(tokenize = english_tokenizer, init_token = '<sos>', eos_token = '<eos>', lower = True)

I downloaded the Multi30K dataset which contains the parallel german-english-french tranlsation of approx. 30k sentences with approx. 12 words per sentence each. I used torchtext.datasets split attribute to divide the dataset in train/validation/test, where "exts" attribute specifies which language to use as source and which to use as target

In [37]:
train_data, validation_data, test_data = Multi30k.splits(exts = ('.de', '.en'), fields = (DE, EN))

In [38]:
print(f"# of training instances: {len(train_data.examples)}")
print(f"# of validation instances: {len(validation_data.examples)}")
print(f"# of testing instances: {len(test_data.examples)}")

# of training instances: 29000
# of validation instances: 1014
# of testing instances: 1000


I proceed in building the german (DE) and english (EN) vocabularies from the training data enforcing that only words which appear at least twice are included, otherwise an "UNK" token is put in their place

In [39]:
DE.build_vocab(train_data, min_freq = 2)
EN.build_vocab(train_data, min_freq = 2)

In [40]:
print(f"# of tokens in the German vocabulary: {len(DE.vocab)}")
print(f"# of tokens in the English vocabulary: {len(EN.vocab)}")

# of tokens in the German vocabulary: 7855
# of tokens in the English vocabulary: 5893


I defined the device such that GPU can be exploited to speed up the training

In [41]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

I define the iterators with a batch of 128, such that the data is transformed to an iterable object with a source and target attribute that maps tokenized words to their index in the vocabulary. 

I used "BucketIterator" instead of the standard "Iterator" because it creates batches that minimize the padding within sentences, such that it speeds up computation

In [42]:
batch = 128
train_iterator, validation_iterator, test_iterator = BucketIterator.splits((train_data, validation_data, test_data), batch_size = batch, device = device)

Following the paper I developed an encoder-decoder architecture which is comprised in three main classes: Encoder, Decoder and Seq2Seq. 

The Encoder is: "A multilayered Long-Short-Term Memory (LSTM) that maps the input sequence to a vector of fixed dimensionality [...] the LSTM is known to learn problems with long range temporal dependencies, so an LSTM may succeed in this setting". 

The Decoder is: "The second LSTM is essentially a recurrent neural network language model except that it is conditioned on the input sequence". It decodes the target sequence from the vector received by the encoder. 

Finally the Seq2Seq class has the purpose of including both the Encoder and the Decoder to perform the entire process of encoding-decoding


The Encoder class takes 5 inputs: 
- "input_dim", i.e., the dimensionality of the source vocabulary.
- "embedding_dim", i.e., the size of the embedding layer.
- "hidden_dim", i.e., the size of the hidden and cell states.
- "n_layers", i.e., the number of layers in the LSTM.
- "dropout", i.e., the share of dropout in our model.

For what concerns the depth of the RNN, as opposed to the 4-layers of the paper architeture, I developed a 2-layer LSTM to maintain computational time reasonable. 

Then I defined the "forward" method of the class, where the input sentence is embedded and dropout is performed. Finally, self.rnn performs the calculation of the hidden states and three outputs are created: 

- "outputs", i.e., a list of the hidden states at each time step. 
- "hidden", i.e., the two final hidden states of the two layers. 
- "cell", i.e., a list of the cell state at each time step.

I return just "hidden" and "cell" as "outputs" is not needed in this implementation. 

In [43]:
class Encoder(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, n_layers, dropout):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.n_layers = n_layers
        self.embedding = nn.Embedding(input_dim, embedding_dim)
        self.rnn = nn.LSTM(embedding_dim, hidden_dim, n_layers, dropout = dropout)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, de):  
        embedded = self.dropout(self.embedding(de))    
        outputs, (hidden, cell) = self.rnn(embedded)
          
        return hidden, cell

The inputs of the Decoder are the same of the Encoder except for "output_dim", i.e., the size of the target vocabulary.

The "forward" method of the Decoder is similar to the one of the Encoder however we now have to "unsqueeze" the input sequence to perform the decoding one token at a time. Then I performed embedding, dropout and the calculation of hidden and cell states. Finally, I passed the output through the linear layer to obtain the prediction.

In [44]:
class Decoder(nn.Module):
    def __init__(self, output_dim, embedding_dim, hidden_dim, n_layers, dropout):
        super().__init__()
        self.output_dim = output_dim
        self.hidden_dim = hidden_dim
        self.n_layers = n_layers
        self.embedding = nn.Embedding(output_dim, embedding_dim)
        self.rnn = nn.LSTM(embedding_dim, hidden_dim, n_layers, dropout = dropout)
        self.fc_out = nn.Linear(hidden_dim, output_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, cell):
        input = input.unsqueeze(0)        
        embedded = self.dropout(self.embedding(input))
        output, (hidden, cell) = self.rnn(embedded, (hidden, cell))
        prediction = self.fc_out(output.squeeze(0))
        
        return prediction, hidden, cell

The following "Seq2Seq" class aims at receiving the source sequence, evaluate the context vectors through the "Encoder" class and predict the target sentence through the "Decoder" class. 

In particular, "Seq2Seq" receives the encoder and decoder as inputs. 
Then, in the "forward" method, I initialize some dimensionality variables ("batch_size, "en_len", "en_vocab") to create a torch.zeros vector with these dimensions. I get the hidden and cell states by performing the encoding of the german sequences.

Then a loop is performed, where at each iteration the prediction of a single token is performed by the decoder.

In [45]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
    def forward(self, de, en, ratio = 0.5):         
        batch_size = en.shape[1]
        en_len = en.shape[0]
        en_vocab = self.decoder.output_dim
        outputs = torch.zeros(en_len, batch_size, en_vocab).to(self.device)
        hidden, cell = self.encoder(de)
        input = en[0,:]
        for t in range(1, en_len):
            output, hidden, cell = self.decoder(input, hidden, cell)
            outputs[t] = output
            force = random.random()<ratio
            best = output.argmax(1) 
            input = en[t] if force else best
        
        return outputs

Here I initialize the parameters of the model

In [46]:
input_dim, output_dim = len(DE.vocab), len(EN.vocab)
encoder_embedding_dim, decoder_embedding_dim = 256, 256
hidden_dim, n_layers = 512, 2
encoder_dropout, decoder_dropout = 0.5, 0.5

encoder = Encoder(input_dim, encoder_embedding_dim, hidden_dim, n_layers, encoder_dropout)
decoder = Decoder(output_dim, decoder_embedding_dim, hidden_dim, n_layers, decoder_dropout)

model = Seq2Seq(encoder, decoder, device).to(device)

Following the paper, I performed the initialization of the model weights according to a uniform distribution on the interval (-0.08, 0.08). And I apply that to the model

In [47]:
def weights(model):
    for n, p in model.named_parameters():
        nn.init.uniform_(p.data, -0.08, 0.08)
model.apply(weights);

In [48]:
def parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'The model has {parameters(model):,} trainable parameters')

The model has 13,899,013 trainable parameters


Here I defined the optimizer, i.e., "ADAM", and the loss function, i.e., "CrossEntropyLoss". In particular, by setting the "ignore_index" parameter to "EN_PAD_IDX", it calculate the avg. loss per token ignoring the loss on "pad" tokens

In [49]:
optimizer = optim.Adam(model.parameters())
en_pad_idx = EN.vocab.stoi[EN.pad_token]
loss_function = nn.CrossEntropyLoss(ignore_index = en_pad_idx)

Next, I define the functions "train" and "test". 

In the first one, I am setting "model.train()" such that dropout layer is considered when running the model. It iterates over the data iterator to update the parameters of the model: 
- Gets the german and english sentences.
- Set to zero the gradients calculated previously.
- Runs the model to obtain the predictions. 
- It reshapes input and output through .view() to fit the loss function. 
- Performs backpropagation and clips the gradient to avoind exploding gradients problems.
- Updates the parameters through optimizer.step() 
- Updates epoch_loss

Then it returns the epoch_loss averaged over all batches. 

In [50]:
def train(model, iterator, optimizer, loss_function, clip):
    model.train()
    epoch_loss = 0

    for i, batch in enumerate(iterator):
        de = batch.src
        en = batch.trg
        optimizer.zero_grad()
        output = model(de, en)
        output_dim = output.shape[-1]
        output = output[1:].view(-1, output_dim)
        en = en[1:].view(-1)        
        loss = loss_function(output, en)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()
        epoch_loss += loss.item()

    return epoch_loss / len(iterator)

The "test" function works similarly to the "train" function, however in this case, I set model.eval() such that no parameter optimization and dropout is performed. 

Furthermore, torch.no_grad() is introduced to avoid the computational cost of performing the gradients calculation.

In [51]:
def test(model, iterator, loss_function):
    model.eval()
    epoch_loss = 0

    with torch.no_grad():
        for i, batch in enumerate(iterator):
            de = batch.src
            en = batch.trg
            output = model(de, en, 0) 
            output_dim = output.shape[-1]
            output = output[1:].view(-1, output_dim)
            en = en[1:].view(-1)
            loss = loss_function(output, en)
            epoch_loss += loss.item()
            
    return epoch_loss / len(iterator)

The following "translate" function has the purpose of generating a translation of the german text and it is needed to evaluate the model according to the BLEU score.

In [52]:
def translate(sentence, de_field, en_field, model, device, max = 50):
    model.eval()
    tokens = [token.lower() for token in sentence]
    tokens = [de_field.init_token] + tokens + [de_field.eos_token]
    de_idx = [de_field.vocab.stoi[token] for token in tokens]
    de_tensor = torch.LongTensor(de_idx).unsqueeze(1).to(device)
    
    with torch.no_grad():
      hidden, cell = model.encoder(de_tensor)
    
    en_idx = [en_field.vocab.stoi[en_field.init_token]]
    
    for i in range(max):
        en_tensor = torch.LongTensor([en_idx[-1]]).to(device)
        with torch.no_grad():
            output, hidden, cell = model.decoder(en_tensor, hidden, cell)
        pred_token = output.argmax(1).item()
        en_idx.append(pred_token)
        if pred_token == en_field.vocab.stoi[en_field.eos_token]:
            break

    en_tokens = [en_field.vocab.itos[i] for i in en_idx]

    return en_tokens[1:]

The "bleu" function allows the computation of the BLEU score, which is a a metric specifically designed to assess the quality of a translation. It checks the overlapping between the actual and predicted english sequences in terms of their n-grams and outputs a score between 0 and a 100 (with a 100 being a perfect translation)

In [53]:
def bleu(data, de_field, en_field, model, device, max = 50):
    ens = []
    pred_ens = []  

    for x in data:
        de = vars(x)['src']
        en = vars(x)['trg']
        pred_en = translate(de, de_field, en_field, model, device, max)
        pred_en = pred_en[:-1]
        pred_ens.append(pred_en)
        ens.append([en])
        
    return bleu_score(pred_ens, ens)

Here I defined a function to measure the time elapsed between each epoch of the training process

In [54]:
def count_time(start, end):
    elapsed = end - start
    mins = int(elapsed / 60)
    secs = int(elapsed - (mins * 60))
    return mins, secs

Finally, I defined the training iteration process of the architecture which evaluates the "train_loss" and "validation_loss" of each epoch and keeps track of the best parameters configuration according to the validation_loss. 

It then prints a summary of train/validation loss of each epoch and the correspondent perplexity metric

In [55]:
epochs, clip, best = 15, 1, float('inf')

for epoch in range(epochs):
    start = time.time()
    train_loss = train(model, train_iterator, optimizer, loss_function, clip)
    validation_loss = test(model, validation_iterator, loss_function)
    end = time.time()
    mins, secs = count_time(start, end)

    if validation_loss < best:
        best = validation_loss
        torch.save(model.state_dict(), 'LSTM-model.pt')

    print(f'Epoch: {epoch + 1}, Time: {mins}m {secs}s')
    print(f'Train Loss: {train_loss:.3f}, Validation Loss: {validation_loss:.3f} ')
    print(f'Train PPL: {math.exp(train_loss):7.3f}, Validation PPL: {math.exp(validation_loss):7.3f}\n')

Epoch: 1, Time: 0m 29s
Train Loss: 5.058, Validation Loss: 4.936 
Train PPL: 157.267, Validation PPL: 139.174

Epoch: 2, Time: 0m 29s
Train Loss: 4.506, Validation Loss: 4.813 
Train PPL:  90.584, Validation PPL: 123.145

Epoch: 3, Time: 0m 29s
Train Loss: 4.173, Validation Loss: 4.684 
Train PPL:  64.883, Validation PPL: 108.189

Epoch: 4, Time: 0m 29s
Train Loss: 3.979, Validation Loss: 4.457 
Train PPL:  53.465, Validation PPL:  86.233

Epoch: 5, Time: 0m 29s
Train Loss: 3.785, Validation Loss: 4.370 
Train PPL:  44.040, Validation PPL:  79.061

Epoch: 6, Time: 0m 29s
Train Loss: 3.637, Validation Loss: 4.248 
Train PPL:  37.979, Validation PPL:  69.989

Epoch: 7, Time: 0m 29s
Train Loss: 3.513, Validation Loss: 4.129 
Train PPL:  33.533, Validation PPL:  62.094

Epoch: 8, Time: 0m 29s
Train Loss: 3.369, Validation Loss: 4.107 
Train PPL:  29.046, Validation PPL:  60.790

Epoch: 9, Time: 0m 30s
Train Loss: 3.264, Validation Loss: 4.070 
Train PPL:  26.142, Validation PPL:  58.569

E

In the following cells, I evaluate the performance of the trained architecture on an unseen sample test and compute its BLEU score

In [56]:
model.load_state_dict(torch.load('LSTM-model.pt'))
test_loss = test(model, test_iterator, loss_function)
print(f'Test Loss: {test_loss:.3f}, Test PPL: {math.exp(test_loss):7.3f}')

Test Loss: 3.710, Test PPL:  40.861


In [57]:
bleu_score = bleu(test_data, DE, EN, model, device)
print(f'BLEU score: {bleu_score*100:.2f}')

BLEU score: 14.14
