# Introduction to Natural Language Processing 2 Lab01

In this lab, we are going to implement a pretty decent machine translation model using the transformer and then compare it with differents decoding functions.

---
## Installation

In [1]:
%matplotlib inline

Let's install required packages for the project.

In [2]:
!pip install -U torchtext==0.12 torchdata==0.3.0
!pip install -U spacy sacrebleu
!python -m spacy download en_core_web_sm
!python -m spacy download de_core_news_sm

Collecting torchdata==0.3.0
  Downloading torchdata-0.3.0-py3-none-any.whl (47 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.7/47.7 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: torchdata
Successfully installed torchdata-0.3.0
Collecting spacy
  Downloading spacy-3.4.3-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.4/6.4 MB[0m [31m28.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting sacrebleu
  Downloading sacrebleu-2.3.1-py3-none-any.whl (118 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m118.9/118.9 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
Collecting thinc<8.2.0,>=8.1.0
  Downloading thinc-8.1.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (806 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m806.2/806.2 kB[0m [31m42.8 MB/s[0m eta [36m0:00:00[0m
Installing collected 

--- 
# Language Translation with nn.Transformer and torchtext

To start with, we just follow the pyTorch [language translation with nn.Transformer and torchtext tutorial.](https://pytorch.org/tutorials/beginner/translation_transformer.html)

This tutorial shows:
- How to train a translation model from scratch using Transformer. 
- Use tochtext library to access  [Multi30k](http://www.statmt.org/wmt16/multimodal-task.html#task1) dataset to train a German to English translation model.


## Data Sourcing and Processing

[torchtext library](https://pytorch.org/text/stable/) has utilities for creating datasets that can be easily
iterated through for the purposes of creating a language translation
model. In this example, we show how to use torchtext's inbuilt datasets, 
tokenize a raw text sentence, build vocabulary, and numericalize tokens into tensor. We will use
[Multi30k dataset from torchtext library](https://pytorch.org/text/stable/datasets.html#multi30k) that yields a pair of source-target raw sentences. 

In [3]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torchtext.datasets import multi30k, Multi30k
from typing import Iterable, List, Tuple, Callable

# We need to modify the URLs for the dataset since the links to the original dataset are broken
# Refer to https://github.com/pytorch/text/issues/1756#issuecomment-1163664163 for more info
multi30k.URL["train"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/training.tar.gz"
multi30k.URL["valid"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/validation.tar.gz"

SRC_LANGUAGE = 'de'
TGT_LANGUAGE = 'en'

# Place-holders
token_transform = {}
vocab_transform = {}

# Create source and target language tokenizer. Make sure to install the dependencies.
token_transform[SRC_LANGUAGE] = get_tokenizer('spacy', language='de_core_news_sm')
token_transform[TGT_LANGUAGE] = get_tokenizer('spacy', language='en_core_web_sm')

def yield_tokens(data_iter: List[str], language: str) -> List[str]:
    """
    Helper function to yield list of tokens.

    Parameters
    ----------
    data_iter: List[str]
        list of tokens to yield
    language: str
        destination language
    
    Returns
    -------
    List[str]
    the yielded tokens' list
    """
    language_index = {SRC_LANGUAGE: 0, TGT_LANGUAGE: 1}

    for data_sample in data_iter:
        yield token_transform[language](data_sample[language_index[language]])

# Define special symbols and indices
UNK_IDX, PAD_IDX, BOS_IDX, EOS_IDX = 0, 1, 2, 3
# Make sure the tokens are in order of their indices to properly insert them in vocab
special_symbols = ['<unk>', '<pad>', '<bos>', '<eos>']
 
for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
    # Training data Iterator 
    train_iter = Multi30k(split='train', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
    # Create torchtext's Vocab object 
    vocab_transform[ln] = build_vocab_from_iterator(yield_tokens(train_iter, ln),
                                                    min_freq=1,
                                                    specials=special_symbols,
                                                    special_first=True)

# Set UNK_IDX as the default index. This index is returned when the token is not found. 
# If not set, it throws RuntimeError when the queried token is not found in the Vocabulary. 
for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
  vocab_transform[ln].set_default_index(UNK_IDX)

2022-11-17 22:34:13.278739: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-17 22:34:13.280016: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-17 22:34:13.281353: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-17 22:34:13.282401: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-17 22:34:13.283401: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from S

## Seq2Seq Network using Transformer

Transformer is a Seq2Seq model introduced in [“Attention is all you
need”](https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf)_
paper for solving machine translation tasks. 
Below, we will create a Seq2Seq network that uses Transformer. The network
consists of three parts. First part is the embedding layer. This layer converts tensor of input indices
into corresponding tensor of input embeddings. These embedding are further augmented with positional
encodings to provide position information of input tokens to the model. The second part is the 
actual [Transformer](https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html)_ model. 
Finally, the output of Transformer model is passed through linear layer
that give un-normalized probabilities for each token in the target language. 




In [132]:
from torch import Tensor
import torch
import torch.nn as nn
from torch.nn import Transformer
import math
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Fix random state to get reproductible results
torch.random.manual_seed(42)
import random
random.seed(42)

class PositionalEncoding(nn.Module):
    """
    Helper Module that adds positional encoding to the token embedding to introduce a notion of word order.
    
    Attributes
    ----------
    dropout: torch.nn.Module
        dropout layer

    Methods
    -------
    forward(self, token_embedding: Tensor) -> Tensor
        Neural network module forwarding function
    """
    def __init__(self,
                 emb_size: int,
                 dropout: float,
                 maxlen: int = 5000) -> None:
        super(PositionalEncoding, self).__init__()
        den = torch.exp(- torch.arange(0, emb_size, 2)* math.log(10000) / emb_size)
        pos = torch.arange(0, maxlen).reshape(maxlen, 1)
        pos_embedding = torch.zeros((maxlen, emb_size))
        pos_embedding[:, 0::2] = torch.sin(pos * den)
        pos_embedding[:, 1::2] = torch.cos(pos * den)
        pos_embedding = pos_embedding.unsqueeze(-2)

        self.dropout = nn.Dropout(dropout)
        self.register_buffer('pos_embedding', pos_embedding)

    def forward(self, token_embedding: Tensor) -> Tensor:
        """
        Neural network module forwarding function

        Parameters
        ----------
        tokens: Tensor
            tokens tensor

        Return
        ------
        Tensor
        Tensor of tokens after dropout
        """
        return self.dropout(token_embedding + self.pos_embedding[:token_embedding.size(0), :])

class TokenEmbedding(nn.Module):
    """
    Helper Module to convert tensor of input indices into corresponding tensor of token embeddings
    
    Attributes
    ----------
    embedding: torch.nn.Module
        embedding layer

    Methods
    -------
    forward(self, token_embedding: Tensor) -> Tensor
        Neural network module forwarding function
    """
    
    def __init__(self, vocab_size: int, emb_size) -> None:
        super(TokenEmbedding, self).__init__()
        self.embedding = nn.Embedding(vocab_size, emb_size)
        self.emb_size = emb_size

    def forward(self, tokens: Tensor) -> Tensor:
        """
        Neural network module forwarding function

        Parameters
        ----------
        tokens: Tensor
            tokens tensor

        Return
        ------
        Tensor
        Tensor of tokens embedding after forward
        """
        return self.embedding(tokens.long()) * math.sqrt(self.emb_size)

class Seq2SeqTransformer(nn.Module):
    """
    Seq2Seq Network 
    
    Attributes
    ----------
    transformer: torch.nn.Module
        dropout layer

    Methods
    -------
    forward(self, token_embedding: Tensor) -> Tensor
        Neural network module forwarding function
    """
    def __init__(self,
                 num_encoder_layers: int,
                 num_decoder_layers: int,
                 emb_size: int,
                 nhead: int,
                 src_vocab_size: int,
                 tgt_vocab_size: int,
                 dim_feedforward: int = 512,
                 dropout: float = 0.1) -> None:
        super(Seq2SeqTransformer, self).__init__()
        self.transformer = Transformer(d_model=emb_size,
                                       nhead=nhead,
                                       num_encoder_layers=num_encoder_layers,
                                       num_decoder_layers=num_decoder_layers,
                                       dim_feedforward=dim_feedforward,
                                       dropout=dropout)
        self.generator = nn.Linear(emb_size, tgt_vocab_size)
        self.src_tok_emb = TokenEmbedding(src_vocab_size, emb_size)
        self.tgt_tok_emb = TokenEmbedding(tgt_vocab_size, emb_size)
        self.positional_encoding = PositionalEncoding(
            emb_size, dropout=dropout)

    def forward(self,
                src: Tensor,
                trg: Tensor,
                src_mask: Tensor,
                tgt_mask: Tensor,
                src_padding_mask: Tensor,
                tgt_padding_mask: Tensor,
                memory_key_padding_mask: Tensor) -> Tensor:
        """
        Seq2Seq module forwarding function

        Parameters
        ----------
        src: Tensor
            source tokens tensor
        trg: Tensor
            target tokens tensor
        src_mask: Tensor
            mask for source tokens
        tgt_mask: Tensor
            mask for target tokens
        src_padding_mask: Tensor
            padding mask for source tokens
        tgt_padding_mask: Tensor
            padding mask for target tokens
        memory_key_padding_mask: Tensor
            memory of padding_mask for transformer

        Return
        ------
        Tensor
        Generated outputs after applying transformer on source and target embeddings
        """
        src_emb = self.positional_encoding(self.src_tok_emb(src))
        tgt_emb = self.positional_encoding(self.tgt_tok_emb(trg))
        outs = self.transformer(src_emb, tgt_emb, src_mask, tgt_mask, None, 
                                src_padding_mask, tgt_padding_mask, memory_key_padding_mask)
        return self.generator(outs)

    def encode(self, src: Tensor, src_mask: Tensor) -> Tensor:
        """
        Encode source tokens by applying transformer

        Parameters
        ----------
        src: Tensor
            source tokens
        src_mask: Tensor
            mask for source tokens

        Return
        ------
        Tensor
        Encoded tensor of source tokens
        """
        return self.transformer.encoder(self.positional_encoding(
                            self.src_tok_emb(src)), src_mask)

    def decode(self, tgt: Tensor, memory: Tensor, tgt_mask: Tensor) -> Tensor:
        """
        Decode target tokens by applying transformer decoder 

        Parameters
        ----------
        tgt: Tensor
            target tokens
        memory: Tensor
            memory tensor use by decoder
        tgt_mask: Tensor
            mask for target tokens
        
        Return
        ------
        Tensor
        Decoded tensor of target tokens
        """
        return self.transformer.decoder(self.positional_encoding(
                          self.tgt_tok_emb(tgt)), memory,
                          tgt_mask)

During training, we need a subsequent word mask that will prevent model to look into
the future words when making predictions. We will also need masks to hide
source and target padding tokens. Below, let's define a function that will take care of both. 




In [5]:
def generate_square_subsequent_mask(sz: int) -> Tensor:
    """
    Generate a subsequent mask that will prevent model to look into the future words when making predictions

    Parameters
    ----------
    sz: int
        sequence length
    
    Return
    ------
    Tensor
    Subsequent mask
    """
    mask = (torch.triu(torch.ones((sz, sz), device=DEVICE)) == 1).transpose(0, 1)
    mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
    return mask


def create_mask(src : Tensor, tgt: Tensor) -> Tuple[Tensor, Tensor, Tensor, Tensor]:
    """
    Create mask and padding mask for source and target tokens

    Parameters
    ----------
    src: Tensor
        source tokens
    tgt: Tensor
        target tokens

    Return
    ------
    Tuple[Tensor, Tensor, Tensor, Tensor]
    mask for source tokens, mask for target tokens, padding mask for source tokens and padding mask for target tokens
    """
    src_seq_len = src.shape[0]
    tgt_seq_len = tgt.shape[0]

    tgt_mask = generate_square_subsequent_mask(tgt_seq_len)
    src_mask = torch.zeros((src_seq_len, src_seq_len),device=DEVICE).type(torch.bool)

    src_padding_mask = (src == PAD_IDX).transpose(0, 1)
    tgt_padding_mask = (tgt == PAD_IDX).transpose(0, 1)
    return src_mask, tgt_mask, src_padding_mask, tgt_padding_mask

Let's now define the parameters of our model and instantiate it. Below, we also 
define our loss function which is the cross-entropy loss and the optimizer used for training, here Adam.




In [6]:
torch.manual_seed(0)

SRC_VOCAB_SIZE = len(vocab_transform[SRC_LANGUAGE])
TGT_VOCAB_SIZE = len(vocab_transform[TGT_LANGUAGE])
EMB_SIZE = 512
NHEAD = 8
FFN_HID_DIM = 512
BATCH_SIZE = 128
NUM_ENCODER_LAYERS = 3
NUM_DECODER_LAYERS = 3

transformer = Seq2SeqTransformer(NUM_ENCODER_LAYERS, NUM_DECODER_LAYERS, EMB_SIZE, 
                                 NHEAD, SRC_VOCAB_SIZE, TGT_VOCAB_SIZE, FFN_HID_DIM)

for p in transformer.parameters():
    if p.dim() > 1:
        nn.init.xavier_uniform_(p)

transformer = transformer.to(DEVICE)

loss_fn = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)

optimizer = torch.optim.Adam(transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

## Collation

As seen in the ``Data Sourcing and Processing`` section, our data iterator yields a pair of raw strings. 
We need to convert these string pairs into the batched tensors that can be processed by our ``Seq2Seq`` network 
defined previously. Below we define our collate function that convert batch of raw strings into batch tensors that
can be fed directly into our model.   




In [7]:
from torch.nn.utils.rnn import pad_sequence

def sequential_transforms(*transforms: Iterable) -> Callable:
    """
    Helper function to club together sequential operations

    Parameters
    ----------

    transforms: Iterable
        callable function iterator

    Return
    ------
    Callable
    Callable function to apply all sequential operations on a text input
    """
    def func(txt_input):
        for transform in transforms:
            txt_input = transform(txt_input)
        return txt_input
    return func

def tensor_transform(token_ids: List[int]) -> Tensor:
    """
    Function to add BOS/EOS and create tensor for input sequence indices

    Parameters
    ----------
    token_ids: List[int]
        token indices
    
    Return
    ------
    Tensor
    Final tensor with BOS/EOS added
    """
    return torch.cat((torch.tensor([BOS_IDX]), 
                      torch.tensor(token_ids), 
                      torch.tensor([EOS_IDX])))

# src and tgt language text transforms to convert raw strings into tensors indices
text_transform = {}
for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
    text_transform[ln] = sequential_transforms(token_transform[ln], #Tokenization
                                               vocab_transform[ln], #Numericalization
                                               tensor_transform) # Add BOS/EOS and create tensor


def collate_fn(batch: List) -> Tuple[Tensor, Tensor]:
    """
    Function to collate data samples into batch tensors

    Parameters
    ----------
    batch: List[Tensor]
        batch of tensor to collate

    Return
    ------
    Tuple[Tensor, Tensor]
    source batch and target batch that can be fed into our model
    """
    src_batch, tgt_batch = [], []
    for src_sample, tgt_sample in batch:
        src_batch.append(text_transform[SRC_LANGUAGE](src_sample.rstrip("\n")))
        tgt_batch.append(text_transform[TGT_LANGUAGE](tgt_sample.rstrip("\n")))

    src_batch = pad_sequence(src_batch, padding_value=PAD_IDX)
    tgt_batch = pad_sequence(tgt_batch, padding_value=PAD_IDX)
    return src_batch, tgt_batch

Let's define training and evaluation loop that will be called for each 
epoch.




In [8]:
from torch.utils.data import DataLoader

def train_epoch(model : torch.nn.Module, optimizer: torch.optim) -> float:
    """
    Train function for a model used for each epoch

    Parameters
    ----------
    model: torch.nn.Module
        model to train
    optimizer: torch.optim
        optimizer to use for training
    
    Return
    ------
    float
    Training loss
    """
    model.train()
    losses = 0
    train_iter = Multi30k(split='train', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
    train_dataloader = DataLoader(train_iter, batch_size=BATCH_SIZE, collate_fn=collate_fn)
    
    for src, tgt in train_dataloader:
        src = src.to(DEVICE)
        tgt = tgt.to(DEVICE)

        tgt_input = tgt[:-1, :]

        src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)

        logits = model(src, tgt_input, src_mask, tgt_mask,src_padding_mask, tgt_padding_mask, src_padding_mask)

        optimizer.zero_grad()

        tgt_out = tgt[1:, :]
        loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
        loss.backward()

        optimizer.step()
        losses += loss.item()

    return losses / len(list(train_dataloader))


def evaluate(model: torch.nn.Module) -> float:
    """
    Evaluation function for a model used for each epoch

    Parameters
    ----------
    model: torch.nn.Module
        model to evaluate

    Return
    ------
    float
    Validation loss
    """
    model.eval()
    losses = 0

    val_iter = Multi30k(split='valid', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
    val_dataloader = DataLoader(val_iter, batch_size=BATCH_SIZE, collate_fn=collate_fn)

    for src, tgt in val_dataloader:
        src = src.to(DEVICE)
        tgt = tgt.to(DEVICE)

        tgt_input = tgt[:-1, :]

        src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)

        logits = model(src, tgt_input, src_mask, tgt_mask,src_padding_mask, tgt_padding_mask, src_padding_mask)
        
        tgt_out = tgt[1:, :]
        loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
        losses += loss.item()

    return losses / len(list(val_dataloader))

Now we have all the ingredients to train our model. Let's do it !  
We choose to train on 20 epochs for our results.




In [9]:
from timeit import default_timer as timer
NUM_EPOCHS = 20

for epoch in range(1, NUM_EPOCHS+1):
    start_time = timer()
    train_loss = train_epoch(transformer, optimizer)
    end_time = timer()
    val_loss = evaluate(transformer)
    print((f"Epoch {epoch:<2}: Train loss: {train_loss:.3f}, Val loss: {val_loss:.3f}, "f"Epoch time = {(end_time - start_time):.3f}s"))


Epoch 1 : Train loss: 5.344, Val loss: 4.114, Epoch time = 42.887s
Epoch 2 : Train loss: 3.761, Val loss: 3.320, Epoch time = 43.852s
Epoch 3 : Train loss: 3.162, Val loss: 2.895, Epoch time = 44.492s
Epoch 4 : Train loss: 2.768, Val loss: 2.637, Epoch time = 44.752s
Epoch 5 : Train loss: 2.481, Val loss: 2.442, Epoch time = 44.597s
Epoch 6 : Train loss: 2.250, Val loss: 2.317, Epoch time = 44.455s
Epoch 7 : Train loss: 2.060, Val loss: 2.200, Epoch time = 45.140s
Epoch 8 : Train loss: 1.897, Val loss: 2.112, Epoch time = 44.557s
Epoch 9 : Train loss: 1.754, Val loss: 2.060, Epoch time = 45.015s
Epoch 10: Train loss: 1.631, Val loss: 2.000, Epoch time = 45.844s
Epoch 11: Train loss: 1.524, Val loss: 1.975, Epoch time = 44.672s
Epoch 12: Train loss: 1.420, Val loss: 1.944, Epoch time = 44.455s
Epoch 13: Train loss: 1.333, Val loss: 1.966, Epoch time = 45.275s
Epoch 14: Train loss: 1.252, Val loss: 1.941, Epoch time = 44.875s
Epoch 15: Train loss: 1.173, Val loss: 1.925, Epoch time = 44.

--- 
## Decoding functions

To use our trained model to translate, we need decoding functions.

### Greedy algorithm
The tutorial give a greedy approach at decoding. Let's implement it and test our model with it first.

In [27]:
def greedy_decode(model: torch.nn.Module, src: Tensor, src_mask: Tensor, max_len: int, start_symbol: int) -> Tensor:
    """
    Function to generate output sequence using greedy algorithm 

    Parameters
    ----------
    model: torch.nn.Module
        model to decode
    src: Tensor
        source tokens
    src_mask: Tensor
        mask source tokens
    max_len: int
        number of tokens
    start_symbol: int
        first symbol of output sequence
    
    Return
    ------
    Tensor
    Output sequence
    """
    src = src.to(DEVICE)
    src_mask = src_mask.to(DEVICE)

    memory = model.encode(src, src_mask)
    ys = torch.ones(1, 1).fill_(start_symbol).type(torch.long).to(DEVICE)
    for i in range(max_len-1):
        memory = memory.to(DEVICE)
        tgt_mask = (generate_square_subsequent_mask(ys.size(0))
                    .type(torch.bool)).to(DEVICE)
        out = model.decode(ys, memory, tgt_mask)
        out = out.transpose(0, 1)
        prob = model.generator(out[:, -1])
        _, next_word = torch.max(prob, dim=1)
        next_word = next_word.item()

        ys = torch.cat([ys,
                        torch.ones(1, 1).type_as(src.data).fill_(next_word)], dim=0)
        if next_word == EOS_IDX:
            break
    return ys

In [28]:
def translate_greedy(model: torch.nn.Module, src_sentence: str) -> str:
    """
    Function to translate input sentence into target language with greedy algorithm

    Parameters
    ----------
    model: torch.nn.Module
        pre-trained model used for translate
    src_sentence: str
        sentence to translate
    
    Return
    ------
    str
    Translated sentence in the target language of the transformer (model)
    """
    model.eval()
    src = text_transform[SRC_LANGUAGE](src_sentence).view(-1, 1)
    num_tokens = src.shape[0]
    src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.bool)
    tgt_tokens = greedy_decode(
        model,  src, src_mask, max_len=num_tokens + 5, start_symbol=BOS_IDX).flatten()
    return " ".join(vocab_transform[TGT_LANGUAGE].lookup_tokens(list(tgt_tokens.cpu().numpy()))).replace("<bos>", "").replace("<eos>", "")

In [29]:
print(translate_greedy(transformer, "Eine Gruppe von Menschen steht vor einem Iglu."))

 A group of people standing in front of an igloo 


### Top K sampling algorithm

To compare our result with the greedy algorithm, let's implement a top k sampling algorithm with and without temperature.

In [113]:
def top_k_sampling(model: torch.nn.Module, src: Tensor, src_mask: Tensor, max_len: int, start_symbol: int, k: int, temp: float = 1.0) -> Tensor:
    """
    Function to generate output sequence using top K sampling algorithm

    Parameters
    ----------
    model: torch.nn.Module
        model to decode
    src: Tensor
        source tokens
    src_mask: Tensor
        mask source tokens
    max_len: int
        number of tokens
    start_symbol: int
        first symbol of output sequence
    k: int
        top number of samples
    temp: float
        temperature
    
    Return
    ------
    Tensor
    Output sequence
    """
    src = src.to(DEVICE)
    src_mask = src_mask.to(DEVICE)

    memory = model.encode(src, src_mask)
    ys = torch.ones(1, 1).fill_(start_symbol).type(torch.long).to(DEVICE)
    for i in range(max_len-1):
        memory = memory.to(DEVICE)
        tgt_mask = (generate_square_subsequent_mask(ys.size(0))
                    .type(torch.bool)).to(DEVICE)
        out = model.decode(ys, memory, tgt_mask)
        out = out.transpose(0, 1)
        prob = model.generator(out[:, -1])
        prob = prob.div(temp)
        prob = torch.nn.functional.softmax(prob, dim=1)
        
        top_k_prob, top_k_idx = torch.topk(prob, k)
        top_k_prob = top_k_prob.squeeze(0).cpu()
        top_k_idx = top_k_idx.squeeze(0).cpu()
        next_word = torch.multinomial(top_k_prob, 1)[0]
        next_word = top_k_idx[next_word].item()

        ys = torch.cat([ys,
                        torch.ones(1, 1).type_as(src.data).fill_(next_word)], dim=0)
        if next_word == EOS_IDX:
            break
    return ys

In [114]:
def translate_top_k(model: torch.nn.Module, src_sentence: str, k: int, temp : float = 1.0) -> str:
    """
    Function to translate input sentence into target language with top k sampling algorithm

    Parameters
    ----------
    model: torch.nn.Module
        pre-trained model used for translate
    src_sentence: str
        sentence to translate
    k: int
        top number of samples
    temp: float
        temperature
    Return
    ------
    str
    Translated sentence in the target language of the transformer (model)
    """
    model.eval()
    src = text_transform[SRC_LANGUAGE](src_sentence).view(-1, 1)
    num_tokens = src.shape[0]
    src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.bool)
    tgt_tokens = top_k_sampling(
        model,  src, src_mask, max_len=num_tokens + 5, start_symbol=BOS_IDX, k=k, temp=temp).flatten()
    return " ".join(vocab_transform[TGT_LANGUAGE].lookup_tokens(list(tgt_tokens.cpu().numpy()))).replace("<bos>", "").replace("<eos>", "")

In [115]:
print("Target sentence: A group of people stand in front of an igloo.")
print("--------------")
print("Without temperature variation:")
for k in range(1, 10):
    print(f"k = {k} | temp = 1.0:", translate_top_k(transformer, "Eine Gruppe von Menschen steht vor einem Iglu .", k))
print("--------------")
print("With temperature variation:")
for k in range(1, 10):
    for temp in [0.1, 0.3, 0.5, 0.7, 0.9]:
        print(f"k = {k} | temp = {temp}:", translate_top_k(transformer, "Eine Gruppe von Menschen steht vor einem Iglu .", k, temp))

Target sentence: A group of people stand in front of an igloo.
--------------
Without temperature variation:
k = 1 | temp = 1.0:  A group of people standing in front of an igloo 
k = 2 | temp = 1.0:  A group of people stand in front of an igloo 
k = 3 | temp = 1.0:  A group of people in an olive suit . 
k = 4 | temp = 1.0:  A group of people are standing in front of an igloo . 
k = 5 | temp = 1.0:  A group of people stand in front of an igloo 
k = 6 | temp = 1.0:  A group of people standing in front an auditorium . 
k = 7 | temp = 1.0:  A group of people standing in front of an office setting . 
k = 8 | temp = 1.0:  A group of people in front of an abandoned office . 
k = 9 | temp = 1.0:  A group of people stand in front of an igloo . 
--------------
With temperature variation:
k = 1 | temp = 0.1:  A group of people standing in front of an igloo 
k = 1 | temp = 0.3:  A group of people standing in front of an igloo 
k = 1 | temp = 0.5:  A group of people standing in front of an igloo 
k

We can observe that more that if `k` is too big, the results are bad, and the `temp` (temperature) seems to give better results when it is low.

### Beam search algorithm

Let's code now a beam search (from scratch) for the decoding function.

In [133]:
def beam_search(model: torch.nn.Module, src: Tensor, src_mask: Tensor, max_len: int, start_symbol: int, beam_size: int) -> Tensor:
    """
    Function to generate output sequence using beam search algorithm

    Parameters
    ----------
    model: torch.nn.Module
        model to decode
    src: Tensor
        source tokens
    src_mask: Tensor
        mask source tokens
    max_len: int
        number of tokens
    start_symbol: int
        first symbol of output sequence
    beam_size: int
        number of sample to keep at each step   
    Return
    ------
    Tensor
    Output sequence
    """
    src = src.to(DEVICE)
    src_mask = src_mask.to(DEVICE)

    memory = model.encode(src, src_mask)
    ys = torch.ones(1, 1).fill_(start_symbol).type(torch.long).to(DEVICE)
    memory = memory.to(DEVICE)
    tgt_mask = (generate_square_subsequent_mask(ys.size(0))
                .type(torch.bool)).to(DEVICE)
    out = model.decode(ys, memory, tgt_mask)
    out = out.transpose(0, 1)
    prob = model.generator(out[:, -1])

    prob = torch.nn.functional.softmax(prob, dim=1)
    top_k_prob, top_k_idx = torch.topk(prob, beam_size)

    top_k_prob = top_k_prob.squeeze(0).cpu()
    top_k_idx = top_k_idx.squeeze(0).cpu()

    top_k_prob = top_k_prob.tolist()
    top_k_idx = top_k_idx.tolist()

    top_k_prob = [[prob, [idx]] for prob, idx in zip(top_k_prob, top_k_idx)]

    next_word = top_k_prob[0][1][0]
    ys = torch.cat([ys,
                    torch.ones(1, 1).type_as(src.data).fill_(next_word)], dim=0)

    for i in range(max_len-1):
        temp = []
        for top_prob, top_idx in top_k_prob:
            ys = torch.tensor(top_idx).view(-1, 1).to(DEVICE)
            memory = memory.to(DEVICE)
            tgt_mask = (generate_square_subsequent_mask(ys.size(0))
                        .type(torch.bool)).to(DEVICE)
            out = model.decode(ys, memory, tgt_mask)
            out = out.transpose(0, 1)
            prob = model.generator(out[:, -1])
            prob = torch.nn.functional.softmax(prob, dim=1)
            
            top_k_prob, top_k_idx = torch.topk(prob, beam_size)

            top_k_prob = top_k_prob.squeeze(0).cpu()
            top_k_idx = top_k_idx.squeeze(0).cpu()

            top_k_prob = top_k_prob.tolist()
            top_k_idx = top_k_idx.tolist()

            for t_prob, t_idx in zip(top_k_prob, top_k_idx):
                temp.append([top_prob * t_prob, top_idx + [t_idx]])

        top_k_prob = sorted(temp, key=lambda x: x[0], reverse=True)[:beam_size]
        for prob, idx in top_k_prob:
            if idx[:-1] == EOS_IDX:
                break
    return torch.tensor(top_k_prob[0][1])

In [134]:
def translate_beam_search(model: torch.nn.Module, src_sentence: str, beam_size: int) -> str:
    """
    Function to translate input sentence into target language with beam search algorithm 

    Parameters
    ----------
    model: torch.nn.Module
        pre-trained model used for translate
    src_sentence: str
        sentence to translate
    beam_size: int
        beam size use for decoding function
    Return
    ------
    str
    Translated sentence in the target language of the transformer (model)
    """
    model.eval()
    src = text_transform[SRC_LANGUAGE](src_sentence).view(-1, 1)
    num_tokens = src.shape[0]
    src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.bool)
    tgt_tokens = beam_search(
        model,  src, src_mask, max_len=num_tokens + 5, start_symbol=BOS_IDX, beam_size=beam_size)
    return " ".join(vocab_transform[TGT_LANGUAGE].lookup_tokens(list(tgt_tokens.cpu().numpy()))).replace("<bos>", "").replace("<eos>", "")

In [135]:
print("Target sentence: A group of people stand in front of an igloo.")
print("--------------")
print("Beam size variation:")
for size in range(1, 10):
    print(f"size = {size}:", translate_top_k(transformer, "Eine Gruppe von Menschen steht vor einem Iglu .", size))

Target sentence: A group of people stand in front of an igloo.
--------------
Beam size variation:
size = 1:  A group of people standing in front of an igloo 
size = 2:  A group of people stand in front of an igloo 
size = 3:  A group of people in an office setting . 
size = 4:  A group of people standing in an office setting . 
size = 5:  A group of people in front of an igloo . 
size = 6:  A group of people stand in front of a igloo . 
size = 7:  There is a group of people in an igloo 
size = 8:  A group of people stand in front an olive mostly suit . 
size = 9:  A group of people stands in front of an auditorium . 


### Compare decoding functions

Let's now qualitatively compare a few (at least 3) translation samples for each approach (even the greedy one).

In [136]:
k = 3
temp_top_k = 0.7
beam_size = 5

def compare(model: torch.nn.Module, src_sentence: str, tgt_sentence: str) -> None:
    """
    Function to compare the performance of greedy, top-k sampling and beam search algorithms

    Parameters
    ----------
    model: torch.nn.Module
        pre-trained transformer model
    src_sentence: str
        source sentence to translate
    tgt_sentence: str
        expected sentence after translate
    """
    print(f"Source sentence: {src_sentence}")
    print(f"Target sentence: {tgt_sentence}")
    print(f"Greedy search: {translate_greedy(model, src_sentence)}")
    print(f"Top-k (k = {k}) sampling without temperature: {translate_top_k(model, src_sentence, k)}")
    print(f"Top-k (k = {k}) sampling with temperature = {temp_top_k}: {translate_top_k(model, src_sentence, k, temp_top_k)}")
    print(f"Beam search (size = {beam_size}): {translate_beam_search(model, src_sentence, beam_size)}")
    print("------------------")

compare(transformer, "Eine Gruppe von Menschen steht vor einem Iglu .", "A group of people stand in front of an igloo.")
compare(transformer, "Ein Mann in einem gelben Hut, der etwas anstarrt .", "A man in a yellow hat staring at something.")
compare(transformer, "Ich möchte ein Bier.", "I want a beer.")

Source sentence: Eine Gruppe von Menschen steht vor einem Iglu .
Target sentence: A group of people stand in front of an igloo.
Greedy search:  A group of people standing in front of an igloo 
Top-k (k = 3) sampling without temperature:  Group of people standing in front of an olive office . 
Top-k (k = 3) sampling with temperature = 0.7:  A group of people standing in front of an auditorium . 
Beam search (size = 5): A group of people standing in front of an igloo  .    
------------------
Source sentence: Ein Mann in einem gelben Hut, der etwas anstarrt .
Target sentence: A man in a yellow hat staring at something.
Greedy search:  A man in a yellow hat is using something bubble routine . 
Top-k (k = 3) sampling without temperature:  A man in a yellow hat with something tired hair . 
Top-k (k = 3) sampling with temperature = 0.7:  A man in a yellow hat is using something control of them . 
Beam search (size = 5): A man in a yellow hat is making something tea .       
-----------------

## Compute the BLEU score of the model

We are going to use the sacreBLEU implementation to evaluate our model and quantitatively compare the 4 implemented decoding approaches.

In [137]:
from sacrebleu.metrics import BLEU
import sacrebleu

def BLEU_compare(model: torch.nn.Module, sentences: List[str], refs: List[List[str]]) -> None:
    """
    Apply decoding algorithms on sentences and evaluate them with BLUE score metrics with references sentences
    
    Parameters
    ----------
    model: torch.nn.Module
        pre-trained transformer model
    sentences: List[str]
        sentences to translate
    refs: List[List[str]]
        List of list of references for the expected translations
    """
    
    hyps_greedy = ("Greedy", [translate_greedy(model, sentence) for sentence in sentences ])
    hyps_topk_no_temp = ("Top k sampling without temperature", [ translate_top_k(model, sentence, k) for sentence in sentences ])
    hyps_topk_with_temp = ("Top k sampling with temperature = "+ str(temp_top_k), [ translate_top_k(model, sentence, k, temp_top_k) for sentence in sentences ])
    hyps_beam_search = ("Beam search with size = " + str(beam_size), [translate_beam_search(model, sentence, beam_size) for sentence in sentences ])

    hyps = [ hyps_greedy , hyps_topk_no_temp, hyps_topk_with_temp, hyps_beam_search]

    # Compute the corpus score
    bleu = BLEU()
    for name, h in hyps:
        res = bleu.corpus_score(h, refs)
        print(f"{name:<38}: {str(res)}")

refs = [[
         "A group of people stand in front of an igloo.",
         "A man in a yellow hat staring at something.",
         "I want a beer."
         ]]

sentences = ["Eine Gruppe von Menschen steht vor einem Iglu .", "Ein Mann in einem gelben Hut, der etwas anstarrt .", "Ich möchte ein Bier."]
BLEU_compare(transformer, sentences, refs)

Greedy                                : BLEU = 49.55 75.0/56.0/45.5/31.6 (BP = 1.000 ratio = 1.077 hyp_len = 28 ref_len = 26)
Top k sampling without temperature    : BLEU = 55.49 71.9/55.2/50.0/47.8 (BP = 1.000 ratio = 1.231 hyp_len = 32 ref_len = 26)
Top k sampling with temperature = 0.7 : BLEU = 64.55 79.3/65.4/60.9/55.0 (BP = 1.000 ratio = 1.115 hyp_len = 29 ref_len = 26)
Beam search with size = 5             : BLEU = 54.29 78.6/60.0/50.0/36.8 (BP = 1.000 ratio = 1.077 hyp_len = 28 ref_len = 26)


The output values mean of the corpus score:
- FIRST NUMBER: final BLEU score
- NUM/NUM/NUM/NUM: precision value for 1–4 ngram order
- BP: Brevity Penalty
- ratio: ratio between the length of hypothesis and reference sentences
- hyp_len: total number of characters for hypothesis sentences
- ref_len: total number of characters for reference sentences

We can observe that with this parameters for top k sampling and beam search, we get better result than the greedy algoritm, as expected.
We should benchmark the BLUE score with different parameters for the top k sampling and beam search algorithms, with different k sampling, temperature and beam size.

We could observe that temperature for top k sampling is usefull to get better results.

## References

1. Attention is all you need paper.
   https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
2. The annotated transformer. https://nlp.seas.harvard.edu/2018/04/03/attention.html#positional-encoding
