Guillaume Thibault 1948612

Julien Witty 1949837

# Machine translation

The goal of this TP is to build a machine translation model.
You will be comparing the performance of three different architectures:
* A vanilla RNN
* A GRU-RNN
* A transformer

You are provided with the code to load and build the pytorch dataset,
and the code for the training loop.
You "only" have to code the architectures.
Of course, the use of built-in torch layers such as `nn.GRU`, `nn.RNN` or `nn.Transformer`
is forbidden, as there would be no exercise otherwise.

The source sentences are in english and the target language is french.

This is also for you the occasion to see what a basic machine learning pipeline looks like.
Take a look at the given code, you might learn a lot!

Do not forget to **select the runtime type as GPU!**

**Sources**

* Dataset: [Tab-delimited Bilingual Sentence Pairs](http://www.manythings.org/anki/)

<!---
M. Cettolo, C. Girardi, and M. Federico. 2012. WIT3: Web Inventory of Transcribed and Translated Talks. In Proc. of EAMT, pp. 261-268, Trento, Italy. pdf, bib. [paper](https://aclanthology.org/2012.eamt-1.60.pdf). [website](https://wit3.fbk.eu/2016-01).
-->

* The code is inspired by this [pytorch tutorial](https://pytorch.org/tutorials/beginner/torchtext_translation_tutorial.html).

*This notebook is quite big, use the table of contents to easily navigate through it.*

# Imports and data initializations

We first download and parse the dataset. From the parsed sentences
we can build the vocabularies and the torch datasets.
The end goal of this section is to have an iterator
that can yield the pairs of translated datasets, and
where each sentences is made of a sequence of tokens.

## Imports

In [None]:
!python3 -m spacy download en > /dev/null
!python3 -m spacy download fr > /dev/null
!pip install torchinfo > /dev/null
!pip install einops > /dev/null
!pip install wandb > /dev/null


from itertools import takewhile
from collections import Counter, defaultdict

import numpy as np
from sklearn.model_selection import train_test_split
import pandas as pd

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data.dataset import Dataset
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence

import torchtext
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator, Vocab
from torchtext.datasets import IWSLT2016

import einops
import wandb
from torchinfo import summary

The tokenizers are objects that are able to divide a python string into a list of tokens (words, punctuations, special tokens...) as a list of strings.

The special tokens are used for a particular reasons:
* *\<unk\>*: Replace an unknown word in the vocabulary by this default token
* *\<pad\>*: Virtual token used to as padding token so a batch of sentences can have a unique length
* *\<bos\>*: Token indicating the beggining of a sentence in the target sequence
* *\<eos\>*: Token indicating the end of a sentence in the target sequence

In [None]:
# Original dataset, but there's a bug on Colab with it
# train, valid, _ = IWSLT2016(language_pair=('fr', 'en'))
# train, valid = list(train), list(valid)

# Another dataset, but it is too huge
# !wget https://www.statmt.org/wmt14/training-monolingual-europarl-v7/europarl-v7.en.gz
# !wget https://www.statmt.org/wmt14/training-monolingual-europarl-v7/europarl-v7.fr.gz
# !gunzip europarl-v7.en.gz
# !gunzip europarl-v7.fr.gz

# with open('europarl-v7.en', 'r') as my_file:
#     english = my_file.readlines()

# with open('europarl-v7.fr', 'r') as my_file:
#     french = my_file.readlines()

# dataset = [
#     (en, fr)
#     for en, fr in zip(english, french)
# ]
# print(f'\n{len(dataset):,} sentences.')

# dataset, _ = train_test_split(dataset, test_size=0.8, random_state=0)  # Remove 80% of the dataset (it would be huge otherwise)
# train, valid = train_test_split(dataset, test_size=0.2, random_state=0)  # Split between train and validation dataset

# Our current dataset
!wget http://www.manythings.org/anki/fra-eng.zip
!unzip fra-eng.zip


df = pd.read_csv('fra.txt', sep='\t', names=['english', 'french', 'attribution'])
train = [
    (en, fr) for en, fr in zip(df['english'], df['french'])
]
train, valid = train_test_split(train, test_size=0.1, random_state=0)
print(len(train))

en_tokenizer, fr_tokenizer = get_tokenizer('spacy', language='en'), get_tokenizer('spacy', language='fr')

SPECIALS = ['<unk>', '<pad>', '<bos>', '<eos>']

--2022-04-12 02:58:02--  http://www.manythings.org/anki/fra-eng.zip
Resolving www.manythings.org (www.manythings.org)... 104.21.92.44, 172.67.186.54, 2606:4700:3030::6815:5c2c, ...
Connecting to www.manythings.org (www.manythings.org)|104.21.92.44|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6532197 (6.2M) [application/zip]
Saving to: ‘fra-eng.zip.1’


2022-04-12 02:58:02 (120 MB/s) - ‘fra-eng.zip.1’ saved [6532197/6532197]

Archive:  fra-eng.zip
replace _about.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace fra.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
173106


## Datasets

Functions and classes to build the vocabularies and the torch datasets.
The vocabulary is an object able to transform a string token into the id (an int) of that token in the vocabulary. 

In [None]:
class TranslationDataset(Dataset):
    def __init__(
            self,
            dataset: list,
            en_vocab: Vocab,
            fr_vocab: Vocab,
            en_tokenizer,
            fr_tokenizer,
        ):
        super().__init__()

        self.dataset = dataset
        self.en_vocab = en_vocab
        self.fr_vocab = fr_vocab
        self.en_tokenizer = en_tokenizer
        self.fr_tokenizer = fr_tokenizer
    
    def __len__(self):
        """Return the number of examples in the dataset.
        """
        return len(self.dataset)

    def __getitem__(self, index: int) -> tuple:
        """Return a sample.

        Args
        ----
            index: Index of the sample.

        Output
        ------
            en_tokens: English tokens of the sample, as a LongTensor.
            fr_tokens: French tokens of the sample, as a LongTensor.
        """
        # Get the strings
        en_sentence, fr_sentence = self.dataset[index]

        # To list of words
        # We also add the beggining-of-sentence and end-of-sentence tokens
        en_tokens = ['<bos>'] + self.en_tokenizer(en_sentence) + ['<eos>']
        fr_tokens = ['<bos>'] + self.fr_tokenizer(fr_sentence) + ['<eos>']

        # To list of tokens
        en_tokens = self.en_vocab(en_tokens)  # list[int]
        fr_tokens = self.fr_vocab(fr_tokens)

        return torch.LongTensor(en_tokens), torch.LongTensor(fr_tokens)


def yield_tokens(dataset, tokenizer, lang):
    """Tokenize the whole dataset and yield the tokens.
    """
    assert lang in ('en', 'fr')
    sentence_idx = 0 if lang == 'en' else 1

    for sentences in dataset:
        sentence = sentences[sentence_idx]
        tokens = tokenizer(sentence)
        yield tokens


def build_vocab(dataset: list, en_tokenizer, fr_tokenizer, min_freq: int):
    """Return two vocabularies, one for each language.
    """
    en_vocab = build_vocab_from_iterator(
        yield_tokens(dataset, en_tokenizer, 'en'),
        min_freq=min_freq,
        specials=SPECIALS,
    )
    en_vocab.set_default_index(en_vocab['<unk>'])  # Default token for unknown words

    fr_vocab = build_vocab_from_iterator(
        yield_tokens(dataset, fr_tokenizer, 'fr'),
        min_freq=min_freq,
        specials=SPECIALS,
    )
    fr_vocab.set_default_index(fr_vocab['<unk>'])

    return en_vocab, fr_vocab


def preprocess(
        dataset: list,
        en_tokenizer,
        fr_tokenizer,
        max_words: int,
    ) -> list:
    """Preprocess the dataset.
    Remove samples where at least one of the sentences are too long.
    Those samples takes too much memory.
    Also remove the pending '\n' at the end of sentences.
    """
    filtered = []

    for en_s, fr_s in dataset:
        if len(en_tokenizer(en_s)) >= max_words or len(fr_tokenizer(fr_s)) >= max_words:
            continue
        
        en_s = en_s.replace('\n', '')
        fr_s = fr_s.replace('\n', '')

        filtered.append((en_s, fr_s))

    return filtered


def build_datasets(
        max_sequence_length: int,
        min_token_freq: int,
        en_tokenizer,
        fr_tokenizer,
        train: list,
        val: list,
    ) -> tuple:
    """Build the training, validation and testing datasets.
    It takes care of the vocabulary creation.

    Args
    ----
        - max_sequence_length: Maximum number of tokens in each sequences.
            Having big sequences increases dramatically the VRAM taken during training.
        - min_token_freq: Minimum number of occurences each token must have
            to be saved in the vocabulary. Reducing this number increases
            the vocabularies's size.
        - en_tokenizer: Tokenizer for the english sentences.
        - fr_tokenizer: Tokenizer for the french sentences.
        - train and val: List containing the pairs (english, french) sentences.


    Output
    ------
        - (train_dataset, val_dataset): Tuple of the two TranslationDataset objects.
    """
    datasets = [
        preprocess(samples, en_tokenizer, fr_tokenizer, max_sequence_length)
        for samples in [train, val]
    ]

    en_vocab, fr_vocab = build_vocab(datasets[0], en_tokenizer, fr_tokenizer, min_token_freq)

    datasets = [
        TranslationDataset(samples, en_vocab, fr_vocab, en_tokenizer, fr_tokenizer)
        for samples in datasets
    ]

    return datasets

In [None]:
def generate_batch(data_batch: list, src_pad_idx: int, tgt_pad_idx: int) -> tuple:
    """Add padding to the given batch so that all
    the samples are of the same size.

    Args
    ----
        data_batch: List of samples.
            Each sample is a tuple of LongTensors of varying size.
        src_pad_idx: Source padding index value.
        tgt_pad_idx: Target padding index value.
    
    Output
    ------
        en_batch: Batch of tokens for the padded english sentences.
            Shape of [batch_size, max_en_len].
        fr_batch: Batch of tokens for the padded french sentences.
            Shape of [batch_size, max_fr_len].
    """
    en_batch, fr_batch = [], []
    for en_tokens, fr_tokens in data_batch:
        en_batch.append(en_tokens)
        fr_batch.append(fr_tokens)

    en_batch = pad_sequence(en_batch, padding_value=src_pad_idx, batch_first=True)
    fr_batch = pad_sequence(fr_batch, padding_value=tgt_pad_idx, batch_first=True)
    return en_batch, fr_batch

# Models architecture
This is where you have to code the architectures.

In a machine translation task, the model takes as input the whole
source sentence along with the current known tokens of the target,
and predict the next token in the target sequence.
This means that the target tokens are predicted in an autoregressive
manner, starting from the first token (right after the *\<bos\>* token) and producing tokens one by one until the last *\<eos\>* token.

Formally, we define $s = [s_1, ..., s_{N_s}]$ as the source sequence made of $N_s$ tokens.
We also define $t^i = [t_1, ..., t_i]$ as the target sequence at the beginning of the step $i$.

The output of the model parameterized by $\theta$ is:

$$
T_{i+1} = p(t_{i+1} | s, t^i ; \theta )
$$

Where $T_{i+1}$ is the distribution of the next token $t_{i+1}$.

The loss is simply a *cross entropy loss* over the whole steps, where each class is a token of the vocabulary.

![RNN schema for machinea translation](https://www.simplilearn.com/ice9/free_resources_article_thumb/machine-translation-model-with-encoder-decoder-rnn.jpg)

Note that in this image the english sentence is provided in reverse. 

---

In pytorch, there is no dinstinction between an intermediate layer or a whole model having multiple layers in itself.
Every layers or models inherit from the `torch.nn.Module`.
This module needs to define the `__init__` method where you instanciate the layers,
and the `forward` method where you decide how the inputs and the layers of the module interact between them.
Thanks to the autograd computations of pytorch, you do not have
to implement any backward method!

A really important advice is to **always look at
the shape of your input and your output.**
From that, you can often guess how the layers should interact
with the inputs to produce the right output.
You can also easily detect if there's something wrong going on.

You are more than advised to use the `einops` library and the `torch.einsum` function. This will require less operations than 'classical' code, but note that it's a bit trickier to use.
This is a way of describing tensors manipulation with strings, bypassing the multiple tensor methods executed in the background.
You can find a nice presentation of `einops` [here](https://einops.rocks/1-einops-basics/).
A paper has just been released about einops [here](https://paperswithcode.com/paper/einops-clear-and-reliable-tensor).

**A great tutorial on pytorch can be found [here](https://stanford.edu/class/cs224n/materials/CS224N_PyTorch_Tutorial.html).**
Spending 3 hours on this tutorial is *no* waste of time.

## RNN models

### RNN
Here you have to implement a recurrent neural network. You will need to create a single RNN Layer, and a module allowing to stack these layers. Look up the pytorch documentation to figure out this module's operations and what is communicated from one layer to another.

The `RNNCell` layer produce one hidden state vector for each sentence in the batch
(useful for the output of the encoder), and also produce one embedding for each
token in each sentence (useful for the output of the decoder).

The `RNN` module is composed of a stack of `RNNCell`. Each token embeddings
coming out from a previous `RNNCell` is used as an input for the next `RNNCell` layer.

**Be careful !** Our `RNNCell` implementation is not exactly the same thing as
the PyTorch's `nn.RNNCell`. PyTorch implements only the operations for one token
(so you would need to loop through each tokens inside the `RNN` instead).
You are free to implement `RNN` and `RNNCell` the way you want, as long as it has the expected behaviour of a RNN.

The same thing apply for the `GRU` and `GRUCell`.


In [None]:
# https://nlp.seas.harvard.edu/2018/04/03/attention.html
import copy
def clones(module, N):
    "Produce N identical layers."
    return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])

In [None]:
from einops import rearrange, reduce, repeat
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

class RNNCell(nn.Module):
    """A single RNN layer.
    
    Parameters
    ----------
        input_size: Size of each input token.
        hidden_size: Size of each RNN hidden state.
        dropout: Dropout rate.

    Important note: This layer does not exactly the same thing as nn.RNNCell does.
    PyTorch implementation is only doing one simple pass over one token for each batch.
    This implementation is taking the whole sequence of each batch and provide the
    final hidden state along with the embeddings of each token in each sequence.
    """
    def __init__(
            self,
            input_size: int,
            hidden_size: int,
            dropout: float,
        ):
        super().__init__() 

        self.Wi = nn.Linear(input_size, hidden_size).to(device)
        self.Wh = nn.Linear(hidden_size, hidden_size).to(device)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x: torch.FloatTensor, h: torch.FloatTensor) -> tuple:
        """Go through all the sequence in x, iteratively updating
        the hidden state h.

        Args
        ----
            x: Input sequence.
                Shape of [batch_size, seq_len, input_size].
            h: Initial hidden state.
                Shape of [batch_size, hidden_size].

        Output
        ------
            y: Token embeddings.
                Shape of [batch_size, seq_len, hidden_size].
            h: Last hidden state.
                Shape of [batch_size, hidden_size].
        """
        x = rearrange(x, 'b s i -> s b i')
        y = torch.zeros((x.shape[0], x.shape[1], h.shape[1])).to(x.device)
        for i, seq in enumerate(x):
          # [b h]        [i,h] [b,i]    [h,h] [b,h]
          h = torch.tanh(self.Wi(seq) + self.Wh(h))
          h = self.dropout(h)
          y[i] = h

        y = rearrange(y, 's b h -> b s h')

        return y, h


class RNN(nn.Module):
    """Implementation of an RNN based
    on https://pytorch.org/docs/stable/generated/torch.nn.RNN.html.

    Parameters
    ----------
        input_size: Size of each input token.
        hidden_size: Size of each RNN hidden state.
        num_layers: Number of layers (RNNCell or GRUCell).
        dropout: Dropout rate.
        model_type: Either 'RNN' or 'GRU', to select which model we want.
            This parameter can be removed if you decide to use the module `GRU`.
            Indeed, `GRU` should have exactly the same code as this module,
            but with `GRUCell` instead of `RNNCell`. We let the freedom for you
            to decide at which level you want to specialise the modules (either
            in `TranslationRNN` by creating a `GRU` or a `RNN`, or in `RNN`
            by creating a `GRUCell` or a `RNNCell`).
    """
    def __init__(
            self,
            input_size: int,
            hidden_size: int,
            num_layers: int,
            dropout: float,
        ):
        super().__init__()

        self.input_size = input_size
        self.hidden_size = hidden_size
        self.num_layers = num_layers

        # Model
        self.layers = nn.ModuleList([RNNCell(input_size, hidden_size, dropout)])
        self.layers.extend(clones(RNNCell(hidden_size, hidden_size, dropout), num_layers-1))

        self.linear = nn.Linear(hidden_size, hidden_size)

    def forward(self, x: torch.FloatTensor, h: torch.FloatTensor=None) -> tuple:
        """Pass the input sequence through all the RNN cells.
        Returns the output and the final hidden state of each RNN layer

        Args
        ----
            x: Input sequence.
                Shape of [batch_size, seq_len, input_size].
            h: Hidden state for each RNN layer.
                Can be None, in which case an initial hidden state is created.
                Shape of [batch_size, n_layers, hidden_size].

        Output
        ------
            y: Output embeddings for each token after the RNN layers.
                Shape of [batch_size, seq_len, hidden_size].
            h: Final hidden state.
                Shape of [batch_size, n_layers, hidden_size].
        """
        if h is None:
          h = torch.zeros(x.shape[0], self.num_layers, self.hidden_size).to(x.device)

        h = rearrange(h, 'b n h -> n b h')
        h_next = torch.zeros(h.shape[0], h.shape[1], h.shape[2]).to(device)

        y = x
        for i, cell in enumerate(self.layers):
          y, h_next[i] = cell(y, h[i])

        h_next = rearrange(h_next, 'n b h -> b n h') 
        # y = torch.softmax(y, dim=2)
        # y = self.linear(y)

        return y, h_next



In [None]:
# input_size = 5
# hidden_size = 2
# dropout = 0.005
# batch_size = 3
# num_layers = 2

# cell = RNNCell(input_size, hidden_size, dropout)
# x = torch.rand(batch_size, 4, input_size).to(device)
# h = torch.rand(batch_size, hidden_size).to(device)
# print(cell.forward(x, h))

# rnn = RNN(input_size, hidden_size, num_layers, dropout)
# x = torch.rand(batch_size, 4, input_size).to(device)
# h = torch.rand(batch_size, num_layers ,hidden_size).to(device)
# print(rnn.forward(x, h))

### GRU
Here you have to implement a GRU-RNN. This architecture is close to the Vanilla RNN but perform different operations. Look up the pytorch documentation to figure out the differences.

In [None]:
class GRUCell(nn.Module):
    """A single GRU layer.
    
    Parameters
    ----------
        input_size: Size of each input token.
        hidden_size: Size of each RNN hidden state.
        dropout: Dropout rate.
    """
    def __init__(
            self,
            input_size: int,
            hidden_size: int,
            dropout: float,
        ):
        super().__init__()

        self.Wir = nn.Linear(input_size, hidden_size).to(device)
        self.Whr = nn.Linear(hidden_size, hidden_size).to(device)
        self.Wiz = nn.Linear(input_size, hidden_size).to(device)
        self.Whz = nn.Linear(hidden_size, hidden_size).to(device)
        self.Win = nn.Linear(input_size, hidden_size).to(device)
        self.Whn = nn.Linear(hidden_size, hidden_size).to(device)

        self.dropout = nn.Dropout(dropout)

    def forward(self, x: torch.FloatTensor, h: torch.FloatTensor) -> tuple:
        """
        Args
        ----
            x: Input sequence.
                Shape of [batch_size, seq_len, input_size].
            h: Initial hidden state.
                Shape of [batch_size, hidden_size].

        Output
        ------
            y: Token embeddings.
                Shape of [batch_size, seq_len, hidden_size].
            h: Last hidden state.
                Shape of [batch_size, hidden_size].
        """
        x = rearrange(x, 'b s i -> s b i')
        y = nn.Parameter(torch.empty((x.shape[0], x.shape[1], h.shape[1]))).to(x.device)
        for i, seq in enumerate(x):
          r_t = torch.sigmoid(self.Wir(seq) + self.Whr(h))
          z_t = torch.sigmoid(self.Wiz(seq) + self.Whz(h))
          n_t = torch.tanh(self.Win(seq) + r_t * self.Whn(h))
          h = (1 - z_t) * n_t + z_t * h
          h = self.dropout(h)
          y[i] = h

        y = rearrange(y, 's b i -> b s i')

        return y, h

class GRU(nn.Module):
    """Implementation of a GRU based on https://pytorch.org/docs/stable/generated/torch.nn.GRU.html.

    Parameters
    ----------
        input_size: Size of each input token.
        hidden_size: Size of each RNN hidden state.
        num_layers: Number of layers.
        dropout: Dropout rate.
    """
    def __init__(
            self,
            input_size: int,
            hidden_size: int,
            num_layers: int,
            dropout: float,
        ):
        super().__init__()

        self.input_size = input_size
        self.hidden_size = hidden_size
        self.num_layers = num_layers

        # Model
        self.layers = nn.ModuleList([GRUCell(input_size, hidden_size, dropout)])
        self.layers.extend(clones(GRUCell(hidden_size, hidden_size, dropout), num_layers-1))      

    def forward(self, x: torch.FloatTensor, h: torch.FloatTensor=None) -> tuple:
        """
        Args
        ----
            x: Input sequence
                Shape of [batch_size, seq_len, input_size].
            h: Initial hidden state for each layer.
                If 'None', then an initial hidden state (a zero filled tensor)
                is created.
                Shape of [batch_size, n_layers, hidden_size].

        Output
        ------
            output:
                Shape of [batch_size, seq_len, hidden_size].
            h_n: Final hidden state.
                Shape of [batch_size, n_layers, hidden size].
        """
        if h is None:
          h = torch.zeros(x.shape[0], self.num_layers, self.hidden_size).to(x.device)

        h = rearrange(h, 'b n h -> n b h')
        h_next = torch.zeros(h.shape[0], h.shape[1], h.shape[2]).to(device)
        
        y = x
        for i, cell in enumerate(self.layers):
          y, h_next[i] = cell(y, h[i]) 

        h_next = rearrange(h_next, 'n b h -> b n h')     
        y = torch.softmax(y, dim=2)

        return y, h_next


In [None]:
# input_size = 5
# hidden_size = 2
# dropout = 0.005
# batch_size = 3

# cell = GRUCell(input_size, hidden_size, dropout)
# x = torch.rand(batch_size, 4, input_size).to(device)
# h = torch.rand(batch_size, hidden_size).to(device)
# print(cell.forward(x, h))


# rnn = GRU(input_size, hidden_size, num_layers, dropout)
# x = torch.rand(batch_size, 4, input_size).to(device)
# h = torch.rand(batch_size, num_layers ,hidden_size).to(device)
# print(rnn.forward(x, h))

### Translation RNN

This module instanciates a vanilla RNN or a GRU-RNN and performs the translation task. You have to:
* Encode the source and target sequence
* Pass the final hidden state of the encoder to the decoder (one for each layer)
* Decode the hidden state into the target sequence

We use teacher forcing for training, meaning that when the next token is predicted, that prediction is based on the previous true target tokens. 

In [None]:
from torch.nn.modules.normalization import LayerNorm
from torch.nn.modules.activation import LeakyReLU
class TranslationRNN(nn.Module):
    """Basic RNN encoder and decoder for a translation task.
    It can run as a vanilla RNN or a GRU-RNN.

    Parameters
    ----------
        n_tokens_src: Number of tokens in the source vocabulary.
        n_tokens_tgt: Number of tokens in the target vocabulary.
        dim_embedding: Dimension size of the word embeddings (for both language).
        dim_hidden: Dimension size of the hidden layers in the RNNs
            (for both the encoder and the decoder).
        n_layers: Number of layers in the RNNs.
        dropout: Dropout rate.
        src_pad_idx: Source padding index value.
        tgt_pad_idx: Target padding index value.
        model_type: Either 'RNN' or 'GRU', to select which model we want.
    """

    def __init__(
            self,
            n_tokens_src: int,
            n_tokens_tgt: int,
            dim_embedding: int,
            dim_hidden: int,
            n_layers: int,
            dropout: float,
            src_pad_idx: int,
            tgt_pad_idx: int,
            model_type: str,
        ):
        super().__init__()

        self.dropout_layer = nn.Dropout(dropout)

        self.embedding_en = nn.Embedding(n_tokens_src, dim_embedding, src_pad_idx)
        self.embedding_fr = nn.Embedding(n_tokens_tgt, dim_embedding, tgt_pad_idx)

        if model_type == 'RNN':
          self.model_encoder = RNN(dim_embedding, dim_hidden, n_layers, dropout)
          self.model_decoder = RNN(dim_embedding, dim_hidden, n_layers, dropout)

        elif model_type == 'GRU':
          self.model_encoder = GRU(dim_embedding, dim_hidden, n_layers, dropout)
          self.model_decoder = GRU(dim_embedding, dim_hidden, n_layers, dropout)

        self.fc = nn.Linear(dim_hidden * 2, dim_hidden)
        self.layer_norm = nn.LayerNorm(dim_hidden)

        # Architecture given in the Colab output
        self.sequential = nn.Sequential(
            nn.Linear(dim_hidden, dim_hidden),
            nn.LeakyReLU(),
            nn.LayerNorm(dim_hidden),
            nn.Linear(dim_hidden, dim_hidden),
            nn.LeakyReLU(),
            nn.LayerNorm(dim_hidden),
            nn.Linear(dim_hidden, dim_hidden),
            nn.LeakyReLU(),
            nn.LayerNorm(dim_hidden),
            nn.Linear(dim_hidden, n_tokens_tgt)
        )


    def forward(
        self,
        source: torch.LongTensor,
        target: torch.LongTensor
    ) -> torch.FloatTensor:
        """Predict the target tokens logites based on the source tokens.

        Args
        ----
            source: Batch of source sentences.
                Shape of [batch_size, src_seq_len].
            target: Batch of target sentences.
                Shape of [batch_size, tgt_seq_len].
        
        Output
        ------
            y: Distributions over the next token for all tokens in each sentences.
                Those need to be the logits only, do not apply a softmax because
                it will be done in the loss computation for numerical stability.
                See https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html for more informations.
                Shape of [batch_size, tgt_seq_len, n_tokens_tgt].
        """
        source = torch.fliplr(source)

        embedded_source = self.embedding_en(source)
        encoder_outputs, encoder_hidden = self.model_encoder(embedded_source)

        encoder_hidden = self.layer_norm(encoder_hidden)


        embedded_target = self.embedding_fr(target)
        decoder_output, decoder_hidden = self.model_decoder(embedded_target, encoder_hidden)

        outputs = self.sequential(decoder_output)

        return outputs

## Transformer model
Here you have to code the Transformer architecture.
It is divided in three parts:
* Attention layers
* Encoder and decoder layers
* Main layers (gather the encoder and decoder layers)

The [illustrated transformer](https://jalammar.github.io/illustrated-transformer/) blog can help you
understanding how the architecture works.
Once this is done, you can use [the annontated transformer](https://nlp.seas.harvard.edu/2018/04/03/attention.html) to have an idea of how to code this architecture.
We encourage you to use `torch.einsum` and the `einops` library as much as you can. It will make your code simpler.

---
**Implementation order**

To help you with the implementation, we advise you following this order:
* Implement `TranslationTransformer` and use `nn.Transformer` instead of `Transformer`
* Implement `Transformer` and use `nn.TransformerDecoder` and `nn.TransformerEnocder`
* Implement the `TransformerDecoder` and `TransformerEncoder` and use `nn.MultiHeadAttention`
* Implement `MultiHeadAttention`

Do not forget to add `batch_first=True` when necessary in the `nn` modules.

### Attention layers
We use a `MultiHeadAttention` module, that is able to perform self-attention aswell as cross-attention (depending on what you give as queries, keys and values).

**Attention**


It takes the multiheaded queries, keys and values as input.
It computes the attention between the queries and the keys and return the attended values.

The implementation of this function can greatly be improved with *einsums*.

**MultiheadAttention**

Computes the multihead queries, keys and values and feed them to the `attention` function.
You also need to merge the key padding mask and the attention mask into one mask.

The implementation of this module can greatly be improved with *einops.rearrange*.

In [None]:
from einops.layers.torch import Rearrange


def attention(
        q: torch.FloatTensor,
        k: torch.FloatTensor,
        v: torch.FloatTensor,
        mask: torch.BoolTensor=None,
        dropout: nn.Dropout=None,
    ) -> tuple:
    """Computes multihead scaled dot-product attention from the
    projected queries, keys and values.

    Args
    ----
        q: Batch of queries.
            Shape of [batch_size, seq_len_1, n_heads, dim_model].
        k: Batch of keys.
            Shape of [batch_size, seq_len_2, n_heads, dim_model].
        v: Batch of values.
            Shape of [batch_size, seq_len_2, n_heads, dim_model].
        mask: Prevent tokens to attend to some other tokens (for padding or autoregressive attention).
            Attention is prevented where the mask is `True`.
            Shape of [batch_size, n_heads, seq_len_1, seq_len_2],
            or broadcastable to that shape.
        dropout: Dropout layer to use.

    Output
    ------
        y: Multihead scaled dot-attention between the queries, keys and values.
            Shape of [batch_size, seq_len_1, n_heads, dim_model].
        attn: Computed attention mask.
            Shape of [batch_size, n_heads, seq_len_1, seq_len_2].
    """
    # Source: https://towardsdatascience.com/how-to-code-the-transformer-in-pytorch-24db27c8f9ec
    # Source: https://theaisummer.com/einsum-attention/

    # Softmax(QK^T/sqrt(d)) and apply mask if defined
    scaled_dot_prod  = torch.einsum("bshd, bsnd -> bshn", q, k) / math.sqrt(q.shape[3])
    if mask is not None: 
      scaled_dot_prod = scaled_dot_prod.masked_fill(mask==-math.inf, -math.inf)
    attn = torch.softmax(scaled_dot_prod, dim=-1)

    if dropout is not None:
      attn = dropout(attn)

    # Softmax(QK^T/sqrt(d))*V
    y = torch.einsum('bhlt, bhtv -> bhlv', attn, v)

    return y, attn

class MultiheadAttention(nn.Module):
    """Multihead attention module.
    Can be used as a self-attention and cross-attention layer.
    The queries, keys and values are projected into multiple heads
    before computing the attention between those tensors.

    Parameters
    ----------
        dim: Dimension of the input tokens.
        n_heads: Number of heads. `dim` must be divisible by `n_heads`.
        dropout: Dropout rate.
    """
    def __init__(
            self,
            dim: int,
            n_heads: int,
            dropout: float,
        ):
        super().__init__()

        assert dim % n_heads == 0

        self.n_heads = n_heads
        
        self.q_linear = nn.Linear(dim, dim)
        self.v_linear = nn.Linear(dim, dim)
        self.k_linear = nn.Linear(dim, dim)

        self.dropout = nn.Dropout(dropout)
        self.out = nn.Linear(dim, dim)

    def forward(
            self,
            q: torch.FloatTensor,
            k: torch.FloatTensor,
            v: torch.FloatTensor,
            key_padding_mask: torch.BoolTensor = None,
            attn_mask: torch.BoolTensor = None,
        ) -> torch.FloatTensor:
        """Computes the scaled multi-head attention form the input queries,
        keys and values.

        Project those queries, keys and values before feeding them
        to the `attention` function.

        The masks are boolean masks. Tokens are prevented to attends to
        positions where the mask is `True`.

        Args
        ----
            q: Batch of queries.
                Shape of [batch_size, seq_len_1, dim_model].
            k: Batch of keys.
                Shape of [batch_size, seq_len_2, dim_model].
            v: Batch of values.
                Shape of [batch_size, seq_len_2, dim_model].
            key_padding_mask: Prevent attending to padding tokens.
                Shape of [batch_size, seq_len_2].
            attn_mask: Prevent attending to subsequent tokens.
                Shape of [seq_len_1, seq_len_2].

        Output
        ------
            y: Computed multihead attention.
                Shape of [batch_size, seq_len_1, dim_model].
        """
        # Source: https://towardsdatascience.com/how-to-code-the-transformer-in-pytorch-24db27c8f9ec
        # https://theaisummer.com/einsum-attention/
        batch_size = q.size(0)

        if key_padding_mask is None:
          key_padding_mask = torch.ones((batch_size, k.size(1))).to(device)
        
        if attn_mask is None:
          attn_mask = torch.ones((q.size(1), k.size(1))).to(device)
        # print(f"{key_padding_mask.shape} - {attn_mask.shape}")

        attn_mask = attn_mask.unsqueeze(0).repeat(batch_size, 1, 1)
        key_padding_mask = key_padding_mask.unsqueeze(1)

        mask = torch.add(key_padding_mask, attn_mask).unsqueeze(1)

        # Perform linear operation and split into h heads
        q = rearrange(self.q_linear(q), "b s (h k) -> b h s k", h=self.n_heads)
        k = rearrange(self.k_linear(k), "b s (h k) -> b h s k", h=self.n_heads)
        v = rearrange(self.v_linear(v), "b s (h v) -> b h s v", h=self.n_heads)
        
        # Calculate attention using function we will define next
        y, attn = attention(q, k, v, mask, self.dropout)

        # Concatenate heads and put through final linear layer
        y = rearrange(y, 'b hnum s v -> b s (hnum v)')

        y = self.out(y)
    
        return y


In [None]:
tgt = torch.Tensor([[1,2,0,0,0,0],
                    [1,5,3,8,6,0]])

def subsequent_mask(size):
    "Mask out subsequent positions."
    return torch.triu(torch.ones(size, size) * float('-inf'), diagonal=1).cuda()


def make_padding_mask(tgt, pad_idx):
        "Create a mask to hide padding and future words."
        mask = torch.zeros_like(tgt, dtype = float).cuda()
        mask[tgt == pad_idx] = float('-inf')
        return mask


padding = make_padding_mask(tgt, 0).unsqueeze(1)

sub = subsequent_mask(tgt.size(1))
sub = sub.unsqueeze(0).repeat(tgt.size(0), 1, 1)

print(f"Padding mask {padding.shape}")
print(padding)

print(f"\nSub mask {sub.shape}")
print(sub)

mask = torch.zeros_like(sub, dtype = float).cuda()

mask = torch.add(padding, sub).unsqueeze(1)
print(f"\nmask {mask.shape}")
print(mask)



Padding mask torch.Size([2, 1, 6])
tensor([[[0., 0., -inf, -inf, -inf, -inf]],

        [[0., 0., 0., 0., 0., -inf]]], device='cuda:0', dtype=torch.float64)

Sub mask torch.Size([2, 6, 6])
tensor([[[0., -inf, -inf, -inf, -inf, -inf],
         [0., 0., -inf, -inf, -inf, -inf],
         [0., 0., 0., -inf, -inf, -inf],
         [0., 0., 0., 0., -inf, -inf],
         [0., 0., 0., 0., 0., -inf],
         [0., 0., 0., 0., 0., 0.]],

        [[0., -inf, -inf, -inf, -inf, -inf],
         [0., 0., -inf, -inf, -inf, -inf],
         [0., 0., 0., -inf, -inf, -inf],
         [0., 0., 0., 0., -inf, -inf],
         [0., 0., 0., 0., 0., -inf],
         [0., 0., 0., 0., 0., 0.]]], device='cuda:0')

mask torch.Size([2, 1, 6, 6])
tensor([[[[0., -inf, -inf, -inf, -inf, -inf],
          [0., 0., -inf, -inf, -inf, -inf],
          [0., 0., -inf, -inf, -inf, -inf],
          [0., 0., -inf, -inf, -inf, -inf],
          [0., 0., -inf, -inf, -inf, -inf],
          [0., 0., -inf, -inf, -inf, -inf]]],


        [

### Encoder and decoder layers

**TranformerEncoder**

Apply self-attention layers onto the source tokens.
It only needs the source key padding mask.


**TranformerDecoder**

Apply masked self-attention layers to the target tokens and cross-attention
layers between the source and the target tokens.
It needs the source and target key padding masks, and the target attention mask.

In [None]:
class TransformerDecoderLayer(nn.Module):
    """Single decoder layer.

    Parameters
    ----------
        d_model: The dimension of decoders inputs/outputs.
        dim_feedforward: Hidden dimension of the feedforward networks.
        nheads: Number of heads for each multi-head attention.
        dropout: Dropout rate.
    """

    def __init__(
            self,
            d_model: int,
            d_ff: int,
            nhead: int,
            dropout: float
        ):
        super().__init__()

        self.self_attn = MultiheadAttention(d_model, nhead, dropout=dropout)
        self.multihead_attn = MultiheadAttention(d_model, nhead, dropout=dropout)

        self.linear1 = nn.Linear(d_model, d_ff)
        self.dropout = nn.Dropout(dropout)
        self.linear2 = nn.Linear(d_ff, d_model)
        self.activation = nn.ReLU()

        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)

        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
        self.dropout3 = nn.Dropout(dropout)

    def forward(
            self,
            tgt: torch.FloatTensor,
            src: torch.FloatTensor,
            tgt_mask_attn: torch.BoolTensor,
            src_key_padding_mask: torch.BoolTensor,
            tgt_key_padding_mask: torch.BoolTensor,
        ) -> torch.FloatTensor:
        """Decode the next target tokens based on the previous tokens.

        Args
        ----
            src: Batch of source sentences.
                Shape of [batch_size, src_seq_len, dim_model].
            tgt: Batch of target sentences.
                Shape of [batch_size, tgt_seq_len, dim_model].
            tgt_mask_attn: Mask to prevent attention to subsequent tokens.
                Shape of [tgt_seq_len, tgt_seq_len].
            src_key_padding_mask: Mask to prevent attention to padding in src sequence.
                Shape of [batch_size, src_seq_len].
            tgt_key_padding_mask: Mask to prevent attention to padding in tgt sequence.
                Shape of [batch_size, tgt_seq_len].

        Output
        ------
            y:  Batch of sequence of embeddings representing the predicted target tokens
                Shape of [batch_size, tgt_seq_len, dim_model].
        """
        x = tgt
        x = self.norm1(x + self.self_attn(x, x, x, attn_mask=tgt_mask_attn, key_padding_mask=tgt_key_padding_mask))
        x = self.norm2(x + self.multihead_attn(x, src, src, key_padding_mask=src_key_padding_mask))
        x = self.norm3(x + self._ff_block(x))

        return x
                       
    def _ff_block(self, x):
        x = self.linear2(self.dropout(self.activation(self.linear1(x))))
        return self.dropout3(x)


class TransformerDecoder(nn.Module):
    """Implementation of the transformer decoder stack.

    Parameters
    ----------
        d_model: The dimension of decoders inputs/outputs.
        dim_feedforward: Hidden dimension of the feedforward networks.
        num_decoder_layers: Number of stacked decoders.
        nheads: Number of heads for each multi-head attention.
        dropout: Dropout rate.
    """

    def __init__(
            self,
            d_model: int,
            d_ff: int,
            num_decoder_layer:int ,
            nhead: int,
            dropout: float
        ):
        super().__init__()

        decoder_layer = TransformerDecoderLayer(
            d_model=d_model, 
            nhead=nhead, 
            d_ff=d_ff, 
            dropout=dropout
        )
        self.decoder_layers = clones(decoder_layer, num_decoder_layer)

    def forward(
            self,
            tgt: torch.FloatTensor,
            memory: torch.FloatTensor,
            tgt_mask_attn: torch.BoolTensor,
            tgt_key_padding_mask: torch.BoolTensor,
            memory_key_padding_mask: torch.BoolTensor,
        ) -> torch.FloatTensor:
        """Decodes the source sequence by sequentially passing.
        the encoded source sequence and the target sequence through the decoder stack.

        Args
        ----
            src: Batch of encoded source sentences.
                Shape of [batch_size, src_seq_len, dim_model].
            tgt: Batch of taget sentences.
                Shape of [batch_size, tgt_seq_len, dim_model].
            tgt_mask_attn: Mask to prevent attention to subsequent tokens.
                Shape of [tgt_seq_len, tgt_seq_len].
            src_key_padding_mask: Mask to prevent attention to padding in src sequence.
                Shape of [batch_size, src_seq_len].
            tgt_key_padding_mask: Mask to prevent attention to padding in tgt sequence.
                Shape of [batch_size, tgt_seq_len].

        Output
        ------
            y:  Batch of sequence of embeddings representing the predicted target tokens
                Shape of [batch_size, tgt_seq_len, dim_model].
        """
        # TODO
        output = tgt
        for decoder_layer in self.decoder_layers:
          output = decoder_layer(output, memory, 
                                 tgt_mask_attn=tgt_mask_attn, 
                                 tgt_key_padding_mask=tgt_key_padding_mask, 
                                 src_key_padding_mask=memory_key_padding_mask)
        return output

class TransformerEncoderLayer(nn.Module):
    """Single encoder layer.

    Parameters
    ----------
        d_model: The dimension of input tokens.
        dim_feedforward: Hidden dimension of the feedforward networks.
        nheads: Number of heads for each multi-head attention.
        dropout: Dropout rate.
    """

    def __init__(
            self,
            d_model: int,
            d_ff: int,
            nhead: int,
            dropout: float,
        ):
        super().__init__()

        self.self_attn = MultiheadAttention(d_model, nhead, dropout=dropout)

        # Implementation of Feedforward model
        self.linear1 = nn.Linear(d_model, d_ff)
        self.dropout = nn.Dropout(dropout)
        self.linear2 = nn.Linear(d_ff, d_model)
        self.activation = nn.ReLU()

        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)

    def forward(
        self,
        src: torch.FloatTensor,
        key_padding_mask: torch.BoolTensor
        ) -> torch.FloatTensor:
        """Encodes the input. Does not attend to masked inputs.

        Args
        ----
            src: Batch of embedded source tokens.
                Shape of [batch_size, src_seq_len, dim_model].
            key_padding_mask: Mask preventing attention to padding tokens.
                Shape of [batch_size, src_seq_len].

        Output
        ------
            y: Batch of encoded source tokens.
                Shape of [batch_size, src_seq_len, dim_model].
        """
        x = src
        x = self.norm1(x + self.dropout1(self.self_attn(x, x, x, key_padding_mask=key_padding_mask)))
        x = self.norm2(x + self._ff_block(x))
        
        return x

    def _ff_block(self, x):
        x = self.linear2(self.dropout(self.activation(self.linear1(x))))
        return self.dropout2(x)


class TransformerEncoder(nn.Module):
    """Implementation of the transformer encoder stack.

    Parameters
    ----------
        d_model: The dimension of encoders inputs.
        dim_feedforward: Hidden dimension of the feedforward networks.
        num_encoder_layers: Number of stacked encoders.
        nheads: Number of heads for each multi-head attention.
        dropout: Dropout rate.
    """

    def __init__(
            self,
            d_model: int,
            dim_feedforward: int,
            num_encoder_layers: int,
            nheads: int,
            dropout: float
        ):
        super().__init__()

        encoder_layer = TransformerEncoderLayer(
            d_model=d_model, 
            nhead=nheads, 
            d_ff=dim_feedforward, 
            dropout=dropout
        )
        self.encoder_layers = clones(encoder_layer, num_encoder_layers)


    def forward(
            self,
            src: torch.FloatTensor,
            key_padding_mask: torch.BoolTensor
        ) -> torch.FloatTensor:
        """Encodes the source sequence by sequentially passing.
        the source sequence through the encoder stack.

        Args
        ----
            src: Batch of embedded source sentences.
                Shape of [batch_size, src_seq_len, dim_model].
            key_padding_mask: Mask preventing attention to padding tokens.
                Shape of [batch_size, src_seq_len].

        Output
        ------
            y: Batch of encoded source sequence.
                Shape of [batch_size, src_seq_len, dim_model].
        """
        output = src

        for encoder in self.encoder_layers:
            output = encoder(output, key_padding_mask=key_padding_mask)

        return output

### Positional Encoding and Mask

In [None]:
# From: 
# - https://pytorch.org/tutorials/beginner/transformer_tutorial.html
# - https://nlp.seas.harvard.edu/2018/04/03/attention.html#batches-and-masking
import math

class PositionalEncoding(nn.Module):

    def __init__(self, d_model: int, dropout: float = 0.1, max_len: int = 5000):
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)

        position = torch.arange(max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))
        pe = torch.zeros(max_len, 1, d_model)
        pe[:, 0, 0::2] = torch.sin(position * div_term)
        pe[:, 0, 1::2] = torch.cos(position * div_term)
        self.register_buffer('pe', pe)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Args:
            x: Tensor, shape [seq_len, batch_size, embedding_dim]
        """
        x = x + self.pe[:x.size(0)]
        return self.dropout(x)

from torch.autograd import Variable

def subsequent_mask(size):
    "Mask out subsequent positions."
    return torch.triu(torch.ones(size, size) * float('-inf'), diagonal=1).cuda()


def make_padding_mask(tgt, pad_idx):
        "Create a mask to hide padding and future words."
        mask = torch.zeros_like(tgt, dtype = float).cuda()
        mask[tgt == pad_idx] = float('-inf')
        return mask



### Main layers
This section gather the `Transformer` and the `TranslationTransformer` modules.

**Transformer**


The classical transformer architecture.
It takes the source and target tokens embeddings and
do the forward pass through the encoder and decoder.

**Translation Transformer**

Compute the source and target tokens embeddings, and apply a final head to produce next token logits.
The output must not be the softmax but just the logits, because we use the `nn.CrossEntropyLoss`.

It also creates the *src_key_padding_mask*, the *tgt_key_padding_mask* and the *tgt_mask_attn*.

In [None]:
from pandas._libs.tslibs.tzconversion import tz_convert_from_utc_single
import torch.nn as nn

class Transformer(nn.Module):
    """Implementation of a Transformer based on the paper: https://arxiv.org/pdf/1706.03762.pdf.

    Parameters
    ----------
        d_model: The dimension of encoders/decoders inputs/ouputs.
        nhead: Number of heads for each multi-head attention.
        num_encoder_layers: Number of stacked encoders.
        num_decoder_layers: Number of stacked encoders.
        dim_feedforward: Hidden dimension of the feedforward networks.
        dropout: Dropout rate.
    """

    def __init__(
            self,
            d_model: int =512,
            nhead: int=8,
            num_encoder_layers: int=6,
            num_decoder_layers: int=6,
            activation:str = "relu",
            dim_feedforward: int=2048,
            dropout: float=0.1,
        ):
        super().__init__()

        self.d_model = d_model

        self.encoder = TransformerEncoder(
            d_model, 
            dim_feedforward, 
            num_encoder_layers, 
            nhead, 
            dropout
        )
        
        self.decoder = TransformerDecoder(
            d_model, 
            dim_feedforward, 
            num_decoder_layers, 
            nhead, 
            dropout
        )
        

        # self.transformer = nn.Transformer(
        #     d_model=d_model,
        #     nhead=nhead,
        #     num_encoder_layers=num_encoder_layers,
        #     num_decoder_layers=num_decoder_layers,
        #     activation=activation,
        #     dim_feedforward=dim_feedforward,
        #     dropout=dropout,
        #     batch_first=True
        # )



    def forward(
            self,
            src: torch.FloatTensor,
            tgt: torch.FloatTensor,
            tgt_mask_attn: torch.BoolTensor,
            src_key_padding_mask: torch.BoolTensor,
            tgt_key_padding_mask: torch.BoolTensor
        ) -> torch.FloatTensor:
        """Compute next token embeddings.

        Args
        ----
            src: Batch of source sequences.
                Shape of [batch_size, src_seq_len, dim_model].
            tgt: Batch of target sequences.
                Shape of [batch_size, tgt_seq_len, dim_model].
            tgt_mask_attn: Mask to prevent attention to subsequent tokens.
                Shape of [tgt_seq_len, tgt_seq_len].
            src_key_padding_mask: Mask to prevent attention to padding in src sequence.
                Shape of [batch_size, src_seq_len].
            tgt_key_padding_mask: Mask to prevent attention to padding in tgt sequence.
                Shape of [batch_size, tgt_seq_len].

        Output
        ------
            y: Next token embeddings, given the previous target tokens and the source tokens.
                Shape of [batch_size, tgt_seq_len, dim_model].
        """

        # outputs = self.transformer(
        #     src, 
        #     tgt, 
        #     tgt_mask = tgt_mask_attn,  
        #     memory_mask = None, 
        #     src_key_padding_mask = src_key_padding_mask, 
        #     tgt_key_padding_mask = tgt_key_padding_mask, 
        #     memory_key_padding_mask = None
        # )

        outputs = self.encoder.forward(src, src_key_padding_mask)
        outputs = self.decoder.forward(tgt, outputs, tgt_mask_attn, tgt_key_padding_mask, src_key_padding_mask)
      
        return outputs
        


class TranslationTransformer(nn.Module):
    """Basic Transformer encoder and decoder for a translation task.
    Manage the masks creation, and the token embeddings.
    Position embeddings can be learnt with a standard `nn.Embedding` layer.

    Parameters
    ----------
        n_tokens_src: Number of tokens in the source vocabulary.
        n_tokens_tgt: Number of tokens in the target vocabulary.
        n_heads: Number of heads for each multi-head attention.
        dim_embedding: Dimension size of the word embeddings (for both language).
        dim_hidden: Dimension size of the feedforward layers
            (for both the encoder and the decoder).
        n_layers: Number of layers in the encoder and decoder.
        dropout: Dropout rate.
        src_pad_idx: Source padding index value.
        tgt_pad_idx: Target padding index value.
    """
    def __init__(
            self,
            n_tokens_src: int,
            n_tokens_tgt: int,
            n_heads: int,
            dim_embedding: int,
            dim_hidden: int,
            n_layers: int,
            dropout: float,
            src_pad_idx: int,
            tgt_pad_idx: int,
        ):
        super().__init__()

        self.n_tokens_src = n_tokens_src
        self.n_tokens_tgt = n_tokens_tgt
        self.n_heads = n_heads
        self.dim_embedding = dim_embedding
        self.dim_hidden = dim_hidden
        self.n_layers = n_layers
        self.dropout = dropout
        self.src_pad_idx = src_pad_idx
        self.tgt_pad_idx = tgt_pad_idx

        self.pos_encoder = PositionalEncoding(dim_embedding, dropout)

        self.embedding_en = nn.Embedding(n_tokens_src, dim_embedding)
        self.embedding_fr = nn.Embedding(n_tokens_tgt, dim_embedding)

        self.pos_encoder = PositionalEncoding(dim_embedding, dropout)

        self.my_transformer = Transformer(
            d_model=dim_embedding, 
            nhead=n_heads, 
            num_encoder_layers=n_layers, 
            num_decoder_layers=n_layers, 
            dim_feedforward=dim_hidden, 
            dropout=dropout)
        
        # self.transformer = nn.Transformer(
        #     d_model = dim_embedding, 
        #     nhead = n_heads, 
        #     num_encoder_layers = n_layers, 
        #     num_decoder_layers = n_layers, 
        #     dim_feedforward = dim_hidden, 
        #     dropout = dropout, 
        #     batch_first = True
        # )
        
        self.sequential = nn.Sequential(
            nn.Linear(dim_embedding, dim_embedding),
            nn.LeakyReLU(),
            nn.LayerNorm(dim_embedding),
            nn.Linear(dim_embedding, dim_embedding),
            nn.LeakyReLU(),
            nn.LayerNorm(dim_embedding),
            nn.Linear(dim_embedding, dim_embedding),
            nn.LeakyReLU(),
            nn.LayerNorm(dim_embedding),
            nn.Linear(dim_embedding, n_tokens_tgt)
        )
        self.linear = nn.Linear(dim_embedding, n_tokens_tgt)

    def forward(
            self,
            source: torch.LongTensor,
            target: torch.LongTensor
        ) -> torch.FloatTensor:
        """Predict the target tokens logites based on the source tokens.

        Args
        ----
            source: Batch of source sentences.
                Shape of [batch_size, seq_len_src].
            target: Batch of target sentences.
                Shape of [batch_size, seq_len_tgt].

        Output
        ------
            y: Distributions over the next token for all tokens in each sentences.
                Those need to be the logits only, do not apply a softmax because
                it will be done in the loss computation for numerical stability.
                See https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html for more informations.
                Shape of [batch_size, seq_len_tgt, n_tokens_tgt].
        """
        # Create the masks
        tgt_mask = subsequent_mask(target.size(1))

        # Create the padding masks
        src_key_padding_mask = make_padding_mask(source, self.src_pad_idx)
        tgt_key_padding_mask = make_padding_mask(target, self.tgt_pad_idx)

        # Embeddings
        embedded_en = self.pos_encoder(self.embedding_en(source)*math.sqrt(self.dim_embedding))
        embedded_fr = self.pos_encoder(self.embedding_fr(target)*math.sqrt(self.dim_embedding))

        outputs = self.my_transformer(
            embedded_en, 
            embedded_fr, 
            tgt_mask_attn = tgt_mask, 
            src_key_padding_mask = src_key_padding_mask, 
            tgt_key_padding_mask = tgt_key_padding_mask, 
        )
        
        return self.sequential(outputs)


# Greedy search

Here you have to implement a geedy search to generate a target translation from a trained model and an input source string.
The next token will simply be the most probable one.

In [None]:
#greedy search is basically beam search but only with a beam width of 1.
def indices_terminated(
        target: torch.FloatTensor,
        eos_token: int
    ) -> tuple:
    """Split the target sentences between the terminated and the non-terminated
    sentence. Return the indices of those two groups.

    Args
    ----
        target: The sentences.
            Shape of [batch_size, n_tokens].
        eos_token: Value of the End-of-Sentence token.

    Output
    ------
        terminated: Indices of the terminated sentences (who's got the eos_token).
            Shape of [n_terminated, ].
        non-terminated: Indices of the unfinished sentences.
            Shape of [batch_size-n_terminated, ].
    """
    terminated = [i for i, t in enumerate(target) if eos_token in t]
    non_terminated = [i for i, t in enumerate(target) if eos_token not in t]
    return torch.LongTensor(terminated), torch.LongTensor(non_terminated)


def append_beams(
        target: torch.FloatTensor,
        beams: torch.FloatTensor
    ) -> torch.FloatTensor:
    """Add the beam tokens to the current sentences.
    Duplicate the sentences so one token is added per beam per batch.

    Args
    ----
        target: Batch of unfinished sentences.
            Shape of [batch_size, n_tokens].
        beams: Batch of beams for each sentences.
            Shape of [batch_size, n_beams].

    Output
    ------
        target: Batch of sentences with one beam per sentence.
            Shape of [batch_size * n_beams, n_tokens+1].
    """
    batch_size, n_beams = beams.shape
    n_tokens = target.shape[1]

    target = einops.repeat(target, 'b t -> b c t', c=n_beams)  # [batch_size, n_beams, n_tokens]
    beams = beams.unsqueeze(dim=2)  # [batch_size, n_beams, 1]

    target = torch.cat((target, beams), dim=2)  # [batch_size, n_beams, n_tokens+1]
    target = target.view(batch_size*n_beams, n_tokens+1)  # [batch_size * n_beams, n_tokens+1]
    return target


def greedy_search(
        model: nn.Module,
        source: str,
        src_vocab: Vocab,
        tgt_vocab: Vocab,
        src_tokenizer,
        device: str,
        max_sentence_length: int,
    ) -> str:
    """Do a beam search to produce probable translations.

    Args
    ----
        model: The translation model. Assumes it produces logits score (before softmax).
        source: The sentence to translate.
        src_vocab: The source vocabulary.
        tgt_vocab: The target vocabulary.
        device: Device to which we make the inference.
        max_target: Maximum number of target sentences we keep at the end of each stage.
        max_sentence_length: Maximum number of tokens for the translated sentence.

    Output
    ------
        sentence: The translated source sentence.
    """
    
    src_tokens = ['<bos>'] + src_tokenizer(source) + ['<eos>']
    src_tokens = src_vocab(src_tokens)

    tgt_tokens = ['<bos>']
    tgt_tokens = tgt_vocab(tgt_tokens)

    # To tensor and add unitary batch dimension
    src_tokens = torch.LongTensor(src_tokens).to(device)
    tgt_tokens = torch.LongTensor(tgt_tokens).unsqueeze(dim=0).to(device)
    target_probs = torch.FloatTensor([1]).to(device)
    model.to(device)

    EOS_IDX = tgt_vocab['<eos>']
    with torch.no_grad():
        while tgt_tokens.shape[1] < max_sentence_length:
            batch_size, n_tokens = tgt_tokens.shape

            # Get next beams
            src = einops.repeat(src_tokens, 't -> b t', b=tgt_tokens.shape[0])
            predicted = model.forward(src, tgt_tokens)
            predicted = torch.softmax(predicted, dim=-1)
            probs, predicted = predicted[:, -1].topk(k=1, dim=-1)

            # Separe between terminated sentences and the others
            idx_terminated, idx_not_terminated = indices_terminated(tgt_tokens, EOS_IDX)
            idx_terminated, idx_not_terminated = idx_terminated.to(device), idx_not_terminated.to(device)

            tgt_terminated = torch.index_select(tgt_tokens, dim=0, index=idx_terminated)
            tgt_probs_terminated = torch.index_select(target_probs, dim=0, index=idx_terminated)

            filter_t = lambda t: torch.index_select(t, dim=0, index=idx_not_terminated)
            tgt_others = filter_t(tgt_tokens)
            tgt_probs_others = filter_t(target_probs)
            predicted = filter_t(predicted)
            probs = filter_t(probs)

            # Add the top tokens to the previous target sentences
            tgt_others = append_beams(tgt_others, predicted)

            # Add padding to terminated target
            padd = torch.zeros((len(tgt_terminated), 1), dtype=torch.long, device=device)
            tgt_terminated = torch.cat(
                (tgt_terminated, padd),
                dim=1
            )

            # Update each target sentence probabilities
            tgt_probs_others = torch.repeat_interleave(tgt_probs_others, 1)
            tgt_probs_others *= probs.flatten()
            tgt_probs_terminated *= 0.999  # Penalize short sequences overtime

            # Group up the terminated and the others
            target_probs = torch.cat(
                (tgt_probs_others, tgt_probs_terminated),
                dim=0
            )
            tgt_tokens = torch.cat(
                (tgt_others, tgt_terminated),
                dim=0
            )

            # Keep only the top `max_target` target sentences
            target_probs, indices = target_probs.topk(k=1, dim=0)

            tgt_tokens = torch.index_select(tgt_tokens, dim=0, index=indices)

    sentences = []
    for tgt_sentence in tgt_tokens:
        tgt_sentence = list(tgt_sentence)[1:]  # Remove <bos> token
        tgt_sentence = list(takewhile(lambda t: t != EOS_IDX, tgt_sentence))
        tgt_sentence = ' '.join(tgt_vocab.lookup_tokens(tgt_sentence))
        sentences.append(tgt_sentence)

    sentences = [beautify(s) for s in sentences]

    # Join the sentences with their likelihood
    sentences = [(s, p.item()) for s, p in zip(sentences, target_probs)]
    # Sort the sentences by their likelihood
    sentences = [(s, p) for s, p in sorted(sentences, key=lambda k: k[1], reverse=True)]

    return sentences




# Beam search
Beam search is a smarter way of producing a sequence of tokens from
an autoregressive model than just using a greedy search.

The greedy search always choose the most probable token as the unique
and only next target token, and repeat this processus until the *\<eos\>* token is predicted.

Instead, the beam search selects the k-most probable tokens at each step.
From those k tokens, the current sequence is duplicated k times and the k tokens are appended to the k sequences to produce new k sequences.

*You don't have to understand this code, but understanding this code once the TP is over could improve your torch tensors skills.*

---

**More explanations**

Since it is done at each step, the number of sequences grows exponentially (k sequences after the first step, k² sequences after the second...).
In order to keep the number of sequences low, we remove sequences except the top-s most likely sequences.
To do that, we keep track of the likelihood of each sequence.

Formally, we define $s = [s_1, ..., s_{N_s}]$ as the source sequence made of $N_s$ tokens.
We also define $t^i = [t_1, ..., t_i]$ as the target sequence at the beginning of the step $i$.

The output of the model parameterized by $\theta$ is:

$$
T_{i+1} = p(t_{i+1} | s, t^i ; \theta )
$$

Where $T_{i+1}$ is the distribution of the next token $t_{i+1}$.

Then, we define the likelihood of a target sentence $t = [t_1, ..., t_{N_t}]$ as:

$$
L(t) = \prod_{i=1}^{N_t - 1} p(t_{i+1} | s, t_{i}; \theta )
$$

Pseudocode of the beam search:
```
source: [N_s source tokens]  # Shape of [total_source_tokens]
target: [1, <bos> token]  # Shape of [n_sentences, current_target_tokens]
target_prob: [1]  # Shape of [n_sentences]
# We use `n_sentences` as the batch_size dimension

while current_target_tokens <= max_target_length:
    source = repeat(source, n_sentences)  # Shape of [n_sentences, total_source_tokens]
    predicted = model(source, target)[:, -1]  # Predict the next token distributions of all the n_sentences
    tokens_idx, tokens_prob = topk(predicted, k)

    # Append the `n_sentences * k` tokens to the `n_sentences` sentences
    target = repeat(target, k)  # Shape of [n_sentences * k, current_target_tokens]
    target = append_tokens(target, tokens_idx)  # Shape of [n_sentences * k, current_target_tokens + 1]

    # Update the sentences probabilities
    target_prob = repeat(target_prob, k)  # Shape of [n_sentences * k]
    target_prob *= tokens_prob

    if n_sentences * k >= max_sentences:
        target, target_prob = topk_prob(target, target_prob, k=max_sentences)
    else:
        n_sentences *= k

    current_target_tokens += 1
```

In [None]:
def beautify(sentence: str) -> str:
    """Removes useless spaces.
    """
    punc = {'.', ',', ';'}
    for p in punc:
        sentence = sentence.replace(f' {p}', p)
    
    links = {'-', "'"}
    for l in links:
        sentence = sentence.replace(f'{l} ', l)
        sentence = sentence.replace(f' {l}', l)
    
    return sentence

In [None]:


def beam_search(
        model: nn.Module,
        source: str,
        src_vocab: Vocab,
        tgt_vocab: Vocab,
        src_tokenizer,
        device: str,
        beam_width: int,
        max_target: int,
        max_sentence_length: int,
    ) -> list:
    """Do a beam search to produce probable translations.

    Args
    ----
        model: The translation model. Assumes it produces linear score (before softmax).
        source: The sentence to translate.
        src_vocab: The source vocabulary.
        tgt_vocab: The target vocabulary.
        device: Device to which we make the inference.
        beam_width: Number of top-k tokens we keep at each stage.
        max_target: Maximum number of target sentences we keep at the end of each stage.
        max_sentence_length: Maximum number of tokens for the translated sentence.

    Output
    ------
        sentences: List of sentences orderer by their likelihood.
    """
    src_tokens = ['<bos>'] + src_tokenizer(source) + ['<eos>']
    src_tokens = src_vocab(src_tokens)

    tgt_tokens = ['<bos>']
    tgt_tokens = tgt_vocab(tgt_tokens)

    # To tensor and add unitary batch dimension
    src_tokens = torch.LongTensor(src_tokens).to(device)
    tgt_tokens = torch.LongTensor(tgt_tokens).unsqueeze(dim=0).to(device)
    target_probs = torch.FloatTensor([1]).to(device)
    model.to(device)

    EOS_IDX = tgt_vocab['<eos>']
    with torch.no_grad():
        while tgt_tokens.shape[1] < max_sentence_length:
            batch_size, n_tokens = tgt_tokens.shape

            # Get next beams
            src = einops.repeat(src_tokens, 't -> b t', b=tgt_tokens.shape[0])
            predicted = model.forward(src, tgt_tokens)
            predicted = torch.softmax(predicted, dim=-1)
            probs, predicted = predicted[:, -1].topk(k=beam_width, dim=-1)

            # Separe between terminated sentences and the others
            idx_terminated, idx_not_terminated = indices_terminated(tgt_tokens, EOS_IDX)
            idx_terminated, idx_not_terminated = idx_terminated.to(device), idx_not_terminated.to(device)

            tgt_terminated = torch.index_select(tgt_tokens, dim=0, index=idx_terminated)
            tgt_probs_terminated = torch.index_select(target_probs, dim=0, index=idx_terminated)

            filter_t = lambda t: torch.index_select(t, dim=0, index=idx_not_terminated)
            tgt_others = filter_t(tgt_tokens)
            tgt_probs_others = filter_t(target_probs)
            predicted = filter_t(predicted)
            probs = filter_t(probs)

            # Add the top tokens to the previous target sentences
            tgt_others = append_beams(tgt_others, predicted)

            # Add padding to terminated target
            padd = torch.zeros((len(tgt_terminated), 1), dtype=torch.long, device=device)
            tgt_terminated = torch.cat(
                (tgt_terminated, padd),
                dim=1
            )

            # Update each target sentence probabilities
            tgt_probs_others = torch.repeat_interleave(tgt_probs_others, beam_width)
            tgt_probs_others *= probs.flatten()
            tgt_probs_terminated *= 0.999  # Penalize short sequences overtime

            # Group up the terminated and the others
            target_probs = torch.cat(
                (tgt_probs_others, tgt_probs_terminated),
                dim=0
            )
            tgt_tokens = torch.cat(
                (tgt_others, tgt_terminated),
                dim=0
            )

            # Keep only the top `max_target` target sentences
            if target_probs.shape[0] <= max_target:
                continue

            target_probs, indices = target_probs.topk(k=max_target, dim=0)
            tgt_tokens = torch.index_select(tgt_tokens, dim=0, index=indices)

    sentences = []
    for tgt_sentence in tgt_tokens:
        tgt_sentence = list(tgt_sentence)[1:]  # Remove <bos> token
        tgt_sentence = list(takewhile(lambda t: t != EOS_IDX, tgt_sentence))
        tgt_sentence = ' '.join(tgt_vocab.lookup_tokens(tgt_sentence))
        sentences.append(tgt_sentence)

    sentences = [beautify(s) for s in sentences]

    # Join the sentences with their likelihood
    sentences = [(s, p.item()) for s, p in zip(sentences, target_probs)]
    # Sort the sentences by their likelihood
    sentences = [(s, p) for s, p in sorted(sentences, key=lambda k: k[1], reverse=True)]

    return sentences

# Training loop
This is a basic training loop code. It takes a big configuration dictionnary to avoid never ending arguments in the functions.
We use [Weights and Biases](https://wandb.ai/) to log the trainings.
It logs every training informations and model performances in the cloud.
You have to create an account to use it. Every accounts are free for individuals or research teams.

In [None]:
def print_logs(dataset_type: str, logs: dict):
    """Print the logs.

    Args
    ----
        dataset_type: Either "Train", "Eval", "Test" type.
        logs: Containing the metric's name and value.
    """
    desc = [
        f'{name}: {value:.2f}'
        for name, value in logs.items()
    ]
    desc = '\t'.join(desc)
    desc = f'{dataset_type} -\t' + desc
    desc = desc.expandtabs(5)
    print(desc)


def topk_accuracy(
        real_tokens: torch.FloatTensor,
        probs_tokens: torch.FloatTensor,
        k: int,
        tgt_pad_idx: int,
    ) -> torch.FloatTensor:
    """Compute the top-k accuracy.
    We ignore the PAD tokens.

    Args
    ----
        real_tokens: Real tokens of the target sentence.
            Shape of [batch_size * n_tokens].
        probs_tokens: Tokens probability predicted by the model.
            Shape of [batch_size * n_tokens, n_target_vocabulary].
        k: Top-k accuracy threshold.
        src_pad_idx: Source padding index value.
    
    Output
    ------
        acc: Scalar top-k accuracy value.
    """
    total = (real_tokens != tgt_pad_idx).sum()

    _, pred_tokens = probs_tokens.topk(k=k, dim=-1)  # [batch_size * n_tokens, k]
    real_tokens = einops.repeat(real_tokens, 'b -> b k', k=k)  # [batch_size * n_tokens, k]

    good = (pred_tokens == real_tokens) & (real_tokens != tgt_pad_idx)
    acc = good.sum() / total
    return acc


def loss_batch(
        model: nn.Module,
        source: torch.LongTensor,
        target: torch.LongTensor,
        config: dict,
    )-> dict:
    """Compute the metrics associated with this batch.
    The metrics are:
        - loss
        - top-1 accuracy
        - top-5 accuracy
        - top-10 accuracy

    Args
    ----
        model: The model to train.
        source: Batch of source tokens.
            Shape of [batch_size, n_src_tokens].
        target: Batch of target tokens.
            Shape of [batch_size, n_tgt_tokens].
        config: Additional parameters.

    Output
    ------
        metrics: Dictionnary containing evaluated metrics on this batch.
    """
    device = config['device']
    loss_fn = config['loss'].to(device)
    metrics = dict()

    source, target = source.to(device), target.to(device)
    target_in, target_out = target[:, :-1], target[:, 1:]

    # Loss
    pred = model(source, target_in)  # [batch_size, n_tgt_tokens-1, n_vocab]
    pred = pred.view(-1, pred.shape[2])  # [batch_size * (n_tgt_tokens - 1), n_vocab]
    target_out = target_out.flatten()  # [batch_size * (n_tgt_tokens - 1),]
    metrics['loss'] = loss_fn(pred, target_out)

    # Accuracy - we ignore the padding predictions
    for k in [1, 5, 10]:
        metrics[f'top-{k}'] = topk_accuracy(target_out, pred, k, config['tgt_pad_idx'])

    return metrics


def eval_model(model: nn.Module, dataloader: DataLoader, config: dict) -> dict:
    """Evaluate the model on the given dataloader.
    """
    device = config['device']
    logs = defaultdict(list)

    model.to(device)
    model.eval()

    with torch.no_grad():
        for source, target in dataloader:
            metrics = loss_batch(model, source, target, config)
            for name, value in metrics.items():
                logs[name].append(value.cpu().item())

    for name, values in logs.items():
        logs[name] = np.mean(values)
    return logs


def train_model(model: nn.Module, config: dict):
    """Train the model in a teacher forcing manner.
    """
    train_loader, val_loader = config['train_loader'], config['val_loader']
    train_dataset, val_dataset = train_loader.dataset.dataset, val_loader.dataset.dataset
    optimizer = config['optimizer']
    clip = config['clip']
    device = config['device']

    columns = ['epoch']
    for mode in ['train', 'validation']:
        columns += [
            f'{mode} - {colname}'
            for colname in ['source', 'target', 'predicted', 'likelihood']
        ]
    log_table = wandb.Table(columns=columns)


    print(f'Starting training for {config["epochs"]} epochs, using {device}.')
    for e in range(config['epochs']):
        print(f'\nEpoch {e+1}')

        model.to(device)
        model.train()
        logs = defaultdict(list)

        for batch_id, (source, target) in enumerate(train_loader):
            optimizer.zero_grad()

            metrics = loss_batch(model, source, target, config)
            loss = metrics['loss']

            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
            optimizer.step()
        
            for name, value in metrics.items():
                logs[name].append(value.cpu().item())  # Don't forget the '.item' to free the cuda memory
            
            if batch_id % config['log_every'] == 0:
                for name, value in logs.items():
                    logs[name] = np.mean(value)

                train_logs = {
                    f'Train - {m}': v
                    for m, v in logs.items()
                }
                wandb.log(train_logs)
                logs = defaultdict(list)
        
        # Logs
        if len(logs) != 0:
            for name, value in logs.items():
                logs[name] = np.mean(value)
            train_logs = {
                f'Train - {m}': v
                for m, v in logs.items()
            }
        else:
            logs = {
                m.split(' - ')[1]: v
                for m, v in train_logs.items()
            }

        print_logs('Train', logs)

        logs = eval_model(model, val_loader, config)
        print_logs('Eval', logs)
        val_logs = {
            f'Validation - {m}': v
            for m, v in logs.items()
        }

        val_source, val_target = val_dataset[ torch.randint(len(val_dataset), (1,)) ]
        val_pred, val_prob = beam_search(
            model,
            val_source,
            config['src_vocab'],
            config['tgt_vocab'],
            config['src_tokenizer'],
            device,  # It can take a lot of VRAM
            beam_width=10,
            max_target=100,
            max_sentence_length=config['max_sequence_length'],
        )[0]
        print(val_source)
        print(val_pred)

        logs = {**train_logs, **val_logs}  # Merge dictionnaries
        wandb.log(logs)  # Upload to the WandB cloud

        # Table logs
        train_source, train_target = train_dataset[ torch.randint(len(train_dataset), (1,)) ]
        train_pred, train_prob = beam_search(
            model,
            train_source,
            config['src_vocab'],
            config['tgt_vocab'],
            config['src_tokenizer'],
            device,  # It can take a lot of VRAM
            beam_width=10,
            max_target=100,
            max_sentence_length=config['max_sequence_length'],
        )[0]

        data = [
            e + 1,
            train_source, train_target, train_pred, train_prob,
            val_source, val_target, val_pred, val_prob,
        ]
        log_table.add_data(*data)
    
    # Log the table at the end of the training
    wandb.log({'Model predictions': log_table})

# Training the models
We can now finally train the models.
Choose the right hyperparameters, play with them and try to find
ones that lead to good models and good training curves.
Try to reach a loss under 1.0.

So you know, it is possible to get descent results with approximately 20 epochs.
With CUDA enabled, one epoch, even on a big model with a big dataset, shouldn't last more than 10 minutes.
A normal epoch is between 1 to 5 minutes.

*This is considering Colab Pro, we should try using free Colab to get better estimations.*

---

To test your implementations, it is easier to try your models
in a CPU instance. Indeed, Colab reduces your GPU instances priority
with the time you recently past using GPU instances. It would be
sad to consume all your GPU time on implementation testing.
Moreover, you should try your models on small datasets and with a small number of parameters.
For exemple, you could set:
```
MAX_SEQ_LEN = 10
MIN_TOK_FREQ = 20
dim_embedding = 40
dim_hidden = 60
n_layers = 1
```

You usually don't want to log anything onto WandB when testing your implementation.
To deactivate WandB without having to change any line of code, you can type `!wandb offline` in a cell.

Once you have rightly implemented the models, you can train bigger models on bigger datasets.
When you do this, do not forget to change the runtime as GPU (and use `!wandb online`)!

In [None]:

# Checking GPU and logging to wandb

!wandb login

!nvidia-smi

[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
Tue Apr 12 00:59:27 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P8     9W /  70W |      3MiB / 15109MiB |      0%      Default |
|                               |                    

In [None]:
# Instanciate the datasets

MAX_SEQ_LEN = 60
MIN_TOK_FREQ = 2
train_dataset, val_dataset = build_datasets(
    MAX_SEQ_LEN,
    MIN_TOK_FREQ,
    en_tokenizer,
    fr_tokenizer,
    train,
    valid,
)


print(f'English vocabulary size: {len(train_dataset.en_vocab):,}')
print(f'French vocabulary size: {len(train_dataset.fr_vocab):,}')

print(f'\nTraining examples: {len(train_dataset):,}')
print(f'Validation examples: {len(val_dataset):,}')

English vocabulary size: 11,196
French vocabulary size: 16,970

Training examples: 173,104
Validation examples: 19,235


#### Config map

In [None]:
# Build the model, the dataloaders, optimizer and the loss function
# Log every hyperparameters and arguments into the config dictionnary

# Model type "RNN"
config = {
    # General parameters
    'epochs': 5,
    'batch_size': 128,
    'lr': 1e-3,
    'betas': (0.9, 0.99),
    'clip': 5,
    'device': 'cuda' if torch.cuda.is_available() else 'cpu',

    # Model parameters
    'n_tokens_src': len(train_dataset.en_vocab),
    'n_tokens_tgt': len(train_dataset.fr_vocab),
    'n_heads': 4,
    'dim_embedding': 196,
    'dim_hidden': 256,
    'n_layers': 3,
    'dropout': 0.1,
    'model_type': 'RNN',

    # Others
    'max_sequence_length': MAX_SEQ_LEN,
    'min_token_freq': MIN_TOK_FREQ,
    'src_vocab': train_dataset.en_vocab,
    'tgt_vocab': train_dataset.fr_vocab,
    'src_tokenizer': en_tokenizer,
    'tgt_tokenizer': fr_tokenizer,
    'src_pad_idx': train_dataset.en_vocab['<pad>'],
    'tgt_pad_idx': train_dataset.fr_vocab['<pad>'],
    'seed': 0,
    'log_every': 50,  # Number of batches between each wandb logs
     'create_bar_chart': False
}

torch.manual_seed(config['seed'])

config['train_loader'] = DataLoader(
    train_dataset,
    batch_size=config['batch_size'],
    shuffle=True,
    collate_fn=lambda batch: generate_batch(batch, config['src_pad_idx'], config['tgt_pad_idx'])
)

config['val_loader'] = DataLoader(
    val_dataset,
    batch_size=config['batch_size'],
    shuffle=True,
    collate_fn=lambda batch: generate_batch(batch, config['src_pad_idx'], config['tgt_pad_idx'])
)
# model = TranslationRNN(
#     config['n_tokens_src'],
#     config['n_tokens_tgt'],
#     config['dim_embedding'],
#     config['dim_hidden'],
#     config['n_layers'],
#     config['dropout'],
#     config['src_pad_idx'],
#     config['tgt_pad_idx'],
#     config['model_type'],
# )
# Uncommented for testing
model = TranslationTransformer(
    config['n_tokens_src'],
    config['n_tokens_tgt'],
    config['n_heads'],
    config['dim_embedding'],
    config['dim_hidden'],
    config['n_layers'],
    config['dropout'],
    config['src_pad_idx'],
    config['tgt_pad_idx'],
)
#Replace the model to train here
config['optimizer'] = optim.Adam(
    model.parameters(),
    lr=config['lr'],
    betas=config['betas'],
)

weight_classes = torch.ones(config['n_tokens_tgt'], dtype=torch.float)
weight_classes[config['tgt_vocab']['<unk>']] = 0.1  # Lower the importance of that class
config['loss'] = nn.CrossEntropyLoss(
    weight=weight_classes,
    ignore_index=config['tgt_pad_idx'],  # We do not have to learn those
)

summary(
    model,
    input_size=[
        (config['batch_size'], config['max_sequence_length']),
        (config['batch_size'], config['max_sequence_length'])
    ],
    dtypes=[torch.long, torch.long],
    depth=3,
)

Layer (type:depth-idx)                             Output Shape              Param #
TranslationTransformer                             --                        --
├─Transformer: 1                                   --                        --
│    └─TransformerEncoder: 2-2                     --                        --
│    │    └─ModuleList: 3-1                        --                        768,108
│    └─TransformerDecoder: 2                       --                        --
│    │    └─ModuleList: 3-2                        --                        1,232,628
├─Embedding: 1-1                                   [128, 60, 196]            2,194,416
├─PositionalEncoding: 1-2                          [128, 60, 196]            --
│    └─Dropout: 2-1                                [128, 60, 196]            --
├─Transformer: 1                                   --                        --
│    └─TransformerEncoder: 2-2                     --                        --
│    │    └─Modu

In [None]:
from torch.nn.modules import transformer
#Testing
train_loader= DataLoader(
    train_dataset,
    batch_size=config['batch_size'],
    shuffle=True,
    collate_fn=lambda batch: generate_batch(batch, config['src_pad_idx'], config['tgt_pad_idx'])
)
transformer = nn.Transformer()
for batch_id, (source, target) in enumerate(train_loader):
            print(source.shape)
            print(target.shape)
            
            break

torch.Size([128, 18])
torch.Size([128, 20])


In [None]:
# Build the model, the dataloaders, optimizer and the loss function
# Log every hyperparameters and arguments into the config dictionnary

# Model type "RNN"
config = {
    # General parameters
    'epochs': 5,
    'batch_size': 128,
    'lr': 1e-3,
    'betas': (0.9, 0.99),
    'clip': 5,
    'device': 'cuda' if torch.cuda.is_available() else 'cpu',

    # Model parameters
    'n_tokens_src': len(train_dataset.en_vocab),
    'n_tokens_tgt': len(train_dataset.fr_vocab),
    'n_heads': 4,
    'dim_embedding': 196,
    'dim_hidden': 256,
    'n_layers': 3,
    'dropout': 0.1,
    'model_type': 'RNN',

    # Others
    'max_sequence_length': MAX_SEQ_LEN,
    'min_token_freq': MIN_TOK_FREQ,
    'src_vocab': train_dataset.en_vocab,
    'tgt_vocab': train_dataset.fr_vocab,
    'src_tokenizer': en_tokenizer,
    'tgt_tokenizer': fr_tokenizer,
    'src_pad_idx': train_dataset.en_vocab['<pad>'],
    'tgt_pad_idx': train_dataset.fr_vocab['<pad>'],
    'seed': 0,
    'log_every': 50,  # Number of batches between each wandb logs
}

torch.manual_seed(config['seed'])

config['train_loader'] = DataLoader(
    train_dataset,
    batch_size=config['batch_size'],
    shuffle=True,
    collate_fn=lambda batch: generate_batch(batch, config['src_pad_idx'], config['tgt_pad_idx'])
)

config['val_loader'] = DataLoader(
    val_dataset,
    batch_size=config['batch_size'],
    shuffle=True,
    collate_fn=lambda batch: generate_batch(batch, config['src_pad_idx'], config['tgt_pad_idx'])
)

# Uncommented for testing
# model = TranslationTransformer(
#     config['n_tokens_src'],
#     config['n_tokens_tgt'],
#     config['n_heads'],
#     config['dim_embedding'],
#     config['dim_hidden'],
#     config['n_layers'],
#     config['dropout'],
#     config['src_pad_idx'],
#     config['tgt_pad_idx'],
# )


### **Run  RNN**

In [None]:
model = TranslationRNN(
    config['n_tokens_src'],
    config['n_tokens_tgt'],
    config['dim_embedding'],
    config['dim_hidden'],
    config['n_layers'],
    config['dropout'],
    config['src_pad_idx'],
    config['tgt_pad_idx'],
    config['model_type'],
)

#Replace the model to train here
config['optimizer'] = optim.Adam(
    model.parameters(),
    lr=config['lr'],
    betas=config['betas'],
)

weight_classes = torch.ones(config['n_tokens_tgt'], dtype=torch.float)
weight_classes[config['tgt_vocab']['<unk>']] = 0.1  # Lower the importance of that class
config['loss'] = nn.CrossEntropyLoss(
    weight=weight_classes,
    ignore_index=config['tgt_pad_idx'],  # We do not have to learn those
)

summary(
    model,
    input_size=[
        (config['batch_size'], config['max_sequence_length']),
        (config['batch_size'], config['max_sequence_length'])
    ],
    dtypes=[torch.long, torch.long],
    depth=3,
)

Layer (type:depth-idx)                   Output Shape              Param #
TranslationRNN                           --                        --
├─RNN: 1                                 --                        --
│    └─ModuleList: 2-1                   --                        --
├─RNN: 1                                 --                        --
│    └─ModuleList: 2-2                   --                        --
├─Embedding: 1-1                         [128, 60, 196]            2,194,416
├─RNN: 1-2                               [128, 60, 256]            --
│    └─ModuleList: 2-1                   --                        --
│    │    └─RNNCell: 3-1                 [128, 60, 256]            116,224
│    │    └─RNNCell: 3-2                 [128, 60, 256]            131,584
│    │    └─RNNCell: 3-3                 [128, 60, 256]            131,584
├─LayerNorm: 1-3                         [128, 3, 256]             512
├─Embedding: 1-4                         [128, 60, 196]       

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
with wandb.init(
        config=config,
        project='INF8225 - TP3',  # Title of your project
        group='RNN - small',  # In what group of runs do you want this run to be in?
        save_code=True,
        mode = 'disabled'
    ):
    train_model(model, config)


Starting training for 5 epochs, using cuda.

Epoch 1
Train -   loss: 2.85     top-1: 0.47    top-5: 0.65    top-10: 0.70
Eval -    loss: 2.67     top-1: 0.49    top-5: 0.67    top-10: 0.73
Was Tom murdered?
Tom est-il Tom ?

Epoch 2
Train -   loss: 2.72     top-1: 0.50    top-5: 0.69    top-10: 0.74
Eval -    loss: 2.42     top-1: 0.52    top-5: 0.71    top-10: 0.77
Where did you stay?
Où as-tu fait ?

Epoch 3
Train -   loss: 2.31     top-1: 0.54    top-5: 0.72    top-10: 0.78
Eval -    loss: 2.30     top-1: 0.53    top-5: 0.72    top-10: 0.78
I was asked to wear a wire.
J'ai dû prendre un cadeau.

Epoch 4
Train -   loss: 2.36     top-1: 0.52    top-5: 0.72    top-10: 0.79
Eval -    loss: 2.23     top-1: 0.54    top-5: 0.74    top-10: 0.79
I don't want you to leave.
Je ne veux pas que tu restes.

Epoch 5
Train -   loss: 2.28     top-1: 0.53    top-5: 0.73    top-10: 0.79
Eval -    loss: 2.18     top-1: 0.55    top-5: 0.74    top-10: 0.80
They treat their employees well.
Ils ont arrêté 

In [None]:
sentence = "I called my friend to say hi."

preds = beam_search(
    model,
    sentence,
    config['src_vocab'],
    config['tgt_vocab'],
    config['src_tokenizer'],
    config['device'],
    beam_width=10,
    max_target=100,
    max_sentence_length=config['max_sequence_length']
)[:5]

for i, (translation, likelihood) in enumerate(preds):
    print(f'{i}. ({likelihood*100:.5f}%) \t {translation}')

0. (0.18937%) 	 J'ai fait ça pour moi.
1. (0.16131%) 	 J'ai fait mes devoirs.
2. (0.13921%) 	 J'ai pris mon argent pour moi.
3. (0.12492%) 	 J'ai fait mes devoirs pour moi.
4. (0.10391%) 	 J'ai pris mon parapluie.


In [None]:
sentence = "I called my friend to say hi."
pred1 = greedy_search(
    model,
    sentence,
    config['src_vocab'],
    config['tgt_vocab'],
    config['src_tokenizer'],
    config['device'],
    config['max_sequence_length']
)[0]

print(f'({pred1[1]*100:.5f}%)       {pred1[0]}')

(0.12492%)       J'ai fait mes devoirs pour moi.


### **Run GRU**

In [None]:
config["model_type"] = "GRU"

model = TranslationRNN(
    config['n_tokens_src'],
    config['n_tokens_tgt'],
    config['dim_embedding'],
    config['dim_hidden'],
    config['n_layers'],
    config['dropout'],
    config['src_pad_idx'],
    config['tgt_pad_idx'],
    config['model_type'],
)

#Replace the model to train here
config['optimizer'] = optim.Adam(
    model.parameters(),
    lr=config['lr'],
    betas=config['betas'],
)

weight_classes = torch.ones(config['n_tokens_tgt'], dtype=torch.float)
weight_classes[config['tgt_vocab']['<unk>']] = 0.1  # Lower the importance of that class
config['loss'] = nn.CrossEntropyLoss(
    weight=weight_classes,
    ignore_index=config['tgt_pad_idx'],  # We do not have to learn those
)

summary(
    model,
    input_size=[
        (config['batch_size'], config['max_sequence_length']),
        (config['batch_size'], config['max_sequence_length'])
    ],
    dtypes=[torch.long, torch.long],
    depth=3,
)

Layer (type:depth-idx)                   Output Shape              Param #
TranslationRNN                           --                        --
├─GRU: 1                                 --                        --
│    └─ModuleList: 2-1                   --                        --
├─GRU: 1                                 --                        --
│    └─ModuleList: 2-2                   --                        --
├─Embedding: 1-1                         [128, 60, 196]            2,194,416
├─GRU: 1-2                               [128, 60, 256]            --
│    └─ModuleList: 2-1                   --                        --
│    │    └─GRUCell: 3-1                 [128, 60, 256]            348,672
│    │    └─GRUCell: 3-2                 [128, 60, 256]            394,752
│    │    └─GRUCell: 3-3                 [128, 60, 256]            394,752
├─LayerNorm: 1-3                         [128, 3, 256]             512
├─Embedding: 1-4                         [128, 60, 196]       

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
with wandb.init(
        config=config,
        project='INF8225 - TP3',  # Title of your project
        group='GRU - small',  # In what group of runs do you want this run to be in?
        save_code=True,
        mode = 'disabled'
    ):
    train_model(model, config)
    


Starting training for 5 epochs, using cuda.

Epoch 1
Train -   loss: 2.57     top-1: 0.50    top-5: 0.68    top-10: 0.75
Eval -    loss: 2.62     top-1: 0.50    top-5: 0.68    top-10: 0.74
I'm a new student.
Je suis un peu en retard.

Epoch 2
Train -   loss: 2.34     top-1: 0.55    top-5: 0.73    top-10: 0.79
Eval -    loss: 2.22     top-1: 0.55    top-5: 0.74    top-10: 0.79
I read the whole book in a day.
J'ai trouvé le livre ce matin.

Epoch 3
Train -   loss: 2.16     top-1: 0.55    top-5: 0.76    top-10: 0.81
Eval -    loss: 2.03     top-1: 0.58    top-5: 0.77    top-10: 0.82
Give me a beer.
Donne-moi un œil.

Epoch 4
Train -   loss: 2.00     top-1: 0.57    top-5: 0.79    top-10: 0.84
Eval -    loss: 1.88     top-1: 0.60    top-5: 0.79    top-10: 0.84
Don't you remember my name?
Ne me dis-tu pas mon nom ?

Epoch 5
Train -   loss: 2.05     top-1: 0.57    top-5: 0.77    top-10: 0.82
Eval -    loss: 1.79     top-1: 0.61    top-5: 0.81    top-10: 0.85
This work is simple enough that ev

In [None]:
sentence = "I called my friend to say hi."

preds = beam_search(
    model,
    sentence,
    config['src_vocab'],
    config['tgt_vocab'],
    config['src_tokenizer'],
    config['device'],
    beam_width=10,
    max_target=100,
    max_sentence_length=config['max_sequence_length']
)[:5]

for i, (translation, likelihood) in enumerate(preds):
    print(f'{i}. ({likelihood*100:.5f}%) \t {translation}')

0. (0.46880%) 	 J'ai oublié mon nom.
1. (0.40330%) 	 J'ai oublié à ma mère.
2. (0.38278%) 	 J'ai répondu à ma mère.
3. (0.26512%) 	 J'ai laissé mon nom.
4. (0.23090%) 	 J'ai écrit mon nom.


In [None]:
sentence = "I called my friend to say hi."
pred1 = greedy_search(
    model,
    sentence,
    config['src_vocab'],
    config['tgt_vocab'],
    config['src_tokenizer'],
    config['device'],
    config['max_sequence_length']
)[0]

print(f'({pred1[1]*100:.5f}%)       {pred1[0]}')

(0.02307%)       J'ai oublié mon nom à ton sujet.


Beam search

### **Run Transformer**

In [None]:
model = TranslationTransformer(
    config['n_tokens_src'],
    config['n_tokens_tgt'],
    config['n_heads'],
    config['dim_embedding'],
    config['dim_hidden'],
    config['n_layers'],
    config['dropout'],
    config['src_pad_idx'],
    config['tgt_pad_idx'],
)

config['optimizer'] = optim.Adam(
    model.parameters(),
    lr=config['lr'],
    betas=config['betas'],
)

weight_classes = torch.ones(config['n_tokens_tgt'], dtype=torch.float)
weight_classes[config['tgt_vocab']['<unk>']] = 0.1  # Lower the importance of that class
config['loss'] = nn.CrossEntropyLoss(
    weight=weight_classes,
    ignore_index=config['tgt_pad_idx'],  # We do not have to learn those
)

summary(
    model,
    input_size=[
        (config['batch_size'], config['max_sequence_length']),
        (config['batch_size'], config['max_sequence_length'])
    ],
    dtypes=[torch.long, torch.long],
    depth=3,
)

Layer (type:depth-idx)                             Output Shape              Param #
TranslationTransformer                             --                        --
├─Transformer: 1                                   --                        --
│    └─TransformerEncoder: 2-2                     --                        --
│    │    └─ModuleList: 3-1                        --                        768,108
│    └─TransformerDecoder: 2                       --                        --
│    │    └─ModuleList: 3-2                        --                        1,232,628
├─Embedding: 1-1                                   [128, 60, 196]            2,194,416
├─PositionalEncoding: 1-2                          [128, 60, 196]            --
│    └─Dropout: 2-1                                [128, 60, 196]            --
├─Transformer: 1                                   --                        --
│    └─TransformerEncoder: 2-2                     --                        --
│    │    └─Modu

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
with wandb.init(
        config=config,
        project='INF8225 - TP3',  # Title of your project
        group='Transformer Translation - small',  # In what group of runs do you want this run to be in?
        save_code=True,
        mode = 'disabled'
    ):
    train_model(model, config)


Starting training for 5 epochs, using cuda.

Epoch 1
Train -   loss: 2.27     top-1: 0.57    top-5: 0.76    top-10: 0.82
Eval -    loss: 2.13     top-1: 0.59    top-5: 0.77    top-10: 0.82
They already knew.
Elles savais déjà.

Epoch 2
Train -   loss: 1.88     top-1: 0.63    top-5: 0.82    top-10: 0.86
Eval -    loss: 1.72     top-1: 0.64    top-5: 0.83    top-10: 0.87
I can't believe you're getting married.
Je n'arrive pas à croire que tu sois marié.

Epoch 3
Train -   loss: 1.75     top-1: 0.64    top-5: 0.82    top-10: 0.87
Eval -    loss: 1.52     top-1: 0.67    top-5: 0.86    top-10: 0.89
This one's all yours.
C'est tout le monde.

Epoch 4
Train -   loss: 1.63     top-1: 0.65    top-5: 0.85    top-10: 0.89
Eval -    loss: 1.44     top-1: 0.68    top-5: 0.87    top-10: 0.90
Go ahead!
Allez !

Epoch 5
Train -   loss: 1.36     top-1: 0.72    top-5: 0.88    top-10: 0.90
Eval -    loss: 1.35     top-1: 0.70    top-5: 0.88    top-10: 0.91
I thought you'd already done that.
Je pensais qu

In [None]:
sentence = "I called my friend to say hi."

preds = beam_search(
    model,
    sentence,
    config['src_vocab'],
    config['tgt_vocab'],
    config['src_tokenizer'],
    config['device'],
    beam_width=10,
    max_target=100,
    max_sentence_length=config['max_sequence_length']
)[:5]

for i, (translation, likelihood) in enumerate(preds):
    print(f'{i}. ({likelihood*100:.5f}%) \t {translation}')

0. (2.80028%) 	 J'ai appelé ma amie.
1. (2.07530%) 	 J'ai téléphoné à mon ami.
2. (2.06449%) 	 J'ai appelé mon ami.
3. (1.99738%) 	 J'ai appelé mon amie.
4. (0.98363%) 	 J'ai téléphoné à mon amie.


In [None]:
sentence = "I called my friend to say hi."
pred1 = greedy_search(
    model,
    sentence,
    config['src_vocab'],
    config['tgt_vocab'],
    config['src_tokenizer'],
    config['device'],
    config['max_sequence_length']
)[0]

print(f'({pred1[1]*100:.5f}%)       {pred1[0]}')

(0.84398%)       J'ai appelé mon ami à dire bonjour.


# Questions
1. Explain the differences between Vanilla RNN, GRU-RNN, and Transformers. 
2. Why is positionnal encoding necessary in Transformers and not in RNNs?
3. Describe the preprocessing process. Detail how the initial dataset is processed before being fed to the translation models.

# Small report - experiments
Once everything is working fine, you can explore and do some little research work.

For exemple, you can experiment with the hyperparameters.
What are the effect of the differents hyperparameters with the final model performance? What about training time?

What are some other metrics you could have for machine translation? Can you compute them and add them to your WandB report?

Those are only examples, you can do whatever you think will be interesting.
This part account for many points, *feel free to go wild!*

---
*Make a small report about your experiments here.*

See report -
Experiments

In [None]:
from torchtext.data.metrics import bleu_score
def evaluate_blue_score(mod):
    source = [s[0] for s in config["val_loader"].dataset.dataset[:50]]
    target =  [[s[1].split(" ")] for s in config["val_loader"].dataset.dataset[:50]]

    pred = []
    for sentence in source:
        pred.append(beam_search(
        mod,
        sentence,
        config['src_vocab'],
        config['tgt_vocab'],
        config['src_tokenizer'],
        config['device'],
        beam_width=1,
        max_target=100,
        max_sentence_length=config['max_sequence_length'])[0][0].split(" "))
    return bleu_score(pred,target)


In [None]:
# Build the model, the dataloaders, optimizer and the loss function
# Log every hyperparameters and arguments into the config dictionnary

# Model type "RNN"
config = {
    # General parameters
    'epochs': 5,
    'batch_size': 128,
    'lr': 1e-3,
    'betas': (0.9, 0.99),
    'clip': 5,
    'device': 'cuda' if torch.cuda.is_available() else 'cpu',

    # Model parameters
    'n_tokens_src': len(train_dataset.en_vocab),
    'n_tokens_tgt': len(train_dataset.fr_vocab),
    'n_heads': 4,
    'dim_embedding': 196,
    'dim_hidden': 256,
    'n_layers': 3,
    'dropout': 0.1,
    'model_type': 'RNN',

    # Others
    'max_sequence_length': MAX_SEQ_LEN,
    'min_token_freq': MIN_TOK_FREQ,
    'src_vocab': train_dataset.en_vocab,
    'tgt_vocab': train_dataset.fr_vocab,
    'src_tokenizer': en_tokenizer,
    'tgt_tokenizer': fr_tokenizer,
    'src_pad_idx': train_dataset.en_vocab['<pad>'],
    'tgt_pad_idx': train_dataset.fr_vocab['<pad>'],
    'seed': 0,
    'log_every': 50,  # Number of batches between each wandb logs
    'create_bar_chart': True
}

torch.manual_seed(config['seed'])

config['train_loader'] = DataLoader(
    train_dataset,
    batch_size=config['batch_size'],
    shuffle=True,
    collate_fn=lambda batch: generate_batch(batch, config['src_pad_idx'], config['tgt_pad_idx'])
)

config['val_loader'] = DataLoader(
    val_dataset,
    batch_size=config['batch_size'],
    shuffle=True,
    collate_fn=lambda batch: generate_batch(batch, config['src_pad_idx'], config['tgt_pad_idx'])
)
# model = TranslationRNN(
#     config['n_tokens_src'],
#     config['n_tokens_tgt'],
#     config['dim_embedding'],
#     config['dim_hidden'],
#     config['n_layers'],
#     config['dropout'],
#     config['src_pad_idx'],
#     config['tgt_pad_idx'],
#     config['model_type'],
# )
# Uncommented for testing
model = TranslationTransformer(
    config['n_tokens_src'],
    config['n_tokens_tgt'],
    config['n_heads'],
    config['dim_embedding'],
    config['dim_hidden'],
    config['n_layers'],
    config['dropout'],
    config['src_pad_idx'],
    config['tgt_pad_idx'],
)
#Replace the model to train here
config['optimizer'] = optim.Adam(
    model.parameters(),
    lr=config['lr'],
    betas=config['betas'],
)

weight_classes = torch.ones(config['n_tokens_tgt'], dtype=torch.float)
weight_classes[config['tgt_vocab']['<unk>']] = 0.1  # Lower the importance of that class
config['loss'] = nn.CrossEntropyLoss(
    weight=weight_classes,
    ignore_index=config['tgt_pad_idx'],  # We do not have to learn those
)


#### **Analysis number of head**

In [None]:
nhead_array = [2,4,7,14,28]


In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

for head in nhead_array:
    config["n_heads"] = head

    model = TranslationTransformer(
    config['n_tokens_src'],
    config['n_tokens_tgt'],
    config['n_heads'],
    config['dim_embedding'],
    config['dim_hidden'],
    config['n_layers'],
    config['dropout'],
    config['src_pad_idx'],
    config['tgt_pad_idx']
    )
    config['optimizer'] = optim.Adam(
    model.parameters(),
    lr=config['lr'],
    betas=config['betas'],
    )
    weight_classes = torch.ones(config['n_tokens_tgt'], dtype=torch.float)
    weight_classes[config['tgt_vocab']['<unk>']] = 0.1  # Lower the importance of that class
    config['loss'] = nn.CrossEntropyLoss(weight=weight_classes,ignore_index=config['tgt_pad_idx'] )
    with wandb.init(
            config=config,
            project='INF8225 - TP3',  # Title of your project
            group='Transformer - Number of head experiments',  # In what group of runs do you want this run to be in?
            save_code=True,
            mode = 'online',
            name = "Transformer with "+str(head)+ " heads"
        ):
        train_model(model, config)


Starting training for 5 epochs, using cuda.

Epoch 1
Train -   loss: 2.34     top-1: 0.55    top-5: 0.75    top-10: 0.80
Eval -    loss: 2.18     top-1: 0.58    top-5: 0.77    top-10: 0.81
What season do you like the best?
Qu'est-ce que tu aimes le meilleur ?

Epoch 2
Train -   loss: 2.02     top-1: 0.60    top-5: 0.79    top-10: 0.84
Eval -    loss: 1.76     top-1: 0.64    top-5: 0.83    top-10: 0.87
I suppose everyone thinks I'm being a little too picky.
Je suppose que tout le monde pense.

Epoch 3
Train -   loss: 1.67     top-1: 0.62    top-5: 0.84    top-10: 0.88
Eval -    loss: 1.56     top-1: 0.67    top-5: 0.85    top-10: 0.89
Did you do your work?
Avez-vous fait du travail ?

Epoch 4
Train -   loss: 1.64     top-1: 0.64    top-5: 0.85    top-10: 0.89
Eval -    loss: 1.46     top-1: 0.68    top-5: 0.86    top-10: 0.90
You have a great alibi.
Tu as un grand alibi.

Epoch 5
Train -   loss: 1.45     top-1: 0.69    top-5: 0.87    top-10: 0.91
Eval -    loss: 1.37     top-1: 0.70    

VBox(children=(Label(value='0.304 MB of 0.304 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
Train - loss,█▅▄▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
Train - top-1,▁▃▄▅▅▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇█████████████████
Train - top-10,▁▄▅▅▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇█████████████████████
Train - top-5,▁▃▅▅▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇████████████████████
Validation - loss,█▄▃▂▁
Validation - top-1,▁▅▆▇█
Validation - top-10,▁▅▆▇█
Validation - top-5,▁▅▆▇█

0,1
Train - loss,1.44645
Train - top-1,0.68537
Train - top-10,0.91106
Train - top-5,0.87396
Validation - loss,1.36722
Validation - top-1,0.69596
Validation - top-10,0.90891
Validation - top-5,0.87555


Starting training for 5 epochs, using cuda.

Epoch 1
Train -   loss: 2.30     top-1: 0.57    top-5: 0.76    top-10: 0.80
Eval -    loss: 2.15     top-1: 0.58    top-5: 0.77    top-10: 0.82
She is constantly writing letters.
Elle est fatigué.

Epoch 2
Train -   loss: 1.97     top-1: 0.60    top-5: 0.81    top-10: 0.85
Eval -    loss: 1.72     top-1: 0.64    top-5: 0.83    top-10: 0.87
I felt relieved when my plane landed safely.
Je me suis sentie quand mon avion.

Epoch 3
Train -   loss: 1.59     top-1: 0.66    top-5: 0.85    top-10: 0.88
Eval -    loss: 1.53     top-1: 0.67    top-5: 0.86    top-10: 0.89
I told myself to stay positive.
J'ai dit de rester seul.

Epoch 4
Train -   loss: 1.70     top-1: 0.65    top-5: 0.83    top-10: 0.87
Eval -    loss: 1.42     top-1: 0.69    top-5: 0.87    top-10: 0.90
I told Tom that he shouldn't go out after dark.
Je lui ai dit qu'il ne devrait pas sortir.

Epoch 5
Train -   loss: 1.29     top-1: 0.72    top-5: 0.89    top-10: 0.92
Eval -    loss: 1.

VBox(children=(Label(value='0.316 MB of 0.316 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
Train - loss,█▅▄▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
Train - top-1,▁▃▄▅▅▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇█████████████████
Train - top-10,▁▄▅▅▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇█████████████████
Train - top-5,▁▄▅▅▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇████████████████
Validation - loss,█▄▃▂▁
Validation - top-1,▁▅▆▇█
Validation - top-10,▁▅▇▇█
Validation - top-5,▁▅▇▇█

0,1
Train - loss,1.28873
Train - top-1,0.7229
Train - top-10,0.92235
Train - top-5,0.88953
Validation - loss,1.35465
Validation - top-1,0.69873
Validation - top-10,0.90933
Validation - top-5,0.87757


Starting training for 5 epochs, using cuda.

Epoch 1
Train -   loss: 2.30     top-1: 0.57    top-5: 0.77    top-10: 0.82
Eval -    loss: 2.15     top-1: 0.58    top-5: 0.77    top-10: 0.82
He lost his job.
Il a perdu son travail.

Epoch 2
Train -   loss: 1.97     top-1: 0.62    top-5: 0.81    top-10: 0.85
Eval -    loss: 1.72     top-1: 0.64    top-5: 0.83    top-10: 0.87
We were both afraid to talk.
Nous avons peur de parler tous les deux.

Epoch 3
Train -   loss: 1.56     top-1: 0.66    top-5: 0.86    top-10: 0.90
Eval -    loss: 1.52     top-1: 0.67    top-5: 0.86    top-10: 0.89
Everyone always asks me that.
Tout le monde me fait toujours ainsi.

Epoch 4
Train -   loss: 1.54     top-1: 0.68    top-5: 0.86    top-10: 0.90
Eval -    loss: 1.41     top-1: 0.69    top-5: 0.87    top-10: 0.90
Who hates you?
Qui vous déteste ?

Epoch 5
Train -   loss: 1.40     top-1: 0.69    top-5: 0.88    top-10: 0.91
Eval -    loss: 1.35     top-1: 0.70    top-5: 0.88    top-10: 0.91
Tom doesn't have a

VBox(children=(Label(value='0.327 MB of 0.327 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
Train - loss,█▅▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
Train - top-1,▁▃▄▅▅▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇████████████████
Train - top-10,▁▄▅▅▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇█████████████████████
Train - top-5,▁▄▅▅▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇████████████████████
Validation - loss,█▄▃▂▁
Validation - top-1,▁▅▆▇█
Validation - top-10,▁▅▇▇█
Validation - top-5,▁▅▇▇█

0,1
Train - loss,1.40212
Train - top-1,0.68969
Train - top-10,0.91432
Train - top-5,0.88082
Validation - loss,1.34529
Validation - top-1,0.70068
Validation - top-10,0.90958
Validation - top-5,0.87811


Starting training for 5 epochs, using cuda.

Epoch 1
Train -   loss: 2.33     top-1: 0.56    top-5: 0.76    top-10: 0.80
Eval -    loss: 2.13     top-1: 0.58    top-5: 0.77    top-10: 0.82
We must leave right away.
Nous devons continuer.

Epoch 2
Train -   loss: 1.78     top-1: 0.63    top-5: 0.83    top-10: 0.87
Eval -    loss: 1.72     top-1: 0.64    top-5: 0.83    top-10: 0.87
They look bored.
Ils ont l'air.

Epoch 3
Train -   loss: 1.53     top-1: 0.68    top-5: 0.86    top-10: 0.90
Eval -    loss: 1.52     top-1: 0.67    top-5: 0.86    top-10: 0.89
All of the students have to wear the same uniform.
Tous les élèves doivent porter les étudiants.

Epoch 4
Train -   loss: 1.59     top-1: 0.66    top-5: 0.86    top-10: 0.90
Eval -    loss: 1.40     top-1: 0.69    top-5: 0.87    top-10: 0.90
Tom knows that's true.
Tom connaît ça vrai.

Epoch 5
Train -   loss: 1.42     top-1: 0.69    top-5: 0.87    top-10: 0.91
Eval -    loss: 1.33     top-1: 0.70    top-5: 0.88    top-10: 0.91
Don't cro

VBox(children=(Label(value='0.340 MB of 0.340 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
Train - loss,█▅▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
Train - top-1,▁▃▄▅▅▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇████████████████
Train - top-10,▁▄▅▅▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇█████████████████████
Train - top-5,▁▄▅▅▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇████████████████████
Validation - loss,█▄▃▂▁
Validation - top-1,▁▄▆▇█
Validation - top-10,▁▅▆▇█
Validation - top-5,▁▅▆▇█

0,1
Train - loss,1.41557
Train - top-1,0.68884
Train - top-10,0.91127
Train - top-5,0.86959
Validation - loss,1.32616
Validation - top-1,0.70283
Validation - top-10,0.91185
Validation - top-5,0.88032


Starting training for 5 epochs, using cuda.

Epoch 1
Train -   loss: 2.24     top-1: 0.56    top-5: 0.75    top-10: 0.81
Eval -    loss: 2.14     top-1: 0.58    top-5: 0.77    top-10: 0.82
He always borrows money from me.
Il me prend toujours d'argent.

Epoch 2
Train -   loss: 1.87     top-1: 0.62    top-5: 0.83    top-10: 0.86
Eval -    loss: 1.70     top-1: 0.64    top-5: 0.83    top-10: 0.87
Show me another tie, please.
Montre-moi une autre cravate, s'il vous plaît.

Epoch 3
Train -   loss: 1.68     top-1: 0.65    top-5: 0.85    top-10: 0.88
Eval -    loss: 1.53     top-1: 0.67    top-5: 0.85    top-10: 0.89
I remember you.
Je me rappelle vous.

Epoch 4
Train -   loss: 1.68     top-1: 0.63    top-5: 0.84    top-10: 0.88
Eval -    loss: 1.41     top-1: 0.69    top-5: 0.87    top-10: 0.90
What did you hope to find?
Comment as-tu trouvé ?

Epoch 5
Train -   loss: 1.34     top-1: 0.69    top-5: 0.88    top-10: 0.92
Eval -    loss: 1.34     top-1: 0.70    top-5: 0.88    top-10: 0.91
Wher

VBox(children=(Label(value='0.352 MB of 0.352 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
Train - loss,█▅▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
Train - top-1,▁▃▄▅▅▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇███████████████
Train - top-10,▁▄▅▅▆▆▆▇▇▇▇▇▇▇▇▇█▇██▇███████████████████
Train - top-5,▁▄▅▅▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇█▇███████████████████
Validation - loss,█▄▃▂▁
Validation - top-1,▁▅▆▇█
Validation - top-10,▁▅▆▇█
Validation - top-5,▁▅▆▇█

0,1
Train - loss,1.33799
Train - top-1,0.68647
Train - top-10,0.92148
Train - top-5,0.88163
Validation - loss,1.34482
Validation - top-1,0.69937
Validation - top-10,0.91015
Validation - top-5,0.87903


Le nombre de tête n'affecte pas les performances.

### **Varying the embedding size**



In [None]:
config = {
    # General parameters
    'epochs': 5,
    'batch_size': 128,
    'lr': 1e-3,
    'betas': (0.9, 0.99),
    'clip': 5,
    'device': 'cuda' if torch.cuda.is_available() else 'cpu',

    # Model parameters
    'n_tokens_src': len(train_dataset.en_vocab),
    'n_tokens_tgt': len(train_dataset.fr_vocab),
    'n_heads': 4,
    'dim_embedding': 196,
    'dim_hidden': 256,
    'n_layers': 3,
    'dropout': 0.1,
    'model_type': 'RNN',

    # Others
    'max_sequence_length': MAX_SEQ_LEN,
    'min_token_freq': MIN_TOK_FREQ,
    'src_vocab': train_dataset.en_vocab,
    'tgt_vocab': train_dataset.fr_vocab,
    'src_tokenizer': en_tokenizer,
    'tgt_tokenizer': fr_tokenizer,
    'src_pad_idx': train_dataset.en_vocab['<pad>'],
    'tgt_pad_idx': train_dataset.fr_vocab['<pad>'],
    'seed': 0,
    'log_every': 50,  # Number of batches between each wandb logs
     'create_bar_chart': False
}




NameError: ignored

#### RNN & GRU

In [None]:
for model_type in ["RNN","GRU"]:
    for emb_size in [100,160, 180,196]:
            config['dim_embedding'] = emb_size
            config['model_type'] = model_type
            model = TranslationRNN(
                config['n_tokens_src'],
                config['n_tokens_tgt'],
                config['dim_embedding'],
                config['dim_hidden'],
                config['n_layers'],
                config['dropout'],
                config['src_pad_idx'],
                config['tgt_pad_idx'],
                config['model_type'],
            )
            model.train()


            #Replace the model to train here
            config['optimizer'] = optim.Adam(
                model.parameters(),
                lr=config['lr'],
                betas=config['betas'],
            )

            weight_classes = torch.ones(config['n_tokens_tgt'], dtype=torch.float)
            weight_classes[config['tgt_vocab']['<unk>']] = 0.1  # Lower the importance of that class
            config['loss'] = nn.CrossEntropyLoss(
                weight=weight_classes,
                ignore_index=config['tgt_pad_idx'],  # We do not have to learn those
            )

            device = 'cuda' if torch.cuda.is_available() else 'cpu'
            with wandb.init(
                    config=config,
                    project='INF8225 - TP3',  # Title of your project
                    group='Embedding size',  # In what group of runs do you want this run to be in?
                    save_code=True,
                    mode = 'offline',
                    name = model_type + " emb_size:" + emb_size
                ):
                train_model(model, config)
                wandb.log({"bleu_score":evaluate_blue_score(model)})


Starting training for 5 epochs, using cuda.

Epoch 1



VBox(children=(Label(value='0.000 MB of 0.000 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

RuntimeError: ignored

#### Transformer

In [None]:
for emb_size in [100,160,180,196]:
    config["dim_embedding"] = emb_size
    config['n_heads'] = 4
    model = TranslationTransformer(
    config['n_tokens_src'],
    config['n_tokens_tgt'],
    config['n_heads'],
    config['dim_embedding'],
    config['dim_hidden'],
    config['n_layers'],
    config['dropout'],
    config['src_pad_idx'],
    config['tgt_pad_idx']
    )
    config['optimizer'] = optim.Adam(
    model.parameters(),
    lr=config['lr'],
    betas=config['betas'],
    )
    weight_classes = torch.ones(config['n_tokens_tgt'], dtype=torch.float)
    weight_classes[config['tgt_vocab']['<unk>']] = 0.1  # Lower the importance of that class
    config['loss'] = nn.CrossEntropyLoss(weight=weight_classes,ignore_index=config['tgt_pad_idx'] )
    with wandb.init(
            config=config,
            project='INF8225 - TP3',  # Title of your project
            group='Transformer - emb_size',  # In what group of runs do you want this run to be in?
            save_code=True,
            mode = 'online',
            name = "Transformer with "+ str(emb_size)
        ):
        train_model(model, config)
        wandb.log({"bleu_score":evaluate_blue_score(model)})

Starting training for 5 epochs, using cuda.

Epoch 1
Train -   loss: 2.83     top-1: 0.49    top-5: 0.69    top-10: 0.74
Eval -    loss: 2.63     top-1: 0.51    top-5: 0.70    top-10: 0.75
I love this view.
J'adore ceci.

Epoch 2
Train -   loss: 2.28     top-1: 0.57    top-5: 0.76    top-10: 0.81
Eval -    loss: 2.10     top-1: 0.59    top-5: 0.78    top-10: 0.82
Don't talk to me about work.
Ne me parle pas.

Epoch 3
Train -   loss: 1.81     top-1: 0.62    top-5: 0.83    top-10: 0.87
Eval -    loss: 1.84     top-1: 0.62    top-5: 0.81    top-10: 0.86
Tom could hardly speak French at all when I first met him.
Je n'ai pas pu parler quand Tom puisse le français.

Epoch 4
Train -   loss: 2.18     top-1: 0.56    top-5: 0.78    top-10: 0.82
Eval -    loss: 1.69     top-1: 0.64    top-5: 0.83    top-10: 0.87
I don't want anything more.
Je ne veux rien.

Epoch 5
Train -   loss: 1.79     top-1: 0.63    top-5: 0.83    top-10: 0.88
Eval -    loss: 1.60     top-1: 0.66    top-5: 0.85    top-10: 0.

VBox(children=(Label(value='0.330 MB of 0.330 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
Train - loss,█▅▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
Train - top-1,▁▃▄▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇█▇▇█████████████
Train - top-10,▁▄▅▅▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇██████████████████
Train - top-5,▁▃▅▅▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇█████████████████
Validation - loss,█▄▃▂▁
Validation - top-1,▁▅▆▇█
Validation - top-10,▁▅▇▇█
Validation - top-5,▁▅▆▇█
bleu_score,▁

0,1
Train - loss,1.7946
Train - top-1,0.62901
Train - top-10,0.87626
Train - top-5,0.82608
Validation - loss,1.59504
Validation - top-1,0.66133
Validation - top-10,0.88389
Validation - top-5,0.84666
bleu_score,0.16249


Starting training for 5 epochs, using cuda.

Epoch 1
Train -   loss: 2.45     top-1: 0.54    top-5: 0.74    top-10: 0.79
Eval -    loss: 2.29     top-1: 0.56    top-5: 0.75    top-10: 0.80
Tom misled me.
Tom m'a menti.

Epoch 2
Train -   loss: 1.93     top-1: 0.62    top-5: 0.80    top-10: 0.85
Eval -    loss: 1.81     top-1: 0.63    top-5: 0.82    top-10: 0.86
Where did you buy it?
Où l'ai-tu acheté ?

Epoch 3
Train -   loss: 1.76     top-1: 0.65    top-5: 0.84    top-10: 0.87
Eval -    loss: 1.61     top-1: 0.66    top-5: 0.84    top-10: 0.88
The place was almost empty.
La place était presque vide.

Epoch 4
Train -   loss: 1.59     top-1: 0.66    top-5: 0.86    top-10: 0.90
Eval -    loss: 1.49     top-1: 0.68    top-5: 0.86    top-10: 0.90
Would you lend me your pen?
Voudrais-tu me prêter votre stylo   ?

Epoch 5
Train -   loss: 1.52     top-1: 0.67    top-5: 0.85    top-10: 0.90
Eval -    loss: 1.41     top-1: 0.69    top-5: 0.87    top-10: 0.90
Do you have relatives here?
Avez-vou

VBox(children=(Label(value='0.342 MB of 0.342 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
Train - loss,█▅▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
Train - top-1,▁▃▄▅▅▅▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇████████████████
Train - top-10,▁▄▅▅▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇█████████████████████
Train - top-5,▁▃▅▅▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇█▇██████████████████
Validation - loss,█▄▃▂▁
Validation - top-1,▁▅▆▇█
Validation - top-10,▁▅▇▇█
Validation - top-5,▁▅▆▇█
bleu_score,▁

0,1
Train - loss,1.51912
Train - top-1,0.67
Train - top-10,0.89847
Train - top-5,0.85018
Validation - loss,1.41499
Validation - top-1,0.68847
Validation - top-10,0.9037
Validation - top-5,0.87019
bleu_score,0.27444


Starting training for 5 epochs, using cuda.

Epoch 1
Train -   loss: 2.33     top-1: 0.56    top-5: 0.75    top-10: 0.80
Eval -    loss: 2.19     top-1: 0.58    top-5: 0.77    top-10: 0.81
She asked me about my mother.
Elle m'a demandé de mon mère.

Epoch 2
Train -   loss: 1.94     top-1: 0.61    top-5: 0.80    top-10: 0.86
Eval -    loss: 1.75     top-1: 0.64    top-5: 0.83    top-10: 0.87
We should leave immediately.
Nous devrions partir immédiatement.

Epoch 3
Train -   loss: 1.61     top-1: 0.66    top-5: 0.84    top-10: 0.88
Eval -    loss: 1.56     top-1: 0.67    top-5: 0.85    top-10: 0.89
I've promised Tom that I would help.
J'espère que j'aiderais Tom.

Epoch 4
Train -   loss: 1.65     top-1: 0.64    top-5: 0.84    top-10: 0.89
Eval -    loss: 1.44     top-1: 0.68    top-5: 0.87    top-10: 0.90
If you want to come, you can.
Si tu veux venir, tu peux venir.

Epoch 5
Train -   loss: 1.46     top-1: 0.66    top-5: 0.87    top-10: 0.90
Eval -    loss: 1.37     top-1: 0.70    top-5

VBox(children=(Label(value='0.354 MB of 0.354 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
Train - loss,█▅▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
Train - top-1,▁▃▄▅▅▅▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇███████████████
Train - top-10,▁▄▅▅▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇█████████████████████
Train - top-5,▁▃▅▅▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇███████████████████
Validation - loss,█▄▃▂▁
Validation - top-1,▁▅▆▇█
Validation - top-10,▁▅▇▇█
Validation - top-5,▁▅▆▇█
bleu_score,▁

0,1
Train - loss,1.4591
Train - top-1,0.65561
Train - top-10,0.89885
Train - top-5,0.8654
Validation - loss,1.367
Validation - top-1,0.69644
Validation - top-10,0.9079
Validation - top-5,0.87552
bleu_score,0.173


Starting training for 5 epochs, using cuda.

Epoch 1
Train -   loss: 2.53     top-1: 0.54    top-5: 0.73    top-10: 0.79
Eval -    loss: 2.17     top-1: 0.58    top-5: 0.77    top-10: 0.82
Tom convinced Mary.
Tom a mis Marie.

Epoch 2
Train -   loss: 1.75     top-1: 0.64    top-5: 0.84    top-10: 0.88
Eval -    loss: 1.73     top-1: 0.64    top-5: 0.83    top-10: 0.87
Do you like white chocolate?
Aimez-vous en blanc ?

Epoch 3
Train -   loss: 1.70     top-1: 0.64    top-5: 0.84    top-10: 0.88
Eval -    loss: 1.54     top-1: 0.67    top-5: 0.85    top-10: 0.89
I'd like to talk with you.
J'aimerais vous parler.

Epoch 4
Train -   loss: 1.52     top-1: 0.66    top-5: 0.86    top-10: 0.90
Eval -    loss: 1.42     top-1: 0.69    top-5: 0.87    top-10: 0.90
I've decided to go by train.
J'ai décidé d'y aller en train.

Epoch 5
Train -   loss: 1.38     top-1: 0.69    top-5: 0.88    top-10: 0.91
Eval -    loss: 1.35     top-1: 0.70    top-5: 0.88    top-10: 0.91
When will we leave?
Quand parto

VBox(children=(Label(value='0.365 MB of 0.365 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
Train - loss,█▅▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
Train - top-1,▁▃▄▅▅▆▆▆▇▆▇▇▇▇▇▇▇▇▇▇▇▇█▇████████████████
Train - top-10,▁▄▅▅▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇█████████████████████
Train - top-5,▁▄▅▅▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇████████████████████
Validation - loss,█▄▃▂▁
Validation - top-1,▁▅▆▇█
Validation - top-10,▁▅▆▇█
Validation - top-5,▁▅▆▇█
bleu_score,▁

0,1
Train - loss,1.37525
Train - top-1,0.68914
Train - top-10,0.91466
Train - top-5,0.88406
Validation - loss,1.3472
Validation - top-1,0.6992
Validation - top-10,0.9101
Validation - top-5,0.8781
bleu_score,0.20832


### Dropout

#### RNN & GRU

In [None]:
for model_type in ["RNN"]:
    for dropout in [0.2,0.3,0.4]:
            config['dim_embedding'] = 196
            config['dropout'] = dropout
            config['model_type'] = model_type
            model = TranslationRNN(
                config['n_tokens_src'],
                config['n_tokens_tgt'],
                config['dim_embedding'],
                config['dim_hidden'],
                config['n_layers'],
                config['dropout'],
                config['src_pad_idx'],
                config['tgt_pad_idx'],
                config['model_type'],
            )
            model.train()


            #Replace the model to train here
            config['optimizer'] = optim.Adam(
                model.parameters(),
                lr=config['lr'],
                betas=config['betas'],
            )

            weight_classes = torch.ones(config['n_tokens_tgt'], dtype=torch.float)
            weight_classes[config['tgt_vocab']['<unk>']] = 0.1  # Lower the importance of that class
            config['loss'] = nn.CrossEntropyLoss(
                weight=weight_classes,
                ignore_index=config['tgt_pad_idx'],  # We do not have to learn those
            )

            device = 'cuda' if torch.cuda.is_available() else 'cpu'
            with wandb.init(
                    config=config,
                    project='INF8225 - TP3',  # Title of your project
                    group='Dropout RNN',  # In what group of runs do you want this run to be in?
                    save_code=True,
                    mode = 'online',
                    name = model_type + " dropout:" + str(dropout)
                ):
                train_model(model, config)
                wandb.log({"bleu_score":evaluate_blue_score(model)})


[34m[1mwandb[0m: Currently logged in as: [33mwittythemighty[0m (use `wandb login --relogin` to force relogin)


Starting training for 5 epochs, using cuda.

Epoch 1
Train -   loss: 2.85     top-1: 0.47    top-5: 0.66    top-10: 0.71
Eval -    loss: 2.80     top-1: 0.47    top-5: 0.65    top-10: 0.71
Here is my bicycle.
Voici un moment.

Epoch 2
Train -   loss: 2.63     top-1: 0.49    top-5: 0.67    top-10: 0.74
Eval -    loss: 2.56     top-1: 0.50    top-5: 0.68    top-10: 0.75
I was detained.
J'étais en sécurité.

Epoch 3
Train -   loss: 2.57     top-1: 0.51    top-5: 0.69    top-10: 0.75
Eval -    loss: 2.44     top-1: 0.51    top-5: 0.70    top-10: 0.76
Your lives will be spared if you surrender.
Vos idées nous rendons tous.

Epoch 4
Train -   loss: 2.42     top-1: 0.52    top-5: 0.70    top-10: 0.78
Eval -    loss: 2.37     top-1: 0.52    top-5: 0.71    top-10: 0.77
Light travels faster than sound.
Peu de personnes ont été tués.

Epoch 5
Train -   loss: 2.58     top-1: 0.50    top-5: 0.68    top-10: 0.74
Eval -    loss: 2.33     top-1: 0.52    top-5: 0.71    top-10: 0.78
It's recommended tha

VBox(children=(Label(value='0.440 MB of 0.440 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
Train - loss,█▅▃▃▃▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
Train - top-1,▁▄▅▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇█▇▇███████████████
Train - top-10,▁▄▅▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇████████████████████
Train - top-5,▁▄▅▆▆▆▆▇▆▇▇▇▇▇▇▇▇▇▇▇▇███████████████████
Validation - loss,█▄▃▂▁
Validation - top-1,▁▅▆▇█
Validation - top-10,▁▄▆▇█
Validation - top-5,▁▄▆▇█
bleu_score,▁

0,1
Train - loss,2.57943
Train - top-1,0.49538
Train - top-10,0.73947
Train - top-5,0.6843
Validation - loss,2.33455
Validation - top-1,0.52329
Validation - top-10,0.77807
Validation - top-5,0.71437
bleu_score,0.08652


Starting training for 5 epochs, using cuda.

Epoch 1



VBox(children=(Label(value='0.447 MB of 0.447 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
Train - loss,█▃▂▂▂▁▁▁▁
Train - top-1,▁▅▆▇▇████
Train - top-10,▁▅▆▇▇████
Train - top-5,▁▅▆▇▇████

0,1
Train - loss,3.67701
Train - top-1,0.38734
Train - top-10,0.61926
Train - top-5,0.54986


KeyboardInterrupt: ignored

#### Transformer

In [None]:

for dropout in [0.1,0.2,0.3,0.4]:
    config["dim_embedding"] = 196
    config['n_heads'] = 4
    model = TranslationTransformer(
    config['n_tokens_src'],
    config['n_tokens_tgt'],
    config['n_heads'],
    config['dim_embedding'],
    config['dim_hidden'],
    config['n_layers'],
    config['dropout'],
    config['src_pad_idx'],
    config['tgt_pad_idx']
    )
    config['optimizer'] = optim.Adam(
    model.parameters(),
    lr=config['lr'],
    betas=config['betas'],
    )
    weight_classes = torch.ones(config['n_tokens_tgt'], dtype=torch.float)
    weight_classes[config['tgt_vocab']['<unk>']] = 0.1  # Lower the importance of that class
    config['loss'] = nn.CrossEntropyLoss(weight=weight_classes,ignore_index=config['tgt_pad_idx'] )
    with wandb.init(
            config=config,
            project='INF8225 - TP3',  # Title of your project
            group='Transformer - Dropout',  
            save_code=True,
            mode = 'online',
            name = "Transformer with "+ str(dropout)
        ):
        train_model(model, config)
        wandb.log({"bleu_score":evaluate_blue_score(model)})


Starting training for 5 epochs, using cuda.

Epoch 1
Train -   loss: 2.46     top-1: 0.56    top-5: 0.74    top-10: 0.80
Eval -    loss: 2.15     top-1: 0.58    top-5: 0.77    top-10: 0.82
Are you leaving?
Es-tu ?

Epoch 2
Train -   loss: 1.74     top-1: 0.65    top-5: 0.83    top-10: 0.86
Eval -    loss: 1.72     top-1: 0.64    top-5: 0.83    top-10: 0.87
It's sort of strange.
C'est bizarre.

Epoch 3
Train -   loss: 1.67     top-1: 0.64    top-5: 0.84    top-10: 0.88
Eval -    loss: 1.53     top-1: 0.67    top-5: 0.86    top-10: 0.89
Let's start with the easy stuff.
Commençons.

Epoch 4
Train -   loss: 1.57     top-1: 0.68    top-5: 0.87    top-10: 0.90
Eval -    loss: 1.44     top-1: 0.69    top-5: 0.87    top-10: 0.90
The king always wears a crown.
Le roi porte toujours un roi.

Epoch 5
Train -   loss: 1.42     top-1: 0.67    top-5: 0.88    top-10: 0.92
Eval -    loss: 1.34     top-1: 0.70    top-5: 0.88    top-10: 0.91
We can't prove anything.
Nous ne pouvons rien prouver.



VBox(children=(Label(value='0.378 MB of 0.378 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
Train - loss,█▅▄▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
Train - top-1,▁▃▄▅▅▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇███████████████
Train - top-10,▁▄▅▅▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇█████████████████████
Train - top-5,▁▄▅▅▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇██▇██████████████████
Validation - loss,█▄▃▂▁
Validation - top-1,▁▅▆▇█
Validation - top-10,▁▅▇▇█
Validation - top-5,▁▅▆▇█
bleu_score,▁

0,1
Train - loss,1.41811
Train - top-1,0.67042
Train - top-10,0.91893
Train - top-5,0.88184
Validation - loss,1.34466
Validation - top-1,0.70089
Validation - top-10,0.90972
Validation - top-5,0.8784
bleu_score,0.24115


Starting training for 5 epochs, using cuda.

Epoch 1
Train -   loss: 2.34     top-1: 0.53    top-5: 0.75    top-10: 0.81
Eval -    loss: 2.14     top-1: 0.58    top-5: 0.77    top-10: 0.82
He might possibly say something ambiguous again.
Il pourrait dire quelque chose à nouveau.

Epoch 2
Train -   loss: 1.79     top-1: 0.62    top-5: 0.82    top-10: 0.87
Eval -    loss: 1.72     top-1: 0.64    top-5: 0.83    top-10: 0.87
He said it as a joke.
Il a dit que ça.

Epoch 3
Train -   loss: 1.55     top-1: 0.68    top-5: 0.87    top-10: 0.90
Eval -    loss: 1.54     top-1: 0.67    top-5: 0.85    top-10: 0.89
She liked tennis and became a tennis coach.
Elle est devenu médecin et au tennis.

Epoch 4
Train -   loss: 1.40     top-1: 0.69    top-5: 0.88    top-10: 0.91
Eval -    loss: 1.42     top-1: 0.69    top-5: 0.87    top-10: 0.90
Come on. You're not that old.
Venez ça.

Epoch 5
Train -   loss: 1.44     top-1: 0.68    top-5: 0.87    top-10: 0.91
Eval -    loss: 1.35     top-1: 0.70    top-5: 

VBox(children=(Label(value='0.391 MB of 0.391 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
Train - loss,█▅▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
Train - top-1,▁▃▄▅▅▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇████████████████
Train - top-10,▁▄▅▅▆▆▆▇▇▇▇▇▇▇▇▇▇███████████████████████
Train - top-5,▁▄▅▅▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇████████████████████
Validation - loss,█▄▃▂▁
Validation - top-1,▁▄▆▇█
Validation - top-10,▁▅▆▇█
Validation - top-5,▁▅▆▇█
bleu_score,▁

0,1
Train - loss,1.44293
Train - top-1,0.68447
Train - top-10,0.90925
Train - top-5,0.8737
Validation - loss,1.34546
Validation - top-1,0.70121
Validation - top-10,0.91094
Validation - top-5,0.87874
bleu_score,0.25037


Starting training for 5 epochs, using cuda.

Epoch 1
Train -   loss: 2.34     top-1: 0.55    top-5: 0.75    top-10: 0.80
Eval -    loss: 2.16     top-1: 0.58    top-5: 0.77    top-10: 0.82
It is more blessed to give than to receive.
C'est plus grande que personne.

Epoch 2
Train -   loss: 1.96     top-1: 0.61    top-5: 0.81    top-10: 0.86
Eval -    loss: 1.73     top-1: 0.64    top-5: 0.83    top-10: 0.87
Tom wants to be strong.
Tom veut être fort.

Epoch 3
Train -   loss: 1.73     top-1: 0.66    top-5: 0.84    top-10: 0.87
Eval -    loss: 1.55     top-1: 0.67    top-5: 0.85    top-10: 0.89
I had a good night's sleep.
J'ai eu une bonne nuit.

Epoch 4
Train -   loss: 1.48     top-1: 0.68    top-5: 0.87    top-10: 0.90
Eval -    loss: 1.43     top-1: 0.69    top-5: 0.87    top-10: 0.90
You need to work faster.
Vous devez travailler plus vite.

Epoch 5
Train -   loss: 1.39     top-1: 0.68    top-5: 0.88    top-10: 0.92
Eval -    loss: 1.36     top-1: 0.70    top-5: 0.88    top-10: 0.91
I

VBox(children=(Label(value='0.402 MB of 0.402 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
Train - loss,█▅▄▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
Train - top-1,▁▃▄▅▅▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇████████████████
Train - top-10,▁▄▅▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇█████████████████████
Train - top-5,▁▄▅▅▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇███████████████████
Validation - loss,█▄▃▂▁
Validation - top-1,▁▅▆▇█
Validation - top-10,▁▅▆▇█
Validation - top-5,▁▅▆▇█
bleu_score,▁

0,1
Train - loss,1.39318
Train - top-1,0.68338
Train - top-10,0.9169
Train - top-5,0.87831
Validation - loss,1.35923
Validation - top-1,0.69756
Validation - top-10,0.90955
Validation - top-5,0.87734
bleu_score,0.21948


Starting training for 5 epochs, using cuda.

Epoch 1
Train -   loss: 2.48     top-1: 0.53    top-5: 0.74    top-10: 0.79
Eval -    loss: 2.17     top-1: 0.58    top-5: 0.77    top-10: 0.81
He doesn't know what he's talking about.
Il ne sait pas quoi dire.

Epoch 2
Train -   loss: 1.95     top-1: 0.63    top-5: 0.80    top-10: 0.85
Eval -    loss: 1.74     top-1: 0.64    top-5: 0.83    top-10: 0.87
What is your opinion on school uniforms?
Quel est votre opinion ?

Epoch 3
Train -   loss: 1.78     top-1: 0.64    top-5: 0.83    top-10: 0.87
Eval -    loss: 1.55     top-1: 0.67    top-5: 0.85    top-10: 0.89
Do you mind if I stay here?
Voyez-vous un inconvénient à ce que je reste ici ?

Epoch 4
Train -   loss: 1.50     top-1: 0.68    top-5: 0.87    top-10: 0.90
Eval -    loss: 1.43     top-1: 0.69    top-5: 0.87    top-10: 0.90
Why didn't you tell this to the police?
Pourquoi n'avez-vous pas dit cette police ?

Epoch 5
Train -   loss: 1.45     top-1: 0.67    top-5: 0.86    top-10: 0.90
Eva

VBox(children=(Label(value='0.414 MB of 0.414 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
Train - loss,█▅▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
Train - top-1,▁▃▄▅▅▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇████████████████
Train - top-10,▁▄▅▅▆▆▆▇▇▇▇▇▇▇▇▇▇▇██████████████████████
Train - top-5,▁▄▅▅▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇████████████████████
Validation - loss,█▄▃▂▁
Validation - top-1,▁▄▆▇█
Validation - top-10,▁▅▇▇█
Validation - top-5,▁▅▆▇█
bleu_score,▁

0,1
Train - loss,1.45272
Train - top-1,0.6731
Train - top-10,0.90333
Train - top-5,0.86023
Validation - loss,1.36062
Validation - top-1,0.69929
Validation - top-10,0.90843
Validation - top-5,0.877
bleu_score,0.24553
