# Project INF8225 2025, Machine translation: Comparison performance between encoder decoder architecture and decoder only


For this project we are going to reuse the overall structure of TP3 where we implemented the encoder decoder transformer and add the implementation of our decoder only architecture


* Dataset: [Tab-delimited Bilingual Sentence Pairs](http://www.manythings.org/anki/)

<!---
M. Cettolo, C. Girardi, and M. Federico. 2012. WIT3: Web Inventory of Transcribed and Translated Talks. In Proc. of EAMT, pp. 261-268, Trento, Italy. pdf, bib. [paper](https://aclanthology.org/2012.eamt-1.60.pdf). [website](https://wit3.fbk.eu/2016-01).
-->

* The code is inspired by this [pytorch tutorial](https://pytorch.org/tutorials/beginner/torchtext_translation_tutorial.html).

# Imports and data initializations

We first download and parse the dataset. From the parsed sentences
we can build the vocabularies and the torch datasets.
The end goal of this section is to have an iterator
that can yield the pairs of translated datasets, and
where each sentences is made of a sequence of tokens.

## Imports

In [4]:
# Note current default torch and cuda was 2.6.0+cu124
# We need to go back to an earlier version compatible with torchtext
# This will generate some dependency issues (incompatible packages), but for things that we will not need for this TP
# !pip install torch==2.1.2+cu121 -f https://download.pytorch.org/whl/torch/ --force-reinstall --no-cache-dir
# !pip install torchtext==0.16.2 --force-reinstall --no-cache-dir
# !pip install numpy==1.23.5 --force-reinstall --no-cache-dir
# !pip install scikit-learn==1.1.3 --force-reinstall --no-cache-dir
# !pip install scipy==1.9.3 --force-reinstall --no-cache-dir
# !pip install spacy einops wandb torchinfo
# !python -m spacy download en_core_web_sm
# !python -m spacy download fr_core_news_sm

!pip install torch==2.1.2+cu121 -f https://download.pytorch.org/whl/torch/
!pip install torchtext==0.16.2 numpy==1.23.5 scikit-learn==1.1.3 scipy==1.9.3 spacy einops wandb torchinfo --force-reinstall --no-cache-dir

!python -m spacy download en_core_web_sm
!python -m spacy download fr_core_news_sm

Looking in links: https://download.pytorch.org/whl/torch/
Collecting torch==2.1.2+cu121
  Downloading https://download.pytorch.org/whl/cu121/torch-2.1.2%2Bcu121-cp311-cp311-linux_x86_64.whl (2200.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 GB[0m [31m473.1 kB/s[0m eta [36m0:00:00[0m
Collecting triton==2.1.0 (from torch==2.1.2+cu121)
  Downloading triton-2.1.0-0-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.3 kB)
Downloading triton-2.1.0-0-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (89.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.2/89.2 MB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: triton, torch
  Attempting uninstall: triton
    Found existing installation: triton 3.2.0
    Uninstalling triton-3.2.0:
      Successfully uninstalled triton-3.2.0
  Attempting uninstall: torch
    Found existing installation: torch 2.6.0+cu124
    Uninstalling torch-

Collecting torchtext==0.16.2
  Downloading torchtext-0.16.2-cp311-cp311-manylinux1_x86_64.whl.metadata (7.5 kB)
Collecting numpy==1.23.5
  Downloading numpy-1.23.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.3 kB)
Collecting scikit-learn==1.1.3
  Downloading scikit_learn-1.1.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (10 kB)
Collecting scipy==1.9.3
  Downloading scipy-1.9.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.4/58.4 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting spacy
  Downloading spacy-3.8.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (27 kB)
Collecting einops
  Downloading einops-0.8.1-py3-none-any.whl.metadata (13 kB)
Collecting wandb
  Downloading wandb-0.19.10-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (10 kB)
Collecting torchinfo
  Downloading torchinfo-1.8.0

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m108.4 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Collecting fr-core-news-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-3.8.0/fr_core_news_sm-3.8.0-py3-none-any.whl (16.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.3/16.3 MB[0m [31m24.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages

In [1]:
from itertools import takewhile
from collections import Counter, defaultdict
import numpy as np
from sklearn.model_selection import train_test_split
import pandas as pd
import torch
# cpal
print(torch.__version__)
import torch.nn as nn
import torch.optim as optim
from torch.utils.data.dataset import Dataset
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence
import torchtext
# from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator, Vocab
from torchtext.datasets import IWSLT2016
import spacy
import einops
import wandb
from torchinfo import summary
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')



2.1.2+cu121


In [2]:
# Our dataset
!wget http://www.manythings.org/anki/fra-eng.zip
!unzip fra-eng.zip
df = pd.read_csv('fra.txt', sep='\t', names=['english', 'french', 'attribution'])
train = [
    (en, fr) for en, fr in zip(df['english'], df['french'])
]
train, valid = train_test_split(train, test_size=0.1, random_state=0)
print(len(train))
en_nlp = spacy.load('en_core_web_sm')
fr_nlp = spacy.load('fr_core_news_sm')
def en_tokenizer(text):
    return [tok.text.lower() for tok in en_nlp.tokenizer(text)]
def fr_tokenizer(text):
    return [tok.text.lower() for tok in fr_nlp.tokenizer(text)]
SPECIALS = ['<unk>', '<pad>', '<bos>', '<eos>']

--2025-04-27 02:25:16--  http://www.manythings.org/anki/fra-eng.zip
Resolving www.manythings.org (www.manythings.org)... 173.254.30.110
Connecting to www.manythings.org (www.manythings.org)|173.254.30.110|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7943074 (7.6M) [application/zip]
Saving to: ‘fra-eng.zip’


2025-04-27 02:25:17 (21.6 MB/s) - ‘fra-eng.zip’ saved [7943074/7943074]

Archive:  fra-eng.zip
  inflating: _about.txt              
  inflating: fra.txt                 
209462


The tokenizers are objects that are able to divide a python string into a list of tokens (words, punctuations, special tokens...) as a list of strings.

The special tokens are used for a particular reasons:
* *\<unk\>*: Replace an unknown word in the vocabulary by this default token
* *\<pad\>*: Virtual token used to as padding token so a batch of sentences can have a unique length
* *\<bos\>*: Token indicating the beggining of a sentence in the target sequence
* *\<eos\>*: Token indicating the end of a sentence in the target sequence

## Datasets

Functions and classes to build the vocabularies and the torch datasets.
The vocabulary is an object able to transform a string token into the id (an int) of that token in the vocabulary.

In [3]:
class TranslationDataset(Dataset):
    def __init__(
            self,
            dataset: list,
            en_vocab: Vocab,
            fr_vocab: Vocab,
            en_tokenizer,
            fr_tokenizer,
        ):
        super().__init__()

        self.dataset = dataset
        self.en_vocab = en_vocab
        self.fr_vocab = fr_vocab
        self.en_tokenizer = en_tokenizer
        self.fr_tokenizer = fr_tokenizer

    def __len__(self):
        """Return the number of examples in the dataset.
        """
        return len(self.dataset)

    def __getitem__(self, index: int) -> tuple:
        """Return a sample.

        Args
        ----
            index: Index of the sample.

        Output
        ------
            en_tokens: English tokens of the sample, as a LongTensor.
            fr_tokens: French tokens of the sample, as a LongTensor.
        """
        # Get the strings
        en_sentence, fr_sentence = self.dataset[index]

        # To list of words
        # We also add the beggining-of-sentence and end-of-sentence tokens
        en_tokens = ['<bos>'] + self.en_tokenizer(en_sentence) + ['<eos>']
        fr_tokens = ['<bos>'] + self.fr_tokenizer(fr_sentence) + ['<eos>']

        # To list of tokens
        en_tokens = self.en_vocab(en_tokens)  # list[int]
        fr_tokens = self.fr_vocab(fr_tokens)

        return torch.LongTensor(en_tokens), torch.LongTensor(fr_tokens)


def yield_tokens(dataset, tokenizer, lang):
    """Tokenize the whole dataset and yield the tokens.
    """
    assert lang in ('en', 'fr')
    sentence_idx = 0 if lang == 'en' else 1

    for sentences in dataset:
        sentence = sentences[sentence_idx]
        tokens = tokenizer(sentence)
        yield tokens


def build_vocab(dataset: list, en_tokenizer, fr_tokenizer, min_freq: int):
    """Return two vocabularies, one for each language.
    """
    en_vocab = build_vocab_from_iterator(
        yield_tokens(dataset, en_tokenizer, 'en'),
        min_freq=min_freq,
        specials=SPECIALS,
    )
    en_vocab.set_default_index(en_vocab['<unk>'])  # Default token for unknown words

    fr_vocab = build_vocab_from_iterator(
        yield_tokens(dataset, fr_tokenizer, 'fr'),
        min_freq=min_freq,
        specials=SPECIALS,
    )
    fr_vocab.set_default_index(fr_vocab['<unk>'])

    return en_vocab, fr_vocab


def preprocess(
        dataset: list,
        en_tokenizer,
        fr_tokenizer,
        max_words: int,
    ) -> list:
    """Preprocess the dataset.
    Remove samples where at least one of the sentences are too long.
    Those samples takes too much memory.
    Also remove the pending '\n' at the end of sentences.
    """
    filtered = []

    for en_s, fr_s in dataset:
        if len(en_tokenizer(en_s)) >= max_words or len(fr_tokenizer(fr_s)) >= max_words:
            continue

        en_s = en_s.replace('\n', '')
        fr_s = fr_s.replace('\n', '')

        filtered.append((en_s, fr_s))

    return filtered


def build_datasets(
        max_sequence_length: int,
        min_token_freq: int,
        en_tokenizer,
        fr_tokenizer,
        train: list,
        val: list,
    ) -> tuple:
    """Build the training, validation and testing datasets.
    It takes care of the vocabulary creation.

    Args
    ----
        - max_sequence_length: Maximum number of tokens in each sequences.
            Having big sequences increases dramatically the VRAM taken during training.
        - min_token_freq: Minimum number of occurences each token must have
            to be saved in the vocabulary. Reducing this number increases
            the vocabularies's size.
        - en_tokenizer: Tokenizer for the english sentences.
        - fr_tokenizer: Tokenizer for the french sentences.
        - train and val: List containing the pairs (english, french) sentences.


    Output
    ------
        - (train_dataset, val_dataset): Tuple of the two TranslationDataset objects.
    """
    datasets = [
        preprocess(samples, en_tokenizer, fr_tokenizer, max_sequence_length)
        for samples in [train, val]
    ]

    en_vocab, fr_vocab = build_vocab(datasets[0], en_tokenizer, fr_tokenizer, min_token_freq)

    datasets = [
        TranslationDataset(samples, en_vocab, fr_vocab, en_tokenizer, fr_tokenizer)
        for samples in datasets
    ]

    return datasets


In [4]:
def generate_batch(data_batch: list, src_pad_idx: int, tgt_pad_idx: int) -> tuple:
    """Add padding to the given batch so that all
    the samples are of the same size.

    Args
    ----
        data_batch: List of samples.
            Each sample is a tuple of LongTensors of varying size.
        src_pad_idx: Source padding index value.
        tgt_pad_idx: Target padding index value.

    Output
    ------
        en_batch: Batch of tokens for the padded english sentences.
            Shape of [batch_size, max_en_len].
        fr_batch: Batch of tokens for the padded french sentences.
            Shape of [batch_size, max_fr_len].
    """
    en_batch, fr_batch = [], []
    for en_tokens, fr_tokens in data_batch:
        en_batch.append(en_tokens)
        fr_batch.append(fr_tokens)

    en_batch = pad_sequence(en_batch, padding_value=src_pad_idx, batch_first=True)
    fr_batch = pad_sequence(fr_batch, padding_value=tgt_pad_idx, batch_first=True)
    return en_batch, fr_batch

# Models architecture

## Tranformer encoder-decoder
This is where you have to code the architectures.

In a machine translation task, the model takes as input the whole
source sentence along with the current known tokens of the target,
and predict the next token in the target sequence.
This means that the target tokens are predicted in an autoregressive
manner, starting from the first token (right after the *\<bos\>* token) and producing tokens one by one until the last *\<eos\>* token.

Formally, we define $s = [s_1, ..., s_{N_s}]$ as the source sequence made of $N_s$ tokens.
We also define $t^i = [t_1, ..., t_i]$ as the target sequence at the beginning of the step $i$.

The output of the model parameterized by $\theta$ is:

$$
T_{i+1} = p(t_{i+1} | s, t^i ; \theta )
$$

Where $T_{i+1}$ is the distribution of the next token $t_{i+1}$.

The loss is simply a *cross entropy loss* over the whole steps, where each class is a token of the vocabulary.

![RNN schema for machinea translation](https://www.simplilearn.com/ice9/free_resources_article_thumb/machine-translation-model-with-encoder-decoder-rnn.jpg)

Note that in this image the english sentence is provided in reverse.


## Transformer models
Here you have to code the Full Transformer and Decoder-Only Transformer architectures.
It is divided in three parts:
* Attention layers (done individually)
* Encoder and decoder layers (done individually)
* Full Transformer: gather the encoder and decoder layers (done individually)

The Transformer (or "Full Transformer") is presented in the paper: [Attention is all you need](https://arxiv.org/pdf/1706.03762.pdf). The [illustrated transformer](https://jalammar.github.io/illustrated-transformer/) blog can help you
understanding how the architecture works.
Once this is done, you can use [the annontated transformer](https://nlp.seas.harvard.edu/2018/04/03/attention.html) to have an idea of how to code this architecture.
We encourage you to use `torch.einsum` and the `einops` library as much as you can. It will make your code simpler.

---
**Implementation order**

To help you with the implementation, we advise you following this order:
* Implement `TranslationTransformer` and use `nn.Transformer` instead of `Transformer`
* Implement `Transformer` and use `nn.TransformerDecoder` and `nn.TransformerEnocder`
* Implement the `TransformerDecoder` and `TransformerEncoder` and use `nn.MultiHeadAttention`
* Implement `MultiHeadAttention`

Do not forget to add `batch_first=True` when necessary in the `nn` modules.

### Positional Encoding


In [5]:
import math
class PositionalEncoding(nn.Module):
    """
    This PE module comes from:
    Pytorch. (2021). LANGUAGE MODELING WITH NN.TRANSFORMER AND TORCHTEXT. https://pytorch.org/tutorials/beginner/transformer_tutorial.html
    """
    def __init__(self, d_model: int, dropout: float, max_len: int = 5000):
        super().__init__()
        self.dropout = nn.Dropout(dropout)

        position = torch.arange(max_len).unsqueeze(1).to(DEVICE)
        div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model)).to(DEVICE)
        pe = torch.zeros(max_len, 1, d_model).to(DEVICE)
        pe[:, 0, 0::2] = torch.sin(position * div_term)
        pe[:, 0, 1::2] = torch.cos(position * div_term)
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = rearrange(x, "b s e -> s b e")
        """
        Args:
            x: Tensor, shape [seq_len, batch_size, embedding_dim]
        """
        x = x + self.pe[:x.size(0)]
        x = rearrange(x, "s b e -> b s e")
        return self.dropout(x)

### Attention layers
We use a `MultiHeadAttention` module, that is able to perform self-attention aswell as cross-attention (depending on what you give as queries, keys and values).

**Attention**


It takes the multiheaded queries, keys and values as input.
It computes the attention between the queries and the keys and return the attended values.

The implementation of this function can greatly be improved with *einsums*.

**MultiheadAttention**

Computes the multihead queries, keys and values and feed them to the `attention` function.
You also need to merge the key padding mask and the attention mask into one mask.

The implementation of this module can greatly be improved with *einops.rearrange*.

In [6]:
from einops.layers.torch import Rearrange
import torch.nn.functional as F



def attention(
        q: torch.FloatTensor,
        k: torch.FloatTensor,
        v: torch.FloatTensor,
        mask: torch.BoolTensor=None,
        dropout: nn.Dropout=None,
    ) -> tuple:
    """Computes multihead scaled dot-product attention from the
    projected queries, keys and values.

    Args
    ----
        q: Batch of queries.
            Shape of [batch_size, seq_len_1, n_heads, dim_model].
        k: Batch of keys.
            Shape of [batch_size, seq_len_2, n_heads, dim_model].
        v: Batch of values.
            Shape of [batch_size, seq_len_2, n_heads, dim_model].
        mask: Prevent tokens to attend to some other tokens (for padding or autoregressive attention).
            Attention is prevented where the mask is `True`.
            Shape of [batch_size, n_heads, seq_len_1, seq_len_2],
            or broadcastable to that shape.
        dropout: Dropout layer to use.

    Output
    ------
        y: Multihead scaled dot-attention between the queries, keys and values.
            Shape of [batch_size, seq_len_1, n_heads, dim_model].
        attn: Computed attention between the keys and the queries.
            Shape of [batch_size, n_heads, seq_len_1, seq_len_2].
    """
    """This code is inspired from http://nlp.seas.harvard.edu/annotated-transformer/#prelims"""
    # On récupère la dimension de chaque tête d'attention
    d_k = q.shape[-1]

    # Calcul du produit scalaire entre les requêtes (q) et les clés (k)
    # On normalise ensuite en divisant par la racine carrée de d_k pour stabiliser l'entraînement
    scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(d_k)

    # Si un masque est fourni, on met à -inf les positions à masquer (empêche l'attention sur ces positions)
    if mask is not None:
        scores = scores.masked_fill(mask, float('-inf'))

    # On applique une softmax pour obtenir des poids d'attention normalisés entre 0 et 1
    attn = F.softmax(scores, dim=-1)

    # Si un dropout est spécifié, on l'applique aux poids d'attention pour régulariser
    if dropout is not None:
        attn = dropout(attn)

    # On applique les poids d'attention sur les valeurs (v) pour obtenir la sortie pondérée
    y = torch.matmul(attn, v)

    # On retourne la sortie finale et les poids d'attention utilisés
    return y, attn

class MultiheadAttention(nn.Module):
    """Multihead attention module.
    Can be used as a self-attention and cross-attention layer.
    The queries, keys and values are projected into multiple heads
    before computing the attention between those tensors.

    Parameters
    ----------
        dim: Dimension of the input tokens.
        n_heads: Number of heads. `dim` must be divisible by `n_heads`.
        dropout: Dropout rate.
    """
    def __init__(
            self,
            dim: int,
            n_heads: int,
            dropout: float,
        ):
        super().__init__()

        assert dim % n_heads == 0

        self.n_heads = n_heads
        self.dim_head = dim // n_heads  # Dimension per head

        # Projection layers
        self.w_q = nn.Linear(dim, dim, bias=False, device=DEVICE)
        self.w_k = nn.Linear(dim, dim, bias=False, device=DEVICE)
        self.w_v = nn.Linear(dim, dim, bias=False, device=DEVICE)
        self.w_o = nn.Linear(dim, dim, bias=False, device=DEVICE)

        self.dropout = nn.Dropout(dropout)

    def forward(
            self,
            q: torch.FloatTensor,
            k: torch.FloatTensor,
            v: torch.FloatTensor,
            key_padding_mask: torch.BoolTensor = None,
            attn_mask: torch.BoolTensor = None,
        ) -> torch.FloatTensor:
        """Computes the scaled multi-head attention form the input queries,
        keys and values.

        Project those queries, keys and values before feeding them
        to the `attention` function.

        The masks are boolean masks. Tokens are prevented to attends to
        positions where the mask is `True`.

        Args
        ----
            q: Batch of queries.
                Shape of [batch_size, seq_len_1, dim_model].
            k: Batch of keys.
                Shape of [batch_size, seq_len_2, dim_model].
            v: Batch of values.
                Shape of [batch_size, seq_len_2, dim_model].
            key_padding_mask: Prevent attending to padding tokens.
                Shape of [batch_size, seq_len_2].
            attn_mask: Prevent attending to subsequent tokens.
                Shape of [seq_len_1, seq_len_2].

        Output
        ------
            y: Computed multihead attention.
                Shape of [batch_size, seq_len_1, dim_model].
        """
        """This code is inspired from https://github.com/harvardnlp/annotated-transformer/blob/debc9fd747bb2123160a98046ad1c2d4da44a567/AnnotatedTransformer.ipynb"""

        # On applique les transformations linéaires pour obtenir les requêtes (q), clés (k) et valeurs (v)
        q = self.w_q(q)
        k = self.w_k(k)
        v = self.w_v(v)

        # On réarrange les dimensions pour séparer les différentes têtes d'attention
        # Avant : [batch_size, seq_len, dim_model]
        # Après  : [batch_size, n_heads, seq_len, dim_head]
        q = rearrange(q, "b s (h d) -> b h s d", h=self.n_heads)
        k = rearrange(k, "b s (h d) -> b h s d", h=self.n_heads)
        v = rearrange(v, "b s (h d) -> b h s d", h=self.n_heads)

        # On calcule l'attention entre les requêtes et les clés
        y, attn_weights = attention(q, k, v, mask=attn_mask, dropout=self.dropout)

        # On réarrange les dimensions pour regrouper les têtes d'attention
        # Avant : [batch_size, n_heads, seq_len, dim_head]
        # Après  : [batch_size, seq_len, dim_model]
        y = rearrange(y, "b h s d -> b s (h d)")

        # Dernière transformation linéaire pour mélanger les têtes d'attention
        return self.w_o(y), attn_weights


### Encoder and decoder layers

**TranformerEncoder**

Apply self-attention layers onto the source tokens.
It only needs the source key padding mask.


**TranformerDecoder**

Apply masked self-attention layers to the target tokens and cross-attention
layers between the source and the target tokens.
It needs the source and target key padding masks, and the target attention mask.

In [7]:
class TransformerDecoderLayer(nn.Module):
    """Single decoder layer.

    Parameters
    ----------
        d_model: The dimension of decoders inputs/outputs.
        dim_feedforward: Hidden dimension of the feedforward networks.
        nheads: Number of heads for each multi-head attention.
        dropout: Dropout rate.
    """

    def __init__(
            self,
            d_model: int,
            d_ff: int,
            nhead: int,
            dropout: float
        ):
        super().__init__()

        # Self-attention using MultiheadAttention(hand made)
        self.self_attn = MultiheadAttention(
            dim=d_model,
            n_heads=nhead,
            dropout=dropout
            )

        # Self-attention using nn.MultiheadAttention
        ''' self.self_attn = nn.MultiheadAttention(
            embed_dim=d_model,
            num_heads=nhead,
            dropout=dropout,
            batch_first=True
            ) '''


        # Cross-attention using MultiheadAttention
        self.cross_attn = MultiheadAttention(
            dim=d_model,
            n_heads=nhead,
            dropout=dropout,
            )

        # Cross-attention using nn.MultiheadAttention
        ''' self.cross_attn = nn.MultiheadAttention(
            embed_dim=d_model,
            num_heads=nhead,
            dropout=dropout,
            batch_first=True
            ) '''

        # Feedforward network
        #Fonction obtenue à l'aide de chatgpt
        self.feedforward = nn.Sequential(
            nn.Linear(d_model, d_ff, device=DEVICE),
            nn.ReLU(),
            nn.Linear(d_ff, d_model, device=DEVICE)
        )

        # Normalization layers
        self.norm_layer1 = nn.LayerNorm(d_model, device=DEVICE)
        self.norm_layer2 = nn.LayerNorm(d_model, device=DEVICE)
        self.norm_layer3 = nn.LayerNorm(d_model, device=DEVICE)

        # Dropout layers
        self.dropout_layer = nn.Dropout(dropout)

    def forward(
            self,
            src: torch.FloatTensor,
            tgt: torch.FloatTensor,
            tgt_mask_attn: torch.BoolTensor,
            src_key_padding_mask: torch.BoolTensor,
            tgt_key_padding_mask: torch.BoolTensor,
        ) -> torch.FloatTensor:
        """Decode the next target tokens based on the previous tokens.

        Args
        ----
            src: Batch of source sentences.
                Shape of [batch_size, src_seq_len, dim_model].
            tgt: Batch of target sentences.
                Shape of [batch_size, tgt_seq_len, dim_model].
            tgt_mask_attn: Mask to prevent attention to subsequent tokens.
                Shape of [tgt_seq_len, tgt_seq_len].
            src_key_padding_mask: Mask to prevent attention to padding in src sequence.
                Shape of [batch_size, src_seq_len].
            tgt_key_padding_mask: Mask to prevent attention to padding in tgt sequence.
                Shape of [batch_size, tgt_seq_len].

        Output
        ------
            y:  Batch of sequence of embeddings representing the predicted target tokens
                Shape of [batch_size, tgt_seq_len, dim_model].
        """

        # Step 1: Self-attention on the target sequence
        ''' self_attn_out, _ = self.self_attn(
        query=tgt, key=tgt, value=tgt,
        attn_mask=tgt_mask_attn,
        key_padding_mask=tgt_key_padding_mask
        ) '''
        self_attn_out, _ = self.self_attn(
        q=tgt, k=tgt, v=tgt,
        attn_mask=tgt_mask_attn,
        key_padding_mask=tgt_key_padding_mask
        )

        tgt = tgt + self.dropout_layer(self_attn_out)  # Residual connection
        tgt = self.norm_layer1(tgt)  # Normalization

        #Cross-attention (target attends to encoded source)
        ''' cross_attn_out, _ = self.cross_attn(
        query=tgt, key=src, value=src,
        key_padding_mask=src_key_padding_mask
        ) '''
        cross_attn_out, _ = self.cross_attn(
        q=tgt, k=src, v=src,
        key_padding_mask=src_key_padding_mask
        )

        tgt = tgt + self.dropout_layer(cross_attn_out)  # Residual connection
        tgt = self.norm_layer2(tgt)  # Normalization

        # Feedforward network
        ff_out = self.feedforward(tgt)
        tgt = tgt + self.dropout_layer(ff_out)  # Residual connection
        tgt = self.norm_layer3(tgt)  # Normalization

        # Output shape: [batch_size, tgt_seq_len, dim_model]
        return tgt


class TransformerDecoder(nn.Module):
    """Implementation of the transformer decoder stack.

    Parameters
    ----------
        d_model: The dimension of decoders inputs/outputs.
        dim_feedforward: Hidden dimension of the feedforward networks.
        num_decoder_layers: Number of stacked decoders.
        nheads: Number of heads for each multi-head attention.
        dropout: Dropout rate.
    """

    def __init__(
            self,
            d_model: int,
            d_ff: int,
            num_decoder_layer:int ,
            nhead: int,
            dropout: float
        ):
        super().__init__()

         # Multiple TransformerDecoderLayer added together
        self.layers = nn.ModuleList([
            TransformerDecoderLayer(d_model, d_ff, nhead, dropout) for _ in range(num_decoder_layer)
        ])

    def forward(
            self,
            src: torch.FloatTensor,
            tgt: torch.FloatTensor,
            tgt_mask_attn: torch.BoolTensor,
            src_key_padding_mask: torch.BoolTensor,
            tgt_key_padding_mask: torch.BoolTensor,
        ) -> torch.FloatTensor:
        """Decodes the source sequence by sequentially passing.
        the encoded source sequence and the target sequence through the decoder stack.

        Args
        ----
            src: Batch of encoded source sentences.
                Shape of [batch_size, src_seq_len, dim_model].
            tgt: Batch of taget sentences.
                Shape of [batch_size, tgt_seq_len, dim_model].
            tgt_mask_attn: Mask to prevent attention to subsequent tokens.
                Shape of [tgt_seq_len, tgt_seq_len].
            src_key_padding_mask: Mask to prevent attention to padding in src sequence.
                Shape of [batch_size, src_seq_len].
            tgt_key_padding_mask: Mask to prevent attention to padding in tgt sequence.
                Shape of [batch_size, tgt_seq_len].

        Output
        ------
            y:  Batch of sequence of embeddings representing the predicted target tokens
                Shape of [batch_size, tgt_seq_len, dim_model].
        """
        # Iteration through each layer
        for layer in self.layers:
            tgt = layer(src, tgt, tgt_mask_attn, src_key_padding_mask, tgt_key_padding_mask)

        return tgt


class TransformerEncoderLayer(nn.Module):
    """Single encoder layer.

    Parameters
    ----------
        d_model: The dimension of input tokens.
        dim_feedforward: Hidden dimension of the feedforward networks.
        nheads: Number of heads for each multi-head attention.
        dropout: Dropout rate.
    """

    def __init__(
            self,
            d_model: int,
            d_ff: int,
            nhead: int,
            dropout: float,
        ):
        super().__init__()

        # Multi-head self-attention (handmade)
        self.self_attn = MultiheadAttention(
            dim=d_model,
            n_heads=nhead,
            dropout=dropout
            )

        ''' self.self_attn = nn.MultiheadAttention(
            embed_dim=d_model,
            num_heads=nhead,
            dropout=dropout,
            batch_first=True
            ) '''

        # Feedforward network
        #Obtenu à l'aide de chatgpt
        self.feedforward = nn.Sequential(
            nn.Linear(d_model, d_ff, device=DEVICE),
            nn.ReLU(),
            nn.Linear(d_ff, d_model, device=DEVICE)
        ).to(DEVICE)

        # Normalization layers
        self.norm_layer1 = nn.LayerNorm(d_model, device=DEVICE)
        self.norm_layer2 = nn.LayerNorm(d_model, device=DEVICE)

        # Dropout layers
        self.dropout = nn.Dropout(dropout)


    def forward(
        self,
        src: torch.FloatTensor,
        key_padding_mask: torch.BoolTensor
        ) -> torch.FloatTensor:
        """Encodes the input. Does not attend to masked inputs.

        Args
        ----
            src: Batch of embedded source tokens.
                Shape of [batch_size, src_seq_len, dim_model].
            key_padding_mask: Mask preventing attention to padding tokens.
                Shape of [batch_size, src_seq_len].

        Output
        ------
            y: Batch of encoded source tokens.
                Shape of [batch_size, src_seq_len, dim_model].
        """

        # Apply self-attention on the source sequence
        attn_output, _ = self.self_attn(src, src, src, key_padding_mask=key_padding_mask)

        # Add attention output to the original src (residual connection)
        src = src + self.dropout(attn_output)

        # Normalize result
        src = self.norm_layer1(src)

        # Apply a feedforward network
        ff_output = self.feedforward(src)

        # Add feedforward output (residual connections)
        src = src + self.dropout(ff_output)

        # Normalize result
        src = self.norm_layer2(src)

        # Return the final output
        return src


class TransformerEncoder(nn.Module):
    """Implementation of the transformer encoder stack.

    Parameters
    ----------
        d_model: The dimension of encoders inputs.
        dim_feedforward: Hidden dimension of the feedforward networks.
        num_encoder_layers: Number of stacked encoders.
        nheads: Number of heads for each multi-head attention.
        dropout: Dropout rate.
    """

    def __init__(
            self,
            d_model: int,
            dim_feedforward: int,
            num_encoder_layers: int,
            nheads: int,
            dropout: float
        ):
        super().__init__()

        # List of TransformerEncoderLayer
        self.layers = nn.ModuleList([
            TransformerEncoderLayer(d_model, dim_feedforward, nheads, dropout) for _ in range(num_encoder_layers)
        ])

    def forward(
            self,
            src: torch.FloatTensor,
            key_padding_mask: torch.BoolTensor
        ) -> torch.FloatTensor:
        """Encodes the source sequence by sequentially passing.
        the source sequence through the encoder stack.

        Args
        ----
            src: Batch of embedded source sentences.
                Shape of [batch_size, src_seq_len, dim_model].
            key_padding_mask: Mask preventing attention to padding tokens.
                Shape of [batch_size, src_seq_len].

        Output
        ------
            y: Batch of encoded source sequence.
                Shape of [batch_size, src_seq_len, dim_model].
        """
        for layer in self.layers:
            src = layer(src, key_padding_mask)

        return src

### Transformer
This section gathers the `Transformer` and the `TranslationTransformer` modules.

**Transformer**


The classical transformer architecture.
It takes the source and target tokens embeddings and
do the forward pass through the encoder and decoder.

**Translation Transformer**

Compute the source and target tokens embeddings, and apply a final head to produce next token logits.
The output must not be the softmax but just the logits, because we use the `nn.CrossEntropyLoss`.

It also creates the *src_key_padding_mask*, the *tgt_key_padding_mask* and the *tgt_mask_attn*.

In [8]:
from einops import rearrange

class Transformer(nn.Module):
    """Implementation of a Transformer based on the paper: https://arxiv.org/pdf/1706.03762.pdf.

    Parameters
    ----------
        d_model: The dimension of encoders/decoders inputs/ouputs.
        nhead: Number of heads for each multi-head attention.
        num_encoder_layers: Number of stacked encoders.
        num_decoder_layers: Number of stacked encoders.
        dim_feedforward: Hidden dimension of the feedforward networks.
        dropout: Dropout rate.
    """

    def __init__(
            self,
            d_model: int,
            nhead: int,
            num_encoder_layers: int,
            num_decoder_layers: int,
            dim_feedforward: int,
            dropout: float,
        ):
        super().__init__()

        # Encoder handmade
        self.encoder = TransformerEncoder(
            d_model=d_model,
            dim_feedforward=dim_feedforward,
            num_encoder_layers=num_encoder_layers,
            nheads=nhead,
            dropout=dropout
        )

        #Encoder nn.Encoder
        ''' self.encoder = nn.Encoder(
            d_model=d_model,
            dim_feedforward=dim_feedforward,
            num_encoder_layers=num_encoder_layers,
            nheads=nhead,
            dropout=dropout,
            batch_first=True
        ) '''

        # Decoder
        self.decoder = TransformerDecoder(
            d_model=d_model,
            d_ff=dim_feedforward,
            num_decoder_layer=num_decoder_layers,
            nhead=nhead,
            dropout=dropout
        )
        #Decoder nn.Decoder
        ''' self.decoder = nn.Decoder(
            d_model=d_model,
            dim_feedforward=dim_feedforward,
            num_encoder_layers=num_encoder_layers,
            nheads=nhead,
            dropout=dropout,
            batch_first=True
        ) '''

    def forward(
            self,
            src: torch.FloatTensor,
            tgt: torch.FloatTensor,
            tgt_mask_attn: torch.BoolTensor,
            src_key_padding_mask: torch.BoolTensor,
            tgt_key_padding_mask: torch.BoolTensor
        ) -> torch.FloatTensor:
        """Compute next token embeddings.

        Args
        ----
            src: Batch of source sequences.
                Shape of [batch_size, src_seq_len, dim_model].
            tgt: Batch of target sequences.
                Shape of [batch_size, tgt_seq_len, dim_model].
            tgt_mask_attn: Mask to prevent attention to subsequent tokens.
                Shape of [tgt_seq_len, tgt_seq_len].
            src_key_padding_mask: Mask to prevent attention to padding in src sequence.
                Shape of [batch_size, src_seq_len].
            tgt_key_padding_mask: Mask to prevent attention to padding in tgt sequence.
                Shape of [batch_size, tgt_seq_len].

        Output
        ------
            y: Next token embeddings, given the previous target tokens and the source tokens.
                Shape of [batch_size, tgt_seq_len, dim_model].
        """
        # Encode source
        encoder = self.encoder(src, key_padding_mask=src_key_padding_mask)

        # Decode using target and source
        output = self.decoder(
            src=encoder,
            tgt=tgt,
            tgt_mask_attn=tgt_mask_attn,
            src_key_padding_mask=src_key_padding_mask,
            tgt_key_padding_mask=tgt_key_padding_mask
        )

        return output


class TranslationTransformer(nn.Module):
    """Basic Transformer encoder and decoder for a translation task.
    Manage the masks creation, and the token embeddings.
    Position embeddings can be learnt with a standard `nn.Embedding` layer.

    Parameters
    ----------
        n_tokens_src: Number of tokens in the source vocabulary.
        n_tokens_tgt: Number of tokens in the target vocabulary.
        n_heads: Number of heads for each multi-head attention.
        dim_embedding: Dimension size of the word embeddings (for both language).
        dim_hidden: Dimension size of the feedforward layers
            (for both the encoder and the decoder).
        n_layers: Number of layers in the encoder and decoder.
        dropout: Dropout rate.
        src_pad_idx: Source padding index value.
        tgt_pad_idx: Target padding index value.
    """
    def __init__(
            self,
            n_tokens_src: int,
            n_tokens_tgt: int,
            n_heads: int,
            dim_embedding: int,
            dim_hidden: int,
            n_layers: int,
            dropout: float,
            src_pad_idx: int,
            tgt_pad_idx: int,
        ):
        super().__init__()

        # Embedding layers
        self.src_embedding = nn.Embedding(n_tokens_src, dim_embedding, device=DEVICE)
        self.tgt_embedding = nn.Embedding(n_tokens_tgt, dim_embedding, device=DEVICE)

        # Position encoding
        self.position_embedding = PositionalEncoding(dim_embedding, dropout).to(DEVICE)

        # Transformer model handmade
        self.transformer = Transformer(
            d_model=dim_embedding,
            nhead=n_heads,
            num_encoder_layers=n_layers,
            num_decoder_layers=n_layers,
            dim_feedforward=dim_hidden,
            dropout=dropout
        )

        # Transformer model nn.Transformer
        ''' self.transformer = nn.Transformer(
            d_model=dim_embedding,
            nhead=n_heads,
            num_encoder_layers=n_layers,
            num_decoder_layers=n_layers,
            dim_feedforward=dim_hidden,
            dropout=dropout
        ) '''

        # Final output projection layer
        self.final_out = nn.Linear(dim_embedding, n_tokens_tgt, device=DEVICE)

        # Padding indexes
        self.src_pad_idx = src_pad_idx
        self.tgt_pad_idx = tgt_pad_idx

    def forward(
            self,
            source: torch.LongTensor,
            target: torch.LongTensor
        ) -> torch.FloatTensor:
        """Predict the target tokens logites based on the source tokens.

        Args
        ----
            source: Batch of source sentences.
                Shape of [batch_size, seq_len_src].
            target: Batch of target sentences.
                Shape of [batch_size, seq_len_tgt].

        Output
        ------
            y: Distributions over the next token for all tokens in each sentences.
                Those need to be the logits only, do not apply a softmax because
                it will be done in the loss computation for numerical stability.
                See https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html for more informations.
                Shape of [batch_size, seq_len_tgt, n_tokens_tgt].
        """
        # Get token embeddings and add positional encodings
        src_emb = self.src_embedding(source)
        src_emb = self.position_embedding(src_emb)
        tgt_emb = self.tgt_embedding(target)
        tgt_emb = self.position_embedding(tgt_emb)

        # Create attention masks
        tgt_mask = self.generate_causal_mask(target)
        src_key_padding_mask, tgt_key_padding_mask = self.generate_key_padding_mask(source, target)

        # Pass through Transformer
        output = self.transformer(
            src=src_emb,
            tgt=tgt_emb,
            src_key_padding_mask=src_key_padding_mask,
            tgt_key_padding_mask=tgt_key_padding_mask,
            tgt_mask_attn=tgt_mask
        )

        # Convert output to logits over vocabulary
        output = self.final_out(output)

        return output

    def generate_causal_mask(
            self,
            target: torch.LongTensor,
        ) -> tuple:
        """Generate the masks to prevent attending subsequent tokens.

        Args
        ----
            source: Batch of source sentences.
                Shape of [batch_size, seq_len_src].
            target: Batch of target sentences.
                Shape of [batch_size, seq_len_tgt].

        Output
        ------
            tgt_mask_attn: Mask to prevent attention to subsequent tokens.
                Shape of [seq_len_tgt, seq_len_tgt].

        """

        seq_len = target.shape[1]

        tgt_mask = torch.ones((seq_len, seq_len), dtype=torch.bool)
        tgt_mask = torch.triu(tgt_mask, diagonal=1).to(target.device)

        return tgt_mask

    def generate_key_padding_mask(
            self,
            source: torch.LongTensor,
            target: torch.LongTensor,
        ) -> tuple:
        """Generate the masks to prevent attending padding tokens.

        Args
        ----
            source: Batch of source sentences.
                Shape of [batch_size, seq_len_src].
            target: Batch of target sentences.
                Shape of [batch_size, seq_len_tgt].

        Output
        ------
            src_key_padding_mask: Mask to prevent attention to padding in src sequence.
                Shape of [batch_size, seq_len_src].
            tgt_key_padding_mask: Mask to prevent attention to padding in tgt sequence.
                Shape of [batch_size, seq_len_tgt].

        """

        src_key_padding_mask = source == self.src_pad_idx
        tgt_key_padding_mask = target == self.tgt_pad_idx

        return src_key_padding_mask, tgt_key_padding_mask


# Decoder Only Tranformer



In [9]:
class DecoderOnlyLayer(nn.Module):
    """Single Decoder-Only layer with masked self-attention.

    Parameters
    ----------
        d_model: The dimension of input tokens.
        dim_feedforward: Hidden dimension of the feedforward networks.
        nheads: Number of heads for each multi-head attention.
        dropout: Dropout rate.
    """
    def __init__(
            self,
            d_model: int,
            d_ff: int,
            nhead: int,
            dropout: float,
        ):
        super().__init__()

        # Self-attention using MultiheadAttention(hand made)
        self.self_attn = MultiheadAttention(
            dim=d_model,
            n_heads=nhead,
            dropout=dropout
            )

        # Feedforward network
        self.feedforward = nn.Sequential(
            nn.Linear(d_model, d_ff, device=DEVICE),
            nn.ReLU(),
            nn.Linear(d_ff, d_model, device=DEVICE)
        )

        # Normalization layers
        self.norm_layer1 = nn.LayerNorm(d_model, device=DEVICE)
        self.norm_layer2 = nn.LayerNorm(d_model, device=DEVICE)

        # Dropout layers
        self.dropout = nn.Dropout(dropout)

    def forward(
        self,
        seq: torch.FloatTensor,
        attn_mask: torch.BoolTensor,
        key_padding_mask: torch.BoolTensor
        ) -> torch.FloatTensor:
        """Applies masked self-attention and feedforward network.

        Args
        ----
            seq: Batch of embedded input tokens.
                Shape of [batch_size, seq_len, dim_model].
            attn_mask: Causal mask preventing attention to future tokens.
                       Shape of [seq_len, seq_len].
            key_padding_mask: Mask preventing attention to padding tokens.
                Shape of [batch_size, seq_len].

        Output
        ------
            y: Batch of encoded tokens.
                Shape of [batch_size, seq_len, dim_model].
        """
        # Apply self-attention
        attn_output, _ = self.self_attn(
            q=seq,
            k=seq,
            v=seq,
            attn_mask=attn_mask,
            key_padding_mask=key_padding_mask,
            )

        # Add attention output to the original seq (residual connection)
        seq = seq + self.dropout(attn_output)
        # Normalize result
        seq = self.norm_layer1(seq)

        # Apply a feedforward network
        ff_output = self.feedforward(seq)
        # Add feedforward output (residual connections)
        seq = seq + self.dropout(ff_output)
        # Normalize result
        seq = self.norm_layer2(seq)

        # Return the final output
        return seq


class DecoderOnlyTransformer(nn.Module):
    """Stack of Decoder-Only layers with final normalization.
       Acts as the core processing unit, taking embedded inputs.

    Parameters
    ----------
        d_model: The dimension of input/output embeddings.
        d_ff: Hidden dimension of the feedforward networks.
        nhead: Number of heads for each multi-head attention.
        num_decoder_layers: Number of stacked decoder layers.
        dropout: Dropout rate.
    """
    def __init__(
        self,
        d_model: int,
        nhead: int,
        num_decoder_layers: int,
        d_ff: int,
        dropout: float,
    ):
        super().__init__()
        self.decoder = DecoderOnlyLayer(
            d_model=d_model,
            d_ff=d_ff,
            nhead=nhead,
            dropout=dropout
        )

        # Stack of DecoderOnlyLayer instances
        self.layers = nn.ModuleList([
            DecoderOnlyLayer(d_model, d_ff, nhead, dropout)
            for _ in range(num_decoder_layers)
        ])


    def forward(
        self,
        seq: torch.FloatTensor,
        attn_mask: torch.BoolTensor,
        key_padding_mask: torch.BoolTensor
    ) -> torch.FloatTensor:
        """Processes the sequence through the stack of decoder layers.

        Args:
            seq: Batch of embedded input tokens (with positional encoding).
                 Shape of [batch_size, seq_len, d_model].
            attn_mask: Causal mask preventing attention to future tokens.
                       Shape of [seq_len, seq_len].
            key_padding_mask: Mask preventing attention to padding tokens.
                              Shape of [batch_size, seq_len].

        Returns:
            output: Processed sequence embeddings.
                    Shape of [batch_size, seq_len, d_model].
        """
        output = seq
        # Pass through each layer in the stack
        for layer in self.layers:
            output = layer(
                seq=output,
                attn_mask=attn_mask,
                key_padding_mask=key_padding_mask
            )
        return output


class DecoderOnlyTranslationTransformer(nn.Module):
    """Decoder-Only Transformer for sequence generation tasks.
    Manages token embeddings, positional encoding, mask creation,
    and final output projection, using DecoderOnlyTransformer as the core.

    Parameters
    ----------
        n_tokens_vocab: Number of tokens in the vocabulary.
        n_heads: Number of heads for each multi-head attention.
        dim_embedding: Dimension size of the word embeddings.
        dim_hidden: Dimension size of the feedforward layers in the core transformer.
        num_layers: Number of layers in the decoder stack.
        dropout: Dropout rate.
        pad_idx: Padding index value in the vocabulary.
    """
    def __init__(
        self,
        n_tokens_vocab: int,
        n_heads: int,
        dim_embedding: int,
        dim_hidden: int,
        num_layers: int,
        dropout: float,
        pad_idx: int,
    ):
        super().__init__()

        # Embedding layer
        self.token_embedding = nn.Embedding(n_tokens_vocab, dim_embedding)

        # Position encoding
        self.position_embedding = PositionalEncoding(dim_embedding, dropout)

        self.transformer = DecoderOnlyTransformer(
            d_model=dim_embedding,
            nhead=n_heads,
            num_decoder_layers=num_layers,
            d_ff=dim_hidden,
            dropout=dropout
        )

        # Final output projection layer
        self.final_out = nn.Linear(dim_embedding, n_tokens_vocab, device=DEVICE)

        # Padding index
        self.pad_idx = pad_idx

    def forward(
        self,
        sequence: torch.LongTensor,
    ) -> torch.FloatTensor:
        """Predict the next token logits based on the input sequence."""

        # Get token embeddings and add positional encodings
        seq_emb = self.token_embedding(sequence)
        seq_emb = self.position_embedding(seq_emb)

        # Create attention masks
        attn_mask = self.generate_causal_mask(sequence)
        key_padding_mask = self.generate_key_padding_mask(sequence)

        # Passe les embeddings et les masques au DecoderOnlyTransformer
        processed_seq = self.transformer(
            seq=seq_emb,
            attn_mask=attn_mask,
            key_padding_mask=key_padding_mask
        )

        # Convert output to logits over vocabulary
        # Utilise la sortie du DecoderOnlyTransformer
        output = self.final_out(processed_seq)

        return output

    def generate_causal_mask(
        self,
        sequence: torch.LongTensor,
    ) -> torch.BoolTensor:
        """Generate the mask to prevent attending subsequent tokens (causal mask)."""
        seq_len = sequence.shape[1]
        # S'assurer que le masque est créé sur le même device que la séquence
        mask = torch.triu(torch.ones((seq_len, seq_len), device=sequence.device, dtype=torch.bool), diagonal=1)
        return mask

    # Peut être changer cette partie la
    def generate_key_padding_mask(
        self,
        sequence: torch.LongTensor,
    ) -> torch.BoolTensor:
        """Generate the mask to prevent attending padding tokens."""
        mask = (sequence == self.pad_idx)
        return mask

# Greedy search

One idea to explore once you have your model working is to implement a geedy search to generate a target translation from a trained model and an input source string. The next token will simply be the most probable one. Compare this strategy of decoding with the beam search strategy below.

In [10]:
def greedy_search(
        model: nn.Module,
        source: str,
        src_vocab: Vocab,
        tgt_vocab: Vocab,
        src_tokenizer,
        device: str,
        max_sentence_length: int,
    ) -> str:
    """Do a beam search to produce probable translations.

    Args
    ----
        model: The translation model. Assumes it produces logits score (before softmax).
        source: The sentence to translate.
        src_vocab: The source vocabulary.
        tgt_vocab: The target vocabulary.
        device: Device to which we make the inference.
        max_target: Maximum number of target sentences we keep at the end of each stage.
        max_sentence_length: Maximum number of tokens for the translated sentence.

    Output
    ------
        sentence: The translated source sentence.
    """

    #Version obtenue à l'aide de chatgpt non utilisée
    ''' # Tokenize and numericalize the input sentence
    src_tokens = [src_vocab[token] for token in src_tokenizer(source)]
    src_tensor = torch.tensor(src_tokens, dtype=torch.long).unsqueeze(0).to(device)  # [1, seq_len]

    # Prepare initial target input (<sos> token)
    sos_token = tgt_vocab['<sos>']
    eos_token = tgt_vocab['<eos>']
    tgt_tokens = [sos_token]

    model.eval()

    with torch.no_grad():
        for _ in range(max_sentence_length):
            tgt_tensor = torch.tensor(tgt_tokens, dtype=torch.long).unsqueeze(0).to(device)  # [1, cur_len]

            # Generate masks
            tgt_mask_attn = model.generate_causal_mask(tgt_tensor)
            src_key_padding_mask, tgt_key_padding_mask = model.generate_key_padding_mask(src_tensor, tgt_tensor)

            # Forward pass
            logits = model(src_tensor, tgt_tensor, tgt_mask_attn, src_key_padding_mask, tgt_key_padding_mask)

            # Select the last token output
            next_token = logits[:, -1, :].argmax(dim=-1).item()  # Take token with highest probability

            # Stop if EOS token is generated
            if next_token == eos_token:
                break

            tgt_tokens.append(next_token)

    # Convert token indices back to words
    translated_sentence = ' '.join(tgt_vocab.lookup_tokens(tgt_tokens[1:]))  # Ignore <sos>

    return translated_sentence '''

# Beam search
Beam search is a smarter way of producing a sequence of tokens from
an autoregressive model than just using a greedy search.

The greedy search always chooses the most probable token as the unique
and only next target token, and repeat this processus until the *\<eos\>* token is predicted.

Instead, the beam search selects the k-most probable tokens at each step.
From those k tokens, the current sequence is duplicated k times and the k tokens are appended to the k sequences to produce new k sequences.

*You don't have to understand this code, but understanding this code once the TP is over could improve your torch tensors skills.*

---

**More explanations**

Since it is done at each step, the number of sequences grows exponentially (k sequences after the first step, k² sequences after the second...).
In order to keep the number of sequences low, we remove sequences except the top-s most likely sequences.
To do that, we keep track of the likelihood of each sequence.

Formally, we define $s = [s_1, ..., s_{N_s}]$ as the source sequence made of $N_s$ tokens.
We also define $t^i = [t_1, ..., t_i]$ as the target sequence at the beginning of the step $i$.

The output of the model parameterized by $\theta$ is:

$$
T_{i+1} = p(t_{i+1} | s, t^i ; \theta )
$$

Where $T_{i+1}$ is the distribution of the next token $t_{i+1}$.

Then, we define the likelihood of a target sentence $t = [t_1, ..., t_{N_t}]$ as:

$$
L(t) = \prod_{i=1}^{N_t - 1} p(t_{i+1} | s, t_{i}; \theta )
$$

Pseudocode of the beam search:
```
source: [N_s source tokens]  # Shape of [total_source_tokens]
target: [1, <bos> token]  # Shape of [n_sentences, current_target_tokens]
target_prob: [1]  # Shape of [n_sentences]
# We use `n_sentences` as the batch_size dimension

while current_target_tokens <= max_target_length:
    source = repeat(source, n_sentences)  # Shape of [n_sentences, total_source_tokens]
    predicted = model(source, target)[:, -1]  # Predict the next token distributions of all the n_sentences
    tokens_idx, tokens_prob = topk(predicted, k)

    # Append the `n_sentences * k` tokens to the `n_sentences` sentences
    target = repeat(target, k)  # Shape of [n_sentences * k, current_target_tokens]
    target = append_tokens(target, tokens_idx)  # Shape of [n_sentences * k, current_target_tokens + 1]

    # Update the sentences probabilities
    target_prob = repeat(target_prob, k)  # Shape of [n_sentences * k]
    target_prob *= tokens_prob

    if n_sentences * k >= max_sentences:
        target, target_prob = topk_prob(target, target_prob, k=max_sentences)
    else:
        n_sentences *= k

    current_target_tokens += 1
```

In [11]:
def beautify(sentence: str) -> str:
    """Removes useless spaces.
    """
    punc = {'.', ',', ';'}
    for p in punc:
        sentence = sentence.replace(f' {p}', p)

    links = {'-', "'"}
    for l in links:
        sentence = sentence.replace(f'{l} ', l)
        sentence = sentence.replace(f' {l}', l)

    return sentence

In [12]:
def indices_terminated(
        target: torch.FloatTensor,
        eos_token: int
    ) -> tuple:
    """Split the target sentences between the terminated and the non-terminated
    sentence. Return the indices of those two groups.

    Args
    ----
        target: The sentences.
            Shape of [batch_size, n_tokens].
        eos_token: Value of the End-of-Sentence token.

    Output
    ------
        terminated: Indices of the terminated sentences (who's got the eos_token).
            Shape of [n_terminated, ].
        non-terminated: Indices of the unfinished sentences.
            Shape of [batch_size-n_terminated, ].
    """
    terminated = [i for i, t in enumerate(target) if eos_token in t]
    non_terminated = [i for i, t in enumerate(target) if eos_token not in t]
    return torch.LongTensor(terminated), torch.LongTensor(non_terminated)


def append_beams(
        target: torch.FloatTensor,
        beams: torch.FloatTensor
    ) -> torch.FloatTensor:
    """Add the beam tokens to the current sentences.
    Duplicate the sentences so one token is added per beam per batch.

    Args
    ----
        target: Batch of unfinished sentences.
            Shape of [batch_size, n_tokens].
        beams: Batch of beams for each sentences.
            Shape of [batch_size, n_beams].

    Output
    ------
        target: Batch of sentences with one beam per sentence.
            Shape of [batch_size * n_beams, n_tokens+1].
    """
    batch_size, n_beams = beams.shape
    n_tokens = target.shape[1]

    target = einops.repeat(target, 'b t -> b c t', c=n_beams)  # [batch_size, n_beams, n_tokens]
    beams = beams.unsqueeze(dim=2)  # [batch_size, n_beams, 1]

    target = torch.cat((target, beams), dim=2)  # [batch_size, n_beams, n_tokens+1]
    target = target.view(batch_size*n_beams, n_tokens+1)  # [batch_size * n_beams, n_tokens+1]
    return target


def beam_search(
        model: nn.Module,
        source: str,
        config: dict,
        beam_width: int,
        max_target: int,
        max_sentence_length: int,
    ) -> list:
    """Do a beam search to produce probable translations, adapting to model architecture.

    Args
    ----
        model: The translation model. Assumes it produces linear score (before softmax).
        source: The sentence to translate.
        config: Dictionary containing model configuration, including:
            'src_vocab', 'tgt_vocab', 'src_tokenizer', 'device', 'model_type',
            'tgt_pad_idx' (implicitly used via EOS_IDX).
        beam_width: Number of top-k tokens we keep at each stage.
        max_target: Maximum number of target sentences we keep at the end of each stage.
        max_sentence_length: Maximum number of tokens for the translated sentence.

    Output
    ------
        sentences: List of sentences ordered by their likelihood.
    """
    src_vocab = config['src_vocab']
    tgt_vocab = config['tgt_vocab']
    src_tokenizer = config['src_tokenizer']
    device = config['device']
    model_type = config['model_type']
    EOS_IDX = tgt_vocab['<eos>']
    BOS_IDX = tgt_vocab['<bos>']
    PAD_IDX = tgt_vocab['<pad>']

    model.eval()
    model.to(device)

    src_tokens_list = ['<bos>'] + src_tokenizer(source) + ['<eos>']
    src_tokens = torch.LongTensor(src_vocab(src_tokens_list)).unsqueeze(0).to(device)

    tgt_tokens = torch.LongTensor([[BOS_IDX]]).to(device)
    log_probs = torch.FloatTensor([0.0]).to(device)

    completed_hypotheses = []

    with torch.no_grad():
        for _step in range(max_sentence_length):
            if tgt_tokens.shape[0] == 0:
                break

            current_batch_size = tgt_tokens.shape[0]

            if model_type == 'encoder-decoder':
                src_repeated = src_tokens.repeat(current_batch_size, 1)
                logits = model.forward(src_repeated, tgt_tokens)
            elif model_type == 'decoder-only':
                logits = model.forward(tgt_tokens)
            else:
                raise ValueError(f"Unsupported model_type: {model_type}")

            next_token_log_probs = torch.log_softmax(logits[:, -1, :], dim=-1)

            cumulative_log_probs = log_probs.unsqueeze(1) + next_token_log_probs

            k = min(beam_width, next_token_log_probs.shape[-1])
            topk_log_probs, topk_indices = cumulative_log_probs.topk(k, dim=-1)

            all_candidate_log_probs = topk_log_probs.view(-1)
            parent_beam_indices = torch.arange(current_batch_size, device=device).unsqueeze(1).expand(-1, k).reshape(-1)
            all_candidate_tokens = topk_indices.view(-1)

            num_candidates = all_candidate_log_probs.shape[0]
            num_to_keep = min(num_candidates, max_target)

            overall_top_log_probs, overall_top_indices = all_candidate_log_probs.topk(num_to_keep, dim=0)

            selected_parent_beam_indices = parent_beam_indices[overall_top_indices]
            selected_tokens = all_candidate_tokens[overall_top_indices]
            selected_log_probs = overall_top_log_probs

            prev_tgt_tokens = tgt_tokens[selected_parent_beam_indices]
            new_tgt_tokens = torch.cat([prev_tgt_tokens, selected_tokens.unsqueeze(1)], dim=1)

            is_terminated = (selected_tokens == EOS_IDX)
            is_active = ~is_terminated

            terminated_indices = torch.nonzero(is_terminated).squeeze(1)
            for idx in terminated_indices:
                score = selected_log_probs[idx].item()
                sequence = new_tgt_tokens[idx]
                completed_hypotheses.append((score, sequence))

            active_indices = torch.nonzero(is_active).squeeze(1)
            if active_indices.shape[0] == 0:
                break

            tgt_tokens = new_tgt_tokens[active_indices]
            log_probs = selected_log_probs[active_indices]

            if tgt_tokens.shape[0] > beam_width:
                top_active_log_probs, top_active_indices = log_probs.topk(beam_width)
                tgt_tokens = tgt_tokens[top_active_indices]
                log_probs = top_active_log_probs

    if not completed_hypotheses:
        completed_hypotheses = [(lp.item(), seq) for lp, seq in zip(log_probs, tgt_tokens)]

    completed_hypotheses.sort(key=lambda x: x[0], reverse=True)

    sentences = []
    for score, tgt_sentence_tensor in completed_hypotheses:
        tgt_sentence_list = tgt_sentence_tensor.tolist()
        if BOS_IDX in tgt_sentence_list:
            tgt_sentence_list = tgt_sentence_list[tgt_sentence_list.index(BOS_IDX)+1:]
        if EOS_IDX in tgt_sentence_list:
            eos_pos = tgt_sentence_list.index(EOS_IDX)
            tgt_sentence_list = tgt_sentence_list[:eos_pos]

        tgt_sentence_str = ' '.join(tgt_vocab.lookup_tokens(tgt_sentence_list))
        sentences.append(tgt_sentence_str)

    sentences_with_scores = list(zip(sentences, [h[0] for h in completed_hypotheses]))

    return sentences_with_scores[:max_target]

# Training loop
This is a basic training loop code. It takes a big configuration dictionnary to avoid never ending arguments in the functions.
We use [Weights and Biases](https://wandb.ai/) to log the trainings.
It logs every training informations and model performances in the cloud.
You have to create an account to use it. Every accounts are free for individuals or research teams.

In [13]:
def print_logs(dataset_type: str, logs: dict):
    """Print the logs.

    Args
    ----
        dataset_type: Either "Train", "Eval", "Test" type.
        logs: Containing the metric's name and value.
    """
    desc = [
        f'{name}: {value:.2f}'
        for name, value in logs.items()
    ]
    desc = '\t'.join(desc)
    desc = f'{dataset_type} -\t' + desc
    desc = desc.expandtabs(5)
    print(desc)


def topk_accuracy(
        real_tokens: torch.FloatTensor,
        probs_tokens: torch.FloatTensor,
        k: int,
        tgt_pad_idx: int,
    ) -> torch.FloatTensor:
    """Compute the top-k accuracy.
    We ignore the PAD tokens.

    Args
    ----
        real_tokens: Real tokens of the target sentence.
            Shape of [batch_size * n_tokens].
        probs_tokens: Tokens probability predicted by the model.
            Shape of [batch_size * n_tokens, n_target_vocabulary].
        k: Top-k accuracy threshold.
        src_pad_idx: Source padding index value.

    Output
    ------
        acc: Scalar top-k accuracy value.
    """
    total = (real_tokens != tgt_pad_idx).sum()

    _, pred_tokens = probs_tokens.topk(k=k, dim=-1)  # [batch_size * n_tokens, k]
    real_tokens = einops.repeat(real_tokens, 'b -> b k', k=k)  # [batch_size * n_tokens, k]

    good = (pred_tokens == real_tokens) & (real_tokens != tgt_pad_idx)
    acc = good.sum() / total
    return acc


def loss_batch(
        model: nn.Module,
        source: torch.LongTensor,
        target: torch.LongTensor,
        config: dict,
    )-> dict:
    """Compute the metrics associated with this batch.
    The metrics are:
        - loss
        - top-1 accuracy
        - top-5 accuracy
        - top-10 accuracy

    Args
    ----
        model: The model to train.
        source: Batch of source tokens.
            Shape of [batch_size, n_src_tokens].
        target: Batch of target tokens.
            Shape of [batch_size, n_tgt_tokens].
        config: Additional parameters.

    Output
    ------
        metrics: Dictionnary containing evaluated metrics on this batch.
    """
    device = config['device']
    loss_fn = config['loss'].to(device)
    metrics = dict()

    source, target = source.to(device), target.to(device)
    target_in, target_out = target[:, :-1], target[:, 1:]

    # Loss
    if config['model_type'] == 'encoder-decoder':
        pred = model(source, target_in)  # [batch_size, n_tgt_tokens-1, n_vocab]
    elif config['model_type'] == 'decoder-only':
        pred = model(target_in)  # [batch_size, n_tgt_tokens-1, n_vocab]

    # pred = model(source, target_in)  # [batch_size, n_tgt_tokens-1, n_vocab]
    # pred = pred.view(-1, pred.shape[2])  # [batch_size * (n_tgt_tokens - 1), n_vocab]
    pred = pred.view(-1, pred.shape[-1])  # [batch_size * (n_tgt_tokens - 1), n_vocab]
    target_out = target_out.flatten()  # [batch_size * (n_tgt_tokens - 1),]
    metrics['loss'] = loss_fn(pred, target_out)

    # Accuracy - we ignore the padding predictions
    for k in [1, 5, 10]:
        metrics[f'top-{k}'] = topk_accuracy(target_out, pred, k, config['tgt_pad_idx'])

    return metrics


def eval_model(model: nn.Module, dataloader: DataLoader, config: dict) -> dict:
    """Evaluate the model on the given dataloader.
    """
    device = config['device']
    logs = defaultdict(list)

    model.to(device)
    model.eval()

    with torch.no_grad():
        for source, target in dataloader:
            metrics = loss_batch(model, source, target, config)
            for name, value in metrics.items():
                logs[name].append(value.cpu().item())

    for name, values in logs.items():
        logs[name] = np.mean(values)
    return logs


def train_model(model: nn.Module, config: dict):
    """Train the model in a teacher forcing manner.
    """
    train_loader, val_loader = config['train_loader'], config['val_loader']
    train_dataset, val_dataset = train_loader.dataset.dataset, val_loader.dataset.dataset
    optimizer = config['optimizer']
    clip = config['clip']
    device = config['device']

    columns = ['epoch']
    for mode in ['train', 'validation']:
        columns += [
            f'{mode} - {colname}'
            for colname in ['source', 'target', 'predicted', 'likelihood']
        ]
    log_table = wandb.Table(columns=columns)


    print(f'Starting training for {config["epochs"]} epochs, using {device}.')
    for e in range(config['epochs']):
        print(f'\nEpoch {e+1}')

        model.to(device)
        model.train()
        logs = defaultdict(list)

        for batch_id, (source, target) in enumerate(train_loader):
            optimizer.zero_grad()

            metrics = loss_batch(model, source, target, config)
            loss = metrics['loss']

            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
            optimizer.step()

            for name, value in metrics.items():
                logs[name].append(value.cpu().item())  # Don't forget the '.item' to free the cuda memory

            if batch_id % config['log_every'] == 0:
                for name, value in logs.items():
                    logs[name] = np.mean(value)

                train_logs = {
                    f'Train - {m}': v
                    for m, v in logs.items()
                }
                wandb.log(train_logs)
                logs = defaultdict(list)

        # Logs
        if len(logs) != 0:
            for name, value in logs.items():
                logs[name] = np.mean(value)
            train_logs = {
                f'Train - {m}': v
                for m, v in logs.items()
            }
        else:
            logs = {
                m.split(' - ')[1]: v
                for m, v in train_logs.items()
            }

        print_logs('Train', logs)

        logs = eval_model(model, val_loader, config)
        print_logs('Eval', logs)
        val_logs = {
            f'Validation - {m}': v
            for m, v in logs.items()
        }

        val_source, val_target = val_dataset[ torch.randint(len(val_dataset), (1,)) ]
        val_pred, val_prob = beam_search(
            model,
            val_source,
            # config['src_vocab'],
            # config['tgt_vocab'],
            # config['src_tokenizer'],
            # device,  # It can take a lot of VRAM
            config=config,
            beam_width=10,
            max_target=100,
            max_sentence_length=config['max_sequence_length'],
        )[0]
        print(val_source)
        print(val_pred)

        logs = {**train_logs, **val_logs}  # Merge dictionnaries
        wandb.log(logs)  # Upload to the WandB cloud

        # Table logs
        train_source, train_target = train_dataset[ torch.randint(len(train_dataset), (1,)) ]
        train_pred, train_prob = beam_search(
            model,
            train_source,
            # config['src_vocab'],
            # config['tgt_vocab'],
            # config['src_tokenizer'],
            # device,  # It can take a lot of VRAM
            config=config,
            beam_width=10,
            max_target=100,
            max_sentence_length=config['max_sequence_length'],
        )[0]

        data = [
            e + 1,
            train_source, train_target, train_pred, train_prob,
            val_source, val_target, val_pred, val_prob,
        ]
        log_table.add_data(*data)

    # Log the table at the end of the training
    wandb.log({'Model predictions': log_table})

# Training the models
We can now finally train the models.
Choose the right hyperparameters, play with them and try to find
ones that lead to good models and good training curves.
Try to reach a loss under 1.0.

So you know, it is possible to get descent results with approximately 20 epochs.
With CUDA enabled, one epoch, even on a big model with a big dataset, shouldn't last more than 10 minutes.
A normal epoch is between 1 to 5 minutes.

*This is considering Colab Pro, we should try using free Colab to get better estimations.*

---

To test your implementations, it is easier to try your models
in a CPU instance. Indeed, Colab reduces your GPU instances priority
with the time you recently past using GPU instances. It would be
sad to consume all your GPU time on implementation testing.
Moreover, you should try your models on small datasets and with a small number of parameters.
For exemple, you could set:
```
MAX_SEQ_LEN = 10
MIN_TOK_FREQ = 20
dim_embedding = 40
dim_hidden = 60
n_layers = 1
```

You usually don't want to log anything onto WandB when testing your implementation.
To deactivate WandB without having to change any line of code, you can type `!wandb offline` in a cell.

Once you have rightly implemented the models, you can train bigger models on bigger datasets.
When you do this, do not forget to change the runtime as GPU (and use `!wandb online`)!

In [14]:
# Checking GPU and logging to wandb

!wandb login

!nvidia-smi

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33malexis-27[0m ([33minf8225_equipe_18[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
Sun Apr 27 02:26:52 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id     

In [15]:
# Instanciate the datasets

# A tester [5, 10, 20, 50, 100]
MAX_SEQ_LEN = 10

MIN_TOK_FREQ = 20
dim_embedding = 40

dim_hidden = 60
n_layers = 1

train_dataset, val_dataset = build_datasets(
    MAX_SEQ_LEN,
    MIN_TOK_FREQ,
    en_tokenizer,
    fr_tokenizer,
    train,
    valid,
)


print(f'English vocabulary size: {len(train_dataset.en_vocab):,}')
print(f'French vocabulary size: {len(train_dataset.fr_vocab):,}')

print(f'\nTraining examples: {len(train_dataset):,}')
print(f'Validation examples: {len(val_dataset):,}')

English vocabulary size: 2,175
French vocabulary size: 2,660

Training examples: 145,333
Validation examples: 16,067


In [20]:
# Build the model, the dataloaders, optimizer and the loss function
# Log every hyperparameters and arguments into the config dictionnary

config = {
    # General parameters

    # A tester [5, 10, 20, 50]
    'epochs': 50,

    # A tester [32, 64, 128, 512, 1024]
    'batch_size': 128,

    'lr': 1e-3,
    'betas': (0.9, 0.99),
    'clip': 5,
    'device': 'cuda' if torch.cuda.is_available() else 'cpu',

    # Model parameters
    'n_tokens_src': len(train_dataset.en_vocab),
    'n_tokens_tgt': len(train_dataset.fr_vocab),
    'n_heads': 4,
    'dim_embedding': 196,
    'dim_hidden': 256,
    'n_layers': 3,
    'dropout': 0.1,
    'model_type': 'decoder-only', # 'encoder-decoder' ou 'decoder-only'

    # Others
    'max_sequence_length': MAX_SEQ_LEN,
    'min_token_freq': MIN_TOK_FREQ,
    'src_vocab': train_dataset.en_vocab,
    'tgt_vocab': train_dataset.fr_vocab,
    'src_tokenizer': en_tokenizer,
    'tgt_tokenizer': fr_tokenizer,
    'src_pad_idx': train_dataset.en_vocab['<pad>'],
    'tgt_pad_idx': train_dataset.fr_vocab['<pad>'],
    'seed': 0,
    'log_every': 50,  # Number of batches between each wandb logs
}

torch.manual_seed(config['seed'])

config['train_loader'] = DataLoader(
    train_dataset,
    batch_size=config['batch_size'],
    shuffle=True,
    collate_fn=lambda batch: generate_batch(batch, config['src_pad_idx'], config['tgt_pad_idx'])
)

config['val_loader'] = DataLoader(
    val_dataset,
    batch_size=config['batch_size'],
    shuffle=True,
    collate_fn=lambda batch: generate_batch(batch, config['src_pad_idx'], config['tgt_pad_idx'])
)

if config['model_type'] == 'encoder-decoder':
  model = TranslationTransformer(
      config['n_tokens_src'],
      config['n_tokens_tgt'],
      config['n_heads'],
      config['dim_embedding'],
      config['dim_hidden'],
      config['n_layers'],
      config['dropout'],
      config['src_pad_idx'],
      config['tgt_pad_idx'],
  )
  summary_input_size = [
      (config['batch_size'], config['max_sequence_length']), # src
      (config['batch_size'], config['max_sequence_length'])  # tgt
  ]
  summary_dtypes = [torch.long, torch.long]
elif config['model_type'] == 'decoder-only':
  model = DecoderOnlyTranslationTransformer(
      config['n_tokens_tgt'],
      config['n_heads'],
      config['dim_embedding'],
      config['dim_hidden'],
      config['n_layers'],
      config['dropout'],
      config['tgt_pad_idx'],
  )
  summary_input_size = [
      (config['batch_size'], config['max_sequence_length']), # src
  ]
  summary_dtypes = [torch.long]

config['optimizer'] = optim.Adam(
    model.parameters(),
    lr=config['lr'],
    betas=config['betas'],
)

weight_classes = torch.ones(config['n_tokens_tgt'], dtype=torch.float)
weight_classes[config['tgt_vocab']['<unk>']] = 0.1  # Lower the importance of that class
config['loss'] = nn.CrossEntropyLoss(
    weight=weight_classes,
    ignore_index=config['tgt_pad_idx'],  # We do not have to learn those
)

summary(
    model,
    input_size=summary_input_size,
    dtypes=summary_dtypes,
    depth=3,
)

Layer (type:depth-idx)                        Output Shape              Param #
DecoderOnlyTranslationTransformer             [128, 10, 2660]           --
├─Embedding: 1-1                              [128, 10, 196]            521,360
├─PositionalEncoding: 1-2                     [128, 10, 196]            --
│    └─Dropout: 2-1                           [128, 10, 196]            --
├─DecoderOnlyTransformer: 1-3                 [128, 10, 196]            255,252
│    └─ModuleList: 2-2                        --                        --
│    │    └─DecoderOnlyLayer: 3-1             [128, 10, 196]            255,252
│    │    └─DecoderOnlyLayer: 3-2             [128, 10, 196]            255,252
│    │    └─DecoderOnlyLayer: 3-3             [128, 10, 196]            255,252
├─Linear: 1-4                                 [128, 10, 2660]           524,020
Total params: 2,066,388
Trainable params: 2,066,388
Non-trainable params: 0
Total mult-adds (Units.MEGABYTES): 231.83
Input size (MB): 0.01


In [21]:
!wandb online  # online / offline / disabled to activate, deactivate or turn off WandB logging
# !wandb offline

with wandb.init(
        config=config,
        project='INF8225 - Projet',  # Title of your project
        group='Transformer Decoder-only',  # In what group of runs do you want this run to be in?
        save_code=True,
        name='Seq: 10, Epoch: 50, Batch size: 128'
    ):
    train_model(model, config)

W&B online. Running your script from this directory will now sync to the cloud.


Starting training for 50 epochs, using cuda.

Epoch 1
Train -   loss: 2.82     top-1: 0.40    top-5: 0.60    top-10: 0.69
Eval -    loss: 2.76     top-1: 0.41    top-5: 0.61    top-10: 0.70
I'm in no hurry.
je ne peux pas y aller .

Epoch 2
Train -   loss: 2.69     top-1: 0.42    top-5: 0.61    top-10: 0.71
Eval -    loss: 2.64     top-1: 0.42    top-5: 0.62    top-10: 0.71
They said no.
je ne l' ai pas vu .

Epoch 3
Train -   loss: 2.63     top-1: 0.42    top-5: 0.62    top-10: 0.72
Eval -    loss: 2.57     top-1: 0.43    top-5: 0.63    top-10: 0.73
Everybody was startled.
je n' ai rien fait de mal .

Epoch 4
Train -   loss: 2.57     top-1: 0.43    top-5: 0.63    top-10: 0.73
Eval -    loss: 2.54     top-1: 0.43    top-5: 0.64    top-10: 0.73
I'm sick of you.
je n' ai rien vu .

Epoch 5
Train -   loss: 2.53     top-1: 0.43    top-5: 0.64    top-10: 0.74
Eval -    loss: 2.50     top-1: 0.43    top-5: 0.64    top-10: 0.74
Why do zebras have stripes?
je ne peux pas le faire .

Epoch 6
Tr

0,1
Train - loss,█▇▇▆▆▅▅▅▄▄▄▄▄▄▄▄▃▃▃▃▃▃▃▃▃▁▃▃▂▂▂▂▃▃▂▂▂▂▂▂
Train - top-1,▁▂▄▄▄▅█▅▅▆▆▆▆▆▆▆▆▇▇▆▆▆▆▆▇▇▇█▇▇▇▇██▇█▇███
Train - top-10,▁▁▃▄▄▄▄▅▅▅▆▅▆▆▆▆▆▆▆▆▆▆▇▆▇▆▆▇▆▇▆▇▇█▇▇▇▇▇▇
Train - top-5,▁▂▂▂▄▄▅▅▅▅▆▅▆▅▅▅▆▆▆▆▆▆▆▇▆▇▆▇▇▇▇▇▇▇▇▇█▇▇█
Validation - loss,█▆▅▄▃▃▃▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
Validation - top-1,▁▃▄▄▅▅▆▆▆▆▇▆▇▇▇▇▇▇▇▇▇▇█▇████████████████
Validation - top-10,▁▂▄▄▅▅▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇██▇█▇██████████████
Validation - top-5,▁▂▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇▇▇██▇▇█▇███████████████

0,1
Train - loss,2.20973
Train - top-1,0.46735
Train - top-10,0.78954
Train - top-5,0.69037
Validation - loss,2.3594
Validation - top-1,0.45663
Validation - top-10,0.7669
Validation - top-5,0.66738


In [None]:
sentence = "It is possible to try your work here."

preds = beam_search(
    model,
    sentence,
    config['src_vocab'],
    config['tgt_vocab'],
    config['src_tokenizer'],
    config['device'],
    beam_width=10,
    max_target=100,
    max_sentence_length=config['max_sequence_length']
)[:5]

for i, (translation, likelihood) in enumerate(preds):
    print(f'{i}. ({likelihood*100:.5f}%) \t {translation}')

0. (0.52463%) 	 il est impossible de trouver votre travail ici.
1. (0.35067%) 	 il est impossible de trouver ton travail ici.
2. (0.26617%) 	 il est impossible de travailler votre travail ici.
3. (0.25073%) 	 il est impossible de tenter votre travail ici.
4. (0.22054%) 	 il est impossible de faire votre travail ici.


---
#Understanding the Architecture of a Decoder-Only Transformer: what inspired our project

Sources:

[this blog post](https://medium.com/international-school-of-ai-data-science/building-custom-gpt-with-pytorch-59e5ba8102d4). The [first "GPT" paper](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf), and the paper cited by this GPT-1 paper for the Decoder Only architecture used for GPT, [i.e. this paper](https://arxiv.org/abs/1801.10198)