<br>
<font>
<div dir=ltr align=center>
<img src="https://cdn.freebiesupply.com/logos/large/2x/sharif-logo-png-transparent.png" width=150 height=150> <br>
<font color=0F5298 size=7>
    Machine learning <br>
<font color=2565AE size=5>
    Computer Engineering Department <br>
    Fall 2024<br>
<font color=3C99D size=5>
    Practical Assignment 5 - NLP - Transformer & Bert <br>
</div>
<div dir=ltr align=center>
<font color=0CBCDF size=4>
   &#x1F349; Masoud Tahmasbi  &#x1F349;  &#x1F353; Arash Ziyaei &#x1F353;
<br>
<font color=0CBCDF size=4>
   &#x1F335; Amirhossein Akbari  &#x1F335;
</div>

____

<font color=9999FF size=4>
&#x1F388; Full Name : Sina Beyrami
<br>
<font color=9999FF size=4>
&#x1F388; Student Number : 400105433

<font color=0080FF size=3>
This notebook covers two key topics. First, we implement a transformer model from scratch and apply it to a specific task. Second, we fine-tune the BERT model using LoRA for efficient adaptation to a downstream task.
</font>
<br>

**Note:**
<br>
<font color=66B2FF size=2>In this notebook, you are free to use any function or model from PyTorch to assist with the implementation. However, TensorFlow is not permitted for this exercise. This ensures consistency and alignment with the tools being focused on.</font>
<br>
<font color=red size=3>**Run All Cells Before Submission**</font>: <font color=FF99CC size=2>Before saving and submitting your notebook, please ensure you run all cells from start to finish. This practice guarantees that your notebook is self-consistent and can be evaluated correctly by others.</font>

# Section 1: Transformer

The transformer architecture consists of two main components: an encoder and a decoder. Each of these components is made up of multiple layers that include self-attention mechanisms and feedforward neural networks. The self-attention mechanism is central to the transformer, as it enables the model to assess the importance of different words in a sentence by considering their relationships with one another.


In this assignment, you should design a transformer model from scratch. You are required to implement the Encoder and Decoder components of a Transformer model.

In [None]:
!pip install datasets



In [None]:
# Importing libraries

# PyTorch
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader, random_split
from torch.utils.tensorboard import SummaryWriter

# Math
import math

# HuggingFace libraries
from datasets import load_dataset
from tokenizers import Tokenizer
from tokenizers.models import WordLevel
from tokenizers.trainers import WordLevelTrainer
from tokenizers.pre_tokenizers import Whitespace

# Pathlib
from pathlib import Path

# typing
from typing import Any

# Library for progress bars in loops
from tqdm import tqdm

# Importing library of warnings
import warnings

## Part 1: Input Embeddings
<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">When we observe the Transformer architecture image above, we can see that the Embeddings represent the first step of both blocks.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">The <code>InputEmbedding</code> class below is responsible for converting the input text into numerical vectors of <code>d_model</code> dimensions. To prevent that our input embeddings become extremely small, we normalize them by multiplying them by the $\sqrt{d_{model}}$.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">In the image below, we can see how the embeddings are created. First, we have a sentence that gets split into tokens—we will explore what tokens are later on—. Then, the token IDs—identification numbers—are transformed into the embeddings, which are high-dimensional vectors.</p>

In [None]:
######################  TODO  ########################
######################  TODO  ########################

# - Define a class `InputEmbeddings` inheriting from `nn.Module`
# - Initialize the class with two parameters:
#   1. `d_model`: Dimension of the embedding vectors
#   2. `vocab_size`: Size of the vocabulary
# - Create an embedding layer using `nn.Embedding` to map input indices to dense vectors

# - In the `forward` method:
#   1. Pass the input `x` through the embedding layer
#   2. Scale the embeddings by the square root of `d_model` for variance normalization

class InputEmbeddings(nn.Module):
    def __init__(self, d_model, vocab_size):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, d_model)
        self.d_model = d_model

    def forward(self, x):
        return self.embed(x) * math.sqrt(self.d_model)

######################  TODO  ########################
######################  TODO  ########################


## Part 2: positional encoding
<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">In the original paper, the authors add the positional encodings to the input embeddings at the bottom of both the encoder and decoder blocks so the model can have some information about the relative or absolute position of the tokens in the sequence. The positional encodings have the same dimension $d_{model}$ as the embeddings, so that the two vectors can be summed and we can combine the semantic content from the word embeddings and positional information from the positional encodings.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">In the <code>PositionalEncoding</code> class below, we will create a matrix of positional encodings <code>pe</code> with dimensions <code>(seq_len, d_model)</code>. We will start by filling it with $0$s.We will then apply the sine function to even indices of the positional encoding matrix while the cosine function is applied to the odd ones.</p>

<p style="
    margin-bottom: 5;
    font-size: 22px;
    font-weight: 300;
    font-family: 'Helvetica Neue', sans-serif;
    color: #000000;
  ">
    \begin{equation}
    \text{Odd Indices } (2i + 1): \quad \text{PE(pos, } 2i + 1) = \cos\left(\frac{\text{pos}}{10000^{2i / d_{model}}}\right)
    \end{equation}
</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">We apply the sine and cosine functions because it allows the model to determine the position of a word based on the position of other words in the sequence, since for any fixed offset $k$, $PE_{pos + k}$ can be represented as a linear function of $PE_{pos}$. This happens due to the properties of sine and cosine functions, where a shift in the input results in a predictable change in the output.</p>

In [None]:
######################  TODO  ########################
######################  TODO  ########################

# - Create a `PositionalEncoding` class inheriting from `nn.Module`
# - Initialize with `d_model`, `seq_len`, and `dropout`
# - Generate a positional encoding matrix using sine and cosine functions
# - Register the positional encoding as a non-trainable buffer
# - In `forward`, add positional encoding to input and apply dropout

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, seq_len, dropout):
        super().__init__()
        self.dropout = nn.Dropout(dropout)
        pe = torch.zeros(seq_len, d_model)
        position = torch.arange(0, seq_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + self.pe[:, :x.size(1)]
        return self.dropout(x)

######################  TODO  ########################
######################  TODO  ########################


## Part 3: layer normalization
<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">When we look at the encoder and decoder blocks, we see several normalization layers called <b><i>Add &amp; Norm</i></b>.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">The <code>LayerNormalization</code> class below performs layer normalization on the input data. During its forward pass, we compute the mean and standard deviation of the input data. We then normalize the input data by subtracting the mean and dividing by the standard deviation plus a small number called epsilon to avoid any divisions by zero. This process results in a normalized output with a mean 0 and a standard deviation 1.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">We will then scale the normalized output by a learnable parameter <code>alpha</code> and add a learnable parameter called <code>bias</code>. The training process is responsible for adjusting these parameters. The final result is a layer-normalized tensor, which ensures that the scale of the inputs to layers in the network is consistent.</p>

In [None]:
######################  TODO  ########################
######################  TODO  ########################

# - Create a `LayerNormalization` class inheriting from `nn.Module`
# - Initialize with `eps` (small value to prevent division by zero)
# - Define trainable parameters:
#   1. `alpha`: Scaling factor initialized to 1
#   2. `bias`: Offset initialized to 0

# - In `forward`, perform layer normalization:
#   1. Compute mean and standard deviation along the last dimension
#   2. Normalize the input using the computed mean and std
#   3. Scale and shift using `alpha` and `bias`

class LayerNormalization(nn.Module):
    def __init__(self, eps=1e-6):
        super().__init__()
        self.eps = eps
        self.alpha = nn.Parameter(torch.ones(1))
        self.bias = nn.Parameter(torch.zeros(1))

    def forward(self, x):
        mean = x.mean(dim=-1, keepdim=True)
        std = x.std(dim=-1, keepdim=True)
        return self.alpha * (x - mean) / (std + self.eps) + self.bias

######################  TODO  ########################
######################  TODO  ########################


## Part 4: Feed Forward Network
<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">In the fully connected feed-forward network, we apply two linear transformations with a ReLU activation in between. We can mathematically represent this operation as:</p>

<p style="
    margin-bottom: 5;
    font-size: 22px;
    font-weight: 300;
    font-family: 'Helvetica Neue', sans-serif;
    color: #000000;
  ">
    \begin{equation}
    \text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2
    \end{equation}
</p>


<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">$W_1$ and $W_2$ are the weights, while $b_1$ and $b_2$ are the biases of the two linear transformations.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">In the <code>FeedForwardBlock</code> below, we will define the two linear transformations—<code>self.linear_1</code> and <code>self.linear_2</code>—and the inner-layer <code>d_ff</code>. The input data will first pass through the <code>self.linear_1</code> transformation, which increases its dimensionality from <code>d_model</code> to <code>d_ff</code>. The output of this operation passes through the ReLU activation function, which introduces non-linearity so the network can learn more complex patterns, and the <code>self.dropout</code> layer is applied to mitigate overfitting. The final operation is the <code>self.linear_2</code> transformation to the dropout-modified tensor, which transforms it back to the original <code>d_model</code> dimension.</p>

In [None]:
######################  TODO  ########################
######################  TODO  ########################

# - Create a `FeedForwardBlock` class inheriting from `nn.Module`
# - Initialize with `d_model`, `d_ff`, and `dropout`
# - Define:
#   1. `linear_1`: Linear layer projecting from `d_model` to `d_ff`
#   2. Dropout layer for regularization
#   3. `linear_2`: Linear layer projecting back from `d_ff` to `d_model`

# - In `forward`, apply the following steps:
#   1. Pass input through `linear_1` followed by ReLU activation
#   2. Apply dropout
#   3. Pass through `linear_2` to return to original dimensions

class FeedForwardBlock(nn.Module):
    def __init__(self, d_model, d_ff, dropout):
        super().__init__()
        self.linear_1 = nn.Linear(d_model, d_ff)
        self.dropout = nn.Dropout(dropout)
        self.linear_2 = nn.Linear(d_ff, d_model)

    def forward(self, x):
        x = self.linear_1(x)
        x = nn.functional.relu(x)
        x = self.dropout(x)
        x = self.linear_2(x)
        return x

######################  TODO  ########################
######################  TODO  ########################


## Part 5: Multi Head Attention
<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">The Multi-Head Attention is the most crucial component of the Transformer. It is responsible for helping the model to understand complex relationships and patterns in the data.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">The image below displays how the Multi-Head Attention works. It doesn't include <code>batch</code> dimension because it only illustrates the process for one single sentence.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">The Multi-Head Attention block receives the input data split into queries, keys, and values organized into matrices $Q$, $K$, and $V$. Each matrix contains different facets of the input, and they have the same dimensions as the input.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">We then linearly transform each matrix by their respective weight matrices $W^Q$, $W^K$, and $W^V$. These transformations will result in new matrices $Q'$, $K'$, and $V'$, which will be split into smaller matrices corresponding to different heads $h$, allowing the model to attend to information from different representation subspaces in parallel. This split creates multiple sets of queries, keys, and values for each head.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">Finally, we concatenate every head into an $H$ matrix, which is then transformed by another weight matrix $W^o$ to produce the multi-head attention output, a matrix $MH-A$ that retains the input dimensionality.</p>

In [None]:
######################  TODO  ########################
######################  TODO  ########################

# - Create a `MultiHeadAttentionBlock` class inheriting from `nn.Module`
# - Initialize with `d_model` (model dimensions), `h` (number of heads), and `dropout`:
#   1. Assert `d_model` is divisible by `h`
#   2. Define `d_k` as dimensions per head
#   3. Create weight matrices (`w_q`, `w_k`, `w_v`, `w_o`) for query, key, value, and output
#   4. Add a dropout layer for regularization

# - Implement a static `attention` method to:
#   1. Compute scaled dot-product attention
#   2. Apply mask if provided
#   3. Apply softmax and dropout
#   4. Return weighted values and attention scores

# - In `forward`, perform:
#   1. Linear transformation of input into query, key, and value
#   2. Split into `h` heads and rearrange dimensions
#   3. Compute attention output and scores using `attention`
#   4. Combine heads and apply output weight matrix

class MultiHeadAttentionBlock(nn.Module):
    def __init__(self, d_model, h, dropout):
        super().__init__()
        assert d_model % h == 0
        self.d_k = d_model // h
        self.h = h
        self.w_q = nn.Linear(d_model, d_model)
        self.w_k = nn.Linear(d_model, d_model)
        self.w_v = nn.Linear(d_model, d_model)
        self.w_o = nn.Linear(d_model, d_model)
        self.dropout = nn.Dropout(dropout)

    @staticmethod
    def attention(q, k, v, mask=None, dropout=None):
        scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(q.size(-1))
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        attn = torch.softmax(scores, dim=-1)
        if dropout is not None:
            attn = dropout(attn)
        output = torch.matmul(attn, v)
        return output, attn

    def forward(self, x_q, x_k, x_v, mask=None):
        bsz = x_q.size(0)
        q = self.w_q(x_q).view(bsz, -1, self.h, self.d_k).transpose(1, 2)
        k = self.w_k(x_k).view(bsz, -1, self.h, self.d_k).transpose(1, 2)
        v = self.w_v(x_v).view(bsz, -1, self.h, self.d_k).transpose(1, 2)
        output, attn = self.attention(q, k, v, mask, self.dropout)
        output = output.transpose(1, 2).contiguous().view(bsz, -1, self.h * self.d_k)
        output = self.w_o(output)
        return output, attn

######################  TODO  ########################
######################  TODO  ########################


## Part 6: Residual Connection
<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">When we look at the architecture of the Transformer, we see that each sub-layer, including the <i>self-attention</i> and <i>Feed Forward</i> blocks, adds its output to its input before passing it to the <i>Add &amp; Norm</i> layer. This approach integrates the output with the original input in the <i>Add &amp; Norm</i> layer. This process is known as the skip connection, which allows the Transformer to train deep networks more effectively by providing a shortcut for the gradient to flow through during backpropagation.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">The <code>ResidualConnection</code> class below is responsible for this process.</p>

In [None]:
######################  TODO  ########################
######################  TODO  ########################

# - Create a `ResidualConnection` class inheriting from `nn.Module`
# - Initialize with `dropout`:
#   1. Add a dropout layer for regularization
#   2. Include a layer normalization instance

# - In `forward`:
#   1. Normalize the input using the normalization layer
#   2. Pass the normalized input through the sublayer
#   3. Apply dropout and add the result back to the original input for residual connection

class ResidualConnection(nn.Module):
    def __init__(self, dropout):
        super().__init__()
        self.dropout = nn.Dropout(dropout)
        self.norm = LayerNormalization()

    def forward(self, x, sublayer):
        return x + self.dropout(sublayer(self.norm(x)))

######################  TODO  ########################
######################  TODO  ########################


## Part 7: Encoder
<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">We will now build the encoder. We create the <code>EncoderBlock</code> class, consisting of the Multi-Head Attention and Feed Forward layers, plus the residual connections.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">In the original paper, the Encoder Block repeats six times. We create the <code>Encoder</code> class as an assembly of multiple <code>EncoderBlock</code>s. We also add layer normalization as a final step after processing the input through all its blocks.</p>

In [None]:
######################  TODO  ########################
######################  TODO  ########################

# - Create an `EncoderBlock` class inheriting from `nn.Module`
# - Initialize with:
#   1. `self_attention_block`: Multi-head attention block
#   2. `feed_forward_block`: Feed-forward block
#   3. `dropout`: Dropout rate for residual connections
# - Define two residual connections for:
#   1. Self-attention block
#   2. Feed-forward block

# - In `forward`:
#   1. Apply the first residual connection with the self-attention block
#   2. Apply the second residual connection with the feed-forward block
#   3. Return the updated tensor after both layers

class EncoderBlock(nn.Module):
    def __init__(self, d_model, h, d_ff, dropout):
        super().__init__()
        self.self_attention_block = MultiHeadAttentionBlock(d_model, h, dropout)
        self.residual_self_attention = ResidualConnection(dropout)
        self.feed_forward_block = FeedForwardBlock(d_model, d_ff, dropout)
        self.residual_feed_forward = ResidualConnection(dropout)

    def forward(self, x, src_mask=None):
        x = self.residual_self_attention(x, lambda _x: self.self_attention_block(_x, _x, _x, src_mask)[0])
        x = self.residual_feed_forward(x, self.feed_forward_block)
        return x

######################  TODO  ########################
######################  TODO  ########################


In [None]:
######################  TODO  ########################
######################  TODO  ########################

# - Create an `Encoder` class inheriting from `nn.Module`
# - Initialize with:
#   1. `layers`: A list of `EncoderBlock` instances
#   2. A layer normalization instance for output normalization

# - In `forward`:
#   1. Pass the input tensor `x` through each `EncoderBlock` in `self.layers`
#   2. Apply the mask during each block's forward pass
#   3. Normalize the final output and return it

class Encoder(nn.Module):
    def __init__(self, d_model, h, d_ff, N, dropout):
        super().__init__()
        self.layers = nn.ModuleList([EncoderBlock(d_model, h, d_ff, dropout) for _ in range(N)])
        self.norm = LayerNormalization()

    def forward(self, x, src_mask=None):
        for layer in self.layers:
            x = layer(x, src_mask)
        return self.norm(x)

######################  TODO  ########################
######################  TODO  ########################


## Part 8: Decoder
<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">Similarly, the Decoder also consists of several DecoderBlocks that repeat six times in the original paper. The main difference is that it has an additional sub-layer that performs multi-head attention with a <i>cross-attention</i> component that uses the output of the Encoder as its keys and values while using the Decoder's input as queries.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">For the Output Embedding, we can use the same <code>InputEmbeddings</code> class we use for the Encoder. You can also notice that the self-attention sub-layer is <i>masked</i>, which restricts the model from accessing future elements in the sequence.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">We will start by building the <code>DecoderBlock</code> class, and then we will build the <code>Decoder</code> class, which will assemble multiple <code>DecoderBlock</code>s.</p>

In [None]:
######################  TODO  ########################
######################  TODO  ########################

# - Create a `DecoderBlock` class inheriting from `nn.Module`
# - Initialize with:
#   1. `self_attention_block`: Multi-head self-attention block
#   2. `cross_attention_block`: Multi-head cross-attention block
#   3. `feed_forward_block`: Feed-forward block
#   4. `dropout`: Dropout rate
# - Define three residual connections for:
#   1. Self-attention block
#   2. Cross-attention block
#   3. Feed-forward block

# - In `forward`:
#   1. Apply the self-attention block with target mask and residual connection
#   2. Apply the cross-attention block with source mask and residual connection
#   3. Apply the feed-forward block with residual connection
#   4. Return the updated tensor

class DecoderBlock(nn.Module):
    def __init__(self, d_model, h, d_ff, dropout):
        super().__init__()
        self.self_attention_block = MultiHeadAttentionBlock(d_model, h, dropout)
        self.residual_self_attention = ResidualConnection(dropout)
        self.cross_attention_block = MultiHeadAttentionBlock(d_model, h, dropout)
        self.residual_cross_attention = ResidualConnection(dropout)
        self.feed_forward_block = FeedForwardBlock(d_model, d_ff, dropout)
        self.residual_feed_forward = ResidualConnection(dropout)

    def forward(self, x, encoder_output, src_mask=None, tgt_mask=None):
        x = self.residual_self_attention(x, lambda _x: self.self_attention_block(_x, _x, _x, tgt_mask)[0])
        x = self.residual_cross_attention(x, lambda _x: self.cross_attention_block(_x, encoder_output, encoder_output, src_mask)[0])
        x = self.residual_feed_forward(x, self.feed_forward_block)
        return x

######################  TODO  ########################
######################  TODO  ########################


In [None]:
######################  TODO  ########################
######################  TODO  ########################

# - Create a `Decoder` class inheriting from `nn.Module`
# - Initialize with:
#   1. `layers`: A list of `DecoderBlock` instances
#   2. A layer normalization instance for the final output

# - In `forward`:
#   1. Pass the input tensor `x` through each `DecoderBlock` in `self.layers`
#   2. Provide `encoder_output`, `src_mask`, and `tgt_mask` to each block
#   3. Normalize the final output using the layer normalization
#   4. Return the normalized output

class Decoder(nn.Module):
    def __init__(self, d_model, h, d_ff, N, dropout):
        super().__init__()
        self.layers = nn.ModuleList([DecoderBlock(d_model, h, d_ff, dropout) for _ in range(N)])
        self.norm = LayerNormalization()

    def forward(self, x, encoder_output, src_mask=None, tgt_mask=None):
        for layer in self.layers:
            x = layer(x, encoder_output, src_mask, tgt_mask)
        return self.norm(x)

######################  TODO  ########################
######################  TODO  ########################


<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">You can see in the Decoder image that after running a stack of <code>DecoderBlock</code>s, we have a Linear Layer and a Softmax function to the output of probabilities. The <code>ProjectionLayer</code> class below is responsible for converting the output of the model into a probability distribution over the <i>vocabulary</i>, where we select each output token from a vocabulary of possible tokens.</p>

In [None]:
######################  TODO  ########################
######################  TODO  ########################

# - Create a `ProjectionLayer` class inheriting from `nn.Module`
# - Initialize with:
#   1. `d_model`: Dimension of the model
#   2. `vocab_size`: Size of the output vocabulary
# - Define a linear layer to project from `d_model` to `vocab_size`

# - In `forward`:
#   1. Pass the input through the linear layer
#   2. Apply log Softmax along the last dimension
#   3. Return the log probabilities

class ProjectionLayer(nn.Module):
    def __init__(self, d_model, vocab_size):
        super().__init__()
        self.linear = nn.Linear(d_model, vocab_size)

    def forward(self, x):
        x = self.linear(x)
        return nn.functional.log_softmax(x, dim=-1)

######################  TODO  ########################
######################  TODO  ########################


## Part 9: Building the Transformer

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">We finally have every component of the Transformer architecture ready. We may now construct the Transformer by putting it all together.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">In the <code>Transformer</code> class below, we will bring together all the components of the model's architecture.</p>

In [None]:
######################  TODO  ########################
######################  TODO  ########################

# - Create a `Transformer` class inheriting from `nn.Module`
# - Initialize with:
#   1. `encoder`: Encoder module
#   2. `decoder`: Decoder module
#   3. `src_embed` and `tgt_embed`: Input embeddings for source and target languages
#   4. `src_pos` and `tgt_pos`: Positional encodings for source and target languages
#   5. `projection_layer`: Linear projection layer for final output

# - Define the `encode` method:
#   1. Apply source embeddings to input
#   2. Add positional encoding
#   3. Pass through the encoder with the source mask
#   4. Return the encoded representation

# - Define the `decode` method:
#   1. Apply target embeddings to input
#   2. Add positional encoding
#   3. Pass through the decoder with encoder output, source mask, and target mask
#   4. Return the decoder's output

# - Define the `project` method:
#   1. Pass decoder output through the projection layer
#   2. Apply log Softmax to obtain probabilities

class Transformer(nn.Module):
    def __init__(self, encoder, decoder, src_embed, tgt_embed, src_pos, tgt_pos, projection_layer):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.src_embed = src_embed
        self.tgt_embed = tgt_embed
        self.src_pos = src_pos
        self.tgt_pos = tgt_pos
        self.projection_layer = projection_layer

    def encode(self, src, src_mask=None):
        x = self.src_embed(src)
        x = self.src_pos(x)
        x = self.encoder(x, src_mask)
        return x

    def decode(self, tgt, memory, src_mask=None, tgt_mask=None):
        x = self.tgt_embed(tgt)
        x = self.tgt_pos(x)
        x = self.decoder(x, memory, src_mask, tgt_mask)
        return x

    def project(self, x):
        return self.projection_layer(x)

######################  TODO  ########################
######################  TODO  ########################


<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">The architecture is finally ready. We now define a function called <code>build_transformer</code>, in which we define the parameters and everything we need to have a fully operational Transformer model for the task of <b>machine translation</b>.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">We will set the same parameters as in the original paper, <a href = "https://arxiv.org/pdf/1706.03762.pdf"><i>Attention Is All You Need</i></a>, where $d_{model}$ = 512, $N$ = 6, $h$ = 8, dropout rate $P_{drop}$ = 0.1, and $d_{ff}$ = 2048.</p>

In [None]:
######################  TODO  ########################
######################  TODO  ########################

# - Define a `build_transformer` function with parameters for:
#   1. Vocabulary sizes (`src_vocab_size`, `tgt_vocab_size`)
#   2. Sequence lengths (`src_seq_len`, `tgt_seq_len`)
#   3. Model dimensions (`d_model`, `d_ff`)
#   4. Number of layers (`N`) and heads (`h`)
#   5. Dropout rate (`dropout`)

# - Create:
#   1. Source and target embedding layers
#   2. Positional encoding layers for source and target
#   3. Encoder blocks with self-attention and feed-forward layers
#   4. Decoder blocks with self-attention, cross-attention, and feed-forward layers
#   5. Encoder and Decoder modules using the blocks
#   6. Projection layer to map decoder output to target vocabulary

# - Assemble all components into a `Transformer` instance
# - Initialize parameters with Xavier uniform initialization
# - Return the initialized Transformer

def build_transformer(src_vocab_size, tgt_vocab_size, src_seq_len, tgt_seq_len, d_model=512, d_ff=2048, N=6, h=8, dropout=0.1):
    src_embed = InputEmbeddings(d_model, src_vocab_size)
    tgt_embed = InputEmbeddings(d_model, tgt_vocab_size)
    src_pos = PositionalEncoding(d_model, src_seq_len, dropout)
    tgt_pos = PositionalEncoding(d_model, tgt_seq_len, dropout)
    encoder = Encoder(d_model, h, d_ff, N, dropout)
    decoder = Decoder(d_model, h, d_ff, N, dropout)
    projection_layer = ProjectionLayer(d_model, tgt_vocab_size)
    model = Transformer(encoder, decoder, src_embed, tgt_embed, src_pos, tgt_pos, projection_layer)
    for p in model.parameters():
        if p.dim() > 1:
            nn.init.xavier_uniform_(p)
    return model

######################  TODO  ########################
######################  TODO  ########################


The model is now ready to be trained!

## Part 10: Tokenizer

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">Tokenization is a crucial preprocessing step for our Transformer model. In this step, we convert raw text into a number format that the model can process.  </p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">There are several Tokenization strategies. We will use the <i>word-level tokenization</i> to transform each word in a sentence into a token.</p>

<center>
    <img src = "https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F8d5e749c-b0bd-4496-85a1-9b4397ad935f_1400x787.jpeg" width = 800, height= 800>
<p style = "font-size: 16px;
            font-family: 'Georgia', serif;
            text-align: center;
            margin-top: 10px;">Different tokenization strategies. Source: <a href = "https://shaankhosla.substack.com/p/talking-tokenization">shaankhosla.substack.com</a>.</p>
</center>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">After tokenizing a sentence, we map each token to an unique integer ID based on the created vocabulary present in the training corpus during the training of the tokenizer. Each integer number represents a specific word in the vocabulary.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">Besides the words in the training corpus, Transformers use special tokens for specific purposes. These are some that we will define right away:</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px"><b>• [UNK]:</b> This token is used to identify an unknown word in the sequence.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px"><b>• [PAD]:</b> Padding token to ensure that all sequences in a batch have the same length, so we pad shorter sentences with this token. We use attention masks to <i>"tell"</i> the model to ignore the padded tokens during training since they don't have any real meaning to the task.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px"><b>•  [SOS]:</b> This is a token used to signal the <i>Start of Sentence</i>.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px"><b>•  [EOS]:</b> This is a token used to signal the <i>End of Sentence</i>.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">In the <code>build_tokenizer</code> function below, we ensure a tokenizer is ready to train the model. It checks if there is an existing tokenizer, and if that is not the case, it trains a new tokenizer.</p>

In [None]:
######################  TODO  ########################
######################  TODO  ########################

# - Define a `build_tokenizer` function with parameters for:
#   1. `config`: Configuration containing tokenizer file path
#   2. `ds`: Dataset to train the tokenizer
#   3. `lang`: Language for which the tokenizer is built

# - Check if the tokenizer file exists:
#   1. If not, create a new tokenizer:
#      - Initialize a word-level tokenizer with an unknown token (`[UNK]`)
#      - Set the pre-tokenizer to split text by whitespace
#      - Define a trainer with special tokens and minimum frequency
#      - Train the tokenizer on all sentences in the dataset
#      - Save the trained tokenizer to the specified file path
#   2. If the file exists, load the tokenizer from the file

# - Return the loaded or trained tokenizer

def build_tokenizer(config, ds, lang):
    tokenizer_path = Path(config['tokenizer_file'].format(lang=lang))
    if not tokenizer_path.is_file():
        tokenizer = Tokenizer(WordLevel(unk_token='[UNK]'))
        tokenizer.pre_tokenizer = Whitespace()
        trainer = WordLevelTrainer(special_tokens=['[UNK]', '[PAD]', '[SOS]', '[EOS]'], min_frequency=2)
        tokenizer.train_from_iterator(get_all_sentences(ds, lang), trainer=trainer)
        tokenizer.save(str(tokenizer_path))
    else:
        tokenizer = Tokenizer.from_file(str(tokenizer_path))
    return tokenizer

######################  TODO  ########################
######################  TODO  ########################


## Part 11: Load Dataset

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">For this task, we will use the <a href = "opus_books · Datasets at Hugging Face">OpusBooks dataset</a>, available on 🤗Hugging Face. This dataset consists of two features, <code>id</code> and <code>translation</code>. The <code>translation</code> feature contains pairs of sentences in different languages, such as Spanish and Portuguese, English and French, and so forth.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">I first tried translating sentences from English to Portuguese—my native tongue — but there are only 1.4k examples for this pair, so the results were not satisfying in the current configurations for this model. I then tried to use the English-French pair due to its higher number of examples—127k—but it would take too long to train with the current configurations. I then opted to train the model on the English-Italian pair, the same one used in the <a href = "https://youtu.be/ISNdQcPhsts?si=253J39cose6IdsLv">Coding a Transformer from scratch on PyTorch, with full explanation, training and inference
</a> video, as that was a good balance between performance and time of training.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">We start by defining the <code>get_all_sentences</code> function to iterate over the dataset and extract the sentences according to the language pair defined—we will do that later.</p>

In [None]:
######################  TODO  ########################
######################  TODO  ########################

# - Define a `get_all_sentences` function to extract sentences from a dataset
# - Accept parameters:
#   1. `ds`: The dataset containing translation pairs
#   2. `lang`: The language key to extract translations

# - Iterate through the dataset:
#   1. Access the 'translation' field of each pair
#   2. Yield the sentence corresponding to the specified language key

def get_all_sentences(ds, lang):
    for item in ds:
        yield item['translation'][lang]

######################  TODO  ########################
######################  TODO  ########################


<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">The <code>get_ds</code> function is defined to load and prepare the dataset for training and validation. In this function, we build or load the tokenizer, split the dataset, and create DataLoaders, so the model can successfully iterate over the dataset in batches. The result of these functions is tokenizers for the source and target languages plus the DataLoader objects.</p>

In [None]:
######################  TODO  ########################
######################  TODO  ########################

# - Define a `get_ds` function to process and prepare the dataset for training
# - Load the `OpusBooks` dataset using:
#   1. Source and target languages from `config`
#   2. Train split of the dataset

# - Build or load tokenizers for source and target languages using `build_tokenizer`

# - Split the dataset into training and validation sets:
#   1. Allocate 90% for training and 10% for validation
#   2. Use `random_split` for randomized splitting

# - Process the splits using a `BilingualDataset` class:
#   1. Convert sentences to tokenized representations
#   2. Apply source and target tokenizers
#   3. Ensure sequence lengths conform to `config`

# - Compute and print the maximum sentence lengths for both source and target languages

# - Create DataLoader objects for training and validation:
#   1. Define batch sizes from `config`
#   2. Enable shuffling for training DataLoader

# - Return:
#   1. Training DataLoader
#   2. Validation DataLoader
#   3. Tokenizer for source language
#   4. Tokenizer for target language

def get_ds(config):
    dataset = load_dataset("opus_books", f"{config['lang_src']}-{config['lang_tgt']}", split="train")
    tokenizer_src = build_tokenizer(config, dataset, config['lang_src'])
    tokenizer_tgt = build_tokenizer(config, dataset, config['lang_tgt'])
    train_size = int(0.9 * len(dataset))
    val_size = len(dataset) - train_size
    train_ds_raw, val_ds_raw = random_split(dataset, [train_size, val_size])
    train_ds = BilingualDataset(train_ds_raw, tokenizer_src, tokenizer_tgt, config['lang_src'], config['lang_tgt'], config['seq_len'])
    val_ds = BilingualDataset(val_ds_raw, tokenizer_src, tokenizer_tgt, config['lang_src'], config['lang_tgt'], config['seq_len'])
    print("Max source length:", max(len(tokenizer_src.encode(item['translation'][config['lang_src']]).ids) for item in dataset))
    print("Max target length:", max(len(tokenizer_tgt.encode(item['translation'][config['lang_tgt']]).ids) for item in dataset))
    train_loader = DataLoader(train_ds, batch_size=config['batch_size'], shuffle=True)
    val_loader = DataLoader(val_ds, batch_size=1, shuffle=False)
    return train_loader, val_loader, tokenizer_src, tokenizer_tgt

######################  TODO  ########################
######################  TODO  ########################


<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">We define the <code>casual_mask</code> function to create a mask for the attention mechanism of the decoder. This mask prevents the model from having information about future elements in the sequence. </p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">We start by making a square grid filled with ones. We determine the grid size with the <code>size</code> parameter. Then, we change all the numbers above the main diagonal line to zeros. Every number on one side becomes a zero, while the rest remain ones. The function then flips all these values, turning ones into zeros and zeros into ones. This process is crucial for models that predict future tokens in a sequence.</p>

In [None]:
######################  TODO  ########################
######################  TODO  ########################

# - Define a `casual_mask` function to create an upper triangular mask
# - Accept `size` as the dimension of the square matrix
# - Steps:
#   1. Create a square matrix of size `size x size` filled with ones
#   2. Use `torch.triu` to make it upper triangular, with zeros below the diagonal
#   3. Convert the matrix to integer type
#   4. Return the mask where zeros represent the causal positions

def casual_mask(size):
    mask = torch.triu(torch.ones(size, size), diagonal=1)
    mask = (mask == 0)
    return mask

######################  TODO  ########################
######################  TODO  ########################


<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">The <code>BilingualDataset</code> class processes the texts of the target and source languages in the dataset by tokenizing them and adding all the necessary special tokens. This class also certifies that the sentences are within a maximum sequence length for both languages and pads all necessary sentences.</p>

In [None]:
######################  TODO  ########################
######################  TODO  ########################

# - Define a `BilingualDataset` class inheriting from `Dataset`
# - Initialize with:
#   1. `ds`: Dataset containing sentence pairs
#   2. `tokenizer_src` and `tokenizer_tgt`: Tokenizers for source and target languages
#   3. `src_lang` and `tgt_lang`: Language identifiers
#   4. `seq_len`: Maximum sequence length for tokens

# - Define special tokens (`[SOS]`, `[EOS]`, `[PAD]`) using the target tokenizer

# - Implement `__len__` to return the number of sentence pairs in the dataset

# - Implement `__getitem__` to:
#   1. Retrieve source and target texts based on the index
#   2. Tokenize source and target texts
#   3. Compute required padding for source and target tokens
#   4. Raise an error if tokenized sentences exceed `seq_len`
#   5. Build `encoder_input` by concatenating `[SOS]`, tokenized text, `[EOS]`, and padding
#   6. Build `decoder_input` by concatenating `[SOS]`, tokenized text, and padding
#   7. Build `label` by concatenating tokenized text, `[EOS]`, and padding
#   8. Ensure all tensors are of length `seq_len`

# - Return a dictionary containing:
#   1. `encoder_input`: Tensor for the encoder
#   2. `decoder_input`: Tensor for the decoder
#   3. `encoder_mask`: Mask for non-padding tokens in the encoder
#   4. `decoder_mask`: Mask for non-padding tokens in the decoder with causal masking
#   5. `label`: Expected output for training
#   6. `src_text` and `tgt_text`: Original source and target texts

class BilingualDataset(Dataset):
    def __init__(self, ds, tokenizer_src, tokenizer_tgt, src_lang, tgt_lang, seq_len):
        self.ds = ds
        self.tokenizer_src = tokenizer_src
        self.tokenizer_tgt = tokenizer_tgt
        self.src_lang = src_lang
        self.tgt_lang = tgt_lang
        self.seq_len = seq_len
        self.sos_id = self.tokenizer_tgt.token_to_id('[SOS]')
        self.eos_id = self.tokenizer_tgt.token_to_id('[EOS]')
        self.pad_id = self.tokenizer_tgt.token_to_id('[PAD]')

    def __len__(self):
        return len(self.ds)

    def __getitem__(self, idx):
        item = self.ds[idx]
        src_text = item['translation'][self.src_lang]
        tgt_text = item['translation'][self.tgt_lang]
        src_ids = self.tokenizer_src.encode(src_text).ids
        tgt_ids = self.tokenizer_tgt.encode(tgt_text).ids

        if len(src_ids) + 2 > self.seq_len:
            src_ids = src_ids[: self.seq_len - 2]
        if len(tgt_ids) + 2 > self.seq_len:
            tgt_ids = tgt_ids[: self.seq_len - 2]

        encoder_input = [self.sos_id] + src_ids + [self.eos_id]
        decoder_input = [self.sos_id] + tgt_ids
        label = tgt_ids + [self.eos_id]

        encoder_input += [self.pad_id] * (self.seq_len - len(encoder_input))
        decoder_input += [self.pad_id] * (self.seq_len - len(decoder_input))
        label += [self.pad_id] * (self.seq_len - len(label))

        encoder_input = torch.tensor(encoder_input, dtype=torch.long)
        decoder_input = torch.tensor(decoder_input, dtype=torch.long)
        label = torch.tensor(label, dtype=torch.long)

        return {
            'encoder_input': encoder_input,
            'decoder_input': decoder_input,
            'label': label,
            'src_text': src_text,
            'tgt_text': tgt_text
        }

######################  TODO  ########################
######################  TODO  ########################


## Part 12: Validation Loop

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">We will now create two functions for the validation loop. The validation loop is crucial to evaluate model performance in translating sentences from data it has not seen during training.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">We will define two functions. The first function, <code>greedy_decode</code>, gives us the model's output by obtaining the most probable next token. The second function, <code>run_validation</code>, is responsible for running the validation process in which we decode the model's output and compare it with the reference text for the target sentence.</p>

In [None]:
######################  TODO  ########################
######################  TODO  ########################

# - Define a `greedy_decode` function to generate the most probable sequence using a trained model
# - Accept parameters:
#   1. `model`: Trained Transformer model
#   2. `source`: Source input sequence
#   3. `source_mask`: Mask for the source sequence
#   4. `tokenizer_src` and `tokenizer_tgt`: Tokenizers for source and target languages
#   5. `max_len`: Maximum sequence length for the output
#   6. `device`: Device to run the computation (e.g., CPU or GPU)

# - Steps:
#   1. Retrieve indices for `[SOS]` and `[EOS]` tokens from the target tokenizer
#   2. Compute encoder output for the source sequence
#   3. Initialize decoder input with `[SOS]`
#   4. Loop until `max_len` is reached or `[EOS]` is generated:
#      - Create a causal mask for the decoder input
#      - Compute decoder output using encoder output and masks
#      - Apply the projection layer to get probabilities for the next token
#      - Select the token with the highest probability and append it to the decoder input
#      - Break the loop if `[EOS]` is generated
#   5. Return the generated sequence of tokens

def greedy_decode(model, source, source_mask, tokenizer_src, tokenizer_tgt, max_len, device):
    sos_id = tokenizer_tgt.token_to_id('[SOS]')
    eos_id = tokenizer_tgt.token_to_id('[EOS]')
    enc_output = model.encode(source.to(device), source_mask.to(device))
    ys = torch.tensor([[sos_id]], dtype=torch.long, device=device)
    for _ in range(max_len):
        tgt_mask = (ys != tokenizer_tgt.token_to_id('[PAD]')).unsqueeze(1).unsqueeze(2)
        size = ys.size(1)
        no_look = casual_mask(size).to(device)
        tgt_mask = tgt_mask & no_look
        dec_output = model.decode(ys, enc_output, source_mask.to(device), tgt_mask)
        prob = model.project(dec_output[:, -1])
        next_word = torch.argmax(prob, dim=-1)
        ys = torch.cat([ys, next_word.unsqueeze(0)], dim=1)
        if next_word.item() == eos_id:
            break
    return ys[0].cpu().numpy()

######################  TODO  ########################
######################  TODO  ########################


In [None]:
######################  TODO  ########################
######################  TODO  ########################

# - Define a `run_validation` function to evaluate the model on the validation dataset
# - Accept parameters:
#   1. `model`: Trained Transformer model
#   2. `validation_ds`: Validation dataset
#   3. `tokenizer_src` and `tokenizer_tgt`: Tokenizers for source and target languages
#   4. `max_len`: Maximum sequence length for decoding
#   5. `device`: Device to run the computation (e.g., CPU or GPU)
#   6. `print_msg`: Function for displaying output messages
#   7. `global_state`: Optional global state for tracking progress
#   8. `writer`: Optional logging writer (e.g., TensorBoard)
#   9. `num_examples`: Number of examples to process per run (default: 2)

# - Steps:
#   1. Set the model to evaluation mode
#   2. Initialize a counter to track the number of processed examples
#   3. Define a fixed console width for printed messages
#   4. Iterate through the validation dataset:
#      - Retrieve `encoder_input` and `encoder_mask` and move them to the specified device
#      - Ensure batch size is 1 for validation
#      - Use `greedy_decode` to generate the model's predictions
#      - Decode the model's output into human-readable text
#      - Print source, target, and predicted text using `print_msg`
#      - Break the loop after processing the specified number of examples (`num_examples`)
#   5. Ensure no gradients are computed during evaluation

def run_validation(model, validation_ds, tokenizer_src, tokenizer_tgt, max_len, device, print_msg, global_state=None, writer=None, num_examples=2):
    model.eval()
    pad_id_src = tokenizer_src.token_to_id('[PAD]')
    count = 0
    console_width = 80
    with torch.no_grad():
        for batch in validation_ds:
            encoder_input = batch['encoder_input'].unsqueeze(0).to(device)
            encoder_mask = (encoder_input != pad_id_src).unsqueeze(1).unsqueeze(2)
            pred_tokens = greedy_decode(model, encoder_input, encoder_mask, tokenizer_src, tokenizer_tgt, max_len, device)
            pred_text = tokenizer_tgt.decode(pred_tokens.tolist())
            src_text = batch['src_text']
            tgt_text = batch['tgt_text']
            print_msg(f"SRC: {src_text}".ljust(console_width))
            print_msg(f"TGT: {tgt_text}".ljust(console_width))
            print_msg(f"PRD: {pred_text}".ljust(console_width))

            count += 1
            if count >= num_examples:
                break

######################  TODO  ########################
######################  TODO  ########################


## Part 13: Training Loop

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">We are ready to train our Transformer model on the OpusBook dataset for the English to Italian translation task.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">We first start by defining the <code>get_model</code> function to load the model by calling the <code>build_transformer</code> function we have previously defined. This function uses the <code>config</code> dictionary to set a few parameters.</p>

In [None]:
######################  TODO  ########################
######################  TODO  ########################

# - Define a `get_model` function to initialize a Transformer model
# - Accept parameters:
#   1. `config`: Configuration dictionary with model settings
#   2. `vocab_src_len`: Length of the source language vocabulary
#   3. `vocab_tgt_len`: Length of the target language vocabulary

# - Use the `build_transformer` function to:
#   1. Create a Transformer model
#   2. Pass the source and target vocabulary lengths
#   3. Set sequence length (`seq_len`) and embedding dimensionality (`d_model`) from `config`

# - Return the initialized model

def get_model(config, vocab_src_len, vocab_tgt_len):
    model = build_transformer(
        src_vocab_size=vocab_src_len,
        tgt_vocab_size=vocab_tgt_len,
        src_seq_len=config['seq_len'],
        tgt_seq_len=config['seq_len'],
        d_model=config['d_model']
    )
    return model

######################  TODO  ########################
######################  TODO  ########################


<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">I have mentioned the <code>config</code> dictionary several times throughout this notebook. Now, it is time to create it.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">In the following cell, we will define two functions to configure our model and the training process.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">In the <code>get_config</code> function, we define crucial parameters for the training process. <code>batch_size</code> for the number of training examples used in one iteration, <code>num_epochs</code> as the number of times the entire dataset is passed forward and backward through the Transformer, <code>lr</code> as the learning rate for the optimizer, etc. We will also finally define the pairs from the OpusBook dataset, <code>'lang_src': 'en'</code> for selecting English as the source language and <code>'lang_tgt': 'it'</code> for selecting Italian as the target language.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">The <code>get_weights_file_path</code> function constructs the file path for saving or loading model weights for any specific epoch.</p>

In [None]:
######################  TODO  ########################
######################  TODO  ########################

# - Define a `get_config` function to return a dictionary of settings for building and training the Transformer model:
#   1. `batch_size`: Number of samples per training batch
#   2. `num_epochs`: Total training epochs
#   3. `lr`: Learning rate for optimization
#   4. `seq_len`: Maximum sequence length for tokens
#   5. `d_model`: Dimensionality of embeddings (e.g., 512)
#   6. `lang_src` and `lang_tgt`: Source and target languages
#   7. `model_folder`: Folder to save model weights
#   8. `model_basename`: Base name for model files
#   9. `preload`: Option to preload a model (default: None)
#   10. `tokenizer_file`: Filename pattern for saving tokenizers
#   11. `experiment_name`: Name of the experiment for logging

# - Define `get_weights_file_path` to construct a file path for saving/retrieving model weights:
#   1. Accept `config` dictionary and `epoch` string as parameters
#   2. Retrieve `model_folder` and `model_basename` from `config`
#   3. Construct the filename with the base name and epoch
#   4. Combine the current directory, model folder, and filename to return the full path

def get_config():
    return {
        'batch_size': 32,
        'num_epochs': 30,
        'lr': 5e-4,
        'seq_len': 64,
        'd_model': 512,
        'lang_src': 'en',
        'lang_tgt': 'it',
        'model_folder': 'checkpoints',
        'model_basename': 'transformer',
        'preload': None,
        'tokenizer_file': 'tokenizer_{lang}.json',
        'experiment_name': 'en-it'
    }

def get_weights_file_path(config, epoch):
    model_folder = Path(config['model_folder'])
    model_folder.mkdir(parents=True, exist_ok=True)
    filename = f"{config['model_basename']}_{epoch}.pth"
    return str(model_folder / filename)

######################  TODO  ########################
######################  TODO  ########################


<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">We finally define our last function, <code>train_model</code>, which takes the <code>config</code> arguments as input. </p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">In this function, we will set everything up for the training. We will load the model and its necessary components onto the GPU for faster training, set the <code>Adam</code> optimizer, and configure the <code>CrossEntropyLoss</code> function to compute the differences between the translations output by the model and the reference translations from the dataset. </p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">Every loop necessary for iterating over the training batches, performing backpropagation, and computing the gradients is in this function. We will also use it to run the validation function and save the current state of the model.</p>

In [None]:
######################  TODO  ########################
######################  TODO  ########################

# - Define a `train_model` function to train a Transformer model
# - Steps:
#   1. Set up the device (GPU or CPU) for training
#   2. Create a directory to store model weights
#   3. Retrieve dataloaders and tokenizers for source and target languages using `get_ds`
#   4. Initialize the Transformer model using `get_model` and move it to the specified device
#   5. Set up TensorBoard for logging training metrics
#   6. Configure the Adam optimizer with learning rate and epsilon from `config`
#   7. If a pre-trained model exists:
#      - Load the model, optimizer state, and global step
#      - Set the starting epoch for resuming training
#   8. Define a cross-entropy loss function:
#      - Ignore padding tokens
#      - Apply label smoothing to prevent overfitting
#   9. Start training loop:
#      - Iterate over epochs from the initial epoch to `config['num_epochs']`
#      - For each batch in the training dataloader:
#         - Set model to training mode
#         - Move input data, masks, and labels to the device
#         - Pass data through the encoder, decoder, and projection layer
#         - Compute loss between model predictions and labels
#         - Log training loss to TensorBoard
#         - Perform backpropagation and update model parameters
#         - Clear gradients for the next batch
#         - Increment global step counter
#      - After each epoch, run validation using `run_validation`
#      - Save the current model state, optimizer state, and global step
#   10. Save model weights after each epoch

def train_model(config):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    Path(config['model_folder']).mkdir(parents=True, exist_ok=True)
    train_loader, val_loader, tokenizer_src, tokenizer_tgt = get_ds(config)
    model = get_model(config, vocab_src_len=tokenizer_src.get_vocab_size(), vocab_tgt_len=tokenizer_tgt.get_vocab_size()).to(device)
    writer = SummaryWriter(log_dir=f"runs/{config['experiment_name']}")
    optimizer = torch.optim.Adam(model.parameters(), lr=config['lr'], eps=1e-9)
    global_step = 0
    start_epoch = 0

    if config['preload']:
        checkpoint = torch.load(config['preload'])
        model.load_state_dict(checkpoint['model_state_dict'])
        optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
        global_step = checkpoint['global_step']
        start_epoch = checkpoint['epoch']

    pad_id = tokenizer_tgt.token_to_id('[PAD]')
    criterion = nn.CrossEntropyLoss(ignore_index=pad_id, label_smoothing=0.1)

    for epoch in range(start_epoch, config['num_epochs']):
        model.train()
        total_loss = 0.0
        total_correct = 0
        total_tokens = 0

        for batch in tqdm(train_loader, desc=f"Epoch {epoch+1}/{config['num_epochs']}"):
            encoder_input = batch['encoder_input'].to(device)
            decoder_input = batch['decoder_input'].to(device)
            label = batch['label'].to(device)

            encoder_mask = (encoder_input != pad_id).unsqueeze(1).unsqueeze(2)
            seq_len = decoder_input.size(1)
            dec_pad_mask = (decoder_input != pad_id).unsqueeze(1).unsqueeze(2)
            no_look = casual_mask(seq_len).to(device).unsqueeze(0).unsqueeze(1)
            decoder_mask = dec_pad_mask & no_look

            optimizer.zero_grad()
            enc_output = model.encode(encoder_input, encoder_mask)
            dec_output = model.decode(decoder_input, enc_output, encoder_mask, decoder_mask)
            logits = model.project(dec_output)

            loss = criterion(logits.reshape(-1, logits.size(-1)), label.reshape(-1))
            loss.backward()
            optimizer.step()

            with torch.no_grad():
                preds = logits.argmax(dim=-1)
                mask = (label != pad_id)
                num_tokens = mask.sum()
                num_correct = (preds[mask] == label[mask]).sum()

            total_loss += loss.item()
            total_correct += num_correct.item()
            total_tokens += num_tokens.item()

            writer.add_scalar("Loss/train", loss.item(), global_step)
            global_step += 1

        epoch_loss = total_loss / len(train_loader)
        epoch_accuracy = (total_correct / total_tokens) if total_tokens > 0 else 0

        print(f"End of Epoch {epoch+1}/{config['num_epochs']}, "
              f"Loss: {epoch_loss:.4f}, Accuracy: {epoch_accuracy:.4f}")

        writer.add_scalar("EpochLoss/train", epoch_loss, epoch)
        writer.add_scalar("EpochAccuracy/train", epoch_accuracy, epoch)

        run_validation(
            model=model,
            validation_ds=val_loader,
            tokenizer_src=tokenizer_src,
            tokenizer_tgt=tokenizer_tgt,
            max_len=config['seq_len'],
            device=device,
            print_msg=print,
            writer=writer,
            num_examples=2
        )

        torch.save({
            'epoch': epoch+1,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'global_step': global_step
        }, get_weights_file_path(config, epoch+1))

    writer.close()
######################  TODO  ########################
######################  TODO  ########################


We can now train the model!

**First Try:**

I used T4 in colab with Learning Rate = 1e-4

In [None]:
if __name__ == '__main__':
    warnings.filterwarnings('ignore')
    config = get_config()
    train_model(config)

Max source length: 309
Max target length: 274


Epoch 1/10: 100%|██████████| 910/910 [04:14<00:00,  3.58it/s]


SRC: ['Vronsky stopped and asked him straight out:']                            
TGT: ['Vronskij si fermò e chiese franco:']                                     
PRD: — Ma il mio mio mio , e non si .                                           
SRC: ['"What a beautiful room!" I exclaimed, as I looked round; for I had never before seen any half so imposing.']
TGT: ["— Che bella stanza! — esclamai, guardando intorno. — È la sala da pranzo; ho aperto la finestra per farvi entrare un po' d'aria."]
PRD: — Ma , ma , e il suo suo suo suo suo suo suo suo suo suo suo suo suo suo suo suo suo suo suo suo .


Epoch 2/10: 100%|██████████| 910/910 [04:14<00:00,  3.57it/s]


SRC: ['Vronsky stopped and asked him straight out:']                            
TGT: ['Vronskij si fermò e chiese franco:']                                     
PRD: Il signor Rochester era un ’ altra volta , ma si .                         
SRC: ['"What a beautiful room!" I exclaimed, as I looked round; for I had never before seen any half so imposing.']
TGT: ["— Che bella stanza! — esclamai, guardando intorno. — È la sala da pranzo; ho aperto la finestra per farvi entrare un po' d'aria."]
PRD: — In quel momento , che era stata un ’ altra volta , ma non si .           


Epoch 3/10: 100%|██████████| 910/910 [04:13<00:00,  3.58it/s]


SRC: ['Vronsky stopped and asked him straight out:']                            
TGT: ['Vronskij si fermò e chiese franco:']                                     
PRD: La sua vita , che si , e , a casa .                                        
SRC: ['"What a beautiful room!" I exclaimed, as I looked round; for I had never before seen any half so imposing.']
TGT: ["— Che bella stanza! — esclamai, guardando intorno. — È la sala da pranzo; ho aperto la finestra per farvi entrare un po' d'aria."]
PRD: — E il suo padre , che si , e il suo padre .                               


Epoch 4/10: 100%|██████████| 910/910 [04:14<00:00,  3.58it/s]


SRC: ['Vronsky stopped and asked him straight out:']                            
TGT: ['Vronskij si fermò e chiese franco:']                                     
PRD: Egli si , e , dopo aver detto , si mise a , e , si mise a parlare di nuovo .
SRC: ['"What a beautiful room!" I exclaimed, as I looked round; for I had never before seen any half so imposing.']
TGT: ["— Che bella stanza! — esclamai, guardando intorno. — È la sala da pranzo; ho aperto la finestra per farvi entrare un po' d'aria."]
PRD: Non c ’ era nulla di nuovo , ma il suo padre si .                          


Epoch 5/10: 100%|██████████| 910/910 [04:13<00:00,  3.58it/s]


SRC: ['Vronsky stopped and asked him straight out:']                            
TGT: ['Vronskij si fermò e chiese franco:']                                     
PRD: Egli si , e , per la sua opinione , si fermò a sé , e si mise a parlare di nuovo .
SRC: ['"What a beautiful room!" I exclaimed, as I looked round; for I had never before seen any half so imposing.']
TGT: ["— Che bella stanza! — esclamai, guardando intorno. — È la sala da pranzo; ho aperto la finestra per farvi entrare un po' d'aria."]
PRD: Non si , e , come un ’ altra volta , si , si , e la sua opinione , si .    


Epoch 6/10: 100%|██████████| 910/910 [04:13<00:00,  3.59it/s]


SRC: ['Vronsky stopped and asked him straight out:']                            
TGT: ['Vronskij si fermò e chiese franco:']                                     
PRD: Egli si avvicinò alla mente , e , per vedere i suoi pensieri , si mise a guardare il suo posto .
SRC: ['"What a beautiful room!" I exclaimed, as I looked round; for I had never before seen any half so imposing.']
TGT: ["— Che bella stanza! — esclamai, guardando intorno. — È la sala da pranzo; ho aperto la finestra per farvi entrare un po' d'aria."]
PRD: Non c ’ era nulla di straordinario , che si di nuovo , e di nuovo , per un po ’ di cui si , di nuovo a lungo , si .


Epoch 7/10: 100%|██████████| 910/910 [04:13<00:00,  3.58it/s]


SRC: ['Vronsky stopped and asked him straight out:']                            
TGT: ['Vronskij si fermò e chiese franco:']                                     
PRD: Egli si , e , per un tratto , si rivolse a sé :                            
SRC: ['"What a beautiful room!" I exclaimed, as I looked round; for I had never before seen any half so imposing.']
TGT: ["— Che bella stanza! — esclamai, guardando intorno. — È la sala da pranzo; ho aperto la finestra per farvi entrare un po' d'aria."]
PRD: Non c ’ era nulla di straordinario , che la sua opinione si , e la quale si di nuovo , si di nuovo .


Epoch 8/10: 100%|██████████| 910/910 [04:14<00:00,  3.58it/s]


SRC: ['Vronsky stopped and asked him straight out:']                            
TGT: ['Vronskij si fermò e chiese franco:']                                     
PRD: Egli si , e , per quanto si , si , si , si di nuovo a sé e di nuovo , per essere messo in modo di nuovo .
SRC: ['"What a beautiful room!" I exclaimed, as I looked round; for I had never before seen any half so imposing.']
TGT: ["— Che bella stanza! — esclamai, guardando intorno. — È la sala da pranzo; ho aperto la finestra per farvi entrare un po' d'aria."]
PRD: Non c ’ è nulla di straordinario , che un ’ altra cosa , e la di nuovo , si dalla stessa maniera di nuovo per un po ’ di cui si .


Epoch 9/10: 100%|██████████| 910/910 [04:14<00:00,  3.58it/s]


SRC: ['Vronsky stopped and asked him straight out:']                            
TGT: ['Vronskij si fermò e chiese franco:']                                     
PRD: Si sentiva che , se ne fosse stato , si era messo a guardare , e si sentiva in modo di nuovo a disagio .
SRC: ['"What a beautiful room!" I exclaimed, as I looked round; for I had never before seen any half so imposing.']
TGT: ["— Che bella stanza! — esclamai, guardando intorno. — È la sala da pranzo; ho aperto la finestra per farvi entrare un po' d'aria."]
PRD: Non c ’ era nulla di straordinario , che si di nuovo il suo posto , e il quale si trovava in una volta , si era messo a ridere di nuovo verso la sua statura , che non si trovava più nulla .


Epoch 10/10: 100%|██████████| 910/910 [04:15<00:00,  3.57it/s]


SRC: ['Vronsky stopped and asked him straight out:']                            
TGT: ['Vronskij si fermò e chiese franco:']                                     
PRD: Si accorse che , per lui , si avvicinava a sé , si era fatto di nuovo , e si mise a guardare il suo sguardo .
SRC: ['"What a beautiful room!" I exclaimed, as I looked round; for I had never before seen any half so imposing.']
TGT: ["— Che bella stanza! — esclamai, guardando intorno. — È la sala da pranzo; ho aperto la finestra per farvi entrare un po' d'aria."]
PRD: Non c ’ era nulla di straordinario , che la sua fantasia era stata in mente , si era messa a guardare dalla sua maniera di , e di nuovo nel momento in cui si era le cose .


**Second Try:**

I used T4 x2 in Kaggle with Learning rate = 5e-4

Running terminated because of Net2.Sharif disconnection.

In [None]:
if __name__ == '__main__':
    warnings.filterwarnings('ignore')
    config = get_config()
    train_model(config)

Max source length: 309
Max target length: 274


Epoch 1/30: 100%|██████████| 910/910 [04:28<00:00,  3.39it/s]


End of Epoch 1/30, Loss: 6.3059, Accuracy: 0.1617
SRC: ['And when out shooting, while he did not seem to be thinking at all, he again and again thought about the old peasant and his family, and felt as if the impression made on him called not only for his attention, but for the solution of some problem related thereto.']
TGT: ['E a caccia, quando gli pareva di non pensare a nulla, che è che non è, di nuovo gli tornava in mente il vecchio con la sua famiglia, e la viva impressione che gli era rimasta sembrava esigesse attenzione non solo per se stessa, ma anche perché gli pareva si collegasse alla soluzione di un qualche cosa.']
PRD: — Ma non si , e non si , e non si .                                        
SRC: ["'You must take me too!"]                                                 
TGT: ['— Prendete anche me con voi.']                                           
PRD: — Ma non si , ma non si .                                                  


Epoch 2/30: 100%|██████████| 910/910 [04:37<00:00,  3.28it/s]


End of Epoch 2/30, Loss: 5.6614, Accuracy: 0.2065
SRC: ['And when out shooting, while he did not seem to be thinking at all, he again and again thought about the old peasant and his family, and felt as if the impression made on him called not only for his attention, but for the solution of some problem related thereto.']
TGT: ['E a caccia, quando gli pareva di non pensare a nulla, che è che non è, di nuovo gli tornava in mente il vecchio con la sua famiglia, e la viva impressione che gli era rimasta sembrava esigesse attenzione non solo per se stessa, ma anche perché gli pareva si collegasse alla soluzione di un qualche cosa.']
PRD: — Ma non c ’ era un , ma non c ’ era un po ’ di .                          
SRC: ["'You must take me too!"]                                                 
TGT: ['— Prendete anche me con voi.']                                           
PRD: — Ma non c ’ è nulla di , ma non c ’ è un di .                             


Epoch 3/30: 100%|██████████| 910/910 [04:37<00:00,  3.28it/s]


End of Epoch 3/30, Loss: 5.3708, Accuracy: 0.2294
SRC: ['And when out shooting, while he did not seem to be thinking at all, he again and again thought about the old peasant and his family, and felt as if the impression made on him called not only for his attention, but for the solution of some problem related thereto.']
TGT: ['E a caccia, quando gli pareva di non pensare a nulla, che è che non è, di nuovo gli tornava in mente il vecchio con la sua famiglia, e la viva impressione che gli era rimasta sembrava esigesse attenzione non solo per se stessa, ma anche perché gli pareva si collegasse alla soluzione di un qualche cosa.']
PRD: — Ma , come , come , come se ne , e , la sua vita .                        
SRC: ["'You must take me too!"]                                                 
TGT: ['— Prendete anche me con voi.']                                           
PRD: — Non c ’ è nulla di , ma non è nulla di .                                 


Epoch 4/30: 100%|██████████| 910/910 [04:37<00:00,  3.28it/s]


End of Epoch 4/30, Loss: 5.1390, Accuracy: 0.2504
SRC: ['And when out shooting, while he did not seem to be thinking at all, he again and again thought about the old peasant and his family, and felt as if the impression made on him called not only for his attention, but for the solution of some problem related thereto.']
TGT: ['E a caccia, quando gli pareva di non pensare a nulla, che è che non è, di nuovo gli tornava in mente il vecchio con la sua famiglia, e la viva impressione che gli era rimasta sembrava esigesse attenzione non solo per se stessa, ma anche perché gli pareva si collegasse alla soluzione di un qualche cosa.']
PRD: — , , — disse , — ma , la mano , la sua .                                  
SRC: ["'You must take me too!"]                                                 
TGT: ['— Prendete anche me con voi.']                                           
PRD: — , , , — disse , e la mano .                                              


Epoch 5/30: 100%|██████████| 910/910 [04:37<00:00,  3.28it/s]


End of Epoch 5/30, Loss: 4.9171, Accuracy: 0.2723
SRC: ['And when out shooting, while he did not seem to be thinking at all, he again and again thought about the old peasant and his family, and felt as if the impression made on him called not only for his attention, but for the solution of some problem related thereto.']
TGT: ['E a caccia, quando gli pareva di non pensare a nulla, che è che non è, di nuovo gli tornava in mente il vecchio con la sua famiglia, e la viva impressione che gli era rimasta sembrava esigesse attenzione non solo per se stessa, ma anche perché gli pareva si collegasse alla soluzione di un qualche cosa.']
PRD: — Il mio , , , e a lungo , , , a questo , .                                
SRC: ["'You must take me too!"]                                                 
TGT: ['— Prendete anche me con voi.']                                           
PRD: — Il mio amico , che non ho mai fatto nulla , e che la , e la di .         


Epoch 6/30: 100%|██████████| 910/910 [04:37<00:00,  3.28it/s]


End of Epoch 6/30, Loss: 4.6885, Accuracy: 0.2954
SRC: ['And when out shooting, while he did not seem to be thinking at all, he again and again thought about the old peasant and his family, and felt as if the impression made on him called not only for his attention, but for the solution of some problem related thereto.']
TGT: ['E a caccia, quando gli pareva di non pensare a nulla, che è che non è, di nuovo gli tornava in mente il vecchio con la sua famiglia, e la viva impressione che gli era rimasta sembrava esigesse attenzione non solo per se stessa, ma anche perché gli pareva si collegasse alla soluzione di un qualche cosa.']
PRD: — Il mio , e il nostro , è vero , e la sua bellezza .                      
SRC: ["'You must take me too!"]                                                 
TGT: ['— Prendete anche me con voi.']                                           
PRD: — Il mio amico , che ho fatto , e la mia parte è stata una volta che non si può esser buona .


Epoch 7/30: 100%|██████████| 910/910 [04:37<00:00,  3.28it/s]


End of Epoch 7/30, Loss: 4.4542, Accuracy: 0.3209
SRC: ['And when out shooting, while he did not seem to be thinking at all, he again and again thought about the old peasant and his family, and felt as if the impression made on him called not only for his attention, but for the solution of some problem related thereto.']
TGT: ['E a caccia, quando gli pareva di non pensare a nulla, che è che non è, di nuovo gli tornava in mente il vecchio con la sua famiglia, e la viva impressione che gli era rimasta sembrava esigesse attenzione non solo per se stessa, ma anche perché gli pareva si collegasse alla soluzione di un qualche cosa.']
PRD: E qui , come al solito , si , si , si , e la stessa cosa .                 
SRC: ["'You must take me too!"]                                                 
TGT: ['— Prendete anche me con voi.']                                           
PRD: — In questo momento , con la sua anima , la sua vita , la quale è passata .


Epoch 8/30: 100%|██████████| 910/910 [04:37<00:00,  3.28it/s]


End of Epoch 8/30, Loss: 4.2069, Accuracy: 0.3536
SRC: ['And when out shooting, while he did not seem to be thinking at all, he again and again thought about the old peasant and his family, and felt as if the impression made on him called not only for his attention, but for the solution of some problem related thereto.']
TGT: ['E a caccia, quando gli pareva di non pensare a nulla, che è che non è, di nuovo gli tornava in mente il vecchio con la sua famiglia, e la viva impressione che gli era rimasta sembrava esigesse attenzione non solo per se stessa, ma anche perché gli pareva si collegasse alla soluzione di un qualche cosa.']
PRD: — Il mio picchiare , — continuò , — è un ’ altra cosa , e poi mi pigli : è così , che non ci sarà mai più bella e buona , e ora ne .
SRC: ["'You must take me too!"]                                                 
TGT: ['— Prendete anche me con voi.']                                           
PRD: — Il mio povero , — continuò , — e io mi , e poi mi di aver

Epoch 9/30: 100%|██████████| 910/910 [04:37<00:00,  3.28it/s]


End of Epoch 9/30, Loss: 3.9568, Accuracy: 0.3936
SRC: ['And when out shooting, while he did not seem to be thinking at all, he again and again thought about the old peasant and his family, and felt as if the impression made on him called not only for his attention, but for the solution of some problem related thereto.']
TGT: ['E a caccia, quando gli pareva di non pensare a nulla, che è che non è, di nuovo gli tornava in mente il vecchio con la sua famiglia, e la viva impressione che gli era rimasta sembrava esigesse attenzione non solo per se stessa, ma anche perché gli pareva si collegasse alla soluzione di un qualche cosa.']
PRD: Il solo argomento del lavoro si era comportato con straordinaria ansietà ; ma il resto era pronto a fuggire e si fermava a fuggire e .
SRC: ["'You must take me too!"]                                                 
TGT: ['— Prendete anche me con voi.']                                           
PRD: — La mia prima opinione era una cosa reale , e la stessa 

Epoch 10/30: 100%|██████████| 910/910 [04:37<00:00,  3.28it/s]


End of Epoch 10/30, Loss: 3.7183, Accuracy: 0.4375
SRC: ['And when out shooting, while he did not seem to be thinking at all, he again and again thought about the old peasant and his family, and felt as if the impression made on him called not only for his attention, but for the solution of some problem related thereto.']
TGT: ['E a caccia, quando gli pareva di non pensare a nulla, che è che non è, di nuovo gli tornava in mente il vecchio con la sua famiglia, e la viva impressione che gli era rimasta sembrava esigesse attenzione non solo per se stessa, ma anche perché gli pareva si collegasse alla soluzione di un qualche cosa.']
PRD: — L ’ ha intesa , signore , l ’ ha intesa , l ’ ha intesa , l ’ ha intesa , l ’ ultimo testimone , la .
SRC: ["'You must take me too!"]                                                 
TGT: ['— Prendete anche me con voi.']                                           
PRD: — L ’ ho già guardato con cura , — continuò il figlio — e io non ho mai veduto il dirit

Epoch 11/30:  74%|███████▍  | 674/910 [03:25<01:11,  3.30it/s]

**Third Try:**

I used GPU P100 in Kaggle with Learning rate = 5e-4

And I reached to +0.80 Accuracy

Running terminated because I fell sleep and my laptop turned off ...

In [None]:
if __name__ == '__main__':
    warnings.filterwarnings('ignore')
    config = get_config()
    train_model(config)

README.md:   0%|          | 0.00/28.1k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/5.73M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/32332 [00:00<?, ? examples/s]

Max source length: 309
Max target length: 274


Epoch 1/30: 100%|██████████| 910/910 [02:36<00:00,  5.83it/s]


End of Epoch 1/30, Loss: 6.3029, Accuracy: 0.1633
SRC: ["'Just one advantage: I live in my own house, which is neither bought nor hired."]
TGT: ['— L’unico vantaggio è che vivo a casa mia, non è roba comprata, né presa in affitto.']
PRD: E che la sua vita , e la sua vita , e la sua vita .                        
SRC: ["Dolly was glad when Anna came in and thereby put an end to Annushka's chatter."]
TGT: ['Dolly fu contenta quando Anna entrò da lei e con la sua presenza fece cessare il chiacchierio di Annuška.']
PRD: Levin , che la sua vita , e la sua vita , e la sua vita .                  


Epoch 2/30: 100%|██████████| 910/910 [02:35<00:00,  5.85it/s]


End of Epoch 2/30, Loss: 5.6475, Accuracy: 0.2079
SRC: ["'Just one advantage: I live in my own house, which is neither bought nor hired."]
TGT: ['— L’unico vantaggio è che vivo a casa mia, non è roba comprata, né presa in affitto.']
PRD: Ma , come , e la mia vita , e la sua vita .                                
SRC: ["Dolly was glad when Anna came in and thereby put an end to Annushka's chatter."]
TGT: ['Dolly fu contenta quando Anna entrò da lei e con la sua presenza fece cessare il chiacchierio di Annuška.']
PRD: Levin non si era stato un ’ altra , ma che si fosse stato stato stato in camera .


Epoch 3/30: 100%|██████████| 910/910 [02:35<00:00,  5.85it/s]


End of Epoch 3/30, Loss: 5.3631, Accuracy: 0.2303
SRC: ["'Just one advantage: I live in my own house, which is neither bought nor hired."]
TGT: ['— L’unico vantaggio è che vivo a casa mia, non è roba comprata, né presa in affitto.']
PRD: il suo sguardo , e la mia vita , e la sua vita , e la sua vita , e la sua .
SRC: ["Dolly was glad when Anna came in and thereby put an end to Annushka's chatter."]
TGT: ['Dolly fu contenta quando Anna entrò da lei e con la sua presenza fece cessare il chiacchierio di Annuška.']
PRD: E la signorina Scatcherd , che la sua vita era , e la sua vita , e la sua .


Epoch 4/30: 100%|██████████| 910/910 [02:34<00:00,  5.88it/s]


End of Epoch 4/30, Loss: 5.1249, Accuracy: 0.2527
SRC: ["'Just one advantage: I live in my own house, which is neither bought nor hired."]
TGT: ['— L’unico vantaggio è che vivo a casa mia, non è roba comprata, né presa in affitto.']
PRD: Io non aveva mai fatto nulla , e la sua opinione , e la sua opinione .     
SRC: ["Dolly was glad when Anna came in and thereby put an end to Annushka's chatter."]
TGT: ['Dolly fu contenta quando Anna entrò da lei e con la sua presenza fece cessare il chiacchierio di Annuška.']
PRD: E , dopo aver fatto il tè , e la sua voce di cui aveva fatto un ’ altra volta , e che non si poteva fare .


Epoch 5/30: 100%|██████████| 910/910 [02:34<00:00,  5.88it/s]


End of Epoch 5/30, Loss: 4.8963, Accuracy: 0.2748
SRC: ["'Just one advantage: I live in my own house, which is neither bought nor hired."]
TGT: ['— L’unico vantaggio è che vivo a casa mia, non è roba comprata, né presa in affitto.']
PRD: , e , come se lo , e la mano .                                             
SRC: ["Dolly was glad when Anna came in and thereby put an end to Annushka's chatter."]
TGT: ['Dolly fu contenta quando Anna entrò da lei e con la sua presenza fece cessare il chiacchierio di Annuška.']
PRD: Ma , come , per la sua disgrazia , si era messa a letto , e la sua amicizia non si poteva fare .


Epoch 6/30: 100%|██████████| 910/910 [02:34<00:00,  5.89it/s]


End of Epoch 6/30, Loss: 4.6730, Accuracy: 0.2968
SRC: ["'Just one advantage: I live in my own house, which is neither bought nor hired."]
TGT: ['— L’unico vantaggio è che vivo a casa mia, non è roba comprata, né presa in affitto.']
PRD: Il suo sguardo era un uomo duro , che non aveva mai avuto il diritto di fare , e che si era fermato in piedi .
SRC: ["Dolly was glad when Anna came in and thereby put an end to Annushka's chatter."]
TGT: ['Dolly fu contenta quando Anna entrò da lei e con la sua presenza fece cessare il chiacchierio di Annuška.']
PRD: Anna si alzò in fretta e , dopo aver parlato , si rivolse a sé il suo sguardo con la sua attenzione .


Epoch 7/30: 100%|██████████| 910/910 [02:34<00:00,  5.91it/s]


End of Epoch 7/30, Loss: 4.4403, Accuracy: 0.3225
SRC: ["'Just one advantage: I live in my own house, which is neither bought nor hired."]
TGT: ['— L’unico vantaggio è che vivo a casa mia, non è roba comprata, né presa in affitto.']
PRD: Mio padre , che era stato cambiato , e che non aveva nessuna importanza , e che si deve fare .
SRC: ["Dolly was glad when Anna came in and thereby put an end to Annushka's chatter."]
TGT: ['Dolly fu contenta quando Anna entrò da lei e con la sua presenza fece cessare il chiacchierio di Annuška.']
PRD: Vronskij , , e , , , , la mano , e si mise a .                             


Epoch 8/30: 100%|██████████| 910/910 [02:34<00:00,  5.89it/s]


End of Epoch 8/30, Loss: 4.1998, Accuracy: 0.3545
SRC: ["'Just one advantage: I live in my own house, which is neither bought nor hired."]
TGT: ['— L’unico vantaggio è che vivo a casa mia, non è roba comprata, né presa in affitto.']
PRD: Levin , , , in sé , picchiando a terra , e a guardare il suo posto .       
SRC: ["Dolly was glad when Anna came in and thereby put an end to Annushka's chatter."]
TGT: ['Dolly fu contenta quando Anna entrò da lei e con la sua presenza fece cessare il chiacchierio di Annuška.']
PRD: Il sentimento di cui era stato fatto , secondo il solito , si mostrò le forme delle forme della mia opinione , e per quanto per il resto di cui si deve fare .


Epoch 9/30: 100%|██████████| 910/910 [02:34<00:00,  5.90it/s]


End of Epoch 9/30, Loss: 3.9589, Accuracy: 0.3926
SRC: ["'Just one advantage: I live in my own house, which is neither bought nor hired."]
TGT: ['— L’unico vantaggio è che vivo a casa mia, non è roba comprata, né presa in affitto.']
PRD: " Sulle prime , che le , si a un tratto , che , e che la , sarà presto .   
SRC: ["Dolly was glad when Anna came in and thereby put an end to Annushka's chatter."]
TGT: ['Dolly fu contenta quando Anna entrò da lei e con la sua presenza fece cessare il chiacchierio di Annuška.']
PRD: Dar ’ ja Aleksandrovna , che , in quel momento , con la massima onestà , si ferma e .


Epoch 10/30: 100%|██████████| 910/910 [02:34<00:00,  5.89it/s]


End of Epoch 10/30, Loss: 3.7226, Accuracy: 0.4364
SRC: ["'Just one advantage: I live in my own house, which is neither bought nor hired."]
TGT: ['— L’unico vantaggio è che vivo a casa mia, non è roba comprata, né presa in affitto.']
PRD: " Sulle prime non ho nulla , sono buone , specialmente una volta che si deve sperare , e non ha fatto nascere da bere il coraggio di .
SRC: ["Dolly was glad when Anna came in and thereby put an end to Annushka's chatter."]
TGT: ['Dolly fu contenta quando Anna entrò da lei e con la sua presenza fece cessare il chiacchierio di Annuška.']
PRD: , , , , eius , eius , eius , eius , eius , eius , desiderò di non avere gusto per il posto di una persona .


Epoch 11/30: 100%|██████████| 910/910 [02:34<00:00,  5.87it/s]


End of Epoch 11/30, Loss: 3.5035, Accuracy: 0.4791
SRC: ["'Just one advantage: I live in my own house, which is neither bought nor hired."]
TGT: ['— L’unico vantaggio è che vivo a casa mia, non è roba comprata, né presa in affitto.']
PRD: E così , per la prima volta , si , si volse a guardare il polso e , la lingua , per non avere bisogno di un altro .
SRC: ["Dolly was glad when Anna came in and thereby put an end to Annushka's chatter."]
TGT: ['Dolly fu contenta quando Anna entrò da lei e con la sua presenza fece cessare il chiacchierio di Annuška.']
PRD: Dar ’ ja Aleksandrovna , sempre , allungata sulla sedia , e gli disse : — Sarà difficile male , tanto meglio , tanto più che non ha fatto per questo mondo ?


Epoch 12/30: 100%|██████████| 910/910 [02:35<00:00,  5.85it/s]


End of Epoch 12/30, Loss: 3.3066, Accuracy: 0.5193
SRC: ["'Just one advantage: I live in my own house, which is neither bought nor hired."]
TGT: ['— L’unico vantaggio è che vivo a casa mia, non è roba comprata, né presa in affitto.']
PRD: Un ’ altra , che era stato , , era il fornello , a prenderlo e a pensare che bisognava condurre la strada .
SRC: ["Dolly was glad when Anna came in and thereby put an end to Annushka's chatter."]
TGT: ['Dolly fu contenta quando Anna entrò da lei e con la sua presenza fece cessare il chiacchierio di Annuška.']
PRD: Dolly , che aveva indovinato in pieno la proposta di parentela , si trovava , e , dopo aver sospirato con cura , si affrettava a disagio .


Epoch 13/30: 100%|██████████| 910/910 [02:35<00:00,  5.87it/s]


End of Epoch 13/30, Loss: 3.1320, Accuracy: 0.5573
SRC: ["'Just one advantage: I live in my own house, which is neither bought nor hired."]
TGT: ['— L’unico vantaggio è che vivo a casa mia, non è roba comprata, né presa in affitto.']
PRD: " , Dick , che tutto questo sarà , come il fatto in casa , non in un luogo , anche in un luogo , in un luogo , in generale .
SRC: ["Dolly was glad when Anna came in and thereby put an end to Annushka's chatter."]
TGT: ['Dolly fu contenta quando Anna entrò da lei e con la sua presenza fece cessare il chiacchierio di Annuška.']
PRD: Dar ’ ja Aleksandrovna , salutandolo , si in un modo che , dopo , si , si in un modo : il giudice , i due bambini , si .


Epoch 14/30: 100%|██████████| 910/910 [02:34<00:00,  5.87it/s]


End of Epoch 14/30, Loss: 2.9768, Accuracy: 0.5910
SRC: ["'Just one advantage: I live in my own house, which is neither bought nor hired."]
TGT: ['— L’unico vantaggio è che vivo a casa mia, non è roba comprata, né presa in affitto.']
PRD: Un po ’ in quel luogo , al quale io ero a sedere sulla strada , e che in piedi sembrava un tale casi di cui si trattava o no .
SRC: ["Dolly was glad when Anna came in and thereby put an end to Annushka's chatter."]
TGT: ['Dolly fu contenta quando Anna entrò da lei e con la sua presenza fece cessare il chiacchierio di Annuška.']
PRD: , Sigonin , e poi , per dir vero , quando si mise a pensare a tutta la gente che aveva sognato per il lavoro , e per l ’ aria non riuscì a nessuno l ’ aveva più .


Epoch 15/30: 100%|██████████| 910/910 [02:34<00:00,  5.88it/s]


End of Epoch 15/30, Loss: 2.8340, Accuracy: 0.6250
SRC: ["'Just one advantage: I live in my own house, which is neither bought nor hired."]
TGT: ['— L’unico vantaggio è che vivo a casa mia, non è roba comprata, né presa in affitto.']
PRD: Per l ’ amor di Dio , non c ’ era nulla da fare ; vi ringrazio molto , e poi il fatto che l ’ ha fatto bene o l ’ effetto delle persone più belle .
SRC: ["Dolly was glad when Anna came in and thereby put an end to Annushka's chatter."]
TGT: ['Dolly fu contenta quando Anna entrò da lei e con la sua presenza fece cessare il chiacchierio di Annuška.']
PRD: Anna , che , come direttore di banca , doveva essere , doveva essere il miglior mezzo di quello che doveva fare .


Epoch 16/30: 100%|██████████| 910/910 [02:34<00:00,  5.88it/s]


End of Epoch 16/30, Loss: 2.7114, Accuracy: 0.6535
SRC: ["'Just one advantage: I live in my own house, which is neither bought nor hired."]
TGT: ['— L’unico vantaggio è che vivo a casa mia, non è roba comprata, né presa in affitto.']
PRD: — , , , , , , , , , ’ è vero che non può .                                 
SRC: ["Dolly was glad when Anna came in and thereby put an end to Annushka's chatter."]
TGT: ['Dolly fu contenta quando Anna entrò da lei e con la sua presenza fece cessare il chiacchierio di Annuška.']
PRD: A quanto a lei , alla sua prima volta , si stendeva una specie d ’ uva che faceva con tanta disinvoltura e fatica , e che non battuta .


Epoch 17/30: 100%|██████████| 910/910 [02:34<00:00,  5.88it/s]


End of Epoch 17/30, Loss: 2.5960, Accuracy: 0.6825
SRC: ["'Just one advantage: I live in my own house, which is neither bought nor hired."]
TGT: ['— L’unico vantaggio è che vivo a casa mia, non è roba comprata, né presa in affitto.']
PRD: Una volta , in quel momento , era , a quanto sembra , si fosse fermata , e il fatto faceva piacere di .
SRC: ["Dolly was glad when Anna came in and thereby put an end to Annushka's chatter."]
TGT: ['Dolly fu contenta quando Anna entrò da lei e con la sua presenza fece cessare il chiacchierio di Annuška.']
PRD: L ’ altra persona , che aveva incontrato da lui , Levin si era assaltato ; e da vero , invece , aveva dato uno sguardo di una volta , la propria , la baciò .


Epoch 18/30: 100%|██████████| 910/910 [02:34<00:00,  5.88it/s]


End of Epoch 18/30, Loss: 2.4976, Accuracy: 0.7076
SRC: ["'Just one advantage: I live in my own house, which is neither bought nor hired."]
TGT: ['— L’unico vantaggio è che vivo a casa mia, non è roba comprata, né presa in affitto.']
PRD: Era dominato da un lato , che era stato ferito , che era in piedi , un ’ occhiata al villaggio , mi portò qua e a domandare che cosa dovevo fare ?
SRC: ["Dolly was glad when Anna came in and thereby put an end to Annushka's chatter."]
TGT: ['Dolly fu contenta quando Anna entrò da lei e con la sua presenza fece cessare il chiacchierio di Annuška.']
PRD: — , Anna , — disse , — e tu non andrebbe più nulla a l ’ occasione per a un ’ occhiata .


Epoch 19/30: 100%|██████████| 910/910 [02:34<00:00,  5.88it/s]


End of Epoch 19/30, Loss: 2.4104, Accuracy: 0.7313
SRC: ["'Just one advantage: I live in my own house, which is neither bought nor hired."]
TGT: ['— L’unico vantaggio è che vivo a casa mia, non è roba comprata, né presa in affitto.']
PRD: Era come se avessi potuto capire , che in questo momento non era ancora morto , che un di quelle acque , senza aver il diritto di .
SRC: ["Dolly was glad when Anna came in and thereby put an end to Annushka's chatter."]
TGT: ['Dolly fu contenta quando Anna entrò da lei e con la sua presenza fece cessare il chiacchierio di Annuška.']
PRD: A ogni modo , a vedersi , con le mani e i , le calze , con le spalle al loro , si fecero silenzio .


Epoch 20/30: 100%|██████████| 910/910 [02:34<00:00,  5.89it/s]


End of Epoch 20/30, Loss: 2.3342, Accuracy: 0.7523
SRC: ["'Just one advantage: I live in my own house, which is neither bought nor hired."]
TGT: ['— L’unico vantaggio è che vivo a casa mia, non è roba comprata, né presa in affitto.']
PRD: — , allora , come un tale lavoro , si può evitare un tale di gusto in una situazione tale che è possibile .
SRC: ["Dolly was glad when Anna came in and thereby put an end to Annushka's chatter."]
TGT: ['Dolly fu contenta quando Anna entrò da lei e con la sua presenza fece cessare il chiacchierio di Annuška.']
PRD: È il fatto che perdei un gran tempo nel mio solo modo con lei , mentre così diceva il miglior modo da far cessare l ’ oppio .


Epoch 21/30: 100%|██████████| 910/910 [02:34<00:00,  5.89it/s]


End of Epoch 21/30, Loss: 2.2643, Accuracy: 0.7715
SRC: ["'Just one advantage: I live in my own house, which is neither bought nor hired."]
TGT: ['— L’unico vantaggio è che vivo a casa mia, non è roba comprata, né presa in affitto.']
PRD: — , ma a non giudicare , a quanto pare , alla stazione termale che vi sia la da noi stessi altri .
SRC: ["Dolly was glad when Anna came in and thereby put an end to Annushka's chatter."]
TGT: ['Dolly fu contenta quando Anna entrò da lei e con la sua presenza fece cessare il chiacchierio di Annuška.']
PRD: Venne a parlare , a star ritto sul sofà , o no , per dargli il denaro ; ma non c ’ era alcuna parte , quanto tempo mi facessero piacere .


Epoch 22/30: 100%|██████████| 910/910 [02:34<00:00,  5.88it/s]


End of Epoch 22/30, Loss: 2.2050, Accuracy: 0.7887
SRC: ["'Just one advantage: I live in my own house, which is neither bought nor hired."]
TGT: ['— L’unico vantaggio è che vivo a casa mia, non è roba comprata, né presa in affitto.']
PRD: Il lavoro dei primi tempi del suo arrivo era un militare ; da altro che il loro aspetto , le sue , le appassionate , e i piccoli nella fretta .
SRC: ["Dolly was glad when Anna came in and thereby put an end to Annushka's chatter."]
TGT: ['Dolly fu contenta quando Anna entrò da lei e con la sua presenza fece cessare il chiacchierio di Annuška.']
PRD: A volte , sedettero sui suoi riccioli d ’ anelli , lo stesso che e che non altro che lo , e che dovessero stare benissimo .


Epoch 23/30: 100%|██████████| 910/910 [02:34<00:00,  5.88it/s]


End of Epoch 23/30, Loss: 2.1503, Accuracy: 0.8048
SRC: ["'Just one advantage: I live in my own house, which is neither bought nor hired."]
TGT: ['— L’unico vantaggio è che vivo a casa mia, non è roba comprata, né presa in affitto.']
PRD: Il conto dell ’ alba era così forte , e il mio bere , mentre si stava legato a gridare così come ad una volta abbia fatto il tempo di riflettere .
SRC: ["Dolly was glad when Anna came in and thereby put an end to Annushka's chatter."]
TGT: ['Dolly fu contenta quando Anna entrò da lei e con la sua presenza fece cessare il chiacchierio di Annuška.']
PRD: A parte il bel tempo , con quello che era stato ferito . E , guardando il cervello , si sollevò il buon senso , così come un pasto portava il latte , la cosa sola mano a parte di chi ?


Epoch 24/30: 100%|██████████| 910/910 [02:34<00:00,  5.88it/s]


End of Epoch 24/30, Loss: 2.1001, Accuracy: 0.8196
SRC: ["'Just one advantage: I live in my own house, which is neither bought nor hired."]
TGT: ['— L’unico vantaggio è che vivo a casa mia, non è roba comprata, né presa in affitto.']
PRD: Il numero 12 era soddisfazione e più piccolo di quello di cui mi aveva trovata accanto a loro ; a loro si era ammalata , non mangiava né .
SRC: ["Dolly was glad when Anna came in and thereby put an end to Annushka's chatter."]
TGT: ['Dolly fu contenta quando Anna entrò da lei e con la sua presenza fece cessare il chiacchierio di Annuška.']
PRD: A volte , alla mia parte , non so se fosse bella o no , comunque fosse la cosa alcuna .


RuntimeError: [enforce fail at inline_container.cc:603] . unexpected pos 134578688 vs 134578576