<br>
<font>
<div dir=ltr align=center>
<img src="https://cdn.freebiesupply.com/logos/large/2x/sharif-logo-png-transparent.png" width=150 height=150> <br>
<font color=0F5298 size=7>
    Machine learning <br>
<font color=2565AE size=5>
    Computer Engineering Department <br>
    Fall 2024<br>
<font color=3C99D size=5>
    Practical Assignment 5 - NLP - Transformer & Bert <br>
</div>
<div dir=ltr align=center>
<font color=0CBCDF size=4>
   &#x1F349; Masoud Tahmasbi  &#x1F349;  &#x1F353; Arash Ziyaei &#x1F353;
<br>
<font color=0CBCDF size=4>
   &#x1F335; Amirhossein Akbari  &#x1F335;
</div>

____

<font color=9999FF size=4>
&#x1F388; Full Name : Amirhosein Rezaei
<br>
<font color=9999FF size=4>
&#x1F388; Student Number : 401105989

<font color=0080FF size=3>
This notebook covers two key topics. First, we implement a transformer model from scratch and apply it to a specific task. Second, we fine-tune the BERT model using LoRA for efficient adaptation to a downstream task.
</font>
<br>

**Note:**
<br>
<font color=66B2FF size=2>In this notebook, you are free to use any function or model from PyTorch to assist with the implementation. However, TensorFlow is not permitted for this exercise. This ensures consistency and alignment with the tools being focused on.</font>
<br>
<font color=red size=3>**Run All Cells Before Submission**</font>: <font color=FF99CC size=2>Before saving and submitting your notebook, please ensure you run all cells from start to finish. This practice guarantees that your notebook is self-consistent and can be evaluated correctly by others.</font>

# Section 1: Transformer

The transformer architecture consists of two main components: an encoder and a decoder. Each of these components is made up of multiple layers that include self-attention mechanisms and feedforward neural networks. The self-attention mechanism is central to the transformer, as it enables the model to assess the importance of different words in a sentence by considering their relationships with one another.


In this assignment, you should design a transformer model from scratch. You are required to implement the Encoder and Decoder components of a Transformer model.

In [None]:
!pip install datasets

# Importing libraries

# PyTorch
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader, random_split
from torch.utils.tensorboard import SummaryWriter

# Math
import math

# HuggingFace libraries
from datasets import load_dataset
from tokenizers import Tokenizer
from tokenizers.models import WordLevel
from tokenizers.trainers import WordLevelTrainer
from tokenizers.pre_tokenizers import Whitespace

# Pathlib
from pathlib import Path

# typing
from typing import Any

# Library for progress bars in loops
from tqdm import tqdm

# Importing library of warnings
import warnings

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

## Part 1: Input Embeddings
<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">When we observe the Transformer architecture image above, we can see that the Embeddings represent the first step of both blocks.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">The <code>InputEmbedding</code> class below is responsible for converting the input text into numerical vectors of <code>d_model</code> dimensions. To prevent that our input embeddings become extremely small, we normalize them by multiplying them by the $\sqrt{d_{model}}$.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">In the image below, we can see how the embeddings are created. First, we have a sentence that gets split into tokens—we will explore what tokens are later on—. Then, the token IDs—identification numbers—are transformed into the embeddings, which are high-dimensional vectors.</p>

In [None]:
######################  TODO  ########################
######################  TODO  ########################

# - Define a class `InputEmbeddings` inheriting from `nn.Module`
# - Initialize the class with two parameters:
#   1. `d_model`: Dimension of the embedding vectors
#   2. `vocab_size`: Size of the vocabulary
# - Create an embedding layer using `nn.Embedding` to map input indices to dense vectors

# - In the `forward` method:
#   1. Pass the input `x` through the embedding layer
#   2. Scale the embeddings by the square root of `d_model` for variance normalization

######################  TODO  ########################
######################  TODO  ########################


class InputEmbeddings(nn.Module):
    def __init__(self, d_model: int, vocab_size: int):
        super().__init__()
        self.d_model = d_model
        self.vocab_size = vocab_size
        self.embedding = nn.Embedding(vocab_size, d_model)
    def forward(self, x):
        return self.embedding(x) * math.sqrt(self.d_model)

## Part 2: positional encoding
<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">In the original paper, the authors add the positional encodings to the input embeddings at the bottom of both the encoder and decoder blocks so the model can have some information about the relative or absolute position of the tokens in the sequence. The positional encodings have the same dimension $d_{model}$ as the embeddings, so that the two vectors can be summed and we can combine the semantic content from the word embeddings and positional information from the positional encodings.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">In the <code>PositionalEncoding</code> class below, we will create a matrix of positional encodings <code>pe</code> with dimensions <code>(seq_len, d_model)</code>. We will start by filling it with $0$s.We will then apply the sine function to even indices of the positional encoding matrix while the cosine function is applied to the odd ones.</p>

<p style="
    margin-bottom: 5;
    font-size: 22px;
    font-weight: 300;
    font-family: 'Helvetica Neue', sans-serif;
    color: #000000;
  ">
    \begin{equation}
    \text{Odd Indices } (2i + 1): \quad \text{PE(pos, } 2i + 1) = \cos\left(\frac{\text{pos}}{10000^{2i / d_{model}}}\right)
    \end{equation}
</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">We apply the sine and cosine functions because it allows the model to determine the position of a word based on the position of other words in the sequence, since for any fixed offset $k$, $PE_{pos + k}$ can be represented as a linear function of $PE_{pos}$. This happens due to the properties of sine and cosine functions, where a shift in the input results in a predictable change in the output.</p>

In [None]:
######################  TODO  ########################
######################  TODO  ########################

# - Create a `PositionalEncoding` class inheriting from `nn.Module`
# - Initialize with `d_model`, `seq_len`, and `dropout`
# - Generate a positional encoding matrix using sine and cosine functions
# - Register the positional encoding as a non-trainable buffer
# - In `forward`, add positional encoding to input and apply dropout

######################  TODO  ########################
######################  TODO  ########################


class PositionalEncoding(nn.Module):
    def __init__(self, d_model: int, seq_len: int, dropout: float) -> None:
        super().__init__()
        self.d_model = d_model
        self.seq_len = seq_len
        self.dropout = nn.Dropout(dropout)
        pe = torch.zeros(seq_len, d_model)
        position = torch.arange(0, seq_len, dtype = torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + (self.pe[:, :x.shape[1], :]).requires_grad_(False)
        return self.dropout(x)

## Part 3: layer normalization
<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">When we look at the encoder and decoder blocks, we see several normalization layers called <b><i>Add &amp; Norm</i></b>.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">The <code>LayerNormalization</code> class below performs layer normalization on the input data. During its forward pass, we compute the mean and standard deviation of the input data. We then normalize the input data by subtracting the mean and dividing by the standard deviation plus a small number called epsilon to avoid any divisions by zero. This process results in a normalized output with a mean 0 and a standard deviation 1.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">We will then scale the normalized output by a learnable parameter <code>alpha</code> and add a learnable parameter called <code>bias</code>. The training process is responsible for adjusting these parameters. The final result is a layer-normalized tensor, which ensures that the scale of the inputs to layers in the network is consistent.</p>

In [None]:
######################  TODO  ########################
######################  TODO  ########################

# - Create a `LayerNormalization` class inheriting from `nn.Module`
# - Initialize with `eps` (small value to prevent division by zero)
# - Define trainable parameters:
#   1. `alpha`: Scaling factor initialized to 1
#   2. `bias`: Offset initialized to 0

# - In `forward`, perform layer normalization:
#   1. Compute mean and standard deviation along the last dimension
#   2. Normalize the input using the computed mean and std
#   3. Scale and shift using `alpha` and `bias`

######################  TODO  ########################
######################  TODO  ########################

class LayerNormalization(nn.Module):
    def __init__(self, eps: float = 10**-6) -> None:
        super().__init__()
        self.eps = eps
        self.alpha = nn.Parameter(torch.ones(1))
        self.bias = nn.Parameter(torch.zeros(1))

    def forward(self, x):
        mean = x.mean(dim = -1, keepdim = True)
        std = x.std(dim = -1, keepdim = True)
        return self.alpha * (x - mean) / (std + self.eps) + self.bias

## Part 4: Feed Forward Network
<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">In the fully connected feed-forward network, we apply two linear transformations with a ReLU activation in between. We can mathematically represent this operation as:</p>

<p style="
    margin-bottom: 5;
    font-size: 22px;
    font-weight: 300;
    font-family: 'Helvetica Neue', sans-serif;
    color: #000000;
  ">
    \begin{equation}
    \text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2
    \end{equation}
</p>


<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">$W_1$ and $W_2$ are the weights, while $b_1$ and $b_2$ are the biases of the two linear transformations.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">In the <code>FeedForwardBlock</code> below, we will define the two linear transformations—<code>self.linear_1</code> and <code>self.linear_2</code>—and the inner-layer <code>d_ff</code>. The input data will first pass through the <code>self.linear_1</code> transformation, which increases its dimensionality from <code>d_model</code> to <code>d_ff</code>. The output of this operation passes through the ReLU activation function, which introduces non-linearity so the network can learn more complex patterns, and the <code>self.dropout</code> layer is applied to mitigate overfitting. The final operation is the <code>self.linear_2</code> transformation to the dropout-modified tensor, which transforms it back to the original <code>d_model</code> dimension.</p>

In [None]:
######################  TODO  ########################
######################  TODO  ########################

# - Create a `FeedForwardBlock` class inheriting from `nn.Module`
# - Initialize with `d_model`, `d_ff`, and `dropout`
# - Define:
#   1. `linear_1`: Linear layer projecting from `d_model` to `d_ff`
#   2. Dropout layer for regularization
#   3. `linear_2`: Linear layer projecting back from `d_ff` to `d_model`

# - In `forward`, apply the following steps:
#   1. Pass input through `linear_1` followed by ReLU activation
#   2. Apply dropout
#   3. Pass through `linear_2` to return to original dimensions

######################  TODO  ########################
######################  TODO  ########################

class FeedForwardBlock(nn.Module):
    def __init__(self, d_model: int, d_ff: int, dropout: float) -> None:
        super().__init__()
        self.linear_1 = nn.Linear(d_model, d_ff)
        self.dropout = nn.Dropout(dropout)
        self.linear_2 = nn.Linear(d_ff, d_model)

    def forward(self, x):
        return self.linear_2(self.dropout(torch.relu(self.linear_1(x))))

## Part 5: Multi Head Attention
<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">The Multi-Head Attention is the most crucial component of the Transformer. It is responsible for helping the model to understand complex relationships and patterns in the data.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">The image below displays how the Multi-Head Attention works. It doesn't include <code>batch</code> dimension because it only illustrates the process for one single sentence.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">The Multi-Head Attention block receives the input data split into queries, keys, and values organized into matrices $Q$, $K$, and $V$. Each matrix contains different facets of the input, and they have the same dimensions as the input.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">We then linearly transform each matrix by their respective weight matrices $W^Q$, $W^K$, and $W^V$. These transformations will result in new matrices $Q'$, $K'$, and $V'$, which will be split into smaller matrices corresponding to different heads $h$, allowing the model to attend to information from different representation subspaces in parallel. This split creates multiple sets of queries, keys, and values for each head.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">Finally, we concatenate every head into an $H$ matrix, which is then transformed by another weight matrix $W^o$ to produce the multi-head attention output, a matrix $MH-A$ that retains the input dimensionality.</p>

In [None]:
class MultiHeadAttentionBlock(nn.Module):
    def __init__(self, d_model: int, h: int, dropout: float) -> None:
        super().__init__()
        self.d_model = d_model
        self.h = h
        assert d_model % h == 0, 'd_model is not divisible by h'
        self.d_k = d_model // h
        self.w_q = nn.Linear(d_model, d_model)
        self.w_k = nn.Linear(d_model, d_model)
        self.w_v = nn.Linear(d_model, d_model)
        self.w_o = nn.Linear(d_model, d_model)
        self.dropout = nn.Dropout(dropout)
    @staticmethod
    def attention(query, key, value, mask, dropout: nn.Dropout):
        d_k = query.shape[-1]
        attention_scores = (query @ key.transpose(-2,-1)) / math.sqrt(d_k)
        if mask is not None:
            attention_scores.masked_fill_(mask == 0, -1e9)
        attention_scores = attention_scores.softmax(dim = -1)
        if dropout is not None:
            attention_scores = dropout(attention_scores)
        return (attention_scores @ value), attention_scores

    def forward(self, q, k, v, mask):
        query = self.w_q(q)
        key = self.w_k(k)
        value = self.w_v(v)
        query = query.view(query.shape[0], query.shape[1], self.h, self.d_k).transpose(1,2)
        key = key.view(key.shape[0], key.shape[1], self.h, self.d_k).transpose(1,2)
        value = value.view(value.shape[0], value.shape[1], self.h, self.d_k).transpose(1,2)
        x, self.attention_scores = MultiHeadAttentionBlock.attention(query, key, value, mask, self.dropout)
        x = x.transpose(1, 2).contiguous().view(x.shape[0], -1, self.h * self.d_k)
        return self.w_o(x)

## Part 6: Residual Connection
<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">When we look at the architecture of the Transformer, we see that each sub-layer, including the <i>self-attention</i> and <i>Feed Forward</i> blocks, adds its output to its input before passing it to the <i>Add &amp; Norm</i> layer. This approach integrates the output with the original input in the <i>Add &amp; Norm</i> layer. This process is known as the skip connection, which allows the Transformer to train deep networks more effectively by providing a shortcut for the gradient to flow through during backpropagation.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">The <code>ResidualConnection</code> class below is responsible for this process.</p>

In [None]:
######################  TODO  ########################
######################  TODO  ########################

# - Create a `ResidualConnection` class inheriting from `nn.Module`
# - Initialize with `dropout`:
#   1. Add a dropout layer for regularization
#   2. Include a layer normalization instance

# - In `forward`:
#   1. Normalize the input using the normalization layer
#   2. Pass the normalized input through the sublayer
#   3. Apply dropout and add the result back to the original input for residual connection

######################  TODO  ########################
######################  TODO  ########################

class ResidualConnection(nn.Module):
    def __init__(self, dropout: float) -> None:
        super().__init__()
        self.dropout = nn.Dropout(dropout)
        self.norm = LayerNormalization()

    def forward(self, x, sublayer):
        return x + self.dropout(sublayer(self.norm(x)))

## Part 7: Encoder
<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">We will now build the encoder. We create the <code>EncoderBlock</code> class, consisting of the Multi-Head Attention and Feed Forward layers, plus the residual connections.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">In the original paper, the Encoder Block repeats six times. We create the <code>Encoder</code> class as an assembly of multiple <code>EncoderBlock</code>s. We also add layer normalization as a final step after processing the input through all its blocks.</p>

In [None]:
######################  TODO  ########################
######################  TODO  ########################

# - Create an `EncoderBlock` class inheriting from `nn.Module`
# - Initialize with:
#   1. `self_attention_block`: Multi-head attention block
#   2. `feed_forward_block`: Feed-forward block
#   3. `dropout`: Dropout rate for residual connections
# - Define two residual connections for:
#   1. Self-attention block
#   2. Feed-forward block

# - In `forward`:
#   1. Apply the first residual connection with the self-attention block
#   2. Apply the second residual connection with the feed-forward block
#   3. Return the updated tensor after both layers

######################  TODO  ########################
######################  TODO  ########################

class EncoderBlock(nn.Module):
    def __init__(self, self_attention_block: MultiHeadAttentionBlock, feed_forward_block: FeedForwardBlock, dropout: float) -> None:
        super().__init__()
        self.self_attention_block = self_attention_block
        self.feed_forward_block = feed_forward_block
        self.residual_connections = nn.ModuleList([ResidualConnection(dropout) for _ in range(2)])

    def forward(self, x, src_mask):
        x = self.residual_connections[0](x, lambda x: self.self_attention_block(x, x, x, src_mask))
        x = self.residual_connections[1](x, self.feed_forward_block)
        return x

In [None]:
######################  TODO  ########################
######################  TODO  ########################

# - Create an `Encoder` class inheriting from `nn.Module`
# - Initialize with:
#   1. `layers`: A list of `EncoderBlock` instances
#   2. A layer normalization instance for output normalization

# - In `forward`:
#   1. Pass the input tensor `x` through each `EncoderBlock` in `self.layers`
#   2. Apply the mask during each block's forward pass
#   3. Normalize the final output and return it

######################  TODO  ########################
######################  TODO  ########################

class Encoder(nn.Module):
    def __init__(self, layers: nn.ModuleList) -> None:
        super().__init__()
        self.layers = layers
        self.norm = LayerNormalization()

    def forward(self, x, mask):
        for layer in self.layers:
            x = layer(x, mask)
        return self.norm(x)

## Part 8: Decoder
<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">Similarly, the Decoder also consists of several DecoderBlocks that repeat six times in the original paper. The main difference is that it has an additional sub-layer that performs multi-head attention with a <i>cross-attention</i> component that uses the output of the Encoder as its keys and values while using the Decoder's input as queries.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">For the Output Embedding, we can use the same <code>InputEmbeddings</code> class we use for the Encoder. You can also notice that the self-attention sub-layer is <i>masked</i>, which restricts the model from accessing future elements in the sequence.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">We will start by building the <code>DecoderBlock</code> class, and then we will build the <code>Decoder</code> class, which will assemble multiple <code>DecoderBlock</code>s.</p>

In [None]:
######################  TODO  ########################
######################  TODO  ########################

# - Create a `DecoderBlock` class inheriting from `nn.Module`
# - Initialize with:
#   1. `self_attention_block`: Multi-head self-attention block
#   2. `cross_attention_block`: Multi-head cross-attention block
#   3. `feed_forward_block`: Feed-forward block
#   4. `dropout`: Dropout rate
# - Define three residual connections for:
#   1. Self-attention block
#   2. Cross-attention block
#   3. Feed-forward block

# - In `forward`:
#   1. Apply the self-attention block with target mask and residual connection
#   2. Apply the cross-attention block with source mask and residual connection
#   3. Apply the feed-forward block with residual connection
#   4. Return the updated tensor

######################  TODO  ########################
######################  TODO  ########################

class DecoderBlock(nn.Module):
    def __init__(self, self_attention_block: MultiHeadAttentionBlock, cross_attention_block: MultiHeadAttentionBlock, feed_forward_block: FeedForwardBlock, dropout: float) -> None:
        super().__init__()
        self.self_attention_block = self_attention_block
        self.cross_attention_block = cross_attention_block
        self.feed_forward_block = feed_forward_block
        self.residual_connections = nn.ModuleList([ResidualConnection(dropout) for _ in range(3)])

    def forward(self, x, encoder_output, src_mask, tgt_mask):
        x = self.residual_connections[0](x, lambda x: self.self_attention_block(x, x, x, tgt_mask))
        x = self.residual_connections[1](x, lambda x: self.cross_attention_block(x, encoder_output, encoder_output, src_mask))
        x = self.residual_connections[2](x, self.feed_forward_block)
        return x

In [None]:
######################  TODO  ########################
######################  TODO  ########################

# - Create a `Decoder` class inheriting from `nn.Module`
# - Initialize with:
#   1. `layers`: A list of `DecoderBlock` instances
#   2. A layer normalization instance for the final output

# - In `forward`:
#   1. Pass the input tensor `x` through each `DecoderBlock` in `self.layers`
#   2. Provide `encoder_output`, `src_mask`, and `tgt_mask` to each block
#   3. Normalize the final output using the layer normalization
#   4. Return the normalized output

######################  TODO  ########################
######################  TODO  ########################

class Decoder(nn.Module):
    def __init__(self, layers: nn.ModuleList) -> None:
        super().__init__()
        self.layers = layers
        self.norm = LayerNormalization()

    def forward(self, x, encoder_output, src_mask, tgt_mask):
        for layer in self.layers:
            x = layer(x, encoder_output, src_mask, tgt_mask)
        return self.norm(x)

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">You can see in the Decoder image that after running a stack of <code>DecoderBlock</code>s, we have a Linear Layer and a Softmax function to the output of probabilities. The <code>ProjectionLayer</code> class below is responsible for converting the output of the model into a probability distribution over the <i>vocabulary</i>, where we select each output token from a vocabulary of possible tokens.</p>

In [None]:
######################  TODO  ########################
######################  TODO  ########################

# - Create a `ProjectionLayer` class inheriting from `nn.Module`
# - Initialize with:
#   1. `d_model`: Dimension of the model
#   2. `vocab_size`: Size of the output vocabulary
# - Define a linear layer to project from `d_model` to `vocab_size`

# - In `forward`:
#   1. Pass the input through the linear layer
#   2. Apply log Softmax along the last dimension
#   3. Return the log probabilities

######################  TODO  ########################
######################  TODO  ########################

class ProjectionLayer(nn.Module):
    def __init__(self, d_model: int, vocab_size: int) -> None:
        super().__init__()
        self.proj = nn.Linear(d_model, vocab_size)

    def forward(self, x):
        return torch.log_softmax(self.proj(x), dim = -1)

## Part 9: Building the Transformer

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">We finally have every component of the Transformer architecture ready. We may now construct the Transformer by putting it all together.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">In the <code>Transformer</code> class below, we will bring together all the components of the model's architecture.</p>

In [None]:
######################  TODO  ########################
######################  TODO  ########################

# - Create a `Transformer` class inheriting from `nn.Module`
# - Initialize with:
#   1. `encoder`: Encoder module
#   2. `decoder`: Decoder module
#   3. `src_embed` and `tgt_embed`: Input embeddings for source and target languages
#   4. `src_pos` and `tgt_pos`: Positional encodings for source and target languages
#   5. `projection_layer`: Linear projection layer for final output

# - Define the `encode` method:
#   1. Apply source embeddings to input
#   2. Add positional encoding
#   3. Pass through the encoder with the source mask
#   4. Return the encoded representation

# - Define the `decode` method:
#   1. Apply target embeddings to input
#   2. Add positional encoding
#   3. Pass through the decoder with encoder output, source mask, and target mask
#   4. Return the decoder's output

# - Define the `project` method:
#   1. Pass decoder output through the projection layer
#   2. Apply log Softmax to obtain probabilities

######################  TODO  ########################
######################  TODO  ########################

class Transformer(nn.Module):
    def __init__(self, encoder: Encoder, decoder: Decoder, src_embed: InputEmbeddings, tgt_embed: InputEmbeddings, src_pos: PositionalEncoding, tgt_pos: PositionalEncoding, projection_layer: ProjectionLayer) -> None:
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.src_embed = src_embed
        self.tgt_embed = tgt_embed
        self.src_pos = src_pos
        self.tgt_pos = tgt_pos
        self.projection_layer = projection_layer

    def encode(self, src, src_mask):
        src = self.src_embed(src)
        src = self.src_pos(src)
        return self.encoder(src, src_mask)

    def decode(self, encoder_output, src_mask, tgt, tgt_mask):
        tgt = self.tgt_embed(tgt)
        tgt = self.tgt_pos(tgt)
        return self.decoder(tgt, encoder_output, src_mask, tgt_mask)

    def project(self, x):
        return self.projection_layer(x)

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">The architecture is finally ready. We now define a function called <code>build_transformer</code>, in which we define the parameters and everything we need to have a fully operational Transformer model for the task of <b>machine translation</b>.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">We will set the same parameters as in the original paper, <a href = "https://arxiv.org/pdf/1706.03762.pdf"><i>Attention Is All You Need</i></a>, where $d_{model}$ = 512, $N$ = 6, $h$ = 8, dropout rate $P_{drop}$ = 0.1, and $d_{ff}$ = 2048.</p>

In [None]:
######################  TODO  ########################
######################  TODO  ########################

# - Define a `build_transformer` function with parameters for:
#   1. Vocabulary sizes (`src_vocab_size`, `tgt_vocab_size`)
#   2. Sequence lengths (`src_seq_len`, `tgt_seq_len`)
#   3. Model dimensions (`d_model`, `d_ff`)
#   4. Number of layers (`N`) and heads (`h`)
#   5. Dropout rate (`dropout`)

# - Create:
#   1. Source and target embedding layers
#   2. Positional encoding layers for source and target
#   3. Encoder blocks with self-attention and feed-forward layers
#   4. Decoder blocks with self-attention, cross-attention, and feed-forward layers
#   5. Encoder and Decoder modules using the blocks
#   6. Projection layer to map decoder output to target vocabulary

# - Assemble all components into a `Transformer` instance
# - Initialize parameters with Xavier uniform initialization
# - Return the initialized Transformer

######################  TODO  ########################
######################  TODO  ########################

def build_transformer(src_vocab_size: int, tgt_vocab_size: int, src_seq_len: int, tgt_seq_len: int, d_model: int = 512, N: int = 6, h: int = 8, dropout: float = 0.1, d_ff: int = 2048) -> Transformer:
    src_embed = InputEmbeddings(d_model, src_vocab_size)
    tgt_embed = InputEmbeddings(d_model, tgt_vocab_size)
    src_pos = PositionalEncoding(d_model, src_seq_len, dropout)
    tgt_pos = PositionalEncoding(d_model, tgt_seq_len, dropout)
    encoder_blocks = []
    for _ in range(N):
        encoder_self_attention_block = MultiHeadAttentionBlock(d_model, h, dropout)
        feed_forward_block = FeedForwardBlock(d_model, d_ff, dropout)
        encoder_block = EncoderBlock(encoder_self_attention_block, feed_forward_block, dropout)
        encoder_blocks.append(encoder_block)
    decoder_blocks = []
    for _ in range(N):
        decoder_self_attention_block = MultiHeadAttentionBlock(d_model, h, dropout)
        decoder_cross_attention_block = MultiHeadAttentionBlock(d_model, h, dropout)
        feed_forward_block = FeedForwardBlock(d_model, d_ff, dropout)
        decoder_block = DecoderBlock(decoder_self_attention_block, decoder_cross_attention_block, feed_forward_block, dropout)
        decoder_blocks.append(decoder_block)
    encoder = Encoder(nn.ModuleList(encoder_blocks))
    decoder = Decoder(nn.ModuleList(decoder_blocks))
    projection_layer = ProjectionLayer(d_model, tgt_vocab_size)
    transformer = Transformer(encoder, decoder, src_embed, tgt_embed, src_pos, tgt_pos, projection_layer)
    for p in transformer.parameters():
        if p.dim() > 1:
            nn.init.xavier_uniform_(p)
    return transformer

The model is now ready to be trained!

## Part 10: Tokenizer

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">Tokenization is a crucial preprocessing step for our Transformer model. In this step, we convert raw text into a number format that the model can process.  </p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">There are several Tokenization strategies. We will use the <i>word-level tokenization</i> to transform each word in a sentence into a token.</p>

<center>
    <img src = "https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F8d5e749c-b0bd-4496-85a1-9b4397ad935f_1400x787.jpeg" width = 800, height= 800>
<p style = "font-size: 16px;
            font-family: 'Georgia', serif;
            text-align: center;
            margin-top: 10px;">Different tokenization strategies. Source: <a href = "https://shaankhosla.substack.com/p/talking-tokenization">shaankhosla.substack.com</a>.</p>
</center>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">After tokenizing a sentence, we map each token to an unique integer ID based on the created vocabulary present in the training corpus during the training of the tokenizer. Each integer number represents a specific word in the vocabulary.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">Besides the words in the training corpus, Transformers use special tokens for specific purposes. These are some that we will define right away:</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px"><b>• [UNK]:</b> This token is used to identify an unknown word in the sequence.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px"><b>• [PAD]:</b> Padding token to ensure that all sequences in a batch have the same length, so we pad shorter sentences with this token. We use attention masks to <i>"tell"</i> the model to ignore the padded tokens during training since they don't have any real meaning to the task.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px"><b>•  [SOS]:</b> This is a token used to signal the <i>Start of Sentence</i>.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px"><b>•  [EOS]:</b> This is a token used to signal the <i>End of Sentence</i>.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">In the <code>build_tokenizer</code> function below, we ensure a tokenizer is ready to train the model. It checks if there is an existing tokenizer, and if that is not the case, it trains a new tokenizer.</p>

In [None]:
######################  TODO  ########################
######################  TODO  ########################

# - Define a `build_tokenizer` function with parameters for:
#   1. `config`: Configuration containing tokenizer file path
#   2. `ds`: Dataset to train the tokenizer
#   3. `lang`: Language for which the tokenizer is built

# - Check if the tokenizer file exists:
#   1. If not, create a new tokenizer:
#      - Initialize a word-level tokenizer with an unknown token (`[UNK]`)
#      - Set the pre-tokenizer to split text by whitespace
#      - Define a trainer with special tokens and minimum frequency
#      - Train the tokenizer on all sentences in the dataset
#      - Save the trained tokenizer to the specified file path
#   2. If the file exists, load the tokenizer from the file

# - Return the loaded or trained tokenizer

######################  TODO  ########################
######################  TODO  ########################


def build_tokenizer(config, ds, lang):
    tokenizer_path = Path(config['tokenizer_file'].format(lang))
    if not Path.exists(tokenizer_path):
        tokenizer = Tokenizer(WordLevel(unk_token = '[UNK]'))
        tokenizer.pre_tokenizer = Whitespace()
        trainer = WordLevelTrainer(special_tokens = ["[UNK]", "[PAD]",
                                                     "[SOS]", "[EOS]"], min_frequency = 2)
        tokenizer.train_from_iterator(get_all_sentences(ds, lang), trainer = trainer)
        tokenizer.save(str(tokenizer_path))
    else:
        tokenizer = Tokenizer.from_file(str(tokenizer_path))
    return tokenizer

## Part 11: Load Dataset

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">For this task, we will use the <a href = "opus_books · Datasets at Hugging Face">OpusBooks dataset</a>, available on 🤗Hugging Face. This dataset consists of two features, <code>id</code> and <code>translation</code>. The <code>translation</code> feature contains pairs of sentences in different languages, such as Spanish and Portuguese, English and French, and so forth.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">I first tried translating sentences from English to Portuguese—my native tongue — but there are only 1.4k examples for this pair, so the results were not satisfying in the current configurations for this model. I then tried to use the English-French pair due to its higher number of examples—127k—but it would take too long to train with the current configurations. I then opted to train the model on the English-Italian pair, the same one used in the <a href = "https://youtu.be/ISNdQcPhsts?si=253J39cose6IdsLv">Coding a Transformer from scratch on PyTorch, with full explanation, training and inference
</a> video, as that was a good balance between performance and time of training.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">We start by defining the <code>get_all_sentences</code> function to iterate over the dataset and extract the sentences according to the language pair defined—we will do that later.</p>

In [None]:
######################  TODO  ########################
######################  TODO  ########################

# - Define a `get_all_sentences` function to extract sentences from a dataset
# - Accept parameters:
#   1. `ds`: The dataset containing translation pairs
#   2. `lang`: The language key to extract translations

# - Iterate through the dataset:
#   1. Access the 'translation' field of each pair
#   2. Yield the sentence corresponding to the specified language key

######################  TODO  ########################
######################  TODO  ########################

def get_all_sentences(ds, lang):
    for pair in ds:
        yield pair['translation'][lang]

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">The <code>get_ds</code> function is defined to load and prepare the dataset for training and validation. In this function, we build or load the tokenizer, split the dataset, and create DataLoaders, so the model can successfully iterate over the dataset in batches. The result of these functions is tokenizers for the source and target languages plus the DataLoader objects.</p>

In [None]:
######################  TODO  ########################
######################  TODO  ########################

# - Define a `get_ds` function to process and prepare the dataset for training
# - Load the `OpusBooks` dataset using:
#   1. Source and target languages from `config`
#   2. Train split of the dataset

# - Build or load tokenizers for source and target languages using `build_tokenizer`

# - Split the dataset into training and validation sets:
#   1. Allocate 90% for training and 10% for validation
#   2. Use `random_split` for randomized splitting

# - Process the splits using a `BilingualDataset` class:
#   1. Convert sentences to tokenized representations
#   2. Apply source and target tokenizers
#   3. Ensure sequence lengths conform to `config`

# - Compute and print the maximum sentence lengths for both source and target languages

# - Create DataLoader objects for training and validation:
#   1. Define batch sizes from `config`
#   2. Enable shuffling for training DataLoader

# - Return:
#   1. Training DataLoader
#   2. Validation DataLoader
#   3. Tokenizer for source language
#   4. Tokenizer for target language

######################  TODO  ########################
######################  TODO  ########################

def get_ds(config):
    ds_raw = load_dataset('opus_books', f'{config["lang_src"]}-{config["lang_tgt"]}', split = 'train')
    reduced_size = int(0.1 * len(ds_raw))
    ds_raw = ds_raw.select(range(reduced_size))
    tokenizer_src = build_tokenizer(config, ds_raw, config['lang_src'])
    tokenizer_tgt = build_tokenizer(config, ds_raw, config['lang_tgt'])
    train_ds_size = int(0.9 * len(ds_raw))
    val_ds_size = len(ds_raw) - train_ds_size
    train_ds_raw, val_ds_raw = random_split(ds_raw, [train_ds_size, val_ds_size])
    train_ds = BilingualDataset(train_ds_raw, tokenizer_src, tokenizer_tgt, config['lang_src'], config['lang_tgt'], config['seq_len'])
    val_ds = BilingualDataset(val_ds_raw, tokenizer_src, tokenizer_tgt, config['lang_src'], config['lang_tgt'], config['seq_len'])
    max_len_src = 0
    max_len_tgt = 0
    for pair in ds_raw:
        src_ids = tokenizer_src.encode(pair['translation'][config['lang_src']]).ids
        tgt_ids = tokenizer_src.encode(pair['translation'][config['lang_tgt']]).ids
        max_len_src = max(max_len_src, len(src_ids))
        max_len_tgt = max(max_len_tgt, len(tgt_ids))
    print(f'Max length of source sentence: {max_len_src}')
    print(f'Max length of target sentence: {max_len_tgt}')
    train_dataloader = DataLoader(train_ds, batch_size = config['batch_size'], shuffle = True)
    val_dataloader = DataLoader(val_ds, batch_size = 1, shuffle = True)
    return train_dataloader, val_dataloader, tokenizer_src, tokenizer_tgt

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">We define the <code>casual_mask</code> function to create a mask for the attention mechanism of the decoder. This mask prevents the model from having information about future elements in the sequence. </p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">We start by making a square grid filled with ones. We determine the grid size with the <code>size</code> parameter. Then, we change all the numbers above the main diagonal line to zeros. Every number on one side becomes a zero, while the rest remain ones. The function then flips all these values, turning ones into zeros and zeros into ones. This process is crucial for models that predict future tokens in a sequence.</p>

In [None]:
######################  TODO  ########################
######################  TODO  ########################

# - Define a `casual_mask` function to create an upper triangular mask
# - Accept `size` as the dimension of the square matrix
# - Steps:
#   1. Create a square matrix of size `size x size` filled with ones
#   2. Use `torch.triu` to make it upper triangular, with zeros below the diagonal
#   3. Convert the matrix to integer type
#   4. Return the mask where zeros represent the causal positions

######################  TODO  ########################
######################  TODO  ########################

def casual_mask(size):
    mask = torch.triu(torch.ones(1, size, size), diagonal = 1).type(torch.int)
    return mask == 0

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">The <code>BilingualDataset</code> class processes the texts of the target and source languages in the dataset by tokenizing them and adding all the necessary special tokens. This class also certifies that the sentences are within a maximum sequence length for both languages and pads all necessary sentences.</p>

In [None]:
class BilingualDataset(Dataset):

    # This takes in the dataset contaning sentence pairs, the tokenizers for target and source languages, and the strings of source and target languages
    # 'seq_len' defines the sequence length for both languages
    def __init__(self, ds, tokenizer_src, tokenizer_tgt, src_lang, tgt_lang, seq_len) -> None:
        super().__init__()

        self.seq_len = seq_len
        self.ds = ds
        self.tokenizer_src = tokenizer_src
        self.tokenizer_tgt = tokenizer_tgt
        self.src_lang = src_lang
        self.tgt_lang = tgt_lang

        # Defining special tokens by using the target language tokenizer
        self.sos_token = torch.tensor([tokenizer_tgt.token_to_id("[SOS]")], dtype=torch.int64)
        self.eos_token = torch.tensor([tokenizer_tgt.token_to_id("[EOS]")], dtype=torch.int64)
        self.pad_token = torch.tensor([tokenizer_tgt.token_to_id("[PAD]")], dtype=torch.int64)


    # Total number of instances in the dataset (some pairs are larger than others)
    def __len__(self):
        return len(self.ds)

    # Using the index to retrive source and target texts
    def __getitem__(self, index: Any) -> Any:
        src_target_pair = self.ds[index]
        src_text = src_target_pair['translation'][self.src_lang]
        tgt_text = src_target_pair['translation'][self.tgt_lang]

        # Tokenizing source and target texts
        enc_input_tokens = self.tokenizer_src.encode(src_text).ids
        dec_input_tokens = self.tokenizer_tgt.encode(tgt_text).ids

        # Computing how many padding tokens need to be added to the tokenized texts
        # Source tokens
        enc_num_padding_tokens = self.seq_len - len(enc_input_tokens) - 2 # Subtracting the two '[EOS]' and '[SOS]' special tokens
        # Target tokens
        dec_num_padding_tokens = self.seq_len - len(dec_input_tokens) - 1 # Subtracting the '[SOS]' special token

        # If the texts exceed the 'seq_len' allowed, it will raise an error. This means that one of the sentences in the pair is too long to be processed
        # given the current sequence length limit (this will be defined in the config dictionary below)
        if enc_num_padding_tokens < 0 or dec_num_padding_tokens < 0:
            raise ValueError('Sentence is too long')

        # Building the encoder input tensor by combining several elements
        encoder_input = torch.cat(
            [
            self.sos_token, # inserting the '[SOS]' token
            torch.tensor(enc_input_tokens, dtype = torch.int64), # Inserting the tokenized source text
            self.eos_token, # Inserting the '[EOS]' token
            torch.tensor([self.pad_token] * enc_num_padding_tokens, dtype = torch.int64) # Addind padding tokens
            ]
        )

        # Building the decoder input tensor by combining several elements
        decoder_input = torch.cat(
            [
                self.sos_token, # inserting the '[SOS]' token
                torch.tensor(dec_input_tokens, dtype = torch.int64), # Inserting the tokenized target text
                torch.tensor([self.pad_token] * dec_num_padding_tokens, dtype = torch.int64) # Addind padding tokens
            ]

        )

        # Creating a label tensor, the expected output for training the model
        label = torch.cat(
            [
                torch.tensor(dec_input_tokens, dtype = torch.int64), # Inserting the tokenized target text
                self.eos_token, # Inserting the '[EOS]' token
                torch.tensor([self.pad_token] * dec_num_padding_tokens, dtype = torch.int64) # Adding padding tokens

            ]
        )

        # Ensuring that the length of each tensor above is equal to the defined 'seq_len'
        assert encoder_input.size(0) == self.seq_len
        assert decoder_input.size(0) == self.seq_len
        assert label.size(0) == self.seq_len

        return {
            'encoder_input': encoder_input,
            'decoder_input': decoder_input,
            'encoder_mask': (encoder_input != self.pad_token).unsqueeze(0).unsqueeze(0).int(),
            'decoder_mask': (decoder_input != self.pad_token).unsqueeze(0).unsqueeze(0).int() & casual_mask(decoder_input.size(0)),
            'label': label,
            'src_text': src_text,
            'tgt_text': tgt_text
        }

## Part 12: Validation Loop

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">We will now create two functions for the validation loop. The validation loop is crucial to evaluate model performance in translating sentences from data it has not seen during training.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">We will define two functions. The first function, <code>greedy_decode</code>, gives us the model's output by obtaining the most probable next token. The second function, <code>run_validation</code>, is responsible for running the validation process in which we decode the model's output and compare it with the reference text for the target sentence.</p>

In [None]:
# Define function to obtain the most probable next token
def greedy_decode(model, source, source_mask, tokenizer_src, tokenizer_tgt, max_len, device):
    # Retrieving the indices from the start and end of sequences of the target tokens
    sos_idx = tokenizer_tgt.token_to_id('[SOS]')
    eos_idx = tokenizer_tgt.token_to_id('[EOS]')

    # Computing the output of the encoder for the source sequence
    encoder_output = model.encode(source, source_mask)
    # Initializing the decoder input with the Start of Sentence token
    decoder_input = torch.empty(1,1).fill_(sos_idx).type_as(source).to(device)

    # Looping until the 'max_len', maximum length, is reached
    while True:
        if decoder_input.size(1) == max_len:
            break

        # Building a mask for the decoder input
        decoder_mask = casual_mask(decoder_input.size(1)).type_as(source_mask).to(device)

        # Calculating the output of the decoder
        out = model.decode(encoder_output, source_mask, decoder_input, decoder_mask)

        # Applying the projection layer to get the probabilities for the next token
        prob = model.project(out[:, -1])

        # Selecting token with the highest probability
        _, next_word = torch.max(prob, dim=1)
        decoder_input = torch.cat([decoder_input, torch.empty(1,1). type_as(source).fill_(next_word.item()).to(device)], dim=1)

        # If the next token is an End of Sentence token, we finish the loop
        if next_word == eos_idx:
            break

    return decoder_input.squeeze(0) # Sequence of tokens generated by the decoder

In [None]:
# Defining function to evaluate the model on the validation dataset
# num_examples = 2, two examples per run
def run_validation(model, validation_ds, tokenizer_src, tokenizer_tgt, max_len, device, print_msg, global_state, writer, num_examples=2):
    model.eval() # Setting model to evaluation mode
    count = 0 # Initializing counter to keep track of how many examples have been processed

    console_width = 80 # Fixed witdh for printed messages

    # Creating evaluation loop
    with torch.no_grad(): # Ensuring that no gradients are computed during this process
        for batch in validation_ds:
            count += 1
            encoder_input = batch['encoder_input'].to(device)
            encoder_mask = batch['encoder_mask'].to(device)

            # Ensuring that the batch_size of the validation set is 1
            assert encoder_input.size(0) ==  1, 'Batch size must be 1 for validation.'

            # Applying the 'greedy_decode' function to get the model's output for the source text of the input batch
            model_out = greedy_decode(model, encoder_input, encoder_mask, tokenizer_src, tokenizer_tgt, max_len, device)

            # Retrieving source and target texts from the batch
            source_text = batch['src_text'][0]
            target_text = batch['tgt_text'][0] # True translation
            model_out_text = tokenizer_tgt.decode(model_out.detach().cpu().numpy()) # Decoded, human-readable model output

            # Printing results
            print_msg('-'*console_width)
            print_msg(f'SOURCE: {source_text}')
            print_msg(f'TARGET: {target_text}')
            print_msg(f'PREDICTED: {model_out_text}')

            # After two examples, we break the loop
            if count == num_examples:
                break

## Part 13: Training Loop

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">We are ready to train our Transformer model on the OpusBook dataset for the English to Italian translation task.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">We first start by defining the <code>get_model</code> function to load the model by calling the <code>build_transformer</code> function we have previously defined. This function uses the <code>config</code> dictionary to set a few parameters.</p>

In [None]:
######################  TODO  ########################
######################  TODO  ########################

# - Define a `get_model` function to initialize a Transformer model
# - Accept parameters:
#   1. `config`: Configuration dictionary with model settings
#   2. `vocab_src_len`: Length of the source language vocabulary
#   3. `vocab_tgt_len`: Length of the target language vocabulary

# - Use the `build_transformer` function to:
#   1. Create a Transformer model
#   2. Pass the source and target vocabulary lengths
#   3. Set sequence length (`seq_len`) and embedding dimensionality (`d_model`) from `config`

# - Return the initialized model

######################  TODO  ########################
######################  TODO  ########################

def get_model(config, vocab_src_len, vocab_tgt_len):
    model = build_transformer(vocab_src_len, vocab_tgt_len, config['seq_len'], config['seq_len'], config['d_model'])
    return model

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">I have mentioned the <code>config</code> dictionary several times throughout this notebook. Now, it is time to create it.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">In the following cell, we will define two functions to configure our model and the training process.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">In the <code>get_config</code> function, we define crucial parameters for the training process. <code>batch_size</code> for the number of training examples used in one iteration, <code>num_epochs</code> as the number of times the entire dataset is passed forward and backward through the Transformer, <code>lr</code> as the learning rate for the optimizer, etc. We will also finally define the pairs from the OpusBook dataset, <code>'lang_src': 'en'</code> for selecting English as the source language and <code>'lang_tgt': 'it'</code> for selecting Italian as the target language.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">The <code>get_weights_file_path</code> function constructs the file path for saving or loading model weights for any specific epoch.</p>

In [None]:
######################  TODO  ########################
######################  TODO  ########################

# - Define a `get_config` function to return a dictionary of settings for building and training the Transformer model:
#   1. `batch_size`: Number of samples per training batch
#   2. `num_epochs`: Total training epochs
#   3. `lr`: Learning rate for optimization
#   4. `seq_len`: Maximum sequence length for tokens
#   5. `d_model`: Dimensionality of embeddings (e.g., 512)
#   6. `lang_src` and `lang_tgt`: Source and target languages
#   7. `model_folder`: Folder to save model weights
#   8. `model_basename`: Base name for model files
#   9. `preload`: Option to preload a model (default: None)
#   10. `tokenizer_file`: Filename pattern for saving tokenizers
#   11. `experiment_name`: Name of the experiment for logging

# - Define `get_weights_file_path` to construct a file path for saving/retrieving model weights:
#   1. Accept `config` dictionary and `epoch` string as parameters
#   2. Retrieve `model_folder` and `model_basename` from `config`
#   3. Construct the filename with the base name and epoch
#   4. Combine the current directory, model folder, and filename to return the full path

######################  TODO  ########################
######################  TODO  ########################

def get_config():
    return{
        'batch_size': 8,
        'num_epochs': 40,
        'lr': 10**-4,
        'seq_len': 350,
        'd_model': 512,
        'lang_src': 'en',
        'lang_tgt': 'it',
        'model_folder': 'weights',
        'model_basename': 'tmodel_',
        'preload': None,
        'tokenizer_file': 'tokenizer_{0}.json',
        'experiment_name': 'runs/tmodel'
    }

def get_weights_file_path(config, epoch: str):
    model_folder = config['model_folder']
    model_basename = config['model_basename']
    model_filename = f"{model_basename}{epoch}.pt"
    return str(Path('.')/ model_folder/ model_filename)

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">We finally define our last function, <code>train_model</code>, which takes the <code>config</code> arguments as input. </p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">In this function, we will set everything up for the training. We will load the model and its necessary components onto the GPU for faster training, set the <code>Adam</code> optimizer, and configure the <code>CrossEntropyLoss</code> function to compute the differences between the translations output by the model and the reference translations from the dataset. </p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">Every loop necessary for iterating over the training batches, performing backpropagation, and computing the gradients is in this function. We will also use it to run the validation function and save the current state of the model.</p>

In [None]:
######################  TODO  ########################
######################  TODO  ########################

# - Define a `train_model` function to train a Transformer model
# - Steps:
#   1. Set up the device (GPU or CPU) for training
#   2. Create a directory to store model weights
#   3. Retrieve dataloaders and tokenizers for source and target languages using `get_ds`
#   4. Initialize the Transformer model using `get_model` and move it to the specified device
#   5. Set up TensorBoard for logging training metrics
#   6. Configure the Adam optimizer with learning rate and epsilon from `config`
#   7. If a pre-trained model exists:
#      - Load the model, optimizer state, and global step
#      - Set the starting epoch for resuming training
#   8. Define a cross-entropy loss function:
#      - Ignore padding tokens
#      - Apply label smoothing to prevent overfitting
#   9. Start training loop:
#      - Iterate over epochs from the initial epoch to `config['num_epochs']`
#      - For each batch in the training dataloader:
#         - Set model to training mode
#         - Move input data, masks, and labels to the device
#         - Pass data through the encoder, decoder, and projection layer
#         - Compute loss between model predictions and labels
#         - Log training loss to TensorBoard
#         - Perform backpropagation and update model parameters
#         - Clear gradients for the next batch
#         - Increment global step counter
#      - After each epoch, run validation using `run_validation`
#      - Save the current model state, optimizer state, and global step
#   10. Save model weights after each epoch

######################  TODO  ########################
######################  TODO  ########################

def train_model(config):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print(f"Using device {device}")
    Path(config['model_folder']).mkdir(parents=True, exist_ok=True)
    train_dataloader, val_dataloader, tokenizer_src, tokenizer_tgt = get_ds(config)
    model = get_model(config,tokenizer_src.get_vocab_size(), tokenizer_tgt.get_vocab_size()).to(device)
    writer = SummaryWriter(config['experiment_name'])
    optimizer = torch.optim.Adam(model.parameters(), lr=config['lr'], eps = 1e-9)
    initial_epoch = 0
    global_step = 0
    if config['preload']:
        model_filename = get_weights_file_path(config, config['preload'])
        print(f'Preloading model {model_filename}')
        state = torch.load(model_filename)
        initial_epoch = state['epoch'] + 1
        optimizer.load_state_dict(state['optimizer_state_dict'])
        global_step = state['global_step']
    loss_fn = nn.CrossEntropyLoss(ignore_index = tokenizer_src.token_to_id('[PAD]'), label_smoothing = 0.1).to(device)
    for epoch in range(initial_epoch, config['num_epochs']):
        batch_iterator = tqdm(train_dataloader, desc = f'Processing epoch {epoch:02d}')
        for batch in batch_iterator:
            model.train()
            encoder_input = batch['encoder_input'].to(device)
            decoder_input = batch['decoder_input'].to(device)
            encoder_mask = batch['encoder_mask'].to(device)
            decoder_mask = batch['decoder_mask'].to(device)
            encoder_output = model.encode(encoder_input, encoder_mask)
            decoder_output = model.decode(encoder_output, encoder_mask, decoder_input, decoder_mask)
            proj_output = model.project(decoder_output)
            label = batch['label'].to(device)
            loss = loss_fn(proj_output.view(-1, tokenizer_tgt.get_vocab_size()), label.view(-1))
            batch_iterator.set_postfix({f"loss": f"{loss.item():6.3f}"})
            writer.add_scalar('train loss', loss.item(), global_step)
            writer.flush()
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
            global_step += 1
        run_validation(model, val_dataloader, tokenizer_src, tokenizer_tgt, config['seq_len'], device, lambda msg: batch_iterator.write(msg), global_step, writer)
        model_filename = get_weights_file_path(config, f'{epoch:02d}')
        torch.save({
            'epoch': epoch,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'global_step': global_step
        }, model_filename)

We can now train the model!

In [None]:
if __name__ == '__main__':
    warnings.filterwarnings('ignore')
    config = get_config()
    train_model(config)

Using device cuda


README.md:   0%|          | 0.00/28.1k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/5.73M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/32332 [00:00<?, ? examples/s]

Max length of source sentence: 190
Max length of target sentence: 159


Processing epoch 00: 100%|██████████| 364/364 [02:07<00:00,  2.85it/s, loss=5.615]


--------------------------------------------------------------------------------
SOURCE: "Yes; he said that from mere politeness: I need not go, I am sure," I answered.
TARGET: — L'ha detto per semplice cortesia, ma non vi andrò, — risposi.
PREDICTED: — Non è , — — — — — — — — — — — — — — — .
--------------------------------------------------------------------------------
SOURCE: Sometimes I think I am in Northumberland, and that the noises I hear round me are the bubbling of a little brook which runs through Deepden, near our house;--then, when it comes to my turn to reply, I have to be awakened; and having heard nothing of what was read for listening to the visionary brook, I have no answer ready."
TARGET: Penso di essere nel Northumberland, scambio il rumore che sento intorno a me per il mormorio di un ruscello, che scorreva accanto a casa nostra. Quando viene la mia volta, debbo uscir dal sogno, ma siccome non ho ascoltato, non trovo la risposta.
PREDICTED: — Non , ma , ma , ma , m

Processing epoch 01: 100%|██████████| 364/364 [02:10<00:00,  2.80it/s, loss=5.575]


--------------------------------------------------------------------------------
SOURCE: I knew my traveller with his broad and jetty eyebrows; his square forehead, made squarer by the horizontal sweep of his black hair.
TARGET: Riconobbi in lui il viaggiatore dalle sopracciglia corvine, dalla fronte quadrata posta in rilievo, dal taglio orizzontale dei capelli.
PREDICTED: , , , , , , .
--------------------------------------------------------------------------------
SOURCE: His shape, now divested of cloak, I perceived harmonised in squareness with his physiognomy: I suppose it was a good figure in the athletic sense of the term--broad chested and thin flanked, though neither tall nor graceful.
TARGET: Ora che non era più avvolto nella pelliccia, mi accorsi che le membra erano in armonia con i tratti, membra di atleta, dal petto largo e i fianchi raccolti, un insieme senza imponenza e senza grazia.
PREDICTED: Non , ma non , ma non , ma , ma , , , , , .


Processing epoch 02: 100%|██████████| 364/364 [02:10<00:00,  2.80it/s, loss=4.697]


--------------------------------------------------------------------------------
SOURCE: I slowly descended.
TARGET: Scesi lentamente.
PREDICTED: .
--------------------------------------------------------------------------------
SOURCE: "Elles changent de toilettes," said Adele; who, listening attentively, had followed every movement; and she sighed.
TARGET: — Si vestono, — disse Adele, che ascoltava ogni movimento, e sospirò:
PREDICTED: — Non mi disse , — disse , — disse , — disse , — disse , — disse .


Processing epoch 03: 100%|██████████| 364/364 [02:10<00:00,  2.80it/s, loss=5.041]


--------------------------------------------------------------------------------
SOURCE: "No, Bessie: she came to my crib last night when you were gone down to supper, and said I need not disturb her in the morning, or my cousins either; and she told me to remember that she had always been my best friend, and to speak of her and be grateful to her accordingly."
TARGET: — No, Bessie, — risposi. — Ieri sera quando scendeste per la cena, ella si avvicinò al mio letto e mi dichiarò che partendo non aveva bisogno di disturbare né lei né le mie cugine; mi disse pure che era stata sempre la mia migliore amica e che non lo dimenticassi. Poi mi pregò di parlar bene di lei e di esserle grata.
PREDICTED: — Non è la signorina Temple , — disse , — disse , — disse , — mi disse , — mi disse , — mi disse , — mi disse , — mi disse , — disse , — mi disse , — mi disse , — mi disse , — disse , — mi disse , — mi disse , — mi disse , — mi disse , — mi disse , — mi disse , — mi disse , — mi disse , — mi diss

Processing epoch 04: 100%|██████████| 364/364 [02:09<00:00,  2.80it/s, loss=4.160]


--------------------------------------------------------------------------------
SOURCE: Besides this earth, and besides the race of men, there is an invisible world and a kingdom of spirits: that world is round us, for it is everywhere; and those spirits watch us, for they are commissioned to guard us; and if we were dying in pain and shame, if scorn smote us on all sides, and hatred crushed us, angels see our tortures, recognise our innocence (if innocent we be: as I know you are of this charge which Mr. Brocklehurst has weakly and pompously repeated at second-hand from Mrs. Reed; for I read a sincere nature in your ardent eyes and on your clear front), and God waits only the separation of spirit from flesh to crown us with a full reward.
TARGET: Al di là di questa terra vi è un regno invisibile; al disopra di questo mondo abitato dagli uomini, ve n'è uno abitato dagli spiriti, e questi spiriti vegliano su di noi, e se moriamo oppressi dalla vergogna e dal disprezzo, ci riconoscono i

Processing epoch 05: 100%|██████████| 364/364 [02:10<00:00,  2.80it/s, loss=4.500]


--------------------------------------------------------------------------------
SOURCE: CHAPTER XVI
TARGET: XVI.
PREDICTED: .
--------------------------------------------------------------------------------
SOURCE: He bent his head a little towards me, and with a single hasty glance seemed to dive into my eyes.
TARGET: Egli chinò la testa verso di me e mi gettò negli occhi uno sguardo rapido.
PREDICTED: La signora Fairfax mi di la porta , e mi il mio giorno mi .


Processing epoch 06: 100%|██████████| 364/364 [02:10<00:00,  2.80it/s, loss=4.873]


--------------------------------------------------------------------------------
SOURCE: "Who could want me?" I asked inwardly, as with both hands I turned the stiff door-handle, which, for a second or two, resisted my efforts.
TARGET: — Chi può aspettarmi? — dicevo fra me, mentre con tutte e due le mani giravo la maniglia, che resisteva ai miei sforzi.
PREDICTED: — E che cosa mi disse , — mi disse , — mi disse , — mi disse . — E mi domandò il mio mio sguardo di .
--------------------------------------------------------------------------------
SOURCE: "Do we pay no money? Do they keep us for nothing?"
TARGET: — Paghiamo, o siamo educate gratuitamente?
PREDICTED: — , signore ?


Processing epoch 07: 100%|██████████| 364/364 [02:10<00:00,  2.80it/s, loss=4.376]


--------------------------------------------------------------------------------
SOURCE: I sat up in bed by way of arousing this said brain: it was a chilly night; I covered my shoulders with a shawl, and then I proceeded _to think_ again with all my might.
TARGET: Mi sedei sul letto sperando di aiutare il mio povero cervello. La notte era fredda; mi gettai uno scialle sulle spalle e mi rimisi a pensare.
PREDICTED: Mi alzai e mi accorsi che mi accorsi che mi accorsi che mi accorsi che mi accorsi che mi accorsi che mi accorsi che mi accorsi che mi avevano , ma il suo volto era , mi accorsi che mi aveva .
--------------------------------------------------------------------------------
SOURCE: Adele sang the canzonette tunefully enough, and with the _naivete_ of her age.
TARGET: Adele aveva cantato la romanza con giusto tono e con l'ingenuità propria della sua età.
PREDICTED: la e i di , i e i .


Processing epoch 08: 100%|██████████| 364/364 [02:10<00:00,  2.80it/s, loss=4.057]


--------------------------------------------------------------------------------
SOURCE: Adele sang the canzonette tunefully enough, and with the _naivete_ of her age.
TARGET: Adele aveva cantato la romanza con giusto tono e con l'ingenuità propria della sua età.
PREDICTED: La signorina Temple la e i e i .
--------------------------------------------------------------------------------
SOURCE: "What about?" "Family troubles, for one thing." "But he has no family."
TARGET: — Da quali pensieri dunque? — Dalle lotte di famiglia.
PREDICTED: — Che cosa sono qui ? — mi domandò .


Processing epoch 09: 100%|██████████| 364/364 [02:10<00:00,  2.80it/s, loss=4.200]


--------------------------------------------------------------------------------
SOURCE: Sometimes I think I am in Northumberland, and that the noises I hear round me are the bubbling of a little brook which runs through Deepden, near our house;--then, when it comes to my turn to reply, I have to be awakened; and having heard nothing of what was read for listening to the visionary brook, I have no answer ready."
TARGET: Penso di essere nel Northumberland, scambio il rumore che sento intorno a me per il mormorio di un ruscello, che scorreva accanto a casa nostra. Quando viene la mia volta, debbo uscir dal sogno, ma siccome non ho ascoltato, non trovo la risposta.
PREDICTED: Non mi che il mio spirito del cuore , ma il mio cuore era che un ' altra era in camera mia , e che mi la mia , ma non mi la mia .
--------------------------------------------------------------------------------
SOURCE: "I wish I could forget it," was the answer.
TARGET: — Vorrei dimenticare, ma non posso.
PREDICTED: 

Processing epoch 10: 100%|██████████| 364/364 [02:10<00:00,  2.80it/s, loss=4.135]


--------------------------------------------------------------------------------
SOURCE: "Decidedly he has had too much wine," I thought; and I did not know what answer to make to his queer question: how could I tell whether he was capable of being re-transformed?
TARGET: — Deve aver bevuto davvero troppo vino, — pensavo, non sapendo qual risposta dargli.
PREDICTED: — La signorina Temple è qui ? — domandò il signor Rochester con me , — ma che non ha fatto , ma che ha fatto di lui .
--------------------------------------------------------------------------------
SOURCE: And dangerous he looked: his black eyes darted sparks. Calming himself by an effort, he added--
TARGET: Infatti aveva negli occhi uno sguardo terribile, e, facendo uno sforzo per calmarsi, aggiunse:
PREDICTED: E se lo , e il suo volto , che si a voce .


Processing epoch 11: 100%|██████████| 364/364 [02:09<00:00,  2.80it/s, loss=3.532]


--------------------------------------------------------------------------------
SOURCE: "Mason!--the West Indies!" he said, in the tone one might fancy a speaking automaton to enounce its single words; "Mason!--the West Indies!" he reiterated; and he went over the syllables three times, growing, in the intervals of speaking, whiter than ashes: he hardly seemed to know what he was doing.
TARGET: — Mason, le Indie occidentali! — disse automaticamente e lo ripetè tre volte. Nel pronunziare quelle parole si faceva sempre più pallido.
PREDICTED: — È la mia parte di una donna di , — disse il signor Bockelhurst . La signora Fairfax è di Millcote , di questa bimba e di non ha potuto un un di di un un di di di , di , di di di .
--------------------------------------------------------------------------------
SOURCE: She paused, and then added, with a sort of assumed indifference, but still in a marked and significant tone--"But you are young, Miss; and I should say a light sleeper: perhaps you 

Processing epoch 12: 100%|██████████| 364/364 [02:09<00:00,  2.80it/s, loss=3.499]


--------------------------------------------------------------------------------
SOURCE: "Is there a little girl called Jane Eyre here?" she asked.
TARGET: — C'è qui una bimba, che si chiama Jane Eyre? — domandò.
PREDICTED: — È vero , — disse Bessie Bessie Bessie ? — domandò la voce .
--------------------------------------------------------------------------------
SOURCE: I suppose you are an orphan: are not either your father or your mother dead?"
TARGET: Voi, io e tutte le altre siamo figlie della carità. Dovete essere orfana.
PREDICTED: — Non avete mai la vostra alunna ; non è vero ?


Processing epoch 13: 100%|██████████| 364/364 [02:10<00:00,  2.80it/s, loss=3.233]


--------------------------------------------------------------------------------
SOURCE: The strangest thing of all was, that not a soul in the house, except me, noticed her habits, or seemed to marvel at them: no one discussed her position or employment; no one pitied her solitude or isolation.
TARGET: Ciò che sopratutto mi stupiva, si era che in casa nessuno pareva badare alle consuetudini di Grace, nessuno domandava che cosa facesse lassù, nessuno compiangevala della solitudine e dell'isolamento.
PREDICTED: La mia madre aveva lasciato in una donna dolce , di , di non poteva di nuovo . Non era né mai veduto la mia mente .
--------------------------------------------------------------------------------
SOURCE: What my sensations were no language can describe; but just as they all rose, stifling my breath and constricting my throat, a girl came up and passed me: in passing, she lifted her eyes.
TARGET: Nessuna parola può esprimere i miei sentimenti, ma intanto che mi gonfiavano il cuor

Processing epoch 14: 100%|██████████| 364/364 [02:09<00:00,  2.80it/s, loss=3.030]


--------------------------------------------------------------------------------
SOURCE: "No, Bessie: she came to my crib last night when you were gone down to supper, and said I need not disturb her in the morning, or my cousins either; and she told me to remember that she had always been my best friend, and to speak of her and be grateful to her accordingly."
TARGET: — No, Bessie, — risposi. — Ieri sera quando scendeste per la cena, ella si avvicinò al mio letto e mi dichiarò che partendo non aveva bisogno di disturbare né lei né le mie cugine; mi disse pure che era stata sempre la mia migliore amica e che non lo dimenticassi. Poi mi pregò di parlar bene di lei e di esserle grata.
PREDICTED: — No , certo ; ho dimenticato di la servitù , e mi ha detto che ho fatto il mio zio ; ma non mi sono così buona che e il mio zio ; e che mi sono così poco dopo l ' ho fatto molto di più .
--------------------------------------------------------------------------------
SOURCE: I knew my traveller 

Processing epoch 15: 100%|██████████| 364/364 [02:09<00:00,  2.80it/s, loss=2.720]


--------------------------------------------------------------------------------
SOURCE: This testimonial I accordingly received in about a month, forwarded a copy of it to Mrs. Fairfax, and got that lady's reply, stating that she was satisfied, and fixing that day fortnight as the period for my assuming the post of governess in her house.
TARGET: Ne mandai copia alla signora Fairfax e ricevei la risposta. Ella era soddisfatta e mi diceva che fra quindici giorni dovevo trovarmi al mio nuovo posto.
PREDICTED: Aveva i lineamenti e l ' ordine di quella bambina , perché era in una casa , perché la signora Fairfax non vi era nulla .
--------------------------------------------------------------------------------
SOURCE: By what instinct do you pretend to distinguish between a fallen seraph of the abyss and a messenger from the eternal throne--between a guide and a seducer?"
TARGET: In forza di quale istinto pretendete di conoscere l'angiolo caduto dal messaggero dell'Eterno? La guida dal se

Processing epoch 16: 100%|██████████| 364/364 [02:09<00:00,  2.80it/s, loss=2.742]


--------------------------------------------------------------------------------
SOURCE: But, as I was saying: sitting in that window-seat, do you think of nothing but your future school?
TARGET: Ma quando state alla finestra, pensate forse alla vostra futura scuola?
PREDICTED: Ma non che cosa avete fatto con la vostra alunna . Non avete mai veduto ?
--------------------------------------------------------------------------------
SOURCE: "Nothing at all, sir."
TARGET: — Non ho niente, signore.
PREDICTED: — Niente , signore , signore .


Processing epoch 17: 100%|██████████| 364/364 [02:10<00:00,  2.80it/s, loss=2.580]


--------------------------------------------------------------------------------
SOURCE: There are many others who have no friends, who must look about for themselves and be their own helpers; and what is their resource?"
TARGET: Ma vi sono tanti che non hanno amici, che debbono cavarsela da sé; quale è dunque la loro risorsa?
PREDICTED: Credete che tutta questa gente e che possa vivere in quella e le altre che per le altre e della mamma ?
--------------------------------------------------------------------------------
SOURCE: "Well," I asked impatiently, "is not Mrs. Reed a hard-hearted, bad woman?"
TARGET: — Ebbene, — le domandai, — la signora Reed non è forse una donna dura e senza cuore?
PREDICTED: — Ebbene , — risposi , — la signora Reed . — Sì , vi prego .


Processing epoch 18: 100%|██████████| 364/364 [02:09<00:00,  2.80it/s, loss=2.070]


--------------------------------------------------------------------------------
SOURCE: Miss Temple seemed to remonstrate.
TARGET: La direttrice cercò di fare un'osservazione.
PREDICTED: La signorina Temple era terminata , e la direttrice è la direttrice .
--------------------------------------------------------------------------------
SOURCE: Besides, I know what sort of a mind I have placed in communication with my own: I know it is one not liable to take infection: it is a peculiar mind: it is a unique one.
TARGET: Del resto so con quale spirito, il mio è entrato in comunione; è uno spirito a parte, sul quale il contagio del male non avrà presa.
PREDICTED: Non so se è mia vita mi di una vita vita vita per me , ma io non posso dire di vedere se fosse stata buona .


Processing epoch 19: 100%|██████████| 364/364 [02:10<00:00,  2.80it/s, loss=1.957]


--------------------------------------------------------------------------------
SOURCE: "Well," I asked impatiently, "is not Mrs. Reed a hard-hearted, bad woman?"
TARGET: — Ebbene, — le domandai, — la signora Reed non è forse una donna dura e senza cuore?
PREDICTED: — E che cosa vuole ? — domandò la signora Reed .
--------------------------------------------------------------------------------
SOURCE: "What awful event has taken place?" said she.
TARGET: — Che cosa è accaduto?
PREDICTED: — Come è il vostro nome di vedere ?


Processing epoch 20: 100%|██████████| 364/364 [02:10<00:00,  2.80it/s, loss=1.888]


--------------------------------------------------------------------------------
SOURCE: Mr. Rochester lay down on a sofa in a pretty room called the salon, and Sophie and I had little beds in another place.
TARGET: Mi sentivo male e anche Sofia, anche il signor Rochester.
PREDICTED: Il signor Rochester , dopo aver posato il lume in cucina e mentre io si in silenzio .
--------------------------------------------------------------------------------
SOURCE: "Elles changent de toilettes," said Adele; who, listening attentively, had followed every movement; and she sighed.
TARGET: — Si vestono, — disse Adele, che ascoltava ogni movimento, e sospirò:
PREDICTED: — , — disse , — e prendendo un gatto in ginocchio . — Vi andrà bene di andare in silenzio .


Processing epoch 21: 100%|██████████| 364/364 [02:10<00:00,  2.80it/s, loss=1.864]


--------------------------------------------------------------------------------
SOURCE: "Why," thought I, "does she not explain that she could neither clean her nails nor wash her face, as the water was frozen?"
TARGET: — Perché, — pensavo, — non le dice che non ha potuto lavarsi stamani, essendo l'acqua gelata?
PREDICTED: — Perché torna in carrozza ? — disse Bianca — è partito a e voi . — Sì , avrebbe un ' aria d ' aria nelle ore che .
--------------------------------------------------------------------------------
SOURCE: "Here is to your health, ministrant spirit!" he said.
TARGET: — Alla vostra salute, spirito benefico!
PREDICTED: — , signorina , — disse il — che sia morta .


Processing epoch 22: 100%|██████████| 364/364 [02:09<00:00,  2.80it/s, loss=1.893]


--------------------------------------------------------------------------------
SOURCE: "You are not a servant at the hall, of course.
TARGET: — Non siete certo una delle donne di servizio della villa, siete....
PREDICTED: — Non avete mai , tanto triste .
--------------------------------------------------------------------------------
SOURCE: In this room, too, there was a cabinet piano, quite new and of superior tone; also an easel for painting and a pair of globes.
TARGET: Da un lato vi era un pianoforte nuovo e di eccellente fabbrica, due cavalletti e le sfere.
PREDICTED: Mi parve alto e mi parve che una piccola piccola piccola nuova , ma siccome era , una donna si andava a visitare le , i mobili , e incominciò a sé .


Processing epoch 23: 100%|██████████| 364/364 [02:09<00:00,  2.80it/s, loss=1.695]


--------------------------------------------------------------------------------
SOURCE: "Vos doigts tremblent comme la feuille, et vos joues sont rouges: mais, rouges comme des cerises!"
TARGET: le dita vi tremano e avete le guance rosse come ciliege.
PREDICTED: — le mie dita per , — rispose egli . — La della dei giovani e , — benché non sia pena .
--------------------------------------------------------------------------------
SOURCE: Merry days were these at Thornfield Hall; and busy days too: how different from the first three months of stillness, monotony, and solitude I had passed beneath its roof!
TARGET: La giornata passava allegramente a Thornfield e l'attività regnava ormai al castello; quale differenza fra quella quindicina e i tre mesi di tranquillità, di monotonia e di solitudine, che avevo passati fra quelle mura!
PREDICTED: L ' campana suonò per la quarta volta ; le alunne si di nuovo e si al refettorio .


Processing epoch 24: 100%|██████████| 364/364 [02:09<00:00,  2.80it/s, loss=1.579]


--------------------------------------------------------------------------------
SOURCE: "Decidedly he has had too much wine," I thought; and I did not know what answer to make to his queer question: how could I tell whether he was capable of being re-transformed?
TARGET: — Deve aver bevuto davvero troppo vino, — pensavo, non sapendo qual risposta dargli.
PREDICTED: — Il mio tono di , — rispose lentamente lord Ingram . — E la mia fronte avrebbe dovuto vedere le vostre consuetudini .
--------------------------------------------------------------------------------
SOURCE: I remained an inmate of its walls, after its regeneration, for eight years: six as pupil, and two as teacher; and in both capacities I bear my testimony to its value and importance.
TARGET: Dopo questa rigenerazione io vi rimasi otto anni, sei come alunna e due come maestra. Nell'una e nell'altra posizione potei render giustizia al valore e all'importanza dell'Istituto.
PREDICTED: Avevamo un solo pezzetto di candela e t

Processing epoch 25: 100%|██████████| 364/364 [02:09<00:00,  2.80it/s, loss=1.611]


--------------------------------------------------------------------------------
SOURCE: I hurried on my frock and a shawl; I withdrew the bolt and opened the door with a trembling hand.
TARGET: M'infilai un vestito, mi ravvolsi in uno scialle, e, tirando il chiavistello, aprii la porta, tremando.
PREDICTED: La porta a voce bassa , quando c ' era portato il fuoco .
--------------------------------------------------------------------------------
SOURCE: Breakfast-time came at last, and this morning the porridge was not burnt; the quality was eatable, the quantity small.
TARGET: La colazione giunse in fine e la mia parte parvemi scarsa; ne avrei mangiato il doppio.
PREDICTED: la colazione fu terminata , si rese grazie di ciò che non si aveva avuto e si cantò un secondo inno .


Processing epoch 26: 100%|██████████| 364/364 [02:10<00:00,  2.80it/s, loss=1.608]


--------------------------------------------------------------------------------
SOURCE: "Is Mr. Rochester an exacting, fastidious sort of man?"
TARGET: — È forse esigente e tirannico il signor Rochester?
PREDICTED: — Signor Rochester , il signor Rochester ?
--------------------------------------------------------------------------------
SOURCE: His shape, now divested of cloak, I perceived harmonised in squareness with his physiognomy: I suppose it was a good figure in the athletic sense of the term--broad chested and thin flanked, though neither tall nor graceful.
TARGET: Ora che non era più avvolto nella pelliccia, mi accorsi che le membra erano in armonia con i tratti, membra di atleta, dal petto largo e i fianchi raccolti, un insieme senza imponenza e senza grazia.
PREDICTED: La sua voce era meno , ma la sua presenza e mi accorsi che pareva . Il mio spirito si fosse ad alzarmi di .


Processing epoch 27: 100%|██████████| 364/364 [02:10<00:00,  2.80it/s, loss=1.582]


--------------------------------------------------------------------------------
SOURCE: "Don't trouble yourself to give her a character," returned Mr. Rochester: "eulogiums will not bias me; I shall judge for myself.
TARGET: — Non vi date la pena d'analizzare il carattere di lei — disse il signor Rochester. — Gli elogi non hanno nessuna influenza sulla mia opinione; la giudicherò da me.
PREDICTED: — Non mi parlate mai ; John , — diss ' egli , — non voglio bene , — quella .
--------------------------------------------------------------------------------
SOURCE: "No."
TARGET: — No.
PREDICTED: — No .


Processing epoch 28: 100%|██████████| 364/364 [02:10<00:00,  2.80it/s, loss=1.507]


--------------------------------------------------------------------------------
SOURCE: "What is it about?" I continued.
TARGET: — Di che cosa parla?
PREDICTED: — Quale è il mio ? — domandarono le domandai .
--------------------------------------------------------------------------------
SOURCE: "A great deal: you are good to those who are good to you. It is all I ever desire to be.
TARGET: — Ce n'è uno grande, al contrario; siete buona per quelli che sono buoni per voi; è stato sempre quello che ho desiderato.
PREDICTED: — che avete ragione , ve ne prego .


Processing epoch 29: 100%|██████████| 364/364 [02:09<00:00,  2.80it/s, loss=1.549]


--------------------------------------------------------------------------------
SOURCE: It depends on yourself to stretch out your hand, and take it up: but whether you will do so, is the problem I study.
TARGET: Dipende da voi di stender la mano e prenderla e studio il vostro volto per sapere se lo farete.
PREDICTED: il bicchiere di vedere se ne , è , avrei fatto un poco , mi la morte .
--------------------------------------------------------------------------------
SOURCE: We now slowly ascended a drive, and came upon the long front of a house: candlelight gleamed from one curtained bow-window; all the rest were dark.
TARGET: Salimmo lentamente una collina e giungemmo davanti alla casa. Si vedevano brillare i lumi dietro la tenda di una finestra bifora; tutto il resto era nel buio.
PREDICTED: Ero dunque rinchiusa di notte in un campo deserto , perché le grandi si più dolci , che non aveva più nulla .


Processing epoch 30: 100%|██████████| 364/364 [02:10<00:00,  2.80it/s, loss=1.570]


--------------------------------------------------------------------------------
SOURCE: "You are not a servant at the hall, of course.
TARGET: — Non siete certo una delle donne di servizio della villa, siete....
PREDICTED: — Non avete mai la tenda ,
--------------------------------------------------------------------------------
SOURCE: "Missis looks stout and well enough in the face, but I think she's not quite easy in her mind: Mr. John's conduct does not please her--he spends a deal of money."
TARGET: — La padrona sta abbastanza bene, ma credo che sia inquieta. La condotta del signor John non le va a genio; egli spende troppo denaro.
PREDICTED: — La signora vuole che via fra un poco di padrone , e la signora Reed è da padrone . Non ha detto che subito ....


Processing epoch 31: 100%|██████████| 364/364 [02:10<00:00,  2.80it/s, loss=1.524]


--------------------------------------------------------------------------------
SOURCE: It emboldened me to ask a question.
TARGET: Essa mi dette animo a rivolgerle una domanda:
PREDICTED: Mi di far la sua domanda .
--------------------------------------------------------------------------------
SOURCE: The want of his animating influence appeared to be peculiarly felt one day that he had been summoned to Millcote on business, and was not likely to return till late.
TARGET: Il bisogno della presenza di lui si fece sentire specialmente un giorno in cui i suoi affari lo chiamarono a Millcote, di dove non avrebbe potuto tornare se non che tardi.
PREDICTED: La madre del suo nome , lo sguardo di vedere una lunga delle sorelle . In quella non lo , e tutte queste cose ; ma i servi non posso dire di Millcote insieme .


Processing epoch 32: 100%|██████████| 364/364 [02:10<00:00,  2.80it/s, loss=1.436]


--------------------------------------------------------------------------------
SOURCE: Well, our ship stopped in the morning, before it was quite daylight, at a great city--a huge city, with very dark houses and all smoky; not at all like the pretty clean town I came from; and Mr. Rochester carried me in his arms over a plank to the land, and Sophie came after, and we all got into a coach, which took us to a beautiful large house, larger than this and finer, called an hotel.
TARGET: Ebbene, il bastimento si fermò la mattina, prima che il sole fosse alzato, in una grande grande città, nera nera, tutta coperta di fumo. Non somigliava punto alla città che avevo lasciata.
PREDICTED: Ella invitò e me lo alla tavola , collocò dinanzi a noi le tazze e i crostini , poi tolse da un cassetto un pan pepato , ravvolto con cura , e la sua mano ne ne delle grosse .
--------------------------------------------------------------------------------
SOURCE: The lesson had comprised part of the reign of

Processing epoch 33: 100%|██████████| 364/364 [02:10<00:00,  2.80it/s, loss=1.513]


--------------------------------------------------------------------------------
SOURCE: "Here is to your health, ministrant spirit!" he said.
TARGET: — Alla vostra salute, spirito benefico!
PREDICTED: — , cara che è accaduto ; mi avete .
--------------------------------------------------------------------------------
SOURCE: "What about?" "Family troubles, for one thing." "But he has no family."
TARGET: — Da quali pensieri dunque? — Dalle lotte di famiglia.
PREDICTED: — Allora , che cosa sia così , — non è vero per me ?


Processing epoch 34: 100%|██████████| 364/364 [02:10<00:00,  2.80it/s, loss=1.424]


--------------------------------------------------------------------------------
SOURCE: I slowly descended.
TARGET: Scesi lentamente.
PREDICTED: poi attorno al cavaliere .
--------------------------------------------------------------------------------
SOURCE: I say scarcely voluntary, for it seemed as if my tongue pronounced words without my will consenting to their utterance: something spoke out of me over which I had no control.
TARGET: Vi era in me una forza che mi spingeva a parlare, nonostante la volontà di tacere.
PREDICTED: L ' argomento era stranamente scelto per una bambina , ma io che l ' originalità stava appunto nell ' udire in bocca di una creaturina accenti d ' amore e di gelosia .


Processing epoch 35: 100%|██████████| 364/364 [02:10<00:00,  2.80it/s, loss=1.471]


--------------------------------------------------------------------------------
SOURCE: The next day commenced as before, getting up and dressing by rushlight; but this morning we were obliged to dispense with the ceremony of washing; the water in the pitchers was frozen.
TARGET: Il giorno seguente incominciò nella stessa maniera che il primo; ci levammo, ci vestimmo senza lume, ma quella mattina fummo dispensate dal lavarci, perché l'acqua era gelata nelle catinelle.
PREDICTED: Il giorno dopo splendeva un sole raggiante , e fu a una escursione nei dintorni .
--------------------------------------------------------------------------------
SOURCE: "I am afraid I never shall do that."
TARGET: — Credo di non potermi consolare mai.
PREDICTED: — Sono stanca , perché non .


Processing epoch 36: 100%|██████████| 364/364 [02:10<00:00,  2.80it/s, loss=1.454]


--------------------------------------------------------------------------------
SOURCE: "Yes--'after life's fitful fever they sleep well,'" I muttered.
TARGET: — Sì, dopo la febbre della vita, dormono tranquilli, — mormorai.
PREDICTED: — Sì , da nove anni ; perché è padrone di questo .
--------------------------------------------------------------------------------
SOURCE: "Burns, you poke your chin most unpleasantly; draw it in." "Burns, I insist on your holding your head up; I will not have you before me in that attitude," &c. &c.
TARGET: Burns, vi ho detto di tener la testa diritta; non voglio vedervi davanti a me in quell'atteggiamento.
PREDICTED: — Come voi , — continuò il viso di , — che siate stato così buona , per me il mio cassetto , e non lo sono ancora sicuro .


Processing epoch 37: 100%|██████████| 364/364 [02:09<00:00,  2.80it/s, loss=1.438]


--------------------------------------------------------------------------------
SOURCE: Adele, indeed, no sooner saw Mrs. Fairfax, than she summoned her to her sofa, and there quickly filled her lap with the porcelain, the ivory, the waxen contents of her "boite;" pouring out, meantime, explanations and raptures in such broken English as she was mistress of.
TARGET: Infatti Adele appena ebbe veduto la vedova, la chiamò e le mise in grembo l'avorio, la porcellana e tutto ciò che conteneva la scatola, esprimendo la sua gioia con frasi monche, perché parlava male inglese.
PREDICTED: Adele era sempre solenne . Si sedè su un panchettino che le ; io non le altre che ; aveva rotto le altre e con un grido d ' oro , si sarebbe .
--------------------------------------------------------------------------------
SOURCE: "But John Reed knocked me down, and my aunt shut me up in the red-room."
TARGET: — John Reed mi buttò in terra e la zia mi rinchiuse nella sala rossa.
PREDICTED: — Ho troppa sete p

Processing epoch 38: 100%|██████████| 364/364 [02:09<00:00,  2.80it/s, loss=1.428]


--------------------------------------------------------------------------------
SOURCE: Next day, by noon, I was up and dressed, and sat wrapped in a shawl by the nursery hearth.
TARGET: La mattina dopo, verso mezzogiorno, ero alzata e vestita, e, dopo essermi rinvoltata in uno scialle, mi ero seduta accanto al fuoco.
PREDICTED: Il giorno dopo splendeva un sole raggiante , e mi sedei tranquillamente in un canto .
--------------------------------------------------------------------------------
SOURCE: "Yes; he did not stay many minutes in the house: Missis was very high with him; she called him afterwards a 'sneaking tradesman.'
TARGET: — Sì, non rimase molto in casa. La signora gli parlava in tono imperioso, e dietro le spalle lo trattò di vile commerciante.
PREDICTED: — Sì , certo , ma non è responsabile delle colpe di sua madre né delle vostre ; poiché sua madre l ' ha abbandonata e la , ebbene ?


Processing epoch 39: 100%|██████████| 364/364 [02:09<00:00,  2.80it/s, loss=1.467]


--------------------------------------------------------------------------------
SOURCE: It was Miss Reed that found them out: I believe she was envious; and now she and her sister lead a cat and dog life together; they are always quarrelling--"
TARGET: Miss Elisa li scoperse: io credo che sia invidiosa, e ora lei e la sorella vivono insieme come cani e gatti, e si leticano sempre.
PREDICTED: Un ' altra cosa Reed era nella mia mente , e sapevo o piuttosto l ' originalità di studio . Non era né tenerezza nel ; nel cervello , e anche là , e anche più piccolo al castello , e che è stato anche di Lowood .
--------------------------------------------------------------------------------
SOURCE: One evening, in the beginning of June, I had stayed out very late with Mary Ann in the wood; we had, as usual, separated ourselves from the others, and had wandered far; so far that we lost our way, and had to ask it at a lonely cottage, where a man and woman lived, who looked after a herd of half-wil

# Section 2: BERT and LoRA

Welcome to Section 2 of our Machine Learning assignment! I hope you've been enjoying the journey so far! 😊

 In this section, you will gain hands-on experience with [BERT](https://arxiv.org/abs/1810.04805) (Bidirectional Encoder Representations from Transformers) and [LoRA](https://arxiv.org/abs/2106.09685) (Low-Rank Adaptation) for text classification tasks. The section is divided into three main parts, each focusing on different aspects of NLP techniques.

## Assignment Structure

### Part 1: Data Preparation and Preprocessing
In this part, you will work with a text classification dataset. You will learn how to:
- Download and load the dataset
- Perform necessary preprocessing steps
- Implement data cleaning and transformation techniques
- Prepare the data in a format suitable for BERT training

### Part 2: Building a Small BERT Model
You will create and train a small BERT model from scratch using the Hugging Face [Transformers](https://huggingface.co/docs/transformers/en/index) library. This part will help you understand:
- The architecture of BERT
- How to configure and initialize a BERT model
- Training process and optimization
- Model evaluation and performance analysis

### Part 3: Fine-tuning with LoRA
In the final part, you will work with a pre-trained [TinyBERT](https://arxiv.org/abs/1909.10351) model and use LoRA for efficient fine-tuning. You will:
- Load a pre-trained TinyBERT model
- Implement LoRA adaptation and fine-tune the model on our classification task
- Compare the results with the previous approach

---

> **NOTE**:  
> Throughout this notebook, make an effort to include sufficient visualizations to enhance understanding:  
> - In the data processing section, display the results of your operations (e.g., show data samples or distributions after preprocessing).  
> - In the classification section, report various evaluation metrics such as accuracy, precision, recall, and F1-score to thoroughly assess your model's performance.  
> - Additionally, take a moment to compare the sizes of the models discussed in this notebook with today’s enormous models. This will help you appreciate the challenges and computational demands associated with training such massive models. 😵‍💫

---


## Part 1: Data Preparation and Preprocessing
We'll be working with the [Consumer Complaint](https://catalog.data.gov/dataset/consumer-complaint-database) dataset, which contains ***complaints*** submitted by consumers about financial products and services. Our goal is to build a classifier that can automatically identify the type of complaint based on the consumer's text description. For this task, we will work with a smaller subset of the dataset, available for download through this [link](https://drive.google.com/file/d/1SpIHksR-WzruEgUjp1SQKGG8bZPnJJoN/view?usp=sharing).

In [1]:
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, BertForSequenceClassification, AdamW, BertConfig
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from tqdm import tqdm

### 1.2 Loading the Data

In [2]:
!pip install gdown
!gdown --id 1SpIHksR-WzruEgUjp1SQKGG8bZPnJJoN
!unzip complaints_small.zip

Downloading...
From (original): https://drive.google.com/uc?id=1SpIHksR-WzruEgUjp1SQKGG8bZPnJJoN
From (redirected): https://drive.google.com/uc?id=1SpIHksR-WzruEgUjp1SQKGG8bZPnJJoN&confirm=t&uuid=a5dd8e12-5774-4521-8a42-3f3c1e9577b7
To: /content/complaints_small.zip
100% 290M/290M [00:01<00:00, 195MB/s]
Archive:  complaints_small.zip
  inflating: complaints_small.csv    


In [14]:
######################  TODO  ########################
######################  TODO  ########################
# Load the dataset
df = pd.read_csv("complaints_small.csv")
######################  TODO  ########################
######################  TODO  ########################

### 1.3 Data Sampling and Class Distribution Analysis

Working with large datasets can be computationally intensive during development. Additionally, imbalanced class distribution can affect model performance. In this section, you'll sample the data and analyze class distributions to make informed decisions about your training dataset.

---

We'll work with a manageable portion of the data to develop and test our approach. While using the complete dataset would likely yield better results, a smaller sample allows us to prototype our solution more efficiently.


In [15]:
######################  TODO  ########################
######################  TODO  ########################

# - Sample a portion of the complete dataset
# - Display the first few rows of your sampled dataset
# - Print the shape of your original and sampled datasets

df = df.sample(frac=0.1, random_state=42)
print(df.head())
print(f"Original dataset shape: {df.shape}")

######################  TODO  ########################
######################  TODO  ########################

                                                  Product  \
335123  Credit reporting or other personal consumer re...   
601718                                           Mortgage   
847752  Credit reporting, credit repair services, or o...   
765316  Credit reporting or other personal consumer re...   
798300  Credit reporting, credit repair services, or o...   

                             Consumer complaint narrative  
335123  Upon reviewing my credit report, I have identi...  
601718  I was doing a rate check to refinance. The age...  
847752  This is my 2nd request that I have been a vict...  
765316  I'm sending this compliant to inform credit bu...  
798300  Im submitting a complaint to you today to info...  
Original dataset shape: (94113, 2)


---

Let's examine the distribution of ***complaints*** types in our dataset. You'll notice that some products have significantly more instances than others, and some categories are quite similar. For example:

- Multiple categories might refer to similar financial products
- Some categories might have very few examples
- Certain categories might be subcategories of others

You have two main approaches to handle this situation:

1. **Merging Similar Classes:** Identify categories that represent similar products/services and Combine them to create more robust, general categories

2. **Selecting Major Classes:** Only select the categories with sufficient representation



> You may choose any approach, but after this step, your data must include **at least five** distinct classes.



In [16]:
######################  TODO  ########################
######################  TODO  ########################

# - Display the number of complaints in each product category
# - Identify which classes are under-represented
class_counts = df['Product'].value_counts()
print(f"Class distribution:\n{class_counts}")

# - Handle class imbalance by choosing and implementing one of these approaches:
#   1. Merge similar product categories (e.g., combining related categories)
#   2. Keep only the major classes with sufficient examples
df['Product'] = df['Product'].replace({
    'Credit card': 'Credit services',
    'Credit reporting': 'Credit services'
})

######################  TODO  ########################
######################  TODO  ########################

Class distribution:
Product
Credit reporting, credit repair services, or other personal consumer reports    32262
Credit reporting or other personal consumer reports                             25121
Debt collection                                                                 11727
Mortgage                                                                         4941
Checking or savings account                                                      4566
Credit card or prepaid card                                                      4269
Credit card                                                                      2504
Student loan                                                                     1880
Money transfer, virtual currency, or money service                               1829
Vehicle loan or lease                                                            1439
Credit reporting                                                                 1231
Payday loan, title loan, o

---
### 1.4 Data Encoding and Text Preprocessing

Before training our model, we need to prepare both our target labels and text data. This involves converting categorical labels into numerical format and cleaning our text data to improve model performance.

In [17]:
######################  TODO  ########################
######################  TODO  ########################

# Label Encoding
# - Apply label encoding to convert product categories into numeric values
label_encoder = LabelEncoder()
df['Product'] = label_encoder.fit_transform(df['Product'])

# Text Preprocessing
# Choose and implement preprocessing steps that you think will improve the quality of your text data.
# Here are some suggestions:
def preprocess_text(text):
    text = text.lower()
    text = ''.join(c for c in text if c.isalnum() or c.isspace())
    text = ' '.join(word for word in text.split() if len(word) > 2)
    return text

df['Consumer complaint narrative'] = df['Consumer complaint narrative'].apply(preprocess_text)
# - Remove special characters and punctuation
# - Remove very short complaints (e.g., less than 10 words)
# - Remove HTML tags if present
df = df[df['Consumer complaint narrative'].str.split().apply(len) >= 10]
print(f"Dataset shape after preprocessing: {df.shape}")

######################  TODO  ########################
######################  TODO  ########################

Dataset shape after preprocessing: (92436, 2)


## 1.5 Dataset Creation and Tokenization

For training our BERT model, we need to:
1. Create a custom Dataset class that will handle tokenization
2. Split the data into training and testing sets
3. Use BERT's tokenizer to convert text into a format suitable for the model

In [18]:
######################  TODO  ########################
######################  TODO  ########################

class ComplaintDataset(Dataset):
    """A custom Dataset class for handling consumer complaints text data with BERT tokenization.

    Parameters:
        texts (List[str]): List of complaint texts to be processed
        labels (List[int]): List of encoded labels corresponding to each text
        tokenizer (BertTokenizer): A BERT tokenizer instance for text processing
        max_len (int, optional): Maximum length for padding/truncating texts. Defaults to 512

    Returns:
        dict: For each item, returns a dictionary containing:
            - input_ids (torch.Tensor): Encoded token ids of the text
            - attention_mask (torch.Tensor): Attention mask for the padded sequence
            - labels (torch.Tensor): Encoded label as a tensor
    """
    def __init__(self, texts, labels, tokenizer, max_len=512):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts.iloc[idx]
        label = self.labels.iloc[idx]
        encoding = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_len,
            return_token_type_ids=False,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt'
        )
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }

######################  TODO  ########################
######################  TODO  ########################

In [8]:
######################  TODO  ########################
######################  TODO  ########################

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    df['Consumer complaint narrative'],
    df['Product'],
    test_size=0.2,
    random_state=42
)

# Initialize tokenizer and create datasets
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
train_dataset = ComplaintDataset(X_train, y_train, tokenizer)
test_dataset = ComplaintDataset(X_test, y_test, tokenizer)

# Create dataloaders
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=False)

######################  TODO  ########################
######################  TODO  ########################

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]



## Part 2: Training a Small-Size BERT Model

In this part, we will explore how to build and train a small-sized BERT model for our classification task. Instead of using the full-sized BERT model, which is computationally expensive, we will create a smaller version using the Transformers library.

In [9]:
######################  TODO  ########################
######################  TODO  ########################

# 1. Define your BERT model for sequence classification
#    Ensure that you set up the configuration properly (e.g., specify the number of output labels).
# 2. Print the total number of trainable parameters in the model to understand its size.
config = BertConfig.from_pretrained("bert-base-uncased", num_labels=len(label_encoder.classes_))
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", config=config)

######################  TODO  ########################
######################  TODO  ########################

---

Now that you have defined your model, it's time to train it!☠️

Training a model of this size can take some time, depending on the available resources. To manage this, you can train your model for just **2–3 epochs** to demonstrate progress. Here are some hints:
- **Training Metrics:** Ensure you print enough metrics, such as loss and accuracy, to track the training progress.
- **Interactive Monitoring:** Use the `tqdm` library to display the progress of your training loop in real-time.

In [None]:
######################  TODO  ########################
######################  TODO  ########################

optimizer = AdamW(model.parameters(), lr=5e-5)
num_epochs = 3

# Training loop
for epoch in range(num_epochs):
    model.train()
    for batch in tqdm(train_loader):
        optimizer.zero_grad()

        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']
        labels = batch['labels']

        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            labels=labels
        )

        # TODO: Perform backpropagation and update the optimizer. Hint: Use outputs.loss to access the model's loss.
        loss = outputs.loss
        accuracy = outputs.accuracy
        loss.backward()
        optimizer.step()
        # TODO: Monitor the training process by reporting metrics such as loss and accuracy.
        print(f"Epoch {epoch + 1}/{num_epochs} - Loss: {loss:.4f}, Accuracy: {accuracy:.2f}%")

# TODO : Evaluate the model on test dataset
model.eval()
total, correct = 0, 0
for batch in test_loader:
    input_ids = batch['input_ids']
    attention_mask = batch['attention_mask']
    labels = batch['labels']
    outputs = lora_model(input_ids=input_ids, attention_mask=attention_mask)
    predictions = torch.argmax(outputs.logits, dim=1)
    correct += (predictions == labels).sum().item()
    total += labels.size(0)

print(f"Test Accuracy: {correct / total * 100:.2f}%")
######################  TODO  ########################
######################  TODO  ########################

Epoch 1/3: 100%|██████████| 4325/4325 [09:00<00:00,  8.00it/s]


Epoch 1/3 - Loss: 0.5887, Accuracy: 79.07%


Epoch 2/3: 100%|██████████| 4325/4325 [08:46<00:00,  8.21it/s]


Epoch 2/3 - Loss: 0.3487, Accuracy: 88.28%


Epoch 3/3: 100%|██████████| 4325/4325 [08:43<00:00,  8.26it/s]


Epoch 3/3 - Loss: 0.2994, Accuracy: 89.89%


Evaluating on Test Set: 100%|██████████| 1082/1082 [01:16<00:00, 14.16it/s]

Test Accuracy: 89.70%





## Part 3: Fine-Tuning TinyBERT with LoRA

As you have experienced, training even a small-sized BERT model can be computationally intensive and time-consuming. To address these challenges, we explore **Parameter-Efficient Fine-Tuning (PEFT)** methods, which allow us to utilize the power of large pretrained models without requiring extensive resources.

---

### **Parameter-Efficient Fine-Tuning (PEFT)**

PEFT methods focus on fine-tuning only a small portion of the model’s parameters while keeping most of the pretrained weights frozen. This drastically reduces the computational and storage requirements while leveraging the rich knowledge embedded in pretrained models.

One popular PEFT method is LoRA (Low-Rank Adaptation).

- **What is LoRA?**

LoRA introduces a mechanism to fine-tune large language models by injecting small low-rank matrices into the model's architecture. Instead of updating all parameters during training, LoRA trains these small matrices while keeping the majority of the original parameters frozen.  This is achieved as follows:

1. **Frozen Weights**: The pretrained weights of the model, represented as a weight matrix $ W \in \mathbb{R}^{d \times k} $, remain **frozen** during fine-tuning.

2. **Low-Rank Decomposition**:
   Instead of directly updating $ W $, LoRA introduces two trainable matrices, $ A \in \mathbb{R}^{d \times r} $ and $ B \in \mathbb{R}^{r \times k} $, where $ r \ll \min(d, k) $.  
   These matrices approximate the update to $ W $ as:
   $$
   \Delta W = A \cdot B
   $$

   Here, $ r $, the rank of the decomposition, is a key hyperparameter that determines the trade-off between computational cost and model capacity.

3. **Adaptation**:
   During training, instead of updating $ W $, the adapted weight is:
   $$
   W' = W + \Delta W = W + A \cdot B
   $$
   Only the low-rank matrices $ A $ and $ B $ are optimized, while $ W $ remains fixed.

4. **Efficiency**:
   Since $ r $ is much smaller than $ d $ and $ k $, the number of trainable parameters in $ A $ and $ B $ is significantly less than in $ W $. This makes the approach highly efficient both in terms of computation and memory.

---

###  **Fine-Tuning TinyBERT**

For this part, we will fine-tune **TinyBERT**, a distilled version of BERT, using the LoRA method.

- **What is TinyBERT?**

TinyBERT is a lightweight version of the original BERT model created through knowledge distillation. It significantly reduces the model size and inference latency while preserving much of the original BERT’s effectiveness. Here are some key characteristics of TinyBERT:
- It is designed to be more resource-efficient for tasks such as classification, question answering, and more.
- TinyBERT retains a compact structure with fewer layers and parameters, making it ideal for fine-tuning with limited computational resources.


> Similar to the previous section, training this model might take some time. Given the resource limitations, you can train the model for just **2-3 epochs** to demonstrate the process.


In [10]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from peft import get_peft_model, LoraConfig, TaskType
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import numpy as np
from tqdm import tqdm

In [11]:
######################  TODO  ########################
######################  TODO  ########################

# Load the pre-trained TinyBERT
model_name = "prajjwal1/bert-tiny"
base_model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Define LoRA Configuration
lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    r=8,
    lora_alpha=32,
    target_modules=["classifier"],
    lora_dropout=0.1
)

######################  TODO  ########################
######################  TODO  ########################

config.json:   0%|          | 0.00/285 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/17.8M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [12]:
######################  TODO  ########################
######################  TODO  ########################

# Apply LoRA to model
lora_model = get_peft_model(model, lora_config)

# TODO: Show the number of trainable parameters
print(f"Number of trainable parameters with LoRA: {sum(p.numel() for p in lora_model.parameters() if p.requires_grad)}")

# Training configuration
optimizer = AdamW(lora_model.parameters(), lr=5e-5)
criterion = nn.CrossEntropyLoss()

######################  TODO  ########################
######################  TODO  ########################

Number of trainable parameters with LoRA: 8837


In [13]:
######################  TODO  ########################
######################  TODO  ########################

num_epochs = 2

# Training loop
for epoch in range(num_epochs):

    lora_model.train()

    for batch in tqdm(train_loader):
        optimizer.zero_grad()

        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']
        labels = batch['labels']

        outputs = lora_model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            labels=labels
        )

        # TODO: Perform backpropagation and update the optimizer. Hint: Use outputs.loss to access the model's loss.
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        # TODO: Monitor the training process by reporting metrics such as loss and accuracy.
        print(f"Epoch {epoch + 1}/{num_epochs} - Loss: {loss:.4f}, Accuracy: {accuracy:.2f}%")


# TODO : Evaluate the model on test dataset
lora_model.eval()
total, correct = 0, 0
for batch in test_loader:
    input_ids = batch['input_ids']
    attention_mask = batch['attention_mask']
    labels = batch['labels']
    outputs = lora_model(input_ids=input_ids, attention_mask=attention_mask)
    predictions = torch.argmax(outputs.logits, dim=1)
    correct += (predictions == labels).sum().item()
    total += labels.size(0)

print(f"Test Accuracy: {correct / total * 100:.2f}%")
######################  TODO  ########################
######################  TODO  ########################

100%|██████████| 4325/4325 [06:29<00:00, 11.10it/s]


Epoch 1/2 - Loss: 0.8656, Accuracy: 0.6880


100%|██████████| 4325/4325 [05:43<00:00, 12.61it/s]


Epoch 2/2 - Loss: 0.6698, Accuracy: 0.7612


100%|██████████| 1082/1082 [01:17<00:00, 14.02it/s]

Test Accuracy: 0.8069



