Natural Language Processing Tutorial
======

This is the tutorial of the 2024 [Mediterranean Machine Learning Summer School](https://www.m2lschool.org/) on Natural Language Processing!

This tutorial will explore the fundamental aspects of Natural Language Processing (NLP). Basic Python programming skills are expected.
Prior knowledge of standard NLP techniques (e.g. text tokenization and classification with ML) is beneficial but optional when working through the notebooks as they assume minimal prior knowledge.

This tutorial combines detailed analysis and development of essential NLP concepts via custom (i.e. from scratch) implementations. Other necessary NLP components will be developed using PyTorch's NLP library implementations. As a result, the tutorial offers deep understanding and facilitates easy usage in future applications.

## Outline

* Part I: Introduction to Text Tokenization and Classification
  *  Text Classification: Simple Classifier
  *  Text Classification: Encoder-only Transformer

* Part II: Introduction to Decoder-only Transformer and Sparse Mixture of Experts Architecture
  *  Text Generation: Decoder-only Transformer
  *  Text Generation: Decoder-only Transformer + MoE

* Part III: Introduction to Parameter Efficient Fine-tuning
  *  Fine-tuning the full Pre-trained Models
  *  Fine-tuning using Low-Rank Adaptation of Large Language Models (LoRA)

## Notation

* Sections marked with [📚] contain cells that you should read, modify and complete to understand how your changes alter the obtained results.
* External resources are mentioned with [✨]. These provide valuable supplementary information for this tutorial and offer opportunities for further in-depth exploration of the topics covered.


## Libraries

This tutorial leverages [PyTorch](https://pytorch.org/) for neural network implementation and training, complemented by standard Python libraries for data processing and the [Hugging Face](https://huggingface.co/) datasets library for accessing NLP resources.

GPU access is recommended for optimal performance, particularly for model training and text generation. While all code can run on CPU, a CUDA-enabled environment will significantly speed up these processes.

## Credits

The tutorial is created by:

* [Luca Herranz-Celotti](http://LuCeHe.github.io)
* [Georgios Peikos](https://www.linkedin.com/in/peikosgeorgios/)

It is inspired by and synthesizes various online resources, which are cited throughout for reference and further reading.

## Note for Colab users

To grab a GPU (if available), make sure you go to `Edit -> Notebook settings` and choose a GPU under `Hardware accelerator`



In this notebook we will show how a simple sentiment classification task can be solved using first a simple neural network in PyTorch, and then using the great Transformer encoder. Let's begin.

# Chapter I. Simple Architecture for Language Classification

##Step 1: Load Packages

In [1]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-2.21.0-py3-none-any.whl (527 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m527.3/527.3 kB[0m [31m15.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)
[2

In [2]:
from tqdm import tqdm

import torch
from torch.utils.data import DataLoader, TensorDataset
import torch.nn as nn
import torch.optim as optim

from datasets import load_dataset
from tokenizers import Tokenizer
from tokenizers.models import WordPiece
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.trainers import WordPieceTrainer
from tokenizers.processors import BertProcessing

##📚 Step 2: Load a Dataset
We'll use the ✨ [HuggingFace datasets](https://huggingface.co/datasets/nyu-mll/glue) library to load a dataset to play with. Let's use the GLUE MRPC dataset for sentiment analysis. ✨ [GLUE](https://gluebenchmark.com/), the General Language Understanding Evaluation benchmark, is a collection of resources for training, evaluating, and analyzing natural language understanding systems. The ✨ [MRPC](https://www.microsoft.com/en-us/download/details.aspx?id=52398), Microsoft Research Paraphrase Corpus, is a corpus of sentence pairs automatically extracted from online news sources, with human annotations indicating whether the sentences in the pair are semantically equivalent. We will load also another dataset that we will use to create our tokenizer.

In [3]:
# EXERCISE: Load the GLUE MRPC dataset
dataset = load_dataset("glue", "mrpc")

# EXERCISE: Create the tokenizer on a different dataset than the one used for
# training. Load the train split of the wikitext-103-raw-v1 dataset.
# For speed we will use only the 100K sentences.
num_sentences = 100_000
tokenizer_dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="train").select(range(num_sentences))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/649k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/75.7k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/308k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

Downloading readme:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/733k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/157M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/157M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/657k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/1801350 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

##📚 Step 3: Tokenize the Dataset with tokenizers
In order to turn long texts into numbers that can be used by the mathematics of a neural network, we have to first cut long texts into small pieces, which is called tokenization, which in turn will make the step to turn those pieces into integers very simple.

Four well-known types of tokenization are Character-level, Word-level, BPE and WordPiece, the last two known as two Subword tokenizers:

1. **Character-Level Tokenization:**
  - **Description:** Character-level tokenization breaks down text into individual characters. Each character, including spaces and punctuation, is treated as a separate token. Used in the old days.
  - **Example:** For the sentence "Hello, world!", character-level tokenization would result in tokens: ['H', 'e', 'l', 'l', 'o', ',', ' ', 'w', 'o', 'r', 'l', 'd', '!']

2. **Word-Level Tokenization:**
  - **Description:** Word-level tokenization splits text into words based on whitespace or punctuation. Each word is considered a separate token. Used in the old days.
  - **Example:** For the sentence "Hello, world!", word-level tokenization would result in tokens: ['Hello', ',', 'world', '!']
3. **Byte-Level Byte Pair Encoding (Byte-level BPE):**
  - **Description:** Byte-level BPE tokenization operates on bytes of the input text. It uses a merge operation to gradually build a vocabulary of byte pairs, making it useful for handling multilingual texts and rare characters. Used by e.g. GPT-2, RoBERTa.
  - **Example:** It creates tokens based on byte pairs, such as "b" and "an" merging into a single token of "ban".
4. **WordPiece Tokenization:**
  - **Description:** WordPiece tokenization breaks words into smaller units. It begins with a basic vocabulary of individual characters and merges the most frequent character sequences to form new tokens. Used by e.g. BERT, DistilBERT, and Electra.
  - **Example:** For the word "tokenization", WordPiece might create tokens like "token", "##ization" where "##" indicates continuation.

Better tokenizers have been developed to serve different purposes in natural language processing and generation tasks, from handling character-level nuances to efficiently managing vocabulary size and handling unseen words.

Typically you will end up using an existing tokenizer, for example the one used by GPT-2 is relatively popular, but here we show you the steps to create one from scratch using the tokenizers library by HuggingFace.

In [4]:
# Set the maximal number of integers fed to the Neural Network per sentence
max_length = 128

# Set the number of elements the tokenizer will create as its vocabulary
vocab_size = 30522

# Initialize the tokenizer with a WordPiece model, using "[UNK]" for unknown tokens
tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))

# EXERCISE: Configure the tokenizer to split input text based on whitespace, as a .pre_tokenizer
tokenizer.pre_tokenizer = Whitespace()

# Display the dataset to be used for training the tokenizer
print(tokenizer_dataset)
train_texts = tokenizer_dataset['text']

# EXERCISE: Train the tokenizer
trainer = WordPieceTrainer(vocab_size=vocab_size, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.train_from_iterator(train_texts, trainer=trainer)

# Set up post-processing to handle padding and truncation as BERT inputs
tokenizer.post_processor = BertProcessing(
    ("[SEP]", tokenizer.token_to_id("[SEP]")), # Token to mark the end of a sequence
    ("[CLS]", tokenizer.token_to_id("[CLS]")) # Token to mark the beginning of a sequence
)

# EXERCISE: Enable truncation to ensure long sequences do not exceed max_length
tokenizer.enable_truncation(max_length=max_length)

# EXERCISE: Enable padding to ensure short sequences reach the max_length, adding the
# "[PAD]" token at the end of the sentence.
tokenizer.enable_padding(length=max_length, pad_id=tokenizer.token_to_id("[PAD]"), pad_token="[PAD]")

Dataset({
    features: ['text'],
    num_rows: 100000
})


In [5]:
# Example texts
texts = [
    "Hello, how are you?",
    "I am fine, thank you!",
    "What about you?",
    "[MASK][CLS]"
]

# Show the effect of tokenizing random sentences
for text in texts:
    print('-'*30)
    print("Text:  ", text)
    output = tokenizer.encode(text, 'nice')
    print("Tokens:", output.tokens)
    print("IDs:   ", output.ids)
    print("length:", len(output.ids))

------------------------------
Text:   Hello, how are you?
Tokens: ['[CLS]', 'Hello', ',', 'how', 'are', 'you', '?', '[SEP]', 'nice', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[

Let's apply our newly created tokenizer to the dataset we want to train our Neural Network to solve.

In [6]:
def tokenize_function(batch):
    # EXERCISE: Tokenize each example in the batch
    tokenized_batch = tokenizer.encode_batch(list(zip(batch['sentence1'], batch['sentence2'])))

    # EXERCISE: Prepare tokenized outputs in the required format
    tokenized_dict = {
        'input_ids': [encoding.ids for encoding in tokenized_batch],
        'attention_mask': [encoding.attention_mask for encoding in tokenized_batch]
    }

    return tokenized_dict

# Tokenize the dataset
tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

Now we need to turn the list of sequences into matrices, also known as mini-batches of data, of shape (batch_size, max_length), which is the standard way to feed data to Neural Networks.

In [7]:
batch_size = 64

# Convert the tokenized datasets to TensorDatasets
def convert_to_tensors(tokenized_dataset):
    input_ids = torch.tensor(tokenized_dataset['input_ids'])
    attention_mask = torch.tensor(tokenized_dataset['attention_mask'])
    # EXERCISE: convert labels to tensor too
    labels = torch.tensor(tokenized_dataset['label'])
    return TensorDataset(input_ids, attention_mask, labels)

train_dataset = convert_to_tensors(tokenized_datasets['train'])
test_dataset = convert_to_tensors(tokenized_datasets['test'])

# Create DataLoader objects
# EXERCISE: set the batch_size and shuffle only the train set
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size)

## 📚 Step 4: Define the Neural Network

We start with an extremely simple Neural Network. First, each integer defined through the tokenization process is assigned a random vector that is learnable, meaning the training process will change its values. That random vector is called embedding, so each word in the sentence will be represented by a learnable vector.

Second, since each sentence has variable length, we will take the mean of the sentence over the time axis, to end up with a representation of the sentence that is of the length of the embedding vector.

Finally we will use a linear layer to turn that mean embedding, into two possible outcomes: one will represent the network's estimate of the sentence being negative, and the other will represent its estimate of the sentence being positive.

In [8]:
class SimpleModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, output_dim):
        super(SimpleModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.fc = nn.Linear(embedding_dim, output_dim)

    def forward(self, input_ids, attention_mask):
        embedded = self.embedding(input_ids)

        # Apply attention mask
        masked_embedded = embedded * attention_mask.unsqueeze(-1).float()

        # Average the embeddings across the temporal dimension
        # EXERCISE: average each sentence score with the sentence length
        summed = masked_embedded.sum(1)
        counts = attention_mask.sum(1, keepdim=True)
        averaged = summed / counts

        # Pass through the fully connected layer
        output = self.fc(averaged)
        return output

## 📚 Step 5: Train and Evaluate

In [9]:
def train(model, num_epochs = 2, lr=1e-3, weight_decay=0.01):
    # Define loss function and optimizer
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=lr, weight_decay=weight_decay)

    # Initialize the model
    initial_embedding_weights = model.embedding.weight.data.clone()

    # Training loop
    for epoch in range(num_epochs):
        model.train()
        pbar = tqdm(train_dataloader)
        for batch in pbar:
            input_ids, attention_mask, labels = batch

            # EXERCISE: Zero the gradients
            optimizer.zero_grad()

            # EXERCISE: Forward pass, pass the inputs and the attention_mask
            outputs = model(input_ids, attention_mask.to(torch.float))

            # EXERCISE: Compute loss
            loss = criterion(outputs, labels)

            # EXERCISE: Backward pass and optimization step
            loss.backward()
            optimizer.step()
            pbar.set_description(f"Loss: {loss.item():.2f}")

    print(f"Epoch {epoch + 1}/{num_epochs}, Loss: {loss.item()}")

    final_embedding_weights = model.embedding.weight.data

    # Check if the weights have changed
    weights_changed = not torch.equal(initial_embedding_weights, final_embedding_weights)
    print("Embedding weights changed during training:", weights_changed)


In [10]:
def evaluate(model):
    # Evaluation loop
    model.eval()
    correct = 0
    total = 0

    with torch.no_grad():
        for batch in tqdm(test_dataloader):
            input_ids, attention_mask, labels = batch

            # EXERCISE: Forward pass with masks
            outputs = model(input_ids, attention_mask)

            # Get predictions
            _, predicted = torch.max(outputs, 1)

            # Update accuracy
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    # EXERCISE: compute the total accuracy
    accuracy = correct / total
    print(f"\n\nTest Accuracy: {accuracy * 100:.2f}%")

In [11]:
# Hyperparameters
vocab_size = tokenizer.get_vocab_size()
embedding_dim = 64
output_dim = 2  # Binary classification

simple_model = SimpleModel(vocab_size, embedding_dim, output_dim)

params = sum(p.numel() for p in simple_model.parameters() if p.requires_grad)
print(f'The model has {params} parameters')

# EXERCISE: play with learning rates in the set [1e-2, 1e-3, 1e-4, 1e-5]
# to find the best choice
train(simple_model, lr=1e-4, weight_decay=0.001)
evaluate(simple_model)

The model has 1953538 parameters


Loss: 0.63: 100%|██████████| 58/58 [00:01<00:00, 37.25it/s]
Loss: 0.69: 100%|██████████| 58/58 [00:01<00:00, 38.94it/s]


Epoch 2/2, Loss: 0.6931377053260803
Embedding weights changed during training: True


100%|██████████| 27/27 [00:00<00:00, 409.23it/s]



Test Accuracy: 65.80%





# Chapter II: Transformer-based Architecture for Language Classification

✨ [Transformers](https://arxiv.org/pdf/1706.03762) appeared as the best option first for language translation, replacing RNNs. Now RNNs are making a come back but Transformers are still the standard, essentially thanks to having what is known as an attention mechanism everywhere in the architecture, that allows them to be able to consider all the previous time steps, while RNNs were in theory limited by being able to see only the previous time step.

The introduction of a typical Transformer-based classifier, like ✨ [BERT](https://arxiv.org/pdf/1810.04805), has to be preceeded by the introduction of the MultiHearAttention as the main ingredient, and of the PositionalEncoding and FeedForward layers used as its building blocks.

In [12]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

# Constants
num_heads = 8    # Number of attention heads
num_layers = 8   # Number of Transformer layers

## 📚 Step 1: MultiHeadAttention

The key to the MultiHeadAttention mechanism is the softmax attention.
The attention scores are computed using the scaled dot-product of the Query and Key vectors. The formula is:

$$
Attention(Q,K,V)=softmax\Big(\frac{QK^T}{\sqrt{d_k}}\Big) V
$$

where $d_k$ is the dimensionality of the $Q,K,V$ vectors. The softmax operation ensures that the scores are normalized, and the scaling factor $\sqrt{d_k}$ helps mitigate the issue of large dot-product values. Typically $Q,K,V$ are going to be linear projections of the same tensor.

In [13]:
# Multi-Head Attention
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        assert d_model % num_heads == 0
        self.d_k = d_model // num_heads
        self.num_heads = num_heads

        # EXERCISE: create the following 4 linear layers without bias
        self.linear_q = nn.Linear(d_model, d_model, bias=False)
        self.linear_k = nn.Linear(d_model, d_model, bias=False)
        self.linear_v = nn.Linear(d_model, d_model, bias=False)
        self.linear_out = nn.Linear(d_model, d_model, bias=False)

    def forward(self, query, mask=None):
        batch_size = query.size(0)

        # Linear projections
        Q = self.linear_q(query)
        K = self.linear_k(query)
        V = self.linear_v(query)

        # Split into multiple heads
        Q = Q.view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)  # (batch_size, num_heads, seq_len, d_k)
        K = K.view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)  # (batch_size, num_heads, seq_len, d_k)
        V = V.view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)  # (batch_size, num_heads, seq_len, d_k)

        # Attention scores
        scores = torch.matmul(Q, K.transpose(-2, -1))  # (batch_size, num_heads, seq_len, seq_len)

        # EXERCISE: divide the scores by the sqrt of
        scores = scores/np.sqrt(self.d_k)

        if mask is not None:
            mask = mask.unsqueeze(1).unsqueeze(1)  # (batch_size, 1, 1, seq_len)
            scores = scores.masked_fill(mask == 0, -1e9)

        # Attention weights
        # EXERCISE: apply the softmax to the scores to have the attn_weights
        attn_weights = torch.softmax(scores, dim=-1)  # (batch_size, num_heads, seq_len, seq_len)

        # Weighted sum of values
        attn_output = torch.matmul(attn_weights, V)  # (batch_size, num_heads, seq_len, d_k)

        # Concatenate heads and project back to d_model
        attn_output = attn_output.transpose(1, 2).contiguous().view(batch_size, -1, self.num_heads * self.d_k)  # (batch_size, seq_len, d_model)
        attn_output = self.linear_out(attn_output)  # (batch_size, seq_len, d_model)

        return attn_output

## 📚 Step 2: PositionalEncoding and FeedForward

Next key factors to Transformer-based architectures success are the positional encoding and the interleaved feedforward network. The PositionalEncoding adds positional information to input tokens using sinusoidal functions. The FeedForward layer is a simple feedforward neural network with two linear layers and ReLU activation, that projects the MHA representation into a 4x wider representation.

In [14]:
import math

# Positional Encoding
class PositionalEncoding(nn.Module):
    def __init__(self, d_model: int, dropout: float = 0.1, max_len: int = 5000):
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)

        position = torch.arange(max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))
        pe = torch.zeros(max_len, 1, d_model)
        pe[:, 0, 0::2] = torch.sin(position * div_term)
        pe[:, 0, 1::2] = torch.cos(position * div_term)
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + self.pe[:x.size(0)]
        return x

# Feedforward Layer
class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super(FeedForward, self).__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.activation = nn.GELU()
        self.linear2 = nn.Linear(d_ff, d_model)

    def forward(self, x):
        # EXERCISE: x = linear(activation(linear(x)))
        x = self.activation(self.linear1(x))
        x = self.linear2(x)
        return x

## 📚 Step 3: Transformer Encoder

The final architecture is a sequence of Transformer blocks where a specific sequence of LayerNormalization, Dropout, skip connections and the layers introduced above are used. Finally the Transformer blocks are chained after the embedding and the positional embedding.

In [15]:
# Transformer Decoder Layer
class TransformerEncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff):
        super(TransformerEncoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)

        self.norm1 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(p=0.1)

        self.ff = FeedForward(d_model, d_ff)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(p=0.1)

    def forward(self, tgt, tgt_mask=None):
        tgt2 = self.self_attn(tgt, mask=tgt_mask)

        # EXERCISE: norm(tgt + dropout(tgt))
        tgt = tgt + self.dropout1(tgt2)
        tgt = self.norm1(tgt)

        # EXERCISE: norm(tgt + dropout(ff(tgt)))
        tgt2 = self.ff(tgt)
        tgt = tgt + self.dropout2(tgt2)
        tgt = self.norm2(tgt)

        return tgt

# Transformer Encoder
class TransformerEncoder(nn.Module):
    def __init__(self, input_dim, d_model, num_heads, num_layers, d_ff, output_dim):
        super(TransformerEncoder, self).__init__()
        self.embedding = nn.Embedding(input_dim, d_model)
        self.pos_encoder = PositionalEncoding(d_model)

        # EXERCISE: a list of layers has to be recorded as a ModuleList in pytorch
        self.layers = nn.ModuleList([TransformerEncoderLayer(d_model, num_heads, d_ff) for _ in range(num_layers)])
        self.dropout = nn.Dropout(p=0.1)
        self.fc_out = nn.Linear(d_model, output_dim)

    def forward(self, tgt, tgt_mask=None):
        # EXERCISE: do the embedding and follow them with a pos_encoder
        tgt = self.embedding(tgt)
        tgt = self.pos_encoder(tgt)

        for layer in self.layers:
            tgt = layer(tgt, tgt_mask)

        summed = tgt.sum(1)
        counts = tgt_mask.sum(1, keepdim=True)
        tgt = summed / counts

        output = self.fc_out(self.dropout(tgt))
        return output

## 📚 Step 4: Train and Evaluate

In [16]:
bert_model = TransformerEncoder(
    input_dim=vocab_size,
    d_model=embedding_dim,
    num_heads=num_heads,
    num_layers=num_layers,
    d_ff=4*embedding_dim,
    output_dim=output_dim
)

params = sum(p.numel() for p in bert_model.parameters() if p.requires_grad)
print(f'The model has {params} parameters')

# EXERCISE: play with learning rates in the set [1e-2, 1e-3, 1e-4, 1e-5]
# to find the best choice
train(bert_model, lr=1e-2, weight_decay=0.001)
evaluate(bert_model)

The model has 2351362 parameters


Loss: 0.67: 100%|██████████| 58/58 [02:05<00:00,  2.16s/it]
Loss: 0.63: 100%|██████████| 58/58 [02:07<00:00,  2.20s/it]


Epoch 2/2, Loss: 0.6315232515335083
Embedding weights changed during training: True


100%|██████████| 27/27 [00:13<00:00,  1.96it/s]



Test Accuracy: 66.49%



