# Transformers

The goal of this class is pretty simple: Learn how work the transformers architecture from scratch and afterward train a fine tune transformer for text classification.
Indeed if you understand well the architecture of the transformer, you will be able to use it for any task and specially for the OVERHYPED Large Language Models (LLMs).

Self-attention is a mechanism in machine learning, particularly in the field of natural language processing, that allows a model to weigh the importance of different parts of an input sequence when processing it. It's a key component of Transformer models, which have revolutionized language processing, Computer vision, Speech recognition etc ... tasks. Here's a detailed explanation of how self-attention works, including the mathematical formulas involved:

### 1. Input Representation
- **Input**: Assume we have an input sequence  Self-attention is the pairwise interdependence of all elements composing an input. $$X = [x_1, x_2, ..., x_n] $$, where each $ x_i $ is a vector representing a word or token in the sequence.
- **Embeddings**: These vectors are typically embeddings, which are dense representations of the words.

### 2. Calculating Query, Key, and Value Vectors
For each element in the sequence, the self-attention mechanism generates three vectors: a query vector $ Q $, a key vector $ K $, and a value vector $V$. These are computed as follows:

$$ Q = XW^Q $$
$$ K = XW^K $$
$$ V = XW^V $$

Where $ W^Q $, $ W^K $, and $ W^V $ are weight matrices that are learned during training.

### 3. Attention Score Calculation
The attention scores are calculated using the query and key vectors. For each pair of words in the sequence, the score represents how much focus to put on other parts of the input when processing a specific part of the input.

$$ \text{Attention Score} = \frac{QK^T}{\sqrt{d_k}} $$

- $d_k$ is the dimension of the key vectors. The division by $ \sqrt{d_k} $ is for scaling purposes to avoid very large values during training.

### 4. Softmax Normalization
The attention scores are normalized using the softmax function to ensure they sum to 1 (converting them into probabilities):

$$ \text{Softmax}(\text{Attention Score}) = \frac{\exp(\text{Attention Score})}{\sum \exp(\text{Attention Score})} $$

### 5. Output Calculation
Finally, the output is calculated as a weighted sum of the value vectors, using the normalized attention scores as weights:

$$ \text{Output} = \text{Softmax}(\text{Attention Score}) \times V $$

### 6. Multi-Head Attention
In practice, multiple sets of $$ Q, K, V $$ matrices are used to create multiple "heads" of attention. This allows the model to focus on different parts of the input simultaneously. The outputs of these heads are then concatenated and linearly transformed into the final output.


![](https://boring-guy.sh/img/masking-rl/multi_head_attention.svg)

Mathematically we can translate the figure into the following equation:

$$
\text { Attention }(Q, K, V, Mask)=\operatorname{softmax}\left(\frac{Mask(Q K^{T})}{\sqrt{d_{k}}}\right) V
$$

$$
\text { MultiHead }(Q, K, V, Mask)= \operatorname{Concat}(\text { head } {1}, \ldots, \text { head }_{h}) W^{O}
$$

$$
\text { where head }{i} = \text { Attention }(Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}, Mask)
$$


### 7. Role in Transformers
In Transformer models, this self-attention mechanism is used both in the encoder to process the input sequence and in the decoder to generate the output sequence, allowing the model to consider the entire input sequence when making predictions.

This explanation provides a basic overview of the self-attention mechanism. The actual implementation involves additional details and optimizations, especially in large-scale models like GPT (Generative Pre-trained Transformer) or BERT (Bidirectional Encoder Representations from Transformers).

![](https://miro.medium.com/v2/resize:fit:1400/1*BHzGVskWGS_3jEcYYi6miQ.png)
A transformer architecture is simply a bunch of transformer encoder layer and transformer decoder layer stacked on top of each other. The encoder is used to encode the input sequence and the decoder is used to generate the output sequence. The encoder and decoder are connected to each other using the attention mechanism.

### 8. References
- [Self-Attention in NLP](https://www.youtube.com/watch?v=5vcj8kSwBCY)
- [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/)
- [Attention? Attention!](https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html)
- [Attention is all you need](https://arxiv.org/pdf/1706.03762.pdf)

In [None]:
from typing import Optional
import torch
import torch.nn as nn 

In [None]:
# For all the tranformers, from scratch, you will use this following tensor as example
dummy_tensor = torch.randn((16, 32, 64 )) # A tensor of shape (batch_size, sequence_length, hidden_size)
# you can imagine a batch of 16 sentence of 32 words, each word being a 64-dimensional vector

<div class="alert alert-block alert-info">
    
<b> Exercise 1.0: </b>
* Implement the self-attention mechanism
</div>

In [None]:
class SelfAttention(nn.Module): 
    def __init__(self, input_dim: int, dim_head: int):
        super().__init__()
        self.input_dim = input_dim
        self.dim_head = dim_head
        self.query = nn.Linear(input_dim, dim_head)
        self.key = nn.Linear(input_dim, dim_head)
        self.value = nn.Linear(input_dim, dim_head)
        self.softmax = nn.Softmax(dim=-1)
    
    def forward(self, sequence: torch.Tensor, attention_mask: torch.Tensor = None):
        """
        sequence: tensor of shape (batch_size, sequence_length, input_dim)
        attention_mask: tensor of shape (batch_size, sequence_length)
        """
        # Compute Q, K, V
        Q = 
        K =
        V =
        
        # Q x K^T 
        dot =
        
        if attention_mask is not None:
            # Apply attention mask to attention score
            # Set the 
            attention_score =
            mask_value = torch.finfo(dots.dtype).min
            mask = # you have transform the mask maybe
            dot.masked_fill_(~mask, mask_value) # If you set the mask value to -inf after the softmax you will get 0
        
        # Apply softmax
        attention_score =
        
        # Scale values by sqrt(dim_head)
        attention_score /= 
        
        # Compute attention score times V
        head =
        
        return head, attention_score
            
        
        

In [None]:
sa = SelfAttention(input_dim=64, dim_head=16)
head, attention_score = sa(dummy_tensor)

<div class="alert alert-block alert-info">
    
<b> Exercise 2.0: </b>
* Implement the multihead attention mechanism
</div>

In [None]:
class MultiHeadAttention(nn.Module):
    def __init__(self, dim: int, heads: int = 8, dim_head: int = 64):
        super().__init__()
        inner_dim = dim_head * heads
        self.heads = heads
        self.scale = dim_head**-0.5  # 1/sqrt(dim)
        """To qkv is an optimisation, if you are brave you can try to implement it"""
        # self.to_qkv = nn.Linear(
        #     dim, inner_dim * 3, bias=False
        # )  # Wq,Wk,Wv for each vector, thats why *3
        self.query = nn.Linear(dim, inner_dim)
        self.key = nn.Linear(dim, inner_dim)
        self.value = nn.Linear(dim, inner_dim)
        self.to_out = nn.Linear(inner_dim, inner_dim)

    def forward(
        self, x: torch.Tensor, mask: Optional[torch.Tensor] = None
    ) :
        h = self.heads

        # # gets q = Q = Wq matmul x1, k = Wk mm x2, v = Wv mm x3
        # qkv = self.to_qkv(x)
        
        Q = 
        K =
        V = 
        
        ...
        
        return out, attention_scores

In [None]:
mha = MultiHeadAttention(64, 8, 64)
out, attention_scores = mha(dummy_tensor)

<div class="alert alert-block alert-info">
    
<b> Exercise 3.0: </b>
* Implement the tranformer block encoder block
</div>

In [None]:
class TransformerEncoderBlock(nn.Module):
    def __init__(self, dim: int, heads: int = 8, dim_head: int = 64):
        super().__init__()
        self.ln1 = nn.LayerNorm(dim)
        self.ln2 = nn.LayerNorm(dim)
        self.mha = MultiHeadAttention(dim, heads, dim_head)
        self.mlp = nn.Sequential(
            nn.Linear(dim, dim * 4),
            nn.GELU(),
            nn.Linear(dim * 4, dim),
        )
    def forward(self, x: torch.Tensor, mask: Optional[torch.Tensor] = None):
        x = x + self.mha(self.ln1(x), mask=mask) # The addition is a skip connextion do not hesitate to raise your hand for more details ;)
        ...
        return x

In [None]:
teb = TransformerEncoderBlock(64, 8, 64)
out = teb(dummy_tensor)

<div class="alert alert-block alert-info">
    
<b> (Optional) Exercise 4.0: </b>
* Implement the tranformer block decoder block
* Combine the encoder and decoder to create a transformer
* Implement the positional encoding
</div>

In [None]:
class TranformerDecoder(nn.Module):
    ...

In [None]:
class PositionalEncoding(nn.Module):
    ...

In [None]:
class Transformer(nn.Module):
    ...

An LLM is just a bunch of transformers encoder and decoder layer with large feedforward dimension and a lot of parameters.

### What is the purpose of Bert? 🤖

BERT is a method of pre-training language representations, meaning that we train a general-purpose "language understanding" model on a large text corpus (like Wikipedia), and then use that model for downstream tasks.

### What is a downstream task? 🧐

A downstream task is when a pre-trained model is used for a new task. For example, you could train Bert on the question answering task. You would benefit from Bert's pre-trained weights for this new task.

### Why Bert like models outperform other models? 🎯

BERT outperforms previous methods because it is the first **unsupervised**, deeply bidirectional system for pre-training NLP. 

`I made a bank deposit` 

In the example, the unidirectional representation of the token `bank` is only based on `I made` but not on `deposit`.

BERT represents `bank` using both its left and right context starting from the very bottom of a deep neural network, so it is deeply bidirectional.

The ELMo and OpenAI GPT models are other high-performance models that provide contextual latent representations. ELMo uses the concatenation of independently trained left-to-right and right-to- left LSTMs to generate features for downstream tasks. OpenAI GPT uses a left-to-right Transformer. 

### How Bert is trained? 🔧

BERT is trained from **two unsupervised tasks**.

#### Task 1: Masked LM

**15%** of the words in the input sentence are masked. The model must then find the words that have been hidden.

```python
input = 'the man went to the [MASK1]. he bought a [MASK2] of milk'
label = {'[MASK1]': 'store', '[MASK2]': 'gallon'}`
```


Tips used in the Masked LM task 😎: 

- 80% of the time: Replace the word with the `[MASK]` token, e.g., `my dog is hair` → `my dog is [MASK]`


- 10% of the time: Replace the word with a randomword,e.g., my `dog is hair` → `my dog is apple`


- 10% of the time: Keep the word unchanged,e.g., `my dog is hair` → `my dog is hair`. 

The model is forced to keep a distributional contextual representation of every input token because the Transformer encoder does not know which words it will be asked to predict or which have been replaced by random words.


#### Task 2: Next sentence prediction

This task is designed to provide BERT an understanding of the relationship between two sentences. This information is not captured by the masked language model task. When choosing the sentences A and B for each pre-training example, 50% of the time B is the actual next sentence that follows A (labeled as IsNext), and 50% of the time it is a random sentence from the corpus (labeled as NotNext).

```python

input = '[CLS] the man went to [MASK] store [SEP] he bought a gallon [MASK] milk [SEP]'
label = 'IsNext'

input = '[CLS] the man [MASK] to the store [SEP] penguin [MASK] are flight ##less birds [SEP]'
label = 'NotNext'

```

To help the model distinguish between the two sentences in training, the input is processed in the following way before entering the model:

- **[SEP]**: Separator between sentences.
- **[CLS]**: Token dedicated to classification tasks.

**The training loss is the sum of the mean masked LM likelihood and the mean next sentence prediction likelihood.**


### What are the corpus used to train BERT?

- BooksCorpus (800M words) 📖

- English Wikipedia (2,500M words) 📚

It is critical to use a document-level corpus rather than a shuffled sentence-level corpus in order to extract long contiguous sequences.

### What is Finetuning ?

Finetuning consists of training the BERT model on a new task to specialize the model. BERT finetuning is relatively inexpensive. It is necessary to add additional layers of neurons to transform the model into a classifier or regressor for example.

### What is the difference between BERT and GPT?

GPT model family are based on the Transformer decoder. They are trained using a causal language modeling (CLM) objective. The model learns to predict the next token in a sequence given the previous tokens.

### What dataset is used to train GPT?
For the most recent version of GPT (3&4), the dataset used is the entire public internet. With the combination of a huge dataset, an self-supervised objective and a large model, GPT-3&4 can perform well on a wide variety of tasks.

### The last secret sauce of GPT 3&4: RLHF
Why chatgpt feel like a discussion with an human, why chat gpt is so good at generating text? The answer is RLHF (Reinforcement Learning with Human Feedback). The model learn using reinforcement to generate an answer which is the most aligned with human. If you are curious about it, there is a project on reinforcement learning with human feedback.

## Part 2: Fine-tuning BERT for text classification

In [None]:
import torch

In [None]:
from datasets import load_dataset

dataset = load_dataset("yelp_review_full")
dataset["train"][100]

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

A tokenizer is a function that splits a string of text into tokens. For example, the string "I love Paris" could be tokenized into the list of tokens ["I", "love", "Paris"]. There are different types of tokenizers, such as word tokenizers that split strings into words, character tokenizers that split strings into characters, and more.

<div class="alert alert-block alert-info">
    
<b> Exercise 5.0: </b>
* Explain with your own words why we need to tokenize the input
</div>

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


def tokenize_function(examples):
    return tokenizer(
        examples["text"], padding="max_length", truncation=True, return_tensors="pt"
    )


tokenized_datasets = dataset.map(tokenize_function, batched=True)

In [None]:
tokenized_datasets.set_format(
    "pt", columns=["input_ids", "attention_mask", "label"], output_all_columns=True
)

In [None]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))


<div class="alert alert-block alert-info">
    
<b> Exercise 6.0: </b>
* Find in the transformers library tes arguments for : download the bert uncased model "bert-base-cased" and change the number of labels to 5
</div>

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    ...
)


<div class="alert alert-block alert-info">
    
<b> Exercise 7.0: </b>
* make an inference with the model
</div>

In [None]:
import torch

# Extract the first example
example = small_train_dataset[0]

# Extract the input fields required by the model
input_ids = ...
attention_mask = ... 

# Convert to the format expected by the model (as tensors)
inputs = {
    "input_ids": input_ids,
    "attention_mask": attention_mask,
}


# Get predictions from the model
with torch.no_grad():
    outputs = model(**inputs)

# The outputs are logits, you can apply softmax to get probabilities
probabilities = torch.nn.functional.softmax(outputs.logits, dim=1)

print(probabilities)

In [None]:
def evaluate_model(model, dataloader, loss_function, device):
    model.eval()
    total_eval_loss = 0
    correct_predictions = 0
    nb_iteration = 0
    total_data = 0
    for batch in dataloader:
        # Move batch to the same device as model
        # batch = {k: v.to(device) for k, v in batch.items()}
        y = batch.pop("label").to(device)
        text = batch.pop("text")
        # Forward pass: compute predicted outputs by passing inputs to the model
        input_data = {k: batch[k] for k in ("input_ids", "attention_mask")}
        input_data = {k: v.to(device) for k, v in input_data.items()}
        with torch.no_grad():
            outputs = model(**input_data)
        logits = outputs.logits
        total_data += logits.size()[0]  # Get the batch size
        loss = loss_function(logits, y)
        total_eval_loss += loss.item()

        preds = torch.argmax(logits, dim=1)
        correct_predictions += torch.sum(preds == y)
        nb_iteration += 1

    avg_loss = total_eval_loss / nb_iteration
    accuracy = correct_predictions.double() / total_data
    return avg_loss, accuracy


<div class="alert alert-block alert-info">
    
<b> Exercise 8.0: </b>
* Add the AdamW optimizer with an LR of 5e-5
</div>

In [None]:
import torch
from torch.optim import  ...
from torch.nn import CrossEntropyLoss

# Ensure the model is on the correct device (GPU or CPU)
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# Initialize the optimizer
optimizer = ... 

# Define the loss function
loss_function = CrossEntropyLoss()


<div class="alert alert-block alert-info">
    
<b> Exercise 9.0: </b>
* Perform the training
* Why the performance is so bad?
</div>

In [None]:
from tqdm import tqdm  # for displaying a progress bar

# Define the number of training epochs
epochs = 3

# Start the training loop
for epoch in range(epochs):
    model.train()  # set the model to training mode
    total_loss = 0
    nb_iteration = 0

    train_iter = small_train_dataset.iter(batch_size=8)
    eval_iter = small_eval_dataset.iter(batch_size=8)

    for batch in tqdm(train_iter, desc=f"Epoch {epoch + 1}/{epochs}", unit="batch"):
        # Move batch to the same device as model
        y = batch.pop("label").to(device)
        text = batch.pop("text")
        # Forward pass: compute predicted outputs by passing inputs to the model
        input_data = {k: batch[k] for k in ("input_ids", "attention_mask")}
        input_data = {k: v.to(device) for k, v in input_data.items()}
        outputs = model(**input_data)

        # Compute loss
        loss = loss_function(outputs.logits, y)

        # Backward pass: compute gradient of the loss with respect to model parameters
        loss.backward()

        # Update parameters and zero the gradients
        optimizer.step()
        optimizer.zero_grad()

        nb_iteration += 1

        # Accumulate the training loss
        total_loss += loss.item()

    # Calculate average loss over an epoch
    avg_train_loss = total_loss / nb_iteration

    print(f"\nEpoch {epoch + 1} complete! Average Training Loss: {avg_train_loss:.4f}")
    eval_loss, eval_accuracy = evaluate_model(model, eval_iter, loss_function, device)
    print(f"Validation Loss: {eval_loss:.4f}, Validation Accuracy: {eval_accuracy:.4f}")


<div class="alert alert-block alert-info">
    
<b> Exercise 10.0: </b>
* Reach the best performance possible, add learning rate scheduler, increase the training dataset size etc ...
</div>