<a href="https://colab.research.google.com/github/Firojpaudel/GenAI-Chronicles/blob/main/Transformers_groundup_learnings/model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> 
[![Run in Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://www.kaggle.com/kernels/welcome?src=https://github.com/Firojpaudel/GenAI-Chronicles/blob/main/Transformers_groundup_learnings/model.ipynb)

## Learning the Transformers architecture in detail

---

<div align="center">
    <img src= "https://machinelearningmastery.com/wp-content/uploads/2021/08/attention_research_1.png" alt="The transformers architecture" width="250px">
    <p><i>The Transformers Architecture from the OG paper</i></p>
</div>

I have a foundational understanding of the Transformer architecture from the _'Attention is All You Need'_ paper, but now I'm diving deeper into the concepts. Currently, I'm learning from **Umair Jamil** Sir's 'Coding a Transformers from Scratch' course on PyTorch with **Jay Alamar**'s blog combined.

Fingers crossed everything goes smoothly! 🤞

---

Starting with the `input_embedding`Layer:

<div align="center">
    <img src= "https://images.ctfassets.net/k07f0awoib97/2n4uIQh2bAX7fRmx4AGzyY/a1bc6fa1e2d14ff247716b5f589a2099/Screen_Recording_2023-06-03_at_4.52.54_PM.gif" alt="Input Emedding" width="550px">
    <p><i>The Embedding Mechanism illustrated</i></p>
</div>

In [1]:
##@ All required imports are here in this cell

import torch 
import torch.nn as nn

In [2]:
class InputEmbeddings(nn.Module):
    def __init__(self, d_model: int, vocab_size: int):
        super().__init__()
        self.d_model = d_model
        self.vocab_size = vocab_size
        "The embedding is done in pytorch using just a simple nn.Embedding function"
        self.embed= nn.Embedding(vocab_size, d_model)
        
    def forward(self, x):
        return self.embed(x) * torch.sqrt(self.d_model, dtype=torch.float32)

Okay, so the `Embedding` was already done earlier but we scale these embeddings while passing forward, for majorly two reasons:

1. The `d_model` scaling is done inorder to ensure the magnitude of input embedding and positional encoding are appropriately balanced.
2. Also, it maintains the informational integrity.

> _src: "StackOverflow"_

_Well Mathematically_,

Let's assume:

- Input embeddings as $\text{E}$
- Positional encodings as $\text{P}$

Then, the combined input to the model would be: $$\text{X} = \sqrt{d_{model}} \times \text{E} + \text{P}$$


Ofc, it will go through the `softmax` function before that but yeah..

This scaling ensures that the variance of the embeddings is in line with the variance of the positional encodings, leading to more stable training and better convergence.

After `input_embedding`, time for `positional_encoding`

In [None]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model: int, seq_length: int, dropout: float):
        super().__init__()
        self.d_model = d_model
        self.seq_length = seq_length
        self.dropout = nn.Dropout(dropout)
        
        ## Create a positional encoding matrix of size seq_length x d_model
        pe = torch.zeros(seq_length, d_model)
        ## Create a vector of shape (seq_length, 1)
        pos = torch.arange(0, seq_length).unsqueeze(1)  ## The numerator part 
        div = torch.exp(torch.arange(0, d_model, 2).float() * (-torch.log(torch.tensor(10000.0)) / d_model)) ## The denominator part
        
        ## Apply sine to even positions
        pe[:, 0::2] = torch.sin(pos * div)
        
        ## Apply cosine to odd positions
        pe[:, 1::2] = torch.cos(pos * div)
        
        ## Add a batch dimension to the positional encoding
        pe = pe.unsqueeze(0)     
        
        ## Register the positional encoding as a buffer
        self.register_buffer('pe', pe)
        
    def forward(self, x):
        x= x + (self.pe[:, :x.shape[1], :]).requires_grad_(False)
        return self.dropout(x)

_Explaining the code above:_

> "  Since our model contains no recurrence and no convolution, in order for the model to make use of the
 order of the sequence, **we must inject some information about the relative or absolute position of the
 tokens in the sequence**. To this end, we add "positional encodings" to the input embeddings at the
 bottoms of the encoder and decoder stacks. **The positional encodings have the same dimension dmodel
 as the embeddings**, so that the two can be summed. There are many choices of positional encodings,
 learned and fixed " 

 *-src: paper*

Here the authors of this paper have clearly mentioned, first we should inject a sequence and ofc `seq_length` needs to be defined. 
Next, the dimension we used earlier in the `input_embeddings` is to be used. And, finally there's a `dropout` which is added in order
to prevent the model from overfitting.


Then again from the same subsection, the `positional_encoding` formulas are provided. We use different formulas for odd and even dimensions.

The parameters are same, we just pass through different funcitons. 

- $sin$ for even dimensions and
- $cos$ for odd dimensions.

i.e., $$ \text{PE}_{(pos, 2i)} = sin(\frac{pos}{10000^{\frac{2i}{d_{model}}}})$$ 
$$ and, \text{PE}_{(pos, 2i+1)} = cos(\frac{pos}{10000^{\frac{2i}{d_{model}}}})$$

So, the numerators and denominators are defined before passing through these functions.

_Numerator part first:_
```python 
pos = torch.arange(0, seq_len).unsqueeze(1)
```

Here, the position `pos` is unsquezed to 1, for column vector formatting of shape (seq_length, 1)

_Denominator part:_

```python
div = torch.exp(torch.arange(0, d_model, 2).float() * (-torch.log(torch.tensor(10000.0)))/ d_model)
```

Well, it was meant to be $ 10000 ^{\frac{2i}{d_{model}}}$ but well in coding implementation, we could use log instead.

$$ 10000 ^{\frac{2i}{d_{model}}} \equiv  exp. (ln(10000) \cdot \frac{2i}{d_{model}})$$

Also, we are using negative log to ensure the scaling decreases progressively as $i$ increases


Next, we add batch dimension to make the positional encoding `pe` compatible with **batch_processing**. The resulting shape is `(1, seq_len, d_model)`. 

Then, there comes registering buffers... We register the `pe` as a buffer in the `PositionalEncoding` module so that the `pe` is not treated as paprameter but 
is still part of the state of the module. It's useful for saving and loading models.

---

Next, entering inside the box; Coding the easiest one of them first.. **Add & Norm** layer aka. _Layer Normalization_

**Layer Normalization**: Technique to stabilize and speed up the training of NNs.

Let's assume _a batch_ of 'n' items. Each with mean $\mu$ and variance $\sigma ^2 $. 

Now for normalizing the items in this batch, we calculate the new value as: $$ \hat{x_{i}} = \frac{x_{i} - \mu}{\sqrt{\sigma ^2 + \epsilon}}$$

where, $\epsilon$ is a small constant to prevent the division by zero.

Then, there comes scaling and shifting. aka. the multiplicative and additive steps.

After normalization, the model applies **learnt parameters** $\gamma$(scaling) and $\beta$(shifting) as: $$y_{i} = \gamma \cdot \hat{x_i} + \beta$$

And, this looks so much similar to the the equation of **fully connected layer**. ie., $$y= W \cdot x + b$$

_More on these learnt parameters:_

- $\gamma$ — also known as _multiplicative_ or _scaling_ scales (stretch/shrink) the normalized value $\hat{x_i}$.
    - if $\gamma < 1$, it compresses, 
    - if $\gamma > 1$, it amplifies. 
- $\beta$ — also known as _additive_ or _shifting_ shifts the scaled normalized values up and down. **It allows the network to "center" the activations where needed.**

> _The $\gamma$ and $\beta$ parameters allow flexibility, ensuring the network can adapt the normalized values when necessary._

In [None]:
class LayerNormalization(nn.Module):
    def __init__(self, epsilon: float = 10**-6):
        super.__init__()
        self.epsilon = epsilon
        self.gamma = nn.Parameter(torch.ones(1))  # The scaler
        self.beta = nn.Parameter(torch.zeros(1))  # The shifter aka bias
        
    def forward(self, x):
        mean = x.mean(dim= -1, keepdim=True)
        std = x.std(dim= -1, keepdim=True)
        return self.gamma * (x - mean) / (std + self.epsilon) + self.beta

_Code Explanation incoming:_

Okay, so first we had variance in theory part. How come in code, we have standard deviation??

Well, standard deviation is just sqrt of variance and the epsilon value, we can just put the square rooted value any way.
This way, the coding becomes easier with same functionality.

ie., $\text{standard\_deviation} = \sqrt{variance}$ 

Then, we initialize the value of learnable parameter $\gamma$ to 1 and $\beta$ to zero

---

Next stop, let's expolore the **Feed Forward Layer**:

From the paper, 
<div align="center">
    <img src= "./screen_shots_from_the_OG_paper/Feed_forward.jpg" width="600px">
    <p><i>SS from the paper</i></p>
</div>

First we will calculate the linear transformation—1 by calculating $xW_{1} + b_{1}$ then pass this transformation into $RELU$ activation and then the output acts as input in the next linear transformation—2 as:
$$ \text{FFN}(x) = max(0, xW_{1} + b_{1})W_{2} + b_{2}$$

In [None]:
class FeedForward(nn.Module):
    def __init__(self, d_model: int, d_ff: int, dropout: float):
        super().__init__()
        self.linear_1 = nn.Linear(d_model, d_ff) # The first linear layer with W1 and b1
        self.dropout = nn.Dropout(dropout)
        self.linear_2 = nn.Linear(d_ff, d_model) # The second linear layer with W2 and b2
    
    def forward(self, x):
        return self.linear_2(self.dropout(torch.relu(self.linear_1(x))))

| Description       | Details                                                        |
|-------------------|----------------------------------------------------------------|
| In Linear Layer 1 | **Input_dimension**: d_model ; **Output_dimension**: d_ff (inner_layer)|
| In Linear Layer 2 | **Input_dimension**: d_ff ; **Output_dimension**: back to d_model      |


---
And finally, turn for the most important block; **Multi-Head Attention Block** 

Before diving into the Multi-Head Attention Block, we need to know about the Scaled Dot-product attention mechanism.

**Scaled Dot-Product Mechanism**: 

<div align="center">
    <img src="./screen_shots_from_the_OG_paper/Scaled-dot_product_attn.jpg" width="250px">
    <p><i>Figure representing Scaled Dot-product Attention</i></p>
</div>

_Explaining the process:_

1. **Sentence Breakdown**:
    - Start with any sentence. This sentence is broken down into "Query", "Keys", and "Values".

2. **Query-Key Multiplication**:
    - Multiply the Query and Keys. This multiplication helps in determining the relevance of each key with respect to the query.

3. **Scaling**:
    - Scale the multiplied values. This step ensures that the values are within a manageable range, preventing extremely large values that could destabilize the training process.

4. **Masking (Optional)**:
    - Pass the scaled values through a masking layer. Masking is used to prevent attending to certain positions, typically in the context of sequence-to-sequence models.

5. **Activation Function**:
    - Apply the activation function (softmax, as mentioned in the paper). This step converts the scaled values into probabilities, highlighting the most relevant keys.

        - Given by formula: $$\text{Attention}(Q,K,V) = \text{softmax}(\frac{QK^T}{\sqrt{d_{k}}})V $$

        where,

        - $d_{k}$ — dimension of keys == dimension of queries (so they can be multiplied duhh.. ) &&

        - $d_{v}$ — dimension of values.

6. **Final Multiplication**:
    - Multiply the output of the activation function with the Values. This step produces the `Attention(Q, K, V)` value, which is a weighted sum of the values based on the relevance scores.



**Multi-Head Attention**

<div align="center">
    <img src="./screen_shots_from_the_OG_paper/Multihead.jpg" width="250px">
    <p><i>Figure Representing the Multi-Head Attention Process</i></p>
</div>

- In Scaled Dot-Product Attention, we focus on a single set of queries, keys, and values. However, with Multi-Head Attention, we apply this mechanism across multiple sets of queries, keys, and values, enabling the model to capture different features from the input sequence.

- Each set of queries, keys, and values is linearly projected before being fed into the Scaled Dot-Product Attention layer. This allows each head to focus on different parts of the input, enhancing the model's ability to understand complex patterns.

In [None]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model: int, h: int, dropout: float):
        super().__init__()
        self.d_model= d_model
        self.h= h
        assert d_model % h ==0, "d_model is not divisible by h"
        
        self.d_k = d_model // h
        self.w_q = nn.Linear(d_model, d_model)
        self.w_k = nn.Linear(d_model, d_model)
        self.w_v = nn.Linear(d_model, d_model)
        
        self.w_o = nn.Linear(d_model, d_model)
        self.dropout = nn.Dropout(dropout)
        
    @staticmethod
    def attention(query, key, value, mask, dropout: nn.Dropout):
        d_k = query.shape[-1]
        
        attention_scores = query @ key.transpose(-2, -1) / torch.sqrt(torch.tensor(d_k, dtype=torch.float32))
        if mask is not None:
            attention_scores.masked_fill(mask == 0, -1e9)
        attention_scores = attention_scores.softmax(dim = -1)
        
        if dropout is not None:
            attention_scores = dropout(attention_scores)
            
        return (attention_scores @ value), attention_scores
    
    def forward(self, q, k, v, mask):
        query = self.w_q(q)
        key = self.w_k(k)
        value = self.w_v(v)
        
        query = query.view(query.size[0], query.shape[1], self.h, self.d_k).transpose(1,2)
        key = key.view(key.size[0], key.shape[1], self.h, self.d_k).transpose(1,2)
        value = value.view(value.size[0], value.shape[1], self.h, self.d_k).transpose(1,2)
        
        x, self.attention_scores = MultiHeadAttention.attention(query, key, value, mask, self.dropout)
        
        ## (Batch_size, h, seq_length, d_k) -> (Batch_size, seq_length, h, d_k) -> (Batch_size, seq_length, d_model)
        x= x.transpose(1, 2).contiguous().view(x.size(0), -1, self.h * self.d_k)  #@ self.h * self.d_k = d_model
        
        #@ Now finally we multiply this with the output weights:
        return self.w_o(x)       

_Code Explanation of above Cell:_

- **Starting from `h`**: `h` is the number of attention heads. The assertion ensures `d_model` is divisible by `h` so that each head gets an equal portion of model dimension.
- `d_k`: The dimension of each attention head.
- `w_q, w_k, w_v`: Are the linear layers to project input queries(Q), keys(K) and values(V) from `d_model` to `d_model`.
- `w_o`: is the linear layer to combine the outputs of attention heads

- Next, we have defined the attention function statically, 

    Reason behind defining with `@staticmethod`:
    - **No Dependency on Class Instance**: The attention function doesn't need any instance-specific data or attributes to perform its operations. It relies solely on the inputs provided (`query`, `key`, `value`, `mask`, and `dropout`). Therefore, it can be defined as a static method because it doesn't require access to the instance (self) of the class.
    - **Reusability**: Static methods can be called without needing to create an instance of the class. This makes the attention function more flexible and reusable in different contexts, even outside the `MultiHeadAttention` class if needed.

- We’ve reintroduced the dimension d_k inside the static function because the static function acts independently. This time, we simply extract the last value from the query shape as d_k, which represents the dimension of the embedded vectors. This works because it’s just a matrix multiplication after all. 🤔 _(The figure at the end of this cell will definitely help us understand that)_ 

- Then, in `atttention_scores`, we have just applied the formula $ \frac{Q K^T}{\sqrt{d_{k}}}$. Also, transposing the second last and last positions.

> PS: If you are wondering what @ is, learn python haiyaa.. why are you even here? ✋😕🤚

- Likewise in the same `attention_score`, if there is mask provided, masking very large negative values to 0, just ignoring...

- Then, pass it through the softmax activation

- Also,if the dropout exists, apply it in `attention_scores`. It would prevent overfitting.

- Then matrix multiplication with values and attention score itself.

_In formard function:_

- We just passed query,key and value through the respective linear layers.

Now explaining this section:
```python 
query = query.view(query.size[0], query.shape[1], self.h, self.d_k).transpose(1, 2)
key = key.view(key.size[0], key.shape[1], self.h, self.d_k).transpose(1, 2)
value = value.view(value.size[0], value.shape[1], self.h, self.d_k).transpose(1, 2)
```

Okay so typically, the shape/size of `q,k and v` are in the format: `(Batch_size, seq_len, d_model)`. Now, we are just changing the view to be in the format of : `(Batch_size, seq_len, h, d_k)`

Then after transposing 1 and 2 positions, we get the shape of: `(Batch_size, h, seq_len, d_k)`

But the major question is why change the view?
Answer to that is here:

1. **Multi-Head Attention Mechanism:** In the Transformer model, the attention mechanism is split into multiple heads. This allows the model to focus on different parts of the input sequence simultaneously. Each head has its own set of weights and performs its own attention calculation.

2. **Reshaping for Parallel Computation:** The original shape of the tensors `query, key, and value` is `(Batch_size, seq_len, d_model)`. `d_model` is the dimension of the model, which is split into `h` heads, each with a dimension of `d_k` (where $d_{k} = \frac{d_{model}}{h}$). By reshaping the tensors to `(Batch_size, seq_len, h, d_k)`, we prepare them for parallel computation across the multiple heads.

3. **Transposing for Efficient Computation:** After reshaping, we transpose the tensors to `(Batch_size, h, seq_len, d_k)`.
This transposition ensures that the heads are the second dimension, which allows for efficient computation of attention scores across the heads.

Then lets talk about  the contiguous memory allocation. The `.contiguous()` function ensures that the tensor's memory layout is suitable for the next operation `(view)`. This step is necessary because transpose can result in non-contiguous memory storage, which is incompatible with reshaping.

<div align="center">
    <img src="./screen_shots_from_the_OG_paper/Mechanism_explained.png">
    <p><i>Multihead Attention visually explained</i></p>
    <p><i>*Courtesy: Umair Jamil sirs video*</i></p>
</div>

---

Well, now finally, we define the **residual_connections** or the **skip_connections**, 

In [None]:
class ResidualConnection(nn.Module):
    def __init__(self, dropout: float):
        super().__init__()
        self.dropout = nn.Dropout(dropout)
        self.norm = LayerNormalization()
    
    def forward(self, x, sublayer):
        return x + self.dropout(sublayer(self.norm(x)))

All the required functions for the Encoder/Decoder blocks are done. Now, we just need to define the Encoder block...

---

In [None]:
class EncoderBlock (nn.Module):
    def __init__(self, self_attn: MultiHeadAttention, feed_forward: FeedForward, dropout: float):
        super().__init__()
        self.self_attn = self_attn
        self.feed_forward = feed_forward
        self.residual_connection = nn.ModuleList([ResidualConnection(dropout) for _ in range(2)])
        
    def forward(self,x, mask): 
        ''' why apply mask here?:
        The mask is applied to the self attention mechanism to prevent the model from attending to the future tokens.'''
        x= self.residual_connection[0](x, lambda x: self.self_attn(x, x, x, mask))
        x= self.residual_connection[1](x, self.feed_forward)
        return x

_Code Explanations:_

So I was gonna use 
```python 
    self.residual_connection_1 = ResidualConnection(dropout)
    self.residual_connection_2 = ResidualConnection(dropout)
```
but well found out that using ModuleList is more appropriate for this kind of things...

Now, I'd like to explain the forward function in this module:
```python
x= self.residual_connection[0](x, lambda x: self.self_attn(x, x, x, mask))
x= self.residual_connection[1](x, self.feed_forward)
```
So here, first lets bring in the figure to explain this more clearly.

<div align= "center">
    <img src= "./screen_shots_from_the_OG_paper/Encoder_Block_zoomed.jpg" width= "300|px">
    <p><i>Encoder Block Zoomed with labelled Skip/Residual Connections</i></p>
</div>

The first residual or skip connection is defined as `[0]`, and the lambda function is used to pass arguments to the self_attn function dynamically. Specifically, it allows the self-attention layer to take $Q,K,V$ `(query, key, value)` all as $x$, along with the mask.

The second residual connection, `[1]`, wraps the feed-forward block, applying it directly to the output from the first residual connection.

---

Now defining the Encoder as a whole:


In [None]:
class Encoder(nn.Module):
    def __init__(self, layers: nn.ModuleList):
        super().__init__()
        self.layers = layers
        self.norm = LayerNormalization()
    
    def forward(self, x, mask):
        for layer in self.layers:
            x= layer(x, mask)
        return self.norm(x)  #Since after every layer we have Add and Norm layer.. 🤷

And with this encoder layer is pretty much done...

---

Next, lets go for the **decoder block**

In [None]:
class DecoderBlock(nn.Module):
    def __init__(self, self_attn: MultiHeadAttention, cross_attn: MultiHeadAttention, \
                 feed_forward: FeedForward, dropout: float):
        super().__init__()
        self.self_attn = self_attn
        self.cross_attn = cross_attn
        self.feed_forward = feed_forward
        self.residual_connection = nn.ModuleList([ResidualConnection(dropout) for _ in range(3)])
        
    def forward(self, x, enc_output, src_mask, trgt_mask):
        x = self.residual_connection[0](x, lambda x: self.self_attn(x, x, x, trgt_mask))
        x = self.residual_connection[1](x, lambda x: self.cross_attn(x, enc_output, enc_output, src_mask))
        x = self.residual_connection[2](x, self.feed_forward)
        return x

In [None]:
''' Now coding the Decoder class '''

class Decoder(nn.Module):
    def __init__(self, layers: nn.ModuleList):
        super().__init__()
        self.layers = layers
        self.norm = LayerNormalization()
    
    def forward(self, x, enc_output, src_mask, trgt_mask):
        for layer in self.layers:
            x = layer(x, enc_output, src_mask, trgt_mask)
        return self.norm(x)