## Learning the Transformers architecture in detail

---

<div align="center">
    <img src= "https://machinelearningmastery.com/wp-content/uploads/2021/08/attention_research_1.png" alt="The transformers architecture" width="250px">
    <p style="text align: center;"><i>The Transformers Architecture from the OG paper</i></p>
</div>

I have a foundational understanding of the Transformer architecture from the _'Attention is All You Need'_ paper, but now I'm diving deeper into the concepts. Currently, I'm learning from **Umair Jamil** Sir's 'Coding a Transformers from Scratch' course on PyTorch with **Jay Alamar**'s blog combined.

Fingers crossed everything goes smoothly! 🤞

---

Starting with the `input_embedding`Layer:

<div align="center">
    <img src= "https://images.ctfassets.net/k07f0awoib97/2n4uIQh2bAX7fRmx4AGzyY/a1bc6fa1e2d14ff247716b5f589a2099/Screen_Recording_2023-06-03_at_4.52.54_PM.gif" alt="Input Emedding" width="550px">
    <p><i>The Embedding Mechanism illustrated</i></p>
</div>

In [1]:
##@ All required imports are here in this cell

import torch 
import torch.nn as nn

In [2]:
class InputEmbeddings(nn.Module):
    def __init__(self, d_model: int, vocab_size: int):
        super().__init__()
        self.d_model = d_model
        self.vocab_size = vocab_size
        "The embedding is done in pytorch using just a simple nn.Embedding function"
        self.embed= nn.Embedding(vocab_size, d_model)
        
    def forward(self, x):
        return self.embed(x) * torch.sqrt(self.d_model, dtype=torch.float32)

Okay, so the `Embedding` was already done earlier but we scale these embeddings while passing forward, for majorly two reasons:

1. The `d_model` scaling is done inorder to ensure the magnitude of input embedding and positional encoding are appropriately balanced.
2. Also, it maintains the informational integrity.

> _src: "StackOverflow"_

_Well Mathematically_,

Let's assume:

- Input embeddings as $\text{E}$
- Positional encodings as $\text{P}$

Then, the combined input to the model would be: $$\text{X} = \sqrt{d_{model}} \times \text{E} + \text{P}$$


Ofc, it will go through the `softmax` function before that but yeah..

This scaling ensures that the variance of the embeddings is in line with the variance of the positional encodings, leading to more stable training and better convergence.

After `input_embedding`, time for `positional_encoding`

In [None]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model: int, seq_length: int, dropout: float):
        super().__init__()
        self.d_model = d_model
        self.seq_length = seq_length
        self.dropout = nn.Dropout(dropout)
        
        ## Create a positional encoding matrix of size seq_length x d_model
        pe = torch.zeros(seq_length, d_model)
        ## Create a vector of shape (seq_length, 1)
        pos = torch.arange(0, seq_length).unsqueeze(1)  ## The numerator part 
        div = torch.exp(torch.arange(0, d_model, 2).float() * (-torch.log(torch.tensor(10000.0)) / d_model)) ## The denominator part
        
        ## Apply sine to even positions
        pe[:, 0::2] = torch.sin(pos * div)
        
        ## Apply cosine to odd positions
        pe[:, 1::2] = torch.cos(pos * div)
        
        ## Add a batch dimension to the positional encoding
        pe = pe.unsqueeze(0)     
        
        ## Register the positional encoding as a buffer
        self.register_buffer('pe', pe)
        
    def forward(self, x):
        x= x + (self.pe[:, :x.shape[1], :]).requires_grad_(False)
        return self.dropout(x)

_Explaining the code above:_

> "  Since our model contains no recurrence and no convolution, in order for the model to make use of the
 order of the sequence, **we must inject some information about the relative or absolute position of the
 tokens in the sequence**. To this end, we add "positional encodings" to the input embeddings at the
 bottoms of the encoder and decoder stacks. **The positional encodings have the same dimension dmodel
 as the embeddings**, so that the two can be summed. There are many choices of positional encodings,
 learned and fixed " 

 *-src: paper*

Here the authors of this paper have clearly mentioned, first we should inject a sequence and ofc `seq_length` needs to be defined. 
Next, the dimension we used earlier in the `input_embeddings` is to be used. And, finally there's a `dropout` which is added in order
to prevent the model from overfitting.


Then again from the same subsection, the `positional_encoding` formulas are provided. We use different formulas for odd and even dimensions.

The parameters are same, we just pass through different funcitons. 

- $sin$ for even dimensions and
- $cos$ for odd dimensions.

i.e., $$ \text{PE}_{(pos, 2i)} = sin(\frac{pos}{10000^{\frac{2i}{d_{model}}}})$$ 
$$ and, \text{PE}_{(pos, 2i+1)} = cos(\frac{pos}{10000^{\frac{2i}{d_{model}}}})$$

So, the numerators and denominators are defined before passing through these functions.

_Numerator part first:_
```python 
pos = torch.arange(0, seq_len).unsqueeze(1)
```

Here, the position `pos` is unsquezed to 1, for column vector formatting of shape (seq_length, 1)

_Denominator part:_

```python
div = torch.exp(torch.arange(0, d_model, 2).float() * (-torch.log(torch.tensor(10000.0)))/ d_model)
```

Well, it was meant to be $ 10000 ^{\frac{2i}{d_{model}}}$ but well in coding implementation, we could use log instead.

$$ 10000 ^{\frac{2i}{d_{model}}} \equiv  exp. (ln(10000) \cdot \frac{2i}{d_{model}})$$

Also, we are using negative log to ensure the scaling decreases progressively as $i$ increases


Next, we add batch dimension to make the positional encoding `pe` compatible with **batch_processing**. The resulting shape is `(1, seq_len, d_model)`. 

Then, there comes registering buffers... We register the `pe` as a buffer in the `PositionalEncoding` module so that the `pe` is not treated as paprameter but 
is still part of the state of the module. It's useful for saving and loading models.

---

Next, entering inside the box; Coding the easiest one of them first.. **Add & Norm** layer aka. _Layer Normalization_

**Layer Normalization**: Technique to stabilize and speed up the training of NNs.

Let's assume _a batch_ of 'n' items. Each with mean $\mu$ and variance $\sigma ^2 $. 

Now for normalizing the items in this batch, we calculate the new value as: $$ \hat{x_{i}} = \frac{x_{i} - \mu}{\sqrt{\sigma ^2 + \epsilon}}$$

where, $\epsilon$ is a small constant to prevent the division by zero.

Then, there comes scaling and shifting. aka. the multiplicative and additive steps.

After normalization, the model applies **learnt parameters** $\gamma$(scaling) and $\beta$(shifting) as: $$y_{i} = \gamma \cdot \hat{x_i} + \beta$$

And, this looks so much similar to the the equation of **fully connected layer**. ie., $$y= W \cdot x + b$$

_More on these learnt parameters:_

- $\gamma$ — also known as _multiplicative_ or _scaling_ scales (stretch/shrink) the normalized value $\hat{x_i}$.
    - if $\gamma < 1$, it compresses, 
    - if $\gamma > 1$, it amplifies. 
- $\beta$ — also known as _additive_ or _shifting_ shifts the scaled normalized values up and down. **It allows the network to "center" the activations where needed.**

> _The $\gamma$ and $\beta$ parameters allow flexibility, ensuring the network can adapt the normalized values when necessary._

In [None]:
class LayerNormalization(nn.Module):
    def __init__(self, epsilon: float = 10**-6):
        super.__init__()
        self.epsilon = epsilon
        self.gamma = nn.Parameter(torch.ones(1))  # The scaler
        self.beta = nn.Parameter(torch.zeros(1))  # The shifter aka bias
        
    def forward(self, x):
        mean = x.mean(dim= -1, keepdim=True)
        std = x.std(dim= -1, keepdim=True)
        return self.gamma * (x - mean) / (std + self.epsilon) + self.beta

_Code Explanation incoming:_

Okay, so first we had variance in theory part. How come in code, we have standard deviation??

Well, standard deviation is just sqrt of variance and the epsilon value, we can just put the square rooted value any way.
This way, the coding becomes easier with same functionality.

ie., $\text{standard\_deviation} = \sqrt{variance}$ 

Then, we initialize the value of learnable parameter $\gamma$ to 1 and $\beta$ to zero

---

Next stop, let's expolore the **Feed Forward Layer**:

From the paper, 
<div align="center">
    <img src= "./screen_shots_from_the_OG_paper/paper1.jpg" width="600px">
    <p><i>SS from the paper</i></p>
</div>

First we will calculate the linear transformation—1 by calculating $xW_{1} + b_{1}$ then pass this transformation into $RELU$ activation and then the output acts as input in the next linear transformation—2 as:
$$ \text{FFN}(x) = max(0, xW_{1} + b_{1})W_{2} + b_{2}$$

In [None]:
class FeedForward(nn.Module):
    def __init__(self, d_model: int, d_ff: int, dropout: float):
        super().__init__()
        self.linear_1 = nn.Linear(d_model, d_ff) # The first linear layer with W1 and b1
        self.dropout = nn.Dropout(dropout)
        self.linear_2 = nn.Linear(d_ff, d_model) # The second linear layer with W2 and b2
    
    def forward(self, x):
        return self.linear_2(self.dropout(torch.relu(self.linear_1(x))))

| Description       | Details                                                        |
|-------------------|----------------------------------------------------------------|
| In Linear Layer 1 | **Input_dimension**: d_model ; **Output_dimension**: d_ff (inner_layer)|
| In Linear Layer 2 | **Input_dimension**: d_ff ; **Output_dimension**: back to d_model      |


---
And finally, turn for the most important block; **Multi-Head Attention Block** 