## Learning the Transformers architecture in detail

---

<div align="center">
    <img src= "https://machinelearningmastery.com/wp-content/uploads/2021/08/attention_research_1.png" alt="The transformers architecture" width="250px">
    <figcaption><i>The Transformers Architecture from the OG paper</i></figcaption>
</div>

I have a foundational understanding of the Transformer architecture from the _'Attention is All You Need'_ paper, but now I'm diving deeper into the concepts. Currently, I'm learning from **Umair Jamil** Sir's 'Coding a Transformers from Scratch' course on PyTorch with **Jay Alamar**'s blog combined.

Fingers crossed everything goes smoothly! 🤞

---

Starting with the `input_embedding`Layer:

<div align="center">
    <img src= "https://images.ctfassets.net/k07f0awoib97/2n4uIQh2bAX7fRmx4AGzyY/a1bc6fa1e2d14ff247716b5f589a2099/Screen_Recording_2023-06-03_at_4.52.54_PM.gif" alt="Input Emedding" width="550px">
    <figcaption><i>The Embedding Mechanism illustrated</i></figcaption>
</div>

In [1]:
##@ All required imports are here in this cell

import torch 
import torch.nn as nn

In [2]:
class InputEmbeddings(nn.Module):
    def __init__(self, d_model: int, vocab_size: int):
        super().__init__()
        self.d_model = d_model
        self.vocab_size = vocab_size
        "The embedding is done in pytorch using just a simple nn.Embedding function"
        self.embed= nn.Embedding(vocab_size, d_model)
        
    def forward(self, x):
        return self.embed(x) * torch.sqrt(self.d_model, dtype=torch.float32)

Okay, so the `Embedding` was already done earlier but we scale these embeddings while passing forward, for majorly two reasons:

1. The `d_model` scaling is done inorder to ensure the magnitude of input embedding and positional encoding are appropriately balanced.
2. Also, it maintains the informational integrity.

> _src: "StackOverflow"_

_Well Mathematically_,

Let's assume:

- Input embeddings as $\text{E}$
- Positional encodings as $\text{P}$

Then, the combined input to the model would be: $$\text{X} = \sqrt{d_{model}} \times \text{E} + \text{P}$$


Ofc, it will go through the `softmax` function before that but yeah..

This scaling ensures that the variance of the embeddings is in line with the variance of the positional encodings, leading to more stable training and better convergence.

After `input_embedding`, time for `positional_encoding`

In [None]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model: int, seq_length: int, dropout: float):
        super().__init__()
        self.d_model = d_model
        self.seq_length = seq_length
        self.dropout = nn.Dropout(dropout)
        
        ## Create a positional encoding matrix of size seq_length x d_model
        pe = torch.zeros(seq_length, d_model)
        ## Create a vector of shape (seq_length, 1)
        pos = torch.arange(0, seq_length).unsqueeze(1)  ## The numerator part 
        div = torch.exp(torch.arange(0, d_model, 2).float() * (-torch.log(torch.tensor(10000.0)) / d_model)) ## The denominator part
        
        ## Apply sine to even positions
        pe[:, 0::2] = torch.sin(pos * div)
        
        ## Apply cosine to odd positions
        pe[:, 1::2] = torch.cos(pos * div)
        
        ## Add a batch dimension to the positional encoding\
        pe = pe.unsqueeze(0)     
        
        ## Register the positional encoding as a buffer
        self.register_buffer('pe', pe)
        
    def forward(self, x):
        x= x + (self.pe[:, :x.shape[1], :]).requires_grad_(False)
        return self.dropout(x)

_Explaining the code above:_

> "  Since our model contains no recurrence and no convolution, in order for the model to make use of the
 order of the sequence, **we must inject some information about the relative or absolute position of the
 tokens in the sequence**. To this end, we add "positional encodings" to the input embeddings at the
 bottoms of the encoder and decoder stacks. **The positional encodings have the same dimension dmodel
 as the embeddings**, so that the two can be summed. There are many choices of positional encodings,
 learned and fixed " 

 *-src: paper*

Here the authors of this paper have clearly mentioned, first we should inject a sequence and ofc `seq_length` needs to be defined. 
Next, the dimension we used earlier in the `input_embeddings` is to be used. And, finally there's a `dropout` which is added in order
to prevent the model from overfitting.


Then again from the same subsection, the `positional_encoding` formulas are provided. We use different formulas for odd and even dimensions.

The parameters are same, we just pass through different funcitons. 

- $sin$ for even dimensions and
- $cos$ for odd dimensions.

i.e., $$ \text{PE}_{(pos, 2i)} = sin(\frac{pos}{10000^{\frac{2i}{d_{model}}}})$$ 
$$ and, \text{PE}_{(pos, 2i+1)} = cos(\frac{pos}{10000^{\frac{2i}{d_{model}}}})$$

So, the numerators and denominators are defined before passing through these functions.

_Numerator part first:_
```python 
pos = torch.arange(0, seq_len).unsqueeze(1)
```

Here, the position `pos` is unsquezed to 1, for column vector formatting

_Denominator part:_

```python
div = torch.exp(torch.arange(0, d_model, 2).float() * (-torch.log(torch.tensor(10000.0)))/ d_model)
```

Well, it was meant to be $ 10000 ^{\frac{2i}{d_{model}}}$ but well in coding implementation, we could use log instead.

$$ 10000 ^{\frac{2i}{d_{model}}} \equiv  exp. (ln(10000) \cdot \frac{2i}{d_{model}})$$

Also, we are using negative log to ensure the scaling decreases progressively as $i$ increases