# Overview

 We have several notebooks to introduce Transformer like:
 
 * [Encoder in Transformer](https://www.kaggle.com/code/aisuko/encoder-in-transformers-architecture)
 * [Decoder in Transformer](https://www.kaggle.com/code/aisuko/decoder-in-transformers-architecture)
 * [Multiple Head Attention](https://www.kaggle.com/code/aisuko/mask-multi-multi-head-attention)
 

## Let's give a short review of these components.


**Encoder**

It has a `Multi-Head Attention` mechanism and a fully connected `Feed-Forward network`. There are also residual connections around two sub-layers, plus layer normalization for the output of each sub-layer. All sub-layers in the model and the embedding layers produce outputs of dimension $d_{model}=512$.

**Decoder**

The decoder follows a similar structure, but it inserts a third sub-layer taht performs multi-head attention over the output of the encoder block. There is also a modification of the self-attention sub-layer in the decoder block to avoid positions from attending to subsequent positions. This masking ensures that the predictions for position `i` depend solely on the known outputs at positions less than i.

Both the encoder and decoder blocks are repeated N times. In the original paper, it is N=6, and we will define a similar value in this notebook.

# Input Embeddings

The `InputEmbeddings` class below is responsible for converting the input text into numerical vectors of `d_model` dimensions. To prevent that our input embeddings become extremely small, we normalize them by multiplying them by the $\sqrt{d_{model}}$

In [None]:
import math
import torch.nn as nn

class InputEmbeddings(nn.Module):
    def __init__(self, d_model: int, vocab_size: int):
        super().__init__()
        self.d_model=d_model # Dimension of vectors (512)
        self.vocab_size=vocab_size # Size of the vocabulary
        self.embedding=nn.Embedding(vocab_size, d_model)
    
    def forward(self, x):
        return self.embedding(x)*math.sqrt(self.d_model) # normalizing the variance of the embeddings

# Positional Encoding

In the original paper, the authors add the positional encodings to the input embeddings at the bottom of both the encoder and decoder block so the model can have some information about the relative or absolute position of the tokens in the sequence. The positional encodings have the same dimension $d_{model}$ as the embeddings, so that the two vectors can be summed and we can combine the semantic content from the word embeddings and positional information from the positional encodings.

In the `PositionalEncoding` class below, we will create a matrix of positional encodings `pe` with dimensions `(seq_len, d_model)`. We will start by filling it with 0s. We will then apply the sine function to even indices of the positional encoding matrix while the cosine function is applied to the odd ones.

$$Even Indices(2i): PE(pos,2i)=sin(\frac{pos}{10000^{\frac{2i}{d_{model}}}})$$

$$Odd Indices(2i+1): PE(pos, 2i+1)=cos(\frac{pos}{10000^{\frac{2i}{d_{model}}}})$$

We apply the sine and cosine functions because it allows the model to determine the position of a word based on the position of other word in the sequence, since for any fixed offset $k$, $PE_{pos+k}$ can be represented as a linear function of $PE_{pos}$. This happens due to the properties of sine and cosine functions, where a shift in the input results in a predictable change in the output.

In [None]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model:int, seq_len:int, dropout:float) -> None:
        super().__init__()
        self.d_model=d_model # Dimensionality of the model
        self.seq_len=seq_len # Maximum sequence length
        self.dropout=nn.Dropout(dropout) # dropout layer to prevent overfitting
        
        # creating a positional ecoding matrix of shape (seq_len, d_model) filled with zeros
        pe=torch.zeros(seq_len, d_model)
        
        # creating a tensor representing positions (0 to seq_len -1)
        position=torch.arange(0, seq_len, dtype=torch.float).unsqueeze(1) # transforming `position` into a 2D tensor[seq_len,1]
        
        # creating te division term for the positional encoding formula
        div_term=torch.exp(torch.arange(0, d_model, 2).float()*(-math.log(10000.0)/d_model))
        
        # apply sine to even indices in pe
        pe[:,0::2]=torch.sin(position*div_term)
        
        # apply cosine to odd indices in pe
        pe[:,1::2]=torch.cos(position*div_term)
        
        # adding an extra dimension at the beginning of pe matrix for batch handling
        pe=pe.unsqueeze('pe', pe)
        
        # registering 'pe' as buffer, buffer is a tensor not considered as a model parameter
        self.register_buffer('pe',pe)
    
    def forward(self, x):
        # adding positional encoding to the input tensor X
        x=x+(self.pe[:,:x.shape[1],:].requires_grad_(False))
        return self.dropout(x) # dropout for regularization