# Overview

 We have several notebooks to introduce Transformer like:
 
 * [Encoder in Transformer](https://www.kaggle.com/code/aisuko/encoder-in-transformers-architecture)
 * [Decoder in Transformer](https://www.kaggle.com/code/aisuko/decoder-in-transformers-architecture)
 * [Multiple Head Attention](https://www.kaggle.com/code/aisuko/mask-multi-multi-head-attention)
 

## Let's give a short review of these components.


**Encoder**

It has a `Multi-Head Attention` mechanism and a fully connected `Feed-Forward network`. There are also residual connections around two sub-layers, plus layer normalization for the output of each sub-layer. All sub-layers in the model and the embedding layers produce outputs of dimension $d_{model}=512$.

**Decoder**

The decoder follows a similar structure, but it inserts a third sub-layer taht performs multi-head attention over the output of the encoder block. There is also a modification of the self-attention sub-layer in the decoder block to avoid positions from attending to subsequent positions. This masking ensures that the predictions for position `i` depend solely on the known outputs at positions less than i.

Both the encoder and decoder blocks are repeated N times. In the original paper, it is N=6, and we will define a similar value in this notebook.

# Input Embeddings

The `InputEmbeddings` class below is responsible for converting the input text into numerical vectors of `d_model` dimensions. To prevent that our input embeddings become extremely small, we normalize them by multiplying them by the $\sqrt{d_{model}}$

In [None]:
import math
import torch.nn as nn

class InputEmbeddings(nn.Module):
    def __init__(self, d_model: int, vocab_size: int):
        super().__init__()
        self.d_model=d_model # Dimension of vectors (512)
        self.vocab_size=vocab_size # Size of the vocabulary
        self.embedding=nn.Embedding(vocab_size, d_model)
    
    def forward(self, x):
        return self.embedding(x)*math.sqrt(self.d_model) # normalizing the variance of the embeddings

# Positional Encoding

In the original paper, the authors add the positional encodings to the input embeddings at the bottom of both the encoder and decoder block so the model can have some information about the relative or absolute position of the tokens in the sequence. The positional encodings have the same dimension $d_{model}$ as the embeddings, so that the two vectors can be summed and we can combine the semantic content from the word embeddings and positional information from the positional encodings.

In the `PositionalEncoding` class below, we will create a matrix of positional encodings `pe` with dimensions `(seq_len, d_model)`. We will start by filling it with 0s. We will then apply the sine function to even indices of the positional encoding matrix while the cosine function is applied to the odd ones.

$$Even Indices(2i): PE(pos,2i)=sin(\frac{pos}{10000^{\frac{2i}{d_{model}}}})$$

$$Odd Indices(2i+1): PE(pos, 2i+1)=cos(\frac{pos}{10000^{\frac{2i}{d_{model}}}})$$

We apply the sine and cosine functions because it allows the model to determine the position of a word based on the position of other word in the sequence, since for any fixed offset $k$, $PE_{pos+k}$ can be represented as a linear function of $PE_{pos}$. This happens due to the properties of sine and cosine functions, where a shift in the input results in a predictable change in the output.

In [None]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model:int, seq_len:int, dropout:float) -> None:
        super().__init__()
        self.d_model=d_model # Dimensionality of the model
        self.seq_len=seq_len # Maximum sequence length
        self.dropout=nn.Dropout(dropout) # dropout layer to prevent overfitting
        
        # creating a positional ecoding matrix of shape (seq_len, d_model) filled with zeros
        pe=torch.zeros(seq_len, d_model)
        
        # creating a tensor representing positions (0 to seq_len -1)
        position=torch.arange(0, seq_len, dtype=torch.float).unsqueeze(1) # transforming `position` into a 2D tensor[seq_len,1]
        
        # creating te division term for the positional encoding formula
        div_term=torch.exp(torch.arange(0, d_model, 2).float()*(-math.log(10000.0)/d_model))
        
        # apply sine to even indices in pe
        pe[:,0::2]=torch.sin(position*div_term)
        
        # apply cosine to odd indices in pe
        pe[:,1::2]=torch.cos(position*div_term)
        
        # adding an extra dimension at the beginning of pe matrix for batch handling
        pe=pe.unsqueeze('pe', pe)
        
        # registering 'pe' as buffer, buffer is a tensor not considered as a model parameter
        self.register_buffer('pe',pe)
    
    def forward(self, x):
        # adding positional encoding to the input tensor X
        x=x+(self.pe[:,:x.shape[1],:].requires_grad_(False))
        return self.dropout(x) # dropout for regularization

# Layer Normalization

We have several normalization layers called **Add&Norm**.

The `LayerNormalization` class below performs layer normalization on the input data. During its forward pass, we compute the mean and standard deviation of the input data. We then normalize the input data by subtracing the mean and dividing by the standard deviation plus a small number called **epsilon** to avoid any division by zero. This process results in a normalized output with a mean 0 and standard deviation 1.

We will then scale the normalized output by a learnable parameter `alpha` and add a learnbale parameter called `bias`. The training process is repsonsible for adjusting these parameters. The final result is a layer-normalized tensor, which ensures that the scale of the inputs to layers in the network is consistent.

In [None]:
# creating layer normalization

class LayerNormalization(nn.Module):
    # we define epsilon as 0.000001 to avoid division by zero
    def __init__(self, eps: float=10**-6)-> None:
        super().__init__()
        self.eps=eps
        
        # we define alpha as a trainable parameter and initialize it with ones
        self.alpha=nn.Parameter(torch.ones(1)) # One-dimensional tensor that will be used to scale the input data
        
        # we define bias as a trainable parameter and initialize it with zeros
        self.bias=nn.Parameter(torch.zeros(1)) # One-dimensional tensor that will be added to the input data
        
    def forward(self, x):
        mean=x.mean(dim=-1, keepdim=True) # computing the mean of the input data. Keeping the number of dimensions unchanged
        std=x.std(dim=-1, keepdim=True) # computing the standard deviation of the input data. Keeping the number of dimensions unchanged
        
        # returning the normalized input
        return self.alpha*(x-mean)/(std+self.eps)+self.bias

# Feed-Forward Network

In the fully connected feed-forward network, we apply two linear transformations with a ReLU activation in between. We can mathematically represent this operation as:

$$FFN(x)=max(0, xW_{1}+b_{1})W_{2}+b_{2}$$

$W_{1}$ and $W_{2}$ are the weights, while $b_{1}$ and $b_{2}$ are the biases of the two linear transformations.

In the `FeedForwardBlock` below, we will define the two linear transformers -`self.linear_1` and `self.linear_2` and the inner-layer `d_ff`. The input data will first pass through the `self.linear_1` transformation, which increases its dimensionality from `d_model` to `d_ff`. The output  of this operation passes through the ReLU activation function, which introduces non-linearity so the network can learn more complex patterns, and the `self.dropout` layer is applied to mitigate overfitting. The final operation is the `self.linear_2` transformation to the dropout-modified tensor, which transforms it back to the original `d_model` dimension. 

In [None]:
class FeedForwardBlock(nn.Module):
    def __init__(self,d_model:int, d_ff:int, dropout:float) -> None:
        super().__init__()
        # First lienar transformation
        self.linear_1=nn.Linear(d_model, d_ff) # W1 & b1
        self.dropout=nn.Dropout(dropout) # Dropout to prevent overfitting
        
        # Second linear transformation
        self.linear_2=nn.Linear(d_ff, d_model) # W2 & b2
    
    def forward(self, x):
        # (batch, seq_len, d_model) --> (batch, seq_len, d_ff) --> (batch, seq_len, d_model)
        return self.linear_2(self.dropout(torch.relu(self.linear_1(x))))

# Multi-Head Attention

The Multi-Head Attention is the most crucial component of the Transformer. It is responsible for helping the model to understand complex relationships and patterns in the data.

![](https://cdn.masto.host/sigmoidsocial/media_attachments/files/111/963/621/063/011/468/original/629903a63a938f0a.png)

The Multi-Head Attention block receives the input data split into queries, keys, and values organized into matrices Q, K and V. Each matrix contains different facets of the input, and they have the same dimensions as the input.

![](https://hostux.social/system/media_attachments/files/111/603/992/766/474/377/small/5df72b068852f4da.webp)

We then linearly transform each matrix by their respective weight matrices $W^Q$, $W^K$ and $W^V$. These transformations will result in new matrices $Q'$, $K'$ and $V'$, which will be split into smaller matrices corresponding to different heads $h$, allowing the model to attend to information from different representation subspaces in parallel. This split creates multiple sets of queries, keys, and values for each head.

Finally, we concatenate every head into an H matrix, which is then transformed by another wight matrix $W^o$ to produce the multi-head attention output, a matrix $MH$ - A that retains the input dimensionality.

In [None]:
class MultiHeadAttentionBlock(nn.Module):
    def __init__(self, d_model: int, h:int, dropout:float)-> None: # h= number of heads
        super().__init__()
        self.d_model=d_model
        self.h=h
        
        # we ensure that the dimensions of the model is divisible by the number of heads
        assert d_model %h==0, 'd_model is not divisible by h'
        
        # d_k is the dimension of each attention head's key, query, and values vectors
        self.d_k =d_model // h # d_k formula, like in the original paper
        
        # degining the weight matrices
        self.w_q=nn.Linear(d_model, d_model) # W_q
        self.w_k=nn.Linear(d_model, d_model) # W_k
        self.w_v=nn.Linear(d_model, d_model) # W_v
        self.w_o=nn.Linear(d_model, d_model) # W_o
        
        self.dropout=nn.Dropout(dropout) # Dropout layer to avoid overfitting
        
    @staticmethod
    def attention(query, key, value, mask, dropout:nn.Dropout): # mask=>when we certain words to not interact with others, we hide them
        d_k=query.shape[-1] # the last dimension of query, key and value
        
        # we calculate the Attention(Q,K,V) as in the formula in the image above
        attention_scores=(query@key.transpose(-2, -1))/math.sqrt(d_k) # @=matrix multiplication sign in PyTorch

# Credit

* https://www.kaggle.com/code/lusfernandotorres/transformer-from-scratch-with-pytorch/notebook?scriptVersionId=157547654
* https://www.youtube.com/watch?v=ISNdQcPhsts&t=9595s