## What and why Transformers?

In Natural Language Processing (NLP), "transformers" refer to a powerful deep learning architecture that utilizes a *"multi-head attention"* mechanism to analyze relationships between words in a sentence, allowing for highly accurate results on various NLP tasks like *machine translation*, *sentiment analysis*, and *question answering*, making it a leading approach in modern NLP due to its ability to capture complex contextual relationships between words within a sequence, significantly surpassing older models like RNNs.

### 1. Parallely send all words to encoder
Attention mechanism solved the problems with longer sentences, but still we are sending words one by one that is each time we are sending only one word at a time. We are still not able to send all the words at once.

Not scalable -> If the document is huge we need to send one word at time is huge time complexity.

So we need different architecture to handle this problem.

#### Transfomers

Transformers are not using LSTM RNN, instead they used Self Attention Module.

All the words can be parallely sent to the encoder. (Positional Encoding -> an important concept)

Hence it is scalable. It means we can train this model with huge data in lesser time when compared to RNN.

GPT, BERT are examples of transformers, using the -*transfer*- learning concept we can create SOTA *(State of the Art)* models on top of the tranformers, since those transformers are trained with huge amout of data.

### 2. Contextual Embedding
*Contextual embedding*: It's a kind of embedding process where each word will have some sort of relation ship with other words in the sentences.

Example: I'm Krishna and I love to play Cricket. 

In the above sentence the "I" is related to the name "Krishna" -> this relationship will be considered in the contextual embedding. And this can be achieved by self attention mechanism.

Transformers follows Encoder - Decoder Architecture. And There are multiple encoders and decoders.



https://arxiv.org/html/1706.03762v7 - Attention Is All You Need

https://jalammar.github.io/illustrated-transformer/

#### 1. High Level Architecture

In [1]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="high_level_1.png", width=700) 

In [2]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="high_level_2.png", width=700) 

In [4]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="transformer_arch.png", width=700) 

#### 2. Encoders & Decoders

In [5]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="whatisinside_encoder_decoder.png", width=700) 

In [7]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="encoder.png", width=700) 

Here z1, z2, & z3 are contextual vectors

In [8]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="encoder1.png", width=700) 

Feed Forward Neural Network (ANN) converting z1, z2, & z3 vectors to different vectors and sending to next encoder.

#### 3. Self Attention

In [11]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="self_attention_1.png", width=200) 

Self-attention, also known as scaled dot-product attention, is a crucial mechanism in the transformer architecture that allows the model to weigh the importance of difference tokens in the input sequence relative to each other.

Self-attention layer basically converting *<strong>embedded vectors</strong>* into *<strong>contextual embedding</strong>* vectors using Scaled Dot-Product Attention method.

<u>Inputs</u>: Queries, Keys, & Values



<u><font color='#FFA500'>Query Vector</font></u> (Q): Query vectors represent the tokne for which we are calculating the attention. They help determine the importance of the other tokens in the context of the current token.

<b>Importance:</b>

<font color='#FFA500'><u>Focus Determination</u></font>: Queries help the model to decide which parts of the sequence to focus on for each specific token. By calculating the dot product between a query vector and all key vectors, the model assesses how much attention to give to each token relative to the current token.

<u><font color='#FFA500'>Contextual Understanding</font></u>: Queries contribute to understanding the relationship between the current token and the rest of the sequence, which is essential for capturing dependencies and context.

<u><font color='#FFA500'>Key Vector</font></u> (K): Key vectors represent all the tokens in the sequence and are used to compare with the query vectors to calculate attention scores.

<b>Importance:</b>

<font color='#FFA500'><u>Relevance Measurement</u></font>: Keys are compared with queris to measure the relevance or compatibility of each token with the current token.  This comparison helps in determining how much attention each token should receive.

<u><font color='#FFA500'>Information Retrieval</font></u>: Keys play a critical role inretrieving the most relevant information from the sequence by providing a basis for the attention mechanism to compute similarity scores.

<u><font color='#FFA500'>Value Vector</font></u> (V): Value vectors hold the actual information that will be aggregated to form the output of the attention mechanism.

<b>Importance:</b>

<font color='#FFA500'><u>Information Aggregation</u></font>: Values contain the data that will be weighted by the attention scores.  The weighted sum of values forms the output of the self-attention mechanism, which is then passed on to the next layers in the network.

<u><font color='#FFA500'>Context Preservation</font></u>: By weighting the values according to the attention scores, the model preserves and aggregates relevant context from the entire sequence, which is crucial for tasks like translation, summarization, and more.

1. Token Embedding
2. Linear Transformation

    We create Q, K, & V by multiplying the embeddings by learned weights matrices $ W_Q $, $ W_K $, & $ W_V $.

Here we initialize some weight matrics based on the embedding matrix size and calculate Q, K, & V vectors by doing dot matrix operations. In this case we initialized weights as Identity matrix for our understanding.

In [2]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="linear_transformation.png", width=500) 

3. Compute Attention Score

In [3]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="compute_att_score.png", width=500) 

4. Scaling

    We scale down the socres by dividing the dimensions of the key vector $ \sqrt{d_K} $. Scaling in the attention mechanism is crucial to prevent the dot product from growing too large. <font color="orange">To ensure stable gradients during training.</font>

    Two Problems: 1. Gradient Exploading, 2. Softmax Saturation (vanishing gradient problem).

In [4]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="without_scaling.png", width=500) 

Here, the most of the attention weight is assigned to the first key vector, and the second vector with very little. This leads to Softmax saturation, that is the weights are not going to be updated during backpropagation, (vanishing gradient problem).

In [5]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="with_Scaling.png", width=500) 

Summary of Importance of Scaling:

<b>Stabilizing Training</b>: Scaling prevents extreamly large dot products, which helps in stabilizing the gradients during backpropagation, making the training process more stable and efficient.

<b>Preventing Saturation</b>: By sclaing the dot products, the softmax function produces more balanced attention weights, preventing the model from focusing too heavily on a single token and ignoring others. 

<b>Improved Learning</b>: Balanced attention weights enable the model to learn better representations by considering multiple relevant tokens in the sequence, leading to better performance on the tasks that require context understanding.

<b>Scaling ensures that the dor products are kept within a range that allows the softmax funtion to operate effectively, providing a more balanced distribution of attention weights and improving the overall learning process of the model.</b>

In [8]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="scaling.png", width=500) 

# here we only showed for the word, wiht respect to other words

5. Applying Softmax

In [7]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="apply_softmax.png", width=500) 

6. Weighted sum of values
    
    We multiply the attention weights by corresponding value vecotrs.


In [9]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="weighted_sum.png", width=500) 

In [11]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="summary_self_attention.png", width=500) 

In [12]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="sa_one.png", width=500) 

In [14]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="sa_2.png", width=500) 

#### 4. Multihead Attention

In [15]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="mha_one.png", width=500) 

In [16]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="mha_two.png", width=500) 

#### 5. Feed Forward Neural Network with Multihead attention

In [17]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="feed_forward_prep.png", width=500) 

https://colab.research.google.com/drive/1hXIQ77A4TYS4y3UthWF-Ci7V7vVUoxmQ?usp=sharing#scrollTo=twSVFOM9SopW

https://towardsdatascience.com/deconstructing-bert-part-2-visualizing-the-inner-workings-of-attention-60a16d86b5c1

#### 6. Positional Encoding

Representing the order of the sequence


Adding index at the end of each word vector will be an issue incase of bigger numbers. Instead adding a positional vector of the same length of the word vector.

In [20]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="positional_encoding_one.png", width=700) 

Types of Positional Encoding:

1. Sinusoidal Positional Encoding
2. Learned Positional Encoding - Postional encodings are learned during training.

<b>Sinusoidal Positional Encoding:</b>  It uses sin and consine functions of different frequencies to create positional encoding.

In [22]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="pe_formula.png", width=500) 

Where,

* $pos$ is the position

* $i$ is the dimension

* $d_\text{model}$ is the dimesionality of the embeddings

In [21]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="positional_encoding.png", width=500) 

In [23]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="pe_one.png", width=500) 

#### 7. Layer Normalization

Normalization: 1. Batch, 2. Layer

Doing normalize on each output vectors, $Z_1$,$Z_2$....$Z_n$ is called as Batch Normalization

Doing Normalize on each layers of $Z_1$,$Z_2$....$Z_n$ vectors is called Layer Normalization, we calculate, z score
after calculating $\sigma$ & $\mu$ for each layer

In transformers we apply layer normalization, if all are zero in that specific layer its going to affect the learning. 

$\gamma$ & $\beta$ - Learnable parameters -> we can use this parameters if we don't want to normlize the final Z vectors.



In [25]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="learnable_params.png", width=500) 

In [24]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="add_and_normalize.png", width=500) 

In [26]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="normalize_one.png", width=500) 

#### 8. Encoder Architecture

In [27]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="encoder-decoer-arch.png", width=500) 

In [29]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="encoder_arch.png", width=700) 

Why Residuals?

In [30]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="residuals.png", width=700) 

Why Feed Forward Neural Net?

In [32]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="ffnn.png", width=700) 

#### 9. Decoder in Tranfomers

The transformer decoder is responsible for generating the output sequence one token at a time, using enocder's output and the previously generated tokens.

1. Masked Multi Head Self Attention,
2. Multi Head Attention (Encoder Decoder Attention)
3. Feed Forward Neural Network

Inputs are sent at single go, but he output is generated one word at a time.

1. Training Mechanism
2. Inference Mechanism

In [33]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="masked_multi_head_one.png", width=700) 

#### 10. Masked Multi Head Attention

1. Look Ahead Mask
2. Padding Mask

In [34]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="masked_multi_head_two.png", width=700) 

In [35]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="masked_multi_head_3.png", width=700) 

#### Masked Application

It helps manage the structure of the sequences being procesed and ensures the modle behaves correctly during training and inferencing.

Reason,

Handling variable length sequences with padding MASK

Purpose,

1. To handle sequences of different length in batch.
2. To ensure that padding tokens wihch are added to make sequences of uniform length, do not affect the model prediction.

Padding mask

In [36]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="padding_mask.png", width=700) 

Look Ahead Masking -> Maintain Auto Regressive Property

To ensure that each position in the decoder output sequence can only attend to the previous position, but no future position.

In [37]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="combined_mask_1.png", width=700) 

In [38]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="combined_mask_2.png", width=700) 

In [39]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="masked_score.png", width=700) 

In [40]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="masking.png", width=700) 

#### 11. Encoder-Decoder Multi Head Attention

In [41]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="ed-multihead-attention.png", width=700) 

In [42]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="ed-multihead.png", width=700) 

#### 12. Final Decoder Linear and Softmax Layer

In [46]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="final.png", width=700)



In [45]:
Image(url="softmax.png ", width=700)

In [48]:
Image(url="target_trained.png", width=700)

Hopefully upon training, the model would output the right translation we expect. Of course it's no real indication if this phrase was part of the training dataset (see: cross validation). Notice that every position gets a little bit of probability even if it's unlikely to be the output of that time step -- that's a very useful property of softmax which helps the training process.