## What and why Transformers?

In Natural Language Processing (NLP), "transformers" refer to a powerful deep learning architecture that utilizes a *"multi-head attention"* mechanism to analyze relationships between words in a sentence, allowing for highly accurate results on various NLP tasks like *machine translation*, *sentiment analysis*, and *question answering*, making it a leading approach in modern NLP due to its ability to capture complex contextual relationships between words within a sequence, significantly surpassing older models like RNNs.

### 1. Parallely send all words to encoder
Attention mechanism solved the problems with longer sentences, but still we are sending words one by one that is each time we are sending only one word at a time. We are still not able to send all the words at once.

Not scalable -> If the document is huge we need to send one word at time is huge time complexity.

So we need different architecture to handle this problem.

#### Transfomers

Transformers are not using LSTM RNN, instead they used Self Attention Module.

All the words can be parallely sent to the encoder. (Positional Encoding -> an important concept)

Hence it is scalable. It means we can train this model with huge data in lesser time when compared to RNN.

GPT, BERT are examples of transformers, using the -*transfer*- learning concept we can create SOTA *(State of the Art)* models on top of the tranformers, since those transformers are trained with huge amout of data.

### 2. Contextual Embedding
*Contextual embedding*: It's a kind of embedding process where each word will have some sort of relation ship with other words in the sentences.

Example: I'm Krishna and I love to play Cricket. 

In the above sentence the "I" is related to the name "Krishna" -> this relationship will be considered in the contextual embedding. And this can be achieved by self attention mechanism.

Transformers follows Encoder - Decoder Architecture. And There are multiple encoders and decoders.



https://arxiv.org/html/1706.03762v7 - Attention Is All You Need

https://jalammar.github.io/illustrated-transformer/

#### 1. High Level Architecture

In [1]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="high_level_1.png", width=700) 

In [2]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="high_level_2.png", width=700) 

In [4]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="transformer_arch.png", width=700) 

#### 2. Encoders & Decoders

In [5]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="whatisinside_encoder_decoder.png", width=700) 

In [7]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="encoder.png", width=700) 

Here z1, z2, & z3 are contextual vectors

In [8]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="encoder1.png", width=700) 

Feed Forward Neural Network (ANN) converting z1, z2, & z3 vectors to different vectors and sending to next encoder.

#### 3. Self Attention

In [11]:
# import image module 
from IPython.display import Image  
# get the image 
Image(url="self_attention_1.png", width=200) 

Self-attention, also known as scaled dot-product attention, is a crucial mechanism in the transformer architecture that allows the model to weigh the importance of difference tokens in the input sequence relative to each other.

Self-attention layer basically converting *<strong>embedded vectors</strong>* into *<strong>contextual embedding</strong>* vectors using Scaled Dot-Product Attention method.

<u>Inputs</u>: Queries, Keys, & Values

<u><font color='#FFA500'>Query Vector</font></u> (Q): Query vectors represent the tokne for which we are calculating the attention. They help determine the importance of the other tokens in the context of the current token.

<u><font color='#FFA500'>Importance</font></u>: Queries help the model to decide which parts of the sequence to focus on for each specific token. By calculating the dot product between a query vector and all key vectors, the model assesses how much attention to give to each token relative to the current token.

<u><font color='#FFA500'>Contextual Understanding</font></u>: Queries contribute to understanding the relationship between the current token and the rest of the sequence, which is essential for capturing dependencies and context.