# Transformers Intuition

* Transformer ek deep learning architecture hai jo sequence data ko samajhne me expert hota hai jaise text, audio, code. Ye RNN ya LSTM ki tarah step by step process nahi karta balki pura sequence ek saath dekhta hai.

* Iska core idea attention hota hai jisme model har word ko baaki words ke context me samajhta hai. Isse yaad rakhne ki problem khatam ho jaati hai.

* Transformer encoder part input ka meaning samajhta hai aur decoder part output create karta hai, isi se translation models aur large language models ban sakte hain.

* Ye design parallel processing ko allow karta hai jisse training fast hoti hai aur large datasets ko easily handle kiya ja sakta hai.

### Why in AI/ML :

* Transformer context ko deep level par samajh pata hai. Sentence me koi word shuru me ho ya last me, attention mechanism uska relation identify kar leta hai.

* Ye long sequence understanding me perfect hota hai. RNN aur LSTM long sentences me confuse ho jaate hain, transformer nahi hota.

* Real world me translation, chatbots, sentiment analysis, document understanding, coding assistants, vision tasks sab me transformer backbone ka role hota hai.

### Important components intuition:

* Self Attention :

    - Ye mechanism har word ko sentence ke baaki words se compare karta hai.

    - Example intuition:

        - Sentence: "The cat sat on the mat because it was warm"

        - "it" kis ko refer karta hai?

        - Attention pehle "cat" par focus karega kyunki relation strong hai.

    - Self attention formula values, keys, queries se milkar banta hai jisme model decision leta hai ki kis word ko kitna weight dena chahiye.

### Positional Encoding :

* Transformer order nahi samajhta. Isse har word ko ek special pattern milta hai jisse model sequence order samajhta hai.

* Ex. :

    - Word1 = sine cosine pattern

    - Word2 = different pattern

    - Isse model janta hai ki kaunsa word pehle aur kaunsa baad me aaya.

### Multi Head Attention :

* Ek attention ek cheez dekhega, multiple heads multiple relationships dekhte hain.

* Ek head syntax samajh sakta hai

* Ek head noun adjective relation

* Ek head long range dependency

* Sab combine hoke deep understanding milti hai.

### Feed Forward Network :

* Attention ke baad simple dense layers features ko refine karte hain. Yeh part meaning ko aur clean karta hai.

#### Simple transformer encoder block intuition code: 

In [2]:
import torch
import torch.nn as nn

encoder_layer = nn.TransformerEncoderLayer(
    # word vector size
    d_model=512,
    # multi head attention
    nhead=8
)

encoder = nn.TransformerEncoder(encoder_layer , num_layers=2)

# 10 = sequence length
# 1 = batch size
# 512 = feature size

x = torch.randn(10 , 1 , 512)

out = encoder(x)
print("Shapes : " , out.shape)

Shapes :  torch.Size([10, 1, 512])




* Transformer encoder input me har token ko 512 dimension me vector ke form me leta hai.

* 8 heads input ke alag alag relations parallel me analyze karte hain.

* 2 layers ka matlab 2 baar deep attention understanding hogi.

* Output shape same hota hai kyunki transformer sequence ko transform karta hai, shrink nahi karta.

# Attention Concept :

* Attention ek aisa mechanism hota hai jisme model input ke har word ko baaki words ke relation me dekhta hai. Model ye decide karta hai ki kis word par focus karna zaroori hai aur kis par kam.

* Ye weight system jaisa hota hai. Important words ko high weight milta hai aur non important words ko low weight. Isse sentence ka real meaning clear ho jata hai.

* Attention query, key aur value vector use karta hai. Query ek word ko represent karta hai, key context deta hai aur value actual information hoti hai.

* Attention ka final output weighted sum hota hai jisse model ko contextual understanding milti hai jaise word kis se judta hai, kya meaning deta hai aur sentence me kya role hai.

* Why Attention important in AI/ML :

    - Sentence me long distance relation identify karna easy ho jata hai. Example: "The dog that chased the cat was hungry." Yaha "dog" aur "hungry" ka relation attention turant pakad leta hai.

    - Attention model ko decide karne deta hai ki kis part ko highlight karna hai. Isse model flexible aur smart ho jata hai.

    - Ye parallel process hota hai jisse training fast hoti hai aur large sequence efficiently handle ho jata hai.

    - Real world me translation, summarization, QA, chatbots sab attention concept ke bina impossible hain.

* Why Transformer Beats RNN :

    - Transformers poori sequence ko ek saath process karte hain, RNN step by step chalti hai. Isse transformer fast hota hai aur GPU par parallel training possible hoti hai.

    - RNN long sentences me memory lose kar deta hai. Transformer attention ki wajah se long range relation perfect samajh leta hai.

    - RNN me gradient vanish aur explode problem common hoti hai. Transformer ki attention architecture ye problem almost remove kar deti hai.

    - Transformer ki multi head attention diverse relations capture karti hai. RNN ek baar me ek pattern focus karta hai. Isliye transformer deeper, richer understanding deta hai.

In [None]:
attn = nn.MultiheadAttention(embed_dim=64 , num_heads=1)
x = torch.randn(10 , 1 , 64)

out , weights = attn(x , x , x)

print("Weights Shape : " , weights.shape)
print("weights : " , weights)

Weights Shape :  torch.Size([1, 10, 10])
weights :  tensor([[[0.0626, 0.3061, 0.0382, 0.0558, 0.0549, 0.1004, 0.0743, 0.0686,
          0.1158, 0.1233],
         [0.0660, 0.1007, 0.1701, 0.0915, 0.0942, 0.1478, 0.1223, 0.0703,
          0.0479, 0.0892],
         [0.0770, 0.0664, 0.2037, 0.1080, 0.1336, 0.1089, 0.0847, 0.1126,
          0.0501, 0.0551],
         [0.1166, 0.0980, 0.1191, 0.1305, 0.0652, 0.0625, 0.0830, 0.1158,
          0.1242, 0.0852],
         [0.1067, 0.1805, 0.1572, 0.0457, 0.0714, 0.0833, 0.1001, 0.0772,
          0.0652, 0.1126],
         [0.0454, 0.1487, 0.1179, 0.0732, 0.0689, 0.1364, 0.1197, 0.1037,
          0.0947, 0.0913],
         [0.1401, 0.0575, 0.0734, 0.1147, 0.0814, 0.0431, 0.2510, 0.0864,
          0.0951, 0.0573],
         [0.1763, 0.1109, 0.1233, 0.0747, 0.1359, 0.0957, 0.0468, 0.0745,
          0.0544, 0.1075],
         [0.1119, 0.0924, 0.0357, 0.1069, 0.0967, 0.1099, 0.0321, 0.1522,
          0.1546, 0.1075],
         [0.0857, 0.0689, 0.0783, 0.116

* weights me attention scores hoti hain jisse model ke focus ka exact idea mil jata hai.

* Ye batata hai ki sequence ka har token kis token se kitna connected hai.

* Transformers isi mechanism se RNN ki limitation ko cross karte hain aur richer context produce karte hain.