<h1 style="text-align: center; text-decoration: underline ;">
    Self-Attention from Scratch
</h1>


##  Imports and data initialization

In [2]:
import torch
import torch.nn.functional as F
from torch import nn
torch.set_printoptions(sci_mode=False)

In [3]:
torch.manual_seed(74)
B,T,C = 4,8,2 
x = torch.randn(B,T,C)
x.shape

torch.Size([4, 8, 2])

## V1 : naive implementation 

> ### <u> Intuition </u>
> - We are simply **averaging across all the previously generated tokens for each seperate batch**  
> and then **predicting the next one**.  
>
> - While this approach is quite **lossy**, it’s still a **good starting point** —  
> because the information that seems lost can still be recovered later.
> 
> <br>

In [4]:
xbow = torch.zeros((B,T,C))
for b in range(B):
    for t in range(T):
        xprev = x[b,:t+1]
        xbow[b,t] = torch.mean(xprev,0)


## V2 : Efficient Averaging using tril and matmul 

> ### <u>Intuition</u>
> - `tril` gives us the **lower triangular** part of a matrix.  
> - When we perform a row-wise normalization `tril` matrix full of ones:
> 
> <div style="text-align: center;">
> <pre>
>  [1,    0,    0   ]
>  [0.5,  0.5,  0   ]
>  [0.33, 0.33, 0.33]
> </pre>
> </div>
> 
> - The **bottom-left triangle** now contains **weights** that sum to 1 in each row,  
> while the top-right triangle is zeros.  
> 
> - Now if we perform **matrix multiplication** using `@` <u>(matrix multiplication operator)</u>, we get the same results as if we were performing an average.
> 
> </br>


In [5]:
wei = torch.tril(torch.ones(T,T))
wei = wei / wei.sum(1, keepdim=True)
xbow2 = wei @ x # wei brodcasts itself 4 times as Batch size is 4 so we multiply each of the 
xbow2.shape     # examples in the batch with the normalized matrix

torch.Size([4, 8, 2])

## V3 : Adding Softmax to the implementation 

In [6]:
tril = torch.tril(torch.ones(T,T))
wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0 , float('-inf'))
wei = F.softmax(wei,dim=-1)
xbow3 = wei @ x 


In [7]:
wei

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])

## V4 : Complete self-attention

> as we can see above wei seems to give equal weightage to all tokens, however in theory this is not really true as certain tokens might find other tokens more or less interesting , therefore we make each vector or token emit a key and query vector , this key gives certain key informations that answers the queries that each word might be asking all other other words, the reason this is called self-attention is because the input to all these vectors is the same 'x'
> 
><br>

In [8]:
head_size = 16 
key  = nn.Linear(C,head_size, bias = False)
query  = nn.Linear(C,head_size, bias = False)
value  = nn.Linear(C,head_size, bias = False)

k = key(x)
q = query(x)
v = value(x) # exists because directly using the tokens to aggregate we get a value each token holds 
             # and aggregate that instead and also to get the output in the head size dimension
             
wei = q @ k.transpose(-2,-1) # dot product essentially

tril = torch.tril(torch.ones(T,T))
wei = wei.masked_fill(tril == 0 , float('-inf'))
wei = F.softmax(wei,dim=-1) # normalizing

v = value(x)
xbow4 = wei @ v


In [None]:
xbow4

> ## NOTES
> - attention is just a communication mechanism , it can be applied to any arbitrary directed graph
> - attention doesnt have a notion of space it operates over sets of vectors therefore it needs positional encoding
> - there is no communication across batches, each example in the batch has its own attention
> - encoders unlike decoders are not causal, the future can talk to the past, gpt is a decoder only architecture, attention works with both encoders and decoders
> - attention can be of many types , cross attention uses the queries from another node and only passes the key and value from current nodes or vice versa
> - attention needs to be normalized before softmaxing or it turns into an almost one-hot encoded vector where the major attention is only paid to one token instead of spread across tokens like it should be. high variance before softmaxing leads to almost one-hot like predictions
>
><br>

In [9]:
k = torch.randn(B,T,head_size)
q = torch.randn(B,T,head_size)
wei = q @ k.transpose(-2,-1) #* head_size**-0.5
k2 = torch.randn(B,T,head_size)
q2 = torch.randn(B,T,head_size)
wei2 = q2 @ k2.transpose(-2,-1) * head_size**-0.5

In [10]:
k.var() , k2.var()

(tensor(1.0069), tensor(0.9967))

In [11]:
q.var() ,  q2.var()

(tensor(0.9302), tensor(0.8952))

In [12]:
wei.var() , wei2.var()

(tensor(15.6977), tensor(1.0194))

In [13]:
print("Softmax(wei)  [0][0]:")
print(torch.softmax(wei, dim=-1)[0][0])   # almost one-hot like results

print("\nSoftmax(wei2) [0][0]:")
print(torch.softmax(wei2, dim=-1)[0][0])  # better distribution

Softmax(wei)  [0][0]:
tensor([    0.0000,     0.8555,     0.1444,     0.0000,     0.0000,     0.0000,
            0.0000,     0.0001])

Softmax(wei2) [0][0]:
tensor([0.2348, 0.0406, 0.1533, 0.0552, 0.0287, 0.3193, 0.0635, 0.1047])
