## Coding attention mechanism
* Here we will code attention mechanism from scratch.
* We will code the following attention mechanism:
    1. Self Attention
    2. Causal Attention
* Let's first start with self attention.


## 1.Self Attention
* Here we will use the following example as our example: "The dog chased the rabbit"

In [4]:
import torch

input = torch.tensor(
    [
        [0.86,0.97,0.89],#The
        [0.56,0.40,0.67],#dog
        [0.33,0.82,0.55],#chased
        [0.19,0.38,0.85],#the
        [0.25,0.36,0.88]#rabbit
    ]
)
print(input)

tensor([[0.8600, 0.9700, 0.8900],
        [0.5600, 0.4000, 0.6700],
        [0.3300, 0.8200, 0.5500],
        [0.1900, 0.3800, 0.8500],
        [0.2500, 0.3600, 0.8800]])


* Let's assume the above tensor is the embedding of our input sentence.
* In the attention mechanism, each token embedding in the sequence is projected into its corresponding query, key, and value vectors by multiplying the embedding with the respective query, key, and value matrices.
* The query, key, and value projection matrices are parameters of the model that are learned during training.
* Here we will generate our our own query,key and value matrices.


In [2]:
w_query = torch.randn(3,3)
w_key = torch.randn(3,3)
w_value = torch.randn(3,3)
print(w_query)
print(w_key)
print(w_value)

tensor([[ 0.3472,  1.5337, -1.2087],
        [ 1.0625,  0.9326, -0.3823],
        [ 0.2039, -0.9414,  0.9970]])
tensor([[ 1.6860, -0.9125,  1.8669],
        [ 0.7066, -0.7550,  0.6424],
        [-0.3040, -0.0108,  0.8090]])
tensor([[-0.4256, -1.4681,  0.3927],
        [-0.5141,  1.2863, -0.1176],
        [-0.7740, -0.3166, -0.5723]])


* Now since we have the weight matrices, let's now compute query,key and value vectors, by computing the dot product of the these matrices with each token.

In [13]:
query_vectors = torch.matmul(input, w_query)
key_vectors = torch.matmul(input, w_key)
value_vectors = torch.matmul(input, w_value)

print("Query vectors:")
print(query_vectors)
print("\nKey vectors:")
print(key_vectors)
print("\nValue vectors:")
print(value_vectors)

Query vectors:
tensor([[ 1.5107,  1.3857, -0.5229],
        [ 0.7560,  0.6011, -0.1617],
        [ 1.0980,  0.7531, -0.1639],
        [ 0.6430, -0.1544,  0.4726],
        [ 0.6487, -0.1093,  0.4376]])

Key vectors:
tensor([[ 1.8649, -1.5266,  2.9487],
        [ 1.0232, -0.8202,  1.8445],
        [ 0.9686, -0.9261,  1.5878],
        [ 0.3305, -0.4694,  1.2865],
        [ 0.4084, -0.5094,  1.4099]])

Value vectors:
tensor([[-1.5536, -0.2967, -0.2857],
        [-0.9626, -0.5198, -0.2106],
        [-0.9877,  0.3961, -0.2816],
        [-0.9342, -0.0593, -0.4565],
        [-0.9726, -0.1826, -0.4478]])


* The next step is to calculate the attention scores. This is done by taking the dot product of the query vector for each token with the key vectors of all other tokens.
* The resulting matrix will show how much attention each token should pay to every other token in the sequence.

In [24]:
attention_scores = torch.matmul(query_vectors, key_vectors.transpose(-2,-1))

print("Attention Scores:")
print(attention_scores)

Attention Scores:
tensor([[-0.8400, -0.5553, -0.6502, -0.8240, -0.8262],
        [ 0.0153, -0.0178, -0.0812, -0.2404, -0.2255],
        [ 0.4146,  0.2034,  0.1058, -0.2016, -0.1664],
        [ 2.8285,  1.6563,  1.5163,  0.8930,  1.0076],
        [ 2.6671,  1.5606,  1.4245,  0.8287,  0.9376]])


* To prevent large dot products from destablizig gradient we scale the attention scores:

In [25]:
scaled_scores = attention_scores / torch.sqrt(torch.tensor(3.0))
scaled_scores

tensor([[-0.4850, -0.3206, -0.3754, -0.4757, -0.4770],
        [ 0.0088, -0.0103, -0.0469, -0.1388, -0.1302],
        [ 0.2393,  0.1174,  0.0611, -0.1164, -0.0961],
        [ 1.6330,  0.9562,  0.8754,  0.5156,  0.5817],
        [ 1.5399,  0.9010,  0.8224,  0.4784,  0.5413]])

* Next we convert these scores in probabilities where apply softmax row-wise

In [26]:
attention_weights = torch.softmax(scaled_scores,dim=-1)
attention_weights

tensor([[0.1883, 0.2219, 0.2101, 0.1900, 0.1898],
        [0.2146, 0.2105, 0.2030, 0.1851, 0.1867],
        [0.2417, 0.2139, 0.2022, 0.1693, 0.1728],
        [0.3768, 0.1915, 0.1767, 0.1233, 0.1317],
        [0.3663, 0.1933, 0.1787, 0.1267, 0.1349]])

* Now let's compute the final output embedding

In [27]:
output = torch.matmul(attention_weights,value_vectors)
output

tensor([[-1.0756, -0.1339, -0.3314],
        [-1.0911, -0.1378, -0.3310],
        [-1.1074, -0.1444, -0.3257],
        [-1.1876, -0.1728, -0.3130],
        [-1.1813, -0.1705, -0.3140]])

## 2. Causal Attention

In [20]:
scaled_scores = attention_scores / torch.sqrt(torch.tensor(3.0))
scaled_scores

tensor([[-0.4850, -0.3206, -0.3754, -0.4757, -0.4770],
        [ 0.0088, -0.0103, -0.0469, -0.1388, -0.1302],
        [ 0.2393,  0.1174,  0.0611, -0.1164, -0.0961],
        [ 1.6330,  0.9562,  0.8754,  0.5156,  0.5817],
        [ 1.5399,  0.9010,  0.8224,  0.4784,  0.5413]])

In [21]:
masked_scores = torch.tril(scaled_scores)
masked_scores

tensor([[-0.4850,  0.0000,  0.0000,  0.0000,  0.0000],
        [ 0.0088, -0.0103,  0.0000,  0.0000,  0.0000],
        [ 0.2393,  0.1174,  0.0611,  0.0000,  0.0000],
        [ 1.6330,  0.9562,  0.8754,  0.5156,  0.0000],
        [ 1.5399,  0.9010,  0.8224,  0.4784,  0.5413]])

In [22]:
attention_weights = torch.softmax(masked_scores, dim=-1)
attention_weights

tensor([[0.1334, 0.2167, 0.2167, 0.2167, 0.2167],
        [0.2018, 0.1980, 0.2001, 0.2001, 0.2001],
        [0.2328, 0.2060, 0.1948, 0.1832, 0.1832],
        [0.4001, 0.2033, 0.1876, 0.1309, 0.0782],
        [0.3663, 0.1933, 0.1787, 0.1267, 0.1349]])

In [23]:
output = torch.matmul(attention_weights,value_vectors)
output

tensor([[-1.0429, -0.1188, -0.3407],
        [-1.0832, -0.1320, -0.3366],
        [-1.1017, -0.1433, -0.3304],
        [-1.2008, -0.1722, -0.3047],
        [-1.1813, -0.1705, -0.3140]])