# Implementing Attention Mechanism

We would cover the following:
* Why do we need attention mechanisms in neural networK
* Implemetin a basic self-attention framework, progressing to an enhanced self-attention mechanism
* A causal attention module that allows LLMs to generate on token at a time
* Masking randomly selected attention weights with dropouts to reduce overfitting
* Stacking multiple causal attention modules into a multi-head attention module

### Type os Attention mechanis
* Simplifed Self-attention
* Self-attention
* Causal attention
* Multi-head attention

## 3.2 Capuring Data Dependencies wity Attention Mechanism
Initialy the Bahdanau mechanism was used with RNNs to give the current layer access to previous layers

**Self-attention** - This mechanism allows each position in the input sequence to consider the relevancy of (or attend
to) all other positions in the same sequence when computing the representation of the sequence

## 3.3 Attending to Various Parts of the Input
The self refers to the ability of the llm refer to weights of other tokens in the same sequence.
The goal of self-attention is to compute the **context vector** for each each input element that combines information
from all other inputs \
Take note of the following terms:
* input elements x
* attention scores - this is vector of how the element of each input attents to all other elements in the input sequence
* attention weights - this is normalized attention weights
* context vector - this a vector derive by combining each input vector with it's attention weights

In [1]:
# Context vector calculation for the input "Your journey starts with one step"
import torch
inputs = torch.tensor(
    [
        [0.43, 0.15, 0.89], # Your    (x^1)
        [0.55, 0.87, 0.66], # journey (x^2)
        [0.57, 0.85, 0.64], # starts  
        [0.22, 0.58, 0.33], # with
        [0.77, 0.25, 0.01], # one
        [0.05, 0.80, 0.55]  #step
    ]
)

### How to calculate attention score 
We will calculate attention score by of the second element with reference to every other elements and this is calculated
taking the dot product of the current element (query token) and the every other element.
Then we would have a set of scores that would make up the context vector


In [23]:
# Calculating the attention scores
attention_scores_2 = torch.empty(size=(inputs.shape[0],)) # We create an empty tensor with dimention same as that of the inputs
for index, query in enumerate(inputs): # We iterate throught he inputs (each embedding vector in the input)
    attention_scores_2[index] = torch.dot(query, inputs[index]) # for each input, we find the dot product of the query and the current input and store it at the index position of the attention_scores
print(attention_scores_2) # print the attention scores

tensor([0.9995, 1.4950, 1.4570, 0.4937, 0.6555, 0.9450])


**Dot Product** \
A dot product is anothe way to multiply two vectors element-wise and then summing the results


In [15]:
attention_scores2 = torch.empty(inputs.shape[0])
query = inputs[0]
result = 0
for index, item in enumerate(query):
    result = result + query[index] * item
print(result)
print(torch.dot(inputs[0], query))

tensor(0.9995)
tensor(0.9995)


In [35]:
# Calculating the attention matrix
# This is a matrix of the attention scores of all the inputs
attention_scores_matrix = torch.empty(inputs.shape[0], inputs.shape[0])
for index, query in enumerate(inputs):
    attention_scores_matrix[index] = torch.matmul(inputs, query)
print(attention_scores_matrix)

tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.3775, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.6476, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.6578, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3177, 0.6565],
        [0.3775, 0.6476, 0.6578, 0.3177, 0.6555, 0.2440],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2440, 0.9450]])


In [19]:
# We can also obtain the attention_scores_matrix by mulitplying the 
print(torch.matmul(inputs, inputs.T))

tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.3775, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.6476, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.6578, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3177, 0.6565],
        [0.3775, 0.6476, 0.6578, 0.3177, 0.6555, 0.2440],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2440, 0.9450]])


In [36]:
print(inputs @ inputs.T)

tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.3775, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.6476, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.6578, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3177, 0.6565],
        [0.3775, 0.6476, 0.6578, 0.3177, 0.6555, 0.2440],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2440, 0.9450]])


**Dot Product and Similarity** \
The dot product of of two vectors is a measure of similrity because it measures how closely two vectors are aligned.
When the dot product is high, it shows taht theres is a greate degree of similarity between the vectors.
Therefore, in the context of self-attention, it determines the extent to which each element attends to every other element
in the sequence

**Normalization** \
We would have to normilize each of the attention scores we obtained. \
The goal of normalization is to obtain attention weights that sum up to 1
\
**Attention Weights** \
We obtain the attention weigths by normalizing the attention scores. This is achieved by dividing each attention score by the
sum of all the attention scores 

**The Softmax Function** \
Using the softmax function for normalization. This function is more efficient at handling extreme values and is more
gives more favorable gradient properties during training.
It also ensures that teh attention weights are always positive making the output interpretable as probabilities or relative
importance where higher weights indicates greater importance.
Pytorch has an implementation of the softmax() fuction which is much better

In [30]:
## Softmax implementation
def softmax_naive(x):
    return torch.exp(x) / torch.exp(x).sum(dim=0)

attention_weights = softmax_naive(attention_scores)
print("Attention weights: ", attention_weights)
print("Attention weights sum: ", attention_weights.sum())

Attention weights:  tensor([0.1543, 0.2533, 0.2438, 0.0931, 0.1094, 0.1461])
Attention weights sum:  tensor(1.)


In [43]:
# Peform normilization
attention_weight_2 = attention_scores_2 / sum(attention_scores_2)
print(attention_weight_2)

tensor([0.1653, 0.2473, 0.2410, 0.0817, 0.1084, 0.1563])


In [26]:
print(attention_weight_1.sum())

tensor(1.)


## The Context Vector
This is like the enriched embedding vectors.
It incorporates each input vector with the attention weights for that vector.
To obtain this, we must multiply each input vector with the correponding attention weight

In [34]:
# Computing the context vector
query = inputs[1]
context_vector_2 = torch.zeros(query.shape)
for index, query in enumerate(inputs):
    context_vector_2 = attention_weight_2[index] * query
print(context_vector_2)

tensor([0.0078, 0.1250, 0.0860])


In [38]:
# Normalize the attention scores matrix
attention_weights_matrix = torch.softmax(attention_scores_matrix, dim=1)
print(attention_weights_matrix)

tensor([[0.2118, 0.2025, 0.2000, 0.1254, 0.1137, 0.1465],
        [0.1394, 0.2394, 0.2347, 0.1248, 0.1026, 0.1591],
        [0.1399, 0.2384, 0.2341, 0.1250, 0.1053, 0.1574],
        [0.1441, 0.2082, 0.2053, 0.1467, 0.1231, 0.1727],
        [0.1477, 0.1935, 0.1955, 0.1391, 0.1950, 0.1292],
        [0.1391, 0.2194, 0.2138, 0.1427, 0.0945, 0.1905]])


In [39]:
# check that the rows sum up to 1
for item in attention_weights_matrix:
    print(item.sum())

tensor(1.)
tensor(1.)
tensor(1.)
tensor(1.)
tensor(1.)
tensor(1.)


In [41]:
# Compute the context vector matrix
context_vector_matrix = torch.matmul(attention_weights_matrix, inputs) # same as attention_weights_matrix @ inputs
print(context_vector_matrix)

tensor([[0.4389, 0.5964, 0.5733],
        [0.4398, 0.6540, 0.5620],
        [0.4411, 0.6521, 0.5605],
        [0.4291, 0.6312, 0.5416],
        [0.4686, 0.5895, 0.5032],
        [0.4160, 0.6522, 0.5583]])
