# Overview

The Attention mechanism generates scores, determining how much focus to place on each data part. These are used to create a weighted sum of the inputs, which feeds intot he next network layer. This allows the model to capture context and relationships within the data that might be missed with traditional, fixed approaches to processing sequences.

# String to Numberic arrays

We wil simulate the process of setences -> tokens -> embeddings(vectors).

In [1]:
# tokenize
sentence="The weather is beautiful today in Melbourne"
words=sentence.split()

# simulate embeddings
embed={word: [ord(char) -96 for char in word.lower()] for word in words}
print("Word embeddings:")
for word, embedding in embed.items():
    print(f"{word}:{embedding}")
    
# generate attention weights
total_characters=sum(len(word) for word in words)
attention_weights={word: len(word)/total_characters for word in words}
print("\nAttention Weights")
for word, weight in attention_weights.items():
    print(f"{word}:{weight:.3f}")
    
# compute weighted sum of embeddings
weighted_embeddings={word: [weight*val for val in embedding] 
                     for word, embedding in embed.items() 
                     for word_weight, weight in attention_weights.items() if word==word_weight}
final_vector= [0]*len(max(embed.values(), key=len))
for embedding in weighted_embeddings.values():
    final_vector=[sum(x) for x in zip(final_vector, embedding)]
    
print("\nFinal Vector Before Transformation:")
print(final_vector)

processed_output=sum(final_vector)
print(f"\nProcessed Output: {processed_output}")

Word embeddings:
The:[20, 8, 5]
weather:[23, 5, 1, 20, 8, 5, 18]
is:[9, 19]
beautiful:[2, 5, 1, 21, 20, 9, 6, 21, 12]
today:[20, 15, 4, 1, 25]
in:[9, 14]
Melbourne:[13, 5, 12, 2, 15, 21, 18, 14, 5]

Attention Weights
The:0.081
weather:0.189
is:0.054
beautiful:0.243
today:0.135
in:0.054
Melbourne:0.243

Final Vector Before Transformation:
[13.297297297297298, 7.837837837837839]

Processed Output: 21.135135135135137


# The Scaled Dot Product Attention

Attention can be done is many different ways. Dot product attention where the alignment funciton is a dot product. [Paper](https://arxiv.org/pdf/1508.04025v5.pdf). And scaled dot product attention was introduced by [Paper](https://arxiv.org/abs/1706.03762).

![](https://cdn.masto.host/sigmoidsocial/media_attachments/files/112/125/036/271/570/911/small/b4d359206e83008b.webp)

Here is a step-by-step approach to scaled dot product attention


## Step1: Deriving Q, K and V from Input Embeddings

### Q
Matrix contianing the query vectors. **These represent the set of items you want to draw attntion to**. In the context of processing a sentence, **a query is typically associated with the current word**. The model uses the query to seek out **relevant information acorss the sequence**.

### K
Matrix containing the key vectors. **Keys are paired with the values and are used to retrieve information**. Each key is associated with a value in a way that the model can use the similarity between a quary and a key to **determine how much attention to pay to the corresponding value**.

### V
Matrix containing the value vectors. **Values hold the actual information the model wants to retrieve**. **Once the model determines which key(and thereby values) are most relevant to a given query**, it aggregates these values, weighted by their relevance, to produce the output.


## Step2: Obtain Q, K and V

**1. Tokenizing and Embeddings**

Tokenizing the input and embedding the token. These embeddings capture the semantic meaning of the input elements.

**2. Learned Linear Transformations**

The model learns separate lienar transformations(weight matrices) to project the input embeddings into query, key and value spaces. These transformations are part of the model's parameters and are optimized during training. In short, we apply these **weight matrices** to the input embeddings to get Q, K and V.

**3. Purpose of Different Spaces**

BY projecting the input into three different spaces, the model can independently manipulate the aspects of the input that are used to calculate attention weights (via Q and K) and the aspects that are used to compute the output of the attention mechnism (via V).


## Step3: Calculate Dot Products of Q and K Transpose

We use numpy's built in function to do this.

## Step4: Get the Scaled Dot Product

We will divide dot products by the square root of the dimensions of the keys to prevent large values of dot products.

## Step5: Apply Softmax to the Scaled Dot Product

Softmax is basically a math function that converts numbers in the vectos into probabilities. Here it transforms a given array of numbers(the vector) into an array of new numbers that add to 1.

## Step6: Multiply by V

Multiply the attention weights with the value matrix V. This step aggregates the values based on the weights, **essentially selecting which values to focus on**. It's basically connecting the probabilities back tot he input matrix via V.

In [2]:
import numpy as np

def softmax(z):
    exp_scores=np.exp(z) # this is e^z where e is a math constant
    probabilities=exp_scores/np.sum(exp_scores)
    return probabilities

vector_array=[2.0,1.0, 0.1]
print(f"Test softmax: {softmax(vector_array)}")

# simulate input embeddings
X=np.random.rand(10,16) # 10 elements, each is a 16-dimensional vector

# initialize weight matrices for queries, keys, and values
W_Q=np.random.rand(16,16) # dimension chosen for example purposes
W_K=np.random.rand(16,16)
W_V=np.random.rand(16,16)

# compute queries, keys and values
Q=np.dot(X, W_Q)
K=np.dot(X, W_K)
V=np.dot(X, W_V)

# Calulate dot products of Q and K Transpose
dot_product=np.dot(Q, K.T)

# Get the Scaled Dot Product
d_k=K.shape[-1]
scaled_dot_product =dot_product/(np.sqrt(d_k))

# Apply Softmax to the Scaled Dot Product
attention_weights=softmax(scaled_dot_product)
print(f"\nSoftmax probabilities:{attention_weights}")

# Multiply by V
output=np.dot(attention_weights, V)
print(f"\nOutput:{output}")

Test softmax: [0.65900114 0.24243297 0.09856589]

Softmax probabilities:[[4.32998049e-26 3.61905363e-18 7.68690348e-18 1.93320955e-19
  1.05466928e-20 1.36151497e-20 7.26962744e-19 2.84585507e-19
  5.59467740e-17 2.95566807e-20]
 [5.88669680e-19 2.80923880e-05 1.21716717e-04 1.58252314e-07
  1.49977318e-09 1.37981948e-09 2.25480403e-06 2.80652025e-07
  3.81273701e-03 6.25136086e-09]
 [1.60347517e-18 2.53910646e-04 1.18440729e-03 1.59045369e-06
  8.06628380e-09 1.03642422e-08 1.85415675e-05 2.92487764e-06
  4.89879897e-02 4.87276624e-08]
 [6.19644087e-20 6.39497291e-07 2.10552398e-06 5.33383647e-09
  5.06620913e-11 5.98807319e-11 4.99802902e-08 1.01120592e-08
  5.05837443e-05 2.24678588e-10]
 [6.33971370e-21 9.04927416e-09 3.36648201e-08 1.12829171e-10
  1.46662877e-12 1.99890269e-12 8.25447245e-10 1.53658576e-10
  6.93568740e-07 5.24842012e-12]
 [1.87789995e-21 7.06756408e-10 2.02619597e-09 7.97124319e-12
  1.78268766e-13 1.70331937e-13 6.02065534e-11 1.47646523e-11
  3.80461087e-08 4.

# Acknowledge

* https://medium.com/the-research-nest/explained-attention-mechanism-in-ai-e9bb6f0b0b4d