# Self-Attention Mechanism Example
Using the sentence "The cat sat on a mat" and focusing on the word "cat".

We create three vectors from each token:
- Query: They can be thought of as the "questions" we ask to the keys. We use queries to search for specific information in our input data.
- Key: They can be thought of as the "labels" for the input data in the context of attention. When we calculate attention scores, we compare the queries to these keys.
- Value: Once we have our attention scores, we use them to create a weighted combination of the values. Think of values as the actual content associated with each key.

Step 1: Create embeddings (for simplicity we create the embeddings here manually and make them very low-dimensional). 

Usually the embeddings are learnable. 

In [8]:
import numpy as np

embeddings = {
    "The": np.array([1, 0, 1]),
    "cat": np.array([0, 1, 0]),
    "sat": np.array([1, 1, 0]),
    "on": np.array([0, 0, 1]),
    "a": np.array([1, 0, 0]),
    "mat": np.array([0, 1, 1])
}

Step 2: Create $W_Q$, $W_K$, $W_V$ matrices (those matrices are learnable)

In [2]:
W_Q = np.array([
    [1, 0, 1],
    [0, 1, 0],
    [1, 1, 0]
])

W_K = np.array([
    [0, 1, 0],
    [1, 0, 1],
    [0, 1, 1]
])

W_V = np.array([
    [1, 0, 0],
    [0, 1, 0],
    [0, 0, 1]
])

Step 3: Calculate Query, Value and Keys

In [3]:
Q = {word: np.dot(embedding, W_Q) for word, embedding in embeddings.items()}
K = {word: np.dot(embedding, W_K) for word, embedding in embeddings.items()}
V = {word: np.dot(embedding, W_V) for word, embedding in embeddings.items()}

print("Q:", Q)
print("K:", K)
print("V:", V)

Q: {'The': array([2, 1, 1]), 'cat': array([0, 1, 0]), 'sat': array([1, 1, 1]), 'on': array([1, 1, 0]), 'a': array([1, 0, 1]), 'mat': array([1, 2, 0])}
K: {'The': array([0, 2, 1]), 'cat': array([1, 0, 1]), 'sat': array([1, 1, 1]), 'on': array([0, 1, 1]), 'a': array([0, 1, 0]), 'mat': array([1, 1, 2])}
V: {'The': array([1, 0, 1]), 'cat': array([0, 1, 0]), 'sat': array([1, 1, 0]), 'on': array([0, 0, 1]), 'a': array([1, 0, 0]), 'mat': array([0, 1, 1])}


Step 4: Calculate the attention scores and scale them

In [4]:
def attention_scores(Q_word, K):
    scores = {word: np.dot(Q_word, K_word) for word, K_word in K.items()}
    return scores

def scale_scores(scores, d_k):
    return {word: round(score / np.sqrt(d_k), 3) for word, score in scores.items()}

Q_cat = Q["cat"]
scores = attention_scores(Q_cat, K)
scaled_scores = scale_scores(scores, d_k=3)

print(scaled_scores)

# We will see how does it work when we don't scale the scores
unscaled_scores = scale_scores(scores, d_k=1)

{'The': 1.155, 'cat': 0.0, 'sat': 0.577, 'on': 0.577, 'a': 0.577, 'mat': 0.577}


Step 5: Calculate the Softmax of the scores

In [5]:
def softmax(scores):
    exp_scores = np.exp(list(scores.values()))
    sum_exp_scores = np.sum(exp_scores)
    return {word: round(exp_score / sum_exp_scores, 3) for word, exp_score in zip(scores.keys(), exp_scores)}

attention_weights = softmax(scaled_scores)
print(attention_weights)
# For unscaled
attention_weights_unscaled = softmax(unscaled_scores)

{'The': 0.281, 'cat': 0.089, 'sat': 0.158, 'on': 0.158, 'a': 0.158, 'mat': 0.158}


Step 6: Calculate the weighted average of attention weights weighted with values

In [6]:
output_cat = sum(attention_weights[word] * V[word] for word in V)

print("Attention Weights Scaled:", attention_weights)
print("Attention Weights Unscaled:", attention_weights_unscaled)
print("Output for 'cat':", output_cat)

Attention Weights Scaled: {'The': 0.281, 'cat': 0.089, 'sat': 0.158, 'on': 0.158, 'a': 0.158, 'mat': 0.158}
Attention Weights Unscaled: {'The': 0.384, 'cat': 0.052, 'sat': 0.141, 'on': 0.141, 'a': 0.141, 'mat': 0.141}
Output for 'cat': [0.597 0.405 0.597]


The attention weights indicate how much focus the word "cat" gives to each word in the sentence, including itself. The weights are normalized probabilities that sum up to 1.

The output vector for "cat" is a weighted sum of the value vectors of all words, weighted by the attention weights.

This output vector can be interpreted as the new representation of the word "cat" after considering the context provided by the entire sentence. It combines information from all the words, with more weight given to the words that "cat" pays more attention to.

# Multi-Head Attention Example

The Multi-Head Attention works anologusly to Self-Attention but instead we define matrices $W_Q$, $W_K$ and $W_V$ for each head.

Moreover we have one matrix $W_O$ to combine them at the end.

Using the sentence "The cat sat on a mat" and focusing on the word "cat" with two attention heads.

In [7]:
import numpy as np

embeddings = {
    "The": np.array([1, 0, 1]),
    "cat": np.array([0, 1, 0]),
    "sat": np.array([1, 1, 0]),
    "on": np.array([0, 0, 1]),
    "a": np.array([1, 0, 0]),
    "mat": np.array([0, 1, 1])
}


W_Q1 = np.array([
    [1, 0, 1],
    [0, 1, 0],
    [1, 1, 0]
])

W_K1 = np.array([
    [0, 1, 0],
    [1, 0, 1],
    [0, 1, 1]
])

W_V1 = np.array([
    [1, 0, 0],
    [0, 1, 0],
    [0, 0, 1]
])

W_Q2 = np.array([
    [0, 1, 0],
    [1, 0, 1],
    [1, 1, 0]
])

W_K2 = np.array([
    [1, 0, 1],
    [0, 1, 0],
    [1, 0, 1]
])

W_V2 = np.array([
    [0, 1, 0],
    [1, 0, 1],
    [0, 0, 1]
])

Q1 = {word: np.dot(embedding, W_Q1) for word, embedding in embeddings.items()}
K1 = {word: np.dot(embedding, W_K1) for word, embedding in embeddings.items()}
V1 = {word: np.dot(embedding, W_V1) for word, embedding in embeddings.items()}

Q2 = {word: np.dot(embedding, W_Q2) for word, embedding in embeddings.items()}
K2 = {word: np.dot(embedding, W_K2) for word, embedding in embeddings.items()}
V2 = {word: np.dot(embedding, W_V2) for word, embedding in embeddings.items()}

Q_cat1 = Q1["cat"]
scores1 = attention_scores(Q_cat1, K1)
scaled_scores1 = scale_scores(scores1, d_k=3)
attention_weights1 = softmax(scaled_scores1)

Q_cat2 = Q2["cat"]
scores2 = attention_scores(Q_cat2, K2)
scaled_scores2 = scale_scores(scores2, d_k=3)
attention_weights2 = softmax(scaled_scores2)

output_cat1 = sum(attention_weights1[word] * V1[word] for word in V1)
output_cat2 = sum(attention_weights2[word] * V2[word] for word in V2)

output_cat = np.concatenate([output_cat1, output_cat2])

W_O = np.random.rand(output_cat.shape[0], 3) 
final_output_cat = np.dot(output_cat, W_O)

print("Attention Weights Head 1:", attention_weights1)
print("Attention Weights Head 2:", attention_weights2)
print("Output for 'cat' Head 1:", output_cat1)
print("Output for 'cat' Head 2:", output_cat2)
print("Final Output for 'cat':", final_output_cat)

Attention Weights Head 1: {'The': 0.281, 'cat': 0.089, 'sat': 0.158, 'on': 0.158, 'a': 0.158, 'mat': 0.158}
Attention Weights Head 2: {'The': 0.424, 'cat': 0.042, 'sat': 0.134, 'on': 0.134, 'a': 0.134, 'mat': 0.134}
Output for 'cat' Head 1: [0.597 0.405 0.597]
Output for 'cat' Head 2: [0.31  0.692 1.002]
Final Output for 'cat': [1.08429367 1.54793053 2.56099683]
