# Attention and Multi-Head Attention — Formulas + Numerical Example

This notebook explains **self-attention** and **multi-head attention** step by step.

Contents:
1. Definitions of Q, K, V
2. Attention formulas
3. Simple (single-head) self-attention — numerical example
4. Multi-head attention — numerical example

Example sentence:
```
I love machine learning
```

## 1. What are Q, K, V?

We start with a sequence of token embeddings:

$X = [x_1, x_2, \dots, x_N], \quad x_i \in \mathbb{R}^{d_{model}}$
Each token is projected into three different vectors:

$Q = XW_Q,\quad K = XW_K,\quad V = XW_V$
Where:
- $W_Q, W_K, W_V \in \mathbb{R}^{d_{model} \times d_k}$ are learned matrices
- $Q, K, V \in \mathbb{R}^{N \times d_k}$

**Q** is used to compute relevance,

**K** provides matching information,

**V** contains the information that will be mixed.

## 2. Self-Attention Formula

Step 1: Similarity scores

$S = QK^T$

Step 2: Scaling

$\hat{S} = \frac{QK^T}{\sqrt{d_k}}$
Step 3: Softmax (row-wise)

$A = \text{softmax}(\hat{S})$
Step 4: Output

$\boxed{\text{Attention}(Q,K,V) = AV}$
Output shape: 
$\mathbb{R}^{N \times d_k}$

## 3. Input embeddings (numerical example)

In [None]:

import torch
import math

X = torch.tensor([
    [1., 0., 1., 0.],   # I
    [0., 1., 1., 0.],   # love
    [1., 1., 0., 1.],   # machine
    [0., 1., 0., 1.]    # learning
])

X


## 4. Simple (Single-Head) Self-Attention

For clarity, we use identity matrices for $W_Q, W_K, W_V$.

In [None]:

d_model = X.shape[1]

WQ = torch.eye(d_model)
WK = torch.eye(d_model)
WV = torch.eye(d_model)

Q = X @ WQ
K = X @ WK
V = X @ WV

d_k = Q.shape[1]

scores = Q @ K.T / math.sqrt(d_k)
weights = torch.softmax(scores, dim=1)
output_single = weights @ V

scores, weights, output_single


## 5. Multi-Head Attention Formula

For $h$ heads:

$Q^{(i)} = XW_Q^{(i)},\quad K^{(i)} = XW_K^{(i)},\quad V^{(i)} = XW_V^{(i)}$
$O^{(i)} = \text{Attention}(Q^{(i)}, K^{(i)}, V^{(i)})$


Concatenate outputs:

$O = \text{Concat}(O^{(1)}, \dots, O^{(h)})$

Final projection:

$\boxed{\text{MHA}(X) = OW_O}$

## 6. Multi-Head Attention (2 heads, numerical)

- Number of heads = 2
- Head dimension = 2
- Total dimension = 4

In [None]:

# Projection matrices for two heads

# Head 1 -> first two features
WQ1 = WK1 = WV1 = torch.tensor([
    [1., 0.],
    [0., 1.],
    [0., 0.],
    [0., 0.]
])

# Head 2 -> last two features
WQ2 = WK2 = WV2 = torch.tensor([
    [0., 0.],
    [0., 0.],
    [1., 0.],
    [0., 1.]
])


In [None]:

def attention(Q, K, V):
    d_k = Q.shape[1]
    scores = Q @ K.T / math.sqrt(d_k)
    weights = torch.softmax(scores, dim=1)
    return weights @ V, weights


In [None]:

# Head 1
Q1, K1, V1 = X @ WQ1, X @ WK1, X @ WV1
O1, A1 = attention(Q1, K1, V1)

# Head 2
Q2, K2, V2 = X @ WQ2, X @ WK2, X @ WV2
O2, A2 = attention(Q2, K2, V2)

O1, A1, O2, A2


## 7. Concatenate heads and output projection

In [None]:

O = torch.cat([O1, O2], dim=1)

# Identity output projection for clarity
WO = torch.eye(4)
Y = O @ WO

Y


## Summary
- Q, K, V are linear projections of the same input
- Self-attention computes a weighted sum of values
- Multi-head attention repeats this in parallel subspaces
- Outputs are concatenated and projected