### Transformer self-attention concept 

In transformers, the attention mechanism allows the model to focus on different parts of the input sequence when producing an output. The scaled dot-product attention is defined as:

$Attention(Q,K,V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$

Where:

- Q: Query matrix
- K: Key matrix
- V: Value matrix
- $d_k$: Dimension of the key vectors

Let us implement an example and see how it works.

### Attention with Embeddings and Positional Encoding

This example will demonstrate 3 concepts:

1. How attention is computed?
2. How computing attention with positional encoding makes a difference?
3. How multi-head attention works?

Start with Defining Embeddings and Positional Encoding

- Let’s use a simple sentence: "I love dogs".
- We’ll represent each word as a 4-dimensional embedding and add positional encoding.
- Word Embeddings (Random for Illustration)
  
Word,Embedding:
-- I,[1, 0, 0, 0]
-- love,[0, 1, 0, 0]
-- dogs,[0, 0, 1, 0]

Positional Encoding (Simplified):

We’ll use a simple sinusoidal positional encoding for positions 0, 1, and 2 without going into details, it is the distance between tokens):

- Position 0: [0, 0, 0, 0]
- Position 1: [0.5, 0.5, 0.5, 0.5]
- Position 2: [1, 1, 1, 1]

Combined Input:

Word Embedding + Positional Encoding
- I,[1, 0, 0, 0] + [0, 0, 0, 0] = [1, 0, 0, 0]
- love,[0, 1, 0, 0] + [0.5, 0.5, 0.5, 0.5] = [0.5, 1.5, 0.5, 0.5]
- dogs,[0, 0, 1, 0] + [1, 1, 1, 1] = [1, 1, 2, 1]


In [None]:
# Functions required to compute the attention scores and creating the Q,K,V vectors

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

def softmax(x):
    e_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return e_x / np.sum(e_x, axis=-1, keepdims=True)

def scaled_dot_product_attention(Q, K, V):
    # Ensure Q, K, V are 3D arrays with shape (batch_size, seq_len, d_model)
    # If not, add a batch dimension
    if Q.ndim == 2:
        Q = Q[np.newaxis, :, :]  # Add batch dimension
        K = K[np.newaxis, :, :]
        V = V[np.newaxis, :, :]

    # Compute QK^T
    matmul_qk = np.matmul(Q, K.transpose(0, 2, 1))  # Shape: (batch_size, seq_len, seq_len)

    # Scale by sqrt(d_k)
    dk = K.shape[-1]
    scaled_attention_logits = matmul_qk / np.sqrt(dk)

    # Apply softmax to get attention weights
    attention_weights = softmax(scaled_attention_logits)  # Shape: (batch_size, seq_len, seq_len)

    # Multiply attention weights with V
    output = np.matmul(attention_weights, V)  # Shape: (batch_size, seq_len, d_model)

    return output, attention_weights

# Embeddings
embeddings = np.array([
    [1, 0, 0, 0],  # I
    [0, 1, 0, 0],  # love
    [0, 0, 1, 0],  # dogs
])

# Positional encoding
positional_encoding = np.array([
    [0, 0, 0, 0],      # Position 0
    [0.5, 0.5, 0.5, 0.5],  # Position 1
    [1, 1, 1, 1],      # Position 2
])

# Combined input
combined = embeddings + positional_encoding

#### Attention without positional encoding

In [None]:
# Attention without positional encoding
Q_no_pos = K_no_pos = V_no_pos = embeddings
output_no_pos, attention_weights_no_pos = scaled_dot_product_attention(Q_no_pos, K_no_pos, V_no_pos)

print("Attention Weights Without Positional Encoding:\n", attention_weights_no_pos)

#### How to Read the Matrix

1. Rows represent the query word (the word that is "attending").
2. Columns represent the key word (the word being attended to).
3. Each value is a probability (sums to 1 for each row).

Let us plot it in a more readable manner

In [None]:
# Attention weights without positional encoding
attention_weights = attention_weights_no_pos[0]

# Labels for the words
words = ["I", "love", "dogs"]

# Create a heatmap
plt.figure(figsize=(6, 4))
sns.heatmap(
    attention_weights,
    annot=True,
    cmap="YlGnBu",
    xticklabels=words,
    yticklabels=words,
    fmt=".3f",
    linewidths=0.5,
    linecolor="gray"
)

# Add title and labels
plt.title("Attention Weights Without Positional Encoding")
plt.xlabel("Key (Attended To)")
plt.ylabel("Query (Attending)")
plt.show()

##### Results with position encoding
- Each word attends most strongly to itself (diagonal values ~0.45).
- Each word attends equally but weakly to the other words (off-diagonal values ~0.27).

#### Attention with with pos encoding:

In [None]:
# Attention with positional encoding
Q_with_pos = K_with_pos = V_with_pos = combined
output_with_pos, attention_weights_with_pos = scaled_dot_product_attention(Q_with_pos, K_with_pos, V_with_pos)

print("\nAttention Weights With Positional Encoding:\n", attention_weights_with_pos)

In [None]:
# Attention weights with positional encoding (your example)
attention_weights = attention_weights_with_pos[0]

# Labels for the words
words = ["I", "love", "dogs"]

# Create a heatmap
plt.figure(figsize=(6, 4))
sns.heatmap(
    attention_weights,
    annot=True,
    cmap="YlGnBu",
    xticklabels=words,
    yticklabels=words,
    fmt=".3f",
    linewidths=0.5,
    linecolor="gray"
)

# Add title and labels
plt.title("Attention Weights With Positional Encoding")
plt.xlabel("Key (Attended To)")
plt.ylabel("Query (Attending)")
plt.show()


#### Why Does Positional Encoding Change the Attention?

1. "love" attends more to "dogs" because of the phrase structure ("love dogs").
2. "dogs" attends strongly to itself, as it is the object of the sentence.
3. "I" attends to both itself and "dogs", possibly because "I" is the subject and "dogs" is the object.

Without positional encoding, the attention weights were symmetric and each word attended mostly to itself.

With positional encoding, the model learns to attend to words based on their position and semantic role in the sequence.

-------

#### Multi-Head Attention for "I love dogs"

- Instead of computing attention once, we split the Q, K, and V matrices into multiple "heads" (e.g., 2 or 8).
- Each head computes its own attention weights, allowing the model to capture diverse patterns.
- The outputs of all heads are concatenated and projected back to the original dimension.

We shall use 2 heads:

Each head will show different attention patterns, demonstrating how multi-head attention enriches the model's understanding of the sequence.

In [None]:
# let us rewrite the functions for multi-head

def softmax(x):
    e_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return e_x / np.sum(e_x, axis=-1, keepdims=True)

def scaled_dot_product_attention(Q, K, V):
    matmul_qk = np.matmul(Q, K.transpose(0, 2, 1))
    dk = K.shape[-1]
    scaled_attention_logits = matmul_qk / np.sqrt(dk)
    attention_weights = softmax(scaled_attention_logits)
    output = np.matmul(attention_weights, V)
    return output, attention_weights

# Embeddings (3 words, 4 dimensions)
embeddings = np.array([
    [1, 0, 0, 0],  # I
    [0, 1, 0, 0],  # love
    [0, 0, 1, 0],  # dogs
], dtype=np.float32)

# Positional encoding
positional_encoding = np.array([
    [0, 0, 0, 0],      # Position 0
    [0.5, 0.5, 0.5, 0.5],  # Position 1
    [1, 1, 1, 1],      # Position 2
], dtype=np.float32)

# Combined input (add batch dimension)
combined = (embeddings + positional_encoding)[np.newaxis, :, :]  # Shape: (1, 3, 4)

# Number of heads and dimensions per head
num_heads = 2
d_model = 4
depth = d_model // num_heads  # 2 dimensions per head

# Random projection matrices for Q, K, V
W_Q = np.random.randn(d_model, d_model)
W_K = np.random.randn(d_model, d_model)
W_V = np.random.randn(d_model, d_model)

# Project to Q, K, V
Q = np.matmul(combined, W_Q)  # Shape: (1, 3, 4)
K = np.matmul(combined, W_K)
V = np.matmul(combined, W_V)

# Split into heads
def split_heads(x, num_heads):
    batch_size, seq_len, d_model = x.shape
    x = x.reshape(batch_size, seq_len, num_heads, depth)
    return np.transpose(x, (0, 2, 1, 3))  # Shape: (batch_size, num_heads, seq_len, depth)

def combine_heads(x):
    batch_size, _, seq_len, _ = x.shape
    x = np.transpose(x, (0, 2, 1, 3))  # Shape: (batch_size, seq_len, num_heads, depth)
    return x.reshape(batch_size, seq_len, -1)  # Concatenate heads

# Split into heads
Q_heads = split_heads(Q, num_heads)  # Shape: (1, 2, 3, 2)
K_heads = split_heads(K, num_heads)
V_heads = split_heads(V, num_heads)

# Compute attention for each head
attention_outputs = []
attention_weights_all = []

for i in range(num_heads):
    Q_head = Q_heads[0, i]  # Shape: (3, 2)
    K_head = K_heads[0, i]
    V_head = V_heads[0, i]

    # Add batch dimension
    Q_head = Q_head[np.newaxis, :, :]  # Shape: (1, 3, 2)
    K_head = K_head[np.newaxis, :, :]
    V_head = V_head[np.newaxis, :, :]

    output, attention_weights = scaled_dot_product_attention(Q_head, K_head, V_head)
    attention_outputs.append(output)
    attention_weights_all.append(attention_weights)

# Concatenate outputs from all heads
attention_output = np.concatenate(attention_outputs, axis=-1)  # Shape: (1, 3, 4)

# Print attention weights for each head
for i, weights in enumerate(attention_weights_all):
    print(f"Attention Weights for Head {i+1}:\n", weights[0])
    plt.figure(figsize=(5, 3))
    sns.heatmap(weights[0], annot=True, cmap="YlGnBu", xticklabels=["I", "love", "dogs"], yticklabels=["I", "love", "dogs"])
    plt.title(f"Attention Head {i+1}")
    plt.show()