# Self Attention in Transformers

**Self-attention** is the core mechanism inside Transformer neural networks.  
It allows every word in a sentence to look at **every other word** (including itself) to build a richer, context-aware representation.

Unlike RNNs which process words sequentially and struggle with long-range dependencies, self-attention processes all words **in parallel** and captures **both past and future context** simultaneously.

### How it works at a high level:
1. Each word generates three vectors: **Query (Q)**, **Key (K)**, and **Value (V)**.
2. The **dot product** of Q and K determines how much attention each word pays to every other word.
3. These scores are **scaled** (divided by √d_k) to keep gradients stable.
4. An optional **mask** is applied (in the decoder) to prevent attending to future words.
5. **Softmax** converts raw scores into a probability distribution.
6. The probability-weighted sum of **V** vectors produces the new context-aware representations.

## Generate Data

We create random **Query (Q)**, **Key (K)**, and **Value (V)** matrices to simulate the inputs to self-attention.

- **L** (sequence length) = 4 → represents 4 words/tokens in a sentence (e.g., "my name is Aj").
- **d_k** (key/query dimension) = 8 → each Q and K vector has 8 dimensions.
- **d_v** (value dimension) = 8 → each V vector has 8 dimensions.

In a real Transformer, Q, K, V are produced by multiplying the input embeddings (typically 512-dim) with learned weight matrices. Here we use random values for demonstration.

In [None]:
import numpy as np
import math

# -------------------------------------------------------------------
# Define dimensions for our self-attention example
# -------------------------------------------------------------------
# L   = Sequence length (number of tokens/words in the input sentence)
#       Example: "my name is Aj" → 4 tokens, so L = 4
# d_k = Dimension of the Query (Q) and Key (K) vectors
#       Each word's Q and K will be a vector of size 8
# d_v = Dimension of the Value (V) vectors
#       Each word's V will be a vector of size 8
# In real Transformers, d_k and d_v are typically 64 (with 512-dim embeddings split across 8 heads)
L, d_k, d_v = 4, 8, 8

# -------------------------------------------------------------------
# Generate random Q, K, V matrices
# -------------------------------------------------------------------
# q (Query):  Shape (L, d_k) = (4, 8)
#   - "What am I looking for?" — each word asks what context it needs
# k (Key):    Shape (L, d_k) = (4, 8)
#   - "What can I offer?" — each word advertises what information it contains
# v (Value):  Shape (L, d_v) = (4, 8)
#   - "What information do I actually carry?" — the real content each word provides
#
# np.random.randn generates values from a standard normal distribution (mean=0, std=1)
q = np.random.randn(L, d_k)   # 4 words, each with an 8-dim query vector
k = np.random.randn(L, d_k)   # 4 words, each with an 8-dim key vector
v = np.random.randn(L, d_v)   # 4 words, each with an 8-dim value vector

In [None]:
# -------------------------------------------------------------------
# Print the generated Q, K, V matrices
# -------------------------------------------------------------------
# Each matrix is (4, 8): 4 rows (one per word) × 8 columns (vector dimensions)
# Row 0 → word 1 ("my"), Row 1 → word 2 ("name"), etc.
print("Q\n", q)   # Query matrix — what each word is searching for
print("K\n", k)   # Key matrix — what each word offers as a match
print("V\n", v)   # Value matrix — the actual information each word carries

Q
 [[ 0.11672673 -2.54870451 -1.44065948  0.93661829  1.36278968  1.04252277
  -0.01310938 -1.3163937 ]
 [ 0.26721599 -0.90218255  0.07417847 -0.10430246  0.52684253 -0.07081531
  -0.60511725 -0.55225527]
 [-0.93297509  0.28724456  1.37184579  0.41589874  0.34981245 -0.24753755
  -1.24497125  0.05044148]
 [-0.11414585 -0.01545749 -0.58376828 -0.40193907  0.93931836 -1.94334363
  -0.34770465  1.50103406]]
K
 [[ 1.1226585  -0.85645535  0.54315044  1.36560451  0.52539476 -0.94502504
  -0.48444661  0.46268014]
 [-0.53713766 -1.16937329 -0.57988617  0.92713577 -0.85995607 -0.40352635
   0.26555146 -1.83159914]
 [-2.06994435 -0.09514715 -1.64928361 -0.17375184  0.13146819 -1.76335363
   1.56568846  0.69751826]
 [ 0.32910684 -0.1939204  -0.80444134  0.78816869  0.35599408  0.28309835
  -0.25970963  1.49744622]]
V
 [[-0.00368231  1.43739233 -0.59614565 -1.23171219  1.12030717 -0.98620738
  -0.15461465 -1.03106383]
 [ 0.85585446 -1.79878344  0.67321704  0.05607552 -0.15542661 -1.41264124
  -0.4

## Self Attention

The self-attention formula computes how much each word should attend to every other word:

$$
\text{self attention} = softmax\bigg(\frac{Q.K^T}{\sqrt{d_k}}+M\bigg)
$$

**Step-by-step breakdown:**
1. **Q · K^T** → Dot product of Query and Key (transposed). Produces an (L × L) matrix where entry [i, j] = how much word i should attend to word j.
2. **÷ √d_k** → Scale down to reduce variance (prevents softmax from producing extreme values / vanishing gradients).
3. **+ M** → Add mask (optional, used in decoder to block future words).
4. **softmax** → Convert raw scores into probabilities (each row sums to 1).

Then compute the new context-aware value vectors:

$$
\text{new V} = \text{self attention}.V
$$

Each word's new representation = weighted sum of **all** value vectors, where the weights come from the attention scores.

In [None]:
# -------------------------------------------------------------------
# Step 1: Compute Q · K^T  (raw attention scores, BEFORE scaling)
# -------------------------------------------------------------------
# np.matmul(q, k.T) performs matrix multiplication:
#   q shape:   (L, d_k) = (4, 8)
#   k.T shape: (d_k, L) = (8, 4)
#   Result:    (L, L)   = (4, 4)
#
# The resulting 4×4 matrix:
#   - Row i, Column j = dot product of word i's Query with word j's Key
#   - Higher value → word i should pay MORE attention to word j
#   - This is the "affinity" or "compatibility" between every pair of words
#
# Example: entry [0, 2] tells us how much word 0 ("my") attends to word 2 ("is")
np.matmul(q, k.T)

array([[ 1.9385252 ,  5.43647918, -0.38370563,  1.24225801],
       [ 1.35187753,  1.19807371, -1.70999851, -0.38129862],
       [ 1.06382646, -0.86860778, -1.86251774, -0.68520405],
       [ 2.21209236, -2.81995366,  5.32327746,  2.24049732]])

In [None]:
# -------------------------------------------------------------------
# WHY do we need to divide by √d_k?  (Variance analysis)
# -------------------------------------------------------------------
# When Q and K are drawn from standard normal distributions (mean=0, var=1),
# their dot product has variance ≈ d_k (the dimension of the vectors).
#
# Problem: If d_k is large (e.g., 64 or 512), the dot product values become
# very large, pushing softmax into regions with extremely small gradients
# (the "saturation" zone), making training very slow or unstable.
#
# Solution: Divide by √d_k to bring the variance back to approximately 1.
#
# Let's verify:
#   - q.var() ≈ 1       (Q values have variance ~1, as expected from randn)
#   - k.var() ≈ 1       (K values have variance ~1)
#   - (Q·K^T).var() ≈ d_k = 8  (unscaled dot product has inflated variance!)
q.var(), k.var(), np.matmul(q, k.T).var()

(0.8672192297664698, 0.9229851723027697, 5.1446872979260165)

In [None]:
# -------------------------------------------------------------------
# Step 2: Scale the dot product by dividing by √d_k
# -------------------------------------------------------------------
# math.sqrt(d_k) = √8 ≈ 2.83
# Dividing by this factor reduces the variance of the scores from ~d_k back to ~1
# This ensures softmax receives moderate values, producing well-distributed
# attention weights (not all near 0 or 1).
#
# scaled shape: (4, 4) — same as before, but values are smaller and more stable
scaled = np.matmul(q, k.T) / math.sqrt(d_k)

# Verify the variance reduction:
#   - q.var() ≈ 1 (unchanged)
#   - k.var() ≈ 1 (unchanged)
#   - scaled.var() ≈ 1  ← SUCCESS! Variance is now close to 1 instead of ~8
q.var(), k.var(), scaled.var()

(0.8672192297664698, 0.9229851723027697, 0.643085912240752)

Notice the reduction in variance of the product

In [None]:
# -------------------------------------------------------------------
# Display the scaled attention scores matrix
# -------------------------------------------------------------------
# This is a 4×4 matrix (L × L):
#   - Each row corresponds to a word in the sentence
#   - Each column corresponds to a word it could attend to
#   - Values are now in a reasonable range (roughly -2 to +2) thanks to scaling
#   - These scores will next be masked (in decoder) and then passed through softmax
scaled

array([[ 0.68537216,  1.92208565, -0.13566043,  0.43920453],
       [ 0.47796088,  0.42358302, -0.60457577, -0.13480942],
       [ 0.37611945, -0.30709922, -0.65849946, -0.24225621],
       [ 0.78209275, -0.99700418,  1.88206279,  0.79213542]])

## Masking

- **Purpose:** Prevent words from "cheating" by looking at future words that haven't been generated yet.
- **Where it's used:** Only in the **decoder** (not in the encoder).
  - **Encoder:** Every word can attend to every other word (full context) → no mask needed.
  - **Decoder:** During generation, word at position t can only attend to words at positions ≤ t → mask required.
- **How it works:**
  - A lower-triangular matrix of 1s is created (future positions = 0).
  - Zeros are replaced with **-∞** (negative infinity) so that after softmax, those positions get a probability of 0.
  - Ones are replaced with **0** so they don't change the attention scores when added.

**Example for L=4:**
```
Word 1 can see: [word1,   -∞,    -∞,    -∞  ]
Word 2 can see: [word1, word2,   -∞,    -∞  ]
Word 3 can see: [word1, word2, word3,   -∞  ]
Word 4 can see: [word1, word2, word3, word4 ]
```

In [None]:
# -------------------------------------------------------------------
# Step 3a: Create the causal (look-ahead) mask
# -------------------------------------------------------------------
# np.tril = lower TRIangular matrix (keeps values on and below the diagonal)
# np.ones((L, L)) creates a 4×4 matrix of all 1s, then tril zeros out the upper triangle
#
# Result:
#   [[1, 0, 0, 0],    ← word 0 can only see itself
#    [1, 1, 0, 0],    ← word 1 can see words 0 and 1
#    [1, 1, 1, 0],    ← word 2 can see words 0, 1, and 2
#    [1, 1, 1, 1]]    ← word 3 can see all words (0, 1, 2, 3)
#
# 1 = "allowed to attend" | 0 = "blocked (future word)"
mask = np.tril(np.ones( (L, L) ))
mask

array([[1., 0., 0., 0.],
       [1., 1., 0., 0.],
       [1., 1., 1., 0.],
       [1., 1., 1., 1.]])

In [None]:
# -------------------------------------------------------------------
# Step 3b: Convert the binary mask into an additive mask
# -------------------------------------------------------------------
# We transform 0s → -infinity and 1s → 0 so we can ADD this mask to the scores:
#
#   mask[mask == 0] = -inf   → blocked positions become -∞
#       When we later add -∞ to an attention score, softmax(−∞) → 0
#       This means the model assigns ZERO attention to future words.
#
#   mask[mask == 1] = 0      → allowed positions become 0
#       Adding 0 leaves the attention score unchanged.
#
# After this transformation, the mask looks like:
#   [[  0, -inf, -inf, -inf],
#    [  0,    0, -inf, -inf],
#    [  0,    0,    0, -inf],
#    [  0,    0,    0,    0]]
mask[mask == 0] = -np.infty
mask[mask == 1] = 0

In [None]:
# -------------------------------------------------------------------
# Display the additive mask
# -------------------------------------------------------------------
# Verify the mask:
#   - 0 entries → these positions are allowed (attention score unchanged)
#   - -inf entries → these positions are blocked (softmax will output 0 here)
# Row i shows which words word i is allowed to see (0) vs blocked from seeing (-inf)
mask

array([[  0., -inf, -inf, -inf],
       [  0.,   0., -inf, -inf],
       [  0.,   0.,   0., -inf],
       [  0.,   0.,   0.,   0.]])

In [None]:
# -------------------------------------------------------------------
# Step 3c: Apply the mask to the scaled attention scores
# -------------------------------------------------------------------
# Element-wise addition: scaled + mask
#   - Where mask is 0 → score stays the same (word is visible)
#   - Where mask is -inf → score becomes -inf (word is hidden/future)
#
# After softmax, -inf entries will become 0 probability, effectively
# preventing the decoder from "peeking" at future tokens.
#
# Result shape: (4, 4) — same as the scaled matrix
scaled + mask

array([[ 0.68537216,        -inf,        -inf,        -inf],
       [ 0.47796088,  0.42358302,        -inf,        -inf],
       [ 0.37611945, -0.30709922, -0.65849946,        -inf],
       [ 0.78209275, -0.99700418,  1.88206279,  0.79213542]])

## Softmax

Softmax converts raw attention scores into a **probability distribution** — each row sums to 1.

$$
\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}
$$

**Why softmax?**
- It turns arbitrary real-valued scores into values between 0 and 1.
- Higher scores → higher probability → more attention.
- Scores of **-∞** (masked future words) → e^(-∞) = 0 → zero attention (exactly what we want!).
- Each row becomes a valid probability distribution, so the weighted sum of values is a proper weighted average.

In [None]:
# -------------------------------------------------------------------
# Step 4: Define the softmax function
# -------------------------------------------------------------------
def softmax(x):
    """
    Compute softmax along the last axis (each row independently).
    
    For each row:
      1. Exponentiate every element: e^(x_i)
      2. Sum all exponentiated values in that row: Σ e^(x_j)
      3. Divide each exponentiated value by the sum → probabilities
    
    The .T (transpose) trick:
      - np.exp(x) computes e^(x_i) for all elements
      - np.sum(np.exp(x), axis=-1) sums across the last axis (each row), giving shape (L,)
      - We transpose, divide (broadcasting divides each column by the row sum), 
        then transpose back to get the correct shape (L, L)
    
    Result: Each row sums to 1.0 — a valid probability distribution.
    For -inf inputs: e^(-inf) = 0, so masked positions get 0 probability.
    """
    return (np.exp(x).T / np.sum(np.exp(x), axis=-1)).T

In [None]:
# -------------------------------------------------------------------
# Step 5: Compute the attention weights
# -------------------------------------------------------------------
# Apply softmax to (scaled scores + mask):
#   attention = softmax( (Q·K^T)/√d_k + M )
#
# Result shape: (4, 4) — the attention weight matrix
#   - attention[i][j] = probability that word i attends to word j
#   - Each row sums to 1.0
#   - Upper triangle is 0 (due to mask → -inf → softmax → 0)
#   - This means each word only attends to itself and previous words (causal)
attention = softmax(scaled + mask)

In [None]:
# -------------------------------------------------------------------
# Display the attention weight matrix
# -------------------------------------------------------------------
# Each row i shows HOW MUCH word i attends to every other word:
#   Row 0: [1.0,   0,    0,    0  ] ← word 0 can only see itself (100%)
#   Row 1: [0.3, 0.7,   0,    0  ] ← word 1 splits attention between word 0 and itself
#   Row 2: [0.1, 0.2, 0.7,   0  ] ← word 2 distributes attention across words 0-2
#   Row 3: [0.1, 0.2, 0.3, 0.4]   ← word 3 attends to all words 0-3
# (Actual values depend on the random Q, K inputs)
#
# Notice: upper triangle is always 0 (masked future words!)
attention

array([[1.        , 0.        , 0.        , 0.        ],
       [0.51359112, 0.48640888, 0.        , 0.        ],
       [0.53753304, 0.27144826, 0.1910187 , 0.        ],
       [0.19293995, 0.03256643, 0.57960627, 0.19488734]])

In [None]:
# -------------------------------------------------------------------
# Step 6: Compute the new context-aware Value vectors
# -------------------------------------------------------------------
# new_v = attention · V
#   attention shape: (L, L) = (4, 4)   — attention weights
#   v shape:         (L, d_v) = (4, 8) — original value vectors
#   Result:          (L, d_v) = (4, 8) — new value vectors enriched with context
#
# What this does for each word i:
#   new_v[i] = Σ_j  attention[i][j] * v[j]
#
#   = weighted sum of ALL value vectors, where the weights are the attention scores
#
# Example for word 2:
#   new_v[2] = 0.1*v[0] + 0.2*v[1] + 0.7*v[2] + 0.0*v[3]
#   (word 2 gets 70% of its own info, 20% from word 1, 10% from word 0, 0% from word 3)
#
# This is the KEY output of self-attention: each word's new representation
# now incorporates context from the words it attended to!
new_v = np.matmul(attention, v)
new_v

array([[-0.00368231,  1.43739233, -0.59614565, -1.23171219,  1.12030717,
        -0.98620738, -0.15461465, -1.03106383],
       [ 0.41440401, -0.13671232,  0.02128364, -0.60532081,  0.49977893,
        -1.1936286 , -0.27463831, -1.10169151],
       [ 0.32673907,  0.72121642, -0.00947672, -0.59897862,  0.90155754,
        -0.88535361, -0.21384855, -0.7053796 ],
       [ 0.18700384,  1.67754576,  0.33105314, -0.41795742,  1.4258469 ,
        -0.18788199, -0.10285145,  0.54683565]])

In [None]:
# -------------------------------------------------------------------
# Compare: Original V vs New V (context-enriched)
# -------------------------------------------------------------------
# The ORIGINAL value vectors — each word's representation in isolation,
# with NO context from other words.
# Compare this with new_v above to see how self-attention has mixed
# information from different words based on the attention weights.
#
# Key insight: new_v[i] ≠ v[i] because attention has blended in
# information from other words that are relevant to word i.
v

array([[-0.00368231,  1.43739233, -0.59614565, -1.23171219,  1.12030717,
        -0.98620738, -0.15461465, -1.03106383],
       [ 0.85585446, -1.79878344,  0.67321704,  0.05607552, -0.15542661,
        -1.41264124, -0.40136933, -1.17626611],
       [ 0.50465335,  2.28693419,  0.67128338,  0.2506863 ,  1.78802234,
         0.14775751, -0.11405725,  0.88026286],
       [-0.68069105,  0.68385101,  0.17994557, -1.68013201,  0.91543969,
        -0.19108312,  0.03160471,  1.40527326]])

# Complete Self-Attention Function

Below, we encapsulate the entire self-attention pipeline into a reusable function.  
This function works for **both encoder and decoder**:
- **Encoder:** Call without mask → every word attends to every other word (full bidirectional context).
- **Decoder:** Call with mask → each word can only attend to itself and previous words (causal/autoregressive).

In a real Transformer with **multi-headed attention**, this function would be called multiple times (once per head) with different learned projections of Q, K, V, and the outputs would be concatenated.

In [None]:
# ===================================================================
# COMPLETE SCALED DOT-PRODUCT ATTENTION FUNCTION
# ===================================================================
# This encapsulates the entire self-attention pipeline into one function.

def softmax(x):
    """
    Row-wise softmax: converts each row of scores into a probability distribution.
    
    Args:
        x: Input matrix of shape (L, L) — raw attention scores
    Returns:
        Matrix of same shape where each row sums to 1.0
    """
    return (np.exp(x).T / np.sum(np.exp(x), axis=-1)).T

def scaled_dot_product_attention(q, k, v, mask=None):
    """
    Compute Scaled Dot-Product Attention.
    
    This is the CORE attention mechanism used in every Transformer layer.
    
    Formula: Attention(Q, K, V) = softmax( (Q · K^T) / √d_k + mask ) · V
    
    Args:
        q:    Query matrix  — shape (L, d_k) — "what am I looking for?"
        k:    Key matrix    — shape (L, d_k) — "what do I contain?"
        v:    Value matrix  — shape (L, d_v) — "what information do I provide?"
        mask: Optional mask — shape (L, L)   — additive mask (0 for allowed, -inf for blocked)
              • Pass None for ENCODER (no masking, full bidirectional attention)
              • Pass lower-triangular mask for DECODER (causal, can't see future)
    
    Returns:
        out:       New context-enriched value vectors — shape (L, d_v)
        attention: Attention weight matrix — shape (L, L), each row sums to 1
    
    Steps:
        1. Compute Q · K^T          → raw affinity scores        (L, L)
        2. Divide by √d_k           → stabilize variance         (L, L)
        3. Add mask (if provided)    → block future positions     (L, L)
        4. Apply softmax             → convert to probabilities   (L, L)
        5. Multiply by V             → weighted sum of values     (L, d_v)
    """
    # Step 1 & 2: Scaled dot product
    d_k = q.shape[-1]                                  # Get the dimension from Q's last axis
    scaled = np.matmul(q, k.T) / math.sqrt(d_k)       # (L, L) — scaled attention scores
    
    # Step 3: Apply mask (only in decoder, skip in encoder)
    if mask is not None:
        scaled = scaled + mask   # -inf entries → will become 0 after softmax
    
    # Step 4: Softmax — convert scores to probabilities
    attention = softmax(scaled)  # (L, L) — each row is a probability distribution
    
    # Step 5: Weighted sum of value vectors
    out = np.matmul(attention, v)  # (L, d_v) — new context-aware representations
    
    return out, attention

In [None]:
# -------------------------------------------------------------------
# Run the complete self-attention function and display all results
# -------------------------------------------------------------------
# Call with mask=mask → Decoder-style causal attention
# (To use for Encoder, simply pass mask=None for full bidirectional attention)
values, attention = scaled_dot_product_attention(q, k, v, mask=mask)

# Display all inputs and outputs for comparison:
print("Q\n", q)            # Original query vectors (what each word is looking for)
print("K\n", k)            # Original key vectors (what each word offers)
print("V\n", v)            # Original value vectors (each word's raw information)
print("New V\n", values)   # Context-enriched value vectors (after attention blending)
print("Attention\n", attention)  # Attention weights — shows how much each word attends to others
                                 # Row i, Col j = how much word i attends to word j
                                 # Upper triangle should be 0 (masked future words)

Q
 [[ 0.11672673 -2.54870451 -1.44065948  0.93661829  1.36278968  1.04252277
  -0.01310938 -1.3163937 ]
 [ 0.26721599 -0.90218255  0.07417847 -0.10430246  0.52684253 -0.07081531
  -0.60511725 -0.55225527]
 [-0.93297509  0.28724456  1.37184579  0.41589874  0.34981245 -0.24753755
  -1.24497125  0.05044148]
 [-0.11414585 -0.01545749 -0.58376828 -0.40193907  0.93931836 -1.94334363
  -0.34770465  1.50103406]]
K
 [[ 1.1226585  -0.85645535  0.54315044  1.36560451  0.52539476 -0.94502504
  -0.48444661  0.46268014]
 [-0.53713766 -1.16937329 -0.57988617  0.92713577 -0.85995607 -0.40352635
   0.26555146 -1.83159914]
 [-2.06994435 -0.09514715 -1.64928361 -0.17375184  0.13146819 -1.76335363
   1.56568846  0.69751826]
 [ 0.32910684 -0.1939204  -0.80444134  0.78816869  0.35599408  0.28309835
  -0.25970963  1.49744622]]
V
 [[-0.00368231  1.43739233 -0.59614565 -1.23171219  1.12030717 -0.98620738
  -0.15461465 -1.03106383]
 [ 0.85585446 -1.79878344  0.67321704  0.05607552 -0.15542661 -1.41264124
  -0.4