# Understanding Transformers: A Deep Dive into Attention Mechanisms

Welcome! This tutorial will help you understand how transformers work, with a special focus on the **attention mechanism** - the core innovation that makes transformers so powerful.

## What You'll Learn
1. Why transformers were invented
2. The attention mechanism (in detail!)
3. Self-attention step by step
4. Multi-head attention
5. Building a simple transformer from scratch

Let's get started!

In [None]:
# Install required packages (uncomment if needed)
# !pip install numpy matplotlib torch

import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F

# Set random seed for reproducibility
np.random.seed(42)
torch.manual_seed(42)

print("Libraries imported successfully!")

## Part 1: Why Do We Need Attention?

Before transformers, we used **Recurrent Neural Networks (RNNs)** for sequence tasks. But RNNs had problems:

1. **Sequential Processing**: RNNs process words one at a time, making them slow
2. **Memory Issues**: They forget information from earlier in long sequences
3. **No Parallelization**: Can't process multiple words simultaneously

**Attention** solves these problems by letting the model look at ALL words at once and decide which ones are important!

### Intuition: How Humans Read

Consider this sentence: "The animal didn't cross the street because **it** was too tired."

When you read "it", your brain automatically knows it refers to "animal" (not "street"). You **attend** to the relevant word. That's what attention mechanisms do!

## Part 2: Understanding Attention Step by Step

### The Core Idea

Attention asks three questions for each word:
1. **Query (Q)**: What am I looking for?
2. **Key (K)**: What do I contain?
3. **Value (V)**: What do I actually represent?

Think of it like a database:
- **Query**: Your search term
- **Key**: The index that helps you find relevant items
- **Value**: The actual data you retrieve

### The Attention Formula

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Let's break this down with a simple example!

In [None]:
# Let's start with a simple example: "The cat sat"
# We'll represent each word with a vector (in practice, these come from embeddings)

sentence = ["The", "cat", "sat"]
vocab_size = len(sentence)
d_model = 4  # dimension of our word vectors (kept small for clarity)

# Create simple word embeddings (random for demonstration)
embeddings = np.random.randn(vocab_size, d_model)

print("Our sentence:", sentence)
print("\nWord embeddings (each word is a", d_model, "dimensional vector):")
for i, word in enumerate(sentence):
    print(f"{word}: {embeddings[i]}")

### Step 1: Create Query, Key, and Value matrices

We create Q, K, and V by multiplying our embeddings with learned weight matrices.

Think of these weight matrices as "projections" that transform our words into different representations for different purposes.

In [None]:
# Initialize weight matrices (in practice, these are learned during training)
d_k = d_model  # dimension of keys and queries
d_v = d_model  # dimension of values

W_q = np.random.randn(d_model, d_k)  # Query weight matrix
W_k = np.random.randn(d_model, d_k)  # Key weight matrix
W_v = np.random.randn(d_model, d_v)  # Value weight matrix

# Compute Q, K, V for all words
Q = embeddings @ W_q  # (3, 4) @ (4, 4) = (3, 4)
K = embeddings @ W_k
V = embeddings @ W_v

print("Query matrix Q (shape:", Q.shape, ")")
print(Q)
print("\nKey matrix K (shape:", K.shape, ")")
print(K)
print("\nValue matrix V (shape:", V.shape, ")")
print(V)

### Step 2: Calculate Attention Scores

Now we compute how much each word should "attend" to every other word.

We do this by taking the dot product of Q and K^T. This measures similarity:
- High score = words are related
- Low score = words are not related

In [None]:
# Calculate attention scores: Q @ K^T
attention_scores = Q @ K.T  # (3, 4) @ (4, 3) = (3, 3)

print("Raw attention scores:")
print(attention_scores)
print("\nShape:", attention_scores.shape)
print("\nInterpretation:")
print("- Row i, Column j: How much word i attends to word j")
print(f"- For example, '{sentence[1]}' attending to '{sentence[0]}': {attention_scores[1, 0]:.2f}")

### Step 3: Scale the Scores

We divide by âˆšd_k to prevent the dot products from getting too large.

**Why?** Large values push the softmax function into regions with tiny gradients, making training difficult.

In [None]:
# Scale by square root of dimension
scaled_scores = attention_scores / np.sqrt(d_k)

print("Scaled attention scores:")
print(scaled_scores)
print("\nScaling factor (âˆšd_k):", np.sqrt(d_k))

### Step 4: Apply Softmax

Softmax converts scores into probabilities (they sum to 1).

This gives us **attention weights** - how much focus each word gets.

In [None]:
# Apply softmax to each row
def softmax(x):
    exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True))  # subtract max for numerical stability
    return exp_x / np.sum(exp_x, axis=-1, keepdims=True)

attention_weights = softmax(scaled_scores)

print("Attention weights:")
print(attention_weights)
print("\nEach row sums to 1:", attention_weights.sum(axis=1))

# Visualize attention weights
plt.figure(figsize=(8, 6))
plt.imshow(attention_weights, cmap='Blues', aspect='auto')
plt.colorbar(label='Attention Weight')
plt.xticks(range(len(sentence)), sentence)
plt.yticks(range(len(sentence)), sentence)
plt.xlabel('Key (attending TO)')
plt.ylabel('Query (attending FROM)')
plt.title('Attention Weight Visualization')

# Add text annotations
for i in range(len(sentence)):
    for j in range(len(sentence)):
        plt.text(j, i, f'{attention_weights[i, j]:.2f}', 
                ha='center', va='center', color='red')

plt.tight_layout()
plt.show()

### Step 5: Compute Weighted Sum

Finally, we multiply attention weights by the Value matrix.

This creates a new representation for each word that incorporates information from all other words!

In [None]:
# Multiply attention weights by values
output = attention_weights @ V  # (3, 3) @ (3, 4) = (3, 4)

print("Output after attention:")
print(output)
print("\nShape:", output.shape)
print("\nNotice: Each word now has a new representation that combines information from all words!")

## Part 3: Self-Attention Implementation

Now let's put it all together in a clean function!

In [None]:
def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Compute scaled dot-product attention.
    
    Args:
        Q: Query matrix (batch_size, seq_len, d_k)
        K: Key matrix (batch_size, seq_len, d_k)
        V: Value matrix (batch_size, seq_len, d_v)
        mask: Optional mask (batch_size, seq_len, seq_len)
    
    Returns:
        output: Attention output (batch_size, seq_len, d_v)
        attention_weights: Attention weights (batch_size, seq_len, seq_len)
    """
    d_k = Q.shape[-1]
    
    # Step 1: Compute attention scores
    scores = Q @ K.transpose(-2, -1)  # (batch, seq_len, seq_len)
    
    # Step 2: Scale
    scores = scores / np.sqrt(d_k)
    
    # Step 3: Apply mask (if provided)
    if mask is not None:
        scores = np.where(mask == 0, -1e9, scores)
    
    # Step 4: Softmax
    attention_weights = softmax(scores)
    
    # Step 5: Weighted sum
    output = attention_weights @ V
    
    return output, attention_weights

# Test our function
output, weights = scaled_dot_product_attention(Q, K, V)
print("Output from our attention function:")
print(output)
print("\nAttention weights:")
print(weights)

## Part 4: Multi-Head Attention

**Key Insight**: Different words might be related in different ways!

Examples:
- Grammatical relationships (subject-verb)
- Semantic relationships (synonyms, antonyms)
- Positional relationships (nearby words)

**Multi-head attention** runs attention multiple times in parallel, each learning different types of relationships.

### How it works:
1. Split Q, K, V into multiple "heads"
2. Run attention independently on each head
3. Concatenate the results
4. Apply a final linear transformation

In [None]:
class MultiHeadAttention:
    def __init__(self, d_model, num_heads):
        """
        Multi-head attention layer.
        
        Args:
            d_model: Dimension of the model
            num_heads: Number of attention heads
        """
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads  # dimension per head
        
        # Weight matrices for all heads (combined)
        self.W_q = np.random.randn(d_model, d_model) * 0.01
        self.W_k = np.random.randn(d_model, d_model) * 0.01
        self.W_v = np.random.randn(d_model, d_model) * 0.01
        self.W_o = np.random.randn(d_model, d_model) * 0.01  # output projection
    
    def split_heads(self, x):
        """
        Split the last dimension into (num_heads, d_k).
        
        Input shape: (batch_size, seq_len, d_model)
        Output shape: (batch_size, num_heads, seq_len, d_k)
        """
        batch_size, seq_len, d_model = x.shape
        x = x.reshape(batch_size, seq_len, self.num_heads, self.d_k)
        return x.transpose(0, 2, 1, 3)  # (batch_size, num_heads, seq_len, d_k)
    
    def combine_heads(self, x):
        """
        Combine heads back into single dimension.
        
        Input shape: (batch_size, num_heads, seq_len, d_k)
        Output shape: (batch_size, seq_len, d_model)
        """
        batch_size, num_heads, seq_len, d_k = x.shape
        x = x.transpose(0, 2, 1, 3)  # (batch_size, seq_len, num_heads, d_k)
        return x.reshape(batch_size, seq_len, self.d_model)
    
    def forward(self, x, mask=None):
        """
        Forward pass of multi-head attention.
        
        Args:
            x: Input (batch_size, seq_len, d_model)
            mask: Optional mask
        
        Returns:
            output: Attention output (batch_size, seq_len, d_model)
            attention_weights: Weights from all heads
        """
        batch_size = x.shape[0]
        
        # Linear projections
        Q = x @ self.W_q  # (batch_size, seq_len, d_model)
        K = x @ self.W_k
        V = x @ self.W_v
        
        # Split into multiple heads
        Q = self.split_heads(Q)  # (batch_size, num_heads, seq_len, d_k)
        K = self.split_heads(K)
        V = self.split_heads(V)
        
        # Apply attention to each head
        attention_outputs = []
        attention_weights_list = []
        
        for i in range(self.num_heads):
            output, weights = scaled_dot_product_attention(
                Q[:, i, :, :], K[:, i, :, :], V[:, i, :, :], mask
            )
            attention_outputs.append(output)
            attention_weights_list.append(weights)
        
        # Stack outputs from all heads
        attention_output = np.stack(attention_outputs, axis=1)  # (batch, num_heads, seq_len, d_k)
        
        # Combine heads
        output = self.combine_heads(attention_output)  # (batch_size, seq_len, d_model)
        
        # Final linear projection
        output = output @ self.W_o
        
        return output, attention_weights_list

# Test multi-head attention
d_model = 8
num_heads = 2
seq_len = 3
batch_size = 1

# Create input (batch_size, seq_len, d_model)
x = np.random.randn(batch_size, seq_len, d_model)

mha = MultiHeadAttention(d_model, num_heads)
output, attention_weights = mha.forward(x)

print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"\nNumber of attention heads: {num_heads}")
print(f"Dimension per head: {mha.d_k}")

# Visualize attention from different heads
fig, axes = plt.subplots(1, num_heads, figsize=(12, 4))
for i in range(num_heads):
    ax = axes[i] if num_heads > 1 else axes
    im = ax.imshow(attention_weights[i][0], cmap='Blues', aspect='auto')
    ax.set_title(f'Head {i+1}')
    ax.set_xlabel('Key')
    ax.set_ylabel('Query')
    plt.colorbar(im, ax=ax)

plt.tight_layout()
plt.suptitle('Attention Patterns from Different Heads', y=1.02)
plt.show()

print("\nNotice: Different heads learn different attention patterns!")

## Part 5: Building a Complete Transformer Block

A transformer block consists of:
1. **Multi-head attention layer**
2. **Add & Normalize** (residual connection + layer normalization)
3. **Feed-forward network** (2 linear layers with activation)
4. **Add & Normalize** (another residual connection)

Let's implement this using PyTorch for cleaner code!

In [None]:
class TransformerBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        """
        A single transformer block.
        
        Args:
            d_model: Model dimension
            num_heads: Number of attention heads
            d_ff: Dimension of feed-forward network
            dropout: Dropout rate
        """
        super().__init__()
        
        # Multi-head attention
        self.attention = nn.MultiheadAttention(d_model, num_heads, dropout=dropout, batch_first=True)
        
        # Feed-forward network
        self.ff_network = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout)
        )
        
        # Layer normalization
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x, mask=None):
        """
        Forward pass.
        
        Args:
            x: Input tensor (batch_size, seq_len, d_model)
            mask: Optional attention mask
        
        Returns:
            Output tensor (batch_size, seq_len, d_model)
        """
        # Multi-head attention with residual connection
        attn_output, _ = self.attention(x, x, x, attn_mask=mask)
        x = self.norm1(x + self.dropout(attn_output))
        
        # Feed-forward network with residual connection
        ff_output = self.ff_network(x)
        x = self.norm2(x + ff_output)
        
        return x

# Test the transformer block
d_model = 64
num_heads = 8
d_ff = 256
batch_size = 2
seq_len = 10

# Create random input
x = torch.randn(batch_size, seq_len, d_model)

# Create transformer block
transformer_block = TransformerBlock(d_model, num_heads, d_ff)

# Forward pass
output = transformer_block(x)

print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"\nTransformer block parameters:")
print(f"  - Model dimension (d_model): {d_model}")
print(f"  - Number of heads: {num_heads}")
print(f"  - Feed-forward dimension: {d_ff}")
print(f"\nTotal parameters: {sum(p.numel() for p in transformer_block.parameters()):,}")

## Part 6: Positional Encoding

**Problem**: Attention has no sense of word order!

"The cat sat" and "Sat cat the" would look identical to pure attention.

**Solution**: Add positional information to word embeddings.

We use sine and cosine functions at different frequencies:

$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$

$$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$

In [None]:
def get_positional_encoding(seq_len, d_model):
    """
    Generate positional encodings.
    
    Args:
        seq_len: Sequence length
        d_model: Model dimension
    
    Returns:
        Positional encodings (seq_len, d_model)
    """
    position = np.arange(seq_len)[:, np.newaxis]  # (seq_len, 1)
    div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))
    
    pos_encoding = np.zeros((seq_len, d_model))
    pos_encoding[:, 0::2] = np.sin(position * div_term)  # even indices
    pos_encoding[:, 1::2] = np.cos(position * div_term)  # odd indices
    
    return pos_encoding

# Visualize positional encodings
seq_len = 50
d_model = 128

pos_encoding = get_positional_encoding(seq_len, d_model)

plt.figure(figsize=(12, 6))
plt.imshow(pos_encoding, cmap='RdBu', aspect='auto')
plt.colorbar(label='Encoding Value')
plt.xlabel('Dimension')
plt.ylabel('Position')
plt.title('Positional Encoding Visualization')
plt.tight_layout()
plt.show()

print(f"Positional encoding shape: {pos_encoding.shape}")
print("\nNotice the wavelike patterns! Each position gets a unique encoding.")

# Show how positions differ
plt.figure(figsize=(10, 4))
for i in [0, 10, 20, 30, 40]:
    plt.plot(pos_encoding[i, :50], label=f'Position {i}')
plt.xlabel('Dimension')
plt.ylabel('Value')
plt.title('Positional Encodings for Different Positions (first 50 dims)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## Part 7: Putting It All Together - Complete Transformer

Now let's build a complete transformer model with:
1. Input embedding
2. Positional encoding
3. Multiple transformer blocks
4. Output projection

In [None]:
class SimpleTransformer(nn.Module):
    def __init__(self, vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_len, dropout=0.1):
        """
        A simple transformer model.
        
        Args:
            vocab_size: Size of vocabulary
            d_model: Model dimension
            num_heads: Number of attention heads
            num_layers: Number of transformer blocks
            d_ff: Feed-forward dimension
            max_seq_len: Maximum sequence length
            dropout: Dropout rate
        """
        super().__init__()
        
        self.d_model = d_model
        
        # Token embedding
        self.embedding = nn.Embedding(vocab_size, d_model)
        
        # Positional encoding (fixed, not learned)
        self.register_buffer('pos_encoding', 
                           torch.FloatTensor(get_positional_encoding(max_seq_len, d_model)))
        
        # Transformer blocks
        self.transformer_blocks = nn.ModuleList([
            TransformerBlock(d_model, num_heads, d_ff, dropout)
            for _ in range(num_layers)
        ])
        
        # Output projection
        self.output_layer = nn.Linear(d_model, vocab_size)
        
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x, mask=None):
        """
        Forward pass.
        
        Args:
            x: Input token indices (batch_size, seq_len)
            mask: Optional attention mask
        
        Returns:
            Output logits (batch_size, seq_len, vocab_size)
        """
        seq_len = x.size(1)
        
        # Embed tokens and scale
        x = self.embedding(x) * np.sqrt(self.d_model)
        
        # Add positional encoding
        x = x + self.pos_encoding[:seq_len, :]
        x = self.dropout(x)
        
        # Pass through transformer blocks
        for block in self.transformer_blocks:
            x = block(x, mask)
        
        # Project to vocabulary
        logits = self.output_layer(x)
        
        return logits

# Create a small transformer
vocab_size = 1000
d_model = 64
num_heads = 4
num_layers = 2
d_ff = 256
max_seq_len = 100

model = SimpleTransformer(vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_len)

# Test with random input
batch_size = 2
seq_len = 20
x = torch.randint(0, vocab_size, (batch_size, seq_len))

output = model(x)

print("Model Architecture:")
print(model)
print(f"\nInput shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"\nTotal parameters: {sum(p.numel() for p in model.parameters()):,}")

# Count parameters by component
print("\nParameters by component:")
print(f"  Embedding: {model.embedding.weight.numel():,}")
print(f"  Transformer blocks: {sum(p.numel() for block in model.transformer_blocks for p in block.parameters()):,}")
print(f"  Output layer: {model.output_layer.weight.numel() + model.output_layer.bias.numel():,}")

## Part 8: Real-World Example - Attention Visualization

Let's see how attention works on a real sentence!

In [None]:
def visualize_attention_pattern(sentence, attention_weights, head_idx=0):
    """
    Visualize attention pattern for a sentence.
    
    Args:
        sentence: List of words
        attention_weights: Attention weights tensor
        head_idx: Which attention head to visualize
    """
    # Get attention for specified head
    attn = attention_weights[head_idx].detach().numpy()
    
    fig, ax = plt.subplots(figsize=(10, 8))
    im = ax.imshow(attn, cmap='Blues', aspect='auto')
    
    # Set ticks and labels
    ax.set_xticks(range(len(sentence)))
    ax.set_yticks(range(len(sentence)))
    ax.set_xticklabels(sentence, rotation=45, ha='right')
    ax.set_yticklabels(sentence)
    
    ax.set_xlabel('Attending TO (Key)')
    ax.set_ylabel('Attending FROM (Query)')
    ax.set_title(f'Attention Pattern (Head {head_idx})')
    
    # Add colorbar
    cbar = plt.colorbar(im, ax=ax)
    cbar.set_label('Attention Weight')
    
    # Add value annotations
    for i in range(len(sentence)):
        for j in range(len(sentence)):
            text = ax.text(j, i, f'{attn[i, j]:.2f}',
                         ha='center', va='center', 
                         color='red' if attn[i, j] > 0.3 else 'black',
                         fontsize=8)
    
    plt.tight_layout()
    plt.show()

# Create a simple example
example_sentence = ["The", "cat", "sat", "on", "the", "mat"]
seq_len = len(example_sentence)
d_model = 64
num_heads = 4

# Create simple embeddings
embeddings = torch.randn(1, seq_len, d_model)

# Create attention layer
attention = nn.MultiheadAttention(d_model, num_heads, batch_first=True)

# Get attention weights
with torch.no_grad():
    _, attention_weights = attention(embeddings, embeddings, embeddings, average_attn_weights=False)

# Visualize different heads
print("Attention Patterns from Different Heads")
print("Notice how different heads focus on different relationships!\n")

for head_idx in range(min(2, num_heads)):
    visualize_attention_pattern(example_sentence, attention_weights[0], head_idx)

## Part 9: Key Takeaways & Summary

### What We Learned:

1. **Attention Mechanism**
   - Allows model to focus on relevant parts of input
   - Uses Query, Key, Value paradigm
   - Formula: Attention(Q,K,V) = softmax(QK^T / âˆšd_k)V

2. **Self-Attention**
   - Each word attends to all words (including itself)
   - Captures relationships between words
   - Parallel processing (unlike RNNs)

3. **Multi-Head Attention**
   - Multiple attention mechanisms in parallel
   - Each head learns different relationships
   - More expressive than single attention

4. **Transformer Block**
   - Multi-head attention + Feed-forward network
   - Residual connections + Layer normalization
   - Stack multiple blocks for deeper models

5. **Positional Encoding**
   - Adds word order information
   - Sine/cosine functions at different frequencies
   - Added to input embeddings

### Why Transformers Are Powerful:

âœ… **Parallelization**: Process all words simultaneously
âœ… **Long-range dependencies**: Can attend to any word regardless of distance
âœ… **Flexibility**: Same architecture for many tasks (translation, generation, etc.)
âœ… **Scalability**: Can be trained on massive datasets

### Next Steps:

1. Study encoder-decoder architecture (for translation)
2. Learn about different transformer variants (BERT, GPT, T5)
3. Understand training techniques (learning rate schedules, warmup)
4. Explore applications (NLP, vision transformers, protein folding)

### Recommended Resources:

- "Attention Is All You Need" paper (Vaswani et al., 2017)
- The Illustrated Transformer (Jay Alammar)
- Stanford CS224N: Natural Language Processing
- Hugging Face Transformers library documentation

## Bonus: Interactive Exercise

Try modifying the code to experiment with different configurations!

Ideas to explore:
1. Change the number of attention heads - what happens?
2. Modify the dimension of the model (d_model)
3. Add more transformer blocks
4. Try different sentences and observe attention patterns
5. Implement masking for decoder (prevent attending to future tokens)

In [None]:
# YOUR EXPERIMENTS HERE!
# Try changing these parameters:

d_model = 64          # Try: 32, 64, 128, 256
num_heads = 4         # Try: 1, 2, 4, 8 (must divide d_model)
num_layers = 2        # Try: 1, 2, 4, 6
d_ff = 256           # Try: 128, 256, 512, 1024

# Build and test your model!
vocab_size = 1000
max_seq_len = 100

custom_model = SimpleTransformer(
    vocab_size=vocab_size,
    d_model=d_model,
    num_heads=num_heads,
    num_layers=num_layers,
    d_ff=d_ff,
    max_seq_len=max_seq_len
)

print(f"Custom model parameters: {sum(p.numel() for p in custom_model.parameters()):,}")

# Test it!
test_input = torch.randint(0, vocab_size, (1, 10))
test_output = custom_model(test_input)
print(f"Input shape: {test_input.shape}")
print(f"Output shape: {test_output.shape}")

## Congratulations! ðŸŽ‰

You now understand how transformers work, especially the attention mechanism!

Remember:
- **Attention** is about focusing on relevant information
- **Self-attention** lets words interact with each other
- **Multi-head attention** learns multiple types of relationships
- **Transformers** stack these mechanisms to build powerful models

Keep learning and experimenting! ðŸš€