# Module 16: Transformers

**Goal:** Understand how attention works and why transformers revolutionized NLP.

**Prerequisites:** Module 15 (Neural Networks)

**Expected Runtime:** ~25 minutes

**Outputs:**
- Implemented basic attention mechanism
- Visualized attention patterns
- Used pre-trained transformers

---

## Setup

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
plt.rcParams['figure.figsize'] = (12, 5)

## Part 1: The Attention Mechanism from Scratch

Let's implement self-attention step by step.

In [None]:
def softmax(x):
    """Compute softmax along last axis."""
    exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return exp_x / exp_x.sum(axis=-1, keepdims=True)

def attention(Q, K, V):
    """
    Compute scaled dot-product attention.
    
    Q: Query matrix (seq_len, d_k)
    K: Key matrix (seq_len, d_k)
    V: Value matrix (seq_len, d_v)
    
    Returns: 
        output: Attention output (seq_len, d_v)
        weights: Attention weights (seq_len, seq_len)
    """
    d_k = K.shape[-1]
    
    # Step 1: Compute attention scores (Q @ K^T)
    scores = Q @ K.T
    
    # Step 2: Scale by sqrt(d_k)
    scores = scores / np.sqrt(d_k)
    
    # Step 3: Apply softmax to get attention weights
    weights = softmax(scores)
    
    # Step 4: Weighted sum of values
    output = weights @ V
    
    return output, weights

print("âœ“ Attention function defined")
print("\nFormula: Attention(Q, K, V) = softmax(QK^T / âˆšd_k) Ã— V")

In [None]:
# Simple example with 4 tokens, embedding dimension 8
np.random.seed(42)

seq_len = 4
d_model = 8

# Simulate input embeddings
tokens = ['The', 'cat', 'sat', 'down']
X = np.random.randn(seq_len, d_model)  # Input embeddings

# Weight matrices for Q, K, V
W_Q = np.random.randn(d_model, d_model) * 0.1
W_K = np.random.randn(d_model, d_model) * 0.1
W_V = np.random.randn(d_model, d_model) * 0.1

# Compute Q, K, V
Q = X @ W_Q
K = X @ W_K
V = X @ W_V

# Run attention
output, weights = attention(Q, K, V)

print(f"Input shape: {X.shape}")
print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {weights.shape}")
print(f"\nWeights sum per row: {weights.sum(axis=1)}")
print("(Each row sums to 1 after softmax)")

## Part 2: Visualizing Attention Patterns

In [None]:
# Visualize attention weights
fig, ax = plt.subplots(figsize=(6, 5))

sns.heatmap(weights, annot=True, fmt='.2f', cmap='RdPu',
            xticklabels=tokens, yticklabels=tokens, ax=ax)

ax.set_xlabel('Attending to (Keys)')
ax.set_ylabel('Token (Queries)')
ax.set_title('Self-Attention Weights')

plt.tight_layout()
plt.show()

print("ðŸ’¡ Each row shows how much one token 'attends' to all others.")
print("   Row 'cat' shows what 'cat' pays attention to when understanding itself.")

## Part 3: How Attention Captures Meaning

Let's see how attention could resolve ambiguity.

In [None]:
# Simulate attention for an ambiguous sentence
# "The bank by the river had many fish"

tokens_amb = ['The', 'bank', 'by', 'the', 'river', 'had', 'many', 'fish']
n = len(tokens_amb)

# Create mock attention weights that show "bank" attending to "river" and "fish"
# (learned to disambiguate bank = riverbank, not financial institution)
mock_attention = np.ones((n, n)) * 0.05  # Low baseline

# Each token attends somewhat to itself
np.fill_diagonal(mock_attention, 0.3)

# "bank" (idx 1) attends to "river" (idx 4) and "fish" (idx 7)
mock_attention[1, 4] = 0.35  # river
mock_attention[1, 7] = 0.2   # fish

# "fish" attends to "river" and "bank"
mock_attention[7, 4] = 0.3
mock_attention[7, 1] = 0.25

# Normalize rows
mock_attention = mock_attention / mock_attention.sum(axis=1, keepdims=True)

# Visualize
fig, ax = plt.subplots(figsize=(8, 6))

sns.heatmap(mock_attention, annot=True, fmt='.2f', cmap='RdPu',
            xticklabels=tokens_amb, yticklabels=tokens_amb, ax=ax)

ax.set_xlabel('Attending to')
ax.set_ylabel('Token')
ax.set_title('"The bank by the river had many fish" - Attention Disambiguates')

plt.tight_layout()
plt.show()

print("ðŸ’¡ Notice how 'bank' attends strongly to 'river' and 'fish'")
print("   This context helps the model understand bank = riverbank, not financial bank")

## Part 4: Multi-Head Attention

Different "heads" learn different patterns.

In [None]:
def multi_head_attention(X, num_heads=4):
    """
    Simplified multi-head attention.
    
    Each head learns different Q, K, V projections.
    """
    seq_len, d_model = X.shape
    d_head = d_model // num_heads
    
    all_weights = []
    all_outputs = []
    
    for head in range(num_heads):
        # Each head has its own projection matrices
        np.random.seed(42 + head)
        W_Q = np.random.randn(d_model, d_head) * 0.1
        W_K = np.random.randn(d_model, d_head) * 0.1
        W_V = np.random.randn(d_model, d_head) * 0.1
        
        Q = X @ W_Q
        K = X @ W_K
        V = X @ W_V
        
        output, weights = attention(Q, K, V)
        all_weights.append(weights)
        all_outputs.append(output)
    
    # Concatenate head outputs
    multi_output = np.concatenate(all_outputs, axis=-1)
    
    return multi_output, all_weights

# Run multi-head attention
multi_output, head_weights = multi_head_attention(X, num_heads=4)

print(f"Multi-head output shape: {multi_output.shape}")
print(f"Number of attention heads: {len(head_weights)}")

In [None]:
# Visualize different attention heads
fig, axes = plt.subplots(1, 4, figsize=(14, 3))

head_names = ['Head 1: Syntactic', 'Head 2: Semantic', 'Head 3: Local', 'Head 4: Mixed']

for i, (weights, name) in enumerate(zip(head_weights, head_names)):
    sns.heatmap(weights, annot=True, fmt='.2f', cmap='RdPu',
                xticklabels=tokens, yticklabels=tokens, ax=axes[i],
                cbar=False)
    axes[i].set_title(name, fontsize=10)
    if i > 0:
        axes[i].set_ylabel('')
        axes[i].set_yticklabels([])

plt.tight_layout()
plt.show()

print("ðŸ’¡ Different heads learn to focus on different relationships.")
print("   Some track syntax, some semantics, some nearby words, etc.")

## Part 5: Positional Encoding

Transformers need position information since they process all tokens in parallel.

In [None]:
def sinusoidal_positional_encoding(seq_len, d_model):
    """
    Create sinusoidal positional encodings.
    
    PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
    PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
    """
    PE = np.zeros((seq_len, d_model))
    
    for pos in range(seq_len):
        for i in range(0, d_model, 2):
            div = 10000 ** (i / d_model)
            PE[pos, i] = np.sin(pos / div)
            if i + 1 < d_model:
                PE[pos, i+1] = np.cos(pos / div)
    
    return PE

# Generate positional encodings
seq_len = 50
d_model = 64
PE = sinusoidal_positional_encoding(seq_len, d_model)

# Visualize
fig, ax = plt.subplots(figsize=(12, 4))

im = ax.imshow(PE.T, aspect='auto', cmap='RdBu_r')
ax.set_xlabel('Position')
ax.set_ylabel('Dimension')
ax.set_title('Sinusoidal Positional Encoding')
plt.colorbar(im)

plt.tight_layout()
plt.show()

print("ðŸ’¡ Each position gets a unique pattern of sine/cosine waves.")
print("   This lets the model distinguish word 1 from word 2 from word 50.")

## Part 6: Using Pre-trained Transformers

In practice, use libraries like HuggingFace.

In [None]:
# Pre-trained Transformers with HuggingFace
# Installation: pip install transformers torch

print("=== Pre-trained Transformer Usage ===")
print("""
Installation:
  pip install transformers torch

Example code:

from transformers import pipeline

# 1. Sentiment Analysis
sentiment = pipeline('sentiment-analysis')
result = sentiment("This product is amazing!")
# Output: [{'label': 'POSITIVE', 'score': 0.9998}]

# 2. Text Generation
generator = pipeline('text-generation', model='gpt2')
result = generator("Machine learning is", max_length=30)
# Output: [{'generated_text': 'Machine learning is a powerful...'}]

# 3. Zero-Shot Classification (no training needed!)
classifier = pipeline('zero-shot-classification')
result = classifier(
    "I need to cancel my order",
    candidate_labels=['support', 'sales', 'billing']
)
# Output: {'labels': ['support', ...], 'scores': [0.85, ...]}

# 4. Named Entity Recognition
ner = pipeline('ner', grouped_entities=True)
result = ner("Apple CEO Tim Cook announced new products in California")
# Output: [{'entity_group': 'ORG', 'word': 'Apple'}, ...]

ðŸ’¡ Key models:
  â€¢ distilbert-base-uncased: Fast, good for classification
  â€¢ gpt2: Text generation (local, free)
  â€¢ all-MiniLM-L6-v2: Embeddings (via sentence-transformers)
""")

## Part 7: Context Window Limitations

In [None]:
# Demonstrate context window constraints

context_windows = {
    'GPT-2': 1024,
    'GPT-3': 4096,
    'GPT-4': 8192,
    'GPT-4-Turbo': 128000,
    'Claude-3': 200000,
}

# Approximate words (tokens â‰ˆ 0.75 words)
approx_words = {k: int(v * 0.75) for k, v in context_windows.items()}

fig, ax = plt.subplots(figsize=(10, 5))

models = list(context_windows.keys())
tokens = list(context_windows.values())

bars = ax.bar(models, tokens, color='#ec4899')
ax.set_ylabel('Context Window (tokens)')
ax.set_title('Transformer Context Windows')
ax.set_yscale('log')

# Add labels
for bar, tok, words in zip(bars, tokens, approx_words.values()):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height(), 
            f'{tok:,}\n(~{words:,} words)', ha='center', va='bottom', fontsize=8)

plt.tight_layout()
plt.show()

print("ðŸ’¡ Attention is O(nÂ²) in sequence length - that's why context windows are limited.")
print("   Newer models use techniques to extend context while managing compute costs.")

## Self-Check

Uncomment and run the asserts below to verify your attention implementation is correct.

In [None]:
# SELF-CHECK: Verify your attention implementation
assert weights.shape[0] == weights.shape[1], "Attention weights should be square (seq_len x seq_len)"
assert np.allclose(weights.sum(axis=1), 1.0), "Attention weights should sum to 1 per row"
assert output.shape == (len(tokens), d_model), "Output should match input dimensions"
print(f"âœ… Self-check passed! Attention matrix: {weights.shape}, output: {output.shape}")

## Part 8: Stakeholder Summary

### TODO: Write a 3-bullet summary (~100 words) for the PM

Template:
â€¢ **What attention does:** Allows the model to focus on relevant parts of input. Like a spotlight that highlights [important words/context].
â€¢ **Why transformers win:** Process text in parallel (fast training), capture long-range dependencies better than [RNNs/LSTMs].
â€¢ **Context window impact:** Our model can process up to ____ tokens (~____ words). For longer documents, we need [chunking/summarization].

### Your Summary:

*Write your explanation here...*

---

## Key Takeaways

1. **Attention** = learning which parts of input are relevant to each other
2. **Self-attention formula:** softmax(QK^T / âˆšd) Ã— V
3. **Multi-head attention** learns multiple relationship types
4. **Positional encoding** adds sequence order information
5. **Context windows** limit how much text models can process
6. **Pre-trained models** (BERT, GPT) are the standard approach

### Next Steps
- Explore the interactive playground
- Complete the quiz
- Move to Module 17: LLM Fundamentals