# ðŸ¤– Week 4: Transformers & Tokenization

**Learning Objectives:**
1. Understand the Attention mechanism
2. Master tokenization strategies (BPE, WordPiece, SentencePiece)
3. Explore Transformer architecture (Encoder/Decoder)
4. Analyze how tokenization affects model performance

---

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import re

plt.style.use('seaborn-v0_8-darkgrid')
np.random.seed(42)

---
# Section 1: Theory
---

## What is Attention?

**Core Idea**: Not all inputs are equally important for a given output.

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$$

Where:
- **Q** (Query): What am I looking for?
- **K** (Key): What do I contain?
- **V** (Value): What do I return?

## Why Tokenization Matters

| Tokenizer | Vocab Size | Handling Unknown Words |
|-----------|------------|------------------------|
| Word-level | Large (100K+) | OOV tokens |
| Character | Small (100) | Very long sequences |
| BPE/WordPiece | Medium (30-50K) | Subword decomposition |

---
# Section 2: Hands-On Implementation
---

## 2.1 Simple Attention Mechanism

In [None]:
def softmax(x, axis=-1):
    """Numerically stable softmax."""
    exp_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
    return exp_x / np.sum(exp_x, axis=axis, keepdims=True)


def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Scaled Dot-Product Attention.
    
    Args:
        Q: Query matrix (seq_len, d_k)
        K: Key matrix (seq_len, d_k)
        V: Value matrix (seq_len, d_v)
        mask: Optional attention mask
    
    Returns:
        output: Attention output
        weights: Attention weights
    """
    d_k = K.shape[-1]
    
    # Compute attention scores
    scores = Q @ K.T / np.sqrt(d_k)
    
    # Apply mask if provided
    if mask is not None:
        scores = scores + (mask * -1e9)
    
    # Softmax to get attention weights
    weights = softmax(scores)
    
    # Weighted sum of values
    output = weights @ V
    
    return output, weights

In [None]:
# Example: Simple attention
seq_len = 4
d_k = 8
d_v = 8

# Random Q, K, V matrices
Q = np.random.randn(seq_len, d_k)
K = np.random.randn(seq_len, d_k)
V = np.random.randn(seq_len, d_v)

output, weights = scaled_dot_product_attention(Q, K, V)

print(f"Input shape: ({seq_len}, {d_k})")
print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {weights.shape}")
print(f"\nAttention weights (rows sum to 1):\n{weights}")

## 2.2 Multi-Head Attention

In [None]:
class MultiHeadAttention:
    """Multi-Head Attention mechanism."""
    
    def __init__(self, d_model, num_heads):
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        # Weight matrices for each head
        self.W_Q = np.random.randn(d_model, d_model) * 0.1
        self.W_K = np.random.randn(d_model, d_model) * 0.1
        self.W_V = np.random.randn(d_model, d_model) * 0.1
        self.W_O = np.random.randn(d_model, d_model) * 0.1
    
    def split_heads(self, x):
        """Split into multiple heads."""
        batch_size, seq_len, _ = x.shape
        return x.reshape(batch_size, seq_len, self.num_heads, self.d_k).transpose(0, 2, 1, 3)
    
    def forward(self, Q, K, V, mask=None):
        batch_size = Q.shape[0]
        
        # Linear projections
        Q = Q @ self.W_Q
        K = K @ self.W_K
        V = V @ self.W_V
        
        # Split heads
        Q = self.split_heads(Q)
        K = self.split_heads(K)
        V = self.split_heads(V)
        
        # Attention for each head
        all_outputs = []
        all_weights = []
        for h in range(self.num_heads):
            output, weights = scaled_dot_product_attention(
                Q[0, h], K[0, h], V[0, h], mask
            )
            all_outputs.append(output)
            all_weights.append(weights)
        
        # Concatenate heads
        concat = np.concatenate(all_outputs, axis=-1)
        
        # Final linear projection
        output = concat @ self.W_O
        
        return output, all_weights

In [None]:
# Test Multi-Head Attention
d_model = 64
num_heads = 8
seq_len = 10

mha = MultiHeadAttention(d_model, num_heads)

# Input: (batch_size, seq_len, d_model)
x = np.random.randn(1, seq_len, d_model)

output, weights = mha.forward(x, x, x)  # Self-attention
print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"Number of attention heads: {len(weights)}")

## 2.3 Tokenization from Scratch

In [None]:
class SimpleTokenizer:
    """Simple word-level tokenizer."""
    
    def __init__(self):
        self.word_to_id = {"<PAD>": 0, "<UNK>": 1, "<BOS>": 2, "<EOS>": 3}
        self.id_to_word = {v: k for k, v in self.word_to_id.items()}
        self.vocab_size = 4
    
    def fit(self, texts):
        """Build vocabulary from texts."""
        for text in texts:
            words = text.lower().split()
            for word in words:
                if word not in self.word_to_id:
                    self.word_to_id[word] = self.vocab_size
                    self.id_to_word[self.vocab_size] = word
                    self.vocab_size += 1
    
    def encode(self, text):
        """Convert text to token IDs."""
        words = text.lower().split()
        return [self.word_to_id.get(w, 1) for w in words]  # 1 = <UNK>
    
    def decode(self, ids):
        """Convert token IDs back to text."""
        return " ".join([self.id_to_word.get(i, "<UNK>") for i in ids])

In [None]:
# Test simple tokenizer
texts = [
    "The cat sat on the mat",
    "The dog played in the garden",
    "Machine learning is amazing"
]

tokenizer = SimpleTokenizer()
tokenizer.fit(texts)

print(f"Vocabulary size: {tokenizer.vocab_size}")
print(f"\nVocabulary: {tokenizer.word_to_id}")

# Encode and decode
test_text = "The cat is amazing"
encoded = tokenizer.encode(test_text)
decoded = tokenizer.decode(encoded)

print(f"\nOriginal: '{test_text}'")
print(f"Encoded: {encoded}")
print(f"Decoded: '{decoded}'")

## 2.4 BPE Tokenizer (Simplified)

In [None]:
class SimpleBPE:
    """Simplified Byte-Pair Encoding tokenizer."""
    
    def __init__(self, vocab_size=100):
        self.vocab_size = vocab_size
        self.merges = {}
        self.vocab = {}
    
    def get_pair_counts(self, word_freqs):
        """Count adjacent pairs."""
        pairs = Counter()
        for word, freq in word_freqs.items():
            symbols = word.split()
            for i in range(len(symbols) - 1):
                pairs[symbols[i], symbols[i+1]] += freq
        return pairs
    
    def merge_pair(self, word_freqs, pair):
        """Merge most frequent pair."""
        new_word_freqs = {}
        bigram = " ".join(pair)
        replacement = "".join(pair)
        
        for word, freq in word_freqs.items():
            new_word = word.replace(bigram, replacement)
            new_word_freqs[new_word] = freq
        
        return new_word_freqs
    
    def fit(self, texts, num_merges=10):
        """Learn BPE merges from texts."""
        # Count word frequencies
        word_freqs = Counter()
        for text in texts:
            for word in text.lower().split():
                # Add space between characters + end token
                word_freqs[" ".join(list(word)) + " </w>"] += 1
        
        # Learn merges
        for i in range(num_merges):
            pairs = self.get_pair_counts(word_freqs)
            if not pairs:
                break
            best_pair = max(pairs, key=pairs.get)
            word_freqs = self.merge_pair(word_freqs, best_pair)
            self.merges[best_pair] = "".join(best_pair)
            print(f"Merge {i+1}: {best_pair} -> {''.join(best_pair)}")
        
        # Build vocabulary
        self.vocab = set()
        for word in word_freqs.keys():
            for token in word.split():
                self.vocab.add(token)
        
        return self

In [None]:
# Train BPE
corpus = [
    "low lower lowest",
    "new newer newest",
    "show shower"
]

bpe = SimpleBPE()
bpe.fit(corpus, num_merges=10)

print(f"\nFinal vocabulary ({len(bpe.vocab)} tokens):")
print(sorted(bpe.vocab))

---
# Section 3: Visualizations
---

## 3.1 Attention Weights Heatmap

In [None]:
# Visualize attention patterns
sentence = ["The", "cat", "sat", "on", "the", "mat"]

# Simulated attention weights
np.random.seed(42)
attention = softmax(np.random.randn(len(sentence), len(sentence)))

plt.figure(figsize=(10, 8))
sns.heatmap(attention, xticklabels=sentence, yticklabels=sentence,
            annot=True, fmt=".2f", cmap="Blues")
plt.title("Attention Weights (Self-Attention)")
plt.xlabel("Key Position")
plt.ylabel("Query Position")
plt.show()

## 3.2 Multi-Head Attention Comparison

In [None]:
# Visualize multiple attention heads
fig, axes = plt.subplots(2, 4, figsize=(16, 8))

for i, ax in enumerate(axes.flat):
    # Simulated attention for each head
    np.random.seed(i)
    head_attention = softmax(np.random.randn(len(sentence), len(sentence)))
    
    sns.heatmap(head_attention, ax=ax, cmap="Blues", cbar=False,
                xticklabels=sentence if i >= 4 else [],
                yticklabels=sentence if i % 4 == 0 else [])
    ax.set_title(f"Head {i+1}")

plt.suptitle("Multi-Head Attention: Different Heads Learn Different Patterns", fontsize=14)
plt.tight_layout()
plt.show()

## 3.3 Token Distribution Analysis

In [None]:
# Compare tokenization strategies
sample_text = "Natural language processing enables machines to understand human communication"

# Word-level tokenization
word_tokens = sample_text.split()

# Character-level tokenization
char_tokens = list(sample_text.replace(" ", "_"))

# Simulated BPE-like (subword)
subword_tokens = ["Nat", "ural", "_lang", "uage", "_process", "ing", 
                  "_enables", "_machines", "_to", "_understand", 
                  "_human", "_commun", "ication"]

print(f"Original: {sample_text}")
print(f"\nWord-level ({len(word_tokens)} tokens): {word_tokens}")
print(f"\nCharacter-level ({len(char_tokens)} tokens): {char_tokens[:20]}...")
print(f"\nSubword/BPE ({len(subword_tokens)} tokens): {subword_tokens}")

In [None]:
# Visualize token counts
methods = ['Word-level', 'Character-level', 'Subword (BPE)']
token_counts = [len(word_tokens), len(char_tokens), len(subword_tokens)]

plt.figure(figsize=(10, 5))
bars = plt.bar(methods, token_counts, color=['steelblue', 'coral', 'seagreen'])
plt.ylabel('Number of Tokens')
plt.title('Tokenization Methods Comparison')

for bar, count in zip(bars, token_counts):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.5, 
             str(count), ha='center', fontsize=12)

plt.show()

---
# Section 4: Unit Tests
---

In [None]:
def run_tests():
    print("Running Unit Tests...\n")
    
    # Test 1: Softmax sums to 1
    x = np.array([1, 2, 3])
    assert abs(softmax(x).sum() - 1.0) < 1e-6
    print("âœ“ Softmax sum test passed")
    
    # Test 2: Attention output shape
    Q = np.random.randn(5, 8)
    K = np.random.randn(5, 8)
    V = np.random.randn(5, 16)
    output, weights = scaled_dot_product_attention(Q, K, V)
    assert output.shape == (5, 16)
    assert weights.shape == (5, 5)
    print("âœ“ Attention shape test passed")
    
    # Test 3: Attention weights sum to 1
    assert np.allclose(weights.sum(axis=-1), 1.0)
    print("âœ“ Attention weights normalization test passed")
    
    # Test 4: Tokenizer encode/decode
    tok = SimpleTokenizer()
    tok.fit(["hello world"])
    encoded = tok.encode("hello world")
    decoded = tok.decode(encoded)
    assert decoded == "hello world"
    print("âœ“ Tokenizer encode/decode test passed")
    
    # Test 5: Unknown token handling
    encoded = tok.encode("hello unknown")
    assert 1 in encoded  # 1 = <UNK>
    print("âœ“ Unknown token handling test passed")
    
    print("\nðŸŽ‰ All tests passed!")

run_tests()

---
# Section 5: Interview Prep
---

## Key Questions

### Q1: Explain the Attention mechanism in simple terms.

**Answer:**
- Attention allows the model to focus on relevant parts of the input
- Uses Query-Key-Value: Query asks "what's relevant?", Keys answer, Values provide content
- Softmax creates importance weights that sum to 1
- Enables long-range dependencies without recurrence

### Q2: What's the difference between encoder and decoder in Transformers?

**Answer:**
- **Encoder**: Bidirectional, sees entire input (BERT)
- **Decoder**: Autoregressive, can only see past tokens (GPT)
- **Encoder-Decoder**: Full Transformer for seq2seq (T5, BART)

### Q3: How does tokenization affect model performance?

**Answer:**
- Too many tokens = longer sequences, slower, limited context
- Too few tokens = large vocabulary, memory issues
- Subword (BPE) balances both: handles OOV, reasonable sequence length
- Domain-specific tokenizers can improve performance

### Q4: Why do we scale by sqrt(d_k) in attention?

**Answer:**
- Without scaling, dot products grow large for high dimensions
- Large values push softmax into extreme regions (gradients vanish)
- Scaling keeps values in a reasonable range for stable training

---
# Section 6: Exercises
---

In [None]:
# Exercise 1: Implement Positional Encoding
def positional_encoding(seq_len, d_model):
    """
    Create sinusoidal positional encodings.
    
    PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
    PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
    """
    # TODO: Your implementation here
    pass


# Exercise 2: Implement Masked Self-Attention
def create_causal_mask(seq_len):
    """
    Create mask for decoder self-attention.
    Position i can only attend to positions <= i.
    """
    # TODO: Your implementation here
    pass


# Exercise 3: Implement a simple Transformer block
class TransformerBlock:
    """Single Transformer encoder block."""
    def __init__(self, d_model, num_heads, d_ff):
        # TODO: Initialize MHA, FFN, Layer Norms
        pass
    
    def forward(self, x):
        # TODO: Implement forward pass with residual connections
        pass

---
# Section 7: Deliverable
---

## What You Built:

1. **Scaled Dot-Product Attention** - Core attention mechanism
2. **Multi-Head Attention** - Parallel attention heads
3. **Simple Tokenizer** - Word-level tokenization
4. **BPE Tokenizer** - Subword tokenization

## Key Takeaways:

- Attention enables modeling long-range dependencies
- Multi-head attention learns different relationship patterns
- Tokenization choice significantly impacts model input
- BPE/WordPiece balance vocabulary size and sequence length

## Next Week: FastAPI Backend
- Building REST APIs
- Async endpoints
- JWT authentication