# Lab C.3: Induction Head Analysis

**Module:** C - Mechanistic Interpretability  
**Time:** 2 hours  
**Difficulty:** ⭐⭐⭐⭐ (Advanced)

---

## Learning Objectives

By the end of this notebook, you will:
- [ ] Understand what induction heads are and why they matter
- [ ] Create datasets that test induction behavior
- [ ] Identify induction heads through attention pattern analysis
- [ ] Verify induction head composition with previous token heads
- [ ] Measure the impact of ablating induction heads

---

## Prerequisites

- Completed: Lab C.1 and C.2
- Knowledge of: Activation patching, attention patterns

---

## Real-World Context

**Induction heads are arguably the most important circuit discovered in transformers.**

They explain:
- **In-context learning**: How models learn from examples in the prompt
- **Few-shot learning**: Why giving examples helps the model
- **Pattern completion**: Predicting what comes next in repeated sequences

Understanding induction heads gives insight into one of the core mechanisms that makes large language models powerful!

---

## ELI5: What are Induction Heads?

> **Imagine you're playing a pattern game**: Someone says "Apple, Banana, Apple, ___"
>
> **You know the answer is "Banana"!** Why? Because:
> 1. You notice "Apple" appeared before
> 2. You remember what came after "Apple" last time ("Banana")
> 3. You predict "Banana" will come again
>
> **That's exactly what induction heads do!** They complete patterns by:
> 1. Finding where the current token appeared before
> 2. Looking at what came after it
> 3. Predicting that same thing will come again
>
> **The formal pattern:** `[A][B]...[A]` → `[B]`
>
> **Real examples:**
> - "Harry Potter...Harry" → " Potter"
> - "New York City...New York" → " City"
> - "def foo():...def" → " foo" (in code!)

---

## ELI5: The Induction Circuit

> **Induction heads don't work alone!** They need a partner:
>
> **1. Previous Token Head (Layer 1-2)**
> - Job: "Copy information about token[i] to position[i+1]"
> - After "Apple" at position 0, position 1 now "knows" about Apple
>
> **2. Induction Head (Layer 5-6+)**  
> - Job: "Find positions that know about my current token"
> - When at the second "Apple" (position 2), looks for positions that "know" Apple
> - Finds position 1 (which got Apple info from previous token head)
> - Copies what's at position 1 → "Banana"!
>
> **It's a two-step dance:**
> 1. Previous token head moves info forward: `[Apple] → [Banana gets Apple's info]`
> 2. Induction head uses that info: `[Apple] attends to [position with Apple info]`

```
Position:    0       1       2       3
Token:     Apple   Banana  Apple    ???
                     ↑               |
                     └───────────────┘
                   Induction head attends here!
                   (Finds "Banana" to predict)
```

---

## Part 1: Setup

In [None]:
# Core imports
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import gc
from typing import List, Dict, Optional, Tuple
from dataclasses import dataclass
import random

# TransformerLens
from transformer_lens import HookedTransformer, utils
from transformer_lens.hook_points import HookPoint

# Visualization
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Settings
%matplotlib inline
plt.style.use('seaborn-v0_8-whitegrid')
torch.set_grad_enabled(False)

print(f"PyTorch: {torch.__version__}")
print(f"CUDA: {torch.cuda.is_available()}")

In [None]:
# Load model
model = HookedTransformer.from_pretrained(
    "gpt2-small",
    device="cuda" if torch.cuda.is_available() else "cpu"
)

print(f"Loaded GPT-2 Small: {model.cfg.n_layers} layers, {model.cfg.n_heads} heads/layer")
print(f"Total heads: {model.cfg.n_layers * model.cfg.n_heads}")

---

## Part 2: Creating Induction Test Data

To find induction heads, we need data that tests the `[A][B]...[A]` → `[B]` pattern.

In [None]:
def create_repeated_sequence(seq_len: int = 25, seed: int = None) -> torch.Tensor:
    """
    Create a sequence of random tokens, then repeat it.
    
    This creates the pattern: [A][B][C]...[A][B][C]...
    Induction heads should predict each token based on what followed it before.
    
    Args:
        seq_len: Length of the first half (will be doubled)
        seed: Random seed for reproducibility
    
    Returns:
        Token tensor of shape [1, seq_len * 2]
    """
    if seed is not None:
        torch.manual_seed(seed)
    
    # Use tokens in a safe range (avoid special tokens)
    first_half = torch.randint(1000, 10000, (1, seq_len), device=model.cfg.device)
    
    # Repeat the sequence
    repeated = torch.cat([first_half, first_half], dim=1)
    
    return repeated

# Create example
repeated_tokens = create_repeated_sequence(seq_len=10, seed=42)
print(f"Token shape: {repeated_tokens.shape}")
print(f"First half:  {repeated_tokens[0, :10].tolist()}")
print(f"Second half: {repeated_tokens[0, 10:].tolist()}")
print(f"\nAre they equal? {torch.equal(repeated_tokens[0, :10], repeated_tokens[0, 10:])}")

In [None]:
# Test that the model does well on this task
def test_induction_accuracy(model, seq_len: int = 20, n_tests: int = 5):
    """Test how well the model predicts the next token in repeated sequences."""
    
    total_correct = 0
    total_tested = 0
    
    for seed in range(n_tests):
        tokens = create_repeated_sequence(seq_len, seed=seed)
        
        # Get model predictions
        logits = model(tokens)
        predictions = logits[0, :-1, :].argmax(dim=-1)  # Predict next token
        
        # In the second half, each prediction should match what followed in first half
        # Position i in second half (i >= seq_len) should predict tokens[i+1-seq_len]
        for pos in range(seq_len, 2 * seq_len - 1):
            predicted = predictions[pos].item()
            actual = tokens[0, pos + 1].item()  # What actually comes next
            expected = tokens[0, pos + 1 - seq_len].item()  # What came after in first half
            
            # These should all be the same (repeated sequence)
            assert actual == expected, "Bug in test setup"
            
            if predicted == actual:
                total_correct += 1
            total_tested += 1
    
    accuracy = total_correct / total_tested
    print(f"Induction accuracy: {accuracy:.1%} ({total_correct}/{total_tested})")
    return accuracy

accuracy = test_induction_accuracy(model, seq_len=20, n_tests=10)

### What Just Happened?

The model shows strong induction behavior! When it sees `[A][B]...[A]`, it predicts `[B]` with high accuracy. This confirms induction heads are present and active.

---

## Part 3: Finding Induction Heads Through Attention Patterns

Induction heads have a distinctive attention pattern: when at position `i` in the second half, they attend to position `i - seq_len + 1` (the position *after* where this token appeared before).

In [None]:
def compute_induction_scores(
    model: HookedTransformer,
    seq_len: int = 25,
    n_samples: int = 10
) -> np.ndarray:
    """
    Compute induction scores for all attention heads.
    
    Induction score = average attention to the "induction target" position.
    For position i in second half, the target is position (i - seq_len + 1).
    
    Returns:
        Array of shape [n_layers, n_heads] with induction scores.
    """
    n_layers = model.cfg.n_layers
    n_heads = model.cfg.n_heads
    scores = np.zeros((n_layers, n_heads))
    
    for sample in range(n_samples):
        # Create repeated sequence
        tokens = create_repeated_sequence(seq_len, seed=sample)
        
        # Get attention patterns
        _, cache = model.run_with_cache(tokens)
        
        for layer in range(n_layers):
            pattern = cache["pattern", layer][0]  # [n_heads, seq, seq]
            
            for head in range(n_heads):
                # For each position in second half, check attention to induction target
                induction_attn = 0
                count = 0
                
                for pos in range(seq_len, 2 * seq_len):
                    # Induction target: position after where this token appeared before
                    target = pos - seq_len + 1
                    if target > 0:  # Valid position
                        induction_attn += pattern[head, pos, target].item()
                        count += 1
                
                scores[layer, head] += induction_attn / count if count > 0 else 0
        
        # Clean up
        del cache
    
    scores /= n_samples
    return scores

print("Computing induction scores (this may take a minute)...")
induction_scores = compute_induction_scores(model, seq_len=25, n_samples=10)
print("Done!")

In [None]:
# Visualize induction scores
fig = px.imshow(
    induction_scores,
    labels={"x": "Head", "y": "Layer", "color": "Induction Score"},
    color_continuous_scale="YlOrRd",
    title="Induction Scores Across All Attention Heads"
)
fig.update_layout(width=800, height=600)
fig.show()

In [None]:
# Find top induction heads
print("Top Induction Heads:")
print("=" * 40)

flat_scores = induction_scores.flatten()
sorted_indices = np.argsort(flat_scores)[::-1]

top_induction_heads = []
for i, idx in enumerate(sorted_indices[:10]):
    layer = idx // model.cfg.n_heads
    head = idx % model.cfg.n_heads
    score = induction_scores[layer, head]
    print(f"{i+1}. L{layer}H{head}: score = {score:.3f}")
    if score > 0.15:  # Threshold for "real" induction heads
        top_induction_heads.append((layer, head, score))

print(f"\nFound {len(top_induction_heads)} strong induction heads (score > 0.15)")

### Interpreting Induction Scores

- **High scores (>0.15)**: Strong induction heads that consistently attend to the "answer" position
- **Medium scores (0.05-0.15)**: Partial induction behavior, may be backup heads
- **Low scores (<0.05)**: Not induction heads

Typically, induction heads appear in **middle layers (4-7)** of GPT-2 Small.

---

## Part 4: Visualizing Induction Head Attention Patterns

Let's look at the attention patterns of our identified induction heads.

In [None]:
# Create a specific repeated sequence for visualization
seq_len = 10
tokens = create_repeated_sequence(seq_len, seed=123)

# Get attention patterns
logits, cache = model.run_with_cache(tokens)

# Convert to string tokens for labels
token_strs = [f"{i}:{model.tokenizer.decode(t.item())[:6]}" 
              for i, t in enumerate(tokens[0])]
print("Token labels:")
print(token_strs)

In [None]:
# Visualize top induction head
if top_induction_heads:
    layer, head, score = top_induction_heads[0]
    
    pattern = cache["pattern", layer][0, head].detach().cpu().numpy()
    
    fig = px.imshow(
        pattern,
        labels={"x": "Key (Source)", "y": "Query (Attending)", "color": "Attention"},
        x=token_strs,
        y=token_strs,
        color_continuous_scale="Blues",
        title=f"Induction Head L{layer}H{head} Attention Pattern (score={score:.3f})"
    )
    fig.update_layout(width=700, height=600)
    
    # Add annotation showing expected induction pattern
    # Positions 10-19 should attend to positions 1-10
    for i in range(seq_len, 2 * seq_len):
        target = i - seq_len + 1
        if target > 0:
            fig.add_shape(
                type="rect",
                x0=target-0.5, x1=target+0.5,
                y0=i-0.5, y1=i+0.5,
                line=dict(color="red", width=1)
            )
    
    fig.show()
    print("Red boxes: Expected induction attention targets")
else:
    print("No strong induction heads found!")

In [None]:
# Compare multiple induction heads
if len(top_induction_heads) >= 2:
    fig = make_subplots(
        rows=1, cols=min(3, len(top_induction_heads)),
        subplot_titles=[f"L{l}H{h} (score={s:.2f})" for l, h, s in top_induction_heads[:3]]
    )
    
    for i, (layer, head, score) in enumerate(top_induction_heads[:3]):
        pattern = cache["pattern", layer][0, head].detach().cpu().numpy()
        
        fig.add_trace(
            go.Heatmap(
                z=pattern,
                colorscale="Blues",
                showscale=(i == 0)
            ),
            row=1, col=i+1
        )
    
    fig.update_layout(
        title="Top 3 Induction Heads Comparison",
        height=400,
        width=1000
    )
    fig.show()

### What to Look For

In induction head patterns, you should see:
- **Diagonal stripe in bottom-right**: Positions in second half attending to offset in first half
- **Offset of 1**: They attend to position `(i - seq_len + 1)`, not `(i - seq_len)`

This offset is crucial - it's what makes them look at *what came after* the repeated token!

---

## Part 5: Finding Previous Token Heads

Induction heads need previous token heads to work. Let's find those too!

In [None]:
def compute_prev_token_scores(
    model: HookedTransformer,
    seq_len: int = 50,
    n_samples: int = 10
) -> np.ndarray:
    """
    Compute previous token head scores.
    
    Previous token score = average attention to position (i-1).
    """
    n_layers = model.cfg.n_layers
    n_heads = model.cfg.n_heads
    scores = np.zeros((n_layers, n_heads))
    
    for sample in range(n_samples):
        # Random tokens (don't need repeated for this)
        tokens = torch.randint(1000, 10000, (1, seq_len), device=model.cfg.device)
        
        _, cache = model.run_with_cache(tokens)
        
        for layer in range(n_layers):
            pattern = cache["pattern", layer][0]  # [n_heads, seq, seq]
            
            for head in range(n_heads):
                # Average attention to previous position
                prev_attn = 0
                for pos in range(1, seq_len):
                    prev_attn += pattern[head, pos, pos - 1].item()
                
                scores[layer, head] += prev_attn / (seq_len - 1)
        
        del cache
    
    scores /= n_samples
    return scores

print("Computing previous token head scores...")
prev_token_scores = compute_prev_token_scores(model)
print("Done!")

In [None]:
# Visualize previous token scores
fig = px.imshow(
    prev_token_scores,
    labels={"x": "Head", "y": "Layer", "color": "Prev Token Score"},
    color_continuous_scale="Greens",
    title="Previous Token Head Scores"
)
fig.update_layout(width=800, height=600)
fig.show()

In [None]:
# Find top previous token heads
print("Top Previous Token Heads:")
print("=" * 40)

flat_scores = prev_token_scores.flatten()
sorted_indices = np.argsort(flat_scores)[::-1]

top_prev_heads = []
for i, idx in enumerate(sorted_indices[:10]):
    layer = idx // model.cfg.n_heads
    head = idx % model.cfg.n_heads
    score = prev_token_scores[layer, head]
    print(f"{i+1}. L{layer}H{head}: score = {score:.3f}")
    if score > 0.3 and layer < 4:  # Previous token heads should be in early layers
        top_prev_heads.append((layer, head, score))

print(f"\nFound {len(top_prev_heads)} previous token heads (early layers, score > 0.3)")

### Comparing Head Types

- **Previous Token Heads**: Appear in early layers (0-2), attend to position i-1
- **Induction Heads**: Appear in later layers (4-7), attend to position i-seq_len+1

This layer ordering is crucial - previous token heads must run first to set up the information that induction heads use!

---

## Part 6: Verifying Composition

Let's verify that induction heads actually read from previous token heads through K-composition.

In [None]:
def compute_k_composition(
    model: HookedTransformer,
    source_layer: int,
    source_head: int,
    target_layer: int,
    target_head: int
) -> float:
    """
    Compute K-composition score between two heads.
    
    K-composition: Does source head's output get used by target head's keys?
    
    Higher score = source head writes vectors that target head uses for key matching.
    """
    if target_layer <= source_layer:
        return 0.0  # Can't compose backwards
    
    # Get weight matrices
    W_O = model.W_O[source_layer, source_head]  # [d_head, d_model]
    W_V = model.W_V[source_layer, source_head]  # [d_model, d_head]
    W_K = model.W_K[target_layer, target_head]  # [d_model, d_head]
    
    # OV circuit: what does source head write to residual stream?
    OV = W_V @ W_O  # [d_model, d_model]
    
    # K-composition: how much does target's K read from OV output?
    # We measure: ||W_K^T @ OV|| / (||W_K^T|| * ||OV||)
    composition = (W_K.T @ OV).norm().item()
    normalized = composition / (W_K.T.norm().item() * OV.norm().item() + 1e-10)
    
    return normalized

# Compute composition between all prev token and induction head pairs
if top_prev_heads and top_induction_heads:
    print("K-Composition Scores (Previous Token → Induction):")
    print("=" * 60)
    
    for src_layer, src_head, src_score in top_prev_heads[:3]:
        for tgt_layer, tgt_head, tgt_score in top_induction_heads[:3]:
            comp = compute_k_composition(model, src_layer, src_head, tgt_layer, tgt_head)
            print(f"L{src_layer}H{src_head} (prev) → L{tgt_layer}H{tgt_head} (ind): {comp:.3f}")
else:
    print("Need both previous token and induction heads to compute composition")

### Interpreting Composition Scores

High K-composition means the induction head uses the previous token head's output for its key matching. This is exactly what we expect from the induction circuit!

- **High scores (>0.1)**: Strong composition, likely part of induction circuit
- **Low scores (<0.05)**: Weak composition, less likely to be partners

---

## Part 7: Ablation Study

Let's verify induction heads are actually important by ablating (removing) them!

In [None]:
def ablate_heads(
    model: HookedTransformer,
    tokens: torch.Tensor,
    heads_to_ablate: List[Tuple[int, int]],
    method: str = "zero"
) -> torch.Tensor:
    """
    Ablate specific heads and return logits.
    
    Args:
        model: The transformer model
        tokens: Input tokens
        heads_to_ablate: List of (layer, head) tuples to ablate
        method: 'zero' to zero out, 'mean' to replace with mean
    
    Returns:
        Logits after ablation
    """
    def ablate_hook(activation, hook, layer, head):
        # activation shape: [batch, seq, n_heads, d_head]
        if method == "zero":
            activation[:, :, head, :] = 0
        elif method == "mean":
            activation[:, :, head, :] = activation[:, :, head, :].mean()
        return activation
    
    # Create hooks for all heads to ablate
    hooks = []
    for layer, head in heads_to_ablate:
        hook_name = f"blocks.{layer}.attn.hook_z"
        hook_fn = lambda act, hook, l=layer, h=head: ablate_hook(act, hook, l, h)
        hooks.append((hook_name, hook_fn))
    
    return model.run_with_hooks(tokens, fwd_hooks=hooks)

In [None]:
def measure_induction_loss(
    model: HookedTransformer,
    seq_len: int = 25,
    heads_to_ablate: Optional[List[Tuple[int, int]]] = None,
    n_samples: int = 5
) -> Tuple[float, float]:
    """
    Measure loss on induction task, optionally with ablations.
    
    Returns:
        (loss, accuracy) tuple
    """
    total_loss = 0
    total_correct = 0
    total_count = 0
    
    for seed in range(n_samples):
        tokens = create_repeated_sequence(seq_len, seed=seed)
        
        if heads_to_ablate:
            logits = ablate_heads(model, tokens, heads_to_ablate)
        else:
            logits = model(tokens)
        
        # Compute loss on second half predictions
        for pos in range(seq_len, 2 * seq_len - 1):
            target = tokens[0, pos + 1]
            pred_logits = logits[0, pos, :]
            
            # Cross-entropy loss
            loss = torch.nn.functional.cross_entropy(
                pred_logits.unsqueeze(0), target.unsqueeze(0)
            )
            total_loss += loss.item()
            
            # Accuracy
            if pred_logits.argmax() == target:
                total_correct += 1
            
            total_count += 1
    
    return total_loss / total_count, total_correct / total_count

In [None]:
# Baseline without ablation
baseline_loss, baseline_acc = measure_induction_loss(model)
print(f"Baseline (no ablation):")
print(f"  Loss: {baseline_loss:.3f}")
print(f"  Accuracy: {baseline_acc:.1%}")
print()

In [None]:
# Ablate induction heads
if top_induction_heads:
    induction_head_list = [(l, h) for l, h, _ in top_induction_heads]
    
    ablated_loss, ablated_acc = measure_induction_loss(
        model, heads_to_ablate=induction_head_list
    )
    
    print(f"With induction heads ablated ({len(induction_head_list)} heads):")
    print(f"  Loss: {ablated_loss:.3f} (change: {ablated_loss - baseline_loss:+.3f})")
    print(f"  Accuracy: {ablated_acc:.1%} (change: {(ablated_acc - baseline_acc)*100:+.1f}%)")
    print()
    print(f"Loss increase: {(ablated_loss / baseline_loss - 1) * 100:.0f}%")

In [None]:
# Compare: ablate random heads
random.seed(42)
random_heads = [(random.randint(0, 11), random.randint(0, 11)) 
                for _ in range(len(top_induction_heads) if top_induction_heads else 3)]

random_loss, random_acc = measure_induction_loss(
    model, heads_to_ablate=random_heads
)

print(f"With random heads ablated ({len(random_heads)} heads):")
print(f"  Loss: {random_loss:.3f} (change: {random_loss - baseline_loss:+.3f})")
print(f"  Accuracy: {random_acc:.1%} (change: {(random_acc - baseline_acc)*100:+.1f}%)")

### Interpreting Ablation Results

If induction heads are truly important:
- Ablating them should **significantly increase loss** and **decrease accuracy**
- Ablating random heads should have **much smaller effect**

This provides causal evidence that these heads are responsible for induction!

---

## Try It Yourself

### Exercise 1: Visualize Previous Token Head Patterns
Visualize the attention patterns of the top previous token heads. Do they show the expected diagonal pattern?

<details>
<summary>Hint</summary>

Use the same visualization code as for induction heads. Previous token heads should show a diagonal stripe one position below the main diagonal.
</details>

In [None]:
# Exercise 1: Your code here



### Exercise 2: Ablate Previous Token Heads
What happens to induction performance if you ablate the previous token heads instead of the induction heads?

<details>
<summary>Hint</summary>

If induction heads depend on previous token heads, ablating the latter should also hurt induction performance - maybe even more!
</details>

In [None]:
# Exercise 2: Your code here



### Exercise 3: Natural Language Induction
Test induction on natural language: "Harry Potter is a wizard. Harry" → "Potter"?
Do the same induction heads activate for this more realistic example?

<details>
<summary>Hint</summary>

Tokenize the natural language example and check attention patterns. Real text may trigger different/additional heads due to semantic content.
</details>

In [None]:
# Exercise 3: Your code here



---

## Common Mistakes

### Mistake 1: Wrong Induction Target Position
```python
# Wrong: Looking at position (i - seq_len)
target = pos - seq_len  # This is where the token APPEARED

# Correct: Looking at position (i - seq_len + 1)
target = pos - seq_len + 1  # This is what FOLLOWED that token
```
**Why:** Induction heads look at what *came after* the repeated token, not the token itself!

### Mistake 2: Confusing Induction with Copying
```python
# Copying head: "ABC..." → attends to "A", "B", "C"
# Induction head: "ABC...A" → attends to "B" (what came after A)
```
**Why:** Induction is specifically about pattern completion, not simple copying.

### Mistake 3: Layer Order Confusion
```python
# Wrong: Expecting induction heads in layer 0
# Induction heads need previous token heads to run first!

# Correct ordering:
# Layers 0-2: Previous token heads
# Layers 4-7: Induction heads
```
**Why:** Induction requires two-step composition that needs layer separation.

---

## Checkpoint

You've learned:
- What induction heads are and how they work
- How to create test data for induction behavior
- How to identify induction heads through attention patterns
- The role of previous token heads in the induction circuit
- How to verify importance through ablation studies

---

## Challenge (Optional)

**Induction Head Formation**

The original Anthropic paper discovered that induction heads form around step 2 billion during training. Can you:

1. Load a smaller model (GPT-2 or similar) at different training checkpoints
2. Measure induction scores at each checkpoint
3. Find when induction heads "turn on"

This is advanced but reveals fascinating training dynamics!

In [None]:
# Challenge: Training dynamics
# Your code here



---

## Further Reading

- [In-context Learning and Induction Heads](https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html) - The original Anthropic paper
- [A Mathematical Framework for Transformer Circuits](https://transformer-circuits.pub/2021/framework/index.html) - Foundation for circuit analysis
- [The Induction Heads Story (LessWrong)](https://www.lesswrong.com/posts/TvrfY4c9eaGLeyDkE/induction-heads-illustrated) - Illustrated guide

---

## Cleanup

In [None]:
# Clear GPU memory
del cache
gc.collect()
torch.cuda.empty_cache()

print("Memory cleared!")

---

## What's Next?

In **Lab C.4**, we'll explore **Sparse Autoencoders (SAEs)** - a cutting-edge technique for extracting interpretable features from neural networks. SAEs can find individual concepts like "Python code", "questions", or "French text" hidden in model activations!

**Next:** [Lab C.4: Feature Extraction with SAEs](04-feature-extraction-saes.ipynb)