# Lab C.2: Activation Patching on IOI

**Module:** C - Mechanistic Interpretability  
**Time:** 2 hours  
**Difficulty:** ⭐⭐⭐⭐ (Advanced)

---

## Learning Objectives

By the end of this notebook, you will:
- [ ] Understand the IOI (Indirect Object Identification) task
- [ ] Implement activation patching to identify causal components
- [ ] Find which layers and heads are crucial for IOI
- [ ] Create heatmaps of component importance
- [ ] Compare your findings to published IOI circuit research

---

## Prerequisites

- Completed: Lab C.1 (TransformerLens Setup)
- Knowledge of: Attention patterns, caching activations

---

## Real-World Context

**Why IOI?** The Indirect Object Identification task is the "fruit fly" of mechanistic interpretability - a simple, well-defined task where we can fully reverse-engineer the model's algorithm. The [original IOI paper](https://arxiv.org/abs/2211.00593) is considered a landmark in interpretability research.

Understanding how to find circuits in simple tasks gives us the tools to tackle more complex behaviors like:
- Why does the model hallucinate?
- How does it do multi-step reasoning?
- Where does it store factual knowledge?

---

## ELI5: What is Activation Patching?

> **Imagine you're a detective investigating a crime**. You have two scenarios:
>
> 1. **Clean scenario**: The crime happened (model predicts correctly)
> 2. **Corrupted scenario**: The crime didn't happen (model predicts incorrectly)
>
> **Activation patching** is like swapping one piece of evidence at a time from the "no crime" scenario into the "crime" scenario. If swapping that one piece suddenly makes everyone think no crime happened, you've found crucial evidence!
>
> **For neural networks:**
> - Clean run: "John and Mary..." → model predicts "Mary"
> - Corrupted run: "Mary and John..." → model predicts "John"
> - Patching: Replace layer 5's activations from corrupted into clean
> - If prediction changes from "Mary" to "John" → Layer 5 is crucial!
>
> This is **causal intervention** - we're not just observing, we're actively changing things to prove what matters!

---

## ELI5: The IOI Task

> **Consider this sentence:** "John and Mary went to the store. John gave a book to ___"
>
> **Who should fill in the blank?** Mary! Because:
> 1. John already appeared as the giver
> 2. Mary is the other person mentioned
> 3. So Mary must be the receiver
>
> **This is Indirect Object Identification (IOI)**:
> - The subject (John) does an action
> - The indirect object (Mary) receives the action
> - The model must figure out that "to ___" should be Mary, not John
>
> **Why is this interesting?** It requires the model to:
> 1. Track who's mentioned (John and Mary)
> 2. Notice John appears twice (as subject)
> 3. Conclude Mary should fill the blank (by elimination)
>
> A simple task, but it reveals sophisticated circuitry!

---

## Part 1: Setup and IOI Dataset

In [None]:
# Core imports
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import gc
from typing import List, Dict, Optional, Tuple, Callable
from dataclasses import dataclass
from functools import partial
import random

# TransformerLens
from transformer_lens import HookedTransformer, utils
from transformer_lens.hook_points import HookPoint

# Visualization
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Settings
%matplotlib inline
plt.style.use('seaborn-v0_8-whitegrid')
torch.set_grad_enabled(False)  # We don't need gradients for interpretability

print(f"PyTorch: {torch.__version__}")
print(f"CUDA: {torch.cuda.is_available()}")

In [None]:
# Load model
model = HookedTransformer.from_pretrained(
    "gpt2-small",
    device="cuda" if torch.cuda.is_available() else "cpu"
)

print(f"Loaded GPT-2 Small: {model.cfg.n_layers} layers, {model.cfg.n_heads} heads/layer")

In [None]:
# Create IOI dataset
# Clean: "[Name1] and [Name2] ... [Name1] gave ... to" -> [Name2]
# Corrupted: Same but with names swapped in key positions

@dataclass
class IOIExample:
    """A single IOI example with clean and corrupted versions."""
    clean: str
    corrupted: str
    answer: str  # The correct completion (e.g., " Mary")
    wrong_answer: str  # What corrupted version would predict
    
def create_ioi_dataset(n_samples: int = 100, seed: int = 42) -> List[IOIExample]:
    """
    Create IOI dataset with clean and corrupted examples.
    
    Clean: "John and Mary went to the store. John gave a book to" -> "Mary"
    Corrupted: "John and Mary went to the store. Mary gave a book to" -> "John"
    
    The corruption swaps which name appears as the subject of the second sentence.
    """
    random.seed(seed)
    
    names = ["John", "Mary", "Alice", "Bob", "James", "Emma", 
             "Michael", "Sarah", "David", "Lisa", "Tom", "Kate"]
    places = ["store", "park", "beach", "library", "cafe", "museum", "restaurant"]
    objects = ["book", "ball", "key", "letter", "gift", "phone", "bag", "drink"]
    
    dataset = []
    
    for _ in range(n_samples):
        # Pick two different names
        name1, name2 = random.sample(names, 2)
        place = random.choice(places)
        obj = random.choice(objects)
        
        # Clean: Name1 is the subject, so Name2 is the answer
        clean = f"{name1} and {name2} went to the {place}. {name1} gave a {obj} to"
        
        # Corrupted: Name2 is the subject, so Name1 would be the answer
        corrupted = f"{name1} and {name2} went to the {place}. {name2} gave a {obj} to"
        
        dataset.append(IOIExample(
            clean=clean,
            corrupted=corrupted,
            answer=f" {name2}",
            wrong_answer=f" {name1}"
        ))
    
    return dataset

# Create dataset
ioi_dataset = create_ioi_dataset(n_samples=50)

# Show example
example = ioi_dataset[0]
print("IOI Example:")
print("=" * 60)
print(f"Clean:     '{example.clean}'")
print(f"Answer:    '{example.answer}'")
print()
print(f"Corrupted: '{example.corrupted}'")
print(f"Would predict: '{example.wrong_answer}'")

In [None]:
# Verify the model actually does IOI correctly
def test_ioi_accuracy(model, dataset: List[IOIExample], n_test: int = 10) -> float:
    """Test how well the model does on IOI."""
    correct = 0
    
    for ex in dataset[:n_test]:
        tokens = model.to_tokens(ex.clean)
        logits = model(tokens)
        
        # Get prediction
        pred_token = logits[0, -1, :].argmax().item()
        pred_str = model.tokenizer.decode(pred_token)
        
        # Check if correct
        if pred_str.strip() == ex.answer.strip():
            correct += 1
            status = "✓"
        else:
            status = "✗"
        
        print(f"{status} Predicted '{pred_str}', expected '{ex.answer}'")
    
    accuracy = correct / n_test
    print(f"\nAccuracy: {accuracy:.0%}")
    return accuracy

print("Testing model on IOI task:")
print("=" * 60)
accuracy = test_ioi_accuracy(model, ioi_dataset)

### What Just Happened?

GPT-2 Small does surprisingly well on IOI! It's learned to track which name is the subject and predict the other name as the indirect object. Now let's figure out *how* it does this.

---

## Part 2: Understanding Activation Patching

### The Core Idea

1. **Run clean prompt** → Get correct prediction "Mary"
2. **Run corrupted prompt** → Get incorrect prediction "John"
3. **Patch**: Run clean prompt, but replace some activations with values from corrupted run
4. **Measure**: How much does the prediction change?

If patching layer X makes the model predict "John" instead of "Mary", then layer X is crucial for the IOI task!

In [None]:
# Let's understand the patching process step by step
ex = ioi_dataset[0]

# Tokenize both versions
clean_tokens = model.to_tokens(ex.clean)
corrupted_tokens = model.to_tokens(ex.corrupted)

print("Clean tokens:")
print(model.to_str_tokens(clean_tokens))
print(f"\nCorrupted tokens:")
print(model.to_str_tokens(corrupted_tokens))

# Note: The token positions should be the same for both!

In [None]:
# Get answer token IDs
answer_token = model.to_single_token(ex.answer)
wrong_token = model.to_single_token(ex.wrong_answer)

print(f"Answer token: '{ex.answer}' -> ID {answer_token}")
print(f"Wrong token: '{ex.wrong_answer}' -> ID {wrong_token}")

In [None]:
# Run both and cache activations
clean_logits, clean_cache = model.run_with_cache(clean_tokens)
corrupted_logits, corrupted_cache = model.run_with_cache(corrupted_tokens)

# Get probabilities for the answer
clean_probs = torch.softmax(clean_logits[0, -1, :], dim=-1)
corrupted_probs = torch.softmax(corrupted_logits[0, -1, :], dim=-1)

print(f"Clean prediction:")
print(f"  P('{ex.answer}'): {clean_probs[answer_token].item():.2%}")
print(f"  P('{ex.wrong_answer}'): {clean_probs[wrong_token].item():.2%}")

print(f"\nCorrupted prediction:")
print(f"  P('{ex.answer}'): {corrupted_probs[answer_token].item():.2%}")
print(f"  P('{ex.wrong_answer}'): {corrupted_probs[wrong_token].item():.2%}")

### The Metric: Logit Difference

We'll use **logit difference** as our metric:
- Clean run: logit(Mary) - logit(John) = positive (prefers Mary)
- Corrupted run: logit(Mary) - logit(John) = negative (prefers John)

When we patch, we measure how much the logit difference changes from clean to corrupted.

In [None]:
def compute_logit_diff(logits, answer_token, wrong_token, position=-1):
    """
    Compute logit difference: logit(correct) - logit(incorrect)
    Positive = prefers correct, Negative = prefers incorrect
    """
    return (logits[0, position, answer_token] - logits[0, position, wrong_token]).item()

clean_logit_diff = compute_logit_diff(clean_logits, answer_token, wrong_token)
corrupted_logit_diff = compute_logit_diff(corrupted_logits, answer_token, wrong_token)

print(f"Clean logit diff: {clean_logit_diff:.2f} (positive = prefers '{ex.answer}')")
print(f"Corrupted logit diff: {corrupted_logit_diff:.2f} (negative = prefers '{ex.wrong_answer}')")
print(f"\nDifference: {clean_logit_diff - corrupted_logit_diff:.2f}")

---

## Part 3: Implementing Activation Patching

Now let's implement patching! We'll patch activations from the corrupted run into the clean run and measure how it affects the output.

In [None]:
def patch_residual_stream(
    model: HookedTransformer,
    clean_tokens: torch.Tensor,
    corrupted_cache: dict,
    layer: int,
    position: Optional[int] = None
) -> torch.Tensor:
    """
    Patch the residual stream at a specific layer.
    
    Args:
        model: The transformer model
        clean_tokens: Tokens for clean run
        corrupted_cache: Cached activations from corrupted run
        layer: Which layer to patch
        position: Which position to patch (None = all positions)
    
    Returns:
        Logits after patching
    """
    def patch_hook(activation, hook):
        # activation shape: [batch, seq, d_model]
        corrupted_activation = corrupted_cache[hook.name]
        
        if position is None:
            # Patch all positions
            return corrupted_activation
        else:
            # Patch only specific position
            activation[:, position, :] = corrupted_activation[:, position, :]
            return activation
    
    # Run with hook
    hook_name = f"blocks.{layer}.hook_resid_post"
    patched_logits = model.run_with_hooks(
        clean_tokens,
        fwd_hooks=[(hook_name, patch_hook)]
    )
    
    return patched_logits

# Test patching layer 5
patched_logits = patch_residual_stream(model, clean_tokens, corrupted_cache, layer=5)
patched_logit_diff = compute_logit_diff(patched_logits, answer_token, wrong_token)

print(f"Original clean logit diff: {clean_logit_diff:.2f}")
print(f"Patched (layer 5) logit diff: {patched_logit_diff:.2f}")
print(f"Effect: {clean_logit_diff - patched_logit_diff:.2f}")

In [None]:
def compute_patching_effect(
    clean_logit_diff: float,
    corrupted_logit_diff: float,
    patched_logit_diff: float
) -> float:
    """
    Compute normalized patching effect.
    
    Effect = 0: Patching had no effect (still behaves like clean)
    Effect = 1: Patching fully corrupted behavior (behaves like corrupted)
    
    Formula: (clean - patched) / (clean - corrupted)
    """
    return (clean_logit_diff - patched_logit_diff) / (clean_logit_diff - corrupted_logit_diff + 1e-10)

In [None]:
# Patch every layer and measure effects
layer_effects = []

for layer in range(model.cfg.n_layers):
    patched_logits = patch_residual_stream(model, clean_tokens, corrupted_cache, layer=layer)
    patched_diff = compute_logit_diff(patched_logits, answer_token, wrong_token)
    effect = compute_patching_effect(clean_logit_diff, corrupted_logit_diff, patched_diff)
    layer_effects.append(effect)
    
# Plot
fig = px.bar(
    x=list(range(model.cfg.n_layers)),
    y=layer_effects,
    title="Residual Stream Patching Effects by Layer",
    labels={"x": "Layer", "y": "Patching Effect (0=none, 1=full)"}
)
fig.add_hline(y=0, line_dash="dash", line_color="gray")
fig.show()

### Interpreting Layer Effects

- **High effect** (close to 1): This layer is crucial - patching it breaks the IOI behavior
- **Low effect** (close to 0): This layer doesn't matter much for IOI
- **Negative effect**: Patching this layer actually helps (rare, but interesting)

Typically, we see effects concentrated in specific layers where the IOI circuit operates.

---

## Part 4: Head-Level Patching

Now let's go deeper - instead of patching entire layers, let's patch individual attention heads to find exactly which heads matter!

In [None]:
def patch_attention_head(
    model: HookedTransformer,
    clean_tokens: torch.Tensor,
    corrupted_cache: dict,
    layer: int,
    head: int
) -> torch.Tensor:
    """
    Patch a specific attention head's output.
    
    We patch the 'z' tensor which is the weighted combination of values,
    before the output projection.
    """
    def patch_hook(activation, hook):
        # activation shape: [batch, seq, n_heads, d_head]
        corrupted_activation = corrupted_cache[hook.name]
        activation[:, :, head, :] = corrupted_activation[:, :, head, :]
        return activation
    
    hook_name = f"blocks.{layer}.attn.hook_z"
    patched_logits = model.run_with_hooks(
        clean_tokens,
        fwd_hooks=[(hook_name, patch_hook)]
    )
    
    return patched_logits

In [None]:
# Patch every head and measure effects
n_layers = model.cfg.n_layers
n_heads = model.cfg.n_heads

head_effects = np.zeros((n_layers, n_heads))

print("Computing head patching effects...")
for layer in range(n_layers):
    for head in range(n_heads):
        patched_logits = patch_attention_head(model, clean_tokens, corrupted_cache, layer, head)
        patched_diff = compute_logit_diff(patched_logits, answer_token, wrong_token)
        effect = compute_patching_effect(clean_logit_diff, corrupted_logit_diff, patched_diff)
        head_effects[layer, head] = effect
    print(f"  Layer {layer} done")

print("\nDone!")

In [None]:
# Visualize head effects as a heatmap
fig = px.imshow(
    head_effects,
    labels={"x": "Head", "y": "Layer", "color": "Effect"},
    color_continuous_scale="RdBu_r",
    color_continuous_midpoint=0,
    title="Attention Head Patching Effects (IOI Task)"
)
fig.update_layout(width=800, height=600)
fig.show()

In [None]:
# Find the most important heads
print("Most important attention heads for IOI:")
print("=" * 50)

# Flatten and sort
flat_effects = head_effects.flatten()
sorted_indices = np.argsort(np.abs(flat_effects))[::-1]

for i, idx in enumerate(sorted_indices[:15]):
    layer = idx // n_heads
    head = idx % n_heads
    effect = head_effects[layer, head]
    direction = "→ corrupts" if effect > 0 else "→ helps"
    print(f"{i+1:2d}. L{layer}H{head}: effect = {effect:.3f} {direction}")

### Interpreting Head Effects

From the IOI paper, we expect to find:

1. **Name Mover Heads** (positive effect): These heads move the correct name to the output
   - They attend from the final position to the indirect object name
   - Patching them breaks the correct prediction

2. **Negative/Inhibition Heads** (negative effect): These heads suppress the wrong name
   - They help by reducing the probability of the subject name
   - Patching them actually helps (less common)

3. **Backup Heads**: Heads that do the same job as name movers but kick in when needed

---

## Part 5: Aggregating Over the Dataset

Let's run patching on multiple examples to get robust results.

In [None]:
def run_patching_experiment(
    model: HookedTransformer,
    dataset: List[IOIExample],
    n_samples: int = 20
) -> np.ndarray:
    """
    Run head patching experiment over multiple examples.
    Returns aggregated effects [n_layers, n_heads].
    """
    n_layers = model.cfg.n_layers
    n_heads = model.cfg.n_heads
    
    all_effects = np.zeros((n_layers, n_heads))
    
    for i, ex in enumerate(dataset[:n_samples]):
        if i % 5 == 0:
            print(f"Processing example {i+1}/{n_samples}...")
        
        # Tokenize
        clean_tokens = model.to_tokens(ex.clean)
        corrupted_tokens = model.to_tokens(ex.corrupted)
        
        # Get answer tokens
        answer_token = model.to_single_token(ex.answer)
        wrong_token = model.to_single_token(ex.wrong_answer)
        
        # Run both
        clean_logits, clean_cache = model.run_with_cache(clean_tokens)
        corrupted_logits, corrupted_cache = model.run_with_cache(corrupted_tokens)
        
        # Baselines
        clean_diff = compute_logit_diff(clean_logits, answer_token, wrong_token)
        corrupted_diff = compute_logit_diff(corrupted_logits, answer_token, wrong_token)
        
        # Skip if the model doesn't get this one right
        if clean_diff <= 0:
            continue
        
        # Patch each head
        for layer in range(n_layers):
            for head in range(n_heads):
                patched_logits = patch_attention_head(
                    model, clean_tokens, corrupted_cache, layer, head
                )
                patched_diff = compute_logit_diff(patched_logits, answer_token, wrong_token)
                effect = compute_patching_effect(clean_diff, corrupted_diff, patched_diff)
                all_effects[layer, head] += effect
        
        # Clean up
        del clean_cache, corrupted_cache
    
    # Average
    all_effects /= n_samples
    
    return all_effects

# Run experiment
print("Running patching experiment on dataset...")
print("(This will take a few minutes)\n")

aggregated_effects = run_patching_experiment(model, ioi_dataset, n_samples=20)
print("\nDone!")

In [None]:
# Visualize aggregated results
fig = px.imshow(
    aggregated_effects,
    labels={"x": "Head", "y": "Layer", "color": "Avg Effect"},
    color_continuous_scale="RdBu_r",
    color_continuous_midpoint=0,
    title="Aggregated Head Patching Effects (IOI Task, 20 examples)"
)
fig.update_layout(width=800, height=600)
fig.show()

In [None]:
# Find most important heads from aggregated results
print("Most important attention heads (aggregated over dataset):")
print("=" * 60)

flat_effects = aggregated_effects.flatten()
sorted_indices = np.argsort(np.abs(flat_effects))[::-1]

name_movers = []
negative_heads = []

for i, idx in enumerate(sorted_indices[:20]):
    layer = idx // n_heads
    head = idx % n_heads
    effect = aggregated_effects[layer, head]
    
    if effect > 0.05:
        name_movers.append((layer, head, effect))
    elif effect < -0.05:
        negative_heads.append((layer, head, effect))

print("\nName Mover Heads (positive effect - moving the correct answer):")
for layer, head, effect in sorted(name_movers, key=lambda x: -x[2]):
    print(f"  L{layer}H{head}: {effect:.3f}")

print("\nNegative Heads (negative effect - inhibiting wrong answer):")
for layer, head, effect in sorted(negative_heads, key=lambda x: x[2]):
    print(f"  L{layer}H{head}: {effect:.3f}")

---

## Part 6: Comparing to Published Results

The original IOI paper identified specific head roles. Let's compare our findings!

In [None]:
# Known IOI circuit heads from the literature
# (From "Interpretability in the Wild" paper)

IOI_CIRCUIT = {
    "name_mover": [
        (9, 9), (9, 6), (10, 0)  # L9H9, L9H6, L10H0
    ],
    "negative": [
        (10, 7), (11, 10)  # L10H7, L11H10
    ],
    "s_inhibition": [
        (7, 3), (7, 9), (8, 6), (8, 10)
    ],
    "induction": [
        (5, 5), (6, 9)
    ],
    "duplicate_token": [
        (0, 1), (3, 0)
    ],
    "previous_token": [
        (2, 2), (4, 11)
    ]
}

print("Known IOI Circuit Heads from Literature:")
print("=" * 50)
for role, heads in IOI_CIRCUIT.items():
    head_strs = ", ".join([f"L{l}H{h}" for l, h in heads])
    print(f"{role}: {head_strs}")

In [None]:
# Annotate our heatmap with known circuit heads
fig = go.Figure()

# Add heatmap
fig.add_trace(go.Heatmap(
    z=aggregated_effects,
    colorscale="RdBu_r",
    zmid=0,
    colorbar_title="Effect"
))

# Add annotations for known circuit heads
colors = {
    "name_mover": "green",
    "negative": "red",
    "s_inhibition": "purple",
    "induction": "orange",
    "duplicate_token": "cyan",
    "previous_token": "yellow"
}

for role, heads in IOI_CIRCUIT.items():
    for layer, head in heads:
        # Add circle marker
        fig.add_shape(
            type="circle",
            x0=head-0.4, x1=head+0.4,
            y0=layer-0.4, y1=layer+0.4,
            line=dict(color=colors[role], width=2)
        )

fig.update_layout(
    title="Head Patching Effects with Known IOI Circuit Heads Marked",
    xaxis_title="Head",
    yaxis_title="Layer",
    width=900,
    height=700
)
fig.show()

print("\nLegend:")
for role, color in colors.items():
    print(f"  {color}: {role}")

### Comparing Our Results to Literature

The heads we identified should overlap significantly with the known IOI circuit! Common findings:

1. **Name Mover Heads** (L9H9, L9H6, L10H0) have high positive effect
2. **Negative Heads** (L10H7, L11H10) may show negative effect
3. **Earlier layer heads** contribute indirectly by setting up the computation

Some variation is expected due to:
- Random sampling of examples
- Different corruption strategies
- Backup circuits that can compensate

---

## Try It Yourself

### Exercise 1: MLP Patching
Patch MLP layers instead of attention heads. Which layers are most important?

<details>
<summary>Hint</summary>

Use `f"blocks.{layer}.hook_mlp_out"` as the hook name. MLPs don't have heads, so you patch the entire layer.
</details>

In [None]:
# Exercise 1: Your code here
# Implement MLP patching



### Exercise 2: Position-Specific Patching
Patch only specific positions (e.g., just the name positions) instead of all positions. Does this reveal which positions carry the important information?

<details>
<summary>Hint</summary>

Modify the patch function to only patch specific token positions. First find where the names appear in the token list.
</details>

In [None]:
# Exercise 2: Your code here
# Implement position-specific patching



### Exercise 3: Attention Pattern Analysis
For the identified name mover heads, visualize their attention patterns. Do they actually attend to the indirect object name?

<details>
<summary>Hint</summary>

Use `cache["pattern", layer][0, head]` to get the attention pattern. Look at which positions the final token attends to.
</details>

In [None]:
# Exercise 3: Your code here
# Visualize attention patterns for name mover heads



---

## Common Mistakes

### Mistake 1: Forgetting to Use Corrupted Cache
```python
# Wrong: Patching with clean cache (does nothing!)
activation = clean_cache[hook.name]  # No change!

# Correct: Patch with corrupted cache
activation = corrupted_cache[hook.name]  # Introduces counterfactual
```
**Why:** The whole point is to see what happens when we inject "corrupted" information!

### Mistake 2: Not Normalizing Effects
```python
# Wrong: Using raw logit differences
effect = clean_diff - patched_diff  # Scale depends on example!

# Correct: Normalize by the clean-corrupted gap
effect = (clean_diff - patched_diff) / (clean_diff - corrupted_diff)
```
**Why:** Normalization makes effects comparable across examples with different logit scales.

### Mistake 3: Only Looking at One Example
```python
# Wrong: Drawing conclusions from one example
effect = run_patching(example_1)  # Could be an outlier!

# Correct: Aggregate over many examples
effects = [run_patching(ex) for ex in dataset]
mean_effect = np.mean(effects)
```
**Why:** Single examples can be noisy. Aggregation gives robust results.

---

## Checkpoint

You've learned:
- What the IOI task is and why it's important for interpretability
- How activation patching works as a causal intervention
- How to patch residual stream, attention heads, and other components
- How to interpret patching results and identify important circuit components
- How your findings compare to published IOI circuit research

---

## Challenge (Optional)

**Replicate Path Patching**

Instead of patching a head's entire output, try patching the *specific path* through which a head's output reaches the answer. This involves:

1. Identifying which later heads read from the name mover heads
2. Patching only the contribution from source head to target head

This more precisely isolates the causal mechanism. See the IOI paper for details on path patching methodology.

In [None]:
# Challenge: Path patching
# Your code here



---

## Further Reading

- [Interpretability in the Wild (IOI Paper)](https://arxiv.org/abs/2211.00593)
- [Causal Scrubbing](https://www.lesswrong.com/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing) - A more rigorous patching methodology
- [ROME: Rank-One Model Editing](https://rome.baulab.info/) - Using patching to edit facts

---

## Cleanup

In [None]:
# Clear GPU memory
del clean_cache, corrupted_cache
gc.collect()
torch.cuda.empty_cache()

print("Memory cleared!")

---

## What's Next?

In **Lab C.3**, we'll study **Induction Heads** - a fundamental circuit for in-context learning. Induction heads are one of the most important discoveries in mechanistic interpretability, explaining how models learn to copy patterns!

**Next:** [Lab C.3: Induction Head Analysis](03-induction-head-analysis.ipynb)