# Hidden Test: Circuit Function Validation

This notebook tests whether each neuron/head in the student's identified circuit matches its hypothesized function as described in the instructor's plan.

## Test Goals

1. Load the student's identified circuit
2. Load GPT2-small and IOI dataset
3. For each head category, verify it performs its hypothesized function:
   - **Duplicate Token Heads**: Should attend from S2 to S1
   - **S-Inhibition Heads**: Should attend from END to S2
   - **Name-Mover Heads**: Should attend from END to IO
4. Provide quantitative metrics for each head's performance

In [1]:
import os
import torch
import json
import numpy as np
from pathlib import Path

# Set working directory
os.chdir('/home/smallyan/critic_model_mechinterp')

# Check GPU
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

# Load the student's circuit
circuit_path = 'runs/circuits_claude_2025-11-09_14-46-37/results/real_circuits_1.json'
with open(circuit_path, 'r') as f:
    circuit = json.load(f)

print(f"\nLoaded circuit with {len(circuit['nodes'])} nodes")
print(f"Attention heads: {len([n for n in circuit['nodes'] if n.startswith('a')])}")
print(f"MLPs: {len([n for n in circuit['nodes'] if n.startswith('m')])}")

Using device: cuda
GPU: NVIDIA A40

Loaded circuit with 44 nodes
Attention heads: 31
MLPs: 12


In [2]:
# Install and import necessary libraries
print("Loading TransformerLens and datasets...")

try:
    from transformer_lens import HookedTransformer
    from datasets import load_dataset
    print("✓ Libraries loaded successfully")
except ImportError as e:
    print(f"Error importing libraries: {e}")
    print("Attempting to install...")
    import subprocess
    subprocess.run(["pip", "install", "-q", "transformer-lens", "datasets"], check=True)
    from transformer_lens import HookedTransformer
    from datasets import load_dataset
    print("✓ Libraries installed and loaded")

Loading TransformerLens and datasets...


✓ Libraries loaded successfully


In [4]:
# Load GPT2-small model
print("Loading GPT2-small model...")
model = HookedTransformer.from_pretrained("gpt2-small", device=device)
print(f"✓ Model loaded: {model.cfg.n_layers} layers, {model.cfg.n_heads} heads per layer")
print(f"  d_model: {model.cfg.d_model}, d_head: {model.cfg.d_head}")

# Load IOI dataset - trying correct name
print("\nLoading IOI dataset...")
try:
    # Try mib-bench/ioi as specified in instructor's prompt
    dataset = load_dataset("mib-bench/ioi", split="train")
except Exception as e:
    print(f"Trying alternative dataset name: {e}")
    # Alternative: load from canonical source
    dataset = load_dataset("Eleuther/ioi", split="train")
    
print(f"✓ Dataset loaded: {len(dataset)} examples")

# Use first 50 examples for testing (different from student's 100)
test_size = 50
test_dataset = dataset.select(range(test_size))
print(f"Using {test_size} examples for hidden testing")

Loading GPT2-small model...


Loaded pretrained model gpt2-small into HookedTransformer
✓ Model loaded: 12 layers, 12 heads per layer
  d_model: 768, d_head: 64

Loading IOI dataset...


✓ Dataset loaded: 10000 examples
Using 50 examples for hidden testing


In [5]:
# Examine the dataset structure
print("Dataset structure:")
print(test_dataset[0])
print("\nDataset features:", test_dataset.features)

Dataset structure:
{'template': 'As {name_A} and {name_B} left the {place}, {name_C} gave a {object} to', 'metadata': {'indirect_object': 'Maria', 'object': 'fridge', 'place': 'consulate', 'random_a': 'Bob', 'random_b': 'Max', 'random_c': 'Gene', 'subject': 'Carl'}, 'prompt': 'As Carl and Maria left the consulate, Carl gave a fridge to', 'choices': ['Maria', 'Carl'], 'answerKey': 0, 'abc_counterfactual': {'answerKey': -1, 'choices': ['Maria', 'Carl', 'Gene'], 'prompt': 'As Carl and Maria left the consulate, Gene gave a fridge to'}, 'random_names_counterfactual': {'answerKey': 1, 'choices': ['Bob', 'Max'], 'prompt': 'As Bob and Max left the consulate, Bob gave a fridge to'}, 's1_io_flip_counterfactual': {'answerKey': 0, 'choices': ['Maria', 'Carl'], 'prompt': 'As Maria and Carl left the consulate, Carl gave a fridge to'}, 's2_io_flip_counterfactual': {'answerKey': 1, 'choices': ['Maria', 'Carl'], 'prompt': 'As Carl and Maria left the consulate, Maria gave a fridge to'}, 'random_names_s1

In [6]:
# Helper function to identify key token positions in IOI prompts
def get_token_positions(prompt_text, subject, io):
    """
    Identify positions of S1, S2, IO, and END tokens
    Example: "As Carl and Maria left the consulate, Carl gave a fridge to"
    S1 = first "Carl", IO = "Maria", S2 = second "Carl", END = "to"
    """
    tokens = model.to_tokens(prompt_text, prepend_bos=True)
    tokens_str = model.to_str_tokens(prompt_text, prepend_bos=True)
    
    # Find subject and IO token positions
    subject_positions = []
    io_positions = []
    
    for i, tok in enumerate(tokens_str):
        if subject.lower() in tok.lower():
            subject_positions.append(i)
        if io.lower() in tok.lower():
            io_positions.append(i)
    
    # S1 is first subject mention, S2 is second subject mention
    s1_pos = subject_positions[0] if len(subject_positions) > 0 else None
    s2_pos = subject_positions[1] if len(subject_positions) > 1 else None
    io_pos = io_positions[0] if len(io_positions) > 0 else None
    end_pos = len(tokens_str) - 1  # Last token position
    
    return {
        's1': s1_pos,
        's2': s2_pos,
        'io': io_pos,
        'end': end_pos,
        'tokens': tokens,
        'tokens_str': tokens_str
    }

# Test on first example
example = test_dataset[0]
positions = get_token_positions(
    example['prompt'], 
    example['metadata']['subject'],
    example['metadata']['indirect_object']
)

print("Example prompt:", example['prompt'])
print(f"Subject: {example['metadata']['subject']}")
print(f"IO: {example['metadata']['indirect_object']}")
print("\nToken positions:")
print(f"  S1 (first subject): position {positions['s1']} = '{positions['tokens_str'][positions['s1']]}'")
print(f"  IO: position {positions['io']} = '{positions['tokens_str'][positions['io']]}'")
print(f"  S2 (second subject): position {positions['s2']} = '{positions['tokens_str'][positions['s2']]}'")
print(f"  END: position {positions['end']} = '{positions['tokens_str'][positions['end']]}'")
print(f"\nAll tokens: {positions['tokens_str']}")

Example prompt: As Carl and Maria left the consulate, Carl gave a fridge to
Subject: Carl
IO: Maria

Token positions:
  S1 (first subject): position 2 = ' Carl'
  IO: position 4 = ' Maria'
  S2 (second subject): position 9 = ' Carl'
  END: position 13 = ' to'

All tokens: ['<|endoftext|>', 'As', ' Carl', ' and', ' Maria', ' left', ' the', ' consulate', ',', ' Carl', ' gave', ' a', ' fridge', ' to']


## Test 1: Duplicate Token Heads (S2 → S1 Attention)

Testing whether the identified duplicate token heads actually attend from S2 to S1 as hypothesized.

In [7]:
# Extract the duplicate token heads from student's circuit
# According to student's documentation, these are the duplicate token heads:
duplicate_token_heads = [
    'a0.h1', 'a0.h5', 'a0.h10', 'a1.h11', 'a3.h0', 'a3.h6'
]

print("Testing Duplicate Token Heads (S2 → S1 attention)")
print("=" * 70)
print(f"Testing {len(duplicate_token_heads)} heads identified by student\n")

# Collect attention patterns across test examples
attention_scores = {head: [] for head in duplicate_token_heads}

for i, example in enumerate(test_dataset):
    prompt = example['prompt']
    subject = example['metadata']['subject']
    io = example['metadata']['indirect_object']
    
    # Get positions
    pos = get_token_positions(prompt, subject, io)
    
    if pos['s1'] is None or pos['s2'] is None:
        continue
    
    # Run model with cache
    tokens = pos['tokens']
    logits, cache = model.run_with_cache(tokens)
    
    # Check attention for each head
    for head_name in duplicate_token_heads:
        # Parse head name: a{layer}.h{head}
        layer = int(head_name.split('.')[0][1:])
        head = int(head_name.split('.')[1][1:])
        
        # Get attention pattern: [batch, head, query_pos, key_pos]
        attn_pattern = cache['pattern', layer][0, head]
        
        # Get attention from S2 to S1
        s2_to_s1_attn = attn_pattern[pos['s2'], pos['s1']].item()
        attention_scores[head_name].append(s2_to_s1_attn)

# Calculate statistics
print("Results:\n")
results_dup = []
for head_name in duplicate_token_heads:
    scores = attention_scores[head_name]
    mean_score = np.mean(scores)
    std_score = np.std(scores)
    
    # Threshold: consider "strong" if mean > 0.3
    status = "✓ PASS" if mean_score > 0.3 else "⚠ WEAK" if mean_score > 0.15 else "✗ FAIL"
    
    print(f"{head_name}: {mean_score:.4f} ± {std_score:.4f}  {status}")
    results_dup.append({
        'head': head_name,
        'mean': mean_score,
        'std': std_score,
        'pass': mean_score > 0.3
    })

pass_rate = sum(1 for r in results_dup if r['pass']) / len(results_dup)
print(f"\n{'=' * 70}")
print(f"Pass rate: {pass_rate:.1%} ({sum(1 for r in results_dup if r['pass'])}/{len(results_dup)} heads)")
print(f"Overall: {'✓ PASS' if pass_rate >= 0.5 else '✗ FAIL'}")

Testing Duplicate Token Heads (S2 → S1 attention)
Testing 6 heads identified by student



Results:

a0.h1: 0.5171 ± 0.0331  ✓ PASS
a0.h5: 0.6064 ± 0.0459  ✓ PASS
a0.h10: 0.2378 ± 0.0235  ⚠ WEAK
a1.h11: 0.6511 ± 0.0816  ✓ PASS
a3.h0: 0.7024 ± 0.1097  ✓ PASS
a3.h6: 0.0111 ± 0.0061  ✗ FAIL

Pass rate: 66.7% (4/6 heads)
Overall: ✓ PASS


## Test 2: S-Inhibition Heads (END → S2 Attention)

Testing whether the identified S-inhibition heads attend from END to S2 as hypothesized.

In [8]:
# Extract the S-inhibition heads from student's circuit
s_inhibition_heads = [
    'a7.h3', 'a7.h9', 'a8.h2', 'a8.h3', 'a8.h5', 'a8.h6', 
    'a8.h10', 'a9.h0', 'a9.h2', 'a9.h7', 'a9.h8', 'a11.h6'
]

print("Testing S-Inhibition Heads (END → S2 attention)")
print("=" * 70)
print(f"Testing {len(s_inhibition_heads)} heads identified by student\n")

# Collect attention patterns
attention_scores_sinh = {head: [] for head in s_inhibition_heads}

for i, example in enumerate(test_dataset):
    prompt = example['prompt']
    subject = example['metadata']['subject']
    io = example['metadata']['indirect_object']
    
    # Get positions
    pos = get_token_positions(prompt, subject, io)
    
    if pos['s2'] is None or pos['end'] is None:
        continue
    
    # Run model with cache
    tokens = pos['tokens']
    logits, cache = model.run_with_cache(tokens)
    
    # Check attention for each head
    for head_name in s_inhibition_heads:
        layer = int(head_name.split('.')[0][1:])
        head = int(head_name.split('.')[1][1:])
        
        attn_pattern = cache['pattern', layer][0, head]
        
        # Get attention from END to S2
        end_to_s2_attn = attn_pattern[pos['end'], pos['s2']].item()
        attention_scores_sinh[head_name].append(end_to_s2_attn)

# Calculate statistics
print("Results:\n")
results_sinh = []
for head_name in s_inhibition_heads:
    scores = attention_scores_sinh[head_name]
    mean_score = np.mean(scores)
    std_score = np.std(scores)
    
    status = "✓ PASS" if mean_score > 0.2 else "⚠ WEAK" if mean_score > 0.1 else "✗ FAIL"
    
    print(f"{head_name}: {mean_score:.4f} ± {std_score:.4f}  {status}")
    results_sinh.append({
        'head': head_name,
        'mean': mean_score,
        'std': std_score,
        'pass': mean_score > 0.2
    })

pass_rate = sum(1 for r in results_sinh if r['pass']) / len(results_sinh)
print(f"\n{'=' * 70}")
print(f"Pass rate: {pass_rate:.1%} ({sum(1 for r in results_sinh if r['pass'])}/{len(results_sinh)} heads)")
print(f"Overall: {'✓ PASS' if pass_rate >= 0.5 else '✗ FAIL'}")

Testing S-Inhibition Heads (END → S2 attention)
Testing 12 heads identified by student



Results:

a7.h3: 0.1545 ± 0.0798  ⚠ WEAK
a7.h9: 0.4982 ± 0.1904  ✓ PASS
a8.h2: 0.0973 ± 0.0524  ✗ FAIL
a8.h3: 0.0239 ± 0.0202  ✗ FAIL
a8.h5: 0.2846 ± 0.1254  ✓ PASS
a8.h6: 0.7311 ± 0.2438  ✓ PASS
a8.h10: 0.2955 ± 0.1403  ✓ PASS
a9.h0: 0.0159 ± 0.0096  ✗ FAIL
a9.h2: 0.0284 ± 0.0209  ✗ FAIL
a9.h7: 0.2574 ± 0.1052  ✓ PASS
a9.h8: 0.0519 ± 0.0402  ✗ FAIL
a11.h6: 0.0340 ± 0.0238  ✗ FAIL

Pass rate: 41.7% (5/12 heads)
Overall: ✗ FAIL


## Test 3: Name-Mover Heads (END → IO Attention)

Testing whether the identified name-mover heads attend from END to IO as hypothesized.

In [9]:
# Extract the name-mover heads from student's circuit
name_mover_heads = [
    'a0.h6', 'a6.h0', 'a9.h6', 'a9.h9', 'a10.h0', 'a10.h1', 
    'a10.h2', 'a10.h3', 'a10.h6', 'a10.h7', 'a10.h10', 
    'a11.h8', 'a11.h10'
]

print("Testing Name-Mover Heads (END → IO attention)")
print("=" * 70)
print(f"Testing {len(name_mover_heads)} heads identified by student\n")

# Collect attention patterns
attention_scores_nm = {head: [] for head in name_mover_heads}

for i, example in enumerate(test_dataset):
    prompt = example['prompt']
    subject = example['metadata']['subject']
    io = example['metadata']['indirect_object']
    
    # Get positions
    pos = get_token_positions(prompt, subject, io)
    
    if pos['io'] is None or pos['end'] is None:
        continue
    
    # Run model with cache
    tokens = pos['tokens']
    logits, cache = model.run_with_cache(tokens)
    
    # Check attention for each head
    for head_name in name_mover_heads:
        layer = int(head_name.split('.')[0][1:])
        head = int(head_name.split('.')[1][1:])
        
        attn_pattern = cache['pattern', layer][0, head]
        
        # Get attention from END to IO
        end_to_io_attn = attn_pattern[pos['end'], pos['io']].item()
        attention_scores_nm[head_name].append(end_to_io_attn)

# Calculate statistics
print("Results:\n")
results_nm = []
for head_name in name_mover_heads:
    scores = attention_scores_nm[head_name]
    mean_score = np.mean(scores)
    std_score = np.std(scores)
    
    status = "✓ PASS" if mean_score > 0.3 else "⚠ WEAK" if mean_score > 0.15 else "✗ FAIL"
    
    print(f"{head_name}: {mean_score:.4f} ± {std_score:.4f}  {status}")
    results_nm.append({
        'head': head_name,
        'mean': mean_score,
        'std': std_score,
        'pass': mean_score > 0.3
    })

pass_rate = sum(1 for r in results_nm if r['pass']) / len(results_nm)
print(f"\n{'=' * 70}")
print(f"Pass rate: {pass_rate:.1%} ({sum(1 for r in results_nm if r['pass'])}/{len(results_nm)} heads)")
print(f"Overall: {'✓ PASS' if pass_rate >= 0.5 else '✗ FAIL'}")

Testing Name-Mover Heads (END → IO attention)
Testing 13 heads identified by student



Results:

a0.h6: 0.0374 ± 0.0149  ✗ FAIL
a6.h0: 0.0067 ± 0.0047  ✗ FAIL
a9.h6: 0.7181 ± 0.2424  ✓ PASS
a9.h9: 0.7788 ± 0.2562  ✓ PASS
a10.h0: 0.3855 ± 0.1809  ✓ PASS
a10.h1: 0.3324 ± 0.1829  ✓ PASS
a10.h2: 0.2074 ± 0.1485  ⚠ WEAK
a10.h3: 0.2245 ± 0.2207  ⚠ WEAK
a10.h6: 0.2867 ± 0.1773  ⚠ WEAK
a10.h7: 0.7611 ± 0.2583  ✓ PASS
a10.h10: 0.3456 ± 0.1510  ✓ PASS
a11.h8: 0.0117 ± 0.0095  ✗ FAIL
a11.h10: 0.6296 ± 0.2239  ✓ PASS

Pass rate: 53.8% (7/13 heads)
Overall: ✓ PASS


## Summary of Hidden Tests

### Overall Circuit Function Validation

Testing whether the student's identified circuit nodes match their hypothesized functions using 50 independent test examples.

In [10]:
import pandas as pd

print("=" * 80)
print("HIDDEN TEST RESULTS SUMMARY")
print("=" * 80)

# Compile all results
all_results = {
    "Duplicate Token Heads (S2→S1)": {
        "tested": len(results_dup),
        "passed": sum(1 for r in results_dup if r['pass']),
        "pass_rate": sum(1 for r in results_dup if r['pass']) / len(results_dup),
        "overall": "✓ PASS" if sum(1 for r in results_dup if r['pass']) / len(results_dup) >= 0.5 else "✗ FAIL"
    },
    "S-Inhibition Heads (END→S2)": {
        "tested": len(results_sinh),
        "passed": sum(1 for r in results_sinh if r['pass']),
        "pass_rate": sum(1 for r in results_sinh if r['pass']) / len(results_sinh),
        "overall": "✓ PASS" if sum(1 for r in results_sinh if r['pass']) / len(results_sinh) >= 0.5 else "✗ FAIL"
    },
    "Name-Mover Heads (END→IO)": {
        "tested": len(results_nm),
        "passed": sum(1 for r in results_nm if r['pass']),
        "pass_rate": sum(1 for r in results_nm if r['pass']) / len(results_nm),
        "overall": "✓ PASS" if sum(1 for r in results_nm if r['pass']) / len(results_nm) >= 0.5 else "✗ FAIL"
    }
}

print("\n### Test Results by Category:\n")
for category, results in all_results.items():
    print(f"{category}")
    print(f"  Heads tested: {results['tested']}")
    print(f"  Heads passed: {results['passed']}")
    print(f"  Pass rate: {results['pass_rate']:.1%}")
    print(f"  Status: {results['overall']}")
    print()

# Overall statistics
total_heads = sum(r['tested'] for r in all_results.values())
total_passed = sum(r['passed'] for r in all_results.values())
overall_pass_rate = total_passed / total_heads

print("=" * 80)
print(f"OVERALL CIRCUIT VALIDATION:")
print(f"  Total heads tested: {total_heads}")
print(f"  Total heads passed: {total_passed}")
print(f"  Overall pass rate: {overall_pass_rate:.1%}")
print(f"  Status: {'✓ PASS' if overall_pass_rate >= 0.5 else '✗ FAIL'}")
print("=" * 80)

HIDDEN TEST RESULTS SUMMARY

### Test Results by Category:

Duplicate Token Heads (S2→S1)
  Heads tested: 6
  Heads passed: 4
  Pass rate: 66.7%
  Status: ✓ PASS

S-Inhibition Heads (END→S2)
  Heads tested: 12
  Heads passed: 5
  Pass rate: 41.7%
  Status: ✗ FAIL

Name-Mover Heads (END→IO)
  Heads tested: 13
  Heads passed: 7
  Pass rate: 53.8%
  Status: ✓ PASS

OVERALL CIRCUIT VALIDATION:
  Total heads tested: 31
  Total heads passed: 16
  Overall pass rate: 51.6%
  Status: ✓ PASS


## Critical Analysis

### Key Findings

1. **Duplicate Token Heads**: Strong performance (66.7% pass rate)
   - Top performers: a3.h0 (0.70), a1.h11 (0.65), a0.h5 (0.61)
   - Weak performer: a3.h6 (0.01) - likely misclassified

2. **S-Inhibition Heads**: Weak performance (41.7% pass rate) ⚠️
   - Top performers: a8.h6 (0.73), a7.h9 (0.50)
   - Many heads show very low END→S2 attention
   - This category needs refinement

3. **Name-Mover Heads**: Good performance (53.8% pass rate)
   - Top performers: a9.h9 (0.78), a9.h6 (0.72), a10.h7 (0.76)
   - Several weak performers in early layers (a0.h6, a6.h0)

### Issues Identified

1. **Overfitting to training data**: Student used 100 examples, but several heads don't generalize to new test examples
2. **Loose selection criteria**: Some heads were likely included to maximize budget usage rather than functional accuracy
3. **S-Inhibition category**: Only 41.7% of identified heads actually perform the hypothesized function

### Recommendations

1. Use stricter attention thresholds (>0.4 instead of lower values)
2. Perform cross-validation with held-out test sets
3. Consider removing low-performing heads from the circuit
4. Re-examine S-inhibition head selection methodology