# Day 5-6: Advanced Techniques

**Goal:** Expand probe toolkit with position analysis, attention head probing, and component-specific extraction

**Learning Objectives:**
1. Compare different token positions for probe accuracy
2. Extract and probe attention head outputs
3. Extract and probe MLP layer outputs
4. Build a reusable ProbeToolkit class

**Timeline:** 6-8 hours

**Why This Matters for Faithfulness Detection:**
- H2 predicts information is concentrated at conclusion tokens ("therefore", "so")
- We need to systematically compare positions to test this
- Different model components may encode different aspects of faithfulness

---

## Setup

In [None]:
import torch
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import transformer_lens as tl

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

In [None]:
# Load GPT-2 small
model = tl.HookedTransformer.from_pretrained("gpt2-small")
print(f"Model loaded: {model.cfg.n_layers} layers, {model.cfg.d_model} dimensions")
print(f"Attention heads per layer: {model.cfg.n_heads}")
print(f"Head dimension: {model.cfg.d_head}")

In [None]:
# Reuse dataset from Day 3-4
positive_sentences = [
    "I love this movie!",
    "This is amazing and wonderful!",
    "Great job, fantastic work!",
    "Absolutely brilliant performance!",
    "I'm so happy with the results!",
    "This exceeded all my expectations!",
    "Wonderful experience, highly recommend!",
    "Best decision I ever made!",
    "This is perfect in every way!",
    "I'm thrilled with how this turned out!",
    "Outstanding quality and service!",
    "This brings me so much joy!",
    "Incredible work, truly impressive!",
    "I'm delighted with this purchase!",
    "This is exactly what I needed!",
    "Five stars, absolutely love it!",
    "This made my day so much better!",
    "I'm grateful for this opportunity!",
    "This is surprisingly excellent!",
    "I'm really pleased with the outcome!",
]

negative_sentences = [
    "I hate this movie.",
    "This is terrible and awful.",
    "Poor job, disappointing work.",
    "Absolutely dreadful performance.",
    "I'm so unhappy with the results.",
    "This failed all my expectations.",
    "Terrible experience, avoid this.",
    "Worst decision I ever made.",
    "This is flawed in every way.",
    "I'm upset with how this turned out.",
    "Poor quality and bad service.",
    "This brings me so much frustration.",
    "Mediocre work, not impressive.",
    "I'm disappointed with this purchase.",
    "This is not what I needed.",
    "One star, completely hate it.",
    "This ruined my day completely.",
    "I regret this opportunity.",
    "This is surprisingly horrible.",
    "I'm really displeased with the outcome.",
]

print(f"Dataset: {len(positive_sentences)} positive, {len(negative_sentences)} negative")

---

## Part 1: Token Position Analysis (2-3 hours)

**Question:** Which token position contains the most sentiment information?

**Options to explore:**
- **Last token:** Common choice, contains accumulated information
- **First token:** Beginning of sentence
- **Mean pooling:** Average across all tokens
- **Max pooling:** Maximum activation across positions
- **Specific positions:** e.g., position 0, 1, 2, etc.

**For CoT faithfulness:** This maps directly to H2 - whether information is concentrated at conclusion markers

### 1.1 Get ALL Token Activations

In [None]:
def get_all_token_activations(model, sentences, layer=6):
    """
    Get activations for ALL tokens in each sentence.
    
    Args:
        model: HookedTransformer model
        sentences: List of strings
        layer: Which layer to extract from
    
    Returns:
        List of numpy arrays, each of shape (seq_len, d_model)
        (Note: different sentences may have different seq_len)
    """
    all_activations = []
    
    for sentence in sentences:
        _, cache = model.run_with_cache(sentence)
        # Shape: [1, seq_len, d_model]
        acts = cache["resid_post", layer][0, :, :].cpu().numpy()
        all_activations.append(acts)
    
    return all_activations

# Test it
test_acts = get_all_token_activations(model, ["Hello world!"], layer=6)
print(f"Single sentence activations shape: {test_acts[0].shape}")

# Show token breakdown
tokens = model.to_tokens("Hello world!")
print(f"Tokens: {[model.tokenizer.decode([t]) for t in tokens[0]]}")

### 1.2 Extract Features by Position Method

In [None]:
def extract_features(activations, method='last'):
    """
    Extract single feature vector from variable-length activations.
    
    Args:
        activations: List of numpy arrays, each (seq_len, d_model)
        method: 'last', 'first', 'mean', 'max'
    
    Returns:
        numpy array of shape (n_samples, d_model)
    """
    features = []
    
    for acts in activations:
        if method == 'last':
            features.append(acts[-1])
        elif method == 'first':
            features.append(acts[0])
        elif method == 'mean':
            features.append(acts.mean(axis=0))
        elif method == 'max':
            features.append(acts.max(axis=0))
        else:
            raise ValueError(f"Unknown method: {method}")
    
    return np.array(features)

# Demo
test_acts = get_all_token_activations(model, ["Test sentence here"], layer=6)
for method in ['last', 'first', 'mean', 'max']:
    feat = extract_features(test_acts, method)
    print(f"{method}: shape {feat.shape}, mean={feat.mean():.4f}")

### 1.3 Compare Position Methods

In [None]:
def train_and_evaluate_probe(X_pos, X_neg, test_size=0.3, random_state=42):
    """
    Helper function to train probe and return results.
    """
    X = np.vstack([X_pos, X_neg])
    y = np.array([1]*len(X_pos) + [0]*len(X_neg))
    
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state, stratify=y
    )
    
    probe = LogisticRegression(max_iter=1000, random_state=random_state)
    probe.fit(X_train, y_train)
    
    return {
        'probe': probe,
        'train_acc': probe.score(X_train, y_train),
        'test_acc': probe.score(X_test, y_test),
    }

In [None]:
# Get all activations once (for efficiency)
print("Extracting all token activations...")
pos_all_acts = get_all_token_activations(model, positive_sentences, layer=6)
neg_all_acts = get_all_token_activations(model, negative_sentences, layer=6)
print("Done!")

In [None]:
# Compare different position methods
methods = ['last', 'first', 'mean', 'max']
position_results = []

print("\n=== Position Method Comparison ===\n")

for method in methods:
    X_pos = extract_features(pos_all_acts, method)
    X_neg = extract_features(neg_all_acts, method)
    
    results = train_and_evaluate_probe(X_pos, X_neg)
    
    position_results.append({
        'method': method,
        'train_acc': results['train_acc'],
        'test_acc': results['test_acc'],
        'gap': results['train_acc'] - results['test_acc']
    })
    
    print(f"{method:6s}: Train={results['train_acc']:.2%}, Test={results['test_acc']:.2%}")

df_position = pd.DataFrame(position_results)
print(f"\nBest method: {df_position.loc[df_position['test_acc'].idxmax(), 'method']}")

In [None]:
# Visualize position comparison
fig, ax = plt.subplots(figsize=(10, 6))

x = np.arange(len(methods))
width = 0.35

bars1 = ax.bar(x - width/2, df_position['train_acc'], width, label='Train', color='steelblue')
bars2 = ax.bar(x + width/2, df_position['test_acc'], width, label='Test', color='coral')

ax.set_xlabel('Position Method', fontsize=12)
ax.set_ylabel('Accuracy', fontsize=12)
ax.set_title('Sentiment Detection: Position Method Comparison', fontsize=14, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels(methods)
ax.legend()
ax.set_ylim([0.4, 1.05])
ax.grid(True, alpha=0.3, axis='y')

# Add value labels
for bar in bars1 + bars2:
    height = bar.get_height()
    ax.annotate(f'{height:.0%}',
                xy=(bar.get_x() + bar.get_width() / 2, height),
                xytext=(0, 3),
                textcoords="offset points",
                ha='center', va='bottom', fontsize=9)

plt.tight_layout()
plt.savefig('position_method_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

### 1.4 Position-by-Position Analysis

**Let's go deeper:** Which specific token positions carry the most information?

In [None]:
def extract_at_position(activations, position):
    """
    Extract activation at specific position.
    
    Args:
        activations: List of numpy arrays
        position: int (positive) or negative (from end)
    
    Returns:
        numpy array of shape (n_valid_samples, d_model)
        List of valid indices (samples that had enough tokens)
    """
    features = []
    valid_indices = []
    
    for i, acts in enumerate(activations):
        seq_len = acts.shape[0]
        
        # Handle negative indices (from end)
        if position < 0:
            actual_pos = seq_len + position
        else:
            actual_pos = position
        
        # Check if position is valid for this sequence
        if 0 <= actual_pos < seq_len:
            features.append(acts[actual_pos])
            valid_indices.append(i)
    
    return np.array(features), valid_indices

# Test
test_feat, test_idx = extract_at_position(pos_all_acts, position=0)
print(f"Position 0: {len(test_feat)} valid samples")
test_feat, test_idx = extract_at_position(pos_all_acts, position=-1)
print(f"Position -1 (last): {len(test_feat)} valid samples")

In [None]:
# Analyze multiple positions
positions_to_test = [0, 1, 2, 3, -3, -2, -1]  # First few and last few
position_specific_results = []

print("\n=== Position-Specific Analysis ===\n")

for pos in positions_to_test:
    X_pos_feat, pos_idx = extract_at_position(pos_all_acts, pos)
    X_neg_feat, neg_idx = extract_at_position(neg_all_acts, pos)
    
    # Only proceed if we have enough samples
    if len(X_pos_feat) >= 5 and len(X_neg_feat) >= 5:
        # Need balanced dataset - take minimum
        min_samples = min(len(X_pos_feat), len(X_neg_feat))
        X_pos_feat = X_pos_feat[:min_samples]
        X_neg_feat = X_neg_feat[:min_samples]
        
        results = train_and_evaluate_probe(X_pos_feat, X_neg_feat)
        
        position_specific_results.append({
            'position': pos,
            'n_samples': min_samples * 2,
            'train_acc': results['train_acc'],
            'test_acc': results['test_acc'],
        })
        
        pos_label = f"{pos}" if pos >= 0 else f"last{pos+1}" if pos != -1 else "last"
        print(f"Position {pos:3d}: n={min_samples*2:2d}, Train={results['train_acc']:.2%}, Test={results['test_acc']:.2%}")
    else:
        print(f"Position {pos:3d}: Insufficient samples (pos={len(X_pos_feat)}, neg={len(X_neg_feat)})")

In [None]:
# Visualize position-specific results
if position_specific_results:
    df_pos_specific = pd.DataFrame(position_specific_results)
    
    fig, ax = plt.subplots(figsize=(10, 6))
    
    positions = [str(p) for p in df_pos_specific['position']]
    x = np.arange(len(positions))
    width = 0.35
    
    bars1 = ax.bar(x - width/2, df_pos_specific['train_acc'], width, label='Train', color='steelblue')
    bars2 = ax.bar(x + width/2, df_pos_specific['test_acc'], width, label='Test', color='coral')
    
    ax.set_xlabel('Token Position', fontsize=12)
    ax.set_ylabel('Accuracy', fontsize=12)
    ax.set_title('Sentiment Detection by Token Position', fontsize=14, fontweight='bold')
    ax.set_xticks(x)
    ax.set_xticklabels(positions)
    ax.legend()
    ax.set_ylim([0.3, 1.05])
    ax.axhline(y=0.5, color='red', linestyle='--', alpha=0.5, label='Random')
    ax.grid(True, alpha=0.3, axis='y')
    
    plt.tight_layout()
    plt.savefig('position_specific_analysis.png', dpi=150, bbox_inches='tight')
    plt.show()

### Position Analysis Reflection

**Answer these questions:**

1. **Which position(s) work best?** Is this what you expected?
   - Your answer:

2. **Why might the last token contain more sentiment information?**
   - Your answer:

3. **For CoT faithfulness detection, which positions might be most informative?** (Think about where conclusions appear)
   - Your answer:

---

## Part 2: Attention Head Probing (2-3 hours)

**Question:** Do specific attention heads specialize in sentiment?

**Background:**
- GPT-2 small has 12 layers × 12 heads = 144 attention heads
- Each head produces a d_head=64 dimensional output
- Different heads may specialize in different tasks

**Why this matters:** Understanding which components encode information helps us target probes effectively

In [None]:
def get_attention_head_output(model, sentences, layer=6, head=0):
    """
    Extract output from a specific attention head.
    
    Args:
        model: HookedTransformer model
        sentences: List of strings
        layer: Which layer (0-11 for GPT-2 small)
        head: Which head (0-11 for GPT-2 small)
    
    Returns:
        numpy array of shape (n_sentences, d_head)
    """
    activations = []
    
    for sentence in sentences:
        _, cache = model.run_with_cache(sentence)
        
        # Get attention head result
        # Shape: [batch, pos, n_heads, d_head]
        # This is the output of the attention head BEFORE being projected by W_O
        head_output = cache[f"blocks.{layer}.attn.hook_result"]
        
        # Get last token's activation for the specified head
        # Shape: [d_head] = [64]
        head_act = head_output[0, -1, head, :].cpu().numpy()
        activations.append(head_act)
    
    return np.array(activations)

# Test
test_head = get_attention_head_output(model, ["Test sentence"], layer=6, head=0)
print(f"Attention head output shape: {test_head.shape}")  # Should be (1, 64)

In [None]:
# Scan all heads at a single layer
layer_to_scan = 6
head_results = []

print(f"\n=== Scanning All Heads at Layer {layer_to_scan} ===\n")

for head in range(model.cfg.n_heads):  # 12 heads
    X_pos = get_attention_head_output(model, positive_sentences, layer=layer_to_scan, head=head)
    X_neg = get_attention_head_output(model, negative_sentences, layer=layer_to_scan, head=head)
    
    results = train_and_evaluate_probe(X_pos, X_neg)
    
    head_results.append({
        'layer': layer_to_scan,
        'head': head,
        'train_acc': results['train_acc'],
        'test_acc': results['test_acc'],
    })
    
    print(f"Head {head:2d}: Train={results['train_acc']:.2%}, Test={results['test_acc']:.2%}")

df_heads = pd.DataFrame(head_results)
best_head = df_heads.loc[df_heads['test_acc'].idxmax()]
print(f"\nBest head: {best_head['head']:.0f} with test accuracy {best_head['test_acc']:.2%}")

In [None]:
# Visualize head comparison
fig, ax = plt.subplots(figsize=(12, 6))

x = np.arange(model.cfg.n_heads)
width = 0.35

bars1 = ax.bar(x - width/2, df_heads['train_acc'], width, label='Train', color='steelblue', alpha=0.8)
bars2 = ax.bar(x + width/2, df_heads['test_acc'], width, label='Test', color='coral', alpha=0.8)

ax.set_xlabel('Attention Head', fontsize=12)
ax.set_ylabel('Accuracy', fontsize=12)
ax.set_title(f'Sentiment Detection by Attention Head (Layer {layer_to_scan})', fontsize=14, fontweight='bold')
ax.set_xticks(x)
ax.legend()
ax.set_ylim([0.3, 1.05])
ax.axhline(y=0.5, color='red', linestyle='--', alpha=0.5, label='Random')
ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig(f'attention_head_comparison_layer{layer_to_scan}.png', dpi=150, bbox_inches='tight')
plt.show()

### 2.2 Multi-Layer Head Heatmap

**Let's see which heads across ALL layers carry sentiment information**

In [None]:
# Scan all layers and heads (this will take a few minutes)
layers_to_scan = [0, 3, 6, 9, 11]  # Sample of layers
all_head_results = []

print("Scanning heads across layers... (this may take 2-3 minutes)\n")

for layer in layers_to_scan:
    print(f"Layer {layer}...", end=" ")
    for head in range(model.cfg.n_heads):
        X_pos = get_attention_head_output(model, positive_sentences, layer=layer, head=head)
        X_neg = get_attention_head_output(model, negative_sentences, layer=layer, head=head)
        
        results = train_and_evaluate_probe(X_pos, X_neg)
        
        all_head_results.append({
            'layer': layer,
            'head': head,
            'test_acc': results['test_acc'],
        })
    print("done")

print("\nComplete!")

In [None]:
# Create heatmap
df_all_heads = pd.DataFrame(all_head_results)
pivot_heads = df_all_heads.pivot(index='layer', columns='head', values='test_acc')

fig, ax = plt.subplots(figsize=(14, 6))

sns.heatmap(pivot_heads, annot=True, fmt='.0%', cmap='RdYlGn', 
            center=0.5, vmin=0.3, vmax=1.0, ax=ax,
            cbar_kws={'label': 'Test Accuracy'})

ax.set_xlabel('Attention Head', fontsize=12)
ax.set_ylabel('Layer', fontsize=12)
ax.set_title('Sentiment Detection: Layer × Head Heatmap', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.savefig('attention_head_heatmap.png', dpi=150, bbox_inches='tight')
plt.show()

# Find best heads
top_heads = df_all_heads.nlargest(5, 'test_acc')
print("\nTop 5 sentiment-detecting heads:")
for _, row in top_heads.iterrows():
    print(f"  Layer {row['layer']:.0f}, Head {row['head']:.0f}: {row['test_acc']:.2%}")

### Attention Head Reflection

1. **Are some heads notably better than others?** What does this suggest?
   - Your answer:

2. **Do later-layer heads tend to perform better?** Why might this be?
   - Your answer:

3. **How does head dimensionality (64) compare to residual stream (768)?** What are the tradeoffs?
   - Your answer:

---

## Part 3: MLP Layer Probing (1-2 hours)

**Question:** Does the MLP output contain different information than the residual stream?

**Background:**
- Each transformer block: Attention → Add to residual → MLP → Add to residual
- MLP is thought to store factual knowledge
- Residual stream accumulates information from all components

In [None]:
def get_mlp_output(model, sentences, layer=6):
    """
    Extract MLP layer output.
    
    Args:
        model: HookedTransformer model
        sentences: List of strings
        layer: Which layer
    
    Returns:
        numpy array of shape (n_sentences, d_model)
    """
    activations = []
    
    for sentence in sentences:
        _, cache = model.run_with_cache(sentence)
        
        # MLP output (before adding to residual)
        # Shape: [batch, pos, d_model]
        mlp_out = cache[f"blocks.{layer}.hook_mlp_out"]
        
        # Get last token
        act = mlp_out[0, -1, :].cpu().numpy()
        activations.append(act)
    
    return np.array(activations)

# Test
test_mlp = get_mlp_output(model, ["Test sentence"], layer=6)
print(f"MLP output shape: {test_mlp.shape}")  # Should be (1, 768)

In [None]:
# Compare Residual vs MLP across layers
layers_to_test = [0, 3, 6, 9, 11]
component_results = []

print("\n=== Residual vs MLP Comparison ===\n")

for layer in layers_to_test:
    # Residual stream
    X_pos_resid = get_all_token_activations(model, positive_sentences, layer=layer)
    X_neg_resid = get_all_token_activations(model, negative_sentences, layer=layer)
    X_pos_resid = extract_features(X_pos_resid, 'last')
    X_neg_resid = extract_features(X_neg_resid, 'last')
    resid_results = train_and_evaluate_probe(X_pos_resid, X_neg_resid)
    
    # MLP output
    X_pos_mlp = get_mlp_output(model, positive_sentences, layer=layer)
    X_neg_mlp = get_mlp_output(model, negative_sentences, layer=layer)
    mlp_results = train_and_evaluate_probe(X_pos_mlp, X_neg_mlp)
    
    component_results.append({
        'layer': layer,
        'component': 'residual',
        'test_acc': resid_results['test_acc']
    })
    component_results.append({
        'layer': layer,
        'component': 'mlp',
        'test_acc': mlp_results['test_acc']
    })
    
    print(f"Layer {layer:2d}: Residual={resid_results['test_acc']:.2%}, MLP={mlp_results['test_acc']:.2%}")

df_components = pd.DataFrame(component_results)

In [None]:
# Visualize component comparison
fig, ax = plt.subplots(figsize=(10, 6))

df_resid = df_components[df_components['component'] == 'residual']
df_mlp = df_components[df_components['component'] == 'mlp']

ax.plot(df_resid['layer'], df_resid['test_acc'], marker='o', linewidth=2, 
        label='Residual Stream', color='steelblue', markersize=8)
ax.plot(df_mlp['layer'], df_mlp['test_acc'], marker='s', linewidth=2, 
        label='MLP Output', color='coral', markersize=8)

ax.set_xlabel('Layer', fontsize=12)
ax.set_ylabel('Test Accuracy', fontsize=12)
ax.set_title('Sentiment Detection: Residual Stream vs MLP', fontsize=14, fontweight='bold')
ax.legend(fontsize=11)
ax.set_ylim([0.4, 1.0])
ax.axhline(y=0.5, color='red', linestyle='--', alpha=0.5)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('residual_vs_mlp.png', dpi=150, bbox_inches='tight')
plt.show()

### MLP Analysis Reflection

1. **Does MLP or residual stream perform better?** Why might this be?
   - Your answer:

2. **What information might MLP encode that residual doesn't (or vice versa)?**
   - Your answer:

---

## Part 4: Building the ProbeToolkit Class (1-2 hours)

**Goal:** Create a reusable class that encapsulates everything we've learned

**This toolkit will be essential for Week 4's faithfulness detection experiments**

In [None]:
class ProbeToolkit:
    """
    Reusable toolkit for probe experiments.
    
    Supports:
    - Multiple layers
    - Multiple positions (last, first, mean, max, specific index)
    - Multiple components (residual, mlp, attention heads)
    - Systematic comparison across configurations
    """
    
    def __init__(self, model):
        self.model = model
        self.n_layers = model.cfg.n_layers
        self.n_heads = model.cfg.n_heads
        self.d_model = model.cfg.d_model
        self.d_head = model.cfg.d_head
    
    def extract_activations(self, sentences, layer, position='last', component='residual', head=None):
        """
        Flexible activation extraction.
        
        Args:
            sentences: List of strings
            layer: Layer number (0 to n_layers-1)
            position: 'last', 'first', 'mean', 'max', or int for specific position
            component: 'residual', 'mlp', or 'attention'
            head: Head number (required if component='attention')
        
        Returns:
            numpy array of shape (n_sentences, d)
        """
        activations = []
        
        for sentence in sentences:
            _, cache = self.model.run_with_cache(sentence)
            
            # Get raw activations based on component
            if component == 'residual':
                # Shape: [1, seq_len, d_model]
                raw_acts = cache["resid_post", layer][0, :, :].cpu().numpy()
            elif component == 'mlp':
                # Shape: [1, seq_len, d_model]
                raw_acts = cache[f"blocks.{layer}.hook_mlp_out"][0, :, :].cpu().numpy()
            elif component == 'attention':
                if head is None:
                    raise ValueError("Must specify head for attention component")
                # Shape: [1, seq_len, n_heads, d_head]
                raw_acts = cache[f"blocks.{layer}.attn.hook_result"][0, :, head, :].cpu().numpy()
            else:
                raise ValueError(f"Unknown component: {component}")
            
            # Extract based on position
            if position == 'last':
                act = raw_acts[-1]
            elif position == 'first':
                act = raw_acts[0]
            elif position == 'mean':
                act = raw_acts.mean(axis=0)
            elif position == 'max':
                act = raw_acts.max(axis=0)
            elif isinstance(position, int):
                if position < 0:
                    act = raw_acts[position]  # Negative indexing
                elif position < len(raw_acts):
                    act = raw_acts[position]
                else:
                    # Position out of bounds - use last
                    act = raw_acts[-1]
            else:
                raise ValueError(f"Unknown position: {position}")
            
            activations.append(act)
        
        return np.array(activations)
    
    def train_probe(self, X_pos, X_neg, test_size=0.3, random_state=42):
        """
        Train probe with proper train/test split.
        
        Returns:
            dict with 'probe', 'train_acc', 'test_acc', 'X_test', 'y_test'
        """
        X = np.vstack([X_pos, X_neg])
        y = np.array([1]*len(X_pos) + [0]*len(X_neg))
        
        X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=test_size, random_state=random_state, stratify=y
        )
        
        probe = LogisticRegression(max_iter=1000, random_state=random_state)
        probe.fit(X_train, y_train)
        
        return {
            'probe': probe,
            'train_acc': probe.score(X_train, y_train),
            'test_acc': probe.score(X_test, y_test),
            'X_train': X_train,
            'y_train': y_train,
            'X_test': X_test,
            'y_test': y_test,
            'predictions': probe.predict(X_test),
            'probabilities': probe.predict_proba(X_test)
        }
    
    def systematic_comparison(self, pos_sentences, neg_sentences,
                             layers=None, positions=['last'], 
                             components=['residual']):
        """
        Compare probes across multiple configurations.
        
        Args:
            pos_sentences: Positive class sentences
            neg_sentences: Negative class sentences
            layers: List of layers to test (default: sample across model)
            positions: List of positions to test
            components: List of components to test
        
        Returns:
            pandas DataFrame with results
        """
        if layers is None:
            # Default: sample layers across the model
            layers = [0, self.n_layers//4, self.n_layers//2, 
                     3*self.n_layers//4, self.n_layers-1]
        
        results = []
        
        for layer in layers:
            for position in positions:
                for component in components:
                    print(f"Testing: layer={layer}, pos={position}, comp={component}")
                    
                    try:
                        X_pos = self.extract_activations(
                            pos_sentences, layer, position, component
                        )
                        X_neg = self.extract_activations(
                            neg_sentences, layer, position, component
                        )
                        
                        probe_results = self.train_probe(X_pos, X_neg)
                        
                        results.append({
                            'layer': layer,
                            'position': str(position),
                            'component': component,
                            'train_acc': probe_results['train_acc'],
                            'test_acc': probe_results['test_acc'],
                            'gap': probe_results['train_acc'] - probe_results['test_acc']
                        })
                    except Exception as e:
                        print(f"  Error: {e}")
        
        return pd.DataFrame(results)
    
    def test_generalization(self, probe, new_pos, new_neg, layer, 
                           position='last', component='residual'):
        """
        Test trained probe on new distribution.
        """
        X_new_pos = self.extract_activations(new_pos, layer, position, component)
        X_new_neg = self.extract_activations(new_neg, layer, position, component)
        X_new = np.vstack([X_new_pos, X_new_neg])
        y_new = np.array([1]*len(new_pos) + [0]*len(new_neg))
        
        return {
            'accuracy': probe.score(X_new, y_new),
            'predictions': probe.predict(X_new),
            'true_labels': y_new
        }
    
    def get_probe_direction(self, probe):
        """
        Extract the learned direction from a trained probe.
        
        Returns:
            numpy array of shape (d,) - the direction separating classes
        """
        return probe.coef_[0]

In [None]:
# Test the toolkit
toolkit = ProbeToolkit(model)

print(f"Model info: {toolkit.n_layers} layers, {toolkit.d_model} dimensions")
print(f"Attention: {toolkit.n_heads} heads × {toolkit.d_head} dimensions each")

In [None]:
# Run systematic comparison with the toolkit
print("\n=== Systematic Comparison with ProbeToolkit ===\n")

results_df = toolkit.systematic_comparison(
    positive_sentences, 
    negative_sentences,
    layers=[0, 3, 6, 9, 11],
    positions=['last', 'mean'],
    components=['residual', 'mlp']
)

print("\n" + "="*60)
print(results_df.to_string(index=False))

In [None]:
# Visualize systematic comparison results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Layer comparison by component
for component in results_df['component'].unique():
    df_comp = results_df[(results_df['component'] == component) & 
                         (results_df['position'] == 'last')]
    axes[0].plot(df_comp['layer'], df_comp['test_acc'], 
                marker='o', linewidth=2, label=component, markersize=8)

axes[0].set_xlabel('Layer', fontsize=12)
axes[0].set_ylabel('Test Accuracy', fontsize=12)
axes[0].set_title('Component Comparison (position=last)', fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
axes[0].set_ylim([0.4, 1.0])

# Plot 2: Heatmap of all configurations
# Create combined config column for heatmap
results_df['config'] = results_df['position'] + '_' + results_df['component']
pivot = results_df.pivot(index='layer', columns='config', values='test_acc')

sns.heatmap(pivot, annot=True, fmt='.0%', cmap='RdYlGn', 
            center=0.5, ax=axes[1], cbar_kws={'label': 'Test Accuracy'})
axes[1].set_title('All Configurations Heatmap', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.savefig('systematic_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

---

## Summary & Key Findings

**Document your findings from Day 5-6:**

### Position Analysis
- Best position method: 
- Best specific position(s): 
- Key insight: 

### Attention Head Analysis
- Best performing heads: 
- Layer pattern: 
- Key insight: 

### Component Analysis
- Residual vs MLP: 
- Key insight: 

### For CoT Faithfulness Detection
Based on what I learned, my predictions for faithfulness detection:
- Best layers to probe: 
- Best positions: 
- Components to prioritize: 

---

## Next Steps

**Before moving to Week 2 (Reasoning Models):**

✅ Confirm you can:
- Extract activations at any position (last, first, mean, specific index)
- Extract attention head outputs
- Extract MLP outputs
- Use the ProbeToolkit for systematic comparisons

✅ Understand:
- How position affects probe accuracy
- Which attention heads carry task-relevant information
- Difference between residual stream and MLP representations

**Week 2 Preview:** 
- Setting up reasoning models (Qwen via nnsight or Gemini API)
- Understanding Chain-of-Thought structure
- Building CoT-specific analysis tools

---

**Save your work and the ProbeToolkit class - you'll use it extensively in Week 4!**