# Week 2 Exercise: Steering Language Models

In this exercise, you'll gain hands-on experience with:
- Loading and examining transformer architecture
- Extracting and visualizing activation vectors
- Finding induction heads
- Creating steering vectors from contrastive pairs
- Applying steering to control model behavior
- Using Neuronpedia to find concept features

## Setup

Install required libraries:

In [None]:
!pip install transformers torch numpy matplotlib einops circuitsvis -q

In [None]:
import torch
import numpy as np
import matplotlib.pyplot as plt
from transformers import AutoModelForCausalLM, AutoTokenizer
from einops import rearrange
import warnings
warnings.filterwarnings('ignore')

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

## Part 1: Loading and Examining a Transformer

Let's load GPT-2 and explore its architecture.

In [None]:
# Load GPT-2 small
model_name = "gpt2"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
model = model.to(device)
model.eval()  # Evaluation mode

print(f"Model: {model_name}")
print(f"Number of layers: {model.config.n_layer}")
print(f"Hidden size: {model.config.n_embd}")
print(f"Number of attention heads: {model.config.n_head}")
print(f"Vocabulary size: {model.config.vocab_size}")

### Examining Architecture Components

Let's look at the main components:

In [None]:
# Print model structure
print("\nModel Structure:")
print("="*80)
for name, module in model.named_modules():
    if len(list(module.children())) == 0:  # Leaf modules only
        print(f"{name}: {module.__class__.__name__}")
        if name.count('.') <= 2:  # Don't go too deep
            break

Key components:
- `transformer.wte`: Token embeddings (encoder)
- `transformer.h`: Transformer layers (attention + MLP)
- `lm_head`: Output layer (decoder)

Each transformer block has:
- Attention (multihead)
- MLP (feedforward)
- Layer normalization

## Part 2: Extracting Activation Vectors

Let's extract activations at different layers to see internal representations.

In [None]:
def get_activations(text, layer_idx=-1):
    """
    Extract activation vectors at a specific layer.
    
    Args:
        text: Input text
        layer_idx: Which layer to extract from (-1 = last layer)
    
    Returns:
        activations: [num_tokens, hidden_size] tensor
    """
    inputs = tokenizer(text, return_tensors="pt").to(device)
    
    with torch.no_grad():
        # output_hidden_states=True gives us all layer activations
        outputs = model(**inputs, output_hidden_states=True)
        
        # hidden_states is a tuple of (num_layers+1) tensors
        # Each is [batch_size, seq_len, hidden_size]
        hidden_states = outputs.hidden_states
        
        # Extract the desired layer
        activations = hidden_states[layer_idx][0]  # Remove batch dimension
    
    return activations.cpu()

# Test it
text = "The quick brown fox jumps over the lazy dog"
tokens = tokenizer.tokenize(text)
print(f"Text: {text}")
print(f"Tokens: {tokens}")

# Get activations from last layer
activations = get_activations(text, layer_idx=-1)
print(f"\nActivation shape: {activations.shape}")
print(f"  {activations.shape[0]} tokens × {activations.shape[1]} dimensions")

### Visualizing Activations Across Layers

In [None]:
def visualize_layer_activations(text, token_idx=-1):
    """
    Show how activation magnitudes change across layers for a specific token.
    """
    inputs = tokenizer(text, return_tensors="pt").to(device)
    
    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)
        hidden_states = outputs.hidden_states
    
    # Extract activation norms at each layer
    norms = []
    for layer_activations in hidden_states:
        # Get the specific token's activation
        token_activation = layer_activations[0, token_idx, :]
        # Compute L2 norm
        norms.append(token_activation.norm().item())
    
    # Plot
    plt.figure(figsize=(10, 5))
    plt.plot(norms, marker='o')
    plt.xlabel('Layer')
    plt.ylabel('Activation Magnitude (L2 norm)')
    plt.title(f'Activation magnitude across layers for token: "{tokenizer.tokenize(text)[token_idx]}"')
    plt.grid(True, alpha=0.3)
    plt.show()

# Visualize
text = "The cat sat on the mat"
visualize_layer_activations(text, token_idx=-1)

### Exercise 2.1: Compare Activation Patterns

Extract activations for similar vs. dissimilar words and compute their similarity.

In [None]:
def cosine_similarity(v1, v2):
    """Compute cosine similarity between two vectors."""
    return (v1 @ v2) / (v1.norm() * v2.norm())

# Compare "cat" vs "dog" vs "democracy"
words = ["The cat", "The dog", "The democracy"]
activations_list = []

for word in words:
    act = get_activations(word, layer_idx=-1)
    # Get the last token's activation
    activations_list.append(act[-1])

# Compute pairwise similarities
print("Cosine Similarities:")
for i, word1 in enumerate(words):
    for j, word2 in enumerate(words):
        if i < j:
            sim = cosine_similarity(activations_list[i], activations_list[j])
            print(f"  {word1} ↔ {word2}: {sim:.4f}")

**Question:** Are "cat" and "dog" more similar to each other than to "democracy"? Why?

## Part 3: Finding Induction Heads

Let's identify attention heads that perform induction (pattern copying).

In [None]:
def get_attention_patterns(text):
    """
    Extract attention patterns from all heads.
    
    Returns:
        List of attention matrices, one per layer
        Each is [num_heads, seq_len, seq_len]
    """
    inputs = tokenizer(text, return_tensors="pt").to(device)
    
    with torch.no_grad():
        outputs = model(**inputs, output_attentions=True)
        attentions = outputs.attentions  # Tuple of attention tensors
    
    # Convert to list of numpy arrays
    return [attn[0].cpu().numpy() for attn in attentions]

# Test with induction-inducing text
induction_text = "Alice Bob Alice Bob Alice Bob Alice"
tokens = tokenizer.tokenize(induction_text)
print(f"Tokens: {tokens}")

attentions = get_attention_patterns(induction_text)
print(f"\nNumber of layers: {len(attentions)}")
print(f"Attention shape per layer: {attentions[0].shape}")
print(f"  {attentions[0].shape[0]} heads × {attentions[0].shape[1]} query tokens × {attentions[0].shape[2]} key tokens")

### Visualizing Attention Patterns

Look for heads that attend to previous token positions (induction head candidates).

In [None]:
def plot_attention_head(attention_matrix, tokens, layer, head):
    """
    Plot attention pattern for a specific head.
    """
    plt.figure(figsize=(8, 8))
    plt.imshow(attention_matrix, cmap='viridis', aspect='auto')
    plt.colorbar(label='Attention Weight')
    plt.xlabel('Key Position (attending to)')
    plt.ylabel('Query Position (attending from)')
    plt.title(f'Layer {layer}, Head {head}')
    
    # Add token labels
    plt.xticks(range(len(tokens)), tokens, rotation=90)
    plt.yticks(range(len(tokens)), tokens)
    plt.tight_layout()
    plt.show()

# Look at a few heads
# Layer 5, Head 1 is often an induction head in GPT-2 small
layer_idx = 5
head_idx = 1

attention_matrix = attentions[layer_idx][head_idx]
plot_attention_head(attention_matrix, tokens, layer_idx, head_idx)

### Exercise 3.1: Find Previous-Token Heads

A "previous token head" attends strongly to position i-1. Let's find them.

In [None]:
def detect_previous_token_heads(attentions):
    """
    Find heads that predominantly attend to the previous token.
    """
    results = []
    
    for layer_idx, layer_attn in enumerate(attentions):
        num_heads = layer_attn.shape[0]
        
        for head_idx in range(num_heads):
            attn_matrix = layer_attn[head_idx]
            
            # For each query position i (except first),
            # check if it attends mostly to position i-1
            prev_token_scores = []
            for i in range(1, attn_matrix.shape[0]):
                # Attention from position i to position i-1
                prev_token_scores.append(attn_matrix[i, i-1])
            
            # Average attention to previous token
            avg_prev = np.mean(prev_token_scores)
            
            if avg_prev > 0.3:  # Threshold
                results.append((layer_idx, head_idx, avg_prev))
    
    return results

# Find previous-token heads
prev_heads = detect_previous_token_heads(attentions)
print("Previous-Token Heads (candidates for induction head cooperation):")
for layer, head, score in prev_heads:
    print(f"  Layer {layer}, Head {head}: {score:.3f} avg attention to previous token")

### Testing Induction Behavior

Let's see if the model actually performs induction copying.

In [None]:
def test_induction(pattern_text):
    """
    Test if the model can copy patterns.
    """
    inputs = tokenizer(pattern_text, return_tensors="pt").to(device)
    
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits[0, -1, :]  # Last position
        probs = torch.softmax(logits, dim=0)
    
    # Get top predictions
    top_k = 5
    top_probs, top_indices = torch.topk(probs, top_k)
    
    print(f"Pattern: '{pattern_text}'")
    print(f"Top {top_k} next token predictions:")
    for prob, idx in zip(top_probs, top_indices):
        token = tokenizer.decode([idx])
        print(f"  '{token}' → {prob.item():.4f}")

# Test induction
test_induction("foo bar baz foo bar baz foo")
print()
test_induction("Once upon a time, there was a cat. Once upon a time, there was a")

**Question:** Does the model predict tokens that continue the pattern? This is evidence of induction heads at work!

## Part 4: Extracting Steering Vectors

Now let's create steering vectors using contrastive pairs.

In [None]:
def extract_steering_vector(positive_prompts, negative_prompts, layer_idx=-1):
    """
    Extract a steering vector from contrastive examples.
    
    Args:
        positive_prompts: List of texts with the target concept
        negative_prompts: List of texts without the target concept
        layer_idx: Which layer to extract from
    
    Returns:
        Steering vector (mean difference)
    """
    pos_activations = []
    neg_activations = []
    
    # Extract activations for positive examples
    for prompt in positive_prompts:
        acts = get_activations(prompt, layer_idx)
        # Use the last token's activation
        pos_activations.append(acts[-1])
    
    # Extract activations for negative examples
    for prompt in negative_prompts:
        acts = get_activations(prompt, layer_idx)
        neg_activations.append(acts[-1])
    
    # Compute mean difference
    pos_mean = torch.stack(pos_activations).mean(dim=0)
    neg_mean = torch.stack(neg_activations).mean(dim=0)
    
    steering_vector = pos_mean - neg_mean
    
    return steering_vector

# Example: Create a "happiness" steering vector
happy_prompts = [
    "I am so happy and joyful",
    "This is wonderful and delightful",
    "I feel great and excited",
    "Everything is amazing",
]

sad_prompts = [
    "I am so sad and depressed",
    "This is terrible and awful",
    "I feel bad and upset",
    "Everything is horrible",
]

happiness_vector = extract_steering_vector(happy_prompts, sad_prompts, layer_idx=-1)
print(f"Extracted happiness steering vector")
print(f"Shape: {happiness_vector.shape}")
print(f"Magnitude: {happiness_vector.norm():.4f}")

### Exercise 4.1: Create Your Own Steering Vector

Design contrastive pairs for a concept relevant to your project.

In [None]:
# TODO: Create your concept steering vector

my_positive_prompts = [
    # Add prompts that exhibit your concept
]

my_negative_prompts = [
    # Add prompts that lack your concept
]

# Extract vector
# my_steering_vector = extract_steering_vector(my_positive_prompts, my_negative_prompts)
# print(f"My concept steering vector: {my_steering_vector.shape}")

## Part 5: Applying Steering Vectors

Now let's use our steering vector to modify model behavior.

In [None]:
def generate_with_steering(prompt, steering_vector, layer_idx, alpha=1.0, max_length=50):
    """
    Generate text with steering applied at a specific layer.
    
    Args:
        prompt: Input text
        steering_vector: Vector to add to activations
        layer_idx: Which layer to intervene at
        alpha: Strength of steering
        max_length: Maximum generation length
    """
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    steering_vector = steering_vector.to(device)
    
    # Register hook to modify activations
    def steering_hook(module, input, output):
        # output is typically a tuple (hidden_states, ...)
        if isinstance(output, tuple):
            hidden_states = output[0]
        else:
            hidden_states = output
        
        # Add steering vector to all positions
        hidden_states = hidden_states + alpha * steering_vector
        
        if isinstance(output, tuple):
            return (hidden_states,) + output[1:]
        return hidden_states
    
    # Get the target layer
    if layer_idx == -1:
        layer_idx = model.config.n_layer - 1
    
    target_layer = model.transformer.h[layer_idx]
    
    # Register hook
    handle = target_layer.register_forward_hook(steering_hook)
    
    # Generate with steering
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_length=max_length,
            do_sample=False,
            pad_token_id=tokenizer.eos_token_id
        )
    
    # Remove hook
    handle.remove()
    
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Test steering
test_prompt = "Today I went to the park and"

print("="*80)
print("NO STEERING:")
baseline = generate_with_steering(test_prompt, torch.zeros_like(happiness_vector), layer_idx=-1, alpha=0.0)
print(baseline)

print("\n" + "="*80)
print("POSITIVE STEERING (more happy):")
positive = generate_with_steering(test_prompt, happiness_vector, layer_idx=-1, alpha=2.0)
print(positive)

print("\n" + "="*80)
print("NEGATIVE STEERING (more sad):")
negative = generate_with_steering(test_prompt, happiness_vector, layer_idx=-1, alpha=-2.0)
print(negative)
print("="*80)

### Exercise 5.1: Vary Steering Strength

Test different values of alpha to see how steering strength affects output.

In [None]:
# Test multiple steering strengths
test_prompt = "The movie was"
alphas = [-3.0, -1.0, 0.0, 1.0, 3.0]

for alpha in alphas:
    result = generate_with_steering(
        test_prompt, 
        happiness_vector, 
        layer_idx=-1, 
        alpha=alpha,
        max_length=30
    )
    print(f"\nα = {alpha:+.1f}: {result}")

### Exercise 5.2: Compare Layers

Which layer is best for steering? Let's find out.

In [None]:
# Extract steering vectors at different layers
layers_to_test = [0, 3, 6, 9, 11]
test_prompt = "I think that"

for layer in layers_to_test:
    # Extract vector at this layer
    steering_vec = extract_steering_vector(happy_prompts, sad_prompts, layer_idx=layer)
    
    # Apply steering
    result = generate_with_steering(
        test_prompt,
        steering_vec,
        layer_idx=layer,
        alpha=2.0,
        max_length=30
    )
    
    print(f"\nLayer {layer:2d}: {result}")

**Question:** Which layers produce the strongest steering effects? Why might middle or late layers work better?

## Part 6: Introduction to Neuronpedia

Neuronpedia provides pre-computed SAE features. Let's explore how to use it.

### Using Neuronpedia

1. **Visit**: Go to [neuronpedia.org](https://www.neuronpedia.org/)

2. **Select Model**: Choose GPT-2 small or another model

3. **Search Features**: Use the search bar to find concepts
   - Example: "positive sentiment", "medical", "legal"

4. **Examine Features**: For each feature, you can see:
   - Examples of text that maximally activate it
   - Which tokens trigger it
   - Layer and feature index

5. **Export Vectors**: Some versions allow downloading feature vectors

### Exercise 6.1: Neuronpedia Exploration

Visit Neuronpedia and:
1. Search for features related to your concept
2. Record 3-5 relevant features (layer, index, description)
3. Note what kinds of examples activate each feature
4. Compare to your contrastive steering vectors: do they capture similar patterns?

### Manual SAE Feature Vector (Example)

If you have access to SAE weights, you can extract specific features:

In [None]:
# This is a placeholder - actual SAE usage requires loading trained dictionaries
# For your project, you may:
# 1. Download feature vectors from Neuronpedia
# 2. Load pre-trained SAEs
# 3. Train your own SAE (advanced)

print("SAE features would be used similarly to steering vectors:")
print("1. Load/download feature vector from Neuronpedia")
print("2. Apply it using generate_with_steering()")
print("3. Compare results with contrastive steering vectors")
print("\nFor this week's assignment, focus on contrastive extraction.")
print("SAE features provide an alternative that you can explore and compare.")

## Part 7: Putting It All Together

Complete project workflow for your concept.

In [None]:
# Template for your project assignment

# 1. Define your concept
MY_CONCEPT = "[Your concept here]"

# 2. Create contrastive pairs
positive_examples = [
    # Add 10-20 examples with your concept
]

negative_examples = [
    # Add 10-20 examples without your concept
]

# 3. Extract steering vectors at multiple layers
layer_vectors = {}
for layer in [0, 3, 6, 9, 11]:
    layer_vectors[layer] = extract_steering_vector(
        positive_examples, 
        negative_examples, 
        layer_idx=layer
    )

# 4. Test steering on examples
test_prompts = [
    # Add test cases
]

for prompt in test_prompts:
    print(f"\nPrompt: {prompt}")
    
    # Baseline
    baseline = generate_with_steering(
        prompt, torch.zeros_like(layer_vectors[6]), 6, alpha=0.0
    )
    print(f"  Baseline: {baseline}")
    
    # Positive steering
    positive = generate_with_steering(
        prompt, layer_vectors[6], 6, alpha=2.0
    )
    print(f"  Positive: {positive}")
    
    # Negative steering
    negative = generate_with_steering(
        prompt, layer_vectors[6], 6, alpha=-2.0
    )
    print(f"  Negative: {negative}")

# 5. Analyze and document results

## Reflection Questions

Answer these in your project writeup:

1. **Activation Patterns**: How do activation vectors for similar concepts compare? What does this tell you about distributed representations?

2. **Layer Analysis**: Which layer(s) best encode your concept? Why might certain layers be better than others?

3. **Steering Quality**: How does steering strength (alpha) affect:
   - Presence of your concept
   - Text coherence and fluency
   - Unintended side effects

4. **Contrastive Pairs**: How did you design your contrastive examples? What makes a good contrast for your concept?

5. **Neuronpedia**: Did you find SAE features matching your concept? How do they compare to your extracted vectors?

6. **Linear Representation**: Do your results support the linear representation hypothesis? Why or why not?

## Next Steps

For your assignment:
1. Create comprehensive contrastive datasets for your concept
2. Extract and analyze steering vectors across layers
3. Demonstrate successful steering on diverse examples
4. Explore Neuronpedia for related features
5. Document your findings and insights

Save your steering vectors - you'll use them in future weeks!