# Week 2 Exercise: Steering Language Models

In this exercise, you'll gain hands-on experience with:
- Loading and examining transformer architecture
- Extracting and visualizing activation vectors
- Creating steering vectors from contrastive pairs
- Applying steering to control model behavior
- Using Neuronpedia to find concept features

## Setup

Install required libraries:

In [None]:
!pip install transformers torch numpy matplotlib einops circuitsvis -q

In [None]:
import torch
import numpy as np
import matplotlib.pyplot as plt
from transformers import AutoModelForCausalLM, AutoTokenizer
from einops import rearrange
import warnings
warnings.filterwarnings('ignore')

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

## Part 1: Loading and Examining a Transformer

Let's load GPT-2 and explore its architecture.

In [None]:
# Load GPT-2 small
model_name = "gpt2"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
model = model.to(device)
model.eval()  # Evaluation mode

print(f"Model: {model_name}")
print(f"Number of layers: {model.config.n_layer}")
print(f"Hidden size: {model.config.n_embd}")
print(f"Number of attention heads: {model.config.n_head}")
print(f"Vocabulary size: {model.config.vocab_size}")

### Examining Architecture Components

Let's look at the main components:

In [None]:
# Print model structure
print("\nModel Structure:")
print("="*80)
for name, module in model.named_modules():
    if len(list(module.children())) == 0:  # Leaf modules only
        print(f"{name}: {module.__class__.__name__}")
        if name.count('.') <= 2:  # Don't go too deep
            break

Key components:
- `transformer.wte`: Token embeddings (encoder)
- `transformer.h`: Transformer layers (attention + MLP)
- `lm_head`: Output layer (decoder)

Each transformer block has:
- Attention (multihead)
- MLP (feedforward)
- Layer normalization

## Part 2: Extracting Activation Vectors

Let's extract activations at different layers to see internal representations.

In [None]:
def get_activations(text, layer_idx=-1):
    """
    Extract activation vectors at a specific layer.
    
    Args:
        text: Input text
        layer_idx: Which layer to extract from (-1 = last layer)
    
    Returns:
        activations: [num_tokens, hidden_size] tensor
    """
    inputs = tokenizer(text, return_tensors="pt").to(device)
    
    with torch.no_grad():
        # output_hidden_states=True gives us all layer activations
        outputs = model(**inputs, output_hidden_states=True)
        
        # hidden_states is a tuple of (num_layers+1) tensors
        # Each is [batch_size, seq_len, hidden_size]
        hidden_states = outputs.hidden_states
        
        # Extract the desired layer
        activations = hidden_states[layer_idx][0]  # Remove batch dimension
    
    return activations.cpu()

# Test it
text = "The quick brown fox jumps over the lazy dog"
tokens = tokenizer.tokenize(text)
print(f"Text: {text}")
print(f"Tokens: {tokens}")

# Get activations from last layer
activations = get_activations(text, layer_idx=-1)
print(f"\nActivation shape: {activations.shape}")
print(f"  {activations.shape[0]} tokens × {activations.shape[1]} dimensions")

### Visualizing Activations Across Layers

In [None]:
def visualize_layer_activations(text, token_idx=-1):
    """
    Show how activation magnitudes change across layers for a specific token.
    """
    inputs = tokenizer(text, return_tensors="pt").to(device)
    
    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)
        hidden_states = outputs.hidden_states
    
    # Extract activation norms at each layer
    norms = []
    for layer_activations in hidden_states:
        # Get the specific token's activation
        token_activation = layer_activations[0, token_idx, :]
        # Compute L2 norm
        norms.append(token_activation.norm().item())
    
    # Plot
    plt.figure(figsize=(10, 5))
    plt.plot(norms, marker='o')
    plt.xlabel('Layer')
    plt.ylabel('Activation Magnitude (L2 norm)')
    plt.title(f'Activation magnitude across layers for token: "{tokenizer.tokenize(text)[token_idx]}"')
    plt.grid(True, alpha=0.3)
    plt.show()

# Visualize
text = "The cat sat on the mat"
visualize_layer_activations(text, token_idx=-1)

### Exercise 2.1: Compare Activation Patterns

Extract activations for similar vs. dissimilar words and compute their similarity.

In [None]:
def cosine_similarity(v1, v2):
    """Compute cosine similarity between two vectors."""
    return (v1 @ v2) / (v1.norm() * v2.norm())

# Compare "cat" vs "dog" vs "democracy"
words = ["The cat", "The dog", "The democracy"]
activations_list = []

for word in words:
    act = get_activations(word, layer_idx=-1)
    # Get the last token's activation
    activations_list.append(act[-1])

# Compute pairwise similarities
print("Cosine Similarities:")
for i, word1 in enumerate(words):
    for j, word2 in enumerate(words):
        if i < j:
            sim = cosine_similarity(activations_list[i], activations_list[j])
            print(f"  {word1} ↔ {word2}: {sim:.4f}")

**Question:** Are "cat" and "dog" more similar to each other than to "democracy"? Why?

## Part 3: Extracting Steering Vectors

Now let's create steering vectors using contrastive pairs.

### Exercise 3.1: Create Your Own Steering Vector

Design contrastive pairs for a concept relevant to your project.

## Part 4: Applying Steering Vectors

Now let's use our steering vector to modify model behavior.

### Exercise 4.1: Vary Steering Strength

Test different values of alpha to see how steering strength affects output.

### Exercise 4.2: Compare Layers

Which layer is best for steering? Let's find out.

In [None]:
## Part 3: Extracting Steering Vectors

Now let's create steering vectors using contrastive pairs.

**Question:** Which layers produce the strongest steering effects? Why might middle or late layers work better?

## Part 5: Introduction to Neuronpedia

Neuronpedia provides pre-computed SAE features. Let's explore how to use it.

### Exercise 5.1: Neuronpedia Exploration

1. **Visit**: Go to [neuronpedia.org](https://www.neuronpedia.org/)

2. **Select Model**: Choose GPT-2 small or another model

3. **Search Features**: Use the search bar to find concepts
   - Example: "positive sentiment", "medical", "legal"

4. **Examine Features**: For each feature, you can see:
   - Examples of text that maximally activate it
   - Which tokens trigger it
   - Layer and feature index

5. **Export Vectors**: Some versions allow downloading feature vectors

### Exercise 6.1: Neuronpedia Exploration

Visit Neuronpedia and:
1. Search for features related to your concept
2. Record 3-5 relevant features (layer, index, description)
3. Note what kinds of examples activate each feature
4. Compare to your contrastive steering vectors: do they capture similar patterns?

In [None]:
# This is a placeholder - actual SAE usage requires loading trained dictionaries
# For your project, you may:
# 1. Download feature vectors from Neuronpedia
# 2. Load pre-trained SAEs
# 3. Train your own SAE (advanced)

print("SAE features would be used similarly to steering vectors:")
print("1. Load/download feature vector from Neuronpedia")
print("2. Apply it using generate_with_steering()")
print("3. Compare results with contrastive steering vectors")
print("\nFor this week's assignment, focus on contrastive extraction.")
print("SAE features provide an alternative that you can explore and compare.")

## Part 6: Putting It All Together

Complete project workflow for your concept.

In [None]:
# Template for your project assignment

# 1. Define your concept
MY_CONCEPT = "[Your concept here]"

# 2. Create contrastive pairs
positive_examples = [
    # Add 10-20 examples with your concept
]

negative_examples = [
    # Add 10-20 examples without your concept
]

# 3. Extract steering vectors at multiple layers
layer_vectors = {}
for layer in [0, 3, 6, 9, 11]:
    layer_vectors[layer] = extract_steering_vector(
        positive_examples, 
        negative_examples, 
        layer_idx=layer
    )

# 4. Test steering on examples
test_prompts = [
    # Add test cases
]

for prompt in test_prompts:
    print(f"\nPrompt: {prompt}")
    
    # Baseline
    baseline = generate_with_steering(
        prompt, torch.zeros_like(layer_vectors[6]), 6, alpha=0.0
    )
    print(f"  Baseline: {baseline}")
    
    # Positive steering
    positive = generate_with_steering(
        prompt, layer_vectors[6], 6, alpha=2.0
    )
    print(f"  Positive: {positive}")
    
    # Negative steering
    negative = generate_with_steering(
        prompt, layer_vectors[6], 6, alpha=-2.0
    )
    print(f"  Negative: {negative}")

# 5. Analyze and document results

## Next Steps

For your assignment:
1. Create comprehensive contrastive datasets for your concept
2. Extract and analyze steering vectors across layers
3. Demonstrate successful steering on diverse examples
4. Explore Neuronpedia for related features
5. Document your findings and insights

Save your steering vectors - you'll use them in future weeks!