# Steering with SAE Features via NDIF

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Nix07/neural-mechanics-web/blob/main/labs/week2/sae_steering_ndif.ipynb)

This notebook demonstrates **activation steering** using Sparse Autoencoder (SAE) features on large models via NDIF.

**Key Idea:** SAE decoder vectors represent interpretable "directions" in activation space. By adding these vectors to model activations during generation, we can steer model behavior toward specific concepts.

## What We'll Do
1. Load pretrained SAE weights from HuggingFace
2. Find features related to humor/puns using Neuronpedia
3. Extract decoder vectors for those features
4. Use nnsight/NDIF to steer generation on large models

## References
- [SAE Lens](https://github.com/jbloomAus/SAELens) - Library for training and loading SAEs
- [Neuronpedia](https://www.neuronpedia.org/) - Platform for exploring SAE features
- [Steering Using SAE Features](https://docs.neuronpedia.org/steering) - Neuronpedia steering docs
- [nnsight](https://nnsight.net/) - Neural network inspection library

## Setup

In [None]:
!pip install -q nnsight sae-lens requests

In [None]:
import torch
import requests
import json
from nnsight import LanguageModel, CONFIG
from sae_lens import SAE

# Configure NDIF API key from Colab secrets
try:
    from google.colab import userdata
    CONFIG.set_default_api_key(userdata.get('NDIF_API'))
except:
    pass  # Not in Colab or secret not set

# Use NDIF for remote execution on large models
REMOTE = True

# For local testing, set REMOTE = False
if REMOTE:
    MODEL_ID = "google/gemma-2-2b"
else:
    MODEL_ID = "gpt2"

In [None]:
# Load the model
model = LanguageModel(MODEL_ID, device_map="auto")

print(f"Model: {MODEL_ID}")
print(f"Layers: {model.config.num_hidden_layers}")
print(f"Hidden size: {model.config.hidden_size}")

## Part 1: Load Pretrained SAE from HuggingFace

SAE Lens provides pretrained SAEs for many models. Each SAE has:
- **Encoder**: Maps activations to sparse feature activations
- **Decoder**: Maps feature activations back to residual stream (these are our steering vectors!)

In [None]:
# Load a pretrained SAE for Gemma-2-2B
# See: https://github.com/jbloomAus/SAELens/blob/main/sae_lens/pretrained_saes.yaml

# For Gemma-2-2B, we use the gemma-scope SAEs
# Format: "gemma-scope-2b-pt-res" for residual stream SAEs

if "gemma" in MODEL_ID.lower():
    SAE_RELEASE = "gemma-scope-2b-pt-res"
    SAE_ID = "layer_12/width_16k/average_l0_71"  # Layer 12, 16k features
    TARGET_LAYER = 12
elif "gpt2" in MODEL_ID.lower():
    SAE_RELEASE = "gpt2-small-res-jb"
    SAE_ID = "blocks.6.hook_resid_pre"
    TARGET_LAYER = 6
else:
    raise ValueError(f"No pretrained SAE configured for {MODEL_ID}")

print(f"Loading SAE: {SAE_RELEASE} / {SAE_ID}")

# Load the SAE
sae, cfg_dict, sparsity = SAE.from_pretrained(
    release=SAE_RELEASE,
    sae_id=SAE_ID,
    device="cpu"  # Load to CPU, we'll send vectors to NDIF
)

print(f"SAE loaded!")
print(f"  Features: {sae.cfg.d_sae}")
print(f"  Hidden dim: {sae.cfg.d_in}")
print(f"  Decoder shape: {sae.W_dec.shape}")

## Part 2: Find Interesting Features via Neuronpedia

Neuronpedia catalogs SAE features with automated explanations. Let's search for features related to humor, jokes, and puns.

In [None]:
# Neuronpedia API helper functions
NP_API_BASE = "https://www.neuronpedia.org/api"

def search_features(query, model_id="gemma-2-2b", limit=10):
    """Search Neuronpedia for features matching a query."""
    url = f"{NP_API_BASE}/search"
    params = {
        "q": query,
        "modelId": model_id,
        "limit": limit
    }
    try:
        response = requests.get(url, params=params, timeout=10)
        if response.status_code == 200:
            return response.json()
    except:
        pass
    return []

def get_feature_info(model_id, source, index):
    """Get detailed info about a specific feature."""
    url = f"{NP_API_BASE}/feature/{model_id}/{source}/{index}"
    try:
        response = requests.get(url, timeout=10)
        if response.status_code == 200:
            return response.json()
    except:
        pass
    return None

In [None]:
# Search for humor-related features
search_terms = ["humor", "joke", "funny", "pun", "wordplay", "comedy"]

found_features = []
np_model_id = "gemma-2-2b" if "gemma" in MODEL_ID else "gpt2-small"

print("Searching Neuronpedia for humor-related features...\n")

for term in search_terms:
    results = search_features(term, model_id=np_model_id, limit=3)
    if results:
        print(f"'{term}':")
        for f in results:
            source = f.get('source', '')
            index = f.get('index', '')
            expl = f.get('explanation', '')[:60]
            print(f"  {source}/{index}: {expl}")
            found_features.append({
                'source': source,
                'index': index,
                'explanation': f.get('explanation', ''),
                'term': term
            })
    else:
        print(f"'{term}': No results")

print(f"\nFound {len(found_features)} potentially relevant features")

## Part 3: Extract Steering Vectors

Each SAE feature has a corresponding decoder vector. This vector represents the "direction" in activation space that the feature encodes.

In [None]:
def get_steering_vector(sae, feature_idx):
    """
    Extract the steering vector for a specific feature.
    
    The decoder weight W_dec[feature_idx] is the direction in 
    activation space that this feature represents.
    """
    # W_dec shape: (d_sae, d_in) - each row is a feature's decoder vector
    steering_vector = sae.W_dec[feature_idx].detach().clone()
    return steering_vector

# Example: Get steering vector for a specific feature
# You can change this to any feature index you want to explore
EXAMPLE_FEATURE_IDX = 1000  # Change this based on Neuronpedia search

steering_vec = get_steering_vector(sae, EXAMPLE_FEATURE_IDX)
print(f"Steering vector for feature {EXAMPLE_FEATURE_IDX}:")
print(f"  Shape: {steering_vec.shape}")
print(f"  Norm: {steering_vec.norm().item():.4f}")

In [None]:
# Create a combined steering vector from multiple features
def create_combined_steering_vector(sae, feature_indices, weights=None):
    """
    Combine multiple feature vectors into a single steering direction.
    
    Args:
        sae: The loaded SAE
        feature_indices: List of feature indices to combine
        weights: Optional weights for each feature (default: equal)
        
    Returns:
        Combined steering vector
    """
    if weights is None:
        weights = [1.0] * len(feature_indices)
    
    combined = torch.zeros(sae.cfg.d_in)
    for idx, weight in zip(feature_indices, weights):
        combined += weight * sae.W_dec[idx]
    
    # Normalize to unit length
    combined = combined / combined.norm()
    
    return combined

# Example: Combine features related to humor
# Replace these with actual feature indices from Neuronpedia
HUMOR_FEATURES = [1000, 2000, 3000]  # Placeholder - replace with real indices

humor_steering = create_combined_steering_vector(sae, HUMOR_FEATURES)
print(f"Combined humor steering vector:")
print(f"  Shape: {humor_steering.shape}")
print(f"  Norm: {humor_steering.norm().item():.4f}")

## Part 4: Steering Generation via NDIF

Now we use nnsight to add our steering vector to the model's activations during generation. This nudges the model toward generating content related to our target concept.

In [None]:
class SAESteeringGenerator:
    """
    Generate text with SAE-based steering via NDIF.
    """
    
    def __init__(self, model, steering_vector, target_layer, strength=1.0):
        self.model = model
        self.steering_vector = steering_vector
        self.target_layer = target_layer
        self.strength = strength
    
    def generate(self, prompt, max_new_tokens=50, remote=True):
        """
        Generate text with steering applied.
        
        The steering vector is added to the residual stream at the target layer
        for all generated tokens.
        """
        tokens = self.model.tokenizer.encode(prompt)
        generated = list(tokens)
        
        for _ in range(max_new_tokens):
            input_ids = torch.tensor([generated])
            
            with self.model.trace(remote=remote) as tracer:
                with tracer.invoke(input_ids):
                    # Get the residual stream at target layer
                    resid = self.model.model.layers[self.target_layer].output[0]
                    
                    # Add steering vector to all positions
                    steering = self.steering_vector.to(resid.device)
                    resid[:, :, :] = resid + self.strength * steering
                    
                    # Get output logits
                    logits = self.model.output.logits.save()
            
            # Sample next token
            next_logits = logits.value[0, -1, :]
            next_token = torch.argmax(next_logits).item()
            
            if next_token == self.model.tokenizer.eos_token_id:
                break
            
            generated.append(next_token)
        
        return self.model.tokenizer.decode(generated)
    
    def generate_unsteered(self, prompt, max_new_tokens=50, remote=True):
        """
        Generate text WITHOUT steering for comparison.
        """
        tokens = self.model.tokenizer.encode(prompt)
        generated = list(tokens)
        
        for _ in range(max_new_tokens):
            input_ids = torch.tensor([generated])
            
            with self.model.trace(remote=remote) as tracer:
                with tracer.invoke(input_ids):
                    logits = self.model.output.logits.save()
            
            next_logits = logits.value[0, -1, :]
            next_token = torch.argmax(next_logits).item()
            
            if next_token == self.model.tokenizer.eos_token_id:
                break
            
            generated.append(next_token)
        
        return self.model.tokenizer.decode(generated)

In [None]:
# Create a steered generator
# Use a single feature's decoder vector as the steering direction

# Get a steering vector (replace with a real humor-related feature index)
STEERING_FEATURE = 5000  # Placeholder - find a real humor feature on Neuronpedia
steering_vec = get_steering_vector(sae, STEERING_FEATURE)

generator = SAESteeringGenerator(
    model=model,
    steering_vector=steering_vec,
    target_layer=TARGET_LAYER,
    strength=2.0  # Adjust strength to control effect
)

print(f"Generator created with:")
print(f"  Feature: {STEERING_FEATURE}")
print(f"  Layer: {TARGET_LAYER}")
print(f"  Strength: {generator.strength}")

In [None]:
# Compare steered vs unsteered generation
test_prompts = [
    "Why do programmers prefer",
    "The scientist walked into the lab and",
    "My favorite thing about cooking is",
]

print("Comparing steered vs unsteered generation:")
print("=" * 60)

for prompt in test_prompts:
    print(f"\nPrompt: {prompt}")
    
    unsteered = generator.generate_unsteered(prompt, max_new_tokens=30, remote=REMOTE)
    print(f"\nUnsteered: {unsteered}")
    
    steered = generator.generate(prompt, max_new_tokens=30, remote=REMOTE)
    print(f"\nSteered:   {steered}")
    print("-" * 40)

## Part 5: Exploring Steering Strength

The strength parameter controls how strongly we push the model toward the target concept. Let's explore different strengths.

In [None]:
# Test different steering strengths
prompt = "Why do electricians make good"
strengths = [0.0, 1.0, 2.0, 5.0, 10.0]

print(f"Prompt: {prompt}")
print("=" * 60)

for strength in strengths:
    generator.strength = strength
    output = generator.generate(prompt, max_new_tokens=30, remote=REMOTE)
    print(f"\nStrength {strength:4.1f}: {output}")

## Part 6: Positive vs Negative Steering

We can steer TOWARD a concept (positive) or AWAY from it (negative).

In [None]:
# Compare positive and negative steering
prompt = "The comedian walked on stage and said"

print(f"Prompt: {prompt}")
print("=" * 60)

for strength in [-3.0, 0.0, 3.0]:
    generator.strength = strength
    output = generator.generate(prompt, max_new_tokens=40, remote=REMOTE)
    
    label = "NEGATIVE" if strength < 0 else ("NEUTRAL" if strength == 0 else "POSITIVE")
    print(f"\n{label} ({strength:+.1f}): {output}")

## Exercise 1: Find Your Own Steering Features

Use Neuronpedia to find features related to a concept you're interested in, then test steering with them.

In [None]:
# TODO: Search for features related to your concept
# my_concept = "sarcasm"  # or "poetry", "science", etc.
# results = search_features(my_concept, model_id=np_model_id)

# TODO: Create a steering vector from those features
# my_features = [...]  # feature indices from Neuronpedia
# my_steering = create_combined_steering_vector(sae, my_features)

# TODO: Test generation with your steering vector
# my_generator = SAESteeringGenerator(model, my_steering, TARGET_LAYER, strength=2.0)
# output = my_generator.generate("Your prompt here", remote=REMOTE)

pass

## Exercise 2: Layer-Wise Steering

Different layers may have different effects when steered. Try steering at different layers.

In [None]:
# TODO: Test steering at different layers
# Note: You'll need to load SAEs for different layers
# layers_to_test = [4, 8, 12, 16, 20]

# Question: Does steering at earlier vs later layers have different effects?

pass

## Exercise 3: Contrastive Steering

Combine positive and negative features to create more precise steering.

In [None]:
# TODO: Create a contrastive steering vector
# Example: "humor" - "serious" to get pure humor without formality

# humor_features = [...]
# serious_features = [...]

# contrastive = (sum of humor decoder vecs) - (sum of serious decoder vecs)
# contrastive = contrastive / contrastive.norm()

pass

## Summary

In this notebook, we learned:

1. **SAE decoder vectors** are interpretable directions in activation space

2. **Neuronpedia** catalogs features with explanations, helping us find relevant features

3. **Steering via NDIF** lets us modify large model behavior without fine-tuning

4. **Strength parameter** controls how strongly we push toward the concept

5. **Positive/negative steering** lets us amplify or suppress concepts

### Key Insights

- SAE features provide a "vocabulary" for describing model behavior
- Steering is fast (no training) but approximate (may have side effects)
- Combining multiple features can create more specific effects
- Layer choice matters: early layers affect more, late layers are more specific

### Connections to Course Themes

| Week | Method | Connection |
|------|--------|-----------|
| 1 | Logit Lens | SAE features explain what logit lens shows |
| 4 | Geometry | Decoder vectors ARE the geometric directions |
| 5 | Causation | Steering is a form of causal intervention |
| 6 | Probes | Compare: probe direction vs SAE decoder |

### Next Steps

1. Explore more features on [Neuronpedia](https://www.neuronpedia.org/)
2. Try steering for your project's concept
3. Compare SAE steering with mass-mean-difference steering (Week 4)