# Exploring SAE Features with Neuronpedia

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Nix07/neural-mechanics-web/blob/main/labs/week2/neuronpedia_explorer.ipynb)

This notebook demonstrates how to explore **Sparse Autoencoder (SAE) features** using [Neuronpedia](https://www.neuronpedia.org/), an open-source platform for interpretability research.

**Key Idea:** SAEs decompose model activations into interpretable features. Neuronpedia catalogs millions of these features with automated explanations and example activations, letting us explore what concepts a model has learned.

We'll use **GPT-2 Small** to explore how SAE features activate on **puns**â€”looking for features that might encode humor, wordplay, or dual meanings.

## References
- [Neuronpedia](https://www.neuronpedia.org/)
- [Neuronpedia Documentation](https://docs.neuronpedia.org/)
- [Sparse Autoencoders paper](https://arxiv.org/abs/2309.08600)

## Setup

Install the Neuronpedia Python library:

In [None]:
!pip install -q neuronpedia requests

In [None]:
import requests
import json
from IPython.display import display, HTML, IFrame

# Base URL for Neuronpedia API
NP_API_BASE = "https://www.neuronpedia.org/api"

# We'll use GPT OSS 20B (also known as "gpt-neox-20b" in some contexts)
# Check Neuronpedia for available models and SAE sources
MODEL_ID = "EleutherAI/gpt-neox-20b"

## Understanding Neuronpedia's Feature System

Every SAE feature on Neuronpedia has a unique identifier with three parts:

1. **Model ID**: e.g., `gpt2-small`, `gemma-2-2b`
2. **Source**: Layer number + SAE info, e.g., `6-res-jb` (Layer 6, residual stream, Joseph Bloom)
3. **Index**: The feature number within that SAE

Let's explore what's available:

In [None]:
def get_feature(model_id, source, index):
    """Fetch a specific SAE feature from Neuronpedia."""
    url = f"{NP_API_BASE}/feature/{model_id}/{source}/{index}"
    response = requests.get(url)
    if response.status_code == 200:
        return response.json()
    else:
        print(f"Error: {response.status_code}")
        return None

def display_feature(feature_data):
    """Display feature information nicely."""
    if not feature_data:
        return
    
    print(f"Feature: {feature_data.get('modelId')}/{feature_data.get('source')}/{feature_data.get('index')}")
    print(f"\nExplanation: {feature_data.get('explanation', 'No explanation available')}")
    
    # Show top activating examples if available
    activations = feature_data.get('activations', [])
    if activations:
        print(f"\nTop activating examples ({len(activations)} total):")
        for i, act in enumerate(activations[:5]):
            tokens = act.get('tokens', [])
            values = act.get('values', [])
            # Reconstruct the text
            text = ''.join(tokens)
            print(f"  {i+1}. {text[:100]}..." if len(text) > 100 else f"  {i+1}. {text}")

## Exploring Features on GPT-2 Small (Demo)

Let's start with GPT-2 Small which has well-documented SAE features. We'll then apply what we learn to larger models.

In [None]:
# Example: Look up a known interesting feature
# This is feature 650 from layer 6 residual stream SAE on GPT-2 Small
feature = get_feature("gpt2-small", "6-res_scefr-ajt", "650")
display_feature(feature)

## Searching for Humor/Pun-Related Features

Neuronpedia provides search functionality. Let's look for features that might relate to humor, jokes, or wordplay.

In [None]:
def search_features(query, model_id="gpt2-small", limit=10):
    """Search for features by explanation text."""
    url = f"{NP_API_BASE}/search"
    params = {
        "q": query,
        "modelId": model_id,
        "limit": limit
    }
    response = requests.get(url, params=params)
    if response.status_code == 200:
        return response.json()
    else:
        print(f"Search failed: {response.status_code}")
        return []

# Search for features related to humor/jokes
humor_features = search_features("joke", limit=5)
print("Features related to 'joke':")
for f in humor_features:
    print(f"  - {f.get('source')}/{f.get('index')}: {f.get('explanation', 'No explanation')[:80]}...")

In [None]:
# Search for features that might capture wordplay
wordplay_terms = ["pun", "double meaning", "humor", "funny", "joke"]

for term in wordplay_terms:
    results = search_features(term, limit=3)
    print(f"\n'{term}' features:")
    if results:
        for f in results:
            print(f"  {f.get('source')}/{f.get('index')}: {f.get('explanation', '')[:60]}")
    else:
        print("  No results found")

## Interactive Feature Exploration

The best way to explore features is through Neuronpedia's web interface. Let's create links to explore specific features.

In [None]:
def neuronpedia_url(model_id, source, index):
    """Generate a Neuronpedia URL for a feature."""
    return f"https://www.neuronpedia.org/{model_id}/{source}/{index}"

def display_feature_link(model_id, source, index, description=""):
    """Display a clickable link to a feature."""
    url = neuronpedia_url(model_id, source, index)
    display(HTML(f'<a href="{url}" target="_blank">{model_id}/{source}/{index}</a>: {description}'))

# Some interesting features to explore
print("Explore these features on Neuronpedia:")
display_feature_link("gpt2-small", "6-res_scefr-ajt", "650", "Example feature")

## Finding Features that Activate on Puns

Let's use Neuronpedia's inference API to find which SAE features activate most strongly on pun-containing text.

In [None]:
# Note: This requires the neuronpedia_inference_client package
# !pip install neuronpedia_inference_client

# For now, let's manually look at what features might activate on puns
# by examining the Neuronpedia web interface

pun_examples = [
    "Why do electricians make good swimmers? Because they know the current.",
    "Why did the banker break up with his girlfriend? He lost interest.",
    "Time flies like an arrow; fruit flies like a banana.",
    "I used to be a banker, but I lost interest.",
]

print("Pun examples to explore on Neuronpedia:")
for pun in pun_examples:
    print(f"  - {pun}")

print("\nTo find activating features:")
print("1. Go to https://www.neuronpedia.org/gpt2-small")
print("2. Use the 'Test Prompt' feature to enter these puns")
print("3. See which features light up on the pun words!")

## Exercise 1: Explore Feature Families

Some features form "families" that capture related concepts. Can you find features that capture:
- Electricity-related words
- Water/swimming words  
- Question words ("why", "what", "how")

These might all activate together in our electrician pun!

In [None]:
# TODO: Search for features in these categories
categories = ["electricity", "water", "swimming", "question"]

for cat in categories:
    print(f"\n=== {cat.upper()} ===")
    results = search_features(cat, model_id="gpt2-small", limit=5)
    for f in results:
        source = f.get('source', '')
        index = f.get('index', '')
        expl = f.get('explanation', '')[:60]
        print(f"  {source}/{index}: {expl}")

## Exercise 2: Compare Pun vs Literal Contexts

The word "current" appears in both pun and literal contexts. Do the same features activate in both cases?

In [None]:
# Contexts to compare:
pun_context = "Why do electricians make good swimmers? Because they know the current."
literal_electrical = "The electrical current flows through the wire at high voltage."
literal_water = "The river current was too strong for the small boat."

print("Compare these contexts on Neuronpedia:")
print(f"\n1. PUN: {pun_context}")
print(f"\n2. ELECTRICAL: {literal_electrical}")
print(f"\n3. WATER: {literal_water}")
print("\nQuestion: Which features activate on 'current' in each context?")
print("Are there features unique to the pun context?")

## Exercise 3: Steering with Features

Neuronpedia supports "steering" - boosting or suppressing specific features to change model behavior. Can we make a model more likely to generate puns?

In [None]:
# Steering requires an API key
# Sign up at neuronpedia.org and get your key from neuronpedia.org/account

# Example steering code (requires API key):
"""
import os
from neuronpedia.np_vector import NPVector

os.environ["NEURONPEDIA_API_KEY"] = "YOUR_API_KEY"

# Create a steering vector from features associated with humor
# This would boost those features during generation

response = np_vector.steer_chat(
    steered_chat_messages=[{
        "role": "user", 
        "content": "Why do electricians make good swimmers?"
    }]
)

print(response)
"""

print("To try steering:")
print("1. Get an API key from neuronpedia.org/account")
print("2. Find features that activate on jokes/puns")
print("3. Use NPVector to boost those features during generation")
print("4. See if the model generates more puns!")

## Exploring Larger Models

GPT-2 Small is great for learning, but larger models often have more interpretable features. Let's explore what's available:

In [None]:
# Available models on Neuronpedia (check website for current list)
models_to_explore = [
    "gpt2-small",
    "gpt2-medium", 
    "gpt2-large",
    "gemma-2-2b",
    "gemma-2-9b",
    # GPT OSS / EleutherAI models may have different IDs
]

print("Models available on Neuronpedia:")
for m in models_to_explore:
    url = f"https://www.neuronpedia.org/{m}"
    print(f"  - {m}: {url}")

print("\nNote: Check neuronpedia.org for the most current list of models and SAEs")

## Summary

In this notebook, we learned:

1. **SAE Features** decompose model activations into (hopefully) interpretable units

2. **Neuronpedia** catalogs millions of features with automated explanations and examples

3. **Feature Search** helps find features related to specific concepts like humor or wordplay

4. **Puns are challenging** because they require understanding that a word activates multiple meaning-related features simultaneously

### Questions to Consider

- Do puns activate "humor" features, or just multiple literal-meaning features?
- Can you find features that specifically capture ambiguity or double meanings?
- How do feature activations differ between pun and literal uses of the same word?

### Next Steps

1. Explore the Neuronpedia web interface for hands-on feature browsing
2. Use the inference client to find features on your own text
3. Try steering to see if boosting "humor" features changes model behavior