# Phase 3: SAE Comparison & Advanced Analysis

## Objectives

1. Load and compare multiple pre-trained SAEs (different layers, different architectures)
2. Run systematic feature analysis on each SAE
3. Compare specialist feature discovery across different SAEs
4. Understand how SAE architecture and training location affect feature interpretability
5. Document which SAE configurations produce the most useful monosemantic features

## What We'll Learn

- How different SAE training locations (residual stream vs MLP output) affect learned features
- Whether deeper layers learn more specialized features than shallow layers
- Which SAE configurations produce the best domain specialists (code, math, languages, etc.)
- How to systematically compare interpretability across multiple decompositions
- Trade-offs between different SAE architectures for understanding model internals

## Research Questions

- **Layer Depth**: Do layers 8 and 10 learn more specialized features than layer 6?
- **Hook Point Type**: Do MLP-output SAEs differ meaningfully from residual-stream SAEs?
- **Specialist Discovery**: Can we find better code specialists, emoji specialists, or language-specific features in other SAEs?
- **General vs Specific**: Which SAE produces the optimal balance of general and specialist features?

## Available SAEs for Comparison

We'll systematically test:
- **Layer 6 Residual Stream** (`6-res-jb`) - our Phase 2 baseline
- **Layer 8 Residual Stream** (`8-res-jb`) - deeper processing
- **Layer 10 Residual Stream** (`10-res-jb`) - near output layers
- **Layer 6 MLP Output** (`6-mlp-out`) - different information stream

## Methodology

For each SAE, we'll:
1. Extract features from our 70-text diverse dataset (Python, URLs, Math, Non-English, Social/Emoji, Formal, Conversational)
2. Run all Phase 2 analyses: strongest, frequent, selective, and category-specialist searches
3. Record specialist scores for each of the 7 categories
4. Compare feature interpretability using Neuronpedia
5. Aggregate results to determine which SAE is most useful for interpretability

## Expected Outcomes

- Comparative analysis showing which SAE types learn the best specialists
- Evidence for or against the hypothesis that deeper layers = more specialized features
- Recommendations for which SAE to use for specific interpretability tasks
- Foundation for understanding how SAE training location affects decomposition quality

## Prerequisites

- Completed Phase 1 (model loaded, initial activations cached)
- Completed Phase 2 (diverse dataset created, analysis pipeline built)
- Cached data: `../data/phase1_activations.pt` (optional, will extract fresh if needed)
- Phase 2 code: Feature discovery methods and visualization tools

In [1]:
# ============================================================================
# CELL 2: Import Libraries
# ============================================================================

import html
from IPython.display import display, Markdown
import torch
import numpy as np
import pandas as pd
from pathlib import Path
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from IPython.display import display, HTML
import os
import warnings
warnings.filterwarnings('ignore')

# TransformerLens and SAELens
from transformer_lens import HookedTransformer
from sae_lens import SAE

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

print("‚úÖ All libraries imported successfully!")
print(f"PyTorch version: {torch.__version__}")
print(f"Device: {torch.device('cuda' if torch.cuda.is_available() else 'cpu')}")

‚úÖ All libraries imported successfully!
PyTorch version: 2.9.0+cpu
Device: cpu


In [2]:
# ============================================================================
# CELL 3: Check SAE Cache Status
# ============================================================================

from pathlib import Path
import os

print("üîç Checking SAE Cache Status")
print("=" * 70)

# SAEs can be cached in two locations:
# 1. Direct SAELens cache: ~/.cache/sae_lens/
# 2. HuggingFace hub cache: ~/.cache/huggingface/hub/

sae_lens_cache = Path.home() / ".cache" / "sae_lens"
hf_cache = Path.home() / ".cache" / "huggingface" / "hub"

required_saes = {
    "6-res-jb": "blocks.6.hook_resid_pre",
    "8-res-jb": "blocks.8.hook_resid_pre", 
    "10-res-jb": "blocks.10.hook_resid_pre",
    "11-res-jb": "blocks.11.hook_resid_pre"
    
}

def check_sae_cached(sae_path):
    """Check if SAE is cached in either location"""
    # Check direct cache
    direct_path = sae_lens_cache / sae_path
    if direct_path.exists() and (direct_path / "sae_weights.safetensors").exists():
        return True, direct_path
    
    # Check HuggingFace cache
    if hf_cache.exists():
        for root, dirs, files in os.walk(hf_cache):
            if sae_path in root and "sae_weights.safetensors" in files:
                return True, Path(root)
    
    return False, None

all_cached = True
for sae_name, sae_path in required_saes.items():
    is_cached, cache_path = check_sae_cached(sae_path)
    if is_cached:
        print(f"‚úÖ {sae_name}: Cached at {cache_path}")
    else:
        print(f"‚ùå {sae_name}: Not cached")
        all_cached = False

print("\n" + "=" * 70)

if all_cached:
    print("\n‚úÖ All SAEs are cached and ready to load!")
    print("üí° Proceed to the next cell to load the SAEs")
else:
    print("\n‚ö†Ô∏è  Some SAEs need to be downloaded")
    print("\nüì• Run this in terminal to download missing SAEs:")
    print("python3 -c \"from sae_lens import SAE; SAE.from_pretrained('gpt2-small-res-jb', 'blocks.X.hook_resid_pre', 'cpu')\"")

print("=" * 70)

üîç Checking SAE Cache Status
‚úÖ 6-res-jb: Cached at /home/thebuleganteng/.cache/sae_lens/blocks.6.hook_resid_pre
‚úÖ 8-res-jb: Cached at /home/thebuleganteng/.cache/sae_lens/blocks.8.hook_resid_pre
‚úÖ 10-res-jb: Cached at /home/thebuleganteng/.cache/sae_lens/blocks.10.hook_resid_pre
‚úÖ 11-res-jb: Cached at /home/thebuleganteng/.cache/sae_lens/blocks.11.hook_resid_pre


‚úÖ All SAEs are cached and ready to load!
üí° Proceed to the next cell to load the SAEs


In [4]:
# ============================================================================
# CELL 4: Load LLM & All SAEs for Comparison
# ============================================================================

from pathlib import Path

print("üîß Loading GPT-2 and multiple SAEs...")
print("=" * 70)

# Load model
model = HookedTransformer.from_pretrained(
    "gpt2-small",
    device="cpu"
)
print(f"‚úÖ Model loaded: {model.cfg.model_name}")
print()

# SAE cache location (all SAEs are here now)
sae_cache_base = Path.home() / ".cache" / "sae_lens"

# SAE Configuration
available_saes = {
    "6-res-jb": {
        "description": "Layer 6 Residual Stream",
        "hook_point": "blocks.6.hook_resid_pre",
        "d_in": 768,
        "d_sae": 24576
    },
    "8-res-jb": {
        "description": "Layer 8 Residual Stream",
        "hook_point": "blocks.8.hook_resid_pre",
        "d_in": 768,
        "d_sae": 24576
    },
    "10-res-jb": {
        "description": "Layer 10 Residual Stream",
        "hook_point": "blocks.10.hook_resid_pre",
        "d_in": 768,
        "d_sae": 24576
    },
    "11-res-jb": {
        "description": "Layer 11 Residual Stream (Final Layer)",
        "hook_point": "blocks.11.hook_resid_pre",
        "d_in": 768,
        "d_sae": 24576
    }
}

# Load SAEs from disk
loaded_saes = {}

print("Loading SAEs from cache...")
for sae_name, sae_config in available_saes.items():
    print(f"\nüì¶ Loading {sae_name}: {sae_config['description']}")
    
    sae_path = sae_cache_base / sae_config['hook_point']
    
    if not sae_path.exists():
        print(f"   ‚ùå Not found at: {sae_path}")
        continue
    
    try:
        # Load directly from disk (no download, no progress bar issues)
        sae = SAE.load_from_disk(str(sae_path))
        
        loaded_saes[sae_name] = {
            "sae": sae,
            "config": sae_config
        }
        print(f"   ‚úÖ Loaded successfully")
        print(f"   üìä Dimensions: {sae_config['d_in']} ‚Üí {sae_config['d_sae']}")
        print(f"   üéØ Hook point: {sae_config['hook_point']}")
        
    except Exception as e:
        print(f"   ‚ùå Failed to load: {e}")
        import traceback
        traceback.print_exc()

print("\n" + "=" * 70)
print(f"‚úÖ Successfully loaded {len(loaded_saes)}/{len(available_saes)} SAEs")
if loaded_saes:
    print(f"üìã Available SAEs: {list(loaded_saes.keys())}")
    print("\nüéØ Ready for Phase 3 comparison analysis!")
else:
    print("‚ö†Ô∏è  No SAEs loaded successfully.")

print("=" * 70)

üîß Loading GPT-2 and multiple SAEs...
Loaded pretrained model gpt2-small into HookedTransformer
‚úÖ Model loaded: gpt2

Loading SAEs from cache...

üì¶ Loading 6-res-jb: Layer 6 Residual Stream
   ‚úÖ Loaded successfully
   üìä Dimensions: 768 ‚Üí 24576
   üéØ Hook point: blocks.6.hook_resid_pre

üì¶ Loading 8-res-jb: Layer 8 Residual Stream
   ‚úÖ Loaded successfully
   üìä Dimensions: 768 ‚Üí 24576
   üéØ Hook point: blocks.8.hook_resid_pre

üì¶ Loading 10-res-jb: Layer 10 Residual Stream
   ‚úÖ Loaded successfully
   üìä Dimensions: 768 ‚Üí 24576
   üéØ Hook point: blocks.10.hook_resid_pre

üì¶ Loading 11-res-jb: Layer 11 Residual Stream (Final Layer)
   ‚úÖ Loaded successfully
   üìä Dimensions: 768 ‚Üí 24576
   üéØ Hook point: blocks.11.hook_resid_pre

‚úÖ Successfully loaded 4/4 SAEs
üìã Available SAEs: ['6-res-jb', '8-res-jb', '10-res-jb', '11-res-jb']

üéØ Ready for Phase 3 comparison analysis!


In [5]:
# ============================================================================
# CELL 5: Load Diverse Dataset
# ============================================================================

print("\nüìö Loading Diverse Test Dataset")
print("=" * 70)

# Create dataset organized by category
categories = {
    "Python": [
        "def factorial(n):\n    return 1 if n == 0 else n * factorial(n-1)",
        "import torch\nimport numpy as np\nfrom transformers import AutoModel",
        "class NeuralNetwork(nn.Module):\n    def __init__(self):",
        "for i in range(len(data)):\n    result.append(data[i] ** 2)",
        "try:\n    x = int(input())\nexcept ValueError:\n    print('Error')",
        "lambda x: x ** 2 + 3 * x - 5",
        "if __name__ == '__main__':\n    main()",
        "return [x for x in lst if x > 0]",
        "print(f'Result: {sum(values) / len(values):.2f}')",
        "pip install transformers torch numpy pandas",
    ],
    "URLs": [
        "https://www.github.com/anthropics/claude",
        "Visit our website at http://example.com/products",
        "<html><body><h1>Welcome</h1></body></html>",
        "<div class='container'><p>Content here</p></div>",
        "GET /api/v1/users HTTP/1.1",
        "mailto:support@example.com",
        "www.stackoverflow.com/questions/12345",
        "ftp://files.example.org/downloads/",
        "Click here: https://bit.ly/abc123",
        "Check out reddit.com/r/machinelearning",
    ],
    "Math": [
        "f(x) = x^2 + 2x + 1",
        "‚à´(x^2 + 3x)dx = x^3/3 + 3x^2/2 + C",
        "lim(x‚Üí0) sin(x)/x = 1",
        "‚àë(i=1 to n) i = n(n+1)/2",
        "‚àö(a^2 + b^2) = c",
        "P(A|B) = P(B|A)P(A) / P(B)",
        "E = mc^2",
        "‚àáf(x,y) = (‚àÇf/‚àÇx, ‚àÇf/‚àÇy)",
        "det([[a,b],[c,d]]) = ad - bc",
        "sin^2(Œ∏) + cos^2(Œ∏) = 1",
    ],
    "Non-English": [
        "Bonjour, comment allez-vous aujourd'hui?",
        "‰Ω†Â•ΩÔºå‰ªäÂ§©Â§©Ê∞îÊÄé‰πàÊ†∑Ôºü",
        "Hola, ¬øc√≥mo est√°s?",
        "Guten Tag, wie geht es Ihnen?",
        "–ó–¥—Ä–∞–≤—Å—Ç–≤—É–π—Ç–µ, –∫–∞–∫ –¥–µ–ª–∞?",
        "„Åì„Çì„Å´„Å°„ÅØ„ÄÅÂÖÉÊ∞ó„Åß„Åô„ÅãÔºü",
        "ŸÖÿ±ÿ≠ÿ®ÿßÿå ŸÉŸäŸÅ ÿ≠ÿßŸÑŸÉÿü",
        "ÏïàÎÖïÌïòÏÑ∏Ïöî, Ïûò ÏßÄÎÇ¥ÏÖ®Ïñ¥Ïöî?",
        "Ciao, come stai?",
        "Ol√°, como voc√™ est√°?",
    ],
    "Social": [
        "omg that's so funny üòÇüòÇüòÇ",
        "can't wait for the weekend!! üéâüéä",
        "just got coffee ‚òï feeling good ‚ú®",
        "bruh why is this happening üíÄ",
        "yaaaas queen!!! üëëüíÖ‚ú®",
        "ngl this is pretty cool üî•",
        "lmaooo i'm dying üò≠üò≠",
        "tbh idk what to do ü§∑‚Äç‚ôÄÔ∏è",
        "mood af rn üíØ",
        "this slaps fr fr üéµüî•",
    ],
    "Formal": [
        "The phenomenon was observed under controlled laboratory conditions.",
        "In accordance with the aforementioned regulations, we hereby submit this proposal.",
        "The hypothesis was tested using a double-blind randomized controlled trial.",
        "Pursuant to Article 12, Section 3 of the aforementioned statute.",
        "The results indicate a statistically significant correlation (p < 0.05).",
        "This paper examines the theoretical frameworks underlying modern economics.",
        "The defendant pleaded not guilty to all charges in the indictment.",
        "We acknowledge the contributions of all co-authors and funding agencies.",
        "The experimental methodology followed established protocols.",
        "In conclusion, further research is warranted to investigate this phenomenon.",
    ],
    "Conversational": [
        "Hey, what's up? Want to grab lunch later?",
        "I think the meeting went pretty well today.",
        "The weather is nice, maybe we should go for a walk.",
        "Did you see that movie everyone's talking about?",
        "I'm planning a trip to Japan next summer.",
        "That restaurant has the best pizza in town.",
        "My cat keeps knocking things off the table.",
        "The traffic was terrible this morning.",
        "I need to finish this project by Friday.",
        "Let's catch up over coffee sometime.",
    ]
}

# Flatten into lists for analysis
texts = []
labels = []

for category, text_list in categories.items():
    for text in text_list:
        texts.append(text)
        labels.append(category)

print(f"‚úÖ Dataset loaded: {len(texts)} texts across {len(categories)} categories")
print(f"üìã Categories: {list(categories.keys())}")
print(f"üìä Texts per category: {len(texts) // len(categories)}")
print("=" * 70)


üìö Loading Diverse Test Dataset
‚úÖ Dataset loaded: 70 texts across 7 categories
üìã Categories: ['Python', 'URLs', 'Math', 'Non-English', 'Social', 'Formal', 'Conversational']
üìä Texts per category: 10


In [6]:
# ============================================================================
# CELL 5: Define Helper Functions for Analysis and Displaying Results
# ============================================================================

print("\nüî¨ Defining Helper Functions for Analysis and Displaying Results")
print("=" * 70)

'''
Note below that "features" is a 2D tensor that contains the feature activations 
It contains the following:
    features.shape (returns torch.Size([<rows, one per text>, <columns, one per SAE feature>])
    features.dtype - the data type (e.g., float32)
    features.device - where it's stored (CPU or GPU)
    features.max() - find maximum values
    features.min() - find minimum values
    features.sum() - sum all values
    features.mean() - calculate mean
    etc.
'''


# Define the function to identify the feature with highest max activation (strongest)
def analyze_strongest(features: torch.Tensor, texts: list) -> dict:
    ''' 
    Notes re below:
        1. max_activations is a named tuple with TWO 1D arrays:
            1. max_activations.values   # Shape: [24576] (an array w/ 24,576 items inside it) - the MAX VALUE for each feature
            2. max_activations.indices  # Shape: [24576] - WHICH TEXT had that max for each feature
        
            Example:
                     Feature 0   Feature 1   Feature 2
            Text 0   [  2.3        0.0         1.5    ]
            Text 1   [  0.0        8.4         0.0    ]
            Text 2   [  1.2        0.0         3.2    ]
            Text 3   [  5.1        2.1         0.8    ]
            
            max_activations = features.max(dim=0)
            
            max_activations.values   # [5.1, 8.4, 3.2] <- the max values
            max_activations.indices  # [3, 1, 2]       <- which text (row) had that max
        
        2. argmax (used for strongest_feature_idx) vs. max (used for strongest_max_value
            argmax = "argument of the maximum" = WHERE is the maximum (the index/position)
            max = WHAT is the maximum (the actual value)
    '''
    max_activations = features.max(dim=0) # contains two 1D arrays: (A) max_activations.values- the MAX VALUE for each feature AND (B) max_activations.indices  - WHICH TEXT had that max for each feature
    strongest_feature_idx = max_activations.values.argmax().item() # "Which of the 24,576 features has the highest max?" -> e.g., Feature #10399
    strongest_max_val = max_activations.values.max().item() # # "What is that highest max value?" -> e.g., 16.85
    text_idx = max_activations.indices[strongest_feature_idx].item() # What is the index position of the text associated with this highest max value
    text = texts[text_idx] # What is the actual text assocaited with text_idx?
    return {
        'feature_idx': strongest_feature_idx,
        'value': strongest_max_val,
        'text': text
    }


# Define the function to identify the most frequently activated feature
def analyze_frequent(features: torch.Tensor, texts: list) -> dict:
    feature_frequency = (features > 0).sum(dim=0) # Counts the number of times each feature is activated
    most_frequent_feature_idx = feature_frequency.argmax().item() # Finds the position of the most frequently activated item
    most_frequent_feature_count = feature_frequency.max().item() # The number of times the most frequently-actived item was activated

    return {
        'feature_idx': most_frequent_feature_idx,
        'value': most_frequent_feature_count,
        'text': None
    }


# Define the function to identify the most selective feature (high activation but rare)
def analyze_selective(features: torch.Tensor, texts: list, threshold: float = 5.0) -> dict:

    # Boolean mask: which (text, feature) pairs exceed threshold
    strong_activations = (features > threshold)

    # Count how many texts each feature activates strongly on
    strong_activation_counts = strong_activations.sum(dim=0)
    
    # Find features that activate strongly on at least 1 text
    has_strong_activation = strong_activation_counts > 0

    # Among those, find the one active in the FEWEST texts
    # Set infinite count for features with no strong activations
    selectivity_counts = strong_activation_counts.clone().float()
    selectivity_counts[~has_strong_activation] = float('inf')

    # Get the feature with minimum count (most selective)
    most_selective_idx = selectivity_counts.argmin().item()
    selective_max_val = features[:, most_selective_idx].max().item()
    selective_count = strong_activation_counts[most_selective_idx].item()
    total_active_count = (features[:, most_selective_idx] > 0).sum().item()

    # Find which text had the maximum activation
    text_idx = features[:, most_selective_idx].argmax().item()
    text = texts[text_idx]

    return {
        'feature_idx': most_selective_idx,
        'value': selective_max_val,  # Max activation (for consistency with other functions)
        'text': text,
        'selective_count': selective_count,  # Strong activations (>threshold)
        'total_active_count': total_active_count  # Any activations (>0)
    }


# Define the function to identify the most selective feature (high activation but rare)
def analyze_specialists(features: torch.Tensor, texts: list, categories: dict, threshold: float = 5.0) -> dict:
    '''
    For each category, finds the best specialist feature.
    A specialist activates strongly inside the category but rarely outside it.
    
    Args:
        features: [num_texts, num_features] tensor
        texts: List of text strings
        categories: Dict mapping category names to lists of their texts
        threshold: Minimum activation to be considered "strong"
    
    Returns:
        Dict mapping category names to their best specialist feature info
    '''
    results = {}

    for cat_name, cat_texts in categories.items():

        # Find indices of this category's texts in the full text list
        indices = [i for i, text in enumerate(texts) if text in cat_texts]

        # Get features for this category
        cat_features = features[indices, :]

        # Find features with highest MAX activation in this category
        cat_max = cat_features.max(dim=0)
        top_features = cat_max.values.topk(5)

        # Initialize variables to hold specialists
        best_specialist_idx = None
        best_score = -1
        best_info = None

        # Loops across the top 5 features for each category, looking at each feature's value and position in the index
        for max_val, feat_idx in zip(top_features.values, top_features.indices):
            feat_idx_item = feat_idx.item()

            # Count strong activations inside vs outside this category
            other_indices = [i for i in range(len(texts)) if i not in indices]
            strong_inside = (features[indices, feat_idx_item] > threshold).sum().item()
            strong_outside = (features[other_indices, feat_idx_item] > threshold).sum().item()
            
            # Specialist score: inside - outside
            specialist_score = strong_inside - strong_outside
            
            if specialist_score > best_score:
                best_score = specialist_score
                best_specialist_idx = feat_idx_item
                best_info = {
                    'feature_idx': feat_idx_item,
                    'value': max_val.item(),
                    'score': specialist_score,
                    'strong_inside': strong_inside,
                    'strong_outside': strong_outside,
                    'text': None  # Could add if needed
                }
        
        results[cat_name] = best_info
    
    return results


# Define extract_features function
def extract_features(texts, sae, hook_point):
    '''Extract SAE features for a list of texts using the model.'''
    # Tokenize texts
    tokens = model.to_tokens(texts, prepend_bos=True)
    
    # Run model and capture activations at the hook point
    with torch.no_grad():
        _, cache = model.run_with_cache(tokens, names_filter=[hook_point])
    
    # Get activations from cache
    activations = cache[hook_point]  # Shape: [batch, seq_len, d_model]
    
    # Take mean across sequence dimension
    activations = activations.mean(dim=1)  # Shape: [batch, d_model]
    
    # Pass through SAE encoder
    features = sae.encode(activations)  # Shape: [batch, d_sae]
    
    return features


# Helper function to create Neuronpedia links
def neuronpedia_link(sae_name, feature_idx):
    return f"https://neuronpedia.org/gpt2-small/{sae_name}/{feature_idx}"
    

print("‚úÖ analyze_strongest() defined")
print("‚úÖ analyze_frequent() defined")
print("‚úÖ analyze_selective() defined")
print("‚úÖ analyze_specialists() defined")
print("‚úÖ extract_features() defined")
print("‚úÖ neuronopedia_link() defined")

print("=" * 70)


üî¨ Defining Helper Functions for Analysis and Displaying Results
‚úÖ analyze_strongest() defined
‚úÖ analyze_frequent() defined
‚úÖ analyze_selective() defined
‚úÖ analyze_specialists() defined
‚úÖ extract_features() defined
‚úÖ neuronopedia_link() defined


In [7]:
# ============================================================================
# CELL 7: Discover and Analyze Most Interesting Features
# ============================================================================

print("\nüî¨ Finding and Analyzing Most Interesting Features Identified by Each SAE")
print("=" * 70)

# Store results in a nested dictionary
results = {
    'strongest': {},
    'most_frequent': {},
    'most_selective': {},
    'specialists': {}
}

# Loop through the loaded SAEs
for sae_name in loaded_saes:
    print(f"running tests with sae_name: {sae_name}...")
    
    sae_obj = loaded_saes[sae_name]['sae']
    hook_point = loaded_saes[sae_name]['config']['hook_point']

    # Extract features for this SAE (all 70 texts at once)
    print(f"   Extracting features from hook_point: {hook_point}...")
    features = extract_features(texts, sae_obj, hook_point)  # [70, 24576]

    # Run all 4 analyses and store results
    print(f"   Running analyses...")
    results['strongest'][sae_name] = analyze_strongest(features=features, texts=texts)
    results['most_frequent'][sae_name] = analyze_frequent(features=features, texts=texts)
    results['most_selective'][sae_name] = analyze_selective(features=features, texts=texts)
    results['specialists'][sae_name] = analyze_specialists( categories=categories, features=features, texts=texts)
    print(f"   ‚úÖ {sae_name} complete")


print("\n" + "=" * 70)
print("‚úÖ All SAE analyses complete!")
print("=" * 70)




üî¨ Finding and Analyzing Most Interesting Features Identified by Each SAE
running tests with sae_name: 6-res-jb...
   Extracting features from hook_point: blocks.6.hook_resid_pre...
   Running analyses...
   ‚úÖ 6-res-jb complete
running tests with sae_name: 8-res-jb...
   Extracting features from hook_point: blocks.8.hook_resid_pre...
   Running analyses...
   ‚úÖ 8-res-jb complete
running tests with sae_name: 10-res-jb...
   Extracting features from hook_point: blocks.10.hook_resid_pre...
   Running analyses...
   ‚úÖ 10-res-jb complete
running tests with sae_name: 11-res-jb...
   Extracting features from hook_point: blocks.11.hook_resid_pre...
   Running analyses...
   ‚úÖ 11-res-jb complete

‚úÖ All SAE analyses complete!


In [8]:
# ============================================================================
# CELL 8: Print Detailed Per-SAE Output
# ============================================================================

# After running analyses for each SAE, print detailed findings
for sae_name in loaded_saes:
    print(f"\n{'='*70}")
    print(f"üìä DETAILED ANALYSIS FOR SAE: {sae_name}")
    print(f"{'='*70}")
    
    # Show strongest feature details
    strongest = results['strongest'][sae_name]
    print(f"\n1Ô∏è‚É£ STRONGEST Feature: #{strongest['feature_idx']}")
    print(f"   Max activation: {strongest['value']:.2f}")
    print(f"   Text: {strongest['text'][:100]}...")
    print(f"   üîó {neuronpedia_link(sae_name, strongest['feature_idx'])}")
    
    # Show most frequent
    frequent = results['most_frequent'][sae_name]
    print(f"\n2Ô∏è‚É£ MOST FREQUENT Feature: #{frequent['feature_idx']}")
    print(f"   Active in: {int(frequent['value'])}/70 texts ({100*frequent['value']/70:.1f}%)")
    print(f"   üîó {neuronpedia_link(sae_name, frequent['feature_idx'])}")
    
    # Show selective
    selective = results['most_selective'][sae_name]
    print(f"\n3Ô∏è‚É£ MOST SELECTIVE Feature: #{selective['feature_idx']}")
    print(f"   Max: {selective['value']:.2f}")
    print(f"   Strong activations (>5.0): {selective['selective_count']}/70")
    print(f"   Any activations (>0): {selective['total_active_count']}/70")
    print(f"   üîó {neuronpedia_link(sae_name, selective['feature_idx'])}")
    
    # Show category specialists summary
    print(f"\n4Ô∏è‚É£ CATEGORY SPECIALISTS:")
    specialist_count = 0
    for cat_name, cat_data in results['specialists'][sae_name].items():
        if cat_data and cat_data['score'] > 0:
            specialist_count += 1
            print(f"   ‚úÖ {cat_name}: Feature #{cat_data['feature_idx']} (score: {cat_data['score']})")
            print(f"      üîó {neuronpedia_link(sae_name, cat_data['feature_idx'])}")  # Added link
        else:
            print(f"   ‚ùå {cat_name}: No specialist found")
    
    print(f"\nüìä Summary: {specialist_count}/7 categories have specialists (score > 0)")



üìä DETAILED ANALYSIS FOR SAE: 6-res-jb

1Ô∏è‚É£ STRONGEST Feature: #6819
   Max activation: 21.18
   Text: The experimental methodology followed established protocols....
   üîó https://neuronpedia.org/gpt2-small/6-res-jb/6819

2Ô∏è‚É£ MOST FREQUENT Feature: #316
   Active in: 70/70 texts (100.0%)
   üîó https://neuronpedia.org/gpt2-small/6-res-jb/316

3Ô∏è‚É£ MOST SELECTIVE Feature: #20066
   Max: 8.27
   Strong activations (>5.0): 1/70
   Any activations (>0): 1/70
   üîó https://neuronpedia.org/gpt2-small/6-res-jb/20066

4Ô∏è‚É£ CATEGORY SPECIALISTS:
   ‚ùå Python: No specialist found
   ‚ùå URLs: No specialist found
   ‚ùå Math: No specialist found
   ‚ùå Non-English: No specialist found
   ‚ùå Social: No specialist found
   ‚ùå Formal: No specialist found
   ‚ùå Conversational: No specialist found

üìä Summary: 0/7 categories have specialists (score > 0)

üìä DETAILED ANALYSIS FOR SAE: 8-res-jb

1Ô∏è‚É£ STRONGEST Feature: #20644
   Max activation: 38.34
   Text: E = mc^2..

In [14]:
# ============================================================================
# CELL 9: Display HTML Table Summarizing All Output
# ============================================================================

def display_comparison_table(results):
    '''
    Display comparison table of SAE analysis results using HTML.
    '''
    
    # Get list of SAE names
    sae_names = list(results['strongest'].keys())
    
    # Helper function to create Neuronpedia link
    def neuronpedia_link(sae_name, feature_idx):
        return f"https://neuronpedia.org/gpt2-small/{sae_name}/{feature_idx}"
    
    # Helper function to sanitize text for HTML
    def sanitize_text(text, max_length=40):
        if text is None:
            return "N/A"
        # Replace newlines with spaces
        text = text.replace('\n', ' ').replace('\t', ' ').replace('\r', ' ')
        # Replace multiple spaces with single space
        text = ' '.join(text.split())
        # Truncate
        if len(text) > max_length:
            text = text[:max_length] + "..."
        # HTML escape
        text = html.escape(text)
        return text
    
    # Build HTML table
    html_content = """
    <style>
        .sae-comparison-table {
            border-collapse: collapse;
            width: 100%;
            margin: 20px 0;
        }
        .sae-comparison-table th, .sae-comparison-table td {
            border: 1px solid #ddd;
            padding: 12px;
            text-align: left;
            vertical-align: top;
        }
        .sae-comparison-table th {
            background-color: #4CAF50;
            color: white;
            font-weight: bold;
        }
        .sae-comparison-table tr:nth-child(even) {
            background-color: #f2f2f2;
        }
        .sae-comparison-table .category-header {
            background-color: #e7f3e7;
            font-weight: bold;
        }
        .feature-link {
            color: #0066cc;
            text-decoration: none;
            font-weight: bold;
        }
        .feature-link:hover {
            text-decoration: underline;
        }
        .text-preview {
            font-family: monospace;
            font-size: 0.9em;
            color: #666;
            margin-top: 5px;
        }
        .feature-item {
            margin-bottom: 10px;
            padding-bottom: 8px;
            border-bottom: 1px solid #eee;
        }
        .feature-item:last-child {
            border-bottom: none;
        }
    </style>
    
    <h2>SAE Comparison: Feature Analysis Results</h2>
    <table class="sae-comparison-table">
        <tr>
            <th>Analysis</th>
    """
    
    # Add SAE name headers
    for sae_name in sae_names:
        html_content += f"<th>{sae_name}</th>"
    html_content += "</tr>\n"
    
    # Row 1: Top 5 Strongest Features
    html_content += '<tr><td><strong>Top 5 Strongest Features</strong></td>'
    for sae_name in sae_names:
        html_content += '<td>'
        
        # Get top 5 strongest features for this SAE
        # Need to re-extract features to find top 5
        sae_obj = loaded_saes[sae_name]['sae']
        hook_point = loaded_saes[sae_name]['config']['hook_point']
        features_tensor = extract_features(texts, sae_obj, hook_point)
        max_activations = features_tensor.max(dim=0)
        top_5 = max_activations.values.topk(5)
        
        for rank, (max_val, feat_idx) in enumerate(zip(top_5.values, top_5.indices), 1):
            feat_idx_item = feat_idx.item()
            text_idx = max_activations.indices[feat_idx_item].item()
            text_preview = sanitize_text(texts[text_idx], max_length=40)
            link = neuronpedia_link(sae_name, feat_idx_item)
            
            html_content += f'''
                <div class="feature-item">
                    {rank}. <a href="{link}" class="feature-link" target="_blank">#{feat_idx_item}</a>
                    (max: {max_val:.2f})<br>
                    <div class="text-preview">{text_preview}</div>
                </div>
            '''
        html_content += '</td>'
    html_content += '</tr>\n'
    
    # Row 2: Top 5 Most Frequent Features
    html_content += '<tr><td><strong>Top 5 Most Frequent Features</strong></td>'
    for sae_name in sae_names:
        html_content += '<td>'
        
        # Get top 5 most frequent features
        sae_obj = loaded_saes[sae_name]['sae']
        hook_point = loaded_saes[sae_name]['config']['hook_point']
        features_tensor = extract_features(texts, sae_obj, hook_point)
        feature_frequency = (features_tensor > 0).sum(dim=0)
        top_5 = feature_frequency.topk(5)
        
        for rank, (count, feat_idx) in enumerate(zip(top_5.values, top_5.indices), 1):
            feat_idx_item = feat_idx.item()
            link = neuronpedia_link(sae_name, feat_idx_item)
            
            html_content += f'''
                <div class="feature-item">
                    {rank}. <a href="{link}" class="feature-link" target="_blank">#{feat_idx_item}</a>
                    (active: {int(count)}/70)
                </div>
            '''
        html_content += '</td>'
    html_content += '</tr>\n'
    
    # Row 3: Top 5 Most Selective Features
    html_content += '<tr><td><strong>Top 5 Most Selective Features</strong></td>'
    for sae_name in sae_names:
        html_content += '<td>'
        
        # Get top 5 most selective features
        sae_obj = loaded_saes[sae_name]['sae']
        hook_point = loaded_saes[sae_name]['config']['hook_point']
        features_tensor = extract_features(texts, sae_obj, hook_point)
        
        threshold = 5.0
        strong_activations = (features_tensor > threshold)
        strong_activation_counts = strong_activations.sum(dim=0)
        has_strong_activation = strong_activation_counts > 0
        
        # Find top 5 most selective (fewest strong activations, but at least 1)
        selectivity_counts = strong_activation_counts.clone().float()
        selectivity_counts[~has_strong_activation] = float('inf')
        
        # Get features sorted by selectivity (ascending - fewest activations first)
        sorted_indices = selectivity_counts.argsort()
        top_5_selective = sorted_indices[:5]
        
        for rank, feat_idx in enumerate(top_5_selective, 1):
            feat_idx_item = feat_idx.item()
            selective_count = strong_activation_counts[feat_idx_item].item()
            max_val = features_tensor[:, feat_idx_item].max().item()
            text_idx = features_tensor[:, feat_idx_item].argmax().item()
            text_preview = sanitize_text(texts[text_idx], max_length=40)
            link = neuronpedia_link(sae_name, feat_idx_item)
            
            html_content += f'''
                <div class="feature-item">
                    {rank}. <a href="{link}" class="feature-link" target="_blank">#{feat_idx_item}</a>
                    (max: {max_val:.2f}, selective: {selective_count}/70)<br>
                    <div class="text-preview">{text_preview}</div>
                </div>
            '''
        html_content += '</td>'
    html_content += '</tr>\n'
    
    # Category specialists header
    html_content += f'<tr class="category-header"><td colspan="{len(sae_names)+1}"><strong>Category Specialists</strong></td></tr>\n'
    
    # Rows for each category
    category_names = list(results['specialists'][sae_names[0]].keys())
    for cat_name in category_names:
        html_content += f'<tr><td><em>{cat_name}</em></td>'
        for sae_name in sae_names:
            cat_data = results['specialists'][sae_name][cat_name]
            if cat_data:
                link = neuronpedia_link(sae_name, cat_data['feature_idx'])
                score_emoji = "‚úÖ" if cat_data['score'] > 0 else "‚ùå"
                html_content += f'''
                    <td>
                        <a href="{link}" class="feature-link" target="_blank">Feature #{cat_data['feature_idx']}</a><br>
                        <strong>Score:</strong> {cat_data['score']} {score_emoji}<br>
                        <strong>Max:</strong> {cat_data['value']:.2f}
                    </td>
                '''
            else:
                html_content += '<td>N/A</td>'
        html_content += '</tr>\n'
    
    html_content += '</table>\n'
    
    # Summary section
    html_content += '<h3>Summary</h3><table class="sae-comparison-table"><tr><th>SAE</th><th>Total Specialists (score > 0)</th></tr>'
    for sae_name in sae_names:
        specialist_count = sum(
            1 for cat_data in results['specialists'][sae_name].values() 
            if cat_data and cat_data['score'] > 0
        )
        html_content += f'<tr><td>{sae_name}</td><td>{specialist_count}/{len(category_names)}</td></tr>'
    html_content += '</table>'
    
    # Display the HTML
    display(HTML(html_content))

print("‚úÖ display_comparison_table() defined")
print("=" * 70)

# Then display as table
display_comparison_table(results)

‚úÖ display_comparison_table() defined


Analysis,6-res-jb,8-res-jb,10-res-jb,11-res-jb
Top 5 Strongest Features,1. #6819  (max: 21.18)  The experimental methodology followed es...  2. #23123  (max: 11.33)  ‚à´(x^2 + 3x)dx = x^3/3 + 3x^2/2 + C  3. #979  (max: 10.88)  ‚à´(x^2 + 3x)dx = x^3/3 + 3x^2/2 + C  4. #316  (max: 9.48)  lambda x: x ** 2 + 3 * x - 5  5. #23111  (max: 9.09)  ‚à´(x^2 + 3x)dx = x^3/3 + 3x^2/2 + C,"1. #20644  (max: 38.34)  E = mc^2  2. #13670  (max: 13.55)  ÏïàÎÖïÌïòÏÑ∏Ïöî, Ïûò ÏßÄÎÇ¥ÏÖ®Ïñ¥Ïöî?  3. #11746  (max: 12.85)  The weather is nice, maybe we should go ...  4. #11533  (max: 12.45)  ‚à´(x^2 + 3x)dx = x^3/3 + 3x^2/2 + C  5. #4078  (max: 11.85)  ŸÖÿ±ÿ≠ÿ®ÿßÿå ŸÉŸäŸÅ ÿ≠ÿßŸÑŸÉÿü","1. #10658  (max: 43.24)  E = mc^2  2. #1794  (max: 13.05)  print(f'Result: {sum(values) / len(value...  3. #21412  (max: 12.36)  ÏïàÎÖïÌïòÏÑ∏Ïöî, Ïûò ÏßÄÎÇ¥ÏÖ®Ïñ¥Ïöî?  4. #16384  (max: 12.25)  print(f'Result: {sum(values) / len(value...  5. #4268  (max: 11.49)  The weather is nice, maybe we should go ...","1. #8100  (max: 71.13)  E = mc^2  2. #5717  (max: 22.34)  ÏïàÎÖïÌïòÏÑ∏Ïöî, Ïûò ÏßÄÎÇ¥ÏÖ®Ïñ¥Ïöî?  3. #9530  (max: 17.36)  ÏïàÎÖïÌïòÏÑ∏Ïöî, Ïûò ÏßÄÎÇ¥ÏÖ®Ïñ¥Ïöî?  4. #14349  (max: 12.50)  P(A|B) = P(B|A)P(A) / P(B)  5. #11379  (max: 11.70)  P(A|B) = P(B|A)P(A) / P(B)"
Top 5 Most Frequent Features,1. #2039  (active: 70/70)  2. #979  (active: 70/70)  3. #7496  (active: 70/70)  4. #9088  (active: 70/70)  5. #316  (active: 70/70),1. #4078  (active: 70/70)  2. #6955  (active: 70/70)  3. #7662  (active: 70/70)  4. #11533  (active: 70/70)  5. #818  (active: 70/70),1. #1794  (active: 70/70)  2. #4268  (active: 70/70)  3. #9576  (active: 70/70)  4. #16384  (active: 70/70)  5. #12421  (active: 70/70),1. #8032  (active: 70/70)  2. #7548  (active: 70/70)  3. #11379  (active: 70/70)  4. #12555  (active: 70/70)  5. #11266  (active: 70/70)
Top 5 Most Selective Features,"1. #20066  (max: 8.27, selective: 1/70)  ÏïàÎÖïÌïòÏÑ∏Ïöî, Ïûò ÏßÄÎÇ¥ÏÖ®Ïñ¥Ïöî?  2. #6819  (max: 21.18, selective: 44/70)  The experimental methodology followed es...  3. #23373  (max: 7.93, selective: 70/70)  I think the meeting went pretty well tod...  4. #23123  (max: 11.33, selective: 70/70)  ‚à´(x^2 + 3x)dx = x^3/3 + 3x^2/2 + C  5. #316  (max: 9.48, selective: 70/70)  lambda x: x ** 2 + 3 * x - 5","1. #13670  (max: 13.55, selective: 1/70)  ÏïàÎÖïÌïòÏÑ∏Ïöî, Ïûò ÏßÄÎÇ¥ÏÖ®Ïñ¥Ïöî?  2. #10516  (max: 10.21, selective: 1/70)  ÏïàÎÖïÌïòÏÑ∏Ïöî, Ïûò ÏßÄÎÇ¥ÏÖ®Ïñ¥Ïöî?  3. #20644  (max: 38.34, selective: 58/70)  E = mc^2  4. #7662  (max: 10.36, selective: 70/70)  Pursuant to Article 12, Section 3 of the...  5. #11533  (max: 12.45, selective: 70/70)  ‚à´(x^2 + 3x)dx = x^3/3 + 3x^2/2 + C","1. #2401  (max: 5.80, selective: 1/70)  ‚à´(x^2 + 3x)dx = x^3/3 + 3x^2/2 + C  2. #4410  (max: 10.20, selective: 1/70)  ÏïàÎÖïÌïòÏÑ∏Ïöî, Ïûò ÏßÄÎÇ¥ÏÖ®Ïñ¥Ïöî?  3. #21412  (max: 12.36, selective: 1/70)  ÏïàÎÖïÌïòÏÑ∏Ïöî, Ïûò ÏßÄÎÇ¥ÏÖ®Ïñ¥Ïöî?  4. #1303  (max: 10.46, selective: 1/70)  –ó–¥—Ä–∞–≤—Å—Ç–≤—É–π—Ç–µ, –∫–∞–∫ –¥–µ–ª–∞?  5. #10658  (max: 43.24, selective: 61/70)  E = mc^2","1. #9530  (max: 17.36, selective: 1/70)  ÏïàÎÖïÌïòÏÑ∏Ïöî, Ïûò ÏßÄÎÇ¥ÏÖ®Ïñ¥Ïöî?  2. #4826  (max: 5.74, selective: 1/70)  ‰Ω†Â•ΩÔºå‰ªäÂ§©Â§©Ê∞îÊÄé‰πàÊ†∑Ôºü  3. #9590  (max: 8.17, selective: 1/70)  ŸÖÿ±ÿ≠ÿ®ÿßÿå ŸÉŸäŸÅ ÿ≠ÿßŸÑŸÉÿü  4. #5717  (max: 22.34, selective: 1/70)  ÏïàÎÖïÌïòÏÑ∏Ïöî, Ïûò ÏßÄÎÇ¥ÏÖ®Ïñ¥Ïöî?  5. #22917  (max: 5.34, selective: 1/70)  ‚à´(x^2 + 3x)dx = x^3/3 + 3x^2/2 + C"
Category Specialists,Category Specialists,Category Specialists,Category Specialists,Category Specialists
Python,,,,
URLs,,,,
Math,,,,
Non-English,,Feature #13670  Score: 1 ‚úÖ  Max: 13.55,Feature #21412  Score: 1 ‚úÖ  Max: 12.36,Feature #5717  Score: 1 ‚úÖ  Max: 22.34
Social,,,,
Formal,,,,

SAE,Total Specialists (score > 0)
6-res-jb,0/7
8-res-jb,1/7
10-res-jb,1/7
11-res-jb,1/7


In [11]:
# ============================================================================
# CELL 10: Dataframe Output
# ============================================================================
# After analyses, show top activating texts for each interesting feature

for sae_name in loaded_saes:
    num_top_features = 5
    
    print(f"\n{'='*70}")
    print(f"üî¨ {sae_name} - Top {num_top_features} Strongest Features")
    print(f"{'='*70}")
    
    features_tensor = extract_features(texts, loaded_saes[sae_name]['sae'], 
                                      loaded_saes[sae_name]['config']['hook_point'])
    
    # Find top n strongest features
    max_activations = features_tensor.max(dim=0)
    top_n_features = max_activations.values.topk(num_top_features)
    
    # Display each of the top n features
    for rank, (max_val, feat_idx) in enumerate(zip(top_n_features.values, top_n_features.indices), 1):
        feat_idx_item = feat_idx.item()
        
        # Create DataFrame showing top activations for this feature
        feature_acts = features_tensor[:, feat_idx_item]
        df = pd.DataFrame({
            'Text': texts,
            'Activation': feature_acts.detach().numpy(),
            'Category': labels
        })
        df = df.sort_values('Activation', ascending=False)
        
        print(f"\n{rank}Ô∏è‚É£ Feature #{feat_idx_item} - Max Activation: {max_val:.2f}")
        print(f"üîó {neuronpedia_link(sae_name, feat_idx_item)}")
        display(df.head(10))


üî¨ 6-res-jb - Top 5 Strongest Features

1Ô∏è‚É£ Feature #6819 - Max Activation: 21.18
üîó https://neuronpedia.org/gpt2-small/6-res-jb/6819


Unnamed: 0,Text,Activation,Category
58,The experimental methodology followed establis...,21.178993,Formal
67,The traffic was terrible this morning.,18.877275,Conversational
26,E = mc^2,18.829359,Math
38,"Ciao, come stai?",17.305628,Non-English
65,That restaurant has the best pizza in town.,17.103907,Conversational
61,I think the meeting went pretty well today.,16.717604,Conversational
50,The phenomenon was observed under controlled l...,16.284458,Formal
55,This paper examines the theoretical frameworks...,16.190865,Formal
69,Let's catch up over coffee sometime.,15.394181,Conversational
66,My cat keeps knocking things off the table.,15.328994,Conversational



2Ô∏è‚É£ Feature #23123 - Max Activation: 11.33
üîó https://neuronpedia.org/gpt2-small/6-res-jb/23123


Unnamed: 0,Text,Activation,Category
21,‚à´(x^2 + 3x)dx = x^3/3 + 3x^2/2 + C,11.33449,Math
27,"‚àáf(x,y) = (‚àÇf/‚àÇx, ‚àÇf/‚àÇy)",11.227591,Math
23,‚àë(i=1 to n) i = n(n+1)/2,11.215016,Math
25,P(A|B) = P(B|A)P(A) / P(B),11.207876,Math
5,lambda x: x ** 2 + 3 * x - 5,11.189367,Python
29,sin^2(Œ∏) + cos^2(Œ∏) = 1,11.164784,Math
24,‚àö(a^2 + b^2) = c,11.105572,Math
51,In accordance with the aforementioned regulati...,11.087137,Formal
22,lim(x‚Üí0) sin(x)/x = 1,11.077873,Math
20,f(x) = x^2 + 2x + 1,11.071542,Math



3Ô∏è‚É£ Feature #979 - Max Activation: 10.88
üîó https://neuronpedia.org/gpt2-small/6-res-jb/979


Unnamed: 0,Text,Activation,Category
21,‚à´(x^2 + 3x)dx = x^3/3 + 3x^2/2 + C,10.883331,Math
25,P(A|B) = P(B|A)P(A) / P(B),10.813282,Math
34,"–ó–¥—Ä–∞–≤—Å—Ç–≤—É–π—Ç–µ, –∫–∞–∫ –¥–µ–ª–∞?",10.763273,Non-English
27,"‚àáf(x,y) = (‚àÇf/‚àÇx, ‚àÇf/‚àÇy)",10.753177,Math
29,sin^2(Œ∏) + cos^2(Œ∏) = 1,10.707228,Math
23,‚àë(i=1 to n) i = n(n+1)/2,10.640139,Math
22,lim(x‚Üí0) sin(x)/x = 1,10.612128,Math
30,"Bonjour, comment allez-vous aujourd'hui?",10.600821,Non-English
20,f(x) = x^2 + 2x + 1,10.591829,Math
24,‚àö(a^2 + b^2) = c,10.58762,Math



4Ô∏è‚É£ Feature #316 - Max Activation: 9.48
üîó https://neuronpedia.org/gpt2-small/6-res-jb/316


Unnamed: 0,Text,Activation,Category
5,lambda x: x ** 2 + 3 * x - 5,9.482623,Python
23,‚àë(i=1 to n) i = n(n+1)/2,9.40764,Math
21,‚à´(x^2 + 3x)dx = x^3/3 + 3x^2/2 + C,9.403224,Math
4,try:\n x = int(input())\nexcept ValueError:...,9.390193,Python
22,lim(x‚Üí0) sin(x)/x = 1,9.362902,Math
20,f(x) = x^2 + 2x + 1,9.361799,Math
29,sin^2(Œ∏) + cos^2(Œ∏) = 1,9.353755,Math
25,P(A|B) = P(B|A)P(A) / P(B),9.342552,Math
46,lmaooo i'm dying üò≠üò≠,9.329171,Social
3,for i in range(len(data)):\n result.append(...,9.327734,Python



5Ô∏è‚É£ Feature #23111 - Max Activation: 9.09
üîó https://neuronpedia.org/gpt2-small/6-res-jb/23111


Unnamed: 0,Text,Activation,Category
21,‚à´(x^2 + 3x)dx = x^3/3 + 3x^2/2 + C,9.094437,Math
5,lambda x: x ** 2 + 3 * x - 5,8.963695,Python
23,‚àë(i=1 to n) i = n(n+1)/2,8.957463,Math
27,"‚àáf(x,y) = (‚àÇf/‚àÇx, ‚àÇf/‚àÇy)",8.945869,Math
28,"det([[a,b],[c,d]]) = ad - bc",8.913188,Math
20,f(x) = x^2 + 2x + 1,8.901364,Math
24,‚àö(a^2 + b^2) = c,8.898706,Math
29,sin^2(Œ∏) + cos^2(Œ∏) = 1,8.898414,Math
0,def factorial(n):\n return 1 if n == 0 else...,8.889197,Python
22,lim(x‚Üí0) sin(x)/x = 1,8.859401,Math



üî¨ 8-res-jb - Top 5 Strongest Features

1Ô∏è‚É£ Feature #20644 - Max Activation: 38.34
üîó https://neuronpedia.org/gpt2-small/8-res-jb/20644


Unnamed: 0,Text,Activation,Category
26,E = mc^2,38.339603,Math
58,The experimental methodology followed establis...,37.222916,Formal
67,The traffic was terrible this morning.,34.819199,Conversational
38,"Ciao, come stai?",33.546543,Non-English
69,Let's catch up over coffee sometime.,32.713402,Conversational
61,I think the meeting went pretty well today.,31.619547,Conversational
68,I need to finish this project by Friday.,30.955946,Conversational
50,The phenomenon was observed under controlled l...,30.624462,Formal
66,My cat keeps knocking things off the table.,30.274088,Conversational
65,That restaurant has the best pizza in town.,29.899927,Conversational



2Ô∏è‚É£ Feature #13670 - Max Activation: 13.55
üîó https://neuronpedia.org/gpt2-small/8-res-jb/13670


Unnamed: 0,Text,Activation,Category
37,"ÏïàÎÖïÌïòÏÑ∏Ïöî, Ïûò ÏßÄÎÇ¥ÏÖ®Ïñ¥Ïöî?",13.550343,Non-English
50,The phenomenon was observed under controlled l...,0.0,Formal
49,this slaps fr fr üéµüî•,0.0,Social
48,mood af rn üíØ,0.0,Social
47,tbh idk what to do ü§∑‚Äç‚ôÄÔ∏è,0.0,Social
46,lmaooo i'm dying üò≠üò≠,0.0,Social
45,ngl this is pretty cool üî•,0.0,Social
44,yaaaas queen!!! üëëüíÖ‚ú®,0.0,Social
5,lambda x: x ** 2 + 3 * x - 5,0.0,Python
42,just got coffee ‚òï feeling good ‚ú®,0.0,Social



3Ô∏è‚É£ Feature #11746 - Max Activation: 12.85
üîó https://neuronpedia.org/gpt2-small/8-res-jb/11746


Unnamed: 0,Text,Activation,Category
62,"The weather is nice, maybe we should go for a ...",12.850475,Conversational
51,In accordance with the aforementioned regulati...,12.832219,Formal
33,"Guten Tag, wie geht es Ihnen?",12.78705,Non-English
53,"Pursuant to Article 12, Section 3 of the afore...",12.628222,Formal
59,"In conclusion, further research is warranted t...",12.586558,Formal
39,"Ol√°, como voc√™ est√°?",12.567338,Non-English
64,I'm planning a trip to Japan next summer.,12.537554,Conversational
69,Let's catch up over coffee sometime.,12.493813,Conversational
30,"Bonjour, comment allez-vous aujourd'hui?",12.489168,Non-English
36,ŸÖÿ±ÿ≠ÿ®ÿßÿå ŸÉŸäŸÅ ÿ≠ÿßŸÑŸÉÿü,12.464737,Non-English



4Ô∏è‚É£ Feature #11533 - Max Activation: 12.45
üîó https://neuronpedia.org/gpt2-small/8-res-jb/11533


Unnamed: 0,Text,Activation,Category
21,‚à´(x^2 + 3x)dx = x^3/3 + 3x^2/2 + C,12.445198,Math
8,print(f'Result: {sum(values) / len(values):.2f}'),12.402194,Python
27,"‚àáf(x,y) = (‚àÇf/‚àÇx, ‚àÇf/‚àÇy)",12.350143,Math
23,‚àë(i=1 to n) i = n(n+1)/2,12.272026,Math
25,P(A|B) = P(B|A)P(A) / P(B),12.202666,Math
28,"det([[a,b],[c,d]]) = ad - bc",12.180933,Math
3,for i in range(len(data)):\n result.append(...,12.103069,Python
0,def factorial(n):\n return 1 if n == 0 else...,12.074641,Python
29,sin^2(Œ∏) + cos^2(Œ∏) = 1,12.0705,Math
22,lim(x‚Üí0) sin(x)/x = 1,12.029311,Math



5Ô∏è‚É£ Feature #4078 - Max Activation: 11.85
üîó https://neuronpedia.org/gpt2-small/8-res-jb/4078


Unnamed: 0,Text,Activation,Category
36,ŸÖÿ±ÿ≠ÿ®ÿßÿå ŸÉŸäŸÅ ÿ≠ÿßŸÑŸÉÿü,11.845957,Non-English
25,P(A|B) = P(B|A)P(A) / P(B),11.797411,Math
31,‰Ω†Â•ΩÔºå‰ªäÂ§©Â§©Ê∞îÊÄé‰πàÊ†∑Ôºü,11.730913,Non-English
23,‚àë(i=1 to n) i = n(n+1)/2,11.708187,Math
47,tbh idk what to do ü§∑‚Äç‚ôÄÔ∏è,11.634768,Social
44,yaaaas queen!!! üëëüíÖ‚ú®,11.607357,Social
8,print(f'Result: {sum(values) / len(values):.2f}'),11.606942,Python
28,"det([[a,b],[c,d]]) = ad - bc",11.606054,Math
24,‚àö(a^2 + b^2) = c,11.571928,Math
11,Visit our website at http://example.com/products,11.561803,URLs



üî¨ 10-res-jb - Top 5 Strongest Features

1Ô∏è‚É£ Feature #10658 - Max Activation: 43.24
üîó https://neuronpedia.org/gpt2-small/10-res-jb/10658


Unnamed: 0,Text,Activation,Category
26,E = mc^2,43.244122,Math
58,The experimental methodology followed establis...,42.267998,Formal
67,The traffic was terrible this morning.,41.522003,Conversational
38,"Ciao, come stai?",40.216732,Non-English
15,mailto:support@example.com,38.608032,URLs
65,That restaurant has the best pizza in town.,38.24905,Conversational
68,I need to finish this project by Friday.,37.560989,Conversational
69,Let's catch up over coffee sometime.,37.476276,Conversational
66,My cat keeps knocking things off the table.,36.747162,Conversational
50,The phenomenon was observed under controlled l...,36.091499,Formal



2Ô∏è‚É£ Feature #1794 - Max Activation: 13.05
üîó https://neuronpedia.org/gpt2-small/10-res-jb/1794


Unnamed: 0,Text,Activation,Category
8,print(f'Result: {sum(values) / len(values):.2f}'),13.046694,Python
3,for i in range(len(data)):\n result.append(...,13.035959,Python
1,import torch\nimport numpy as np\nfrom transfo...,12.893911,Python
7,return [x for x in lst if x > 0],12.860558,Python
33,"Guten Tag, wie geht es Ihnen?",12.808798,Non-English
28,"det([[a,b],[c,d]]) = ad - bc",12.807419,Math
39,"Ol√°, como voc√™ est√°?",12.751987,Non-English
62,"The weather is nice, maybe we should go for a ...",12.686335,Conversational
51,In accordance with the aforementioned regulati...,12.659408,Formal
53,"Pursuant to Article 12, Section 3 of the afore...",12.64129,Formal



3Ô∏è‚É£ Feature #21412 - Max Activation: 12.36
üîó https://neuronpedia.org/gpt2-small/10-res-jb/21412


Unnamed: 0,Text,Activation,Category
37,"ÏïàÎÖïÌïòÏÑ∏Ïöî, Ïûò ÏßÄÎÇ¥ÏÖ®Ïñ¥Ïöî?",12.363623,Non-English
50,The phenomenon was observed under controlled l...,0.0,Formal
49,this slaps fr fr üéµüî•,0.0,Social
48,mood af rn üíØ,0.0,Social
47,tbh idk what to do ü§∑‚Äç‚ôÄÔ∏è,0.0,Social
46,lmaooo i'm dying üò≠üò≠,0.0,Social
45,ngl this is pretty cool üî•,0.0,Social
44,yaaaas queen!!! üëëüíÖ‚ú®,0.0,Social
5,lambda x: x ** 2 + 3 * x - 5,0.0,Python
42,just got coffee ‚òï feeling good ‚ú®,0.0,Social



4Ô∏è‚É£ Feature #16384 - Max Activation: 12.25
üîó https://neuronpedia.org/gpt2-small/10-res-jb/16384


Unnamed: 0,Text,Activation,Category
8,print(f'Result: {sum(values) / len(values):.2f}'),12.245234,Python
3,for i in range(len(data)):\n result.append(...,12.232844,Python
25,P(A|B) = P(B|A)P(A) / P(B),12.196186,Math
2,class NeuralNetwork(nn.Module):\n def __ini...,11.979583,Python
4,try:\n x = int(input())\nexcept ValueError:...,11.924099,Python
1,import torch\nimport numpy as np\nfrom transfo...,11.842025,Python
23,‚àë(i=1 to n) i = n(n+1)/2,11.804591,Math
28,"det([[a,b],[c,d]]) = ad - bc",11.786098,Math
36,ŸÖÿ±ÿ≠ÿ®ÿßÿå ŸÉŸäŸÅ ÿ≠ÿßŸÑŸÉÿü,11.758161,Non-English
11,Visit our website at http://example.com/products,11.684157,URLs



5Ô∏è‚É£ Feature #4268 - Max Activation: 11.49
üîó https://neuronpedia.org/gpt2-small/10-res-jb/4268


Unnamed: 0,Text,Activation,Category
62,"The weather is nice, maybe we should go for a ...",11.491919,Conversational
51,In accordance with the aforementioned regulati...,11.423798,Formal
59,"In conclusion, further research is warranted t...",11.248917,Formal
56,The defendant pleaded not guilty to all charge...,11.221302,Formal
65,That restaurant has the best pizza in town.,11.119753,Conversational
50,The phenomenon was observed under controlled l...,11.089849,Formal
64,I'm planning a trip to Japan next summer.,11.05995,Conversational
1,import torch\nimport numpy as np\nfrom transfo...,11.048014,Python
54,The results indicate a statistically significa...,11.037558,Formal
55,This paper examines the theoretical frameworks...,11.003435,Formal



üî¨ 11-res-jb - Top 5 Strongest Features

1Ô∏è‚É£ Feature #8100 - Max Activation: 71.13
üîó https://neuronpedia.org/gpt2-small/11-res-jb/8100


Unnamed: 0,Text,Activation,Category
26,E = mc^2,71.129883,Math
58,The experimental methodology followed establis...,68.0382,Formal
67,The traffic was terrible this morning.,64.544769,Conversational
38,"Ciao, come stai?",63.847511,Non-English
15,mailto:support@example.com,63.060844,URLs
69,Let's catch up over coffee sometime.,60.790672,Conversational
65,That restaurant has the best pizza in town.,59.239334,Conversational
48,mood af rn üíØ,57.534817,Social
50,The phenomenon was observed under controlled l...,57.157982,Formal
61,I think the meeting went pretty well today.,55.314991,Conversational



2Ô∏è‚É£ Feature #5717 - Max Activation: 22.34
üîó https://neuronpedia.org/gpt2-small/11-res-jb/5717


Unnamed: 0,Text,Activation,Category
37,"ÏïàÎÖïÌïòÏÑ∏Ïöî, Ïûò ÏßÄÎÇ¥ÏÖ®Ïñ¥Ïöî?",22.343342,Non-English
50,The phenomenon was observed under controlled l...,0.0,Formal
49,this slaps fr fr üéµüî•,0.0,Social
48,mood af rn üíØ,0.0,Social
47,tbh idk what to do ü§∑‚Äç‚ôÄÔ∏è,0.0,Social
46,lmaooo i'm dying üò≠üò≠,0.0,Social
45,ngl this is pretty cool üî•,0.0,Social
44,yaaaas queen!!! üëëüíÖ‚ú®,0.0,Social
5,lambda x: x ** 2 + 3 * x - 5,0.0,Python
42,just got coffee ‚òï feeling good ‚ú®,0.0,Social



3Ô∏è‚É£ Feature #9530 - Max Activation: 17.36
üîó https://neuronpedia.org/gpt2-small/11-res-jb/9530


Unnamed: 0,Text,Activation,Category
37,"ÏïàÎÖïÌïòÏÑ∏Ïöî, Ïûò ÏßÄÎÇ¥ÏÖ®Ïñ¥Ïöî?",17.355253,Non-English
50,The phenomenon was observed under controlled l...,0.0,Formal
49,this slaps fr fr üéµüî•,0.0,Social
48,mood af rn üíØ,0.0,Social
47,tbh idk what to do ü§∑‚Äç‚ôÄÔ∏è,0.0,Social
46,lmaooo i'm dying üò≠üò≠,0.0,Social
45,ngl this is pretty cool üî•,0.0,Social
44,yaaaas queen!!! üëëüíÖ‚ú®,0.0,Social
5,lambda x: x ** 2 + 3 * x - 5,0.0,Python
42,just got coffee ‚òï feeling good ‚ú®,0.0,Social



4Ô∏è‚É£ Feature #14349 - Max Activation: 12.50
üîó https://neuronpedia.org/gpt2-small/11-res-jb/14349


Unnamed: 0,Text,Activation,Category
25,P(A|B) = P(B|A)P(A) / P(B),12.501636,Math
3,for i in range(len(data)):\n result.append(...,12.407318,Python
8,print(f'Result: {sum(values) / len(values):.2f}'),12.273678,Python
4,try:\n x = int(input())\nexcept ValueError:...,12.246566,Python
1,import torch\nimport numpy as np\nfrom transfo...,12.10624,Python
0,def factorial(n):\n return 1 if n == 0 else...,12.019621,Python
28,"det([[a,b],[c,d]]) = ad - bc",11.960091,Math
21,‚à´(x^2 + 3x)dx = x^3/3 + 3x^2/2 + C,11.930597,Math
29,sin^2(Œ∏) + cos^2(Œ∏) = 1,11.850236,Math
20,f(x) = x^2 + 2x + 1,11.805093,Math



5Ô∏è‚É£ Feature #11379 - Max Activation: 11.70
üîó https://neuronpedia.org/gpt2-small/11-res-jb/11379


Unnamed: 0,Text,Activation,Category
25,P(A|B) = P(B|A)P(A) / P(B),11.70452,Math
27,"‚àáf(x,y) = (‚àÇf/‚àÇx, ‚àÇf/‚àÇy)",11.1396,Math
28,"det([[a,b],[c,d]]) = ad - bc",11.105947,Math
8,print(f'Result: {sum(values) / len(values):.2f}'),10.990323,Python
21,‚à´(x^2 + 3x)dx = x^3/3 + 3x^2/2 + C,10.971671,Math
3,for i in range(len(data)):\n result.append(...,10.833378,Python
4,try:\n x = int(input())\nexcept ValueError:...,10.8096,Python
24,‚àö(a^2 + b^2) = c,10.788852,Math
29,sin^2(Œ∏) + cos^2(Œ∏) = 1,10.787041,Math
23,‚àë(i=1 to n) i = n(n+1)/2,10.760444,Math


In [12]:
# ============================================================================
# CELL 11: Interpretation Section
# ============================================================================


# After table display
print("\n" + "="*70)
print("üí≠ INTERPRETATION GUIDE")
print("="*70)
print("""
- STRONGEST: Feature with highest peak activation (may still be general-purpose)
- FREQUENT: Feature that fires consistently across many text types  
- SELECTIVE: Feature with high activation but only on specific texts (rare specialist)
- CATEGORY SPECIALISTS: Features that fire strongly in one category but rarely outside

Specialist Score = (strong activations inside category) - (strong activations outside)
‚úÖ Positive score = True specialist
‚ùå Zero/negative score = General feature that fires across categories
""")

# Compare SAEs
print("\nüî¨ CROSS-SAE COMPARISON:")

# Get SAE names from results
sae_names = list(results['strongest'].keys())

for sae_name in sae_names:
    count = sum(1 for cat_data in results['specialists'][sae_name].values() 
                if cat_data and cat_data['score'] > 0)
    print(f"   {sae_name}: {count}/7 specialists found")

if all(count == 0 for count in [...]):
    print("\n‚ö†Ô∏è KEY FINDING: None of the SAEs learned strong category specialists!")
    print("   This suggests these particular decompositions favor general features.")


üí≠ INTERPRETATION GUIDE

- STRONGEST: Feature with highest peak activation (may still be general-purpose)
- FREQUENT: Feature that fires consistently across many text types  
- SELECTIVE: Feature with high activation but only on specific texts (rare specialist)
- CATEGORY SPECIALISTS: Features that fire strongly in one category but rarely outside

Specialist Score = (strong activations inside category) - (strong activations outside)
‚úÖ Positive score = True specialist
‚ùå Zero/negative score = General feature that fires across categories


üî¨ CROSS-SAE COMPARISON:
   6-res-jb: 0/7 specialists found
   8-res-jb: 1/7 specialists found
   10-res-jb: 1/7 specialists found
   11-res-jb: 1/7 specialists found
