In [2]:
# üéì ECE685 Project 2: Multi-Layer SAE-Guided LLM Safety
# ============================================================================
# 
# This notebook extends the single-layer approach by hooking SAEs to MULTIPLE 
# layers of the LLM (layers 12 and 20), concatenating the features, and 
# comparing performance with the single-layer baseline.
#
# Key improvements:
# 1. Hook SAE at Layer 12 (mid-level features) AND Layer 20 (high-level features)
# 2. Concatenate features from both layers for richer representations
# 3. Compare single-layer vs multi-layer detection performance
# 4. Test multi-layer steering
#
# ============================================================================

print("üéì ECE685 Project 2: Multi-Layer SAE Experiment")
print("=" * 60)
print("This notebook hooks SAEs at MULTIPLE layers to boost performance.")
print("Layers: 12 (mid-level) + 20 (high-level semantic)")
print("=" * 60)

: 

# üéì ECE685 Project 2: Multi-Layer SAE-Guided LLM Safety

## Exploring Sparsity Across Multiple Layers

This notebook extends our single-layer approach by **hooking SAEs to multiple layers** of the LLM to boost performance.

### Key Hypothesis
- **Layer 12** (middle): Captures structural and syntactic features
- **Layer 20** (deep): Captures more abstract semantic concepts
- **Combined**: Concatenating features from both layers provides richer representations

### What We'll Do:
1. **Load models**: Gemma-2-2B-IT + SAEs from Gemma Scope for layers 12 AND 20
2. **Multi-layer capture**: Hook both layers, encode with respective SAEs
3. **Feature comparison**: Compare single-layer vs multi-layer feature discovery
4. **Detection boost**: Train classifiers on concatenated features
5. **Steering comparison**: Test if multi-layer steering improves results

### Experimental Design:
| Condition | Layers | Feature Dim | Expected Benefit |
|-----------|--------|-------------|------------------|
| Single-L12 | 12 only | 16,384 | Baseline (structural) |
| Single-L20 | 20 only | 16,384 | High-level semantic |
| Multi-Layer | 12 + 20 | 32,768 | Combined strengths |

## üîß Setup: Install Dependencies & Login to HuggingFace

### ‚ö†Ô∏è IMPORTANT: Follow These Steps Exactly!

| Step | Action |
|------|--------|
| 1Ô∏è‚É£ | **Run the installation cell below** (wait for it to finish, ~2-3 min) |
| 2Ô∏è‚É£ | **RESTART THE RUNTIME**: Click `Runtime ‚Üí Restart runtime` |
| 3Ô∏è‚É£ | **After restart, SKIP the installation cell** (don't run it again!) |
| 4Ô∏è‚É£ | **Run the verification cell** to confirm everything works |
| 5Ô∏è‚É£ | Continue with the HuggingFace login and rest of notebook |

> üõë **Why restart?** Google Colab caches numpy in memory. After installing a new version, you MUST restart for Python to use the new version.

In [None]:
# ============================================================================
# STEP 1: Install dependencies (RUN THIS FIRST, THEN RESTART!)
# ============================================================================
# 
# ‚ö†Ô∏è  CRITICAL: After this cell finishes, you MUST:
#     1. Click Runtime ‚Üí Restart runtime (or Ctrl+M .)
#     2. After restart, SKIP this cell
#     3. Run the verification cell below
#
# ============================================================================

# Uninstall any existing numpy first to avoid conflicts
%pip uninstall -y numpy
# Install compatible numpy version
%pip install numpy==1.26.4
# Install other dependencies
%pip install -q transformers accelerate datasets torch pandas matplotlib scikit-learn tqdm
# Install sae-lens (must be after numpy to avoid conflicts)
%pip install -q sae-lens

: 

In [1]:
# ============================================================================
# STEP 2: Verify installation (run AFTER restarting runtime)
# ============================================================================

import numpy as np
print(f"NumPy version: {np.__version__}")

if np.__version__.startswith("2."):
    print("\n" + "="*70)
    print("‚ùå ERROR: Wrong NumPy version detected!")
    print("="*70)
    print(f"\n   You have NumPy {np.__version__}, but need 1.26.x")
    print("\n   FIX: Click Runtime ‚Üí Restart runtime, then run THIS cell again")
    print("="*70)
    raise RuntimeError("Please restart the runtime and try again!")

print(f"‚úì NumPy version: {np.__version__} (compatible)")

import sae_lens
print(f"‚úì sae-lens version: {sae_lens.__version__}")

import torch
print(f"‚úì PyTorch version: {torch.__version__}")
print(f"‚úì CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"  GPU: {torch.cuda.get_device_name(0)}")

import transformers
print(f"‚úì transformers version: {transformers.__version__}")

print("\n" + "="*50)
print("‚úÖ All packages loaded successfully!")
print("="*50)

NumPy version: 2.0.2

‚ùå ERROR: Wrong NumPy version detected!

   You have NumPy 2.0.2, but need 1.26.x

   FIX: Click Runtime ‚Üí Restart runtime, then run THIS cell again


RuntimeError: Please restart the runtime and try again!

In [None]:
# ============================================================================
# STEP 3: HuggingFace Login + Google Drive Mounting (COLAB ONLY)
# ============================================================================

from google.colab import drive
import os

# Mount Google Drive to save results
drive.mount('/content/drive', force_remount=False)
print("‚úì Google Drive mounted at /content/drive")

# Setup HuggingFace authentication
from huggingface_hub import login

HF_TOKEN = os.getenv("HF_TOKEN")
if HF_TOKEN is None:
    print("\n" + "="*70)
    print("‚ö†Ô∏è  Enter your HuggingFace token when prompted below")
    print("="*70)
    print("To get token: https://huggingface.co/settings/tokens")
    print("To accept license: https://huggingface.co/google/gemma-2-2b-it")
    print("="*70)
    login()  # This will prompt for token
else:
    login(token=HF_TOKEN)
    print("‚úì HuggingFace authenticated!")

print("\n‚úÖ Setup complete!")

In [None]:
# ============================================================================
# STEP 4: Setup Checkpoint & Result Saving to Google Drive
# ============================================================================
# 
# Save results during long experiments to Google Drive
# so they persist even if the Colab session disconnects
#
# ============================================================================

import json
import pickle
import os
from pathlib import Path
from datetime import datetime

class ColabExperimentCheckpoint:
    """Manage checkpoints and results for long-running Colab experiments"""
    
    def __init__(self, experiment_name="default"):
        self.experiment_name = experiment_name
        # Save to Google Drive so results persist after disconnect
        self.results_dir = Path("/content/drive/My Drive/ECE685_Results")
        self.results_dir.mkdir(parents=True, exist_ok=True)
        self.results = {}
        self.load_latest_checkpoint()
    
    def load_latest_checkpoint(self):
        """Load the most recent checkpoint if it exists"""
        checkpoint_files = sorted(self.results_dir.glob(f"{self.experiment_name}_*.pkl"))
        if checkpoint_files:
            latest = checkpoint_files[-1]
            with open(latest, 'rb') as f:
                self.results = pickle.load(f)
            print(f"‚úì Loaded checkpoint: {latest.name}")
            print(f"  Contains {len(self.results)} result entries")
        else:
            print(f"‚Ñπ No previous checkpoint found. Starting fresh.")
    
    def save_result(self, key, value):
        """Save a single result"""
        self.results[key] = value
        print(f"  ‚Üí Saved: {key}")
    
    def save_batch(self, batch_dict):
        """Save multiple results at once"""
        self.results.update(batch_dict)
        print(f"  ‚Üí Saved {len(batch_dict)} results")
    
    def checkpoint(self):
        """Save current results to Google Drive"""
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        checkpoint_path = self.results_dir / f"{self.experiment_name}_{timestamp}.pkl"
        
        with open(checkpoint_path, 'wb') as f:
            pickle.dump(self.results, f)
        
        print(f"‚úì Checkpoint saved to Google Drive: {checkpoint_path.name}")
        print(f"  Path: {checkpoint_path}")
        return checkpoint_path
    
    def get_result(self, key, default=None):
        """Retrieve a saved result"""
        return self.results.get(key, default)
    
    def summary(self):
        """Print summary of saved results"""
        print(f"\nüìä Experiment Summary: {self.experiment_name}")
        print(f"{'='*60}")
        if self.results:
            for i, key in enumerate(self.results.keys(), 1):
                print(f"  {i}. {key}")
        else:
            print("  (No results saved yet)")
        print(f"{'='*60}\n")

# Create checkpoint manager for this experiment
checkpoint = ColabExperimentCheckpoint(experiment_name="multilayer_sae")

print("\n‚úÖ Checkpoint system ready!")
print("\nUsage examples:")
print("  checkpoint.save_result('layer_12_features', features)")
print("  checkpoint.save_batch({'key1': val1, 'key2': val2})")
print("  checkpoint.checkpoint()  # Save to Google Drive")
print("  checkpoint.get_result('key')")
print(f"\nResults saved to: {checkpoint.results_dir}")

## üì¶ Multi-Layer Configuration

**Key Change**: We now configure TWO layers (12 and 20) instead of just one.

In [None]:
# ============================================================================
# MULTI-LAYER CONFIGURATION
# ============================================================================
from dataclasses import dataclass, field
from pathlib import Path
from typing import Literal, Optional, Dict, Any, List
import json

@dataclass
class MultiLayerConfig:
    """Configuration for multi-layer SAE experiment"""
    
    # Model settings
    gemma_model_name: str = "google/gemma-2-2b-it"
    toxicity_model_name: str = "unitary/unbiased-toxic-roberta"
    dtype: str = "bfloat16"
    
    # ========== MULTI-LAYER SETTINGS ==========
    # Hook at BOTH layers 12 (mid) and 20 (deep)
    hook_layers: List[int] = field(default_factory=lambda: [12, 20])
    
    # SAE settings for EACH layer (Gemma Scope)
    sae_release: str = "gemma-scope-2b-pt-res-canonical"
    sae_ids: Dict[int, str] = field(default_factory=lambda: {
        12: "layer_12/width_16k/canonical",
        20: "layer_20/width_16k/canonical",
    })
    
    # Data settings - smaller for faster experiments
    nq_sample_limit: int = 1000  # Reduced for speed (vs 3610)
    hh_sample_limit: int = 1000  # Reduced for speed (vs 8552)
    rtp_sample_limit: int = 1000  # Reduced for speed
    
    # Batch sizes
    data_batch_size: int = 32  # Smaller batch for multi-layer (more memory)
    steering_batch_size: int = 8
    
    # Steering settings
    steering_samples: int = 100  # Per condition
    steering_strength_ratios: list = field(default_factory=lambda: [0.0, 0.1, 0.2, 0.3])
    
    # Feature discovery
    top_k_features: int = 100
    
    # Experiment
    device: str = "cuda"
    seed: int = 42

CONFIG = MultiLayerConfig()

print("‚úì Multi-Layer Config loaded")
print(f"  Model: {CONFIG.gemma_model_name}")
print(f"  Hook Layers: {CONFIG.hook_layers}")
print(f"  SAE IDs:")
for layer, sae_id in CONFIG.sae_ids.items():
    print(f"    Layer {layer}: {sae_id}")
print(f"  Data limits: NQ={CONFIG.nq_sample_limit}, HH={CONFIG.hh_sample_limit}, RTP={CONFIG.rtp_sample_limit}")
print(f"  Combined feature dim: {16384 * len(CONFIG.hook_layers)} (16k √ó {len(CONFIG.hook_layers)} layers)")

‚úì Multi-Layer Config loaded
  Model: google/gemma-2-2b-it
  Hook Layers: [12, 20]
  SAE IDs:
    Layer 12: layer_12/width_16k/canonical
    Layer 20: layer_20/width_16k/canonical
  Data limits: NQ=1000, HH=1000, RTP=1000
  Combined feature dim: 32768 (16k √ó 2 layers)


In [None]:
# ============================================================================
# MULTI-LAYER SAE WRAPPER
# ============================================================================
import torch
from sae_lens import SAE

@dataclass
class SparseAutoencoder:
    """Wrapper for a single Gemma Scope SAE"""
    encoder_weight: torch.Tensor
    decoder_weight: torch.Tensor
    bias: torch.Tensor
    layer: int
    n_features: int

    @classmethod
    def load(cls, layer: int, sae_id: str) -> "SparseAutoencoder":
        """Load pretrained SAE for a specific layer"""
        print(f"  Loading SAE for Layer {layer}: {sae_id}")
        
        sae, cfg_dict, sparsity = SAE.from_pretrained(
            release=CONFIG.sae_release,
            sae_id=sae_id,
            device="cpu",
        )
        
        encoder_w = sae.W_enc.detach().cpu().t()
        decoder_w = sae.W_dec.detach().cpu()
        
        print(f"    ‚úì Layer {layer} SAE: {sae.cfg.d_sae} features, d_in={sae.cfg.d_in}")
        
        return cls(
            encoder_weight=encoder_w,
            decoder_weight=decoder_w,
            bias=sae.b_enc.detach().cpu() if hasattr(sae, "b_enc") else torch.zeros(sae.cfg.d_sae),
            layer=layer,
            n_features=sae.cfg.d_sae,
        )

    @torch.inference_mode()
    def encode(self, residual: torch.Tensor) -> torch.Tensor:
        """Encode residual to sparse codes"""
        residual = residual.float().cpu()
        if residual.dim() == 3:
            residual = residual.reshape(-1, residual.size(-1))
        projected = torch.nn.functional.linear(residual, self.encoder_weight, self.bias)
        return torch.nn.functional.relu(projected)

    @torch.inference_mode()
    def decode(self, codes: torch.Tensor) -> torch.Tensor:
        """Decode codes back to residual space"""
        return torch.nn.functional.linear(codes, self.decoder_weight.t())


class MultiLayerSAE:
    """Manages SAEs for multiple layers and concatenates features"""
    
    def __init__(self, layers: List[int], sae_ids: Dict[int, str]):
        self.layers = sorted(layers)
        self.saes = {}
        
        print("Loading Multi-Layer SAEs...")
        for layer in self.layers:
            self.saes[layer] = SparseAutoencoder.load(layer, sae_ids[layer])
        
        self.total_features = sum(sae.n_features for sae in self.saes.values())
        print(f"‚úì Multi-Layer SAE ready: {len(self.layers)} layers, {self.total_features} total features")
    
    def encode_single(self, residuals: Dict[int, torch.Tensor], layer: int) -> torch.Tensor:
        """Encode residual from a single layer"""
        return self.saes[layer].encode(residuals[layer])
    
    def encode_concat(self, residuals: Dict[int, torch.Tensor]) -> torch.Tensor:
        """Encode and concatenate features from all layers"""
        codes_list = []
        for layer in self.layers:
            codes = self.saes[layer].encode(residuals[layer])
            codes_list.append(codes)
        return torch.cat(codes_list, dim=-1)
    
    def build_steering_vector(self, layer: int, f_plus_ids: list, f_minus_ids: list,
                               plus_weight: float = -1.0, minus_weight: float = 0.5) -> torch.Tensor:
        """Build steering vector for a specific layer"""
        sae = self.saes[layer]
        d_in = sae.decoder_weight.shape[1]
        direction = torch.zeros(d_in)
        
        for idx in f_plus_ids:
            if idx < sae.n_features:
                direction += plus_weight * sae.decoder_weight[idx]
        for idx in f_minus_ids:
            if idx < sae.n_features:
                direction += minus_weight * sae.decoder_weight[idx]
        
        if direction.norm() > 0:
            direction = direction / direction.norm()
        return direction

print("‚úì Multi-Layer SAE classes defined")

‚úì Multi-Layer SAE classes defined


In [None]:
# ============================================================================
# MULTI-LAYER GEMMA INTERFACE
# ============================================================================
from contextlib import contextmanager
from transformers import AutoModelForCausalLM, AutoTokenizer

class MultiLayerCapture:
    """Captures residual stream from MULTIPLE layers"""
    
    def __init__(self, layers: List[int]):
        self.layers = layers
        self.residuals = {layer: None for layer in layers}
    
    def make_hook(self, layer: int):
        def hook(module, inputs, output):
            hidden = output[0] if isinstance(output, tuple) else output
            self.residuals[layer] = hidden.detach()
        return hook
    
    def clear(self):
        self.residuals = {layer: None for layer in self.layers}


class MultiLayerGemma:
    """Gemma with multi-layer activation capture"""
    
    def __init__(self, model_name: str = None):
        model_id = model_name or CONFIG.gemma_model_name
        dtype = getattr(torch, CONFIG.dtype)
        
        print(f"Loading {model_id}...")
        self.tokenizer = AutoTokenizer.from_pretrained(model_id)
        self.tokenizer.pad_token = self.tokenizer.eos_token
        self.model = AutoModelForCausalLM.from_pretrained(
            model_id,
            torch_dtype=dtype,
            device_map="auto",
        )
        self.capture = MultiLayerCapture(CONFIG.hook_layers)
        self._handles = []
        print(f"‚úì Gemma loaded on {self.model.device}")
        print(f"  Capturing layers: {CONFIG.hook_layers}")
    
    def register_hooks(self):
        """Register hooks for all target layers"""
        self.remove_hooks()
        for layer in self.capture.layers:
            block = self.model.model.layers[layer]
            handle = block.register_forward_hook(self.capture.make_hook(layer))
            self._handles.append(handle)
    
    def remove_hooks(self):
        for handle in self._handles:
            handle.remove()
        self._handles = []
    
    @torch.inference_mode()
    def generate_with_capture(self, prompt: str, max_new_tokens: int = 50) -> dict:
        """Generate text and capture residuals from all layers"""
        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device)
        
        self.capture.clear()
        self.register_hooks()
        
        try:
            output_ids = self.model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                do_sample=False,
                pad_token_id=self.tokenizer.eos_token_id,
            )
        finally:
            self.remove_hooks()
        
        text = self.tokenizer.decode(output_ids[0], skip_special_tokens=True)
        
        # Get last-token residuals from each layer
        residuals = {}
        for layer, res in self.capture.residuals.items():
            if res is not None:
                residuals[layer] = res[:, -1, :].cpu()
        
        return {"text": text, "residuals": residuals}
    
    @torch.inference_mode()
    def generate_batch_with_capture(self, prompts: list, max_new_tokens: int = 50) -> list:
        """Batch generate with multi-layer residual capture"""
        if not prompts:
            return []
        
        inputs = self.tokenizer(
            prompts,
            return_tensors="pt",
            padding=True,
            truncation=True,
            max_length=512
        ).to(self.model.device)
        
        # Store residuals per layer
        batch_residuals = {layer: [] for layer in self.capture.layers}
        
        def make_batch_hook(layer):
            def hook(module, inp, output):
                hidden = output[0] if isinstance(output, tuple) else output
                batch_residuals[layer].append(hidden.detach().cpu())
            return hook
        
        # Register hooks
        handles = []
        for layer in self.capture.layers:
            block = self.model.model.layers[layer]
            handle = block.register_forward_hook(make_batch_hook(layer))
            handles.append(handle)
        
        try:
            output_ids = self.model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                do_sample=False,
                pad_token_id=self.tokenizer.eos_token_id,
            )
        finally:
            for handle in handles:
                handle.remove()
        
        texts = self.tokenizer.batch_decode(output_ids, skip_special_tokens=True)
        
        # Build results
        results = []
        for i, text in enumerate(texts):
            residuals = {}
            for layer in self.capture.layers:
                if batch_residuals[layer]:
                    hidden = batch_residuals[layer][0]  # First forward pass
                    if i < hidden.shape[0]:
                        seq_len = (inputs['attention_mask'][i] == 1).sum().item()
                        residuals[layer] = hidden[i, seq_len-1, :].unsqueeze(0)
            
            results.append({"text": text, "residuals": residuals})
        
        return results

print("‚úì MultiLayerGemma class defined")

‚úì MultiLayerGemma class defined


In [None]:
# ============================================================================
# TOXICITY CLASSIFIER
# ============================================================================
from transformers import AutoModelForSequenceClassification

@dataclass
class ToxicityScore:
    probability: float
    label: int

class ToxicityWrapper:
    def __init__(self, model_name: str = None):
        model_id = model_name or CONFIG.toxicity_model_name
        print(f"Loading toxicity classifier: {model_id}...")
        self.tokenizer = AutoTokenizer.from_pretrained(model_id)
        self.model = AutoModelForSequenceClassification.from_pretrained(model_id)
        self.model.eval()
        print("‚úì Toxicity classifier loaded")

    @torch.inference_mode()
    def score(self, text: str, threshold: float = 0.5) -> ToxicityScore:
        if not text or len(text.strip()) == 0:
            return ToxicityScore(probability=0.0, label=0)
        inputs = self.tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
        logits = self.model(**inputs).logits
        probs = torch.softmax(logits, dim=-1)
        prob = probs[0, 1].item()
        return ToxicityScore(probability=prob, label=int(prob >= threshold))

print("‚úì ToxicityWrapper class defined")

‚úì ToxicityWrapper class defined


---

# üöÄ Part 1: Load Models

Now we load:
1. **Gemma-2B-IT** with multi-layer hooks
2. **SAEs for BOTH layers 12 and 20**
3. **Toxicity Classifier**

In [None]:
# ============================================================================
# LOAD ALL MODELS
# ============================================================================

print("=" * 60)
print("LOADING MODELS (MULTI-LAYER)")
print("=" * 60)

# 1. Load Gemma with multi-layer capture
gemma = MultiLayerGemma()
hidden_size = gemma.model.config.hidden_size
print(f"  Hidden size: {hidden_size}")

# 2. Load SAEs for BOTH layers
print()
multi_sae = MultiLayerSAE(CONFIG.hook_layers, CONFIG.sae_ids)

# 3. Load toxicity classifier
print()
tox = ToxicityWrapper()

print("\n" + "=" * 60)
print("‚úì ALL MODELS LOADED!")
print(f"  Gemma: {CONFIG.gemma_model_name}")
print(f"  SAE Layers: {CONFIG.hook_layers}")
print(f"  Total SAE features: {multi_sae.total_features}")
print("=" * 60)

LOADING MODELS (MULTI-LAYER)
Loading google/gemma-2-2b-it...


tokenizer_config.json:   0%|          | 0.00/47.0k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/838 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json:   0%|          | 0.00/24.2k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/241M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

‚úì Gemma loaded on cpu
  Capturing layers: [12, 20]
  Hidden size: 2304

Loading Multi-Layer SAEs...
  Loading SAE for Layer 12: layer_12/width_16k/canonical


layer_12/width_16k/average_l0_82/params.(‚Ä¶):   0%|          | 0.00/302M [00:00<?, ?B/s]

  sae, cfg_dict, sparsity = SAE.from_pretrained(


    ‚úì Layer 12 SAE: 16384 features, d_in=2304
  Loading SAE for Layer 20: layer_20/width_16k/canonical


layer_20/width_16k/average_l0_71/params.(‚Ä¶):   0%|          | 0.00/302M [00:00<?, ?B/s]

    ‚úì Layer 20 SAE: 16384 features, d_in=2304
‚úì Multi-Layer SAE ready: 2 layers, 32768 total features

Loading toxicity classifier: unitary/unbiased-toxic-roberta...


tokenizer_config.json:   0%|          | 0.00/997 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

‚úì Toxicity classifier loaded

‚úì ALL MODELS LOADED!
  Gemma: google/gemma-2-2b-it
  SAE Layers: [12, 20]
  Total SAE features: 32768


model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

## üìä Load Datasets

In [None]:
from datasets import load_dataset
import pandas as pd
import numpy as np
from tqdm import tqdm

print("Loading datasets from HuggingFace...")

# NQ-Open: Question answering with ground truth
nq_dataset = load_dataset("nq_open", split="validation")
print(f"‚úì NQ-Open: {len(nq_dataset)} samples")

# RealToxicityPrompts
rtp_dataset = load_dataset("allenai/real-toxicity-prompts", split="train")
print(f"‚úì RealToxicityPrompts: {len(rtp_dataset)} samples")

# Anthropic HH: Safe conversations
hh_dataset = load_dataset("Anthropic/hh-rlhf", split="test")
print(f"‚úì Anthropic HH: {len(hh_dataset)} samples")

Loading datasets from HuggingFace...


README.md: 0.00B [00:00, ?B/s]

nq_open/train-00000-of-00001.parquet:   0%|          | 0.00/4.46M [00:00<?, ?B/s]

nq_open/validation-00000-of-00001.parque(‚Ä¶):   0%|          | 0.00/214k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87925 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3610 [00:00<?, ? examples/s]

‚úì NQ-Open: 3610 samples


README.md: 0.00B [00:00, ?B/s]

prompts.jsonl:   0%|          | 0.00/67.7M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/99442 [00:00<?, ? examples/s]

‚úì RealToxicityPrompts: 99442 samples


README.md: 0.00B [00:00, ?B/s]

harmless-base/train.jsonl.gz:   0%|          | 0.00/13.2M [00:00<?, ?B/s]

helpful-base/train.jsonl.gz:   0%|          | 0.00/16.2M [00:00<?, ?B/s]

helpful-online/train.jsonl.gz:   0%|          | 0.00/20.1M [00:00<?, ?B/s]

helpful-rejection-sampled/train.jsonl.gz:   0%|          | 0.00/25.7M [00:00<?, ?B/s]

harmless-base/test.jsonl.gz:   0%|          | 0.00/743k [00:00<?, ?B/s]

helpful-base/test.jsonl.gz:   0%|          | 0.00/875k [00:00<?, ?B/s]

helpful-online/test.jsonl.gz:   0%|          | 0.00/1.05M [00:00<?, ?B/s]

helpful-rejection-sampled/test.jsonl.gz:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/160800 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/8552 [00:00<?, ? examples/s]

‚úì Anthropic HH: 8552 samples


## üî¨ Multi-Layer Activation Capture & Labeling

For each prompt, we now:
1. Run through Gemma and capture residuals from **BOTH layers 12 and 20**
2. Encode with respective SAEs
3. Store **BOTH** single-layer codes AND concatenated codes for comparison

In [None]:
# ============================================================================
# HELPER FUNCTIONS
# ============================================================================

def check_hallucination(model_answer: str, ground_truth_answers: list) -> bool:
    """Check if model answer is a hallucination"""
    model_answer_lower = model_answer.lower().strip()
    
    if len(model_answer_lower.split()) < 2:
        return True
    
    for gt in ground_truth_answers:
        gt_lower = gt.lower().strip()
        if gt_lower in model_answer_lower or model_answer_lower in gt_lower:
            return False
        gt_words = set(gt_lower.split())
        answer_words = set(model_answer_lower.split())
        if len(gt_words & answer_words) >= min(2, len(gt_words)):
            return False
    
    return True


def process_nq_multilayer(dataset, limit: int, batch_size: int = None) -> pd.DataFrame:
    """Process NQ-Open with multi-layer SAE encoding"""
    if batch_size is None:
        batch_size = CONFIG.data_batch_size
    
    questions = []
    ground_truths = []
    
    for i, row in enumerate(dataset):
        if i >= limit:
            break
        questions.append(row['question'])
        ground_truths.append(row['answer'])
    
    print(f"  Processing {len(questions)} questions...")
    records = []
    
    for batch_start in tqdm(range(0, len(questions), batch_size), desc="NQ-Open"):
        batch_end = min(batch_start + batch_size, len(questions))
        batch_q = questions[batch_start:batch_end]
        batch_gt = ground_truths[batch_start:batch_end]
        
        results = gemma.generate_batch_with_capture(batch_q, max_new_tokens=50)
        
        for j, (q, gt, res) in enumerate(zip(batch_q, batch_gt, results)):
            if not res['residuals']:
                continue
            
            answer = res['text'].replace(q, '').strip()
            
            # Encode with EACH layer's SAE
            codes_per_layer = {}
            for layer in CONFIG.hook_layers:
                if layer in res['residuals']:
                    codes_per_layer[layer] = multi_sae.encode_single(res['residuals'], layer).squeeze(0).numpy()
            
            # Also get concatenated codes
            codes_concat = multi_sae.encode_concat(res['residuals']).squeeze(0).numpy()
            
            is_halluc = check_hallucination(answer, gt)
            
            records.append({
                'id': batch_start + j,
                'prompt': q,
                'answer': answer,
                'ground_truth': gt,
                'label': int(is_halluc),
                'codes_l12': codes_per_layer.get(12),
                'codes_l20': codes_per_layer.get(20),
                'codes_concat': codes_concat,
            })
    
    df = pd.DataFrame(records)
    n_halluc = df['label'].sum()
    print(f"‚úì NQ: {len(df)} samples, {n_halluc} hallucinated ({n_halluc/len(df)*100:.1f}%)")
    return df


def process_toxicity_multilayer(dataset, name: str, limit: int, 
                                 text_field: str = 'prompt',
                                 batch_size: int = None) -> pd.DataFrame:
    """Process toxicity dataset with multi-layer SAE encoding"""
    if batch_size is None:
        batch_size = CONFIG.data_batch_size
    
    prompts = []
    indices = []
    
    for i, row in enumerate(dataset):
        if len(prompts) >= limit:
            break
        
        if text_field == 'prompt' and isinstance(row.get('prompt'), dict):
            prompt = row['prompt'].get('text', '')
        elif text_field == 'chosen':
            text = row['chosen']
            if 'Human:' in text:
                prompt = text.split('Human:')[1].split('Assistant:')[0].strip()
            else:
                prompt = text[:200]
        else:
            prompt = row.get(text_field, '')
        
        if prompt and len(prompt) >= 5:
            prompts.append(prompt[:500])
            indices.append(i)
    
    print(f"  Processing {len(prompts)} prompts...")
    records = []
    
    for batch_start in tqdm(range(0, len(prompts), batch_size), desc=name):
        batch_end = min(batch_start + batch_size, len(prompts))
        batch_p = prompts[batch_start:batch_end]
        batch_idx = indices[batch_start:batch_end]
        
        results = gemma.generate_batch_with_capture(batch_p, max_new_tokens=50)
        
        for j, (p, idx, res) in enumerate(zip(batch_p, batch_idx, results)):
            if not res['residuals']:
                continue
            
            completion = res['text'].replace(p, '').strip()
            
            codes_per_layer = {}
            for layer in CONFIG.hook_layers:
                if layer in res['residuals']:
                    codes_per_layer[layer] = multi_sae.encode_single(res['residuals'], layer).squeeze(0).numpy()
            
            codes_concat = multi_sae.encode_concat(res['residuals']).squeeze(0).numpy()
            
            tox_score = tox.score(completion)
            
            records.append({
                'id': idx,
                'prompt': p,
                'completion': completion,
                'toxicity': tox_score.probability,
                'label': tox_score.label,
                'codes_l12': codes_per_layer.get(12),
                'codes_l20': codes_per_layer.get(20),
                'codes_concat': codes_concat,
            })
    
    df = pd.DataFrame(records)
    n_toxic = df['label'].sum()
    print(f"‚úì {name}: {len(df)} samples, {n_toxic} toxic ({n_toxic/len(df)*100:.1f}%)")
    return df

print("‚úì Multi-layer processing functions defined")

In [None]:
# ============================================================================
# PROCESS ALL DATASETS WITH MULTI-LAYER ENCODING
# ============================================================================
import random
random.seed(CONFIG.seed)

print(f"\n{'='*60}")
print("PROCESSING DATASETS (MULTI-LAYER)")
print(f"{'='*60}")
print(f"Layers: {CONFIG.hook_layers}")
print(f"Batch size: {CONFIG.data_batch_size}")

# 1. NQ-Open (Hallucination)
print("\n1/3: Processing NQ-Open...")
nq_df = process_nq_multilayer(nq_dataset, limit=CONFIG.nq_sample_limit)

# 2. Anthropic HH (Safe)
print("\n2/3: Processing Anthropic HH...")
hh_df = process_toxicity_multilayer(hh_dataset, "HH", limit=CONFIG.hh_sample_limit, text_field='chosen')

# 3. RealToxicityPrompts
rtp_indices = list(range(len(rtp_dataset)))
random.shuffle(rtp_indices)
rtp_shuffled = rtp_dataset.select(rtp_indices[:CONFIG.rtp_sample_limit])

print("\n3/3: Processing RTP...")
rtp_df = process_toxicity_multilayer(rtp_shuffled, "RTP", limit=CONFIG.rtp_sample_limit, text_field='prompt')

# Combine for safety task
safety_df = pd.concat([rtp_df, hh_df], ignore_index=True)

print(f"\n{'='*60}")
print("DATA READY!")
print(f"{'='*60}")
print(f"Hallucination: {len(nq_df)} samples")
print(f"Safety: {len(safety_df)} samples (RTP: {len(rtp_df)}, HH: {len(hh_df)})")

: 

---

# üîç Part 2: Multi-Layer Feature Discovery & Detection Comparison

This is the **key experiment**: We compare detection performance using:
1. **Layer 12 only** (baseline)
2. **Layer 20 only** (deep layer)
3. **Layers 12+20 concatenated** (multi-layer)

The hypothesis is that combining features from multiple layers should improve detection.

In [None]:
# ============================================================================
# FEATURE DISCOVERY & DETECTION FUNCTIONS
# ============================================================================
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, classification_report

plt.style.use('seaborn-v0_8-whitegrid')


def discover_features(codes_matrix: np.ndarray, labels: np.ndarray, 
                       task_name: str, top_k: int = 50):
    """Find correlated features"""
    unique, counts = np.unique(labels, return_counts=True)
    
    if len(unique) < 2:
        print(f"  ‚ö†Ô∏è {task_name}: Only one class, using variance-based selection")
        var = codes_matrix.var(axis=0)
        top_idx = np.argsort(var)[-top_k:][::-1]
        return top_idx, top_idx[:top_k//2], top_idx[top_k//2:]
    
    correlations = []
    for i in range(codes_matrix.shape[1]):
        feat = codes_matrix[:, i]
        if feat.std() > 0:
            corr = np.corrcoef(feat, labels)[0, 1]
            if not np.isnan(corr):
                correlations.append((i, corr))
    
    correlations.sort(key=lambda x: abs(x[1]), reverse=True)
    
    f_plus = [c[0] for c in correlations[:top_k] if c[1] > 0][:top_k//2]
    f_minus = [c[0] for c in correlations[:top_k] if c[1] < 0][:top_k//2]
    top_all = [c[0] for c in correlations[:top_k]]
    
    return top_all, f_plus, f_minus


def train_and_evaluate(codes_matrix: np.ndarray, labels: np.ndarray, 
                        feature_indices: list, name: str):
    """Train detector and return metrics"""
    X = codes_matrix[:, feature_indices] if feature_indices else codes_matrix
    y = labels
    
    unique, counts = np.unique(y, return_counts=True)
    if len(unique) < 2:
        return {'accuracy': 1.0, 'f1': 0.0, 'auroc': 0.5}
    
    min_class = min(counts)
    stratify = y if min_class >= 5 else None
    
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=CONFIG.seed, stratify=stratify
    )
    
    clf = LogisticRegression(max_iter=1000, class_weight='balanced', random_state=CONFIG.seed)
    clf.fit(X_train, y_train)
    
    y_pred = clf.predict(X_test)
    y_prob = clf.predict_proba(X_test)[:, 1]
    
    metrics = {
        'accuracy': accuracy_score(y_test, y_pred),
        'f1': f1_score(y_test, y_pred, zero_division=0),
        'auroc': roc_auc_score(y_test, y_prob) if len(np.unique(y_test)) > 1 else 0.5
    }
    
    return metrics

print("‚úì Detection functions defined")

In [None]:
# ============================================================================
# COMPARE SINGLE-LAYER vs MULTI-LAYER DETECTION
# ============================================================================

print("="*70)
print("LAYER COMPARISON EXPERIMENT")
print("="*70)
print("\nComparing detection performance across layer configurations...")

# Store results
results = {
    'hallucination': {},
    'safety': {}
}

TOP_K = 50  # Features to use

# ==================== HALLUCINATION TASK ====================
print("\n" + "-"*70)
print("HALLUCINATION DETECTION")
print("-"*70)

halluc_labels = nq_df['label'].values

# Layer 12 only
codes_l12 = np.stack([c for c in nq_df['codes_l12'].values if c is not None])
top_l12, fp_l12, fm_l12 = discover_features(codes_l12, halluc_labels[:len(codes_l12)], "Halluc-L12", TOP_K)
metrics_l12 = train_and_evaluate(codes_l12, halluc_labels[:len(codes_l12)], top_l12, "L12")
results['hallucination']['layer_12'] = metrics_l12
print(f"\n  Layer 12:  Acc={metrics_l12['accuracy']:.3f}, F1={metrics_l12['f1']:.3f}, AUC={metrics_l12['auroc']:.3f}")

# Layer 20 only
codes_l20 = np.stack([c for c in nq_df['codes_l20'].values if c is not None])
top_l20, fp_l20, fm_l20 = discover_features(codes_l20, halluc_labels[:len(codes_l20)], "Halluc-L20", TOP_K)
metrics_l20 = train_and_evaluate(codes_l20, halluc_labels[:len(codes_l20)], top_l20, "L20")
results['hallucination']['layer_20'] = metrics_l20
print(f"  Layer 20:  Acc={metrics_l20['accuracy']:.3f}, F1={metrics_l20['f1']:.3f}, AUC={metrics_l20['auroc']:.3f}")

# Multi-layer (concatenated)
codes_concat = np.stack([c for c in nq_df['codes_concat'].values if c is not None])
top_concat, fp_concat, fm_concat = discover_features(codes_concat, halluc_labels[:len(codes_concat)], "Halluc-Multi", TOP_K)
metrics_multi = train_and_evaluate(codes_concat, halluc_labels[:len(codes_concat)], top_concat, "Multi")
results['hallucination']['multi_layer'] = metrics_multi
print(f"  Multi-Layer: Acc={metrics_multi['accuracy']:.3f}, F1={metrics_multi['f1']:.3f}, AUC={metrics_multi['auroc']:.3f}")

# ==================== SAFETY TASK ====================
print("\n" + "-"*70)
print("SAFETY DETECTION")
print("-"*70)

safety_labels = safety_df['label'].values

# Layer 12 only
codes_l12_s = np.stack([c for c in safety_df['codes_l12'].values if c is not None])
top_l12_s, _, _ = discover_features(codes_l12_s, safety_labels[:len(codes_l12_s)], "Safety-L12", TOP_K)
metrics_l12_s = train_and_evaluate(codes_l12_s, safety_labels[:len(codes_l12_s)], top_l12_s, "L12")
results['safety']['layer_12'] = metrics_l12_s
print(f"\n  Layer 12:  Acc={metrics_l12_s['accuracy']:.3f}, F1={metrics_l12_s['f1']:.3f}, AUC={metrics_l12_s['auroc']:.3f}")

# Layer 20 only
codes_l20_s = np.stack([c for c in safety_df['codes_l20'].values if c is not None])
top_l20_s, _, _ = discover_features(codes_l20_s, safety_labels[:len(codes_l20_s)], "Safety-L20", TOP_K)
metrics_l20_s = train_and_evaluate(codes_l20_s, safety_labels[:len(codes_l20_s)], top_l20_s, "L20")
results['safety']['layer_20'] = metrics_l20_s
print(f"  Layer 20:  Acc={metrics_l20_s['accuracy']:.3f}, F1={metrics_l20_s['f1']:.3f}, AUC={metrics_l20_s['auroc']:.3f}")

# Multi-layer
codes_concat_s = np.stack([c for c in safety_df['codes_concat'].values if c is not None])
top_concat_s, _, _ = discover_features(codes_concat_s, safety_labels[:len(codes_concat_s)], "Safety-Multi", TOP_K)
metrics_multi_s = train_and_evaluate(codes_concat_s, safety_labels[:len(codes_concat_s)], top_concat_s, "Multi")
results['safety']['multi_layer'] = metrics_multi_s
print(f"  Multi-Layer: Acc={metrics_multi_s['accuracy']:.3f}, F1={metrics_multi_s['f1']:.3f}, AUC={metrics_multi_s['auroc']:.3f}")

In [None]:
# ============================================================================
# VISUALIZATION: LAYER COMPARISON
# ============================================================================

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Prepare data for plotting
layers = ['Layer 12', 'Layer 20', 'Multi-Layer']
metrics_names = ['Accuracy', 'F1 Score', 'ROC-AUC']

halluc_data = [
    [results['hallucination']['layer_12']['accuracy'],
     results['hallucination']['layer_20']['accuracy'],
     results['hallucination']['multi_layer']['accuracy']],
    [results['hallucination']['layer_12']['f1'],
     results['hallucination']['layer_20']['f1'],
     results['hallucination']['multi_layer']['f1']],
    [results['hallucination']['layer_12']['auroc'],
     results['hallucination']['layer_20']['auroc'],
     results['hallucination']['multi_layer']['auroc']],
]

safety_data = [
    [results['safety']['layer_12']['accuracy'],
     results['safety']['layer_20']['accuracy'],
     results['safety']['multi_layer']['accuracy']],
    [results['safety']['layer_12']['f1'],
     results['safety']['layer_20']['f1'],
     results['safety']['multi_layer']['f1']],
    [results['safety']['layer_12']['auroc'],
     results['safety']['layer_20']['auroc'],
     results['safety']['multi_layer']['auroc']],
]

x = np.arange(len(layers))
width = 0.35

for idx, (ax, metric_name) in enumerate(zip(axes, metrics_names)):
    bars1 = ax.bar(x - width/2, halluc_data[idx], width, label='Hallucination', color='#3498db', edgecolor='black')
    bars2 = ax.bar(x + width/2, safety_data[idx], width, label='Safety', color='#e74c3c', edgecolor='black')
    
    ax.set_ylabel(metric_name)
    ax.set_title(f'{metric_name} by Layer Configuration')
    ax.set_xticks(x)
    ax.set_xticklabels(layers)
    ax.legend()
    ax.set_ylim(0, 1.1)
    ax.axhline(0.5, color='gray', linestyle='--', alpha=0.5, label='Random')
    
    # Add value labels
    for bar in bars1:
        ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02, 
                f'{bar.get_height():.2f}', ha='center', fontsize=9)
    for bar in bars2:
        ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02,
                f'{bar.get_height():.2f}', ha='center', fontsize=9)

plt.suptitle('Single-Layer vs Multi-Layer SAE Detection Performance', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('multilayer_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

print("‚úì Saved multilayer_comparison.png")

In [None]:
# ============================================================================
# SUMMARY TABLE
# ============================================================================

print("\n" + "="*80)
print("MULTI-LAYER EXPERIMENT SUMMARY")
print("="*80)

print(f"\n{'Configuration':<15} {'Task':<15} {'Accuracy':<12} {'F1 Score':<12} {'ROC-AUC':<12}")
print("-"*66)

for task in ['hallucination', 'safety']:
    for config in ['layer_12', 'layer_20', 'multi_layer']:
        m = results[task][config]
        config_display = config.replace('_', ' ').title()
        print(f"{config_display:<15} {task.title():<15} {m['accuracy']:<12.3f} {m['f1']:<12.3f} {m['auroc']:<12.3f}")
    print()

# Calculate improvements
print("-"*66)
print("MULTI-LAYER IMPROVEMENT vs LAYER 12 BASELINE:")
print("-"*66)

for task in ['hallucination', 'safety']:
    base_auc = results[task]['layer_12']['auroc']
    multi_auc = results[task]['multi_layer']['auroc']
    improvement = (multi_auc - base_auc) / base_auc * 100 if base_auc > 0 else 0
    
    l20_auc = results[task]['layer_20']['auroc']
    l20_improvement = (l20_auc - base_auc) / base_auc * 100 if base_auc > 0 else 0
    
    print(f"\n  {task.title()}:")
    print(f"    Layer 12 (baseline): AUC = {base_auc:.3f}")
    print(f"    Layer 20:            AUC = {l20_auc:.3f} ({l20_improvement:+.1f}%)")
    print(f"    Multi-Layer:         AUC = {multi_auc:.3f} ({improvement:+.1f}%)")

---

# üéÆ Part 3: Multi-Layer Steering (Optional)

Test if steering at multiple layers simultaneously improves results.

**Strategy**: Apply steering vectors at BOTH layer 12 and layer 20.

In [None]:
# ============================================================================
# MULTI-LAYER STEERING EXPERIMENT
# ============================================================================

# Build steering vectors for each layer
print("Building steering vectors for multi-layer steering...")

# Use discovered F+ and F- features
N_STEER = 25

# For hallucination - we need to map concat indices back to layer-specific
# For simplicity, just use the layer-specific features we discovered earlier
steering_vectors = {
    12: multi_sae.build_steering_vector(12, list(fp_l12[:N_STEER]), list(fm_l12[:N_STEER])),
    20: multi_sae.build_steering_vector(20, list(fp_l20[:N_STEER]), list(fm_l20[:N_STEER])),
}

print(f"‚úì Steering vectors built for layers {list(steering_vectors.keys())}")


def generate_with_multilayer_steering(gemma_model, prompt: str, 
                                       steering_vectors: dict,
                                       alphas: dict,
                                       max_new_tokens: int = 50):
    """Generate with steering at multiple layers"""
    inputs = gemma_model.tokenizer(prompt, return_tensors="pt").to(gemma_model.model.device)
    
    handles = []
    
    # Register steering hooks for each layer
    for layer, steer_vec in steering_vectors.items():
        alpha = alphas.get(layer, 0.0)
        if alpha > 0:
            def make_steering_hook(sv, a):
                def hook(module, inp, output):
                    hidden = output[0] if isinstance(output, tuple) else output
                    shifted = hidden.clone()
                    steer = sv.to(shifted.device).to(shifted.dtype) * a
                    shifted[:, -1, :] += steer
                    if isinstance(output, tuple):
                        return (shifted,) + output[1:]
                    return shifted
                return hook
            
            block = gemma_model.model.model.layers[layer]
            handle = block.register_forward_hook(make_steering_hook(steer_vec, alpha))
            handles.append(handle)
    
    try:
        output_ids = gemma_model.model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,
            pad_token_id=gemma_model.tokenizer.eos_token_id,
        )
    finally:
        for h in handles:
            h.remove()
    
    return gemma_model.tokenizer.decode(output_ids[0], skip_special_tokens=True)

print("‚úì Multi-layer steering function defined")

In [None]:
# ============================================================================
# RUN MULTI-LAYER STEERING COMPARISON
# ============================================================================

print("="*70)
print("MULTI-LAYER STEERING EXPERIMENT")
print("="*70)

# Test on a subset of prompts
test_prompts = nq_df['prompt'].tolist()[:CONFIG.steering_samples]
test_gt = nq_df['ground_truth'].tolist()[:CONFIG.steering_samples]

# Steering conditions to compare
# Use moderate alpha values (10-20% of typical residual norm ~150)
ALPHA_VALUE = 15.0  # Moderate strength

steering_conditions = {
    'baseline': {12: 0.0, 20: 0.0},
    'layer_12_only': {12: ALPHA_VALUE, 20: 0.0},
    'layer_20_only': {12: 0.0, 20: ALPHA_VALUE},
    'both_layers': {12: ALPHA_VALUE, 20: ALPHA_VALUE},
}

steering_results = {}

for condition_name, alphas in steering_conditions.items():
    print(f"\n--- Testing: {condition_name} ---")
    print(f"    Alphas: L12={alphas[12]:.1f}, L20={alphas[20]:.1f}")
    
    halluc_count = 0
    
    for prompt, gt in tqdm(zip(test_prompts, test_gt), total=len(test_prompts), desc=condition_name):
        text = generate_with_multilayer_steering(gemma, prompt, steering_vectors, alphas)
        answer = text.replace(prompt, '').strip()
        
        is_halluc = check_hallucination(answer, gt)
        if is_halluc:
            halluc_count += 1
    
    halluc_rate = halluc_count / len(test_prompts)
    steering_results[condition_name] = {
        'alphas': alphas,
        'halluc_rate': halluc_rate,
        'halluc_count': halluc_count,
        'n_samples': len(test_prompts)
    }
    
    print(f"    Hallucination rate: {halluc_rate*100:.1f}% ({halluc_count}/{len(test_prompts)})")

print("\n" + "="*70)
print("STEERING RESULTS SUMMARY")
print("="*70)

In [None]:
# ============================================================================
# VISUALIZE STEERING RESULTS
# ============================================================================

fig, ax = plt.subplots(figsize=(10, 6))

conditions = list(steering_results.keys())
halluc_rates = [steering_results[c]['halluc_rate'] * 100 for c in conditions]

colors = ['#95a5a6', '#3498db', '#e74c3c', '#9b59b6']
bars = ax.bar(conditions, halluc_rates, color=colors, edgecolor='black', linewidth=2)

ax.set_ylabel('Hallucination Rate (%)', fontsize=12)
ax.set_title('Multi-Layer Steering: Hallucination Rates', fontsize=14, fontweight='bold')
ax.set_ylim(0, 100)

# Add baseline reference line
baseline_rate = steering_results['baseline']['halluc_rate'] * 100
ax.axhline(baseline_rate, color='red', linestyle='--', alpha=0.7, label=f'Baseline: {baseline_rate:.1f}%')

# Value labels
for bar, rate in zip(bars, halluc_rates):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1,
            f'{rate:.1f}%', ha='center', fontsize=11, fontweight='bold')

ax.legend()
plt.xticks(rotation=15)
plt.tight_layout()
plt.savefig('multilayer_steering.png', dpi=150, bbox_inches='tight')
plt.show()

print("‚úì Saved multilayer_steering.png")

---

# ‚úÖ Multi-Layer Experiment Complete!

## Summary

### Detection Comparison (Single vs Multi-Layer)

| Configuration | Halluc AUC | Safety AUC | Notes |
|---------------|------------|------------|-------|
| Layer 12 only | - | - | Baseline (structural features) |
| Layer 20 only | - | - | Deep layer (semantic features) |
| Multi-Layer (12+20) | - | - | Combined (32k features) |

*Fill in the values from your experimental results above.*

### Key Findings

1. **Layer 20** may capture more abstract semantic concepts than Layer 12
2. **Multi-layer concatenation** potentially provides complementary information
3. **Steering at both layers** may show different effects than single-layer steering

### Files Generated
- `multilayer_comparison.png` - Detection performance comparison
- `multilayer_steering.png` - Steering experiment results

### For the Report
Add this to your LaTeX Discussion section:

```latex
\subsection{Multi-Layer Analysis}
Following the suggestion to explore sparsity across different depths of the LLM, 
we compared SAE features from layers 12 (middle) and 20 (deep). Our hypothesis 
was that deeper layers encode more abstract semantic concepts that could improve 
detection and steering performance.

Results show that [INSERT YOUR FINDINGS HERE]. When concatenating features from 
both layers (32,768 total features), we observed [INSERT OBSERVATION]. For 
steering, applying interventions at multiple layers simultaneously 
[IMPROVED/DID NOT SIGNIFICANTLY CHANGE] the hallucination rate compared to 
single-layer steering.
```