# Day 8-9: nnsight Setup with Qwen

**Goal:** Learn nnsight fundamentals and verify setup with a probe verification exercise

**Learning Objectives:**
1. Understand why we're transitioning from TransformerLens to nnsight
2. Load and interact with Qwen2.5-7B-Instruct using nnsight
3. Extract activations from any layer
4. Perform basic interventions (activation patching)
5. Verify setup works by training a sentiment probe

**Timeline:** 6-7 hours

**Environment:** Vast.ai with GPU (16GB+ VRAM recommended)

---

## Why nnsight Instead of TransformerLens?

In Days 3-6, you used **TransformerLens** with GPT-2 to learn probing fundamentals. Now we're switching to **nnsight** for your actual CoT faithfulness research.

**The key difference:**

| Aspect | TransformerLens | nnsight |
|--------|-----------------|--------|
| **Model support** | GPT-2, GPT-Neo, limited models | ANY HuggingFace model |
| **Qwen/Llama/DeepSeek** | NOT SUPPORTED | Fully supported |
| **CoT reasoning models** | Cannot use | Required for this |
| **Best use case** | Learning mech interp basics | Production research on modern models |

**Bottom line:** TransformerLens doesn't support Qwen, Llama, or other modern reasoning models. For CoT faithfulness research, you MUST use nnsight (or similar tools like nnsight).

GPT-2 doesn't do meaningful chain-of-thought reasoning - it's too small. You need a model like Qwen2.5-7B-Instruct that actually reasons through problems.

---

## Part 1: Environment Setup & Installation

In [None]:
# Install required packages (run this first on a fresh Vast.ai instance)
# Uncomment and run if packages aren't installed

# !pip install nnsight>=0.3.0
# !pip install torch>=2.0.0
# !pip install transformers>=4.36.0
# !pip install accelerate
# !pip install numpy scikit-learn matplotlib pandas

In [None]:
# Verify GPU is available
import torch

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    print("WARNING: No GPU detected! This notebook requires a GPU.")
    print("If on Vast.ai, make sure you selected a GPU instance.")

In [None]:
# Import all required libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# nnsight - the key library for this notebook
from nnsight import LanguageModel

print("All imports successful!")

---

## Part 2: Loading Qwen with nnsight

In [None]:
# Load Qwen2.5-7B-Instruct with nnsight
# This will download the model on first run (~14GB)

print("Loading Qwen2.5-7B-Instruct... (this may take a few minutes on first run)")

model = LanguageModel(
    "Qwen/Qwen2.5-7B-Instruct",
    device_map="auto",  # Automatically place on GPU
    torch_dtype=torch.float16  # Use half precision to save memory
)

print("Model loaded successfully!")

In [None]:
# Explore the model architecture
# This is DIFFERENT from TransformerLens - we're working with HuggingFace structure

print("Model Architecture:")
print("=" * 50)
print(model)

In [None]:
# Key model properties for Qwen2.5-7B
# Access the underlying HuggingFace model config

config = model.config
print(f"Model: {config.model_type}")
print(f"Hidden size (d_model): {config.hidden_size}")
print(f"Number of layers: {config.num_hidden_layers}")
print(f"Number of attention heads: {config.num_attention_heads}")
print(f"Vocabulary size: {config.vocab_size}")

### Key Structural Differences from TransformerLens

| TransformerLens (GPT-2) | nnsight (Qwen) |
|------------------------|----------------|
| `model.blocks[i]` | `model.model.layers[i]` |
| `cache["resid_post", layer]` | `model.model.layers[layer].output[0]` |
| 12 layers, 768 hidden | 28 layers, 3584 hidden |
| Automatic caching | Explicit `.save()` required |

The nnsight library wraps the HuggingFace model, so we access layers through `model.model.layers[i]`.

---

## Part 3: Basic Tracing - Saving Activations

The core concept in nnsight is the **trace context**. Inside a `with model.trace()` block:
- Operations create "proxy" objects
- Nothing executes until the context exits
- Use `.save()` to preserve values you want to access later

In [None]:
# Simplest example: save hidden states at one layer

with model.trace("Hello, world!"):
    # Access layer 14 (middle of the 28 layers)
    # .output[0] gets the hidden states (first element of output tuple)
    hidden_states = model.model.layers[14].output[0].save()

# After the context exits, hidden_states.value contains the actual tensor
print(f"Hidden states shape: {hidden_states.value.shape}")
print(f"Expected: [batch=1, seq_len, hidden_size={config.hidden_size}]")

In [None]:
# Understanding the shape:
# - Batch dimension (1 because we passed a single string)
# - Sequence length (number of tokens in "Hello, world!")
# - Hidden dimension (3584 for Qwen2.5-7B)

# Let's check the tokenization
tokens = model.tokenizer.encode("Hello, world!")
print(f"Tokens: {tokens}")
print(f"Token strings: {[model.tokenizer.decode([t]) for t in tokens]}")
print(f"Number of tokens: {len(tokens)}")

In [None]:
# Save activations at MULTIPLE layers

layers_to_extract = [0, 7, 14, 21, 27]  # Early, mid, late layers
saved_activations = {}

with model.trace("The capital of France is Paris."):
    for layer in layers_to_extract:
        saved_activations[layer] = model.model.layers[layer].output[0].save()

# Check shapes
print("Activations at each layer:")
for layer, acts in saved_activations.items():
    print(f"  Layer {layer:2d}: shape = {acts.value.shape}")

In [None]:
# Helper function: Extract final token activation at specified layer
# This mirrors what we did in TransformerLens

def get_final_token_activation_nnsight(model, text, layer):
    """
    Extract the activation of the final token at a specified layer.
    
    Args:
        model: nnsight LanguageModel
        text: Input string
        layer: Layer number (0 to num_layers-1)
    
    Returns:
        numpy array of shape (hidden_size,)
    """
    with model.trace(text):
        hidden = model.model.layers[layer].output[0].save()
    
    # Get final token's activation: [batch=0, position=-1, :]
    final_act = hidden.value[0, -1, :].cpu().numpy()
    return final_act

# Test it
test_act = get_final_token_activation_nnsight(model, "Test sentence.", layer=14)
print(f"Activation shape: {test_act.shape}")
print(f"Expected: ({config.hidden_size},)")

### Comparison: TransformerLens vs nnsight Activation Extraction

**TransformerLens:**
```python
_, cache = model.run_with_cache(text)
activation = cache["resid_post", layer][0, -1, :].cpu().numpy()
```

**nnsight:**
```python
with model.trace(text):
    hidden = model.model.layers[layer].output[0].save()
activation = hidden.value[0, -1, :].cpu().numpy()
```

Key differences:
1. nnsight requires explicit `.save()` - values aren't automatically cached
2. nnsight uses deferred execution - operations run when context exits
3. nnsight uses HuggingFace model structure (`model.model.layers`)

---

## Part 3.5: Accessing Attention and MLP Components

In Day 5-6, you learned that different components (attention vs MLP) serve different computational roles:
- **Attention:** Routes information between positions ("where should I look?")
- **MLP:** Transforms information within position ("what should I compute?")

Here's how to access these components in nnsight (compared to TransformerLens).

In [None]:
# First, let's explore the structure of a single layer in Qwen
# This shows us what submodules we can access

print("Structure of layer 0:")
print("=" * 50)
for name, module in model.model.layers[0].named_children():
    print(f"  {name}: {type(module).__name__}")

In [None]:
# Access MLP output at a specific layer
# The MLP in Qwen is called "mlp" (in GPT-2/TransformerLens it was also "mlp")

with model.trace("The capital of France is Paris."):
    # Get the MLP output (after the MLP computation, before adding to residual)
    mlp_output = model.model.layers[14].mlp.output.save()
    
    # For comparison, also get the full layer output (residual stream)
    layer_output = model.model.layers[14].output[0].save()

print(f"MLP output shape: {mlp_output.value.shape}")
print(f"Layer output shape: {layer_output.value.shape}")
print(f"\nBoth should have hidden_size={config.hidden_size} as last dimension")

In [None]:
# Access Attention output at a specific layer
# In Qwen, the self-attention module is called "self_attn"

with model.trace("The capital of France is Paris."):
    # Get the attention output (after attention computation, before adding to residual)
    # Note: attention output is typically a tuple, we want the first element
    attn_output = model.model.layers[14].self_attn.output[0].save()

print(f"Attention output shape: {attn_output.value.shape}")
print(f"Expected: [batch, seq_len, hidden_size={config.hidden_size}]")

In [None]:
# Access Attention PATTERNS (the attention weights after softmax)
# This requires accessing internal attention computation
# We need to enable output_attentions in the model config

# Method 1: Access attention weights by modifying the forward call
# This is more involved - we need to look at what the attention module outputs

# Let's check what the self_attn module returns
with model.trace("Hello world"):
    # The full output of self_attn - let's see what's there
    full_attn_output = model.model.layers[14].self_attn.output.save()

print(f"Attention module output type: {type(full_attn_output.value)}")
if isinstance(full_attn_output.value, tuple):
    print(f"Number of elements in tuple: {len(full_attn_output.value)}")
    for i, elem in enumerate(full_attn_output.value):
        if elem is not None:
            print(f"  Element {i}: shape = {elem.shape}")

In [None]:
# To get attention patterns (weights), we need to enable output_attentions
# This makes the model return attention weights as part of its output

# Temporarily modify config to output attentions
original_output_attentions = model.config.output_attentions
model.config.output_attentions = True

with model.trace("The capital of France is Paris."):
    # Now attention weights should be in the output
    attn_with_weights = model.model.layers[14].self_attn.output.save()

# Restore original setting
model.config.output_attentions = original_output_attentions

# Check what we got
print(f"Output type: {type(attn_with_weights.value)}")
if isinstance(attn_with_weights.value, tuple):
    print(f"Number of elements: {len(attn_with_weights.value)}")
    for i, elem in enumerate(attn_with_weights.value):
        if elem is not None:
            print(f"  Element {i}: shape = {elem.shape}")
            if len(elem.shape) == 4:  # [batch, num_heads, seq, seq]
                print(f"    -> This is likely the attention pattern!")

In [None]:
# Helper functions for extracting different components

def get_mlp_output(model, text, layer, position='last'):
    """
    Extract MLP output at a specific layer.
    
    Args:
        model: nnsight LanguageModel
        text: Input string
        layer: Layer number
        position: 'last', 'first', 'mean', or int
    
    Returns:
        numpy array of shape (hidden_size,)
    """
    with model.trace(text):
        mlp_out = model.model.layers[layer].mlp.output.save()
    
    acts = mlp_out.value[0]  # Remove batch dim
    
    if position == 'last':
        return acts[-1, :].cpu().numpy()
    elif position == 'first':
        return acts[0, :].cpu().numpy()
    elif position == 'mean':
        return acts.mean(dim=0).cpu().numpy()
    elif isinstance(position, int):
        return acts[position, :].cpu().numpy()


def get_attention_output(model, text, layer, position='last'):
    """
    Extract attention output at a specific layer.
    
    Args:
        model: nnsight LanguageModel
        text: Input string  
        layer: Layer number
        position: 'last', 'first', 'mean', or int
    
    Returns:
        numpy array of shape (hidden_size,)
    """
    with model.trace(text):
        attn_out = model.model.layers[layer].self_attn.output[0].save()
    
    acts = attn_out.value[0]  # Remove batch dim
    
    if position == 'last':
        return acts[-1, :].cpu().numpy()
    elif position == 'first':
        return acts[0, :].cpu().numpy()
    elif position == 'mean':
        return acts.mean(dim=0).cpu().numpy()
    elif isinstance(position, int):
        return acts[position, :].cpu().numpy()


def get_attention_patterns(model, text, layer):
    """
    Extract attention patterns (weights after softmax) at a specific layer.
    
    Args:
        model: nnsight LanguageModel
        text: Input string
        layer: Layer number
    
    Returns:
        numpy array of shape (num_heads, seq_len, seq_len)
    """
    # Enable attention output temporarily
    original = model.config.output_attentions
    model.config.output_attentions = True
    
    with model.trace(text):
        attn_out = model.model.layers[layer].self_attn.output.save()
    
    model.config.output_attentions = original
    
    # Attention weights are typically the second element of the tuple
    # Shape: [batch, num_heads, seq_len, seq_len]
    if isinstance(attn_out.value, tuple) and len(attn_out.value) > 1:
        patterns = attn_out.value[1]
        if patterns is not None:
            return patterns[0].cpu().numpy()  # Remove batch dim
    
    return None


# Test the helper functions
print("Testing component extraction functions:")
print("=" * 50)

test_mlp = get_mlp_output(model, "Test sentence.", layer=14, position='last')
print(f"MLP output shape: {test_mlp.shape}")

test_attn = get_attention_output(model, "Test sentence.", layer=14, position='last')
print(f"Attention output shape: {test_attn.shape}")

test_patterns = get_attention_patterns(model, "Test sentence.", layer=14)
if test_patterns is not None:
    print(f"Attention patterns shape: {test_patterns.shape}")
    print(f"  -> [num_heads={test_patterns.shape[0]}, seq_len={test_patterns.shape[1]}, seq_len={test_patterns.shape[2]}]")
else:
    print("Attention patterns: Not available (model may not support output_attentions)")

### Component Access Comparison: TransformerLens vs nnsight

| Component | TransformerLens | nnsight (Qwen) |
|-----------|-----------------|----------------|
| **Residual stream (layer output)** | `cache["resid_post", layer]` | `model.model.layers[layer].output[0]` |
| **MLP output** | `cache["blocks.{layer}.hook_mlp_out"]` | `model.model.layers[layer].mlp.output` |
| **Attention output** | `cache["attn_out", layer]` | `model.model.layers[layer].self_attn.output[0]` |
| **Attention patterns** | `cache["blocks.{layer}.attn.hook_pattern"]` | Enable `output_attentions=True`, then `self_attn.output[1]` |
| **Individual head outputs** | `cache["blocks.{layer}.attn.hook_z"]` | Requires accessing internal attention computation |

**Key insight from Day 5-6:** You found that middle-layer attention heads outperformed residual stream probes for sentiment. With these extraction functions, you can test if the same pattern holds for Qwen and for faithfulness detection.

In [None]:
# Quick comparison: Residual vs MLP vs Attention at the same layer
# This mirrors the analysis you did in Day 5-6

test_text = "I love this movie!"
layer = 14

# Extract all three components
with model.trace(test_text):
    residual = model.model.layers[layer].output[0].save()
    mlp = model.model.layers[layer].mlp.output.save()
    attn = model.model.layers[layer].self_attn.output[0].save()

# Get final token activations
residual_act = residual.value[0, -1, :].cpu().numpy()
mlp_act = mlp.value[0, -1, :].cpu().numpy()
attn_act = attn.value[0, -1, :].cpu().numpy()

print(f"Component activations at layer {layer}, final token:")
print(f"  Residual stream: norm = {np.linalg.norm(residual_act):.2f}")
print(f"  MLP output:      norm = {np.linalg.norm(mlp_act):.2f}")
print(f"  Attention output: norm = {np.linalg.norm(attn_act):.2f}")

# Note: residual ≈ previous_residual + attention + mlp
# So residual has accumulated information from both components

---

## Part 4: Generation with Activation Access

For CoT faithfulness research, you need to:
1. Generate reasoning (multi-token output)
2. Access activations during that generation

nnsight uses `.generate()` for this.

In [None]:
# Generate text with activation access

prompt = "What is 17 * 23? Let me think step by step."

with model.generate(max_new_tokens=150) as generator:
    with generator.invoke(prompt):
        # Save activations at layer 14 during generation
        hidden_during_gen = model.model.layers[14].output[0].save()

# Decode the generated output
output_tokens = generator.output[0]
output_text = model.tokenizer.decode(output_tokens)

print("Generated text:")
print("=" * 50)
print(output_text)
print("=" * 50)
print(f"\nActivations shape: {hidden_during_gen.value.shape}")

In [None]:
# Understanding the activation shape during generation:
# The shape is [batch, total_sequence_length, hidden_size]
# total_sequence_length = prompt_tokens + generated_tokens

prompt_tokens = model.tokenizer.encode(prompt)
print(f"Prompt tokens: {len(prompt_tokens)}")
print(f"Total sequence length in activations: {hidden_during_gen.value.shape[1]}")
print(f"Generated tokens: {hidden_during_gen.value.shape[1] - len(prompt_tokens)}")

In [None]:
# Map tokens to their activations
# This is important for position analysis in faithfulness research

all_token_ids = output_tokens.tolist()
all_tokens = [model.tokenizer.decode([t]) for t in all_token_ids]

print("Token-Activation mapping (first 20 tokens):")
print("-" * 60)
for i, (tok_id, tok_str) in enumerate(zip(all_token_ids[:20], all_tokens[:20])):
    act_norm = hidden_during_gen.value[0, i, :].norm().item()
    print(f"Pos {i:3d}: '{tok_str:15s}' | Activation norm: {act_norm:.2f}")

---

## Part 5: Intervention Basics

Beyond reading activations, nnsight lets you **modify** them during forward pass. This is activation patching.

In [None]:
# Simple intervention: zero out the last token's activation at layer 10

prompt = "The capital of France is"

# First, get baseline output
with model.generate(max_new_tokens=5) as generator:
    with generator.invoke(prompt):
        pass  # No intervention

baseline_output = model.tokenizer.decode(generator.output[0])
print(f"Baseline: {baseline_output}")

# Now with intervention
with model.generate(max_new_tokens=5) as generator:
    with generator.invoke(prompt):
        # Zero out the last position at layer 10
        model.model.layers[10].output[0][:, -1, :] = 0

intervened_output = model.tokenizer.decode(generator.output[0])
print(f"Intervened (zero layer 10): {intervened_output}")

In [None]:
# Cross-prompt intervention (activation patching)
# Extract activation from one prompt, inject into another

source_prompt = "The capital of Germany is"
target_prompt = "The capital of France is"

# Step 1: Extract activation from source
with model.trace(source_prompt):
    source_activation = model.model.layers[14].output[0].save()

print(f"Source activation shape: {source_activation.value.shape}")

# Step 2: Inject into target and generate
with model.generate(max_new_tokens=5) as generator:
    with generator.invoke(target_prompt):
        # Replace layer 14 activation with source
        model.model.layers[14].output[0][:, :, :] = source_activation.value

patched_output = model.tokenizer.decode(generator.output[0])
print(f"\nTarget prompt: '{target_prompt}'")
print(f"Patched with source: '{source_prompt}'")
print(f"Output: {patched_output}")

### Comparison: TransformerLens vs nnsight Interventions

**TransformerLens (hook-based):**
```python
def hook_fn(activation, hook):
    activation[:, -1, :] = 0
    return activation

model.run_with_hooks(prompt, fwd_hooks=[("blocks.10.hook_resid_post", hook_fn)])
```

**nnsight (assignment-based):**
```python
with model.trace(prompt):
    model.model.layers[10].output[0][:, -1, :] = 0
```

nnsight's approach is more intuitive - you just assign to the values you want to change.

---

## Part 6: Building Probe-Compatible Extraction Class

Let's build a reusable class that mirrors the ProbeToolkit from TransformerLens days.

In [None]:
class NNsightActivationExtractor:
    """
    Reusable toolkit for extracting activations from nnsight models.
    Mirrors the ProbeToolkit from TransformerLens exercises.
    """
    
    def __init__(self, model):
        self.model = model
        self.config = model.config
        self.num_layers = model.config.num_hidden_layers
        self.hidden_size = model.config.hidden_size
    
    def get_layer_activation(self, text, layer, position='last'):
        """
        Extract activation at a specific layer and position.
        
        Args:
            text: Input string
            layer: Layer number (0 to num_layers-1)
            position: 'last', 'first', 'mean', or int for specific position
        
        Returns:
            numpy array of shape (hidden_size,)
        """
        with self.model.trace(text):
            hidden = self.model.model.layers[layer].output[0].save()
        
        acts = hidden.value[0]  # Remove batch dimension
        
        if position == 'last':
            return acts[-1, :].cpu().numpy()
        elif position == 'first':
            return acts[0, :].cpu().numpy()
        elif position == 'mean':
            return acts.mean(dim=0).cpu().numpy()
        elif isinstance(position, int):
            return acts[position, :].cpu().numpy()
        else:
            raise ValueError(f"Unknown position: {position}")
    
    def get_batch_activations(self, texts, layer, position='last'):
        """
        Extract activations for multiple texts.
        
        Args:
            texts: List of input strings
            layer: Layer number
            position: Position strategy
        
        Returns:
            numpy array of shape (n_texts, hidden_size)
        """
        activations = []
        for text in texts:
            act = self.get_layer_activation(text, layer, position)
            activations.append(act)
        return np.array(activations)
    
    def compare_layers(self, texts_pos, texts_neg, layers, position='last'):
        """
        Compare probe performance across multiple layers.
        
        Args:
            texts_pos: List of positive class texts
            texts_neg: List of negative class texts
            layers: List of layer numbers to test
            position: Position strategy
        
        Returns:
            DataFrame with layer, train_acc, test_acc
        """
        results = []
        
        for layer in layers:
            print(f"Testing layer {layer}...")
            
            # Extract activations
            X_pos = self.get_batch_activations(texts_pos, layer, position)
            X_neg = self.get_batch_activations(texts_neg, layer, position)
            
            # Combine into dataset
            X = np.vstack([X_pos, X_neg])
            y = np.array([1] * len(texts_pos) + [0] * len(texts_neg))
            
            # Train/test split
            X_train, X_test, y_train, y_test = train_test_split(
                X, y, test_size=0.3, random_state=42, stratify=y
            )
            
            # Train probe
            probe = LogisticRegression(max_iter=1000, random_state=42)
            probe.fit(X_train, y_train)
            
            train_acc = probe.score(X_train, y_train)
            test_acc = probe.score(X_test, y_test)
            
            results.append({
                'layer': layer,
                'train_acc': train_acc,
                'test_acc': test_acc,
                'gap': train_acc - test_acc
            })
        
        return pd.DataFrame(results)

# Create extractor instance
extractor = NNsightActivationExtractor(model)
print(f"Extractor ready for model with {extractor.num_layers} layers, {extractor.hidden_size} hidden dim")

In [None]:
# Test the extractor
test_act = extractor.get_layer_activation("Test sentence.", layer=14, position='last')
print(f"Single activation shape: {test_act.shape}")

test_batch = extractor.get_batch_activations(
    ["First test.", "Second test.", "Third test."],
    layer=14,
    position='last'
)
print(f"Batch activation shape: {test_batch.shape}")

---

## Part 7: Full Probe Verification

Let's verify our setup by training the same sentiment probe from Day 3-4 on Qwen activations.

In [None]:
# Same sentiment dataset from Day 3-4
positive_sentences = [
    "I love this movie!",
    "This is amazing and wonderful!",
    "Great job, fantastic work!",
    "I absolutely loved every minute of that film.",
    "This is the best coffee I've ever tasted.",
    "She was so kind and helpful throughout the process.",
    "What a beautiful day to be outside.",
    "I'm thrilled with how the project turned out.",
    "The team did an outstanding job on this.",
    "I can't wait to visit again next year.",
    "This restaurant exceeded all my expectations.",
    "He's such a talented and generous person.",
    "I feel incredibly grateful for this opportunity.",
    "The service here is always fantastic.",
    "That was the most fun I've had in ages.",
    "I'm so proud of what we accomplished together.",
    "This book changed my perspective completely.",
    "The sunset tonight was absolutely stunning.",
    "I've never felt more welcomed anywhere.",
    "Everything about this experience was delightful.",
    "She gave the most inspiring speech I've ever heard.",
    "I'm genuinely excited about what's next.",
    "This made my whole week better.",
]

negative_sentences = [
    "I hate this movie.",
    "This is terrible and awful.",
    "Poor job, disappointing work.",
    "I was deeply disappointed by the outcome.",
    "The food was cold and tasteless.",
    "This has been the worst customer service experience.",
    "I regret wasting my time on this.",
    "The whole event was a complete disaster.",
    "I'm frustrated with how poorly this was handled.",
    "Nothing about this met my expectations.",
    "The quality has really gone downhill.",
    "I felt completely ignored the entire time.",
    "This product broke after just one use.",
    "What a miserable waste of money.",
    "I've never been so let down by a company.",
    "The atmosphere was unpleasant and unwelcoming.",
    "I'm annoyed that nobody bothered to help.",
    "This ruined what should have been a good day.",
    "The wait was unbearable and unnecessary.",
    "I found the whole thing incredibly tedious.",
    "They clearly don't care about their customers.",
    "I'm upset that this turned out so badly.",
    "Everything that could go wrong did."
]

print(f"Dataset: {len(positive_sentences)} positive, {len(negative_sentences)} negative")

In [None]:
# Compare probes across layers
# Qwen has 28 layers, so we test: early (0, 7), middle (14), late (21, 27)

layers_to_test = [0, 7, 14, 21, 27]

print("Running layer comparison... (this will take a few minutes)\n")
results = extractor.compare_layers(
    positive_sentences, 
    negative_sentences, 
    layers_to_test,
    position='last'
)

print("\n" + "=" * 60)
print("RESULTS: Sentiment Detection by Layer (Qwen2.5-7B)")
print("=" * 60)
print(results.to_string(index=False))

In [None]:
# Visualize results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Accuracy by layer
ax1.plot(results['layer'], results['train_acc'], marker='o', label='Train', linewidth=2)
ax1.plot(results['layer'], results['test_acc'], marker='s', label='Test', linewidth=2)
ax1.set_xlabel('Layer', fontsize=12)
ax1.set_ylabel('Accuracy', fontsize=12)
ax1.set_title('Sentiment Detection Accuracy by Layer (Qwen2.5-7B)', fontsize=14, fontweight='bold')
ax1.legend(fontsize=10)
ax1.grid(True, alpha=0.3)
ax1.set_ylim([0.4, 1.05])

# Plot 2: Train/test gap
ax2.bar(results['layer'], results['gap'], color='coral', alpha=0.7)
ax2.axhline(y=0.15, color='red', linestyle='--', label='15% threshold')
ax2.set_xlabel('Layer', fontsize=12)
ax2.set_ylabel('Train - Test Accuracy', fontsize=12)
ax2.set_title('Overfitting Analysis', fontsize=14, fontweight='bold')
ax2.legend(fontsize=10)
ax2.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig('qwen_sentiment_probe_layer_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

best_layer = results.loc[results['test_acc'].idxmax(), 'layer']
best_acc = results.loc[results['test_acc'].idxmax(), 'test_acc']
print(f"\nBest performing layer: {best_layer} with {best_acc:.2%} test accuracy")

In [None]:
# Generalization test with different distribution
different_positive = [
    "The mentorship programme completely changed my career trajectory.",
    "I've never tasted homemade pasta this good before.",
    "Seeing my daughter graduate was the proudest moment of my life.",
    "This neighbourhood has such a wonderful sense of community.",
    "The therapist really helped me work through my anxiety.",
    "I'm blown away by how talented this band is.",
    "The renovation turned out even better than we imagined.",
    "Volunteering there has been incredibly rewarding.",
    "I finally finished the marathon and it felt amazing.",
    "The customer support team solved everything in one call.",
]

different_negative = [
    "The dentist appointment was as painful as I feared.",
    "I'm gutted that the concert got cancelled last minute.",
    "This laptop overheats constantly and crashes without warning.",
    "The landlord refuses to fix anything in this flat.",
    "I felt completely blindsided by their decision.",
    "The commute is draining the life out of me.",
    "I'm heartbroken that the relationship ended this way.",
    "The sequel completely failed to capture the original's magic.",
    "I've been struggling to sleep properly for weeks.",
    "The interview went terribly and I know I won't get the job.",
]

print(f"Different distribution: {len(different_positive)} positive, {len(different_negative)} negative")

In [None]:
# Train probe on original, test on different distribution
# Use the best layer from previous analysis

best_layer_idx = int(results.loc[results['test_acc'].idxmax(), 'layer'])
print(f"Using best layer: {best_layer_idx}")

# Get training data activations
X_pos_train = extractor.get_batch_activations(positive_sentences, best_layer_idx)
X_neg_train = extractor.get_batch_activations(negative_sentences, best_layer_idx)
X_train = np.vstack([X_pos_train, X_neg_train])
y_train = np.array([1] * len(positive_sentences) + [0] * len(negative_sentences))

# Train probe
probe = LogisticRegression(max_iter=1000, random_state=42)
probe.fit(X_train, y_train)
train_acc = probe.score(X_train, y_train)

# Get different distribution activations
X_pos_diff = extractor.get_batch_activations(different_positive, best_layer_idx)
X_neg_diff = extractor.get_batch_activations(different_negative, best_layer_idx)
X_diff = np.vstack([X_pos_diff, X_neg_diff])
y_diff = np.array([1] * len(different_positive) + [0] * len(different_negative))

# Test on different distribution
diff_acc = probe.score(X_diff, y_diff)

print(f"\n=== Generalization Test (Qwen2.5-7B) ===")
print(f"Training accuracy: {train_acc:.2%}")
print(f"Different distribution accuracy: {diff_acc:.2%}")
print(f"Accuracy change: {(diff_acc - train_acc):.2%}")

if diff_acc > 0.7:
    print("\n✓ Probe generalizes well! Setup verified.")
else:
    print("\n⚠ Probe generalization is weak - may need investigation.")

### Comparison: GPT-2 (TransformerLens) vs Qwen (nnsight)

Record your results here:

| Metric | GPT-2 (Day 3-4) | Qwen2.5-7B (Today) |
|--------|-----------------|--------------------|
| Best layer | ___ | ___ |
| Best test accuracy | ___% | ___% |
| Generalization accuracy | ___% | ___% |

**Observations:**
- 
- 
- 

---

## Part 8: Troubleshooting & Success Criteria

### Common Issues

**1. Out of Memory (OOM) Errors**
- Reduce batch size (process fewer sentences at once)
- Ensure using `torch_dtype=torch.float16`
- Clear cache between operations: `torch.cuda.empty_cache()`

**2. Slow Inference**
- Check that model is on GPU: `next(model.parameters()).device`
- Use `device_map="auto"` for optimal placement
- Consider using flash attention if available

**3. Shape Mismatches**
- Qwen has 28 layers (vs GPT-2's 12)
- Qwen has 3584 hidden size (vs GPT-2's 768)
- Always check `model.config` for correct dimensions

**4. Vast.ai Specific**
- Make sure SSH is configured correctly
- Use persistent storage for model weights to avoid re-downloading
- Check that you have enough disk space (~15GB for Qwen)

In [None]:
# Success criteria verification
print("=" * 60)
print("SUCCESS CRITERIA CHECKLIST")
print("=" * 60)

checks = {
    "Model loads without OOM": True,  # If you got here, it loaded
    "Can extract activations at any layer": test_act.shape == (config.hidden_size,),
    "Activations have correct shape": test_batch.shape == (3, config.hidden_size),
    "Probe achieves >70% test accuracy": results['test_acc'].max() > 0.7,
    "Can generate text with activation access": hidden_during_gen.value is not None,
}

for check, passed in checks.items():
    status = "✓" if passed else "✗"
    print(f"  [{status}] {check}")

if all(checks.values()):
    print("\n" + "=" * 60)
    print("ALL CHECKS PASSED! Setup is verified.")
    print("You're ready for the CoT understanding notebook.")
    print("=" * 60)
else:
    print("\n⚠ Some checks failed. Review the issues above.")

---

## Next Steps

Now that nnsight setup is verified, you're ready for:

1. **Understanding CoT Structure** (Day 8-9 continued)
   - Analyze how Qwen generates reasoning
   - Identify reasoning markers and structure

2. **Building CoT Dataset** (Day 8-9 continued)
   - Generate faithful/unfaithful CoT examples
   - Create evaluation framework

**Key concepts you now understand:**
- Why nnsight (not TransformerLens) for reasoning models
- How to extract activations with `.trace()` and `.save()`
- How to intervene on activations (patching)
- How to access activations during generation

---

**Save your work and record your results!**