# Day 3-4: First Probes - Sentiment Detection

**Goal:** Build hands-on probe expertise through sentiment classification

**Learning Objectives:**
1. Train your first probe on transformer activations
2. Compare probe performance across layers
3. Test generalization and identify failure modes
4. Understand spurious correlations

**Timeline:** 8-10 hours

---

## Setup

In [None]:
import torch
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import transformer_lens as tl

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"TransformerLens ready!")

In [None]:
# Load GPT-2 small
model = tl.HookedTransformer.from_pretrained("gpt2-small")
print(f"Model loaded: {model.cfg.n_layers} layers, {model.cfg.d_model} dimensions")

---

## Part 1: Basic Sentiment Probe (2 hours)

**Concept:** Can we train a probe to detect sentiment from layer activations?

**Hypothesis:** If sentiment information is linearly accessible at layer 6, a simple logistic regression should achieve >70% accuracy.

In [None]:
# Dataset: Positive and negative sentences
positive_sentences = [
    "I love this movie!",
    "This is amazing and wonderful!",
    "Great job, fantastic work!",
    "Absolutely brilliant performance!",
    "I'm so happy with the results!",
    "This exceeded all my expectations!",
    "Wonderful experience, highly recommend!",
    "Best decision I ever made!",
    "This is perfect in every way!",
    "I'm thrilled with how this turned out!",
    "Outstanding quality and service!",
    "This brings me so much joy!",
    "Incredible work, truly impressive!",
    "I'm delighted with this purchase!",
    "This is exactly what I needed!",
    "Five stars, absolutely love it!",
    "This made my day so much better!",
    "I'm grateful for this opportunity!",
    "This is surprisingly excellent!",
    "I'm really pleased with the outcome!",
]

negative_sentences = [
    "I hate this movie.",
    "This is terrible and awful.",
    "Poor job, disappointing work.",
    "Absolutely dreadful performance.",
    "I'm so unhappy with the results.",
    "This failed all my expectations.",
    "Terrible experience, avoid this.",
    "Worst decision I ever made.",
    "This is flawed in every way.",
    "I'm upset with how this turned out.",
    "Poor quality and bad service.",
    "This brings me so much frustration.",
    "Mediocre work, not impressive.",
    "I'm disappointed with this purchase.",
    "This is not what I needed.",
    "One star, completely hate it.",
    "This ruined my day completely.",
    "I regret this opportunity.",
    "This is surprisingly horrible.",
    "I'm really displeased with the outcome.",
]

print(f"Dataset: {len(positive_sentences)} positive, {len(negative_sentences)} negative")

### Extract Activations

**Key Decision:** Which token's activation should we use?
- **Last token:** Usually contains most information (what we'll start with)
- First token: Less information about sentence content
- Mean pooling: Average across all tokens

In [None]:
def get_final_token_activation(model, sentences, layer=6):
    """
    Extract activation of the final token at specified layer.
    
    Args:
        model: HookedTransformer model
        sentences: List of strings
        layer: Which layer to extract from (0-11 for GPT-2 small)
    
    Returns:
        numpy array of shape (n_sentences, d_model)
    """
    activations = []
    
    for sentence in sentences:
        # Run model and cache activations
        _, cache = model.run_with_cache(sentence)
        
        # Extract residual stream at specified layer
        # Shape: [batch=1, seq_len, d_model=768]
        layer_acts = cache["resid_post", layer]
        
        # Get final token's activation
        final_act = layer_acts[0, -1, :].cpu().numpy()
        activations.append(final_act)
    
    return np.array(activations)

# Test it
test_acts = get_final_token_activation(model, ["Test sentence"], layer=6)
print(f"Activation shape: {test_acts.shape}")  # Should be (1, 768)

In [None]:
# Extract activations for our dataset at layer 6
print("Extracting activations... (this may take a minute)")

X_pos = get_final_token_activation(model, positive_sentences, layer=6)
X_neg = get_final_token_activation(model, negative_sentences, layer=6)

# Combine into single dataset
X = np.vstack([X_pos, X_neg])
y = np.array([1]*len(positive_sentences) + [0]*len(negative_sentences))

print(f"X shape: {X.shape}")  # (40, 768)
print(f"y shape: {y.shape}")  # (40,)
print(f"Class balance: {np.sum(y)} positive, {len(y) - np.sum(y)} negative")

### Train the Probe

**What's happening:** Logistic regression learns a weight vector (direction in 768-dim space) that best separates positive from negative.

In [None]:
# Split into train/test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

print(f"Train size: {len(X_train)}, Test size: {len(X_test)}")

In [None]:
# Train the probe
probe = LogisticRegression(max_iter=1000, random_state=42)
probe.fit(X_train, y_train)

# Evaluate
train_acc = probe.score(X_train, y_train)
test_acc = probe.score(X_test, y_test)

print(f"\n=== Probe Performance ===")
print(f"Train accuracy: {train_acc:.2%}")
print(f"Test accuracy:  {test_acc:.2%}")
print(f"Train/test gap: {(train_acc - test_acc):.2%}")

# Detailed test set performance
y_pred = probe.predict(X_test)
print(f"\n{classification_report(y_test, y_pred, target_names=['Negative', 'Positive'])}")

### Reflection Questions

**Before moving on, think about:**
1. Is your test accuracy >70%? (Success criterion)
2. Is the train/test gap <15%? (Overfitting check)
3. What does this tell you about sentiment being linearly accessible at layer 6?

**Write your observations here:**
- 
- 
- 

---

## Part 2: Multi-Layer Comparison (2 hours)

**Question:** Which layer has the most accessible sentiment information?

**Hypothesis:** Later layers should be better (information gets refined as it flows through).

**Why this matters:** For CoT faithfulness, understanding WHERE in the model information crystallizes is crucial.

In [None]:
# Test multiple layers
layers_to_test = [0, 2, 4, 6, 8, 10, 11]
results = []

print("Testing multiple layers... (this will take a few minutes)\n")

for layer in layers_to_test:
    # Extract activations at this layer
    X_pos = get_final_token_activation(model, positive_sentences, layer=layer)
    X_neg = get_final_token_activation(model, negative_sentences, layer=layer)
    X = np.vstack([X_pos, X_neg])
    y = np.array([1]*len(positive_sentences) + [0]*len(negative_sentences))
    
    # Train/test split
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42, stratify=y
    )
    
    # Train probe
    probe_layer = LogisticRegression(max_iter=1000, random_state=42)
    probe_layer.fit(X_train, y_train)
    
    # Evaluate
    train_acc = probe_layer.score(X_train, y_train)
    test_acc = probe_layer.score(X_test, y_test)
    
    results.append({
        'layer': layer,
        'train_acc': train_acc,
        'test_acc': test_acc,
        'gap': train_acc - test_acc
    })
    
    print(f"Layer {layer:2d}: Train={train_acc:.2%}, Test={test_acc:.2%}, Gap={train_acc - test_acc:.2%}")

print("\nDone!")

In [None]:
# Visualize results
import pandas as pd

df_results = pd.DataFrame(results)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Accuracy by layer
ax1.plot(df_results['layer'], df_results['train_acc'], marker='o', label='Train', linewidth=2)
ax1.plot(df_results['layer'], df_results['test_acc'], marker='s', label='Test', linewidth=2)
ax1.set_xlabel('Layer', fontsize=12)
ax1.set_ylabel('Accuracy', fontsize=12)
ax1.set_title('Sentiment Detection Accuracy by Layer', fontsize=14, fontweight='bold')
ax1.legend(fontsize=10)
ax1.grid(True, alpha=0.3)
ax1.set_ylim([0.4, 1.0])

# Plot 2: Train/test gap
ax2.bar(df_results['layer'], df_results['gap'], color='coral', alpha=0.7)
ax2.axhline(y=0.15, color='red', linestyle='--', label='15% threshold')
ax2.set_xlabel('Layer', fontsize=12)
ax2.set_ylabel('Train - Test Accuracy', fontsize=12)
ax2.set_title('Overfitting Analysis', fontsize=14, fontweight='bold')
ax2.legend(fontsize=10)
ax2.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig('sentiment_probe_layer_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"\nBest performing layer (test acc): Layer {df_results.loc[df_results['test_acc'].idxmax(), 'layer']:.0f}")

### Analysis Questions

**Look at your plots and answer:**

1. **Does accuracy improve in later layers?** Why or why not?
   - Your answer:

2. **Which layer has the best test accuracy?** Is this surprising?
   - Your answer:

3. **Are any layers severely overfitting (gap >15%)?** What could cause this?
   - Your answer:

4. **Connection to CoT faithfulness:** If you were probing for "faithful reasoning," would you focus on early, middle, or late layers? Why?
   - Your answer:

---

## Part 3: Generalization Testing (2-3 hours)

**Critical Question:** Does the probe learn actual sentiment, or spurious correlations?

**This is THE KEY CONCEPT for your faithfulness research!**

**Test 1:** Different writing style (but same sentiment)

In [None]:
# Different distribution: More formal/elaborate style
different_positive = [
    "Absolutely delightful experience today.",
    "Couldn't be happier with the outcome.",
    "This has genuinely exceeded my expectations.",
    "I find this remarkably satisfying.",
    "The quality is truly exceptional.",
    "I'm thoroughly impressed by this.",
    "This represents outstanding value.",
    "I'm genuinely pleased with this choice.",
    "The experience was notably positive.",
    "This is remarkably well executed.",
]

different_negative = [
    "Utterly disappointing and frustrating.",
    "Completely unsatisfactory results.",
    "This failed to meet basic expectations.",
    "I find this remarkably unsatisfying.",
    "The quality is truly poor.",
    "I'm thoroughly unimpressed by this.",
    "This represents terrible value.",
    "I'm genuinely displeased with this choice.",
    "The experience was notably negative.",
    "This is remarkably poorly executed.",
]

print(f"Different distribution: {len(different_positive)} positive, {len(different_negative)} negative")

In [None]:
# Use the probe trained on layer 6 from Part 1
# (Make sure you still have 'probe' from earlier)

# Extract activations from different distribution
X_diff_pos = get_final_token_activation(model, different_positive, layer=6)
X_diff_neg = get_final_token_activation(model, different_negative, layer=6)
X_diff = np.vstack([X_diff_pos, X_diff_neg])
y_diff = np.array([1]*len(different_positive) + [0]*len(different_negative))

# Evaluate on different distribution
diff_acc = probe.score(X_diff, y_diff)
y_diff_pred = probe.predict(X_diff)

print(f"\n=== Generalization Test ===")
print(f"Original test accuracy: {test_acc:.2%}")
print(f"Different distribution accuracy: {diff_acc:.2%}")
print(f"Accuracy drop: {(test_acc - diff_acc):.2%}")
print(f"\n{classification_report(y_diff, y_diff_pred, target_names=['Negative', 'Positive'])}")

### Critical Analysis

**Answer these carefully (they're crucial for understanding probes):**

1. **Why might probe get 90% on test set but 60% on different distribution?**
   - Your answer:

2. **What spurious correlations might the probe have learned?** (Examples: sentence length, specific words, punctuation)
   - Your answer:

3. **How would you test if probe detects sentiment vs. sentence length?**
   - Your answer:

---

## Part 4: Spurious Correlation Testing (Bonus, 2 hours)

**Let's actually test for spurious correlations!**

In [None]:
# Test hypothesis: Does probe use sentence length as a shortcut?

# Create dataset where positive = SHORT, negative = LONG
# (opposite of typical pattern)
short_positive = [
    "Great!",
    "Love it!",
    "Perfect!",
    "Amazing!",
    "Excellent!",
]

long_negative = [
    "This is absolutely terrible, disappointing, and completely unsatisfactory in every possible way.",
    "I cannot express how utterly frustrated and unhappy I am with this poor quality product.",
    "The experience was remarkably negative, frustrating, and failed to meet any of my expectations.",
    "This represents a terrible value proposition with numerous flaws and significant shortcomings throughout.",
    "I am thoroughly displeased and disappointed with every aspect of this regrettable purchase decision.",
]

# Test if probe gets confused
X_short_pos = get_final_token_activation(model, short_positive, layer=6)
X_long_neg = get_final_token_activation(model, long_negative, layer=6)
X_adversarial = np.vstack([X_short_pos, X_long_neg])
y_adversarial = np.array([1]*len(short_positive) + [0]*len(long_negative))

adv_acc = probe.score(X_adversarial, y_adversarial)
y_adv_pred = probe.predict(X_adversarial)

print(f"\n=== Adversarial Test (Length vs Sentiment) ===")
print(f"Accuracy: {adv_acc:.2%}")
print(f"\nConfusion matrix:")
print(confusion_matrix(y_adversarial, y_adv_pred))
print(f"\nIf probe relied on length: accuracy would be ~0%")
print(f"If probe detects true sentiment: accuracy should be ~100%")
print(f"\nActual accuracy: {adv_acc:.2%} - What does this tell you?")

### Final Reflection: Connecting to Faithfulness

**You've now experienced the core challenges of probe-based detection.**

**For your CoT faithfulness research (Week 4), you'll face:**

1. **The generalization problem:** Just like sentiment probe might fail on different writing styles, faithfulness probe might fail when reasoning style changes

2. **The spurious correlation problem:** Just like probe might use length instead of sentiment, faithfulness probe might use "word 'therefore'" instead of actual faithfulness

3. **The layer selection problem:** You need to know WHERE in the model to probe (just like we tested multiple layers)

4. **The position selection problem:** You need to know WHICH tokens to probe (conclusion tokens? all tokens?)

**Your hypotheses (H1-H4) directly test these challenges!**

---

## Next Steps

**Before moving to Day 5-6 (Advanced Techniques):**

✅ Confirm you can:
- Extract activations from any layer
- Train a probe and evaluate it
- Compare performance across layers
- Test generalization on different distributions
- Identify potential spurious correlations

✅ Understand:
- What probes detect (linearly accessible information)
- When probes fail (non-linear, spurious correlations, distribution shift)
- Why this matters for faithfulness detection

**Day 5-6 Preview:** Token position analysis, cross-position probing, activation extraction from MLP/attention layers

---

**Save your work and take notes on what you learned!**