# Manual Activation Patching Experiments

This notebook demonstrates manual activation patching using TransformerLens, following the approach outlined in plan.md.

## Overview
- Load positive thought patterns from the dataset
- Extract activations from "clean" positive examples
- Inject these activations into "corrupted" negative/neutral prompts
- Generate longer sequences (50-70 tokens) to observe the effect

In [None]:
import sys
import os
import torch
import json
import random
from IPython.display import display, HTML

# Add TransformerLens to path
sys.path.append('/home/koalacrown/Desktop/Code/Projects/turnaround/turn_point/third_party/TransformerLens')

# Import our activation patcher
sys.path.append('/home/koalacrown/Desktop/Code/Projects/turnaround/turn_point/manual_activation_patching')
from activation_patcher import ActivationPatcher

print("Imports successful!")

## Initialize the Model and Load Dataset

In [None]:
# Choose your model here - change this and re-run to experiment with different models
MODEL_NAME = "gpt2-small"  # Change to: gpt2-medium, EleutherAI/gpt-neo-125m, etc.

# Initialize the activation patcher
patcher = ActivationPatcher(MODEL_NAME)

# Load the positive patterns dataset
dataset_path = "/home/koalacrown/Desktop/Code/Projects/turnaround/turn_point/data/final/positive_patterns.jsonl"
patterns = patcher.load_dataset(dataset_path)

print(f"Loaded {len(patterns)} positive thought patterns")
print(f"Model info: {patcher.get_model_info()}")

In [None]:
# List all supported models (optional - run this to see available models)
# ActivationPatcher.list_supported_models()

## Explore the Dataset

Let's look at a few examples from our positive patterns dataset:

In [None]:
# Display a few example patterns
for i, pattern in enumerate(patterns[:3]):
    print(f"\n=== Example {i+1} ===")
    print(f"Pattern Name: {pattern['cognitive_pattern_name']}")
    print(f"Type: {pattern['cognitive_pattern_type']}")
    print(f"Positive Pattern: {pattern['positive_thought_pattern'][:200]}...")
    print(f"Negative Example: {pattern['reference_negative_example'][:150]}...")
    print("-" * 80)

## Experiment 1: Basic Activation Patching

Let's start with a simple experiment using one positive pattern:

In [None]:
# Select a random pattern for our first experiment
random.seed(42)  # For reproducibility
sample_pattern = random.choice(patterns)

print(f"Selected Pattern: {sample_pattern['cognitive_pattern_name']}")
print(f"Clean Text: {sample_pattern['positive_thought_pattern'][:300]}...")

# Define our corrupted/negative prompt
corrupted_prompts = [
    "I feel completely overwhelmed and don't know how to",
    "Everything seems hopeless and I can't figure out why",
    "My thoughts are spiraling out of control and I feel like",
    "I'm stuck in negative thinking patterns and can't seem to",
    "The anxiety is taking over and I don't think I can"
]

corrupted_text = random.choice(corrupted_prompts)
clean_text = sample_pattern['positive_thought_pattern']

print(f"\nCorrupted Prompt: {corrupted_text}")

In [None]:
# Extract key words for patching
target_words = patcher._extract_key_words(clean_text)
print(f"Target words for patching: {target_words}")

# Perform the patching and generation
predicted_token, generated_text = patcher.patch_and_generate(
    clean_text=clean_text,
    corrupted_text=corrupted_text,
    target_words=target_words,
    num_placeholder_tokens=5,
    layer_idx=-1,  # Last layer
    max_new_tokens=60
)

print("\n" + "="*100)
print("EXPERIMENT 1 RESULTS:")
print("="*100)
print(f"Generated Text:\n{generated_text}")

## Experiment 2: Different Layers Comparison

Let's see how patching at different layers affects the output:

In [None]:
# Test patching at different layers
layers_to_test = [0, 3, 6, 9, -1]  # Early, middle, late, and final layers
sample_pattern = patterns[5]  # Use a different pattern

clean_text = sample_pattern['positive_thought_pattern']
corrupted_text = "I can't stop worrying about everything and feel completely"
target_words = patcher._extract_key_words(clean_text)

print(f"Pattern: {sample_pattern['cognitive_pattern_name']}")
print(f"Target words: {target_words}")
print(f"Corrupted prompt: {corrupted_text}")
print("\n" + "="*100)

layer_results = {}

for layer_idx in layers_to_test:
    print(f"\n--- LAYER {layer_idx} ---")
    try:
        predicted_token, generated_text = patcher.patch_and_generate(
            clean_text=clean_text,
            corrupted_text=corrupted_text,
            target_words=target_words,
            num_placeholder_tokens=5,
            layer_idx=layer_idx,
            max_new_tokens=50
        )
        layer_results[layer_idx] = generated_text
        print(f"Generated: {generated_text}")
    except Exception as e:
        print(f"Error at layer {layer_idx}: {e}")
        layer_results[layer_idx] = f"Error: {e}"
    print("-" * 80)

## Experiment 3: Multiple Patterns Comparison

Let's compare how different cognitive patterns affect the generation:

In [None]:
# Test with multiple different cognitive patterns
test_patterns = patterns[:5]  # Use first 5 patterns
corrupted_text = "I feel trapped and don't see a way forward because"

print(f"Testing with corrupted prompt: '{corrupted_text}'")
print("\n" + "="*100)

pattern_results = []

for i, pattern in enumerate(test_patterns):
    print(f"\n--- PATTERN {i+1}: {pattern['cognitive_pattern_name']} ---")
    
    clean_text = pattern['positive_thought_pattern']
    target_words = patcher._extract_key_words(clean_text)
    
    print(f"Clean text (first 100 chars): {clean_text[:100]}...")
    print(f"Target words: {target_words}")
    
    try:
        predicted_token, generated_text = patcher.patch_and_generate(
            clean_text=clean_text,
            corrupted_text=corrupted_text,
            target_words=target_words,
            num_placeholder_tokens=5,
            layer_idx=-1,
            max_new_tokens=65
        )
        
        pattern_results.append({
            'pattern_name': pattern['cognitive_pattern_name'],
            'generated_text': generated_text
        })
        
        print(f"\nGENERATED TEXT:")
        print(generated_text)
        
    except Exception as e:
        print(f"Error with pattern {i+1}: {e}")
        pattern_results.append({
            'pattern_name': pattern['cognitive_pattern_name'],
            'generated_text': f"Error: {e}"
        })
    
    print("-" * 100)

## Experiment 4: Different Number of Patch Positions

Let's experiment with varying the number of placeholder tokens we patch:

In [None]:
# Test with different numbers of placeholder tokens
placeholder_counts = [1, 3, 5, 7]
sample_pattern = patterns[10]  # Use another pattern

clean_text = sample_pattern['positive_thought_pattern']
corrupted_text = "My mind keeps racing with negative thoughts and I feel"
target_words = patcher._extract_key_words(clean_text)

print(f"Pattern: {sample_pattern['cognitive_pattern_name']}")
print(f"Target words: {target_words}")
print(f"Corrupted prompt: {corrupted_text}")
print("\n" + "="*100)

placeholder_results = {}

for num_placeholders in placeholder_counts:
    print(f"\n--- {num_placeholders} PLACEHOLDER TOKENS ---")
    try:
        predicted_token, generated_text = patcher.patch_and_generate(
            clean_text=clean_text,
            corrupted_text=corrupted_text,
            target_words=target_words,
            num_placeholder_tokens=num_placeholders,
            layer_idx=-1,
            max_new_tokens=55
        )
        placeholder_results[num_placeholders] = generated_text
        print(f"Generated: {generated_text}")
    except Exception as e:
        print(f"Error with {num_placeholders} placeholders: {e}")
        placeholder_results[num_placeholders] = f"Error: {e}"
    print("-" * 80)

## Experiment 5: Baseline Comparison (No Patching)

Let's generate text without any patching to see the baseline behavior:

In [None]:
# Generate baseline text without patching
test_prompts = [
    "I feel completely overwhelmed and don't know how to",
    "My thoughts are spiraling out of control and I feel like",
    "Everything seems hopeless and I can't figure out why",
    "I'm stuck in negative thinking patterns and can't seem to",
    "The anxiety is taking over and I don't think I can"
]

print("BASELINE GENERATIONS (No Patching):")
print("="*100)

baseline_results = []

for i, prompt in enumerate(test_prompts):
    print(f"\n--- Prompt {i+1}: {prompt} ---")
    
    # Tokenize and generate without any hooks
    tokens = patcher.model.to_tokens(prompt)
    generated_tokens = patcher.model.generate(
        tokens,
        max_new_tokens=60,
        temperature=0.7,
        do_sample=True
    )
    generated_text = patcher.model.to_string(generated_tokens[0])
    
    baseline_results.append({
        'prompt': prompt,
        'generated_text': generated_text
    })
    
    print(f"Generated: {generated_text}")
    print("-" * 80)

## Experiment 6: Custom Pattern Testing

Interactive cell for testing your own patterns:

In [None]:
# Interactive experiment - modify these variables to test your own patterns

# You can change these to experiment with different combinations
CUSTOM_CLEAN_TEXT = "I'm taking a moment to acknowledge my feelings and remind myself that challenges are temporary. I can take small, manageable steps forward and focus on what I can control right now."

CUSTOM_CORRUPTED_TEXT = "I don't know what to do anymore and feel completely"

CUSTOM_TARGET_WORDS = ["acknowledge", "feelings", "challenges", "temporary", "control"]  # Or leave empty to auto-extract

CUSTOM_NUM_PLACEHOLDERS = 5
CUSTOM_LAYER = -1  # -1 for last layer
CUSTOM_MAX_TOKENS = 70

print("CUSTOM EXPERIMENT:")
print("="*100)
print(f"Clean text: {CUSTOM_CLEAN_TEXT}")
print(f"Corrupted text: {CUSTOM_CORRUPTED_TEXT}")

# Auto-extract target words if not provided
if not CUSTOM_TARGET_WORDS:
    CUSTOM_TARGET_WORDS = patcher._extract_key_words(CUSTOM_CLEAN_TEXT)

print(f"Target words: {CUSTOM_TARGET_WORDS}")
print(f"Settings: {CUSTOM_NUM_PLACEHOLDERS} placeholders, layer {CUSTOM_LAYER}, {CUSTOM_MAX_TOKENS} max tokens")

try:
    predicted_token, generated_text = patcher.patch_and_generate(
        clean_text=CUSTOM_CLEAN_TEXT,
        corrupted_text=CUSTOM_CORRUPTED_TEXT,
        target_words=CUSTOM_TARGET_WORDS,
        num_placeholder_tokens=CUSTOM_NUM_PLACEHOLDERS,
        layer_idx=CUSTOM_LAYER,
        max_new_tokens=CUSTOM_MAX_TOKENS
    )
    
    print("\nGENERATED TEXT:")
    print(generated_text)
    
except Exception as e:
    print(f"Error in custom experiment: {e}")

## Results Summary

Let's create a summary of our experiments:

In [None]:
print("EXPERIMENT SUMMARY")
print("="*100)

print("\n1. Basic Activation Patching:")
print(f"   - Successfully patched activations from positive patterns")
print(f"   - Generated {len(generated_text.split()) if 'generated_text' in locals() else 'N/A'} tokens")

print("\n2. Layer Comparison:")
if 'layer_results' in locals():
    for layer, result in layer_results.items():
        status = "Success" if not result.startswith("Error") else "Failed"
        print(f"   - Layer {layer}: {status}")

print("\n3. Pattern Comparison:")
if 'pattern_results' in locals():
    for result in pattern_results:
        status = "Success" if not result['generated_text'].startswith("Error") else "Failed"
        print(f"   - {result['pattern_name']}: {status}")

print("\n4. Placeholder Count Testing:")
if 'placeholder_results' in locals():
    for count, result in placeholder_results.items():
        status = "Success" if not result.startswith("Error") else "Failed"
        print(f"   - {count} placeholders: {status}")

print("\n5. Baseline vs Patched:")
print(f"   - Baseline generations: {len(baseline_results) if 'baseline_results' in locals() else 0} completed")
print(f"   - Comparison shows effect of activation patching")

print("\nTechnical Details:")
print(f"   - Model: {patcher.model_name}")
print(f"   - Dataset size: {len(patterns)} patterns")
print(f"   - Primary patching layer: Last residual stream (resid_post)")
print(f"   - Token generation: 50-70 tokens per experiment")

## Next Steps and Extensions

Ideas for further experimentation:

1. **Different Activation Types**: Try patching attention patterns, MLP activations, etc.
2. **Multiple Layer Patching**: Patch at multiple layers simultaneously
3. **Quantitative Evaluation**: Add metrics to measure positivity/negativity of generated text
4. **Fine-grained Control**: Patch specific attention heads or neurons
5. **Longer Sequences**: Test with even longer generation sequences
6. **Different Models**: Try with larger models or different architectures

Modify the cells above to explore these possibilities!