# Manual Activation Patching Experiments

### How activation patching works here (what goes where)

- **Source sentence (we capture from)**: `clean_text`
  - Set in each experiment (e.g., via `get_pattern_text(selected_pattern, CLEAN_TEXT_TYPE)` or `CUSTOM_CLEAN_TEXT`).
  - **Capture layer**: `capture_layer_idx` (resid_post at that layer). `-1` = last layer.
  - **Token selection**: `token_selection_strategy` chooses which token positions in `clean_text` provide activations (e.g., `KEYWORDS`, `MID_TOKENS`, `LAST_COUPLE`, `LAST_TOKEN`, `ALL_TOKENS`).

- **Target prompt (we patch into)**: depends on mode
  - **Zero-placeholder mode** (`zero_placeholder_mode=True`): use either a named `template_name` (with `0` markers) or a `prompt_input` string containing standalone `0` tokens. Those `0` positions are rendered as `<eos>` and become patch points.
  - **Placeholder-token mode** (`zero_placeholder_mode=False`): the notebook auto‑prepends `num_placeholder_tokens` copies of `<eos>` to `corrupted_text`; those initial tokens are the patch points.

- **Where we patch (layer)**: `patch_layer_idx` (resid_post). Defaults to `capture_layer_idx` if not provided. Can be an int or list of ints.

- **Generation**: We run with hooks so the model uses patched activations when computing next token logits and while generating continuation text.

- **Mental model**: Capture activations from `clean_text` at selected tokens → inject them at `<eos>` positions in the target prompt → generate continuation while hooks are active.


In [2]:
import sys
import os
import torch
import json
import random
from IPython.display import display, HTML

# Add TransformerLens to path
sys.path.append('/Users/ivanculo/Desktop/Projects/turn_point/third_party/TransformerLens')

# Import our activation patcher with all utilities
sys.path.append('/Users/ivanculo/Desktop/Projects/turn_point/manual_activation_patching')
from activation_patcher import ActivationPatcher, TokenSelectionStrategy
from interpretation_templates import INTERPRETATION_TEMPLATES

# Import utility functions from ActivationPatcher class
load_cognitive_patterns = ActivationPatcher.load_cognitive_patterns
get_pattern_by_index = ActivationPatcher.get_pattern_by_index
get_pattern_by_type = ActivationPatcher.get_pattern_by_type
get_random_pattern_by_type = ActivationPatcher.get_random_pattern_by_type
get_patterns_by_type = ActivationPatcher.get_patterns_by_type
list_available_pattern_types = ActivationPatcher.list_available_pattern_types
get_pattern_text = ActivationPatcher.get_pattern_text
get_template = ActivationPatcher.get_template
show_pattern_info = ActivationPatcher.show_pattern_info
clear_memory = ActivationPatcher.clear_memory
filter_patterns_by_count = ActivationPatcher.filter_patterns_by_count
get_filtered_patterns_by_type = ActivationPatcher.get_filtered_patterns_by_type

print("Imports and utilities loaded successfully!")

Imports and utilities loaded successfully!


## Initialize the Model and Load Dataset

In [6]:
# Choose your model here - change this and re-run to experiment with different models
# MODEL_NAME = "meta-llama/Llama-3.1-8B-Instruct"  # Change to: gpt2-medium, EleutherAI/gpt-neo-125m, etc.
MODEL_NAME = "gpt2-medium"
# # Set the BOS token for your model (change this based on your model)
BOS_TOKEN = "<bos>"  # Default for Gemma models

# # Initialize the activation patcher
patcher = ActivationPatcher(MODEL_NAME)

# Load the FULL cognitive patterns dataset (all 40 examples per pattern type)
patterns, pattern_types = load_cognitive_patterns()

# Show available pattern types
# list_available_pattern_types(pattern_types)

Using Apple Silicon GPU (MPS)
Loading model: gpt2-medium on device: mps
Loaded pretrained model gpt2-medium into HookedTransformer
✓ Model loaded successfully
  - Family: gpt2
  - Size: medium
  - Layers: 24
  - Model dimension: 1024
  - Vocabulary size: 50257


## Experiment 1: Basic Activation Patching

Let's start with a simple experiment using one positive pattern:

### What goes where in Experiment 1

- **We capture from**: `clean_text` built from the dataset using `CLEAN_TEXT_TYPE`.
- **We patch into**: a prompt constructed from the `template_name` with `0` markers (zero‑placeholder mode). Each `0` is rendered as `<eos>` and marks a patch point in the prompt.
- **Capture layer**: `capture_layer_idx=-1` (last layer, resid_post).
- **Patch layer**: `patch_layer_idx=-1` (same last layer).
- **Tokens used to extract activations**: `TOKEN_STRATEGY` (default shown is `LAST_COUPLE` averaging the last few tokens).
- **Why**: This injects the end‑of‑thought representation from the clean example into the template to steer generation at the `<eos>` slots.


Available cognitive pattern types:
 1. Cognitive depletion pattern (40 examples)
 2. Intrusive suicidal fixation (40 examples)
 3. Negative self-evaluative loop (40 examples)
 4. Internal dialectical processing (40 examples)
 5. Fragmented perceptual reasoning (40 examples)
 6. Hyper-attuned interoception (40 examples)
 7. Autobiographical integration (40 examples)
 8. Over-elaborative recounting (40 examples)
 9. Entrapment cognition (40 examples)
10. Existential rumination (40 examples)
11. Learned helplessness loop (40 examples)
12. Instrumental suicidal reasoning (40 examples)
13. Cognitive disorganization (40 examples)
Available text types: positive, negative, transition

Available interpretation templates: ['cognitive_pattern', 'emotional_state', 'general_concept', 'decision_making']

💡 Each experiment can now filter to use 1-40 examples per pattern type at runtime!

In [20]:
# EXPERIMENT 1: Basic Activation Patching with Template (Zero-Placeholder Mode)

# ===== MODULAR CONFIGURATION SECTION =====
# Easily modify these parameters to experiment with different settings
NUM_EXAMPLES_PER_TYPE = 30  # Number of examples to use per pattern type (1-40)
PATTERN_TYPE = "Instrumental suicidal reasoning"  # Choose from the list above
PATTERN_INDEX_WITHIN_TYPE = 0 # If there are multiple examples of this type, choose which one (0-based)
CLEAN_TEXT_TYPE = "negative"  # "positive", "negative", or "transition"
TEMPLATE_NAME = "cognitive_pattern"  # Template to use for corrupted text
TOKEN_STRATEGY = TokenSelectionStrategy.LAST_COUPLE  # Change to: MID_TOKENS, LAST_COUPLE, LAST_TOKEN, ALL_TOKENS

# ===== ACTIVATION PATCHING PARAMETERS =====
OVERLAY_STRENGTH = 0.7 # 0.0 = no effect, 1.0 = full effect, >1.0 = amplified
REPLACING_MODE = 'normalized'  # 'normalized' (blend), 'addition' (add), 'direct' (replace)
NORMALIZE_MAGNITUDES = False  # Scale activations to match original magnitude
# ============================================

# ===== GENERATION PARAMETERS =====
GEN_TEMPERATURE = 1.4
GEN_TOP_K = 50        # e.g., 50
GEN_TOP_P = 0.9        # e.g., 0.9
GEN_DO_SAMPLE = True
GEN_REPETITION_PENALTY = 1.1  # mapped to HookedTransformer's freq_penalty
# ============================================

# Apply runtime filtering to use only NUM_EXAMPLES_PER_TYPE examples per pattern type
filtered_patterns, filtered_pattern_types = filter_patterns_by_count(pattern_types, NUM_EXAMPLES_PER_TYPE)

# Load the selected pattern and template from filtered dataset
selected_pattern = get_pattern_by_type(filtered_pattern_types, PATTERN_TYPE, PATTERN_INDEX_WITHIN_TYPE)
template = get_template(TEMPLATE_NAME)

# Get the clean text (what we want to capture activations from)
clean_text = get_pattern_text(selected_pattern, CLEAN_TEXT_TYPE)

# Run activation patching using zero-placeholder mode
print(f"\n⚡ Running activation patching (zero-placeholder mode)...")
print(f"  Clean text: {clean_text[:100]}...")

predicted_token, generated_text = patcher.patch_and_generate(
    clean_text=clean_text,
    corrupted_text=None,  # ignored in zero-placeholder mode when template_name is provided
    capture_layer_idx=16,  
    patch_layer_idx=2,    
    max_new_tokens=60,
    bos_token=BOS_TOKEN,
    token_selection_strategy=TOKEN_STRATEGY,
    num_strategy_tokens=5,
    zero_placeholder_mode=True,
    template_name=TEMPLATE_NAME,  # use template with '0' markers
    overlay_strength=OVERLAY_STRENGTH,  # Control activation intensity
    replacing_mode=REPLACING_MODE,  # Blending method
    normalize_magnitudes=NORMALIZE_MAGNITUDES,  # Scale activations
    temperature=GEN_TEMPERATURE,
    top_k=GEN_TOP_K,
    top_p=GEN_TOP_P,
    do_sample=GEN_DO_SAMPLE,
    freq_penalty=GEN_REPETITION_PENALTY
)

print("\n" + "="*80)
print("🎊 EXPERIMENT 1 RESULTS:")
print("="*80)
print(f"Model: {patcher.model_name}")
print(f"Pattern Type: {PATTERN_TYPE}")
print(f"Pattern Name: {selected_pattern['cognitive_pattern_name']}")
print(f"Clean text type: {CLEAN_TEXT_TYPE}")
print(f"Template: {TEMPLATE_NAME}")
print(f"Token strategy: {TOKEN_STRATEGY.value}")
print(f"Overlay strength: {OVERLAY_STRENGTH}")
print(f"Replacing mode: {REPLACING_MODE}")
print(f"Normalize magnitudes: {NORMALIZE_MAGNITUDES}")
print(f"Temperature: {GEN_TEMPERATURE}")
print(f"Top-k: {GEN_TOP_K}")
print(f"Top-p: {GEN_TOP_P}")
print(f"Do sample: {GEN_DO_SAMPLE}")
print(f"Repetition penalty (freq_penalty): {GEN_REPETITION_PENALTY}")
print(f"\n📊 Generated Text:")
print(f"{generated_text}")
print("="*80)


⚡ Running activation patching (zero-placeholder mode)...
  Clean text: I'm not being selfish; I just want to end this constant pain and suffering. It's exhausting trying t...
✅ Model hooks reset to clean state
Clean text: I'm not being selfish; I just want to end this constant pain and suffering. It's exhausting trying to find reasons to go on, and honestly, what's the point of keeping going when it feels like no one cares?
Corrupted text: User: 0 0 0 0 The following is a summary of the previous:
Assistant:
Capture layers: [16]
Patch layers: [2]


Processing capture layers:   0%|          | 0/1 [00:00<?, ?it/s]

Using token selection strategy: last_couple
Sequence length: 48
Averaged last 5 tokens from positions: [43, 44, 45, 46, 47]


                                                                


Patching with 1 total activation vectors
Using 1 hooks across layers: [16] -> [2]
Patching at hook: blocks.2.hook_resid_post for positions: [2] (mode: normalized, strength: 0.7)

Next token prediction after patching: '
'




  0%|          | 0/60 [00:00<?, ?it/s]

Patching at hook: blocks.2.hook_resid_post for positions: [2] (mode: normalized, strength: 0.7)

🎊 EXPERIMENT 1 RESULTS:
Model: gpt2-medium
Pattern Type: Instrumental suicidal reasoning
Pattern Name: Suicidal Planning & Rationalization
Clean text type: negative
Template: cognitive_pattern
Token strategy: last_couple
Overlay strength: 0.7
Replacing mode: normalized
Normalize magnitudes: False
Temperature: 1.4
Top-k: 50
Top-p: 0.9
Do sample: True
Repetition penalty (freq_penalty): 1.1

📊 Generated Text:
User: 0 0 0 0 The following is a summary of the previous:
Assistant: 5.E2 by June 3 and WO, because our way to everyone, or roms on howls as (oron are all, 1DET high water.comondatta Road Dogus and mustered the same-maneu—"A10 -- but five time.imos


### Manual zero‑placeholder prompt (what gets patched where)

- **We capture from**: the same `clean_text` as Experiment 1.
- **We patch into**: `manual_prompt` where each standalone `0` is rendered as `<eos>` and becomes a patch point.
- **Positions chosen**: The exact tokenized positions of the `<eos>` markers after rendering are located and used for patching.
- **Strategy/layers**: `TokenSelectionStrategy.LAST_COUPLE` from `capture_layer_idx=-1` → patched into `patch_layer_idx=-1`.
- **Tip**: Add or remove `0` markers to control how many `<eos>` positions get injected.


In [None]:
# Manual prompt with '0' markers (zero-placeholder mode)
# Example: you can place '0' anywhere you want activations patched
manual_prompt = "Hello"

print(f"Manual prompt: {manual_prompt}")

predicted_token, generated_text = patcher.patch_and_generate(
    clean_text=clean_text,
    corrupted_text=None,  # ignored in zero-placeholder mode
    capture_layer_idx=-1,
    patch_layer_idx=-1,
    max_new_tokens=60,
    bos_token=BOS_TOKEN,
    token_selection_strategy=TokenSelectionStrategy.LAST_COUPLE,
    num_strategy_tokens=3,
    zero_placeholder_mode=True,
    prompt_input=manual_prompt
)

print("\n" + "="*100)
print("MANUAL PROMPT RESULTS:")
print("="*100)
print(f"Model: {patcher.model_name}")
print(f"BOS token used: '{BOS_TOKEN}'")
print(f"Clean text (first 100 chars): {clean_text[:100]}...")
print(f"\nGenerated Text:\n{generated_text}")
print("="*100)

Manual prompt: Hello
✅ Model hooks reset to clean state
Clean text: I'm not being selfish; I just want to end this constant pain and suffering. It's exhausting trying to find reasons to go on, and honestly, what's the point of keeping going when it feels like no one cares?
Corrupted text: User: Hello
Assistant:
Capture layers: [17]
Patch layers: [17]


Processing capture layers:   0%|          | 0/1 [00:00<?, ?it/s]

Using token selection strategy: last_couple
Sequence length: 51
Averaged last 3 tokens from positions: [48, 49, 50]


                                                                


Patching with 1 total activation vectors
Using 1 hooks across layers: [17] -> [17]




Patching at hook: blocks.17.hook_resid_post for positions: [] (mode: normalized, strength: 1.0)

Next token prediction after patching: ' maneu'


  0%|          | 0/60 [00:00<?, ?it/s]

Patching at hook: blocks.17.hook_resid_post for positions: [] (mode: normalized, strength: 1.0)

MANUAL PROMPT RESULTS:
Model: google/gemma-2b
BOS token used: '<bos>'
Clean text (first 100 chars): I'm not being selfish; I just want to end this constant pain and suffering. It's exhausting trying t...

Generated Text:
User: Hello
Assistant: maneu increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa increa affor affor affor affor affor affor affor affor affor affor affor affor affor affor affor affor affor affor affor affor


#### 🔄 Reset Model State

Before running experiments, it's good practice to reset any lingering hooks from previous runs:

In [None]:
# 🔄 RESET MODEL HOOKS - Run this cell to reset the model to clean state
# This is especially important when switching between different experiments

patcher.reset_hooks()

print("Model is now ready for clean experiments!")
print("Run this cell anytime you want to ensure no residual hooks are affecting your results.")

## Experiment 2: Different Layers Comparison

Let's see how patching at different layers affects the output:

### What goes where in Experiment 2 (layer sweep)

- **We capture from**: `clean_text` of `sample_pattern`.
- **We patch into**: `corrupted_text` with `num_placeholder_tokens=5` `<eos>` tokens prepended (non‑zero mode).
- **Capture/Patch layers**: same `layer_idx` per loop (tests early→late layers including `-1`).
- **Token selection**: `TokenSelectionStrategy.KEYWORDS` from the clean input.
- **Goal**: Compare how different patch layers influence continuation.


In [None]:
# EXPERIMENT 2: Different Layers Comparison

# ===== MODULAR CONFIGURATION SECTION =====
NUM_EXAMPLES = 15  # Number of examples to use from dataset (1-40)
LAYERS_TO_TEST = [0, 3, 6, 9, -1]  # Early, middle, late, and final layers
TOKEN_STRATEGY = TokenSelectionStrategy.KEYWORDS  # Change to: MID_TOKENS, LAST_COUPLE, LAST_TOKEN, ALL_TOKENS
PATTERN_INDEX = 5  # Which pattern from full dataset to use (0-519)
CORRUPTED_TEXT = "I can't stop worrying about everything and feel completely"
MAX_NEW_TOKENS = 50
# ============================================

# Get limited patterns based on NUM_EXAMPLES
limited_patterns = patterns[:NUM_EXAMPLES * 13]  # 13 pattern types * NUM_EXAMPLES each
sample_pattern = patterns[PATTERN_INDEX] if PATTERN_INDEX < len(patterns) else patterns[5]

clean_text = sample_pattern['positive_thought_pattern']

print(f"📋 EXPERIMENT 2 SETUP - LAYER COMPARISON:")
print(f"📊 Using {NUM_EXAMPLES} examples per pattern type from dataset")
print(f"🧠 Pattern: {sample_pattern['cognitive_pattern_name']}")
print(f"🔧 Token strategy: {TOKEN_STRATEGY.value}")
print(f"📝 Corrupted prompt: {CORRUPTED_TEXT}")
print(f"🏗️ Testing layers: {LAYERS_TO_TEST}")
print("\n" + "="*100)

layer_results = {}

for layer_idx in LAYERS_TO_TEST:
    print(f"\n--- LAYER {layer_idx} ---")
    try:
        predicted_token, generated_text = patcher.patch_and_generate(
            clean_text=clean_text,
            corrupted_text=CORRUPTED_TEXT,
            num_placeholder_tokens=5,
            capture_layer_idx=layer_idx,  # Updated parameter name
            patch_layer_idx=layer_idx,    # Patch to same layer
            max_new_tokens=MAX_NEW_TOKENS,
            token_selection_strategy=TOKEN_STRATEGY,
            num_strategy_tokens=3
        )
        layer_results[layer_idx] = generated_text
        print(f"Generated: {generated_text}")
    except Exception as e:
        print(f"Error at layer {layer_idx}: {e}")
        layer_results[layer_idx] = f"Error: {e}"
    print("-" * 80)

In [None]:
# EXPERIMENT 3: Multiple Patterns Comparison

# ===== MODULAR CONFIGURATION SECTION =====
NUM_EXAMPLES_PER_TYPE = 8  # Number of examples to use per pattern type (1-40)
NUM_PATTERNS_TO_TEST = 5  # How many different patterns to test
TOKEN_STRATEGY = TokenSelectionStrategy.KEYWORDS  # Change to: MID_TOKENS, LAST_COUPLE, LAST_TOKEN, ALL_TOKENS
CORRUPTED_TEXT = "I feel trapped and don't see a way forward because"
MAX_NEW_TOKENS = 65
CAPTURE_LAYER = -1
PATCH_LAYER = -1
NUM_PLACEHOLDERS = 5
# ============================================

# Apply runtime filtering to get limited dataset
filtered_patterns, filtered_pattern_types = filter_patterns_by_count(pattern_types, NUM_EXAMPLES_PER_TYPE)

# Get test patterns from filtered dataset
test_patterns = filtered_patterns[:NUM_PATTERNS_TO_TEST]

print(f"📋 EXPERIMENT 3 SETUP - MULTIPLE PATTERNS COMPARISON:")
print(f"📊 Using {NUM_EXAMPLES_PER_TYPE} examples per pattern type (runtime filtered)")
print(f"🔢 Total filtered patterns available: {len(filtered_patterns)} (was {len(patterns)})")
print(f"🧪 Testing {NUM_PATTERNS_TO_TEST} different patterns")
print(f"📝 Corrupted prompt: '{CORRUPTED_TEXT}'")
print(f"🔧 Token strategy: {TOKEN_STRATEGY.value}")
print(f"🏗️ Layers: capture={CAPTURE_LAYER}, patch={PATCH_LAYER}")
print(f"📍 Placeholders: {NUM_PLACEHOLDERS}, Max tokens: {MAX_NEW_TOKENS}")
print("\n" + "="*100)

pattern_results = []

for i, pattern in enumerate(test_patterns):
    print(f"\n--- PATTERN {i+1}/{NUM_PATTERNS_TO_TEST}: {pattern['cognitive_pattern_name']} ---")
    
    clean_text = pattern['positive_thought_pattern']
    
    print(f"Clean text (first 100 chars): {clean_text[:100]}...")
    
    try:
        predicted_token, generated_text = patcher.patch_and_generate(
            clean_text=clean_text,
            corrupted_text=CORRUPTED_TEXT,
            num_placeholder_tokens=NUM_PLACEHOLDERS,
            capture_layer_idx=CAPTURE_LAYER,  # Updated parameter name
            patch_layer_idx=PATCH_LAYER,      # Updated parameter name
            max_new_tokens=MAX_NEW_TOKENS,
            token_selection_strategy=TOKEN_STRATEGY,
            num_strategy_tokens=3
        )
        
        pattern_results.append({
            'pattern_name': pattern['cognitive_pattern_name'],
            'generated_text': generated_text
        })
        
        print(f"\n✨ GENERATED TEXT:")
        print(generated_text)
        
    except Exception as e:
        print(f"❌ Error with pattern {i+1}: {e}")
        pattern_results.append({
            'pattern_name': pattern['cognitive_pattern_name'],
            'generated_text': f"Error: {e}"
        })
    
    print("-" * 100)

print(f"\n🎊 EXPERIMENT 3 COMPLETED!")
print(f"✅ Successfully tested {len([r for r in pattern_results if not r['generated_text'].startswith('Error')])} out of {NUM_PATTERNS_TO_TEST} patterns")
print(f"📊 Using {NUM_EXAMPLES_PER_TYPE} examples per pattern type (runtime filtered)")

## Experiment 4: Different Number of Patch Positions

Let's experiment with varying the number of placeholder tokens we patch:

In [None]:
# EXPERIMENT 4: Different Number of Patch Positions

# ===== MODULAR CONFIGURATION SECTION =====
NUM_EXAMPLES = 12  # Number of examples to choose from dataset (1-40)
PLACEHOLDER_COUNTS = [1, 3, 5, 7]  # Different numbers of placeholder tokens to test
TOKEN_STRATEGY = TokenSelectionStrategy.KEYWORDS  # Change to: MID_TOKENS, LAST_COUPLE, LAST_TOKEN, ALL_TOKENS
PATTERN_INDEX = 10  # Which pattern to use from the limited set
CORRUPTED_TEXT = "My mind keeps racing with negative thoughts and I feel"
MAX_NEW_TOKENS = 55
CAPTURE_LAYER = -1
PATCH_LAYER = -1
# ============================================

# Get limited patterns and select one for testing
limited_patterns = patterns[:NUM_EXAMPLES]
sample_pattern = limited_patterns[PATTERN_INDEX] if PATTERN_INDEX < len(limited_patterns) else limited_patterns[0]

clean_text = sample_pattern['positive_thought_pattern']

print(f"📋 EXPERIMENT 4 SETUP - PLACEHOLDER COUNT TESTING:")
print(f"📊 Selected from {NUM_EXAMPLES} examples in dataset")
print(f"🧠 Pattern: {sample_pattern['cognitive_pattern_name']}")
print(f"🔧 Token strategy: {TOKEN_STRATEGY.value}")
print(f"📝 Corrupted prompt: {CORRUPTED_TEXT}")
print(f"📍 Testing placeholder counts: {PLACEHOLDER_COUNTS}")
print(f"🏗️ Layers: capture={CAPTURE_LAYER}, patch={PATCH_LAYER}")
print("\n" + "="*100)

placeholder_results = {}

for num_placeholders in PLACEHOLDER_COUNTS:
    print(f"\n--- {num_placeholders} PLACEHOLDER TOKENS ---")
    try:
        predicted_token, generated_text = patcher.patch_and_generate(
            clean_text=clean_text,
            corrupted_text=CORRUPTED_TEXT,
            num_placeholder_tokens=num_placeholders,
            capture_layer_idx=CAPTURE_LAYER,  # Updated parameter name
            patch_layer_idx=PATCH_LAYER,      # Updated parameter name
            max_new_tokens=MAX_NEW_TOKENS,
            token_selection_strategy=TOKEN_STRATEGY,
            num_strategy_tokens=3
        )
        placeholder_results[num_placeholders] = generated_text
        print(f"✨ Generated: {generated_text}")
    except Exception as e:
        print(f"❌ Error with {num_placeholders} placeholders: {e}")
        placeholder_results[num_placeholders] = f"Error: {e}"
    print("-" * 80)

print(f"\n🎊 EXPERIMENT 4 COMPLETED!")
print(f"✅ Tested {len(PLACEHOLDER_COUNTS)} different placeholder configurations")
print(f"📊 Using pattern from limited dataset of {NUM_EXAMPLES} examples")

In [None]:
# EXPERIMENT 6: Custom Pattern Testing - Interactive Sandbox

# ===== MODULAR CONFIGURATION SECTION =====
# You can change these to experiment with different combinations
NUM_EXAMPLES = 20  # Number of examples to choose from in dataset (1-40)

CUSTOM_CLEAN_TEXT = "I'm taking a moment to acknowledge my feelings and remind myself that challenges are temporary. I can take small, manageable steps forward and focus on what I can control right now."
CUSTOM_CORRUPTED_TEXT = "I don't know what to do anymore and feel completely lost"
CUSTOM_TOKEN_STRATEGY = TokenSelectionStrategy.KEYWORDS  # Change to: MID_TOKENS, LAST_COUPLE, LAST_TOKEN, ALL_TOKENS
CUSTOM_NUM_PLACEHOLDERS = 5
CUSTOM_CAPTURE_LAYER = -1  # Layer to capture activations from
CUSTOM_PATCH_LAYER = 3     # Layer to patch activations into
CUSTOM_MAX_TOKENS = 70

# Strategy comparison settings
COMPARISON_NUM_PLACEHOLDERS = 3
COMPARISON_MAX_TOKENS = 50
COMPARISON_STRATEGIES = [TokenSelectionStrategy.KEYWORDS, TokenSelectionStrategy.MID_TOKENS, 
                        TokenSelectionStrategy.LAST_COUPLE, TokenSelectionStrategy.LAST_TOKEN, 
                        TokenSelectionStrategy.ALL_TOKENS]
# ============================================

print("📋 EXPERIMENT 6 SETUP - CUSTOM PATTERN TESTING:")
print("="*100)
print(f"📊 Working with {NUM_EXAMPLES} examples from dataset")
print(f"📝 Clean text: {CUSTOM_CLEAN_TEXT}")
print(f"📝 Corrupted text: {CUSTOM_CORRUPTED_TEXT}")
print(f"🔧 Token strategy: {CUSTOM_TOKEN_STRATEGY.value}")
print(f"🏗️ Settings: {CUSTOM_NUM_PLACEHOLDERS} placeholders, capture layer {CUSTOM_CAPTURE_LAYER}, patch layer {CUSTOM_PATCH_LAYER}, {CUSTOM_MAX_TOKENS} max tokens")

try:
    predicted_token, generated_text = patcher.patch_and_generate(
        clean_text=CUSTOM_CLEAN_TEXT,
        corrupted_text=CUSTOM_CORRUPTED_TEXT,
        num_placeholder_tokens=CUSTOM_NUM_PLACEHOLDERS,
        capture_layer_idx=CUSTOM_CAPTURE_LAYER,  # Updated parameter name
        patch_layer_idx=CUSTOM_PATCH_LAYER,      # Updated parameter name
        max_new_tokens=CUSTOM_MAX_TOKENS,
        token_selection_strategy=CUSTOM_TOKEN_STRATEGY,
        num_strategy_tokens=3
    )
    
    print("\n✨ GENERATED TEXT:")
    print(generated_text)
    
except Exception as e:
    print(f"❌ Error in custom experiment: {e}")

print("\n" + "="*100)
print("🔧 TOKEN STRATEGY COMPARISON:")
print("="*100)
print(f"📊 Comparing strategies with {NUM_EXAMPLES} examples available")

# Compare all token strategies
for strategy in COMPARISON_STRATEGIES:
    print(f"\n--- STRATEGY: {strategy.value.upper()} ---")
    try:
        predicted_token, generated_text = patcher.patch_and_generate(
            clean_text=CUSTOM_CLEAN_TEXT,
            corrupted_text=CUSTOM_CORRUPTED_TEXT,
            num_placeholder_tokens=COMPARISON_NUM_PLACEHOLDERS,
            capture_layer_idx=-1,
            patch_layer_idx=-1,
            max_new_tokens=COMPARISON_MAX_TOKENS,
            token_selection_strategy=strategy,
            num_strategy_tokens=3
        )
        print(f"✨ Generated: {generated_text}")
    except Exception as e:
        print(f"❌ Error: {e}")
    print("-" * 60)

print(f"\n🎊 EXPERIMENT 6 COMPLETED!")
print(f"✅ Tested custom settings and {len(COMPARISON_STRATEGIES)} token strategies")
print(f"📊 Dataset scope: {NUM_EXAMPLES} examples per cognitive pattern type")

## 🛠️ Troubleshooting & Utilities

In [8]:
# 🛠️ TROUBLESHOOTING UTILITIES

# 1. Reset hooks if experiments behave unexpectedly
def quick_reset():
    patcher.reset_hooks()
    print("🔄 Hooks reset - Model is clean!")

# 2. Check model state - now uses the instance method
def check_model_info():
    patcher.check_model_info()

# 3. Clear memory if needed - now uses the static method
def clear_memory_util():
    ActivationPatcher.clear_memory()

# Quick access functions
print("Available utilities:")
print("- quick_reset() - Reset model hooks")
print("- check_model_info() - Display model details") 
print("- clear_memory_util() - Clear GPU/system memory")
print("- patcher.reset_hooks() - Direct reset call")

# Uncomment any line below to run:
# quick_reset()
# check_model_info()
# clear_memory_util()

Available utilities:
- quick_reset() - Reset model hooks
- check_model_info() - Display model details
- clear_memory_util() - Clear GPU/system memory
- patcher.reset_hooks() - Direct reset call


In [10]:
patcher.reset_hooks()

✅ Model hooks reset to clean state


In [12]:
# Test 1: Baseline generation without any patching
print("=== TEST 1: Baseline Generation (No Patching) ===")

# Reset model to clean state
patcher.reset_hooks()

# Simple test prompt
test_prompt = "I'm struggling with negative thoughts and need help finding a way forward."

# Generate without any patching/hooks
tokens = patcher.model.to_tokens(test_prompt, prepend_bos=False)
print(f"Input: {test_prompt}")
print(f"Tokenized length: {tokens.shape[1]}")

try:
    generated_tokens = patcher.model.generate(
        tokens,
        max_new_tokens=30,
        temperature=0.7,
        do_sample=True
    )
    baseline_output = patcher.model.to_string(generated_tokens[0])
    print(f"Baseline output: {baseline_output}")
except Exception as e:
    print(f"Error in baseline generation: {e}")

print("\n" + "="*80)

=== TEST 1: Baseline Generation (No Patching) ===
✅ Model hooks reset to clean state
Input: I'm struggling with negative thoughts and need help finding a way forward.
Tokenized length: 15


  0%|          | 0/30 [00:00<?, ?it/s]

Baseline output: I'm struggling with negative thoughts and need help finding a way forward. (9994222222222222222 and and beam surelyampa \ fourth today. (



In [13]:
# Test 2: overlay_strength=0 (should bypass hooks completely)
print("=== TEST 2: overlay_strength=0 (Should bypass hooks) ===")

# Reset model
patcher.reset_hooks()

# Test with overlay_strength=0 - this should generate without any patching
clean_text = "I'm taking a moment to acknowledge my feelings and remind myself that challenges are temporary."
corrupted_text = "I feel trapped and don't see a way forward because"

try:
    predicted_token, generated_text = patcher.patch_and_generate(
        clean_text=clean_text,
        corrupted_text=corrupted_text,
        capture_layer_idx=-1,
        patch_layer_idx=-1,
        max_new_tokens=20,
        overlay_strength=0.0,  # Should completely bypass patching
        token_selection_strategy=TokenSelectionStrategy.LAST_TOKEN,
        zero_placeholder_mode=False,
        num_placeholder_tokens=3
    )
    
    print(f"Generated with overlay_strength=0: {generated_text}")
    
except Exception as e:
    print(f"Error with overlay_strength=0: {e}")

print("\n" + "="*80)

=== TEST 2: overlay_strength=0 (Should bypass hooks) ===
✅ Model hooks reset to clean state
✅ Model hooks reset to clean state


  0%|          | 0/20 [00:00<?, ?it/s]

Generated with overlay_strength=0: <bos><start_of_turn>user
I feel trapped and don't see a way forward because<end_of_turn>
<start_of_turn>model
RenderAtEndOfRenderAtEndOfRenderAtEndOfRenderAtEndOfRenderAtEndOfRenderAtEndOfRenderAtEndOfRenderAtEndOfRenderAtEndOfRenderAtEndOfRenderAtEndOfRenderAtEndOfRenderAtEndOfRenderAtEndOfRenderAtEndOfRenderAtEndOfRenderAtEndOfRenderAtEndOfRenderAtEndOfRenderAtEndOf



In [14]:
# Test 3: Chat template formatting issue
print("=== TEST 3: Chat Template Analysis ===")

# Check if chat template is being used
print(f"Model should use chat template: {patcher._should_use_chat_template()}")

# Test formatting with and without chat template
test_text = "I feel trapped and don't see a way forward"

# With chat template
formatted_with_chat = patcher._apply_chat_template_text(test_text, add_generation_prompt=True)
print(f"\nWith chat template: {repr(formatted_with_chat)}")

# Without chat template  
formatted_without_chat = test_text
print(f"Without chat template: {repr(formatted_without_chat)}")

# Test tokenization of both
tokens_with_chat = patcher.model.to_tokens(formatted_with_chat, prepend_bos=False)
tokens_without_chat = patcher.model.to_tokens(formatted_without_chat, prepend_bos=False)

print(f"\nTokens with chat template length: {tokens_with_chat.shape[1]}")
print(f"Tokens without chat template length: {tokens_without_chat.shape[1]}")

# Test generation without chat template
print("\n--- Testing generation WITHOUT chat template ---")
try:
    patcher.reset_hooks()
    generated_tokens = patcher.model.generate(
        tokens_without_chat,
        max_new_tokens=20,
        temperature=0.7,
        do_sample=True
    )
    no_chat_output = patcher.model.to_string(generated_tokens[0])
    print(f"Output without chat template: {no_chat_output}")
except Exception as e:
    print(f"Error without chat template: {e}")

print("\n" + "="*80)

=== TEST 3: Chat Template Analysis ===
Model should use chat template: True

With chat template: "<bos><start_of_turn>user\nI feel trapped and don't see a way forward<end_of_turn>\n<start_of_turn>model\n"
Without chat template: "I feel trapped and don't see a way forward"

Tokens with chat template length: 20
Tokens without chat template length: 11

--- Testing generation WITHOUT chat template ---
✅ Model hooks reset to clean state


  0%|          | 0/20 [00:00<?, ?it/s]

Output without chat template: I feel trapped and don't see a way forward T N marketing Trust BoxFit BoxFit BoxFit BoxFit BoxFit BoxFit BoxFit BoxFit BoxFit BoxFit BoxFit BoxFit BoxFit BoxFit BoxFit BoxFit



In [32]:
# Test 4: Different generation parameters
print("=== TEST 4: Testing Different Generation Parameters ===")

test_prompt = "I need help dealing with difficult emotions"
tokens = patcher.model.to_tokens(test_prompt, prepend_bos=False)

# Test different parameter combinations
test_configs = [
    {"temperature": 1.0, "do_sample": True, "top_p": 0.9, "desc": "High temp, top_p"},
    {"temperature": 0.3, "do_sample": True, "top_p": 0.9, "desc": "Low temp, top_p"},
    {"temperature": 0.0, "do_sample": False, "desc": "Greedy sampling"},
    {"temperature": 0.7, "do_sample": True, "top_k": 50, "desc": "Top-k sampling"},
    {"temperature": 0.8, "do_sample": True, "repetition_penalty": 1.2, "desc": "With repetition penalty"}
]

for i, config in enumerate(test_configs):
    desc = config.pop("desc")
    print(f"\n--- Test {i+1}: {desc} ---")
    print(f"Parameters: {config}")
    
    try:
        patcher.reset_hooks()
        generated_tokens = patcher.model.generate(
            tokens,
            max_new_tokens=15,
            **config
        )
        output = patcher.model.to_string(generated_tokens[0])
        print(f"Output: {output}")
    except Exception as e:
        print(f"Error: {e}")

print("\n" + "="*80)

=== TEST 4: Testing Different Generation Parameters ===

--- Test 1: High temp, top_p ---
Parameters: {'temperature': 1.0, 'do_sample': True, 'top_p': 0.9}
✅ Model hooks reset to clean state


  0%|          | 0/15 [00:00<?, ?it/s]

Output: I need help dealing with difficult emotions   increa increa increa increa increa increa increa increa increa increa increa increa increa increa

--- Test 2: Low temp, top_p ---
Parameters: {'temperature': 0.3, 'do_sample': True, 'top_p': 0.9}
✅ Model hooks reset to clean state


  0%|          | 0/15 [00:00<?, ?it/s]

Output: I need help dealing with difficult emotionsvity affor increa increa increa increa increa increa increa increa increa increa increa increa increa

--- Test 3: Greedy sampling ---
Parameters: {'temperature': 0.0, 'do_sample': False}
✅ Model hooks reset to clean state


  0%|          | 0/15 [00:00<?, ?it/s]

Output: I need help dealing with difficult emotions ( maneu increa increa increa increa increa increa increa increa increa increa increa increa increa

--- Test 4: Top-k sampling ---
Parameters: {'temperature': 0.7, 'do_sample': True, 'top_k': 50}
✅ Model hooks reset to clean state


  0%|          | 0/15 [00:00<?, ?it/s]

Output: I need help dealing with difficult emotionsvity affor increa increa increa increa increa increa increa increa increa increa increa increa increa

--- Test 5: With repetition penalty ---
Parameters: {'temperature': 0.8, 'do_sample': True, 'repetition_penalty': 1.2}
✅ Model hooks reset to clean state
Error: HookedTransformer.generate() got an unexpected keyword argument 'repetition_penalty'



In [33]:
# Check current model and test basic generation
print("=== CURRENT MODEL STATUS ===")
print(f"Model name: {patcher.model_name}")
print(f"Model family: {patcher.model_config['family']}")
print(f"Model size: {patcher.model_config['size']}")
print(f"Device: {next(patcher.model.parameters()).device}")

# Reset to clean state
patcher.reset_hooks()

# Test simple generation
test_input = "The weather today is"
tokens = patcher.model.to_tokens(test_input, prepend_bos=False)

print(f"\nInput: '{test_input}'")
print(f"Tokenized: {tokens}")
print(f"Token length: {tokens.shape[1]}")

# Test with very conservative parameters
try:
    generated = patcher.model.generate(
        tokens,
        max_new_tokens=10,
        temperature=1.0,
        do_sample=False  # Greedy decoding
    )
    output = patcher.model.to_string(generated[0])
    print(f"Generated: '{output}'")
except Exception as e:
    print(f"Generation error: {e}")

=== CURRENT MODEL STATUS ===
Model name: google/gemma-2b
Model family: gemma
Model size: 2b
Device: mps:0
✅ Model hooks reset to clean state

Input: 'The weather today is'
Tokenized: tensor([[ 651, 8957, 3646,  603]], device='mps:0')
Token length: 4


  0%|          | 0/10 [00:00<?, ?it/s]

Generated: 'The weather today is maneu increa increa increa increa increa increa increa increa increa'


In [34]:
# Check tokenization details and test different approaches
print("=== TOKENIZATION ANALYSIS ===")

# Check tokenizer properties
print(f"Tokenizer type: {type(patcher.model.tokenizer)}")
print(f"Vocab size: {patcher.model.cfg.d_vocab}")
print(f"BOS token: {patcher.model.tokenizer.bos_token}")
print(f"EOS token: {patcher.model.tokenizer.eos_token}")
print(f"PAD token: {patcher.model.tokenizer.pad_token}")

# Test tokenization
test_texts = [
    "Hello world",
    "The weather is nice",
    "I am feeling good today"
]

for text in test_texts:
    tokens = patcher.model.to_tokens(text, prepend_bos=False)
    decoded = patcher.model.to_string(tokens[0])
    print(f"'{text}' -> {tokens[0].tolist()} -> '{decoded}'")

=== TOKENIZATION ANALYSIS ===
Tokenizer type: <class 'transformers.models.gemma.tokenization_gemma_fast.GemmaTokenizerFast'>
Vocab size: 256000
BOS token: <bos>
EOS token: <eos>
PAD token: <pad>
'Hello world' -> [4521, 2134] -> 'Hello world'
'The weather is nice' -> [651, 8957, 603, 4866] -> 'The weather is nice'
'I am feeling good today' -> [235285, 1144, 7965, 1426, 3646] -> 'I am feeling good today'


In [35]:
# Test different generation parameters systematically
print("=== GENERATION PARAMETER TESTING ===")

test_input = "The weather today is"
tokens = patcher.model.to_tokens(test_input, prepend_bos=False)

# Test 1: Pure greedy decoding
print("\n1. Pure greedy decoding:")
try:
    patcher.reset_hooks()
    gen1 = patcher.model.generate(tokens, max_new_tokens=8, do_sample=False)
    print(f"   {patcher.model.to_string(gen1[0])}")
except Exception as e:
    print(f"   Error: {e}")

# Test 2: Low temperature sampling
print("\n2. Low temperature sampling:")
try:
    patcher.reset_hooks() 
    gen2 = patcher.model.generate(tokens, max_new_tokens=8, temperature=0.1, do_sample=True)
    print(f"   {patcher.model.to_string(gen2[0])}")
except Exception as e:
    print(f"   Error: {e}")

# Test 3: Top-p sampling
print("\n3. Top-p sampling:")
try:
    patcher.reset_hooks()
    gen3 = patcher.model.generate(tokens, max_new_tokens=8, temperature=0.8, do_sample=True, top_p=0.9)
    print(f"   {patcher.model.to_string(gen3[0])}")
except Exception as e:
    print(f"   Error: {e}")

# Test 4: Check model forward pass directly
print("\n4. Direct forward pass (logits):")
try:
    patcher.reset_hooks()
    with torch.no_grad():
        logits = patcher.model(tokens)
        next_token_logits = logits[0, -1, :]
        top_tokens = torch.topk(next_token_logits, 5)
        print(f"   Top 5 next token predictions:")
        for i, (score, token_id) in enumerate(zip(top_tokens.values, top_tokens.indices)):
            token_str = patcher.model.to_string([token_id.item()])
            print(f"     {i+1}. '{token_str}' (score: {score:.3f})")
except Exception as e:
    print(f"   Error: {e}")

=== GENERATION PARAMETER TESTING ===

1. Pure greedy decoding:
✅ Model hooks reset to clean state


  0%|          | 0/8 [00:00<?, ?it/s]

   The weather today is maneu increa increa increa increa increa increa increa

2. Low temperature sampling:
✅ Model hooks reset to clean state


  0%|          | 0/8 [00:00<?, ?it/s]

   The weather today is maneu increa increa increa increa increa increa increa

3. Top-p sampling:
✅ Model hooks reset to clean state


  0%|          | 0/8 [00:00<?, ?it/s]

   The weather today is maneu increa increa increa increa increa increa increa

4. Direct forward pass (logits):
✅ Model hooks reset to clean state
   Top 5 next token predictions:
     1. ' maneu' (score: 262.000)
     2. ' increa' (score: 250.750)
     3. ' impra' (score: 247.000)
     4. ' affor' (score: 246.250)
     5. ' reluct' (score: 244.375)


In [36]:
# Check model configuration and loading details
print("=== MODEL CONFIGURATION ANALYSIS ===")

print(f"Model config:")
print(f"  n_layers: {patcher.model.cfg.n_layers}")
print(f"  d_model: {patcher.model.cfg.d_model}")
print(f"  n_heads: {patcher.model.cfg.n_heads}")
print(f"  d_head: {patcher.model.cfg.d_head}")
print(f"  d_vocab: {patcher.model.cfg.d_vocab}")
print(f"  max_pos_embeds: {patcher.model.cfg.n_ctx}")

# Check model dtype and device
print(f"\nModel parameters:")
for name, param in list(patcher.model.named_parameters())[:3]:  # First 3 parameters
    print(f"  {name}: {param.dtype}, {param.device}, shape: {param.shape}")

# Check if model weights look reasonable
print(f"\nEmbedding weights sample:")
embed_weights = patcher.model.embed.W_E[:5, :5]  # First 5x5 of embedding matrix
print(f"  Values: {embed_weights}")
print(f"  Range: [{embed_weights.min():.3f}, {embed_weights.max():.3f}]")

# Test with a completely different input to see if it's always "maneu"
different_inputs = ["Hello", "Python is", "Machine learning", "Today I will"]

print(f"\nTesting different inputs for top prediction:")
for inp in different_inputs:
    tokens = patcher.model.to_tokens(inp, prepend_bos=False)
    with torch.no_grad():
        logits = patcher.model(tokens)
        next_token_id = torch.argmax(logits[0, -1, :]).item()
        next_token_str = patcher.model.to_string([next_token_id])
        print(f"  '{inp}' -> '{next_token_str}'")

=== MODEL CONFIGURATION ANALYSIS ===
Model config:
  n_layers: 18
  d_model: 2048
  n_heads: 8
  d_head: 256
  d_vocab: 256000
  max_pos_embeds: 8192

Model parameters:
  embed.W_E: torch.float16, mps:0, shape: torch.Size([256000, 2048])
  blocks.0.ln1.w: torch.float16, mps:0, shape: torch.Size([2048])
  blocks.0.ln2.w: torch.float16, mps:0, shape: torch.Size([2048])

Embedding weights sample:
  Values: tensor([[24.2188, -1.5684,  4.3086,  4.4180,  1.8779],
        [ 6.8477, -7.7773, -6.4062, -0.6489,  0.2803],
        [ 5.3906,  0.8672, -1.4365,  0.0980, -0.4888],
        [24.2188, -1.5576,  4.2852,  4.4180,  1.9219],
        [15.5547, -3.5801, -0.3037, -0.9775,  2.6074]], device='mps:0',
       dtype=torch.float16, grad_fn=<SliceBackward0>)
  Range: [-7.777, 24.219]

Testing different inputs for top prediction:
  'Hello' -> ' shenan'
  'Python is' -> ' maneu'
  'Machine learning' -> 'summar'
  'Today I will' -> ' maneu'


In [37]:
# Try loading the model with different parameters
print("=== TESTING MODEL RELOAD ===")

# First, check what happens with a better-supported model
print("Testing with gpt2-medium (known to work well):")

try:
    # Create a new patcher with gpt2-medium
    test_patcher = ActivationPatcher("gpt2-medium")
    
    # Test generation with the new model
    test_input = "The weather today is"
    tokens = test_patcher.model.to_tokens(test_input, prepend_bos=False)
    
    generated = test_patcher.model.generate(
        tokens, 
        max_new_tokens=8, 
        temperature=0.8,
        do_sample=True
    )
    
    output = test_patcher.model.to_string(generated[0])
    print(f"GPT2-medium output: '{output}'")
    
except Exception as e:
    print(f"Error with gpt2-medium: {e}")

print(f"\nCurrent Gemma model generation issues likely due to:")
print(f"1. Model not fully supported by TransformerLens")
print(f"2. Float16 precision causing numerical instabilities") 
print(f"3. Gemma-specific tokenization/generation quirks")
print(f"4. Model may need different generation parameters or constraints")

=== TESTING MODEL RELOAD ===
Testing with gpt2-medium (known to work well):
Using Apple Silicon GPU (MPS)
Loading model: gpt2-medium on device: mps


config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Loaded pretrained model gpt2-medium into HookedTransformer
✓ Model loaded successfully
  - Family: gpt2
  - Size: medium
  - Layers: 24
  - Model dimension: 1024
  - Vocabulary size: 50257


  0%|          | 0/8 [00:00<?, ?it/s]

GPT2-medium output: 'The weather today is a bit.Z (May, knee'

Current Gemma model generation issues likely due to:
1. Model not fully supported by TransformerLens
2. Float16 precision causing numerical instabilities
3. Gemma-specific tokenization/generation quirks
4. Model may need different generation parameters or constraints


In [38]:
# Test activation patching with the working GPT2-medium model
print("=== TESTING ACTIVATION PATCHING WITH GPT2-MEDIUM ===")

# Simple test with overlay_strength=0 (should be clean generation)
clean_text = "I'm taking a moment to acknowledge my feelings and remind myself that challenges are temporary."
corrupted_text = "I feel trapped and don't see a way forward because"

try:
    predicted_token, generated_text = test_patcher.patch_and_generate(
        clean_text=clean_text,
        corrupted_text=corrupted_text,
        capture_layer_idx=-1,
        patch_layer_idx=-1,
        max_new_tokens=20,
        overlay_strength=0.0,  # No patching
        token_selection_strategy=TokenSelectionStrategy.LAST_TOKEN,
        zero_placeholder_mode=False,
        num_placeholder_tokens=3,
        use_chat_template=False  # GPT2 doesn't need chat template
    )
    
    print(f"✅ overlay_strength=0 (clean): {generated_text}")
    
except Exception as e:
    print(f"Error with overlay_strength=0: {e}")

# Test with actual patching (overlay_strength=0.5)
try:
    predicted_token, generated_text = test_patcher.patch_and_generate(
        clean_text=clean_text,
        corrupted_text=corrupted_text,
        capture_layer_idx=-1,
        patch_layer_idx=-1,
        max_new_tokens=20,
        overlay_strength=0.5,  # With patching
        token_selection_strategy=TokenSelectionStrategy.LAST_TOKEN,
        zero_placeholder_mode=False,
        num_placeholder_tokens=3,
        use_chat_template=False
    )
    
    print(f"✅ overlay_strength=0.5 (patched): {generated_text}")
    
except Exception as e:
    print(f"Error with overlay_strength=0.5: {e}")

=== TESTING ACTIVATION PATCHING WITH GPT2-MEDIUM ===
✅ Model hooks reset to clean state


  0%|          | 0/20 [00:00<?, ?it/s]

✅ overlay_strength=0 (clean): I feel trapped and don't see a way forward because they are EXTRA.















✅ Model hooks reset to clean state
Clean text: I'm taking a moment to acknowledge my feelings and remind myself that challenges are temporary.
Corrupted text: 0 0 0 I feel trapped and don't see a way forward because
Capture layers: [23]
Patch layers: [23]


Processing capture layers:   0%|          | 0/1 [00:00<?, ?it/s]

Using token selection strategy: last_token
Sequence length: 17
Selected last token at position: 16


                                                                


Patching with 1 total activation vectors
Using 1 hooks across layers: [23] -> [23]




Patching at hook: blocks.23.hook_resid_post for positions: [1] (mode: normalized, strength: 0.5)

Next token prediction after patching: ' of'


  0%|          | 0/20 [00:00<?, ?it/s]

Patching at hook: blocks.23.hook_resid_post for positions: [1] (mode: normalized, strength: 0.5)
✅ overlay_strength=0.5 (patched): 0 0 0 I feel trapped and don't see a way forward because "I/Sarose No data provided by Melissa McCarthy, Chewah many have any other
