# SelfIE Psychology Experiments

This notebook implements cognitive pattern analysis using the SelfIE (Self-Interpretation of Embeddings) technique, providing a similar workflow to the manual activation patching experiments but using the model's own ability to interpret its internal representations.

## How SelfIE Works Here

- **Source text**: We extract internal representations from cognitive pattern text
- **Interpretation**: The model describes what these representations mean in natural language
- **Analysis**: We compare interpretations between different types of cognitive patterns

Unlike activation patching which modifies model behavior, SelfIE reveals what the model "thinks" its internal representations mean, providing interpretable insights into cognitive processing.

In [None]:
import sys
import os
import torch
import pandas as pd
import json
from IPython.display import display, HTML
import warnings

# Add our SelfIE wrapper to path
sys.path.append('/Users/ivanculo/Desktop/Projects/turn_point/psych_selfie')
sys.path.append('/Users/ivanculo/Desktop/Projects/turn_point/manual_activation_patching')

# Import SelfIE wrapper
from selfie_patcher import SelfIEPatcher, TokenSelectionStrategy

# Import utility functions from the original activation patcher for dataset loading
try:
    from activation_patcher import ActivationPatcher
    load_cognitive_patterns = ActivationPatcher.load_cognitive_patterns
    get_pattern_by_index = ActivationPatcher.get_pattern_by_index
    get_pattern_by_type = ActivationPatcher.get_pattern_by_type
    get_pattern_text = ActivationPatcher.get_pattern_text
    filter_patterns_by_count = ActivationPatcher.filter_patterns_by_count
    list_available_pattern_types = ActivationPatcher.list_available_pattern_types
    print("✓ Successfully imported cognitive pattern utilities")
except ImportError as e:
    warnings.warn(f"Could not import pattern utilities: {e}")
    print("⚠️  Will use manual pattern definitions")

print("SelfIE Psychology Experiment Setup Complete!")

## Initialize Model and Load Dataset

In [None]:
# Model Selection - Choose a LLaMA-compatible model for SelfIE
# Note: SelfIE currently works best with LLaMA models
MODEL_NAME = "meta-llama/Llama-2-7b-chat-hf"  # Change to available LLaMA model
# Alternative models to try:
# MODEL_NAME = "meta-llama/Llama-3.1-8B-Instruct"
# MODEL_NAME = "huggyllama/llama-7b"

# Initialize the SelfIE patcher
print(f"Initializing SelfIE with model: {MODEL_NAME}")
print("Note: This requires transformers==4.34.0 (see requirements.txt)")

try:
    selfie_patcher = SelfIEPatcher(MODEL_NAME)
    print("✓ SelfIE patcher initialized successfully!")
except Exception as e:
    print(f"❌ Error initializing SelfIE: {e}")
    print("\n🔧 Troubleshooting tips:")
    print("1. Ensure you have transformers==4.34.0 installed")
    print("2. Make sure the model is available and you have access")
    print("3. Try a different LLaMA model")
    print("4. Check your GPU memory and CUDA setup")

# Load the cognitive patterns dataset (if available)
try:
    patterns, pattern_types = load_cognitive_patterns()
    print(f"\n✓ Loaded {len(patterns)} cognitive patterns")
    list_available_pattern_types(pattern_types)
except:
    print("\n⚠️  Using manual pattern definitions for this demo")
    # Define some sample patterns manually
    patterns = [
        {
            'cognitive_pattern_name': 'Negative Self-Evaluation',
            'positive_thought_pattern': "I'm learning and growing from my mistakes, and that's part of being human.",
            'negative_thought_pattern': "I always mess everything up and never do anything right.",
            'transition_thought_pattern': "I made a mistake, but I can learn from this experience."
        },
        {
            'cognitive_pattern_name': 'Anxiety Response',
            'positive_thought_pattern': "I can handle challenging situations by taking things one step at a time.",
            'negative_thought_pattern': "Everything is going to go wrong and I won't be able to cope.",
            'transition_thought_pattern': "I'm feeling anxious, but I can use coping strategies to manage this."
        }
    ]
    pattern_types = {'Negative Self-Evaluation': [0], 'Anxiety Response': [1]}
    print(f"✓ Using {len(patterns)} manual patterns for demonstration")

## Experiment 1: Basic SelfIE Interpretation

Let's start with a simple experiment to interpret internal representations of cognitive patterns using SelfIE.

### What this experiment does:
- **Input**: We provide a cognitive pattern text (positive or negative thought)
- **Extraction**: SelfIE extracts internal representations from specific layers and token positions
- **Interpretation**: The model describes what these representations mean in natural language
- **Analysis**: We examine the interpretations to understand how the model processes different cognitive patterns

In [None]:
# EXPERIMENT 1: Basic SelfIE Interpretation

# ===== CONFIGURATION SECTION =====
PATTERN_INDEX = 0  # Which pattern to analyze
TEXT_TYPE = "negative_thought_pattern"  # "positive_thought_pattern", "negative_thought_pattern", "transition_thought_pattern"
INTERPRETATION_TEMPLATE = "cognitive_pattern"  # Template for interpretation
LAYERS_TO_INTERPRET = [-1, -2, -3]  # Which layers to extract representations from (negative = from end)
TOKEN_STRATEGY = TokenSelectionStrategy.LAST_TOKEN  # How to select tokens
MAX_INTERPRETATION_TOKENS = 40  # Max tokens for interpretation
BATCH_SIZE = 2
# ====================================

# Get the pattern to analyze
if PATTERN_INDEX < len(patterns):
    selected_pattern = patterns[PATTERN_INDEX]
    input_text = selected_pattern[TEXT_TYPE]
    pattern_name = selected_pattern['cognitive_pattern_name']
else:
    print("❌ Pattern index out of range")
    input_text = "I feel overwhelmed and don't know how to handle this situation."
    pattern_name = "Sample Pattern"

print(f"\n🧠 EXPERIMENT 1: SelfIE Interpretation")
print(f"Pattern: {pattern_name}")
print(f"Text type: {TEXT_TYPE}")
print(f"Input text: {input_text}")
print(f"Layers to interpret: {LAYERS_TO_INTERPRET}")
print(f"Token strategy: {TOKEN_STRATEGY.value}")
print("\n" + "="*80)

try:
    # Perform SelfIE interpretation
    interpretation_results = selfie_patcher.interpret_text(
        text=input_text,
        layers_to_interpret=LAYERS_TO_INTERPRET,
        interpretation_template=INTERPRETATION_TEMPLATE,
        max_new_tokens=MAX_INTERPRETATION_TOKENS,
        batch_size=BATCH_SIZE
    )
    
    print("\n🎊 EXPERIMENT 1 RESULTS:")
    print("=" * 80)
    
    # Display results in a readable format
    for idx, row in interpretation_results.iterrows():
        print(f"\n📍 Layer {row['layer']}, Token {row['token']} ('{row['token_decoded']}'):")
        print(f"   Interpretation: {row['interpretation'].strip()}")
    
    print("\n📊 Full Results DataFrame:")
    display(interpretation_results[['layer', 'token', 'token_decoded', 'interpretation']])
    
except Exception as e:
    print(f"❌ Error in SelfIE interpretation: {e}")
    print("\n🔧 Troubleshooting:")
    print("1. Check that the model is properly loaded")
    print("2. Verify the model is LLaMA-compatible")
    print("3. Ensure sufficient GPU memory")
    print("4. Try reducing batch_size or max_new_tokens")

print("\n" + "="*80)

## Experiment 2: Comparing Positive vs Negative Patterns

Let's compare how the model interprets positive versus negative cognitive patterns.

In [None]:
# EXPERIMENT 2: Positive vs Negative Pattern Comparison

# ===== CONFIGURATION SECTION =====
COMPARISON_PATTERN_INDEX = 0
COMPARISON_LAYER = -1  # Focus on final layer
COMPARISON_STRATEGY = TokenSelectionStrategy.LAST_TOKEN
COMPARISON_TEMPLATE = "psychological_state"
COMPARISON_MAX_TOKENS = 30
# ====================================

if COMPARISON_PATTERN_INDEX < len(patterns):
    selected_pattern = patterns[COMPARISON_PATTERN_INDEX]
    
    positive_text = selected_pattern['positive_thought_pattern']
    negative_text = selected_pattern['negative_thought_pattern']
    pattern_name = selected_pattern['cognitive_pattern_name']
    
    print(f"\n🔄 EXPERIMENT 2: Positive vs Negative Comparison")
    print(f"Pattern: {pattern_name}")
    print(f"Layer: {COMPARISON_LAYER}")
    print(f"Strategy: {COMPARISON_STRATEGY.value}")
    print("\n" + "="*80)
    
    try:
        # Interpret positive pattern
        print("\n✅ Analyzing POSITIVE pattern...")
        print(f"Text: {positive_text}")
        
        positive_results = selfie_patcher.interpret_text(
            text=positive_text,
            layers_to_interpret=[COMPARISON_LAYER],
            interpretation_template=COMPARISON_TEMPLATE,
            max_new_tokens=COMPARISON_MAX_TOKENS,
            batch_size=1
        )
        
        # Interpret negative pattern
        print("\n❌ Analyzing NEGATIVE pattern...")
        print(f"Text: {negative_text}")
        
        negative_results = selfie_patcher.interpret_text(
            text=negative_text,
            layers_to_interpret=[COMPARISON_LAYER],
            interpretation_template=COMPARISON_TEMPLATE,
            max_new_tokens=COMPARISON_MAX_TOKENS,
            batch_size=1
        )
        
        print("\n🎊 EXPERIMENT 2 RESULTS:")
        print("=" * 80)
        
        print("\n✅ POSITIVE INTERPRETATION:")
        for idx, row in positive_results.iterrows():
            print(f"   {row['interpretation'].strip()}")
        
        print("\n❌ NEGATIVE INTERPRETATION:")
        for idx, row in negative_results.iterrows():
            print(f"   {row['interpretation'].strip()}")
        
        # Store results for further analysis
        comparison_results = {
            'positive': positive_results,
            'negative': negative_results,
            'pattern_name': pattern_name
        }
        
    except Exception as e:
        print(f"❌ Error in comparison experiment: {e}")
else:
    print("❌ Pattern index out of range for comparison")

print("\n" + "="*80)

## Experiment 3: Multi-Layer Analysis

Analyze how interpretations change across different layers of the model.

In [None]:
# EXPERIMENT 3: Multi-Layer Analysis

# ===== CONFIGURATION SECTION =====
MULTILAYER_TEXT = "I keep telling myself I'm not good enough and will never succeed."
MULTILAYER_LAYERS = [-5, -3, -1]  # Early, middle, and final layers
MULTILAYER_TEMPLATE = "cognitive_pattern"
MULTILAYER_STRATEGY = TokenSelectionStrategy.KEYWORDS
MULTILAYER_MAX_TOKENS = 35
# ====================================

print(f"\n🧭 EXPERIMENT 3: Multi-Layer Analysis")
print(f"Text: {MULTILAYER_TEXT}")
print(f"Analyzing layers: {MULTILAYER_LAYERS}")
print(f"Token strategy: {MULTILAYER_STRATEGY.value}")
print("\n" + "="*80)

try:
    # Analyze across multiple layers
    multilayer_results = selfie_patcher.interpret_text(
        text=MULTILAYER_TEXT,
        layers_to_interpret=MULTILAYER_LAYERS,
        interpretation_template=MULTILAYER_TEMPLATE,
        max_new_tokens=MULTILAYER_MAX_TOKENS,
        batch_size=2
    )
    
    print("\n🎊 EXPERIMENT 3 RESULTS:")
    print("=" * 80)
    
    # Group results by layer
    for layer in MULTILAYER_LAYERS:
        layer_results = multilayer_results[multilayer_results['layer'] == layer]
        print(f"\n🏗️ LAYER {layer}:")
        for idx, row in layer_results.iterrows():
            print(f"   Token '{row['token_decoded']}' → {row['interpretation'].strip()}")
    
    # Display summary table
    print("\n📊 Summary Table:")
    display(multilayer_results[['layer', 'token_decoded', 'interpretation']])
    
except Exception as e:
    print(f"❌ Error in multi-layer analysis: {e}")

print("\n" + "="*80)

## Experiment 4: Different Interpretation Templates

Test how different interpretation templates affect the results.

In [None]:
# EXPERIMENT 4: Template Comparison

# ===== CONFIGURATION SECTION =====
TEMPLATE_TEST_TEXT = "I'm overwhelmed by everything I need to do and feel like giving up."
TEMPLATES_TO_TEST = [
    'cognitive_pattern',
    'emotional_state', 
    'psychological_state',
    'decision_making'
]
TEMPLATE_LAYER = -1
TEMPLATE_STRATEGY = TokenSelectionStrategy.LAST_COUPLE
TEMPLATE_MAX_TOKENS = 25
# ====================================

print(f"\n🎭 EXPERIMENT 4: Template Comparison")
print(f"Text: {TEMPLATE_TEST_TEXT}")
print(f"Templates to test: {TEMPLATES_TO_TEST}")
print("\n" + "="*80)

template_comparison = {}

for template in TEMPLATES_TO_TEST:
    print(f"\n🔍 Testing template: {template}")
    
    try:
        results = selfie_patcher.interpret_text(
            text=TEMPLATE_TEST_TEXT,
            layers_to_interpret=[TEMPLATE_LAYER],
            interpretation_template=template,
            max_new_tokens=TEMPLATE_MAX_TOKENS,
            batch_size=1
        )
        
        template_comparison[template] = results
        
        # Show first interpretation
        if not results.empty:
            first_interpretation = results.iloc[0]['interpretation'].strip()
            print(f"   → {first_interpretation}")
        
    except Exception as e:
        print(f"   ❌ Error with template {template}: {e}")
        template_comparison[template] = None

print("\n🎊 EXPERIMENT 4 RESULTS SUMMARY:")
print("=" * 80)

for template, results in template_comparison.items():
    print(f"\n🎭 {template.upper()}:")
    if results is not None and not results.empty:
        for idx, row in results.iterrows():
            print(f"   {row['interpretation'].strip()}")
    else:
        print("   No results or error occurred")

print("\n" + "="*80)

## Advanced Features (Placeholders)

The following cells demonstrate the interface for advanced SelfIE features that are planned for future implementation.

In [None]:
# PLACEHOLDER: Supervised Control
# This would implement the supervised control technique from the SelfIE paper

print("🚧 PLACEHOLDER: Supervised Control")
print("This feature would allow editing concepts in the model's hidden representations.")
print("\nExample usage:")
print("selfie_patcher.supervised_control('reduce_negative_self_talk', control_strength=0.8)")

# Uncomment when implemented:
# try:
#     result = selfie_patcher.supervised_control(
#         target_concept="reduce_negative_self_talk",
#         control_strength=0.8
#     )
#     print(f"✓ Applied supervised control: {result}")
# except NotImplementedError:
#     print("⚠️  Feature not yet implemented")

In [None]:
# PLACEHOLDER: Reinforcement Control
# This would implement RLHF on hidden embeddings to erase harmful knowledge

print("🚧 PLACEHOLDER: Reinforcement Control")
print("This feature would use reinforcement learning to remove harmful concepts.")
print("\nExample usage:")
print("selfie_patcher.reinforcement_control(['self_harm', 'suicidal_ideation'])")

# Uncomment when implemented:
# try:
#     result = selfie_patcher.reinforcement_control([
#         'self_harm_thoughts', 
#         'suicidal_ideation',
#         'extreme_negative_self_evaluation'
#     ])
#     print(f"✓ Applied reinforcement control: {result}")
# except NotImplementedError:
#     print("⚠️  Feature not yet implemented")

In [None]:
# PLACEHOLDER: Batch Processing
# This would efficiently process multiple cognitive patterns

print("🚧 PLACEHOLDER: Batch Pattern Processing")
print("This feature would process multiple cognitive patterns efficiently.")
print("\nExample usage:")
print("results = selfie_patcher.batch_interpret_patterns(patterns, batch_size=8)")

# Uncomment when implemented:
# try:
#     batch_results = selfie_patcher.batch_interpret_patterns(
#         patterns=patterns[:5],  # First 5 patterns
#         batch_size=4
#     )
#     print(f"✓ Processed {len(batch_results)} patterns")
# except NotImplementedError:
#     print("⚠️  Feature not yet implemented")

In [None]:
# PLACEHOLDER: Visualization
# This would create visualizations of interpretation results

print("🚧 PLACEHOLDER: Interpretation Visualization")
print("This feature would create interactive visualizations of results.")
print("\nExample usage:")
print("selfie_patcher.visualize_interpretations(interpretation_results)")

# Uncomment when implemented:
# try:
#     if 'interpretation_results' in locals():
#         viz = selfie_patcher.visualize_interpretations(interpretation_results)
#         print("✓ Created visualization")
# except NotImplementedError:
#     print("⚠️  Feature not yet implemented")

In [None]:
# PLACEHOLDER: Export Functionality  
# This would export results in various formats

print("🚧 PLACEHOLDER: Export Functionality")
print("This feature would export interpretation results to various formats.")
print("\nExample usage:")
print("selfie_patcher.export_interpretations(results, format='json')")
print("selfie_patcher.export_interpretations(results, format='csv')")
print("selfie_patcher.export_interpretations(results, format='html')")

# Uncomment when implemented:
# try:
#     if 'interpretation_results' in locals():
#         export_path = selfie_patcher.export_interpretations(
#             interpretation_results, 
#             format='json'
#         )
#         print(f"✓ Exported results to: {export_path}")
# except NotImplementedError:
#     print("⚠️  Feature not yet implemented")

## Utility Functions

In [None]:
# Utility functions for the notebook

def reset_environment():
    """Reset the SelfIE environment"""
    selfie_patcher.reset_hooks()  # No-op for SelfIE but maintains compatibility
    print("🔄 Environment reset")

def show_model_info():
    """Display current model information"""
    selfie_patcher.check_model_info()

def clear_memory():
    """Clear GPU memory"""
    SelfIEPatcher.clear_memory()

print("Utility functions loaded:")
print("- reset_environment() - Reset the environment")
print("- show_model_info() - Display model information")
print("- clear_memory() - Clear GPU memory")

# Uncomment any line below to run:
# reset_environment()
# show_model_info()
# clear_memory()

## Summary and Next Steps

This notebook provides a foundation for using SelfIE to interpret cognitive patterns. The key advantages over activation patching:

1. **Interpretability**: Direct natural language descriptions of internal representations
2. **No Modification**: Analyzes the model without changing its behavior
3. **Flexibility**: Can interpret any concept the model understands

### To fully implement:
1. Set up the environment with transformers==4.34.0
2. Install the SelfIE library from the third_party directory
3. Use a LLaMA-compatible model
4. Implement the advanced features marked as placeholders
5. Add dataset integration for seamless cognitive pattern analysis

### Future enhancements:
- Supervised control for editing cognitive patterns
- Reinforcement learning for removing harmful patterns
- Interactive visualization of interpretations
- Batch processing for large-scale analysis
- Export capabilities for research and clinical use