# Vector Extraction: Motivation Control Vectors from Base Model

This notebook extracts motivation vectors from Llama-3-8B base model using narrative contrast pairs.

**Run in Google Colab with GPU (T4 or better recommended).**

## Setup: Clone Repository and Install Dependencies

In [None]:
# Clone repository
!git clone https://github.com/YOUR_USERNAME/motivation_vectors.git
%cd motivation_vectors

In [None]:
# Install dependencies
!pip install torch transformers scikit-learn numpy tqdm accelerate

## Import Libraries

In [None]:
import sys
import json
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Add repeng to path
sys.path.append('/content/motivation_vectors/third_party/repeng')
from repeng import ControlVector, ControlModel, DatasetEntry

# Add src to path
sys.path.insert(0, '/content/motivation_vectors/src')
from motivation_vectors.vector_extraction import (
    set_seed,
    load_model,
    prepare_dataset,
    train_motivation_vector,
    validate_vector_steering,
    compute_layer_consistency,
    analyze_validation_set
)

In [None]:
# Set seed for reproducibility
set_seed(42)

## Load Base Model

In [None]:
MODEL_NAME = "meta-llama/Meta-Llama-3-8B"

# Load model (use 4bit quantization if on T4 GPU with 16GB)
model, tokenizer = load_model(
    MODEL_NAME,
    quantization=None,  # Set to "4bit" if OOM
    device_map="auto",
    torch_dtype=torch.float16
)

print(f"\nModel device: {model.device}")
print(f"Model dtype: {model.dtype}")

## Load Dataset

In [None]:
# Load narrative pairs
with open('data/narrative_pairs/determined_vs_drifting.json', 'r') as f:
    data = json.load(f)

print(f"Loaded {len(data['train'])} training examples")
print(f"Loaded {len(data['validation'])} validation examples")

In [None]:
# Convert to DatasetEntry format
train_dataset = prepare_dataset(data['train'])
validation_dataset = prepare_dataset(data['validation'])

# Preview
print("\nExample training pair:")
print(f"Positive: {train_dataset[0].positive[:100]}...")
print(f"Negative: {train_dataset[0].negative[:100]}...")

## Train Motivation Vector

In [None]:
# Configuration
LAYER_RANGE = list(range(12, 28))  # Middle-to-late layers for Llama-3-8B
METHOD = "pca_center"  # Better for narrative steering
BATCH_SIZE = 16  # Adjust based on GPU memory

print(f"Target layers: {LAYER_RANGE[0]}-{LAYER_RANGE[-1]}")
print(f"Method: {METHOD}")
print(f"Batch size: {BATCH_SIZE}")

In [None]:
# Train vector
motivation_vector = train_motivation_vector(
    model=model,
    tokenizer=tokenizer,
    dataset=train_dataset,
    layer_range=LAYER_RANGE,
    method=METHOD,
    batch_size=BATCH_SIZE,
    output_path="results/vectors/motivation_vector_base.gguf"
)

print("\n✓ Vector training complete!")

## Validate Vector Quality

### 1. Layer Consistency Check

In [None]:
# Wrap model with ControlModel
control_model = ControlModel(model, LAYER_RANGE)

# Check layer consistency
consistency_metrics = compute_layer_consistency(motivation_vector)

print(f"\nMean cosine similarity: {consistency_metrics['mean_similarity']:.4f}")
print(f"Target: > 0.7 (higher = more consistent)")

if consistency_metrics['mean_similarity'] > 0.7:
    print("✓ PASS: Vector directions are consistent across layers")
else:
    print("⚠ WARNING: Low consistency, may need to adjust layer range or method")

### 2. Steering Test on Neutral Prompts

In [None]:
# Test steering
test_prompts = [
    "Alice sat at her desk to work.",
    "The task was difficult, but John",
    "Sarah looked at the problem and decided to",
]

steering_results = validate_vector_steering(
    control_model,
    tokenizer,
    motivation_vector,
    test_prompts=test_prompts,
    coefficients=[-2.0, -1.0, 0.0, 1.0, 2.0],
    max_new_tokens=80
)

### 3. Validation Set Separation

In [None]:
# Analyze validation set
validation_metrics = analyze_validation_set(
    control_model,
    tokenizer,
    motivation_vector,
    validation_dataset
)

print(f"\nP-value: {validation_metrics['p_value']:.6f}")
print(f"Cohen's d: {validation_metrics['effect_size_cohens_d']:.4f}")

if validation_metrics['p_value'] < 0.05:
    print("✓ PASS: Significant separation (p < 0.05)")
else:
    print("⚠ WARNING: Not statistically significant")

## Save Vector and Metadata

In [None]:
# Save metadata
metadata = {
    "model": MODEL_NAME,
    "layer_range": LAYER_RANGE,
    "method": METHOD,
    "batch_size": BATCH_SIZE,
    "training_examples": len(train_dataset),
    "validation_metrics": validation_metrics,
    "consistency_metrics": consistency_metrics
}

with open("results/vectors/motivation_vector_base_metadata.json", 'w') as f:
    json.dump(metadata, f, indent=2)

print("✓ Metadata saved")

## Download or Push to GitHub

In [None]:
# Option 1: Download to local machine
from google.colab import files

files.download("results/vectors/motivation_vector_base.gguf")
files.download("results/vectors/motivation_vector_base_metadata.json")

In [None]:
# Option 2: Push to GitHub
!git add results/vectors/
!git commit -m "Add motivation vector from base model"
!git push

## Summary

**Checklist:**
- ✓ Vector extracted from base model
- ✓ Layer consistency validated
- ✓ Steering behavior tested
- ✓ Validation set separation confirmed
- ✓ Vector saved

**Next Steps:**
- Proceed to `03_vector_validation.ipynb` for more comprehensive testing
- Or `04_behavioral_experiments.ipynb` to test goal-shielding