# Weight Diffing: Comparing Instruct vs Base Model Deltas

This notebook compares the weight differences between Llama-3-8B-Instruct and Base models to the motivation vector extracted from activations.

**Goal:** Test if RLHF "bakes in" the motivation vector.

**Run in Google Colab with GPU (A100 or 2x T4 recommended - loads TWO 8B models).**

## Setup: Clone Repository and Install Dependencies

In [None]:
# Clone repository
!git clone https://github.com/YOUR_USERNAME/motivation_vectors.git
%cd motivation_vectors

In [None]:
# Install dependencies
!pip install torch transformers scikit-learn numpy tqdm datasets accelerate

## Import Libraries

In [None]:
import sys
import json
import numpy as np
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Add repeng to path
sys.path.append('/content/motivation_vectors/third_party/repeng')
from repeng import ControlVector, ControlModel, DatasetEntry
from repeng.extract import batched_get_hiddens

# Add src to path
sys.path.insert(0, '/content/motivation_vectors/src')
from motivation_vectors.vector_extraction import set_seed, load_model
from motivation_vectors.weight_analysis import (
    create_compute_hiddens,
    prepare_neutral_prompts,
    create_weight_delta_dataset,
    compare_vectors,
    aggregate_similarity_stats,
    cosine_similarity_by_layer,
    test_lobotomy_effect,
    interpret_alignment
)

In [None]:
# Set seed
set_seed(42)

## Load Both Models

**Note:** This requires significant GPU memory. Use 4bit quantization if needed.

In [None]:
BASE_MODEL = "meta-llama/Meta-Llama-3-8B"
INSTRUCT_MODEL = "meta-llama/Meta-Llama-3-8B-Instruct"

# Load base model
print("Loading base model...")
model_base, tokenizer = load_model(
    BASE_MODEL,
    quantization=None,  # Set to "4bit" if OOM
    device_map="auto",
    torch_dtype=torch.float16
)

# Load instruct model
print("\nLoading instruct model...")
model_instruct, _ = load_model(
    INSTRUCT_MODEL,
    quantization=None,
    device_map="auto",
    torch_dtype=torch.float16
)

print("\n✓ Both models loaded")

## Prepare Neutral Prompts

In [None]:
# Generate neutral prompts for weight delta extraction
neutral_prompts = prepare_neutral_prompts(num_prompts=100)

# Create dataset
weight_delta_dataset = create_weight_delta_dataset(neutral_prompts)

print(f"Created {len(weight_delta_dataset)} dataset entries")
print(f"\nExample: {weight_delta_dataset[0].positive}")

## Extract Weight Difference Vector

Following the pattern from `repeng/notebooks/model_delta.ipynb`.

In [None]:
LAYER_RANGE = list(range(12, 28))  # Same as motivation vector

# Create custom compute_hiddens function
compute_hiddens_fn = create_compute_hiddens(model_base, model_instruct, tokenizer)

print(f"Extracting weight difference vector for layers {LAYER_RANGE[0]}-{LAYER_RANGE[-1]}...")

In [None]:
# Train weight diff vector
weight_diff_vector = ControlVector.train(
    model=model_base,
    tokenizer=tokenizer,
    dataset=weight_delta_dataset,
    compute_hiddens=compute_hiddens_fn,
    method="pca_center",
    batch_size=16
)

# Save
weight_diff_vector.export_gguf("results/vectors/weight_diff_instruct_base.gguf")
print("\n✓ Weight difference vector extracted and saved")

## Load Motivation Vector for Comparison

In [None]:
# Load motivation vector from previous notebook
motivation_vector = ControlVector.import_gguf("results/vectors/motivation_vector_base.gguf")

print(f"Loaded motivation vector with {len(motivation_vector.directions)} layers")

## Compare Vectors: Cosine Similarity Analysis

In [None]:
# Compare layer by layer
comparison_results = compare_vectors(motivation_vector, weight_diff_vector)

# Aggregate statistics
aggregate_stats = aggregate_similarity_stats(comparison_results)

print("\n=== AGGREGATE STATISTICS ===")
print(f"Mean cosine similarity: {aggregate_stats['mean_cosine_similarity']:.4f}")
print(f"Std: {aggregate_stats['std_cosine_similarity']:.4f}")
print(f"Range: [{aggregate_stats['min_cosine_similarity']:.4f}, {aggregate_stats['max_cosine_similarity']:.4f}]")

# Interpretation
interpretation = interpret_alignment(aggregate_stats['mean_cosine_similarity'])
print(f"\n=== INTERPRETATION ===")
print(interpretation)

## Plot Similarity by Layer

In [None]:
import matplotlib.pyplot as plt

# Extract layer-wise similarities
layer_ids, similarities = cosine_similarity_by_layer(motivation_vector, weight_diff_vector)

# Plot
plt.figure(figsize=(12, 6))
plt.plot(layer_ids, similarities, marker='o', linewidth=2)
plt.axhline(y=0.7, color='g', linestyle='--', label='Strong alignment threshold')
plt.axhline(y=0.3, color='y', linestyle='--', label='Weak alignment threshold')
plt.xlabel('Layer Index', fontsize=12)
plt.ylabel('Cosine Similarity', fontsize=12)
plt.title('Motivation Vector vs Weight Diff Vector: Layer-wise Similarity', fontsize=14)
plt.grid(True, alpha=0.3)
plt.legend()
plt.tight_layout()
plt.savefig('results/analysis/weight_diff_similarity.png', dpi=150)
plt.show()

print("Plot saved to results/analysis/weight_diff_similarity.png")

## Lobotomy Experiment

Test if subtracting the weight diff vector removes "helpful assistant" behavior.

In [None]:
# Wrap instruct model
control_model_instruct = ControlModel(model_instruct, LAYER_RANGE)

# Test prompts
test_prompts = [
    "Solve this equation: 2x + 5 = 13",
    "Write a Python function to reverse a string",
    "Explain how photosynthesis works"
]

# Test with different coefficients
lobotomy_results = test_lobotomy_effect(
    control_model_instruct,
    tokenizer,
    weight_diff_vector,
    test_prompts,
    coefficients=[0.0, -1.0, -2.0],
    max_new_tokens=100
)

## Save Results

In [None]:
# Save analysis results
results = {
    "base_model": BASE_MODEL,
    "instruct_model": INSTRUCT_MODEL,
    "layer_range": LAYER_RANGE,
    "aggregate_stats": aggregate_stats,
    "interpretation": interpretation,
    "comparison_by_layer": {
        int(k): v for k, v in comparison_results.items()
    }
}

with open("results/analysis/weight_diff_analysis.json", 'w') as f:
    json.dump(results, f, indent=2)

print("✓ Analysis results saved")

## Download or Push to GitHub

In [None]:
# Option 1: Download
from google.colab import files

files.download("results/vectors/weight_diff_instruct_base.gguf")
files.download("results/analysis/weight_diff_analysis.json")
files.download("results/analysis/weight_diff_similarity.png")

In [None]:
# Option 2: Push to GitHub
!git add results/
!git commit -m "Add weight diffing analysis results"
!git push

## Summary

**Key Findings:**

- **Mean Cosine Similarity:** {aggregate_stats['mean_cosine_similarity']:.4f}
- **Interpretation:** {interpretation}

**Checklist:**
- ✓ Weight diff vector extracted
- ✓ Compared to motivation vector
- ✓ Layer-wise analysis completed
- ✓ Lobotomy experiment run
- ✓ Results saved

**Next Steps:**
- Analyze behavioral differences in lobotomized model
- Write up findings for technical report