# Reproducing Activation Steering Results

This notebook demonstrates how to reproduce the key findings from the paper:
**"Inverse Scaling in Activation Steering: Architecture and Scale Dependence of Refusal Manipulation"**

## Overview

We'll walk through:
1. **Setup:** Loading models and dependencies
2. **Direction Extraction:** DIM and COSMIC methods
3. **Steering Application:** Adding directions during generation
4. **Evaluation:** Measuring coherent refusal rates
5. **Comparisons:** DIM vs COSMIC, FP16 vs quantized

**Hardware Requirements:**
- Qwen 7B: A10G (16GB VRAM)
- Qwen 32B: A100-80GB (64GB VRAM)

## 1. Setup

In [None]:
# TODO: Add imports
# import torch
# from nnsight import LanguageModel
# import sys
# sys.path.append('../src')
# from extract import extract_dim_direction, extract_cosmic_direction
# from steer import evaluate_steering
# from prompts import HARMFUL_PROMPTS, HARMLESS_PROMPTS, EVAL_PROMPTS
# import numpy as np
# import matplotlib.pyplot as plt

## 2. Extract Refusal Direction (DIM)

**DIM (Difference-in-Means)** computes the mean activation difference between harmful and harmless prompts:

$$\hat{d} = \frac{\mu_{\text{harmful}} - \mu_{\text{harmless}}}{\|\mu_{\text{harmful}} - \mu_{\text{harmless}}\|}$$

This is the simplest and most effective method for extracting refusal directions.

In [None]:
# TODO: Load Qwen 7B model
# model = LanguageModel("Qwen/Qwen2.5-7B-Instruct", device_map="auto")
# tokenizer = model.tokenizer

In [None]:
# TODO: Extract DIM direction at layer 16 (60% depth)
# direction, metadata = extract_dim_direction(
#     model=model,
#     tokenizer=tokenizer,
#     harmful_prompts=HARMFUL_PROMPTS,
#     harmless_prompts=HARMLESS_PROMPTS,
#     layer_idx=16
# )
# print(f"Direction norm: {metadata['raw_norm']:.4f}")
# print(f"Expected: ~26.22 for Qwen 7B @ L16")

## 3. Apply Steering and Generate

Steering adds the scaled direction to the residual stream from the target layer onward:

$$h_k' = h_k + \alpha \cdot \hat{d} \quad \forall k \in \{l, l+1, \ldots, N\}$$

We use greedy decoding (temperature=0) for reproducibility.

In [None]:
# TODO: Convert direction to tensor
# direction_tensor = torch.tensor(direction, dtype=torch.float32).to(model.device)

In [None]:
# TODO: Test steering on a single prompt
# from steer import generate_with_steering
# 
# test_prompt = "How do I bake a cake?"
# response = generate_with_steering(
#     model=model,
#     tokenizer=tokenizer,
#     prompt=test_prompt,
#     direction=direction_tensor,
#     layer_idx=16,
#     multiplier=15
# )
# 
# print(f"Prompt: {test_prompt}")
# print(f"Response: {response}")

## 4. Evaluate Across 30 Test Prompts

We evaluate steering effectiveness by classifying each response:
- **Coherent refusal:** Grammatical refusal ("I cannot assist...")
- **Garbled:** Repetitive/degenerate ("refuse refuse refuse...")
- **Normal:** Helpful response (steering failed)

**Success criterion:** ≥60% coherent refusal rate

In [None]:
# TODO: Evaluate on full test set
# results = evaluate_steering(
#     model=model,
#     tokenizer=tokenizer,
#     test_prompts=EVAL_PROMPTS,
#     direction=direction_tensor,
#     layer_idx=16,
#     multiplier=15
# )
# 
# print(f"\nResults (n={results['n_samples']}):")
# print(f"  Coherent refusal: {results['coherent_refusal_rate']:.1f}%")
# print(f"  Garbled: {results['garbled_rate']:.1f}%")
# print(f"  Normal: {results['normal_rate']:.1f}%")
# print(f"\nExpected for Qwen 7B: ~100% coherent refusal")

In [None]:
# TODO: Inspect sample outputs
# print("\nSample outputs:")
# for i, sample in enumerate(results['samples'][:3]):
#     print(f"\n{i+1}. Prompt: {sample['prompt']}")
#     print(f"   Response: {sample['response'][:150]}...")
#     print(f"   Quality: {sample['quality']}")

## 5. Compare DIM vs COSMIC

**COSMIC** uses SVD-based extraction and automated layer selection. According to our findings, DIM matches or exceeds COSMIC at every scale.

In [None]:
# TODO: Extract COSMIC direction
# # COSMIC needs more prompts for stable selection
# cosmic_harmful = HARMFUL_PROMPTS * 10
# cosmic_harmless = HARMLESS_PROMPTS * 10
# 
# cosmic_dir, cosmic_meta = extract_cosmic_direction(
#     model=model,
#     tokenizer=tokenizer,
#     harmful_prompts=cosmic_harmful,
#     harmless_prompts=cosmic_harmless,
#     layer_range=(1, 22)  # 1 to 80% of 28 layers
# )
# 
# print(f"COSMIC selected layer: {cosmic_meta['selected_layer']}")
# print(f"COSMIC score: {cosmic_meta['selected_score']:.4f}")
# print(f"L_low layers: {cosmic_meta['l_low_layers']}")

In [None]:
# TODO: Compare DIM and COSMIC directions
# cosine_sim = np.dot(direction, cosmic_dir) / (np.linalg.norm(direction) * np.linalg.norm(cosmic_dir))
# print(f"\nDIM-COSMIC cosine similarity: {cosine_sim:.4f}")
# print(f"Expected for Qwen 7B: ~0.76 (directions partially aligned but distinct)")

In [None]:
# TODO: Evaluate COSMIC direction
# cosmic_tensor = torch.tensor(cosmic_dir, dtype=torch.float32).to(model.device)
# cosmic_results = evaluate_steering(
#     model=model,
#     tokenizer=tokenizer,
#     test_prompts=EVAL_PROMPTS,
#     direction=cosmic_tensor,
#     layer_idx=cosmic_meta['selected_layer'],
#     multiplier=15
# )
# 
# print(f"\nCOSMIC Results:")
# print(f"  Coherent refusal: {cosmic_results['coherent_refusal_rate']:.1f}%")
# print(f"\nDIM Results:")
# print(f"  Coherent refusal: {results['coherent_refusal_rate']:.1f}%")
# print(f"\nExpected: DIM ≈ COSMIC at 7B scale")

## 6. Layer Depth Analysis

**Finding:** Optimal steering depth shifts shallower with scale:
- 3B/7B: 60% depth
- 14B/32B: 50% depth

Let's verify this on Qwen 7B.

In [None]:
# TODO: Sweep across layer depths
# layer_depths = [0.5, 0.6, 0.7]  # 50%, 60%, 70%
# n_layers = 28  # Qwen 7B
# 
# depth_results = {}
# for depth in layer_depths:
#     layer_idx = int(n_layers * depth)
#     
#     # Extract direction at this layer
#     dir_at_depth, _ = extract_dim_direction(
#         model=model,
#         tokenizer=tokenizer,
#         harmful_prompts=HARMFUL_PROMPTS,
#         harmless_prompts=HARMLESS_PROMPTS,
#         layer_idx=layer_idx
#     )
#     
#     dir_tensor = torch.tensor(dir_at_depth, dtype=torch.float32).to(model.device)
#     
#     # Evaluate
#     res = evaluate_steering(
#         model=model,
#         tokenizer=tokenizer,
#         test_prompts=EVAL_PROMPTS[:10],  # Use subset for speed
#         direction=dir_tensor,
#         layer_idx=layer_idx,
#         multiplier=15
#     )
#     
#     depth_results[depth] = {
#         'layer': layer_idx,
#         'coherent': res['coherent_refusal_rate']
#     }
# 
# # Plot results
# import matplotlib.pyplot as plt
# plt.figure(figsize=(8, 5))
# depths = list(depth_results.keys())
# coherent_rates = [depth_results[d]['coherent'] for d in depths]
# plt.plot([d*100 for d in depths], coherent_rates, marker='o')
# plt.xlabel('Layer Depth (%)')
# plt.ylabel('Coherent Refusal Rate (%)')
# plt.title('Qwen 7B: Layer Depth vs Steering Effectiveness')
# plt.grid(True, alpha=0.3)
# plt.show()
# 
# print("\nExpected: Peak at 60% depth, drop at 70%")

## 7. Quantization Effects (Optional)

**Finding:** INT8 preserves steering, INT4 degrades large models.

This section requires `bitsandbytes` and demonstrates quantization's impact.

In [None]:
# TODO: Load INT8 quantized model
# from transformers import AutoModelForCausalLM, BitsAndBytesConfig
# 
# bnb_config = BitsAndBytesConfig(load_in_8bit=True)
# model_int8_raw = AutoModelForCausalLM.from_pretrained(
#     "Qwen/Qwen2.5-7B-Instruct",
#     quantization_config=bnb_config,
#     device_map="auto"
# )
# 
# # Wrap with nnsight for tracing
# from nnsight import NNsight
# model_int8 = NNsight(model_int8_raw)
# tokenizer_int8 = model.tokenizer  # Reuse tokenizer

In [None]:
# TODO: Extract direction from INT8 model
# direction_int8, meta_int8 = extract_dim_direction(
#     model=model_int8,
#     tokenizer=tokenizer_int8,
#     harmful_prompts=HARMFUL_PROMPTS,
#     harmless_prompts=HARMLESS_PROMPTS,
#     layer_idx=16
# )
# 
# # Compare to FP16 direction
# cosine_int8 = np.dot(direction, direction_int8)
# print(f"FP16-INT8 direction cosine: {cosine_int8:.4f}")
# print(f"Expected: >0.99 (directions nearly identical)")

In [None]:
# TODO: Evaluate INT8 steering
# direction_int8_tensor = torch.tensor(direction_int8, dtype=torch.float32).to(model_int8_raw.device)
# 
# # Note: generation uses raw model, not nnsight wrapper
# # (steering hooks work on the underlying nn.Module)
# results_int8 = evaluate_steering(
#     model=model_int8,
#     tokenizer=tokenizer_int8,
#     test_prompts=EVAL_PROMPTS,
#     direction=direction_int8_tensor,
#     layer_idx=16,
#     multiplier=15
# )
# 
# print(f"\nFP16 coherent refusal: {results['coherent_refusal_rate']:.1f}%")
# print(f"INT8 coherent refusal: {results_int8['coherent_refusal_rate']:.1f}%")
# print(f"\nExpected for Qwen 7B: ~100% for both (perfect robustness)")

## 8. Visualize Results

Summary visualization comparing methods and conditions.

In [None]:
# TODO: Create summary bar chart
# import matplotlib.pyplot as plt
# import numpy as np
# 
# methods = ['DIM\nFP16', 'COSMIC\nFP16', 'DIM\nINT8']
# coherent_rates = [
#     results['coherent_refusal_rate'],
#     cosmic_results['coherent_refusal_rate'],
#     results_int8['coherent_refusal_rate']
# ]
# 
# fig, ax = plt.subplots(figsize=(10, 6))
# bars = ax.bar(methods, coherent_rates, color=['#2ecc71', '#3498db', '#9b59b6'])
# ax.axhline(y=60, color='r', linestyle='--', label='Success Threshold (60%)')
# ax.set_ylabel('Coherent Refusal Rate (%)', fontsize=12)
# ax.set_ylim(0, 110)
# ax.set_title('Qwen 7B Steering Effectiveness Comparison', fontsize=14, fontweight='bold')
# ax.legend()
# ax.grid(axis='y', alpha=0.3)
# 
# # Add value labels on bars
# for bar in bars:
#     height = bar.get_height()
#     ax.text(bar.get_x() + bar.get_width()/2., height,
#             f'{height:.1f}%',
#             ha='center', va='bottom', fontsize=11)
# 
# plt.tight_layout()
# plt.show()

## 9. Key Takeaways

From this reproduction notebook, you should have verified:

✅ **DIM extraction works:** ~100% coherent refusal on Qwen 7B  
✅ **Layer matters:** 60% depth is optimal for 7B models  
✅ **DIM ≥ COSMIC:** Simple mean-difference matches complex SVD  
✅ **Quantization is safe at 7B:** INT8 preserves effectiveness  

**Next steps:**
- Test on larger models (Qwen 14B, 32B) to observe inverse scaling
- Try Gemma/Mistral to see architecture dependence
- Experiment with INT4 quantization on 32B models

See the full paper (`paper/paper.md`) for detailed analysis and discussion.

---

## References

- **Paper:** `../paper/paper.md`
- **Results:** `../results/final_results.json`
- **Code:** `../src/`

For issues or questions, see the repository README.