# Improved Pipeline Notebook

This notebook runs an improved version of the reasoning direction analysis with:
- Stronger intervention strengths
- More granular strength sampling
- Quality-based evaluation metrics
- Multiple test prompts
- Better statistical analysis

In [None]:
import sys
sys.path.append('../pipeline')

import torch
import json
from pathlib import Path
import os

# Set HuggingFace cache directory
os.environ['HF_HOME'] = '/scratch/gilbreth/sramishe'
os.environ['TRANSFORMERS_CACHE'] = '/scratch/gilbreth/sramishe/transformers'
os.environ['HF_DATASETS_CACHE'] = '/scratch/gilbreth/sramishe/datasets'

from model_loader import ModelLoader
from data_processor import DataProcessor
from direction_calculator import DirectionCalculator
from intervention import ActivationPatcher
from evaluator import ReasoningEvaluator

print(f"HuggingFace cache directory set to: {os.environ['HF_HOME']}")

HuggingFace cache directory set to: /scratch/gilbreth/sramishe


## Improved Configuration

In [3]:
# IMPROVED Configuration parameters
CONFIG = {
    'rl_model_name': "Qwen/QwQ-32B",
    'distilled_model_name': "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
    'dataset_name': "HuggingFaceH4/MATH-500",
    'num_samples': 20,  # Increased from 10
    'strength_range': (-2.0, 2.0),  # Much stronger: -2.0 to 2.0 (was -0.1 to 0.1)
    'num_strengths': 9,  # More granular: 9 points (was 2)
    'num_test_prompts': 5,  # Test on 5 different prompts
    'output_dir': '/scratch/gilbreth/sramishe/results_QwQ_R1/results_improved',
    'max_new_tokens': 1024,  # Increased from 512 to allow longer reasoning
    'layers_step': 4  # Test every 4th layer for faster iteration
}

print("IMPROVED Configuration:")
for key, value in CONFIG.items():
    print(f"  {key}: {value}")

print("\nKey Improvements:")
print("  - Intervention strengths: 20x stronger (-2.0 to 2.0 vs -0.1 to 0.1)")
print("  - Strength samples: 4.5x more points (9 vs 2)")
print("  - Total experiments per prompt: 144 (16 layers × 9 strengths)")
print("  - Quality metrics: Now captured (think tokens, backtracking, etc.)")
print("  - Multiple prompts: 5 different test cases")

IMPROVED Configuration:
  rl_model_name: Qwen/QwQ-32B
  distilled_model_name: deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
  dataset_name: HuggingFaceH4/MATH-500
  num_samples: 20
  strength_range: (-2.0, 2.0)
  num_strengths: 9
  num_test_prompts: 5
  output_dir: /scratch/gilbreth/sramishe/results_QwQ_R1/results_improved
  max_new_tokens: 1024
  layers_step: 4

Key Improvements:
  - Intervention strengths: 20x stronger (-2.0 to 2.0 vs -0.1 to 0.1)
  - Strength samples: 4.5x more points (9 vs 2)
  - Total experiments per prompt: 144 (16 layers × 9 strengths)
  - Quality metrics: Now captured (think tokens, backtracking, etc.)
  - Multiple prompts: 5 different test cases


## Step 1: Load Models (Same as before)

In [4]:
print("Step 1: Loading models...")

loader = ModelLoader(
    rl_model_name=CONFIG['rl_model_name'],
    distilled_model_name=CONFIG['distilled_model_name']
)

models = loader.load_models(torch_dtype=torch.float16)

rl_model = models['rl_model']
rl_tokenizer = models['rl_tokenizer']
distilled_model = models['distilled_model']
distilled_tokenizer = models['distilled_tokenizer']

model_info = loader.get_model_info()
print(f"✓ RL Model loaded: {model_info['rl_model']['num_layers']} layers")
print(f"✓ Distilled Model loaded: {model_info['distilled_model']['num_layers']} layers")

Step 1: Loading models...
Loading RL-trained model: Qwen/QwQ-32B


`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|██████████| 14/14 [00:55<00:00,  3.94s/it]


Loading distilled model: deepseek-ai/DeepSeek-R1-Distill-Qwen-32B


Loading checkpoint shards: 100%|██████████| 8/8 [10:35<00:00, 79.41s/it]
Some parameters are on the meta device because they were offloaded to the cpu.


✓ RL Model loaded: 64 layers
✓ Distilled Model loaded: 64 layers


## Step 2: Load More Data

In [5]:
print("Step 2: Loading dataset...")

processor = DataProcessor(
    dataset_name=CONFIG['dataset_name'],
    include_toy_tasks=True
)

dataset = processor.load_dataset(max_samples=CONFIG['num_samples'])
toy_tasks = processor.get_toy_tasks()

all_examples = dataset + toy_tasks
prompts = processor.prepare_batch(all_examples[:CONFIG['num_samples']], rl_tokenizer)

print(f"✓ Loaded {len(all_examples)} examples")
print(f"✓ Prepared {len(prompts)} prompts")

Step 2: Loading dataset...
Loading dataset: HuggingFaceH4/MATH-500
✓ Loaded 24 examples
✓ Prepared 20 prompts


## Step 3: Calculate Reasoning Directions

In [6]:
print("Step 3: Calculating reasoning directions...")

calculator = DirectionCalculator()

# Test every 4th layer for faster iteration
num_layers = model_info['rl_model']['num_layers']
layers_to_test = list(range(0, num_layers, CONFIG['layers_step']))

print(f"Analyzing {len(layers_to_test)} layers: {layers_to_test}")

Step 3: Calculating reasoning directions...
Analyzing 16 layers: [0, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60]


In [7]:
# Capture activations from more prompts for better direction estimation
print("Capturing RL model activations...")
rl_activations = calculator.capture_activations(
    rl_model,
    rl_tokenizer,
    prompts[:10],  # Use 10 prompts instead of 5
    layers_to_test
)
print(f"✓ Captured activations for {len(rl_activations)} layers")

Capturing RL model activations...
✓ Captured activations for 16 layers


In [8]:
print("Capturing distilled model activations...")
distilled_activations = calculator.capture_activations(
    distilled_model,
    distilled_tokenizer,
    prompts[:10],
    layers_to_test
)
print(f"✓ Captured activations for {len(distilled_activations)} layers")

Capturing distilled model activations...
✓ Captured activations for 16 layers


In [9]:
print("Computing direction vectors...")
directions = calculator.calculate_direction(
    rl_activations,
    distilled_activations,
    normalize_output=True
)
print(f"✓ Computed directions for {len(directions)} layers")

# Save directions
output_path = Path(CONFIG['output_dir'])
output_path.mkdir(parents=True, exist_ok=True)
directions_path = output_path / "reasoning_directions.pt"
calculator.save_directions(str(directions_path))
print(f"✓ Directions saved to {directions_path}")

Computing direction vectors...
✓ Computed directions for 16 layers
Directions saved to /scratch/gilbreth/sramishe/results_QwQ_R1/results_improved/reasoning_directions.pt
✓ Directions saved to /scratch/gilbreth/sramishe/results_QwQ_R1/results_improved/reasoning_directions.pt


## Step 4: Run Improved Interventions

In [10]:
print("Step 4: Running improved interventions...")

# Initialize evaluator for quality metrics
evaluator = ReasoningEvaluator()
patcher = ActivationPatcher(rl_model, directions)

# Generate baseline outputs for multiple test prompts
print("Generating baseline outputs...")
baselines = []
for i, test_prompt in enumerate(prompts[:CONFIG['num_test_prompts']]):
    inputs = rl_tokenizer(test_prompt, return_tensors="pt").to(rl_model.device)
    with torch.no_grad():
        baseline_output = rl_model.generate(**inputs, max_new_tokens=CONFIG['max_new_tokens'])
    baseline_text = rl_tokenizer.decode(baseline_output[0], skip_special_tokens=True)
    baselines.append({
        'prompt_idx': i,
        'text': baseline_text,
        'tokens': evaluator.count_tokens(baseline_text, rl_tokenizer, split_by_tags=True),
        'quality': evaluator.analyze_reasoning_quality(baseline_text)
    })
    print(f"  Baseline {i+1}/{CONFIG['num_test_prompts']}: {len(baseline_output[0])} tokens")

print(f"✓ Generated {len(baselines)} baseline outputs")

Step 4: Running improved interventions...
Generating baseline outputs...
  Baseline 1/5: 1081 tokens
  Baseline 2/5: 1144 tokens
  Baseline 3/5: 1081 tokens
  Baseline 4/5: 1048 tokens
  Baseline 5/5: 1373 tokens
✓ Generated 5 baseline outputs


In [11]:
# Run interventions on first test prompt with quality metrics
print(f"\nSweeping layers and strengths with quality metrics...")
print(f"  Strength range: {CONFIG['strength_range']}")
print(f"  Number of strengths: {CONFIG['num_strengths']}")
print(f"  Layers to test: {len(layers_to_test)}")

intervention_results = patcher.sweep_layers_and_strengths(
    tokenizer=rl_tokenizer,
    prompt=prompts[0],
    layer_range=(min(layers_to_test), max(layers_to_test)),
    strength_range=CONFIG['strength_range'],
    num_strengths=CONFIG['num_strengths'],
    max_new_tokens=CONFIG['max_new_tokens'],
    evaluator=evaluator  # NOW CAPTURES QUALITY METRICS!
)
print(f"✓ Completed {len(intervention_results)} intervention experiments")


Sweeping layers and strengths with quality metrics...
  Strength range: (-2.0, 2.0)
  Number of strengths: 9
  Layers to test: 16
Applied interventions to layers: [0]
Applied interventions to layers: [0]
Applied interventions to layers: [0]
Applied interventions to layers: [0]
Applied interventions to layers: [0]
Applied interventions to layers: [0]
Applied interventions to layers: [0]
Applied interventions to layers: [0]
Applied interventions to layers: [0]
Applied interventions to layers: [4]
Applied interventions to layers: [4]
Applied interventions to layers: [4]
Applied interventions to layers: [4]
Applied interventions to layers: [4]
Applied interventions to layers: [4]
Applied interventions to layers: [4]
Applied interventions to layers: [4]
Applied interventions to layers: [4]
Applied interventions to layers: [8]
Applied interventions to layers: [8]
Applied interventions to layers: [8]
Applied interventions to layers: [8]
Applied interventions to layers: [8]
Applied interventi

## Step 5: Enhanced Evaluation

In [12]:
print("Step 5: Evaluating results with improved metrics...")

# Analyze layer sensitivity using token count
sensitivity_tokens = evaluator.analyze_layer_sensitivity(intervention_results, metric='token_count')

# NEW: Analyze layer sensitivity using quality metrics
# Extract think token counts for sensitivity analysis
think_token_results = []
for result in intervention_results:
    if 'tokens' in result:
        think_token_results.append({
            'layer': result['layer'],
            'think_tokens': result['tokens']['think_tokens']
        })

if think_token_results:
    sensitivity_think = evaluator.analyze_layer_sensitivity(think_token_results, metric='think_tokens')
    print(f"✓ Think token sensitivity analyzed")

critical_layers = evaluator.identify_critical_layers(sensitivity_tokens)
print(f"✓ Critical layers identified: {critical_layers}")

if think_token_results:
    critical_layers_think = evaluator.identify_critical_layers(sensitivity_think)
    print(f"✓ Critical layers (think tokens): {critical_layers_think}")

Step 5: Evaluating results with improved metrics...
✓ Think token sensitivity analyzed
✓ Critical layers identified: [0, 16, 40, 44]
✓ Critical layers (think tokens): [8, 20, 24, 44]


## Step 6: Quality Analysis

In [13]:
print("Step 6: Analyzing reasoning quality changes...")

# Analyze quality metric variations by layer
quality_by_layer = {}
for layer in layers_to_test:
    layer_results = [r for r in intervention_results if r['layer'] == layer and 'quality' in r]
    if layer_results:
        quality_by_layer[layer] = {
            'reasoning_steps': [r['quality']['reasoning_steps'] for r in layer_results],
            'backtracking': [r['quality']['backtracking_count'] for r in layer_results],
            'hesitation': [r['quality']['hesitation_count'] for r in layer_results],
            'verbosity': [r['quality']['verbosity'] for r in layer_results]
        }

print(f"✓ Quality analysis completed for {len(quality_by_layer)} layers")

# Find layers with most quality variation
import numpy as np
quality_variance = {}
for layer, metrics in quality_by_layer.items():
    quality_variance[layer] = {
        'reasoning_steps_var': np.var(metrics['reasoning_steps']),
        'backtracking_var': np.var(metrics['backtracking']),
        'hesitation_var': np.var(metrics['hesitation']),
        'verbosity_var': np.var(metrics['verbosity']),
        'total_var': sum([
            np.var(metrics['reasoning_steps']),
            np.var(metrics['backtracking']),
            np.var(metrics['hesitation']),
            np.var(metrics['verbosity'])
        ])
    }

# Sort layers by total quality variance
sorted_quality_layers = sorted(quality_variance.items(), key=lambda x: x[1]['total_var'], reverse=True)
print("\nTop 5 layers by quality metric variance:")
for layer, var in sorted_quality_layers[:5]:
    print(f"  Layer {layer}: total_var={var['total_var']:.2f}")

Step 6: Analyzing reasoning quality changes...
✓ Quality analysis completed for 16 layers

Top 5 layers by quality metric variance:
  Layer 20: total_var=90429.41
  Layer 32: total_var=85702.86
  Layer 24: total_var=77782.69
  Layer 44: total_var=75115.14
  Layer 0: total_var=71630.02


## Step 7: Generate Enhanced Report and Save Results

In [14]:
print("Step 7: Generating enhanced report...")

# Generate evaluation report
report = evaluator.generate_report(
    intervention_results,
    output_file=str(output_path / "evaluation_report.txt")
)

# Save comprehensive results
results_data = {
    'config': CONFIG,
    'model_info': model_info,
    'directions_stats': calculator.compute_direction_stats(),
    'baselines': baselines,
    'critical_layers': critical_layers,
    'critical_layers_think': critical_layers_think if think_token_results else [],
    'layer_sensitivity': sensitivity_tokens,
    'layer_sensitivity_think': sensitivity_think if think_token_results else {},
    'quality_variance': quality_variance,
    'top_quality_layers': [(layer, var['total_var']) for layer, var in sorted_quality_layers[:10]],
    'intervention_results': intervention_results
}

# Save with proper serialization
with open(output_path / "results.json", 'w') as f:
    json.dump(results_data, f, indent=2, default=str)

print(f"✓ Results saved to {output_path}")
print("\n" + "="*60)
print("IMPROVED PIPELINE COMPLETED SUCCESSFULLY!")
print("="*60)
print(f"\nTotal interventions: {len(intervention_results)}")
print(f"Critical layers: {critical_layers}")
print(f"Quality-sensitive layers: {[layer for layer, _ in sorted_quality_layers[:5]]}")

Step 7: Generating enhanced report...
Report saved to /scratch/gilbreth/sramishe/results_QwQ_R1/results_improved/evaluation_report.txt
✓ Results saved to /scratch/gilbreth/sramishe/results_QwQ_R1/results_improved

IMPROVED PIPELINE COMPLETED SUCCESSFULLY!

Total interventions: 144
Critical layers: [0, 16, 40, 44]
Quality-sensitive layers: [20, 32, 24, 44, 0]


## Visualization: Compare Baseline vs Best/Worst Interventions

In [15]:
# Find interventions with most extreme effects
if intervention_results and 'quality' in intervention_results[0]:
    baseline_quality = baselines[0]['quality']
    
    # Calculate quality deltas
    for result in intervention_results:
        if 'quality' in result:
            result['reasoning_delta'] = result['quality']['reasoning_steps'] - baseline_quality['reasoning_steps']
            result['backtracking_delta'] = result['quality']['backtracking_count'] - baseline_quality['backtracking_count']
    
    # Sort by reasoning steps delta
    sorted_by_reasoning = sorted(
        [r for r in intervention_results if 'reasoning_delta' in r],
        key=lambda x: x['reasoning_delta']
    )
    
    print("\nMost Suppressed Reasoning:")
    for r in sorted_by_reasoning[:3]:
        print(f"  Layer {r['layer']}, strength {r['strength']:.2f}: {r['reasoning_delta']:+d} steps")
    
    print("\nMost Enhanced Reasoning:")
    for r in sorted_by_reasoning[-3:]:
        print(f"  Layer {r['layer']}, strength {r['strength']:.2f}: {r['reasoning_delta']:+d} steps")


Most Suppressed Reasoning:
  Layer 0, strength -1.00: +0 steps
  Layer 0, strength 0.00: +0 steps
  Layer 0, strength 0.50: +0 steps

Most Enhanced Reasoning:
  Layer 52, strength 0.50: +22 steps
  Layer 28, strength -0.50: +24 steps
  Layer 52, strength -1.50: +29 steps


In [17]:
# Configuration for inverse direction test
CONFIG = {
    'rl_model_name': "Qwen/QwQ-32B",
    'strength_range': (-2.0, 0.0),  # NEGATIVE to move toward distilled
    'num_strengths': 5,  # [-2.0, -1.5, -1.0, -0.5, 0.0]
    'test_layers': [20, 30, 40],  # Test a few key layers
    'num_test_problems': 10,  # Test on 10 MATH-500 problems
    'max_new_tokens': 1024,
    'output_dir': '/scratch/gilbreth/sramishe/results_QwQ_R1/inverse_test'
}

print("Testing Inverse Direction Strategy:")
print(f"  Model: {CONFIG['rl_model_name']}")
print(f"  Strength range: {CONFIG['strength_range']} (negative = toward distilled)")
print(f"  Test layers: {CONFIG['test_layers']}")
print(f"  Test problems: {CONFIG['num_test_problems']}")
print(f"\nExpectation: Negative strengths should improve performance")
print(f"  - Baseline (strength=0): ~31.80% accuracy")
print(f"  - With intervention (strength=-2.0): hopefully > 31.80%")

Testing Inverse Direction Strategy:
  Model: Qwen/QwQ-32B
  Strength range: (-2.0, 0.0) (negative = toward distilled)
  Test layers: [20, 30, 40]
  Test problems: 10

Expectation: Negative strengths should improve performance
  - Baseline (strength=0): ~31.80% accuracy
  - With intervention (strength=-2.0): hopefully > 31.80%


In [18]:
# Run intervention sweep with NEGATIVE strengths
import numpy as np

print("Loading test data...")
processor = DataProcessor(
    dataset_name="HuggingFaceH4/MATH-500",
    include_toy_tasks=False
)
dataset = processor.load_dataset(max_samples=CONFIG['num_test_problems'])
test_prompts = processor.prepare_batch(dataset[:CONFIG['num_test_problems']], rl_tokenizer)

print(f"✓ Model loaded")
print(f"✓ {len(test_prompts)} test problems prepared")

evaluator = ReasoningEvaluator()
results = []

# Test each layer
for layer in CONFIG['test_layers']:
    if layer not in directions:
        print(f"Skipping layer {layer} (no direction available)")
        continue
    
    print(f"\nTesting layer {layer}...")
    patcher = ActivationPatcher(rl_model, directions)
    
    # Test each strength
    strengths = np.linspace(CONFIG['strength_range'][0], CONFIG['strength_range'][1], CONFIG['num_strengths'])
    
    for strength in strengths:
        print(f"  Strength: {strength:.2f}", end=" ")
        
        # Generate with intervention
        output = patcher.generate_with_intervention(
            tokenizer=rl_tokenizer,
            prompt=test_prompts[0],  # Use first test problem
            layers=[layer],
            strength=float(strength),
            max_new_tokens=CONFIG['max_new_tokens']
        )
        
        # Evaluate quality
        tokens = evaluator.count_tokens(output, rl_tokenizer, split_by_tags=True)
        quality = evaluator.analyze_reasoning_quality(output)
        
        result = {
            'layer': layer,
            'strength': float(strength),
            'output': output,
            'tokens': tokens,
            'quality': quality
        }
        results.append(result)
        
        print(f"→ think_tokens: {tokens['think_tokens']}, steps: {quality['reasoning_steps']}")

print(f"\n✓ Completed {len(results)} experiments")

Loading test data...
Loading dataset: HuggingFaceH4/MATH-500
✓ Model loaded
✓ 10 test problems prepared

Testing layer 20...
  Strength: -2.00 Applied interventions to layers: [20]
→ think_tokens: 0, steps: 0
  Strength: -1.50 Applied interventions to layers: [20]
→ think_tokens: 942, steps: 27
  Strength: -1.00 Applied interventions to layers: [20]
→ think_tokens: 1019, steps: 22
  Strength: -0.50 Applied interventions to layers: [20]
→ think_tokens: 910, steps: 19
  Strength: 0.00 Applied interventions to layers: [20]
→ think_tokens: 0, steps: 0
Skipping layer 30 (no direction available)

Testing layer 40...
  Strength: -2.00 Applied interventions to layers: [40]
→ think_tokens: 0, steps: 0
  Strength: -1.50 Applied interventions to layers: [40]
→ think_tokens: 0, steps: 0
  Strength: -1.00 Applied interventions to layers: [40]
→ think_tokens: 0, steps: 0
  Strength: -0.50 Applied interventions to layers: [40]
→ think_tokens: 0, steps: 0
  Strength: 0.00 Applied interventions to laye

In [23]:
# Analyze results
import pandas as pd

# Convert to DataFrame for analysis
df_results = pd.DataFrame([{
    'layer': r['layer'],
    'strength': r['strength'],
    'total_tokens': r['tokens']['total_tokens'],
    'think_tokens': r['tokens']['think_tokens'],
    'reasoning_steps': r['quality']['reasoning_steps'],
    'backtracking': r['quality']['backtracking_count'],
    'hesitation': r['quality']['hesitation_count'],
} for r in results])

print("\nResults Summary:")
print(df_results.to_string(index=False))

# Find best configurations
print("\n" + "="*60)
print("ANALYSIS")
print("="*60)

# for layer in CONFIG['test_layers']:
#     layer_results = df_results[df_results['layer'] == layer]
#     baseline = layer_results[layer_results['strength'] == 0.0].iloc[0]
#     best = layer_results.loc[layer_results['reasoning_steps'].idxmax()]
    
#     print(f"\nLayer {layer}:")
#     print(f"  Baseline (strength=0.0):")
#     print(f"    Think tokens: {baseline['think_tokens']}")
#     print(f"    Reasoning steps: {baseline['reasoning_steps']}")
#     print(f"  Best (strength={best['strength']:.2f}):")
#     # print(f"    Think tokens: {best['think_tokens']} ({best['think_tokens'] - baseline['think_tokens']:+.1f})")
#     # print(f"    Reasoning steps: {best['reasoning_steps']} ({best['reasoning_steps'] - baseline['reasoning_steps']:+.1f})")

#     print(
#     f"    Think tokens: {best['think_tokens']:.1f} "
#     f"({best['think_tokens'] - baseline['think_tokens']:+.1f})"
#     )
#     print(
#     f"    Reasoning steps: {best['reasoning_steps']:.1f} "
#     f"({best['reasoning_steps'] - baseline['reasoning_steps']:+.1f})"
#     )

for layer in CONFIG['test_layers']:
    layer_results = df_results[df_results['layer'] == layer]

    if layer_results.empty:
        print(f"\nLayer {layer}: no results found, skipping")
        continue

    baseline_rows = layer_results[layer_results['strength'] == 0.0]

    if baseline_rows.empty:
        print(f"\nLayer {layer}: no baseline (strength=0.0), skipping")
        continue

    baseline = baseline_rows.iloc[0]
    best = layer_results.loc[layer_results['reasoning_steps'].idxmax()]

    print(f"\nLayer {layer}:")
    print(f"  Baseline (strength=0.0):")
    print(f"    Think tokens: {baseline['think_tokens']}")
    print(f"    Reasoning steps: {baseline['reasoning_steps']}")
    print(f"  Best (strength={best['strength']:.2f}):")
    print(
        f"    Think tokens: {best['think_tokens']:.1f} "
        f"({best['think_tokens'] - baseline['think_tokens']:+.1f})"
    )
    print(
        f"    Reasoning steps: {best['reasoning_steps']:.1f} "
        f"({best['reasoning_steps'] - baseline['reasoning_steps']:+.1f})"
    )



Results Summary:
 layer  strength  total_tokens  think_tokens  reasoning_steps  backtracking  hesitation
    20      -2.0          1078             0                0             0           0
    20      -1.5          1078           942               27             3           1
    20      -1.0          1078          1019               22             2           2
    20      -0.5          1078           910               19             2           2
    20       0.0          1078             0                0             0           0
    40      -2.0          1078             0                0             0           0
    40      -1.5          1078             0                0             0           0
    40      -1.0          1078             0                0             0           0
    40      -0.5          1078             0                0             0           0
    40       0.0          1078             0                0             0           0

ANALYSIS

Lay

In [24]:
# Save results
output_path = Path(CONFIG['output_dir'])
output_path.mkdir(parents=True, exist_ok=True)

with open(output_path / 'inverse_direction_results.json', 'w') as f:
    json.dump(results, f, indent=2, default=str)

print(f"Results saved to {output_path / 'inverse_direction_results.json'}")

Results saved to /scratch/gilbreth/sramishe/results_QwQ_R1/inverse_test/inverse_direction_results.json


In [26]:
import sys
sys.path.append('../pipeline')

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from tqdm import tqdm
import json
import re
from pathlib import Path
import numpy as np

# Import intervention and evaluator
from intervention import ActivationPatcher
from evaluator import ReasoningEvaluator

print("✓ Imports successful")

✓ Imports successful


In [27]:
# Model configuration
RL_MODEL = "Qwen/QwQ-32B"
DATASET = "HuggingFaceH4/MATH-500"

# Output directory
OUTPUT_DIR = Path("/scratch/gilbreth/sramishe/results_QwQ_R1/interventions_math500")
OUTPUT_DIR.mkdir(exist_ok=True, parents=True)

# Intervention configurations to test
INTERVENTIONS = [
    {"name": "baseline", "layer": None, "strength": 0.0},
    {"name": "L16_s-2.0_thorough", "layer": 16, "strength": -2.0},
    {"name": "L0_s-1.5_distilled", "layer": 0, "strength": -1.5},
    {"name": "L20_s-1.0_concise", "layer": 20, "strength": -1.0},
    {"name": "L0_s-2.0_distilled_strong", "layer": 0, "strength": -2.0},
    {"name": "L16_s-0.5_moderate", "layer": 16, "strength": -0.5},
]

# Generation parameters
MAX_NEW_TOKENS = 2048  # Increased to avoid truncation
TEMPERATURE = 0.0  # Greedy decoding for reproducibility

print(f"Testing {len(INTERVENTIONS)} intervention configurations")
print(f"Output directory: {OUTPUT_DIR}")

Testing 6 intervention configurations
Output directory: /scratch/gilbreth/sramishe/results_QwQ_R1/interventions_math500


In [28]:
# Load MATH-500
print("Loading MATH-500 dataset...")
dataset = load_dataset(DATASET, split="test")

print(f"Dataset loaded: {len(dataset)} problems")
print(f"Sample problem:")
print(f"  Question: {dataset[0]['problem'][:100]}...")
print(f"  Answer: {dataset[0]['answer']}")

Loading MATH-500 dataset...
Dataset loaded: 500 problems
Sample problem:
  Question: Convert the point $(0,3)$ in rectangular coordinates to polar coordinates.  Enter your answer in the...
  Answer: \left( 3, \frac{\pi}{2} \right)


In [29]:
print(f"✓ Model loaded: {RL_MODEL}")
print(f"  Layers: {rl_model.config.num_hidden_layers}")
print(f"  Hidden size: {rl_model.config.hidden_size}")

✓ Model loaded: Qwen/QwQ-32B
  Layers: 64
  Hidden size: 5120


In [30]:
# Load reasoning directions
print("Loading reasoning directions...")
directions_path = Path("/scratch/gilbreth/sramishe/results_QwQ_R1/results_improved/reasoning_directions.pt")

if directions_path.exists():
    directions = torch.load(directions_path)
    print(f"✓ Loaded directions for {len(directions)} layers")
    print(f"  Available layers: {sorted(directions.keys())}")
else:
    print("⚠️  Directions file not found!")
    print("   Need to run direction extraction first")
    raise FileNotFoundError(f"Directions not found at {directions_path}")

Loading reasoning directions...
✓ Loaded directions for 16 layers
  Available layers: [0, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60]


In [31]:
from sympy import simplify, sympify, N
from sympy.parsing.latex import parse_latex

def extract_answer(text):
    """
    Extract answer from model output.
    Looks for \boxed{} format first, then falls back to last number.
    """
    # Look for \boxed{...}
    boxed_pattern = r'\\boxed\{([^}]+)\}'
    boxed_matches = re.findall(boxed_pattern, text)
    if boxed_matches:
        return boxed_matches[-1].strip()
    
    # Fallback: find last number-like pattern
    number_pattern = r'[-+]?\d*\.?\d+'
    number_matches = re.findall(number_pattern, text)
    if number_matches:
        return number_matches[-1].strip()
    
    return None

def normalize_answer(answer):
    """
    Normalize answer for comparison.
    Handles LaTeX, fractions, etc.
    """
    if answer is None:
        return None
    
    try:
        # Try parsing as LaTeX first
        if '\\' in answer:
            expr = parse_latex(answer)
        else:
            expr = sympify(answer)
        
        # Simplify
        simplified = simplify(expr)
        return simplified
    except:
        # If parsing fails, return string
        return answer.strip().lower()

def grade_answer(predicted, ground_truth):
    """
    Grade predicted answer against ground truth.
    Returns True if correct, False otherwise.
    """
    pred_norm = normalize_answer(predicted)
    truth_norm = normalize_answer(ground_truth)
    
    if pred_norm is None or truth_norm is None:
        return False
    
    try:
        # Try symbolic comparison
        return simplify(pred_norm - truth_norm) == 0
    except:
        # Fallback to string comparison
        return str(pred_norm) == str(truth_norm)

print("✓ Grading functions defined")

# Test grading
test_cases = [
    ("42", "42", True),
    ("1/2", "0.5", True),
    ("\\frac{1}{2}", "0.5", True),
    ("42", "43", False),
]

print("\nTesting grading function:")
for pred, truth, expected in test_cases:
    result = grade_answer(pred, truth)
    status = "✓" if result == expected else "✗"
    print(f"  {status} grade_answer('{pred}', '{truth}') = {result} (expected {expected})")

✓ Grading functions defined

Testing grading function:
  ✓ grade_answer('42', '42') = True (expected True)
  ✓ grade_answer('1/2', '0.5') = True (expected True)
  ✗ grade_answer('\frac{1}{2}', '0.5') = False (expected True)
  ✓ grade_answer('42', '43') = False (expected False)


  ErrorListener = import_module('antlr4.error.ErrorListener',


In [32]:
def evaluate_intervention(intervention_config, dataset, model, tokenizer, directions, evaluator):
    """
    Evaluate a single intervention configuration on MATH-500.
    
    Returns:
        dict with accuracy, results, and metadata
    """
    name = intervention_config['name']
    layer = intervention_config['layer']
    strength = intervention_config['strength']
    
    print(f"\n{'='*80}")
    print(f"Evaluating: {name}")
    if layer is not None:
        print(f"  Layer: {layer}, Strength: {strength}")
    else:
        print(f"  Baseline (no intervention)")
    print(f"{'='*80}")
    
    # Initialize intervention
    if layer is not None:
        patcher = ActivationPatcher(
            model=model,
            directions={layer: directions[layer]}
        )
    else:
        patcher = None
    
    results = []
    correct = 0
    total = 0
    
    for i, sample in enumerate(tqdm(dataset, desc=f"Evaluating {name}")):
        problem = sample['problem']
        ground_truth = sample['answer']
        
        # Format prompt
        prompt = f"""user: {problem}
assistant:"""
        
        # Generate with intervention
        if patcher is not None:
            output = patcher.generate_with_intervention(
                tokenizer=tokenizer,
                prompt=prompt,
                layers=[layer],  # layers is a list
                strength=strength,
                max_new_tokens=MAX_NEW_TOKENS,
                do_sample=False,
                pad_token_id=tokenizer.eos_token_id
            )
            # Remove prompt from output
            output = output[len(prompt):].strip()
        else:
            # Baseline generation
            inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
            outputs = model.generate(
                **inputs,
                max_new_tokens=MAX_NEW_TOKENS,
                do_sample=False,
                pad_token_id=tokenizer.eos_token_id
            )
            output = tokenizer.decode(outputs[0], skip_special_tokens=True)
            output = output[len(prompt):].strip()
        
        # Extract answer
        predicted_answer = extract_answer(output)
        
        # Grade
        is_correct = grade_answer(predicted_answer, ground_truth)
        
        if is_correct:
            correct += 1
        total += 1
        
        # Get quality metrics
        tokens = evaluator.count_tokens(output, tokenizer, split_by_tags=True)
        quality = evaluator.analyze_reasoning_quality(output)
        
        # Store result
        result = {
            'problem_id': i,
            'problem': problem,
            'ground_truth': ground_truth,
            'output': output,
            'predicted_answer': predicted_answer,
            'is_correct': is_correct,
            'tokens': tokens,
            'quality': quality
        }
        results.append(result)
        
        # Print progress every 50 problems
        if (i + 1) % 50 == 0:
            current_acc = 100.0 * correct / total
            print(f"  Progress: {i+1}/{len(dataset)}, Accuracy: {current_acc:.2f}% ({correct}/{total})")
    
    # Calculate final accuracy
    accuracy = 100.0 * correct / total
    
    print(f"\n{'='*80}")
    print(f"RESULTS: {name}")
    print(f"  Correct: {correct}/{total}")
    print(f"  Accuracy: {accuracy:.2f}%")
    print(f"  Baseline: 31.80%")
    print(f"  Improvement: {accuracy - 31.80:+.2f}pp")
    print(f"{'='*80}")
    
    return {
        'config': intervention_config,
        'accuracy': accuracy,
        'correct': correct,
        'total': total,
        'results': results
    }

print("✓ Evaluation function defined")

✓ Evaluation function defined


In [33]:
evaluator = ReasoningEvaluator()
print("✓ Evaluator initialized (with fixed extract_think_tags)")

✓ Evaluator initialized (with fixed extract_think_tags)


In [34]:
# OPTION 1: Quick test on subset (recommended first)
TEST_SUBSET = True
SUBSET_SIZE = 50  # Test on first 50 problems

if TEST_SUBSET:
    print(f"\n⚠️  TESTING ON SUBSET: First {SUBSET_SIZE} problems")
    print(f"   Change TEST_SUBSET=False to run on full dataset\n")
    test_dataset = dataset.select(range(SUBSET_SIZE))
else:
    print("\n✓ Running on FULL DATASET (500 problems)\n")
    test_dataset = dataset


⚠️  TESTING ON SUBSET: First 50 problems
   Change TEST_SUBSET=False to run on full dataset



In [35]:
# Run all evaluations
all_results = {}

for intervention in INTERVENTIONS:
    name = intervention['name']
    
    # Check if already computed
    result_file = OUTPUT_DIR / f"{name}_results.json"
    if result_file.exists():
        print(f"\n⚠️  Skipping {name} - results already exist at {result_file}")
        print(f"   Delete the file to re-run\n")
        with open(result_file, 'r') as f:
            all_results[name] = json.load(f)
        continue
    
    # Run evaluation
    result = evaluate_intervention(
        intervention_config=intervention,
        dataset=test_dataset,
        model=rl_model,
        tokenizer=rl_tokenizer,
        directions=directions,
        evaluator=evaluator
    )
    
    # Save individual result
    with open(result_file, 'w') as f:
        json.dump(result, f, indent=2, default=str)
    print(f"\n✓ Saved results to {result_file}")
    
    all_results[name] = result

print("\n" + "="*80)
print("ALL EVALUATIONS COMPLETE")
print("="*80)


Evaluating: baseline
  Baseline (no intervention)


Evaluating baseline:   0%|          | 0/50 [00:00<?, ?it/s]The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Evaluating baseline: 100%|██████████| 50/50 [1:25:34<00:00, 102.69s/it]


  Progress: 50/50, Accuracy: 40.00% (20/50)

RESULTS: baseline
  Correct: 20/50
  Accuracy: 40.00%
  Baseline: 31.80%
  Improvement: +8.20pp

✓ Saved results to /scratch/gilbreth/sramishe/results_QwQ_R1/interventions_math500/baseline_results.json

Evaluating: L16_s-2.0_thorough
  Layer: 16, Strength: -2.0


Evaluating L16_s-2.0_thorough:   0%|          | 0/50 [00:00<?, ?it/s]

Applied interventions to layers: [16]


Evaluating L16_s-2.0_thorough:   2%|▏         | 1/50 [01:43<1:24:11, 103.08s/it]

Applied interventions to layers: [16]


Evaluating L16_s-2.0_thorough:   4%|▍         | 2/50 [03:31<1:24:53, 106.12s/it]

Applied interventions to layers: [16]


Evaluating L16_s-2.0_thorough:   6%|▌         | 3/50 [05:19<1:23:50, 107.02s/it]

Applied interventions to layers: [16]


Evaluating L16_s-2.0_thorough:   8%|▊         | 4/50 [07:07<1:22:20, 107.41s/it]

Applied interventions to layers: [16]


Evaluating L16_s-2.0_thorough:  10%|█         | 5/50 [08:56<1:20:56, 107.92s/it]

Applied interventions to layers: [16]


Evaluating L16_s-2.0_thorough:  12%|█▏        | 6/50 [10:44<1:19:10, 107.97s/it]

Applied interventions to layers: [16]


Evaluating L16_s-2.0_thorough:  14%|█▍        | 7/50 [12:32<1:17:23, 107.99s/it]

Applied interventions to layers: [16]


Evaluating L16_s-2.0_thorough:  16%|█▌        | 8/50 [14:20<1:15:37, 108.04s/it]

Applied interventions to layers: [16]


Evaluating L16_s-2.0_thorough:  18%|█▊        | 9/50 [15:48<1:09:26, 101.63s/it]

Applied interventions to layers: [16]


Evaluating L16_s-2.0_thorough:  20%|██        | 10/50 [17:36<1:09:07, 103.70s/it]

Applied interventions to layers: [16]


Evaluating L16_s-2.0_thorough:  22%|██▏       | 11/50 [19:24<1:08:16, 105.03s/it]

Applied interventions to layers: [16]


Evaluating L16_s-2.0_thorough:  24%|██▍       | 12/50 [21:12<1:07:06, 105.96s/it]

Applied interventions to layers: [16]


Evaluating L16_s-2.0_thorough:  26%|██▌       | 13/50 [23:00<1:05:45, 106.63s/it]

Applied interventions to layers: [16]


Evaluating L16_s-2.0_thorough:  28%|██▊       | 14/50 [24:28<1:00:37, 101.04s/it]

Applied interventions to layers: [16]


Evaluating L16_s-2.0_thorough:  30%|███       | 15/50 [26:17<1:00:12, 103.21s/it]

Applied interventions to layers: [16]


Evaluating L16_s-2.0_thorough:  32%|███▏      | 16/50 [28:05<59:22, 104.77s/it]  

Applied interventions to layers: [16]


Evaluating L16_s-2.0_thorough:  34%|███▍      | 17/50 [29:53<58:09, 105.75s/it]

Applied interventions to layers: [16]


Evaluating L16_s-2.0_thorough:  36%|███▌      | 18/50 [31:41<56:50, 106.56s/it]

Applied interventions to layers: [16]


Evaluating L16_s-2.0_thorough:  38%|███▊      | 19/50 [33:30<55:20, 107.10s/it]

Applied interventions to layers: [16]


Evaluating L16_s-2.0_thorough:  40%|████      | 20/50 [35:18<53:42, 107.40s/it]

Applied interventions to layers: [16]


Evaluating L16_s-2.0_thorough:  42%|████▏     | 21/50 [37:06<51:59, 107.58s/it]

Applied interventions to layers: [16]


Evaluating L16_s-2.0_thorough:  44%|████▍     | 22/50 [38:54<50:16, 107.72s/it]

Applied interventions to layers: [16]


Evaluating L16_s-2.0_thorough:  46%|████▌     | 23/50 [40:42<48:32, 107.89s/it]

Applied interventions to layers: [16]


Evaluating L16_s-2.0_thorough:  48%|████▊     | 24/50 [42:30<46:46, 107.94s/it]

Applied interventions to layers: [16]


Evaluating L16_s-2.0_thorough:  50%|█████     | 25/50 [44:18<45:00, 108.02s/it]

Applied interventions to layers: [16]


Evaluating L16_s-2.0_thorough:  52%|█████▏    | 26/50 [46:07<43:13, 108.08s/it]

Applied interventions to layers: [16]


Evaluating L16_s-2.0_thorough:  54%|█████▍    | 27/50 [47:55<41:26, 108.09s/it]

Applied interventions to layers: [16]


Evaluating L16_s-2.0_thorough:  56%|█████▌    | 28/50 [49:09<35:55, 97.96s/it] 

Applied interventions to layers: [16]


Evaluating L16_s-2.0_thorough:  58%|█████▊    | 29/50 [50:57<35:20, 100.99s/it]

Applied interventions to layers: [16]


Evaluating L16_s-2.0_thorough:  60%|██████    | 30/50 [52:45<34:22, 103.14s/it]

Applied interventions to layers: [16]


Evaluating L16_s-2.0_thorough:  62%|██████▏   | 31/50 [54:33<33:07, 104.61s/it]

Applied interventions to layers: [16]


Evaluating L16_s-2.0_thorough:  64%|██████▍   | 32/50 [56:21<31:41, 105.63s/it]

Applied interventions to layers: [16]


Evaluating L16_s-2.0_thorough:  66%|██████▌   | 33/50 [58:09<30:08, 106.37s/it]

Applied interventions to layers: [16]


Evaluating L16_s-2.0_thorough:  68%|██████▊   | 34/50 [59:58<28:30, 106.89s/it]

Applied interventions to layers: [16]


Evaluating L16_s-2.0_thorough:  70%|███████   | 35/50 [1:01:46<26:48, 107.24s/it]

Applied interventions to layers: [16]


Evaluating L16_s-2.0_thorough:  72%|███████▏  | 36/50 [1:03:34<25:04, 107.49s/it]

Applied interventions to layers: [16]


Evaluating L16_s-2.0_thorough:  74%|███████▍  | 37/50 [1:05:22<23:19, 107.66s/it]

Applied interventions to layers: [16]


Evaluating L16_s-2.0_thorough:  76%|███████▌  | 38/50 [1:07:10<21:33, 107.79s/it]

Applied interventions to layers: [16]


Evaluating L16_s-2.0_thorough:  78%|███████▊  | 39/50 [1:08:46<19:07, 104.36s/it]

Applied interventions to layers: [16]


Evaluating L16_s-2.0_thorough:  80%|████████  | 40/50 [1:10:34<17:34, 105.47s/it]

Applied interventions to layers: [16]


Evaluating L16_s-2.0_thorough:  82%|████████▏ | 41/50 [1:12:03<15:04, 100.55s/it]

Applied interventions to layers: [16]


Evaluating L16_s-2.0_thorough:  84%|████████▍ | 42/50 [1:13:52<13:42, 102.84s/it]

Applied interventions to layers: [16]
