# Test Top Interventions on MATH-500 with Accuracy Measurement

## Objective

Test whether reasoning style changes from interventions translate to **correctness improvements**.

## Baseline Performance

- **RL Model (QwQ-32B)**: 31.80% (159/500)
- **Distilled Model**: 43.40% (217/500)
- **Gap to close**: 11.6 percentage points

## Top Intervention Candidates

Based on corrected quality analysis:

1. **Layer 16, strength=-2.0**: Most thorough reasoning (28 steps, 4 backtracking)
   - Hypothesis: Thoroughness catches more errors

2. **Layer 0, strength=-1.5**: Move toward distilled model (which performs better)
   - Hypothesis: Steering toward better performer improves accuracy

3. **Layer 20, strength=-1.0**: Most concise reasoning (11 steps, 1 backtracking)
   - Hypothesis: Efficiency avoids over-thinking errors

4. **Layer 0, strength=-2.0**: Stronger move toward distilled (13 steps, complete)
   - Alternative to test strength effect

5. **Layer 16, strength=-0.5**: Moderate reasoning (12 steps, complete)
   - Control: non-truncated result at Layer 16

## Success Criteria

- **Minimal**: >32.80% (+1pp, 10% of gap closed)
- **Moderate**: 34.80-36.80% (+3-5pp, 25-40% of gap)
- **Strong**: 39.80-41.80% (+8-10pp, 70-85% of gap)

In [None]:
import sys
sys.path.append('../pipeline')

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from tqdm import tqdm
import json
import re
from pathlib import Path
import numpy as np

# Import intervention and evaluator
from intervention import ActivationPatcher
from evaluator import ReasoningEvaluator

print("✓ Imports successful")

## Configuration

In [None]:
# Model configuration
RL_MODEL = "Qwen/QwQ-32B"
DATASET = "HuggingFaceH4/MATH-500"

# Output directory
OUTPUT_DIR = Path("/scratch/gilbreth/sramishe/results_QwQ_R1/interventions_math500")
OUTPUT_DIR.mkdir(exist_ok=True, parents=True)

# Intervention configurations to test
INTERVENTIONS = [
    {"name": "baseline", "layer": None, "strength": 0.0},
    {"name": "L16_s-2.0_thorough", "layer": 16, "strength": -2.0},
    {"name": "L0_s-1.5_distilled", "layer": 0, "strength": -1.5},
    {"name": "L20_s-1.0_concise", "layer": 20, "strength": -1.0},
    {"name": "L0_s-2.0_distilled_strong", "layer": 0, "strength": -2.0},
    {"name": "L16_s-0.5_moderate", "layer": 16, "strength": -0.5},
]

# Generation parameters
MAX_NEW_TOKENS = 2048  # Increased to avoid truncation
TEMPERATURE = 0.0  # Greedy decoding for reproducibility

print(f"Testing {len(INTERVENTIONS)} intervention configurations")
print(f"Output directory: {OUTPUT_DIR}")

## Load Dataset

In [None]:
# Load MATH-500
print("Loading MATH-500 dataset...")
dataset = load_dataset(DATASET, split="test")

print(f"Dataset loaded: {len(dataset)} problems")
print(f"Sample problem:")
print(f"  Question: {dataset[0]['problem'][:100]}...")
print(f"  Answer: {dataset[0]['answer']}")

## Load Model and Directions

In [None]:
import os
os.environ['HF_HOME'] = '/scratch/gilbreth/sramishe'
os.environ['TRANSFORMERS_CACHE'] = '/scratch/gilbreth/sramishe/transformers'

print("Loading RL model...")
rl_model = AutoModelForCausalLM.from_pretrained(
    RL_MODEL,
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(RL_MODEL)

print(f"✓ Model loaded: {RL_MODEL}")
print(f"  Layers: {rl_model.config.num_hidden_layers}")
print(f"  Hidden size: {rl_model.config.hidden_size}")

In [None]:
# Load reasoning directions
print("Loading reasoning directions...")
directions_path = Path("/scratch/gilbreth/sramishe/results_QwQ_R1/results_improved/reasoning_directions.pt")

if directions_path.exists():
    directions = torch.load(directions_path)
    print(f"✓ Loaded directions for {len(directions)} layers")
    print(f"  Available layers: {sorted(directions.keys())}")
else:
    print("⚠️  Directions file not found!")
    print("   Need to run direction extraction first")
    raise FileNotFoundError(f"Directions not found at {directions_path}")

## Answer Extraction and Grading Functions

In [None]:
from sympy import simplify, sympify, N
from sympy.parsing.latex import parse_latex

def extract_answer(text):
    """
    Extract answer from model output.
    Looks for \boxed{} format first, then falls back to last number.
    """
    # Look for \boxed{...}
    boxed_pattern = r'\\boxed\{([^}]+)\}'
    boxed_matches = re.findall(boxed_pattern, text)
    if boxed_matches:
        return boxed_matches[-1].strip()
    
    # Fallback: find last number-like pattern
    number_pattern = r'[-+]?\d*\.?\d+'
    number_matches = re.findall(number_pattern, text)
    if number_matches:
        return number_matches[-1].strip()
    
    return None

def normalize_answer(answer):
    """
    Normalize answer for comparison.
    Handles LaTeX, fractions, etc.
    """
    if answer is None:
        return None
    
    try:
        # Try parsing as LaTeX first
        if '\\' in answer:
            expr = parse_latex(answer)
        else:
            expr = sympify(answer)
        
        # Simplify
        simplified = simplify(expr)
        return simplified
    except:
        # If parsing fails, return string
        return answer.strip().lower()

def grade_answer(predicted, ground_truth):
    """
    Grade predicted answer against ground truth.
    Returns True if correct, False otherwise.
    """
    pred_norm = normalize_answer(predicted)
    truth_norm = normalize_answer(ground_truth)
    
    if pred_norm is None or truth_norm is None:
        return False
    
    try:
        # Try symbolic comparison
        return simplify(pred_norm - truth_norm) == 0
    except:
        # Fallback to string comparison
        return str(pred_norm) == str(truth_norm)

print("✓ Grading functions defined")

# Test grading
test_cases = [
    ("42", "42", True),
    ("1/2", "0.5", True),
    ("\\frac{1}{2}", "0.5", True),
    ("42", "43", False),
]

print("\nTesting grading function:")
for pred, truth, expected in test_cases:
    result = grade_answer(pred, truth)
    status = "✓" if result == expected else "✗"
    print(f"  {status} grade_answer('{pred}', '{truth}') = {result} (expected {expected})")

## Evaluation Function

In [None]:
def evaluate_intervention(intervention_config, dataset, model, tokenizer, directions, evaluator):
    """
    Evaluate a single intervention configuration on MATH-500.
    
    Returns:
        dict with accuracy, results, and metadata
    """
    name = intervention_config['name']
    layer = intervention_config['layer']
    strength = intervention_config['strength']
    
    print(f"\n{'='*80}")
    print(f"Evaluating: {name}")
    if layer is not None:
        print(f"  Layer: {layer}, Strength: {strength}")
    else:
        print(f"  Baseline (no intervention)")
    print(f"{'='*80}")
    
    # Initialize intervention
    if layer is not None:
        patcher = ActivationPatcher(
            model=model,
            directions={layer: directions[layer]}
        )
    else:
        patcher = None
    
    results = []
    correct = 0
    total = 0
    
    for i, sample in enumerate(tqdm(dataset, desc=f"Evaluating {name}")):
        problem = sample['problem']
        ground_truth = sample['answer']
        
        # Format prompt
        prompt = f"""user: {problem}
assistant:"""
        
        # Generate with intervention
        if patcher is not None:
            output = patcher.generate_with_intervention(
                tokenizer=tokenizer,
                prompt=prompt,
                layers=[layer],  # layers is a list
                strength=strength,
                max_new_tokens=MAX_NEW_TOKENS,
                do_sample=False,
                pad_token_id=tokenizer.eos_token_id
            )
            # Remove prompt from output
            output = output[len(prompt):].strip()
        else:
            # Baseline generation
            inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
            outputs = model.generate(
                **inputs,
                max_new_tokens=MAX_NEW_TOKENS,
                do_sample=False,
                pad_token_id=tokenizer.eos_token_id
            )
            output = tokenizer.decode(outputs[0], skip_special_tokens=True)
            output = output[len(prompt):].strip()
        
        # Extract answer
        predicted_answer = extract_answer(output)
        
        # Grade
        is_correct = grade_answer(predicted_answer, ground_truth)
        
        if is_correct:
            correct += 1
        total += 1
        
        # Get quality metrics
        tokens = evaluator.count_tokens(output, tokenizer, split_by_tags=True)
        quality = evaluator.analyze_reasoning_quality(output)
        
        # Store result
        result = {
            'problem_id': i,
            'problem': problem,
            'ground_truth': ground_truth,
            'output': output,
            'predicted_answer': predicted_answer,
            'is_correct': is_correct,
            'tokens': tokens,
            'quality': quality
        }
        results.append(result)
        
        # Print progress every 50 problems
        if (i + 1) % 50 == 0:
            current_acc = 100.0 * correct / total
            print(f"  Progress: {i+1}/{len(dataset)}, Accuracy: {current_acc:.2f}% ({correct}/{total})")
    
    # Calculate final accuracy
    accuracy = 100.0 * correct / total
    
    print(f"\n{'='*80}")
    print(f"RESULTS: {name}")
    print(f"  Correct: {correct}/{total}")
    print(f"  Accuracy: {accuracy:.2f}%")
    print(f"  Baseline: 31.80%")
    print(f"  Improvement: {accuracy - 31.80:+.2f}pp")
    print(f"{'='*80}")
    
    return {
        'config': intervention_config,
        'accuracy': accuracy,
        'correct': correct,
        'total': total,
        'results': results
    }

print("✓ Evaluation function defined")

## Initialize Evaluator

In [None]:
evaluator = ReasoningEvaluator()
print("✓ Evaluator initialized (with fixed extract_think_tags)")

## Run Evaluations

**Note:** This will take ~6-8 hours total (1+ hour per intervention configuration).

To save time, you can:
1. Test on a subset first (e.g., first 50 problems)
2. Run interventions sequentially or in parallel on multiple GPUs
3. Save checkpoints after each intervention

In [None]:
# OPTION 1: Quick test on subset (recommended first)
TEST_SUBSET = True
SUBSET_SIZE = 50  # Test on first 50 problems

if TEST_SUBSET:
    print(f"\n⚠️  TESTING ON SUBSET: First {SUBSET_SIZE} problems")
    print(f"   Change TEST_SUBSET=False to run on full dataset\n")
    test_dataset = dataset.select(range(SUBSET_SIZE))
else:
    print("\n✓ Running on FULL DATASET (500 problems)\n")
    test_dataset = dataset

In [None]:
# Run all evaluations
all_results = {}

for intervention in INTERVENTIONS:
    name = intervention['name']
    
    # Check if already computed
    result_file = OUTPUT_DIR / f"{name}_results.json"
    if result_file.exists():
        print(f"\n⚠️  Skipping {name} - results already exist at {result_file}")
        print(f"   Delete the file to re-run\n")
        with open(result_file, 'r') as f:
            all_results[name] = json.load(f)
        continue
    
    # Run evaluation
    result = evaluate_intervention(
        intervention_config=intervention,
        dataset=test_dataset,
        model=rl_model,
        tokenizer=tokenizer,
        directions=directions,
        evaluator=evaluator
    )
    
    # Save individual result
    with open(result_file, 'w') as f:
        json.dump(result, f, indent=2, default=str)
    print(f"\n✓ Saved results to {result_file}")
    
    all_results[name] = result

print("\n" + "="*80)
print("ALL EVALUATIONS COMPLETE")
print("="*80)

## Summary and Comparison

In [None]:
# Create summary table
import pandas as pd

summary_data = []
for name, result in all_results.items():
    config = result['config']
    summary_data.append({
        'Name': name,
        'Layer': config['layer'] if config['layer'] is not None else 'N/A',
        'Strength': config['strength'],
        'Accuracy (%)': result['accuracy'],
        'Correct': result['correct'],
        'Total': result['total'],
        'Improvement (pp)': result['accuracy'] - 31.80
    })

summary_df = pd.DataFrame(summary_data)
summary_df = summary_df.sort_values('Accuracy (%)', ascending=False)

print("\n" + "="*80)
print("SUMMARY: INTERVENTION ACCURACY COMPARISON")
print("="*80)
print(summary_df.to_string(index=False))
print("\nBaseline (RL model, no intervention): 31.80%")
print("Target (Distilled model): 43.40%")
print("Gap to close: 11.6pp")

# Save summary
summary_file = OUTPUT_DIR / "summary.csv"
summary_df.to_csv(summary_file, index=False)
print(f"\n✓ Summary saved to {summary_file}")

## Detailed Analysis: Best Intervention

In [None]:
# Find best intervention
best_name = summary_df.iloc[0]['Name']
best_result = all_results[best_name]
best_accuracy = best_result['accuracy']
improvement = best_accuracy - 31.80

print("\n" + "="*80)
print(f"BEST INTERVENTION: {best_name}")
print("="*80)
print(f"Accuracy: {best_accuracy:.2f}%")
print(f"Improvement: {improvement:+.2f}pp")
print(f"Gap closed: {100*improvement/11.6:.1f}% of 11.6pp gap")

if improvement > 1.0:
    print("\n✅ SUCCESS: Achieved >1pp improvement (minimal goal)")
    if improvement > 3.0:
        print("✅ MODERATE SUCCESS: Achieved >3pp improvement")
    if improvement > 8.0:
        print("✅ STRONG SUCCESS: Achieved >8pp improvement")
elif improvement > 0:
    print("\n⚠️  Marginal improvement, but not statistically significant")
else:
    print("\n❌ No improvement over baseline")

# Quality metrics comparison
best_results = best_result['results']
correct_results = [r for r in best_results if r['is_correct']]
incorrect_results = [r for r in best_results if not r['is_correct']]

if len(correct_results) > 0 and len(incorrect_results) > 0:
    print("\n" + "-"*80)
    print("QUALITY METRICS: Correct vs Incorrect")
    print("-"*80)
    
    # Average reasoning steps
    correct_steps = np.mean([r['quality']['reasoning_steps'] for r in correct_results])
    incorrect_steps = np.mean([r['quality']['reasoning_steps'] for r in incorrect_results])
    print(f"Average reasoning steps:")
    print(f"  Correct: {correct_steps:.1f}")
    print(f"  Incorrect: {incorrect_steps:.1f}")
    
    # Average think tokens
    correct_tokens = np.mean([r['tokens']['think_tokens'] for r in correct_results])
    incorrect_tokens = np.mean([r['tokens']['think_tokens'] for r in incorrect_results])
    print(f"\nAverage think tokens:")
    print(f"  Correct: {correct_tokens:.1f}")
    print(f"  Incorrect: {incorrect_tokens:.1f}")
    
    # Backtracking
    correct_backtrack = np.mean([r['quality']['backtracking_count'] for r in correct_results])
    incorrect_backtrack = np.mean([r['quality']['backtracking_count'] for r in incorrect_results])
    print(f"\nAverage backtracking:")
    print(f"  Correct: {correct_backtrack:.1f}")
    print(f"  Incorrect: {incorrect_backtrack:.1f}")

## Sample Outputs: Correct vs Incorrect

In [None]:
# Show sample correct answer
if len(correct_results) > 0:
    print("\n" + "="*80)
    print("SAMPLE CORRECT ANSWER")
    print("="*80)
    sample = correct_results[0]
    print(f"Problem: {sample['problem'][:200]}...")
    print(f"\nGround Truth: {sample['ground_truth']}")
    print(f"Predicted: {sample['predicted_answer']}")
    print(f"\nReasoning steps: {sample['quality']['reasoning_steps']}")
    print(f"Think tokens: {sample['tokens']['think_tokens']}")
    print(f"\nOutput (first 500 chars):\n{sample['output'][:500]}...")

# Show sample incorrect answer
if len(incorrect_results) > 0:
    print("\n" + "="*80)
    print("SAMPLE INCORRECT ANSWER")
    print("="*80)
    sample = incorrect_results[0]
    print(f"Problem: {sample['problem'][:200]}...")
    print(f"\nGround Truth: {sample['ground_truth']}")
    print(f"Predicted: {sample['predicted_answer']}")
    print(f"\nReasoning steps: {sample['quality']['reasoning_steps']}")
    print(f"Think tokens: {sample['tokens']['think_tokens']}")
    print(f"\nOutput (first 500 chars):\n{sample['output'][:500]}...")

## Save Final Report

In [None]:
# Generate report
report = f"""# MATH-500 Intervention Accuracy Results

## Summary

Dataset: {'SUBSET (first ' + str(SUBSET_SIZE) + ' problems)' if TEST_SUBSET else 'FULL MATH-500 (500 problems)'}

### Baseline Performance
- RL Model (no intervention): 31.80%
- Distilled Model: 43.40%
- Gap: 11.6pp

### Intervention Results

{summary_df.to_markdown(index=False)}

### Best Intervention

**{best_name}**
- Accuracy: {best_accuracy:.2f}%
- Improvement: {improvement:+.2f}pp
- Gap closed: {100*improvement/11.6:.1f}%

### Interpretation

"""

if improvement > 1.0:
    report += f"✅ **SUCCESS**: Intervention improved accuracy by {improvement:.2f}pp\n\n"
    report += "This demonstrates that activation steering can improve reasoning performance.\n\n"
    
    if improvement > 3.0:
        report += "The improvement is substantial (>3pp), closing 25%+ of the gap.\n"
else:
    report += f"❌ **NO SIGNIFICANT IMPROVEMENT**: Intervention changed accuracy by {improvement:+.2f}pp\n\n"
    report += "Reasoning style changes (verbosity, thoroughness) did not translate to correctness.\n\n"

# Save report
report_file = OUTPUT_DIR / "REPORT.md"
with open(report_file, 'w') as f:
    f.write(report)

print("\n" + "="*80)
print("REPORT SAVED")
print("="*80)
print(f"Location: {report_file}")
print("\nAll results saved to:")
print(f"  - Individual results: {OUTPUT_DIR}/*_results.json")
print(f"  - Summary CSV: {OUTPUT_DIR}/summary.csv")
print(f"  - Report: {OUTPUT_DIR}/REPORT.md")