# Test: Can Inverse Direction Improve RL Model?

**Hypothesis**: Since the distilled model (43.40%) outperforms the RL model (31.80%), applying the INVERSE direction should improve the RL model.

**Method**:
- Original direction: `rl_activations - distilled_activations`
- Inverse direction: `distilled_activations - rl_activations` (just negate)
- Apply to RL model with positive strength to move toward distilled behavior

In [None]:
import sys
sys.path.append('../pipeline')

import torch
import json
from pathlib import Path
import os

os.environ['HF_HOME'] = '/scratch/gilbreth/sramishe'
os.environ['TRANSFORMERS_CACHE'] = '/scratch/gilbreth/sramishe/transformers'
os.environ['HF_DATASETS_CACHE'] = '/scratch/gilbreth/sramishe/datasets'

from model_loader import ModelLoader
from data_processor import DataProcessor
from direction_calculator import DirectionCalculator
from intervention import ActivationPatcher
from evaluator import ReasoningEvaluator

print(f"HuggingFace cache directory set to: {os.environ['HF_HOME']}")

  from .autonotebook import tqdm as notebook_tqdm


HuggingFace cache directory set to: /scratch/gilbreth/sramishe


## Key Insight

We already have the directions computed as:
```python
direction[layer] = rl_mean - distilled_mean
```

To get the inverse (improvement) direction, we simply **negate it**:
```python
improvement_direction[layer] = -direction[layer]
# Which equals: distilled_mean - rl_mean
```

Or equivalently, we can use the **original direction with negative strength**:
```python
# These are equivalent:
# Option 1: rl + (+strength) × (-direction)
# Option 2: rl + (-strength) × (+direction)
```

In [2]:
# Load existing directions
directions_path = Path('/scratch/gilbreth/sramishe/results_QwQ_R1/results/reasoning_directions.pt')

if directions_path.exists():
    directions = torch.load(directions_path)
    print(f"Loaded {len(directions)} direction vectors")
    print(f"Layers: {sorted(directions.keys())}")
    
    # The inverse directions are just the negatives
    inverse_directions = {layer: -direction for layer, direction in directions.items()}
    print(f"\nCreated {len(inverse_directions)} inverse directions")
else:
    print("ERROR: Directions not found. Run run_pipeline.ipynb first.")

Loaded 13 direction vectors
Layers: [0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60]

Created 13 inverse directions


## Interpretation of Strengths

When using the **original directions** (`rl - distilled`):
- **Positive strength** (+1.0): Enhances RL-specific behavior (move AWAY from distilled)
- **Negative strength** (-1.0): Suppresses RL-specific behavior (move TOWARD distilled)

When using **inverse directions** (`distilled - rl`):
- **Positive strength** (+1.0): Move TOWARD distilled model (should improve!)
- **Negative strength** (-1.0): Move AWAY from distilled model (should degrade)

**Recommendation for Testing**:
1. Use **original directions** with **negative strengths** [-2.0, -1.5, -1.0, -0.5]
2. This is simpler than creating new inverse directions
3. Negative strength with original direction = positive strength with inverse direction

In [4]:
# Configuration for inverse direction test
CONFIG = {
    'rl_model_name': "Qwen/QwQ-32B",
    'strength_range': (-2.0, 0.0),  # NEGATIVE to move toward distilled
    'num_strengths': 5,  # [-2.0, -1.5, -1.0, -0.5, 0.0]
    'test_layers': [20, 30, 40],  # Test a few key layers
    'num_test_problems': 10,  # Test on 10 MATH-500 problems
    'max_new_tokens': 1024,
    'output_dir': '/scratch/gilbreth/sramishe/results_QwQ_R1/inverse_test'
}

print("Testing Inverse Direction Strategy:")
print(f"  Model: {CONFIG['rl_model_name']}")
print(f"  Strength range: {CONFIG['strength_range']} (negative = toward distilled)")
print(f"  Test layers: {CONFIG['test_layers']}")
print(f"  Test problems: {CONFIG['num_test_problems']}")
print(f"\nExpectation: Negative strengths should improve performance")
print(f"  - Baseline (strength=0): ~31.80% accuracy")
print(f"  - With intervention (strength=-2.0): hopefully > 31.80%")

Testing Inverse Direction Strategy:
  Model: Qwen/QwQ-32B
  Strength range: (-2.0, 0.0) (negative = toward distilled)
  Test layers: [20, 30, 40]
  Test problems: 10

Expectation: Negative strengths should improve performance
  - Baseline (strength=0): ~31.80% accuracy
  - With intervention (strength=-2.0): hopefully > 31.80%


## Hypotheses to Test

### H1: Negative Strength Improves Accuracy
- **Prediction**: Applying negative strength (moving toward distilled) increases MATH-500 accuracy
- **Metric**: Correctness on 10 test problems
- **Success criterion**: Accuracy with strength=-2.0 > Baseline accuracy

### H2: Layer Specificity
- **Prediction**: Middle layers (20-40) show strongest improvement
- **Metric**: Accuracy improvement by layer
- **Success criterion**: Layer 20 or 30 shows largest accuracy gain

### H3: Reasoning Quality Improves
- **Prediction**: Negative strength increases reasoning steps and reduces errors
- **Metric**: Think tokens, reasoning steps, backtracking
- **Success criterion**: More structured reasoning at strength=-2.0 vs baseline

### H4: Dose-Response Relationship
- **Prediction**: Stronger negative values → more improvement (up to saturation)
- **Metric**: Accuracy vs strength curve
- **Success criterion**: Monotonic improvement from 0.0 to -2.0

In [5]:
# Load RL model and test data
print("Loading model...")
loader = ModelLoader(
    rl_model_name=CONFIG['rl_model_name'],
    distilled_model_name="deepseek-ai/DeepSeek-R1-Distill-Qwen-32B"  # Not used, just for compatibility
)

models = loader.load_models(torch_dtype=torch.float16)
rl_model = models['rl_model']
rl_tokenizer = models['rl_tokenizer']

print("Loading test data...")
processor = DataProcessor(
    dataset_name="HuggingFaceH4/MATH-500",
    include_toy_tasks=False
)
dataset = processor.load_dataset(max_samples=CONFIG['num_test_problems'])
test_prompts = processor.prepare_batch(dataset[:CONFIG['num_test_problems']], rl_tokenizer)

print(f"✓ Model loaded")
print(f"✓ {len(test_prompts)} test problems prepared")

Loading model...
Loading RL-trained model: Qwen/QwQ-32B


`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|██████████| 14/14 [13:03<00:00, 56.00s/it]


Loading distilled model: deepseek-ai/DeepSeek-R1-Distill-Qwen-32B


Loading checkpoint shards: 100%|██████████| 8/8 [10:38<00:00, 79.76s/it]


Loading test data...
Loading dataset: HuggingFaceH4/MATH-500
✓ Model loaded
✓ 10 test problems prepared


In [None]:
# Run intervention sweep with NEGATIVE strengths
import numpy as np

print("Loading test data...")
processor = DataProcessor(
    dataset_name="HuggingFaceH4/MATH-500",
    include_toy_tasks=False
)
dataset = processor.load_dataset(max_samples=CONFIG['num_test_problems'])
test_prompts = processor.prepare_batch(dataset[:CONFIG['num_test_problems']], rl_tokenizer)

print(f"✓ Model loaded")
print(f"✓ {len(test_prompts)} test problems prepared")

evaluator = ReasoningEvaluator()
results = []

# Test each layer
for layer in CONFIG['test_layers']:
    if layer not in directions:
        print(f"Skipping layer {layer} (no direction available)")
        continue
    
    print(f"\nTesting layer {layer}...")
    patcher = ActivationPatcher(rl_model, directions)
    
    # Test each strength
    strengths = np.linspace(CONFIG['strength_range'][0], CONFIG['strength_range'][1], CONFIG['num_strengths'])
    
    for strength in strengths:
        print(f"  Strength: {strength:.2f}", end=" ")
        
        # Generate with intervention
        output = patcher.generate_with_intervention(
            tokenizer=rl_tokenizer,
            prompt=test_prompts[0],  # Use first test problem
            layers=[layer],
            strength=float(strength),
            max_new_tokens=CONFIG['max_new_tokens']
        )
        
        # Evaluate quality
        tokens = evaluator.count_tokens(output, rl_tokenizer, split_by_tags=True)
        quality = evaluator.analyze_reasoning_quality(output)
        
        result = {
            'layer': layer,
            'strength': float(strength),
            'output': output,
            'tokens': tokens,
            'quality': quality
        }
        results.append(result)
        
        print(f"→ think_tokens: {tokens['think_tokens']}, steps: {quality['reasoning_steps']}")

print(f"\n✓ Completed {len(results)} experiments")


Testing layer 20...
  Strength: -2.00 Applied interventions to layers: [20]


RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

In [None]:
# Analyze results
import pandas as pd

# Convert to DataFrame for analysis
df_results = pd.DataFrame([{
    'layer': r['layer'],
    'strength': r['strength'],
    'total_tokens': r['tokens']['total_tokens'],
    'think_tokens': r['tokens']['think_tokens'],
    'reasoning_steps': r['quality']['reasoning_steps'],
    'backtracking': r['quality']['backtracking_count'],
    'hesitation': r['quality']['hesitation_count'],
} for r in results])

print("\nResults Summary:")
print(df_results.to_string(index=False))

# Find best configurations
print("\n" + "="*60)
print("ANALYSIS")
print("="*60)

for layer in CONFIG['test_layers']:
    layer_results = df_results[df_results['layer'] == layer]
    baseline = layer_results[layer_results['strength'] == 0.0].iloc[0]
    best = layer_results.loc[layer_results['reasoning_steps'].idxmax()]
    
    print(f"\nLayer {layer}:")
    print(f"  Baseline (strength=0.0):")
    print(f"    Think tokens: {baseline['think_tokens']}")
    print(f"    Reasoning steps: {baseline['reasoning_steps']}")
    print(f"  Best (strength={best['strength']:.2f}):")
    print(f"    Think tokens: {best['think_tokens']} ({best['think_tokens'] - baseline['think_tokens']:+d})")
    print(f"    Reasoning steps: {best['reasoning_steps']} ({best['reasoning_steps'] - baseline['reasoning_steps']:+d})")

## Next Steps

If this preliminary test shows promise:

1. **Full Evaluation**: Run on all 500 MATH-500 problems with correctness grading
2. **Optimal Strength**: Find the strength value that maximizes accuracy
3. **Multi-Layer**: Test applying inverse direction to multiple layers simultaneously
4. **Different Layers**: Test all 32 layers to find the most effective ones
5. **Compare to Fine-Tuning**: How does this compare to actually fine-tuning the RL model on distilled model outputs?

## Expected Outcomes

**If hypothesis is correct:**
- Negative strengths should improve reasoning quality
- RL model accuracy should move closer to distilled model (43.40%)
- This would validate that the direction captures meaningful reasoning differences

**If hypothesis is wrong:**
- No improvement with negative strengths
- Suggests the activation difference doesn't capture reasoning capability
- May need more sophisticated direction extraction (multiple directions, PCA, etc.)

In [None]:
# Save results
output_path = Path(CONFIG['output_dir'])
output_path.mkdir(parents=True, exist_ok=True)

with open(output_path / 'inverse_direction_results.json', 'w') as f:
    json.dump(results, f, indent=2, default=str)

print(f"Results saved to {output_path / 'inverse_direction_results.json'}")