# Re-analyze Improved Results with Fixed Evaluator

This notebook re-analyzes the existing improved experiment results using the **FIXED** evaluator that now handles incomplete `<think>` tags.

**Why**: The original analysis had a bug where truncated generations (missing `</think>`) showed `think_tokens=0`.

**What we do**: Load saved outputs from `results_improved/results.json` and re-extract quality metrics using the fixed code.

**Expected**: Will reveal true reasoning patterns that were masked by the evaluation bug.

In [1]:
import sys
sys.path.append('../pipeline')

import json
import numpy as np
import pandas as pd
from pathlib import Path
from collections import defaultdict

# Import the FIXED evaluator
from evaluator import ReasoningEvaluator

# For tokenization
import os
os.environ['HF_HOME'] = '/scratch/gilbreth/sramishe'
os.environ['TRANSFORMERS_CACHE'] = '/scratch/gilbreth/sramishe/transformers'

from transformers import AutoTokenizer

print("✓ Imports successful")
print("✓ Using FIXED evaluator that handles incomplete tags")

  from .autonotebook import tqdm as notebook_tqdm


✓ Imports successful
✓ Using FIXED evaluator that handles incomplete tags


## Load Existing Results

In [2]:
# Load results
results_path = Path('/scratch/gilbreth/sramishe/results_QwQ_R1/results_improved/results.json')

with open(results_path, 'r') as f:
    data = json.load(f)

print(f"Loaded results from: {results_path}")
print(f"\nConfiguration:")
for key, value in data['config'].items():
    print(f"  {key}: {value}")

print(f"\nData keys: {list(data.keys())}")
print(f"Total intervention results: {len(data['intervention_results'])}")
print(f"Baseline generations: {len(data['baselines'])}")

Loaded results from: /scratch/gilbreth/sramishe/results_QwQ_R1/results_improved/results.json

Configuration:
  rl_model_name: Qwen/QwQ-32B
  distilled_model_name: deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
  dataset_name: HuggingFaceH4/MATH-500
  num_samples: 20
  strength_range: [-2.0, 2.0]
  num_strengths: 9
  num_test_prompts: 5
  output_dir: /scratch/gilbreth/sramishe/results_QwQ_R1/results_improved
  max_new_tokens: 1024
  layers_step: 4

Data keys: ['config', 'model_info', 'directions_stats', 'baselines', 'critical_layers', 'critical_layers_think', 'layer_sensitivity', 'layer_sensitivity_think', 'quality_variance', 'top_quality_layers', 'intervention_results']
Total intervention results: 144
Baseline generations: 5


## Initialize Fixed Evaluator and Tokenizer

In [3]:
# Initialize fixed evaluator
evaluator = ReasoningEvaluator()

# Load tokenizer
print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(data['config']['rl_model_name'])

print("✓ Fixed evaluator initialized")
print("✓ Tokenizer loaded")

Loading tokenizer...
✓ Fixed evaluator initialized
✓ Tokenizer loaded


## Test Fix on Baseline Generation

Let's verify the fix works on the baseline generation that previously showed `think_tokens=0`

In [4]:
# Get first baseline
baseline = data['baselines'][0]

print("BASELINE GENERATION (Original metrics - BUGGY)")
print("="*80)
print(f"Total tokens (saved): {baseline['tokens']['total_tokens']}")
print(f"Think tokens (saved): {baseline['tokens']['think_tokens']}")
print(f"Non-think tokens (saved): {baseline['tokens']['non_think_tokens']}")
print(f"Reasoning steps (saved): {baseline['quality']['reasoning_steps']}")
print(f"Has think tags (saved): {baseline['quality']['has_think_tags']}")

print("\n" + "="*80)
print("BASELINE GENERATION (Re-analyzed with FIXED evaluator)")
print("="*80)

# Re-analyze with FIXED evaluator
text = baseline['text']

# Check for incomplete tags
has_opening = '<think>' in text
has_closing = '</think>' in text
print(f"Has <think>: {has_opening}")
print(f"Has </think>: {has_closing}")
if has_opening and not has_closing:
    print("⚠️  TRUNCATED: Has opening tag but missing closing tag")

# Re-extract with fixed code
tokens_fixed = evaluator.count_tokens(text, tokenizer, split_by_tags=True)
quality_fixed = evaluator.analyze_reasoning_quality(text)

print(f"\nTotal tokens (fixed): {tokens_fixed['total_tokens']}")
print(f"Think tokens (fixed): {tokens_fixed['think_tokens']}")
print(f"Non-think tokens (fixed): {tokens_fixed['non_think_tokens']}")
print(f"Reasoning steps (fixed): {quality_fixed['reasoning_steps']}")
print(f"Has think tags (fixed): {quality_fixed['has_think_tags']}")
print(f"Backtracking (fixed): {quality_fixed['backtracking_count']}")
print(f"Hesitation (fixed): {quality_fixed['hesitation_count']}")
print(f"Verbosity (fixed): {quality_fixed['verbosity']}")

print("\n" + "="*80)
print("COMPARISON")
print("="*80)
print(f"Think tokens: {baseline['tokens']['think_tokens']} → {tokens_fixed['think_tokens']} ({tokens_fixed['think_tokens'] - baseline['tokens']['think_tokens']:+d})")
print(f"Reasoning steps: {baseline['quality']['reasoning_steps']} → {quality_fixed['reasoning_steps']} ({quality_fixed['reasoning_steps'] - baseline['quality']['reasoning_steps']:+d})")

if tokens_fixed['think_tokens'] > 0 and baseline['tokens']['think_tokens'] == 0:
    print("\n✅ FIX CONFIRMED: Now detecting think content that was previously missed!")
else:
    print("\n⚠️  No change detected - investigating further")

BASELINE GENERATION (Original metrics - BUGGY)
Total tokens (saved): 1077
Think tokens (saved): 0
Non-think tokens (saved): 1077
Reasoning steps (saved): 0
Has think tags (saved): False

BASELINE GENERATION (Re-analyzed with FIXED evaluator)
Has <think>: True
Has </think>: False
⚠️  TRUNCATED: Has opening tag but missing closing tag

Total tokens (fixed): 1077
Think tokens (fixed): 1024
Non-think tokens (fixed): 52
Reasoning steps (fixed): 15
Has think tags (fixed): True
Backtracking (fixed): 2
Hesitation (fixed): 1
Verbosity (fixed): 680

COMPARISON
Think tokens: 0 → 1024 (+1024)
Reasoning steps: 0 → 15 (+15)

✅ FIX CONFIRMED: Now detecting think content that was previously missed!


## Re-analyze ALL Intervention Results

Now re-analyze all 144 intervention results with the fixed evaluator

In [5]:
print("Re-analyzing all intervention results with FIXED evaluator...")
print(f"Total results to process: {len(data['intervention_results'])}")

# Store fixed results
fixed_results = []
changes_detected = 0

for i, result in enumerate(data['intervention_results']):
    # Get saved (buggy) metrics
    old_tokens = result.get('tokens', {})
    old_quality = result.get('quality', {})
    old_think_tokens = old_tokens.get('think_tokens', 0)
    
    # Re-analyze with fixed evaluator
    text = result['output']
    new_tokens = evaluator.count_tokens(text, tokenizer, split_by_tags=True)
    new_quality = evaluator.analyze_reasoning_quality(text)
    
    # Track if anything changed
    if new_tokens['think_tokens'] != old_think_tokens:
        changes_detected += 1
    
    # Store fixed result
    fixed_result = {
        'layer': result['layer'],
        'strength': result['strength'],
        'output': text,
        'token_count': result['token_count'],
        'tokens_old': old_tokens,
        'quality_old': old_quality,
        'tokens': new_tokens,
        'quality': new_quality,
        'changed': new_tokens['think_tokens'] != old_think_tokens
    }
    fixed_results.append(fixed_result)
    
    if (i + 1) % 20 == 0:
        print(f"  Processed {i+1}/{len(data['intervention_results'])}...")

print(f"\n✓ Re-analysis complete!")
print(f"Changes detected: {changes_detected}/{len(fixed_results)} results")
print(f"Percentage changed: {100*changes_detected/len(fixed_results):.1f}%")

Re-analyzing all intervention results with FIXED evaluator...
Total results to process: 144
  Processed 20/144...
  Processed 40/144...
  Processed 60/144...
  Processed 80/144...
  Processed 100/144...
  Processed 120/144...
  Processed 140/144...

✓ Re-analysis complete!
Changes detected: 122/144 results
Percentage changed: 84.7%


## Analyze Changes by Layer

In [6]:
print("="*80)
print("CHANGES BY LAYER")
print("="*80)

layers = sorted(set(r['layer'] for r in fixed_results))

for layer in layers:
    layer_results = [r for r in fixed_results if r['layer'] == layer]
    changed = sum(1 for r in layer_results if r['changed'])
    
    if changed > 0:
        print(f"\nLayer {layer}: {changed}/{len(layer_results)} results changed")
        
        # Show which strengths changed
        changed_strengths = [r['strength'] for r in layer_results if r['changed']]
        print(f"  Strengths affected: {sorted(set(changed_strengths))}")
        
        # Show old vs new for changed results
        for r in layer_results:
            if r['changed']:
                old = r['tokens_old'].get('think_tokens', 0)
                new = r['tokens']['think_tokens']
                print(f"    Strength {r['strength']:>5.1f}: think_tokens {old:>4d} → {new:>4d} ({new-old:+5d})")

CHANGES BY LAYER

Layer 0: 6/9 results changed
  Strengths affected: [-1.0, 0.0, 0.5, 1.0, 1.5, 2.0]
    Strength  -1.0: think_tokens    0 → 1024 (+1024)
    Strength   0.0: think_tokens    0 → 1025 (+1025)
    Strength   0.5: think_tokens    0 → 1025 (+1025)
    Strength   1.0: think_tokens    0 → 1025 (+1025)
    Strength   1.5: think_tokens    0 → 1025 (+1025)
    Strength   2.0: think_tokens    0 → 1025 (+1025)

Layer 4: 9/9 results changed
  Strengths affected: [-2.0, -1.5, -1.0, -0.5, 0.0, 0.5, 1.0, 1.5, 2.0]
    Strength  -2.0: think_tokens    0 → 1025 (+1025)
    Strength  -1.5: think_tokens    0 → 1025 (+1025)
    Strength  -1.0: think_tokens    0 → 1025 (+1025)
    Strength  -0.5: think_tokens    0 → 1025 (+1025)
    Strength   0.0: think_tokens    0 → 1025 (+1025)
    Strength   0.5: think_tokens    0 → 1025 (+1025)
    Strength   1.0: think_tokens    0 → 1025 (+1025)
    Strength   1.5: think_tokens    0 → 1025 (+1025)
    Strength   2.0: think_tokens    0 → 1025 (+1025)

L

## Detailed Analysis: Layer 0

Layer 0 showed the most promise for improvement (negative strengths induced reasoning)

In [7]:
print("="*80)
print("LAYER 0 - DETAILED ANALYSIS (Before vs After Fix)")
print("="*80)

layer0_results = sorted([r for r in fixed_results if r['layer'] == 0], key=lambda x: x['strength'])

print(f"\n{'Strength':<10} {'Old Think':<12} {'New Think':<12} {'Delta':<8} {'Old Steps':<12} {'New Steps':<12} {'Status'}")
print("-"*100)

for r in layer0_results:
    old_think = r['tokens_old'].get('think_tokens', 0)
    new_think = r['tokens']['think_tokens']
    delta = new_think - old_think
    
    old_steps = r['quality_old'].get('reasoning_steps', 0)
    new_steps = r['quality']['reasoning_steps']
    
    if new_think > 0:
        status = "✓ Reasoning" if delta >= 0 else "✓ Reasoning (fixed)"
    else:
        status = "⚠️  Suppressed"
    
    changed_marker = "*" if r['changed'] else " "
    
    print(f"{changed_marker}{r['strength']:>9.1f}  {old_think:<12} {new_think:<12} {delta:>+7d}  {old_steps:<12} {new_steps:<12} {status}")

print("\n* = Changed with fix")

# Key finding
baseline = [r for r in layer0_results if r['strength'] == 0.0][0]
print(f"\n{'='*80}")
print("KEY FINDING FOR LAYER 0:")
print(f"{'='*80}")
print(f"Baseline (strength=0.0):")
print(f"  Old think_tokens: {baseline['tokens_old'].get('think_tokens', 0)}")
print(f"  New think_tokens: {baseline['tokens']['think_tokens']}")

if baseline['tokens']['think_tokens'] > 0 and baseline['tokens_old'].get('think_tokens', 0) == 0:
    print(f"\n✅ BASELINE NOW SHOWS REASONING!")
    print(f"   - The 'only negative strengths work' pattern was due to evaluation bug")
    print(f"   - Baseline was truncated, missing </think> tag")
    print(f"   - Need to re-evaluate whether negative strengths are truly better")
elif baseline['tokens']['think_tokens'] == 0:
    print(f"\n⚠️  Baseline still shows no reasoning")
    print(f"   - Pattern holds: negative strengths DO induce reasoning at layer 0")
    print(f"   - This supports the inverse direction hypothesis")

LAYER 0 - DETAILED ANALYSIS (Before vs After Fix)

Strength   Old Think    New Think    Delta    Old Steps    New Steps    Status
----------------------------------------------------------------------------------------------------
      -2.0  884          884               +0  13           13           ✓ Reasoning
      -1.5  787          787               +0  12           12           ✓ Reasoning
*     -1.0  0            1024           +1024  0            17           ✓ Reasoning
      -0.5  889          889               +0  15           15           ✓ Reasoning
*      0.0  0            1025           +1025  0            16           ✓ Reasoning
*      0.5  0            1025           +1025  0            16           ✓ Reasoning
*      1.0  0            1025           +1025  0            18           ✓ Reasoning
*      1.5  0            1025           +1025  0            23           ✓ Reasoning
*      2.0  0            1025           +1025  0            16           ✓ Reasoning

* =

## Detailed Analysis: Layer 16

Layer 16 was most sensitive (32-token range)

In [8]:
print("="*80)
print("LAYER 16 - DETAILED ANALYSIS (Before vs After Fix)")
print("="*80)

layer16_results = sorted([r for r in fixed_results if r['layer'] == 16], key=lambda x: x['strength'])

print(f"\n{'Strength':<10} {'Old Think':<12} {'New Think':<12} {'Delta':<8} {'Old Steps':<12} {'New Steps':<12} {'Status'}")
print("-"*100)

for r in layer16_results:
    old_think = r['tokens_old'].get('think_tokens', 0)
    new_think = r['tokens']['think_tokens']
    delta = new_think - old_think
    
    old_steps = r['quality_old'].get('reasoning_steps', 0)
    new_steps = r['quality']['reasoning_steps']
    
    if new_think > 0:
        status = "✓ Reasoning"
    else:
        status = "⚠️  Suppressed"
    
    changed_marker = "*" if r['changed'] else " "
    
    print(f"{changed_marker}{r['strength']:>9.1f}  {old_think:<12} {new_think:<12} {delta:>+7d}  {old_steps:<12} {new_steps:<12} {status}")

print("\n* = Changed with fix")

LAYER 16 - DETAILED ANALYSIS (Before vs After Fix)

Strength   Old Think    New Think    Delta    Old Steps    New Steps    Status
----------------------------------------------------------------------------------------------------
*     -2.0  0            1025           +1025  0            28           ✓ Reasoning
*     -1.5  0            1025           +1025  0            21           ✓ Reasoning
*     -1.0  0            1025           +1025  0            22           ✓ Reasoning
      -0.5  802          802               +0  12           12           ✓ Reasoning
       0.0  973          973               +0  18           18           ✓ Reasoning
*      0.5  0            1025           +1025  0            18           ✓ Reasoning
*      1.0  0            1025           +1025  0            24           ✓ Reasoning
*      1.5  0            1025           +1025  0            26           ✓ Reasoning
*      2.0  0            1025           +1025  0            18           ✓ Reasoning

* 

## Detailed Analysis: Layer 20

Layer 20 showed highest quality variance

In [9]:
print("="*80)
print("LAYER 20 - DETAILED ANALYSIS (Before vs After Fix)")
print("="*80)

layer20_results = sorted([r for r in fixed_results if r['layer'] == 20], key=lambda x: x['strength'])

print(f"\n{'Strength':<10} {'Old Think':<12} {'New Think':<12} {'Delta':<8} {'Old Steps':<12} {'New Steps':<12} {'Status'}")
print("-"*100)

for r in layer20_results:
    old_think = r['tokens_old'].get('think_tokens', 0)
    new_think = r['tokens']['think_tokens']
    delta = new_think - old_think
    
    old_steps = r['quality_old'].get('reasoning_steps', 0)
    new_steps = r['quality']['reasoning_steps']
    
    if new_think > 0:
        status = "✓ Reasoning"
    else:
        status = "⚠️  Suppressed"
    
    changed_marker = "*" if r['changed'] else " "
    
    print(f"{changed_marker}{r['strength']:>9.1f}  {old_think:<12} {new_think:<12} {delta:>+7d}  {old_steps:<12} {new_steps:<12} {status}")

print("\n* = Changed with fix")

LAYER 20 - DETAILED ANALYSIS (Before vs After Fix)

Strength   Old Think    New Think    Delta    Old Steps    New Steps    Status
----------------------------------------------------------------------------------------------------
*     -2.0  0            1025           +1025  0            19           ✓ Reasoning
      -1.5  992          992               +0  19           19           ✓ Reasoning
*     -1.0  0            1025           +1025  0            11           ✓ Reasoning
*     -0.5  0            1025           +1025  0            24           ✓ Reasoning
*      0.0  0            1025           +1025  0            17           ✓ Reasoning
*      0.5  0            1025           +1025  0            21           ✓ Reasoning
       1.0  973          973               +0  16           16           ✓ Reasoning
       1.5  918          918               +0  20           20           ✓ Reasoning
*      2.0  0            1025           +1025  0            23           ✓ Reasoning

* 

## Summary Statistics Comparison

In [10]:
print("="*80)
print("SUMMARY: OLD VS NEW METRICS")
print("="*80)

# Calculate statistics
old_think_tokens = [r['tokens_old'].get('think_tokens', 0) for r in fixed_results]
new_think_tokens = [r['tokens']['think_tokens'] for r in fixed_results]

old_has_reasoning = sum(1 for t in old_think_tokens if t > 0)
new_has_reasoning = sum(1 for t in new_think_tokens if t > 0)

print(f"\nTotal results: {len(fixed_results)}")
print(f"\nResults with reasoning (think_tokens > 0):")
print(f"  Old (buggy): {old_has_reasoning}/{len(fixed_results)} ({100*old_has_reasoning/len(fixed_results):.1f}%)")
print(f"  New (fixed): {new_has_reasoning}/{len(fixed_results)} ({100*new_has_reasoning/len(fixed_results):.1f}%)")
print(f"  Change: {new_has_reasoning - old_has_reasoning:+d} results now show reasoning")

print(f"\nThink tokens statistics:")
print(f"  Old mean: {np.mean(old_think_tokens):.1f}")
print(f"  New mean: {np.mean(new_think_tokens):.1f}")
print(f"  Old std: {np.std(old_think_tokens):.1f}")
print(f"  New std: {np.std(new_think_tokens):.1f}")

# Breakdown by result
unchanged = sum(1 for r in fixed_results if not r['changed'])
increased = sum(1 for r in fixed_results if r['changed'] and r['tokens']['think_tokens'] > r['tokens_old'].get('think_tokens', 0))
decreased = sum(1 for r in fixed_results if r['changed'] and r['tokens']['think_tokens'] < r['tokens_old'].get('think_tokens', 0))

print(f"\nBreakdown:")
print(f"  Unchanged: {unchanged}/{len(fixed_results)} ({100*unchanged/len(fixed_results):.1f}%)")
print(f"  Increased: {increased}/{len(fixed_results)} ({100*increased/len(fixed_results):.1f}%)")
print(f"  Decreased: {decreased}/{len(fixed_results)} ({100*decreased/len(fixed_results):.1f}%)")

SUMMARY: OLD VS NEW METRICS

Total results: 144

Results with reasoning (think_tokens > 0):
  Old (buggy): 22/144 (15.3%)
  New (fixed): 144/144 (100.0%)
  Change: +122 results now show reasoning

Think tokens statistics:
  Old mean: 141.6
  New mean: 1010.0
  Old std: 334.6
  New std: 44.6

Breakdown:
  Unchanged: 22/144 (15.3%)
  Increased: 122/144 (84.7%)
  Decreased: 0/144 (0.0%)


## Visualization: Before vs After

In [11]:
# Create DataFrame for analysis
df = pd.DataFrame([{
    'layer': r['layer'],
    'strength': r['strength'],
    'think_tokens_old': r['tokens_old'].get('think_tokens', 0),
    'think_tokens_new': r['tokens']['think_tokens'],
    'reasoning_steps_old': r['quality_old'].get('reasoning_steps', 0),
    'reasoning_steps_new': r['quality']['reasoning_steps'],
    'changed': r['changed']
} for r in fixed_results])

print("\nDataFrame created with", len(df), "rows")
print("\nSample:")
print(df.head(10).to_string())

# Save fixed results
output_path = Path('/scratch/gilbreth/sramishe/results_QwQ_R1/results_improved')
df.to_csv(output_path / 'fixed_analysis.csv', index=False)
print(f"\n✓ Saved to {output_path / 'fixed_analysis.csv'}")


DataFrame created with 144 rows

Sample:
   layer  strength  think_tokens_old  think_tokens_new  reasoning_steps_old  reasoning_steps_new  changed
0      0      -2.0               884               884                   13                   13    False
1      0      -1.5               787               787                   12                   12    False
2      0      -1.0                 0              1024                    0                   17     True
3      0      -0.5               889               889                   15                   15    False
4      0       0.0                 0              1025                    0                   16     True
5      0       0.5                 0              1025                    0                   16     True
6      0       1.0                 0              1025                    0                   18     True
7      0       1.5                 0              1025                    0                   23     True
8   

## Key Findings Summary

In [12]:
print("="*80)
print("KEY FINDINGS FROM RE-ANALYSIS")
print("="*80)

# Finding 1: How many results were affected?
print(f"\n1. SCOPE OF BUG")
print(f"   - {changes_detected}/{len(fixed_results)} results affected ({100*changes_detected/len(fixed_results):.1f}%)")
print(f"   - These results previously showed think_tokens=0 due to truncation")
print(f"   - Now correctly detected with fixed evaluator")

# Finding 2: Which layers were most affected?
layers_affected = defaultdict(int)
for r in fixed_results:
    if r['changed']:
        layers_affected[r['layer']] += 1

print(f"\n2. LAYERS MOST AFFECTED BY BUG")
for layer, count in sorted(layers_affected.items(), key=lambda x: x[1], reverse=True):
    print(f"   Layer {layer}: {count}/9 results changed")

# Finding 3: Impact on Layer 0 hypothesis
layer0_baseline = [r for r in fixed_results if r['layer'] == 0 and r['strength'] == 0.0][0]
print(f"\n3. LAYER 0 INVERSE DIRECTION HYPOTHESIS")
print(f"   Baseline (strength=0.0) think_tokens:")
print(f"   - Old: {layer0_baseline['tokens_old'].get('think_tokens', 0)}")
print(f"   - New: {layer0_baseline['tokens']['think_tokens']}")

if layer0_baseline['changed']:
    print(f"   ⚠️  HYPOTHESIS INVALIDATED")
    print(f"   - Baseline WAS truncated, not truly suppressed")
    print(f"   - Need to compare quality, not just presence of reasoning")
else:
    print(f"   ✓ HYPOTHESIS STILL VALID")
    print(f"   - Baseline truly shows no reasoning")
    print(f"   - Negative strengths genuinely induce reasoning")

# Finding 4: Overall impact
print(f"\n4. OVERALL IMPACT")
print(f"   Results showing reasoning:")
print(f"   - Old: {old_has_reasoning}/{len(fixed_results)} ({100*old_has_reasoning/len(fixed_results):.1f}%)")
print(f"   - New: {new_has_reasoning}/{len(fixed_results)} ({100*new_has_reasoning/len(fixed_results):.1f}%)")
print(f"   - Net change: {new_has_reasoning - old_has_reasoning:+d} ({100*(new_has_reasoning - old_has_reasoning)/len(fixed_results):+.1f}%)")

print("\n" + "="*80)
print("CONCLUSION")
print("="*80)
print("\nThe evaluation bug fix successfully detects think content in truncated")
print("generations. This provides more accurate quality metrics for analysis.")
print("\nNext steps:")
print("1. Review Layer 0 baseline - does it truly suppress reasoning?")
print("2. Compare QUALITY not just PRESENCE of reasoning")
print("3. Test inverse direction on MATH-500 with correctness grading")
print("="*80)

KEY FINDINGS FROM RE-ANALYSIS

1. SCOPE OF BUG
   - 122/144 results affected (84.7%)
   - These results previously showed think_tokens=0 due to truncation
   - Now correctly detected with fixed evaluator

2. LAYERS MOST AFFECTED BY BUG
   Layer 4: 9/9 results changed
   Layer 36: 9/9 results changed
   Layer 48: 9/9 results changed
   Layer 56: 9/9 results changed
   Layer 60: 9/9 results changed
   Layer 8: 8/9 results changed
   Layer 12: 8/9 results changed
   Layer 40: 8/9 results changed
   Layer 16: 7/9 results changed
   Layer 24: 7/9 results changed
   Layer 28: 7/9 results changed
   Layer 44: 7/9 results changed
   Layer 52: 7/9 results changed
   Layer 0: 6/9 results changed
   Layer 20: 6/9 results changed
   Layer 32: 6/9 results changed

3. LAYER 0 INVERSE DIRECTION HYPOTHESIS
   Baseline (strength=0.0) think_tokens:
   - Old: 0
   - New: 1025
   ⚠️  HYPOTHESIS INVALIDATED
   - Baseline WAS truncated, not truly suppressed
   - Need to compare quality, not just presence of

## Export Fixed Results

Save the re-analyzed results for future use

In [13]:
# Save fixed results to JSON
output_data = {
    'config': data['config'],
    'model_info': data['model_info'],
    'directions_stats': data.get('directions_stats', {}),
    'baselines': data['baselines'],
    'intervention_results_fixed': fixed_results,
    'changes_detected': changes_detected,
    'old_has_reasoning': old_has_reasoning,
    'new_has_reasoning': new_has_reasoning,
}

output_file = output_path / 'results_fixed.json'
with open(output_file, 'w') as f:
    json.dump(output_data, f, indent=2, default=str)

print(f"✓ Fixed results saved to {output_file}")
print(f"✓ CSV analysis saved to {output_path / 'fixed_analysis.csv'}")
print(f"\nYou can now use these corrected results for further analysis!")

✓ Fixed results saved to /scratch/gilbreth/sramishe/results_QwQ_R1/results_improved/results_fixed.json
✓ CSV analysis saved to /scratch/gilbreth/sramishe/results_QwQ_R1/results_improved/fixed_analysis.csv

You can now use these corrected results for further analysis!
