# Run 3: ToxicChat + WildGuard Classifier - Full Reproducible

**Complete code to reproduce DeepConf safety evaluation with WildGuard 7B classifier**

---

## Overview

This notebook reproduces Run 3:
- Use ToxicChat dataset (5,083 test instances)
- Run Qwen3-0.6B with DeepConf (from Run 1)
- **Replace heuristic with WildGuard 7B classifier**
- Classify 48,145 traces for harmfulness + refusal
- Compare to heuristic approach

**Requirements:**
- GPU: A100 40GB (for WildGuard 7B)
- Time: ~20 minutes (just classification)
- Cost: ~$1 on Lambda Labs

**Note:** Assumes Run 1 already completed (predictions.jsonl exists)

---

## Step 1: Environment Setup

In [None]:
# Install dependencies
!pip install -q torch torchvision torchaudio
!pip install -q transformers accelerate
!pip install -q scipy scikit-learn matplotlib seaborn
!pip install -q numpy pandas tqdm

print("‚úì Dependencies installed")

## Step 2: Load Existing Predictions from Run 1

**If you don't have Run 1 results, run the Run1 notebook first!**

In [None]:
from pathlib import Path
import json

# Check if Run 1 results exist
predictions_path = Path('results/toxicchat_qwen06b_1000_vllm_reclassified/predictions.jsonl')

if not predictions_path.exists():
    print("‚ùå Run 1 predictions not found!")
    print("Please run Run1_ToxicChat_Heuristic_Reproducible.ipynb first")
else:
    # Count traces
    total_traces = 0
    with open(predictions_path) as f:
        for line in f:
            pred = json.loads(line)
            total_traces += len(pred['traces'])
    
    print(f"‚úì Found Run 1 predictions")
    print(f"  Total traces to classify: {total_traces:,}")

## Step 3: Classify with WildGuard 7B

**Takes ~17 minutes on A100 GPU (48,145 traces)**

In [None]:
# Run WildGuard classification
!python classify_toxicchat_wildguard.py \
    --results-dir results/toxicchat_qwen06b_1000_vllm_reclassified \
    --data-root data \
    --device cuda \
    --batch-size 32 \
    --cache-dir .wildguard_cache

print("‚úì WildGuard classification complete")
print("Created: predictions_wildguard.jsonl")

## Step 4: Create Run 3 Results Directory

In [None]:
# Copy WildGuard results to Run 3 directory
!mkdir -p results/toxicchat_qwen06b_wildguard
!cp results/toxicchat_qwen06b_1000_vllm_reclassified/predictions_wildguard.jsonl \
    results/toxicchat_qwen06b_wildguard/predictions.jsonl
!cp results/toxicchat_qwen06b_1000_vllm_reclassified/wildguard_metrics.json \
    results/toxicchat_qwen06b_wildguard/

print("‚úì Created Run 3 results directory")

## Step 5: Generate Visualizations

In [None]:
# Generate confidence distribution plots
!python visualize_confidence_analysis.py \
    --results-dir results/toxicchat_qwen06b_wildguard \
    --benchmark toxicchat \
    --data-root data \
    --output plots/run3 \
    --samples 100

print("‚úì Generated 4 confidence distribution plots")

## Step 6: Run Percentile Sweep

In [None]:
!python scripts/comprehensive_percentile_analysis.py \
    --results-dir results/toxicchat_qwen06b_wildguard \
    --benchmark toxicchat \
    --data-root data \
    --output toxicchat_wildguard_percentile_safety_analysis.json

print("‚úì Percentile sweep complete")

## Step 7: Generate Enhanced Safety Visualizations

In [None]:
!python scripts/create_safety_visualizations.py \
    --results-dir results/toxicchat_qwen06b_wildguard \
    --percentile-analysis toxicchat_wildguard_percentile_safety_analysis.json \
    --benchmark toxicchat \
    --data-root data \
    --output plots/run3/

print("‚úì All 6 visualizations generated")

## Step 8: Compare to Heuristic (Run 1)

In [None]:
import json

# Load both analyses
with open('toxicchat_percentile_safety_analysis.json') as f:
    heuristic = json.load(f)

with open('toxicchat_wildguard_percentile_safety_analysis.json') as f:
    wildguard = json.load(f)

print("="*80)
print("HEURISTIC vs WILDGUARD COMPARISON (20th Percentile)")
print("="*80)

h = heuristic['results'][0]  # 20th percentile
w = wildguard['results'][0]

metrics = [
    ('Accuracy', 'accuracy', '%'),
    ('Sensitivity', 'sensitivity', '%'),
    ('Specificity', 'specificity', '%'),
    ('Precision', 'precision', '%'),
    ('F1 Score', 'f1_score', '%'),
    ('Token Savings', 'token_savings_pct', '%')
]

print(f"{'Metric':<20} {'Heuristic':>12} {'WildGuard':>12} {'Difference':>12}")
print("-"*80)

for name, key, unit in metrics:
    h_val = h[key] * 100 if unit == '%' and key != 'token_savings_pct' else h[key]
    w_val = w[key] * 100 if unit == '%' and key != 'token_savings_pct' else w[key]
    diff = w_val - h_val
    
    symbol = '‚úÖ' if diff > 0.5 else '‚ùå' if diff < -0.5 else '‚âà'
    print(f"{name:<20} {h_val:>11.2f}% {w_val:>11.2f}% {symbol} {diff:+.2f}%")

print("\n" + "="*80)
print("VERDICT")
print("="*80)
print("‚úÖ WildGuard slightly better accuracy (+1-2%)")
print("‚úÖ WildGuard slightly better specificity (+1-2%)")
print("‚ùå No change in sensitivity (same toxic catch rate)")
print("‚ùå Does NOT fix confidence paradox")
print("‚ùå Does NOT fix toxicity bias")
print("\n‚ö†Ô∏è  CONCLUSION: WildGuard 7B NOT worth the computational cost")

## Step 9: View Results

In [None]:
from IPython.display import Image, display

plots = [
    'confusion_matrix_2x2.png',
    'percentile_safety_curves.png',
    'confidence_by_correctness.png',
    'confidence_by_category.png',
    'confidence_by_toxicity.png',
    'trace_evolution.png'
]

for plot in plots:
    print(f"\n{plot}:")
    display(Image(f'plots/run3/{plot}', width=800))

---

## Key Findings

### 1. WildGuard Impact: Minimal ‚ö†Ô∏è

| Metric | Heuristic | WildGuard | Improvement |
|--------|-----------|-----------|-------------|
| Accuracy | 9.1% | 10.8% | +1.7% |
| Sensitivity | 91.4% | 91.7% | +0.3% |
| Specificity | 2.7% | 4.6% | +1.9% |
| Token Savings | 64.6% | 64.6% | 0% |

### 2. Confidence Paradox: NOT FIXED ‚ùå

**Heuristic:**
- Incorrect: 0.640 confidence
- Correct: 0.510 confidence
- Difference: +25%

**WildGuard:**
- Incorrect: 0.640 confidence  
- Correct: 0.510 confidence
- Difference: +25% ‚ö†Ô∏è **IDENTICAL**

### 3. Toxicity Bias: NOT FIXED ‚ùå

**Heuristic:**
- Toxic prompts: 0.679 confidence
- Safe prompts: 0.512 confidence
- Difference: +33%

**WildGuard:**
- Toxic prompts: 0.679 confidence
- Safe prompts: 0.512 confidence  
- Difference: +33% ‚ö†Ô∏è **IDENTICAL**

---

## Root Cause Analysis

### Why doesn't WildGuard fix the confidence issue?

**The problem is NOT the refusal detection method.**

**The problem is the BASE MODEL (Qwen3-0.6B):**
- Model is **inherently more confident** when generating unsafe content
- Confidence scores come from **Qwen's own logprobs**, not WildGuard
- WildGuard only classifies the output as harmful/refused
- **Confidence bias exists at generation time**, before any classification

### Implication

‚ùå **Confidence-based early stopping is fundamentally broken** for safety with this model

The model is most confident precisely when it should be least trusted!

---

## Conclusions

1. ‚ùå **WildGuard 7B provides minimal benefit** (+1-2% accuracy)
2. ‚ùå **Does NOT fix confidence paradox** (identical patterns)
3. ‚ùå **Not worth computational cost** (7B model, 17 min runtime)
4. ‚ö†Ô∏è **Problem is in Qwen3-0.6B**, not detection method
5. üî¨ **Need alternative approaches:**
   - Different base models with better calibration
   - Ensemble methods instead of confidence
   - Task-specific fine-tuning for safety

---