# Run 2: WildGuardMix + Heuristic - Full Reproducible

**Complete code to reproduce DeepConf safety evaluation on WildGuardMix**

---

## Overview

This notebook reproduces Run 2:
- Download WildGuardMix dataset (1,725 test instances)
- Run Qwen3-0.6B with DeepConf (10 traces per instance)
- Apply heuristic refusal detection
- Generate all 6 visualizations
- Compare to ToxicChat results

**Requirements:**
- GPU: A100 40GB or similar
- Time: ~2 hours (smaller dataset)
- Cost: ~$3 on Lambda Labs

**Note:** Requires HuggingFace authentication for WildGuardMix

---

## Step 1: Environment Setup

In [None]:
# Install dependencies
!pip install -q torch torchvision torchaudio
!pip install -q transformers accelerate vllm
!pip install -q datasets huggingface_hub
!pip install -q scipy scikit-learn matplotlib seaborn
!pip install -q numpy pandas tqdm

print("✓ Dependencies installed")

## Step 2: HuggingFace Authentication

In [None]:
from huggingface_hub import login

# Login to HuggingFace (required for WildGuardMix)
# Get your token from: https://huggingface.co/settings/tokens
login()

print("✓ HuggingFace authentication complete")

## Step 3: Download WildGuardMix Dataset

In [None]:
!python download_wildguardmix.py

print("✓ WildGuardMix downloaded")
print("  Test: 1,725 instances")
print("  Train: 86,759 instances")

## Step 4: Run DeepConf Experiment

**Takes ~1-2 hours on A100 GPU**

In [None]:
# Run experiment on WildGuardMix
!python run_experiment.py \
    --model Qwen/Qwen3-0.6B \
    --benchmark wildguardmix \
    --num-instances 1725 \
    --min-traces 3 \
    --max-traces 10 \
    --output results/wildguardmix_qwen06b_baseline

print("✓ Experiment complete")

## Step 5: Generate Confidence Distribution Plots

In [None]:
!python visualize_confidence_analysis.py \
    --results-dir results/wildguardmix_qwen06b_baseline \
    --benchmark wildguardmix \
    --data-root data \
    --output plots/run2 \
    --samples 100

print("✓ Generated 4 confidence distribution plots")

## Step 6: Run Percentile Sweep Analysis

In [None]:
!python scripts/comprehensive_percentile_analysis.py \
    --results-dir results/wildguardmix_qwen06b_baseline \
    --benchmark wildguardmix \
    --data-root data \
    --output wildguardmix_percentile_safety_analysis.json

print("✓ Percentile sweep complete")

## Step 7: Generate Enhanced Safety Visualizations

In [None]:
!python scripts/create_safety_visualizations.py \
    --results-dir results/wildguardmix_qwen06b_baseline \
    --percentile-analysis wildguardmix_percentile_safety_analysis.json \
    --benchmark wildguardmix \
    --data-root data \
    --output plots/run2/

print("✓ All 6 visualizations generated")

## Step 8: View Results

In [None]:
from IPython.display import Image, display
import json

# Load analysis results
with open('wildguardmix_percentile_safety_analysis.json') as f:
    analysis = json.load(f)

print("="*70)
print("PERCENTILE SWEEP RESULTS")
print("="*70)
print(f"{'%ile':>4} | {'Acc':>6} | {'Sens':>6} | {'Spec':>6} | {'Savings':>7}")
print("-"*70)
for result in analysis['results']:
    print(f"{result['percentile']:>4} | "
          f"{result['accuracy']*100:>6.2f} | "
          f"{result['sensitivity']*100:>6.2f} | "
          f"{result['specificity']*100:>6.2f} | "
          f"{result['token_savings_pct']:>6.1f}%")

# Display all plots
plots = [
    'confusion_matrix_2x2.png',
    'percentile_safety_curves.png',
    'confidence_by_correctness.png',
    'confidence_by_category.png',
    'confidence_by_toxicity.png',
    'trace_evolution.png'
]

for plot in plots:
    print(f"\n{plot}:")
    display(Image(f'plots/run2/{plot}', width=800))

---

## Key Findings

### 1. Better Accuracy than ToxicChat ✅
- **WildGuardMix:** 41-42% accuracy
- **ToxicChat:** 9-10% accuracy
- +32% absolute improvement

### 2. Minimal Confidence Bias ✅
- Incorrect: 0.655 confidence
- Correct: 0.640 confidence  
- Only 2% difference (vs 25% in ToxicChat)

### 3. No Toxicity Confidence Bias ✅
- Toxic prompts: 0.648 confidence
- Safe prompts: 0.646 confidence
- Nearly identical (vs 33% difference in ToxicChat)

### 4. Similar Sensitivity ✅
- 92% toxic catch rate (vs 91-94% in ToxicChat)
- Consistent safety performance

---

## WildGuardMix vs ToxicChat Comparison

| Metric | ToxicChat | WildGuardMix | Winner |
|--------|-----------|--------------|--------|
| Accuracy | 9-10% | 41-42% | ✅ WildGuardMix |
| Sensitivity | 91-94% | 92% | ≈ Tie |
| Specificity | 2-4% | 2% | ≈ Tie |
| Confidence bias (incorrect) | +25% | +2% | ✅ WildGuardMix |
| Confidence bias (toxic) | +33% | 0% | ✅ WildGuardMix |
| Token savings | 64.6% | 64.3% | ≈ Tie |

### Why WildGuardMix is Better

1. **Gold-standard labels** - Expert-annotated refusal + harmfulness
2. **Balanced dataset** - 44% toxic vs 7% in ToxicChat
3. **Better coverage** - Diverse adversarial examples

---

## Conclusions

1. ✅ **WildGuardMix provides better evaluation** - Higher quality labels
2. ✅ **Less confidence bias** - But still present (incorrect > correct)
3. ✅ **Similar efficiency gains** - 64% token savings at 20th percentile
4. ⚠️ **Dataset quality matters** - ToxicChat may have labeling issues

---