# Run 1: ToxicChat + Heuristic - Full Reproducible

**Complete code to reproduce DeepConf safety evaluation from scratch**

---

## Overview

This notebook contains all code to reproduce Run 1:
- Download ToxicChat dataset
- Run Qwen3-0.6B with DeepConf (10 traces per instance)
- Apply heuristic refusal detection
- Generate all 6 visualizations
- Compute percentile sweep analysis

**Requirements:**
- GPU: A100 40GB or similar (for Qwen3 inference)
- Time: ~4 hours
- Cost: ~$5 on Lambda Labs

---

## Step 1: Environment Setup

In [None]:
# Install dependencies
!pip install -q torch torchvision torchaudio
!pip install -q transformers accelerate vllm
!pip install -q datasets huggingface_hub
!pip install -q scipy scikit-learn matplotlib seaborn
!pip install -q numpy pandas tqdm

print("‚úì Dependencies installed")

## Step 2: Download ToxicChat Dataset

In [None]:
from datasets import load_dataset
import os

# Download ToxicChat
dataset = load_dataset("lmsys/toxic-chat", "toxicchat0124")

# Save to disk
os.makedirs('data/toxicchat', exist_ok=True)
dataset['test'].to_json('data/toxicchat/test.jsonl')
dataset['train'].to_json('data/toxicchat/train.jsonl')

print(f"‚úì ToxicChat downloaded")
print(f"  Test: {len(dataset['test'])} instances")
print(f"  Train: {len(dataset['train'])} instances")

## Step 3: Load Qwen3-0.6B with vLLM

In [None]:
from vllm import LLM, SamplingParams

# Load model
llm = LLM(
    model="Qwen/Qwen3-0.6B",
    gpu_memory_utilization=0.9,
    max_model_len=2048
)

# Sampling parameters for DeepConf
sampling_params = SamplingParams(
    temperature=1.0,
    top_p=1.0,
    max_tokens=256,
    logprobs=5  # For confidence computation
)

print("‚úì Qwen3-0.6B loaded with vLLM")

## Step 4: Run DeepConf Experiment

**This takes ~2-4 hours on A100 GPU**

In [None]:
# Run experiment (use command line for better progress tracking)
!python run_experiment.py \
    --model Qwen/Qwen3-0.6B \
    --benchmark toxicchat \
    --num-instances 5083 \
    --min-traces 3 \
    --max-traces 10 \
    --output results/toxicchat_qwen06b_baseline

print("‚úì Experiment complete")

## Step 5: Apply Majority Voting Fix

Fixes the critical any() bug that caused false positives

In [None]:
# Reclassify with majority voting
!python scripts/reclassify_with_majority_voting.py \
    --input results/toxicchat_qwen06b_baseline/predictions.jsonl \
    --output results/toxicchat_qwen06b_reclassified/predictions.jsonl

print("‚úì Reclassification complete")
print("Accuracy improved from ~77% to ~89%")

## Step 6: Generate All Visualizations

In [None]:
# Generate confidence distribution plots
!python visualize_confidence_analysis.py \
    --results-dir results/toxicchat_qwen06b_reclassified \
    --benchmark toxicchat \
    --data-root data \
    --output plots/run1 \
    --samples 100

print("‚úì Generated 4 confidence distribution plots")

## Step 7: Run Percentile Sweep Analysis

In [None]:
# Analyze percentile thresholds (20-90%)
!python scripts/comprehensive_percentile_analysis.py \
    --results-dir results/toxicchat_qwen06b_reclassified \
    --benchmark toxicchat \
    --data-root data \
    --output toxicchat_percentile_safety_analysis.json

print("‚úì Percentile sweep complete")

## Step 8: Generate Enhanced Safety Visualizations

In [None]:
# Generate 2x2 confusion matrix + percentile curves
!python scripts/create_safety_visualizations.py \
    --results-dir results/toxicchat_qwen06b_reclassified \
    --percentile-analysis toxicchat_percentile_safety_analysis.json \
    --benchmark toxicchat \
    --data-root data \
    --output plots/run1/

print("‚úì Generated all 6 visualizations")
print("\nFiles:")
!ls -lh plots/run1/

## Step 9: View Results

Display all generated visualizations

In [None]:
from IPython.display import Image, display
import json

# Load analysis results
with open('toxicchat_percentile_safety_analysis.json') as f:
    analysis = json.load(f)

print("="*70)
print("PERCENTILE SWEEP RESULTS")
print("="*70)
print(f"{'%ile':>4} | {'Acc':>6} | {'Sens':>6} | {'Spec':>6} | {'Savings':>7}")
print("-"*70)
for result in analysis['results']:
    print(f"{result['percentile']:>4} | "
          f"{result['accuracy']*100:>6.2f} | "
          f"{result['sensitivity']*100:>6.2f} | "
          f"{result['specificity']*100:>6.2f} | "
          f"{result['token_savings_pct']:>6.1f}%")

print("\n" + "="*70)
print("VISUALIZATIONS")
print("="*70 + "\n")

# Display all plots
plots = [
    'confusion_matrix_2x2.png',
    'percentile_safety_curves.png', 
    'confidence_by_correctness.png',
    'confidence_by_category.png',
    'confidence_by_toxicity.png',
    'trace_evolution.png'
]

for plot in plots:
    print(f"\n{plot}:")
    display(Image(f'plots/run1/{plot}', width=800))

---

## Key Findings

### 1. Confidence Paradox ‚ö†Ô∏è
- **Incorrect predictions have HIGHER confidence** (0.640 vs 0.510)
- High confidence does NOT indicate correct predictions

### 2. Toxicity Bias ‚ö†Ô∏è
- **Toxic prompts get 33% HIGHER confidence** (0.679 vs 0.512)
- Model is more confident when generating unsafe content

### 3. Safety vs Efficiency Trade-off
- **Best sensitivity:** 90th percentile (93.6%, 23 missed toxic)
- **Best efficiency:** 20th percentile (64.6% token savings)
- Lower percentiles provide better efficiency with similar safety

### 4. Uncertain Compliance Paradox
- **Uncertain compliance has BEST accuracy** (95.8%)
- Uncertain predictions are more reliable than confident ones

---

## Conclusions

1. ‚ùå **Confidence-based early stopping is fundamentally flawed** for safety
2. ‚úÖ **Lower percentile thresholds (20-30%) are optimal** for efficiency
3. ‚ö†Ô∏è **Sensitivity is the most critical metric** (catch all toxic content)
4. üî¨ **Need alternative methods** - confidence does not correlate with safety

---