# ToxicChat DeepConf Analysis - Full Reproducible Notebook

**Complete end-to-end analysis with data generation and visualization**

This notebook provides:
1. ‚úÖ Pre-generated results via Google Drive (fast)
2. üìù Full code to reproduce from scratch (commented out)
3. üìä All visualizations and analysis

---

## Quick Start (Using Pre-generated Data)

**Google Drive Links:**
- `predictions.jsonl` (36MB): [Download Link - ADD LINK]
- `predictions_wildguard.jsonl` (45MB): [Download Link - ADD LINK]
- `plots/` (4 PNG files, ~2MB total): [Download Link - ADD LINK]

**File sizes:**
- Total: ~83MB
- Suitable for Google Drive sharing ‚úÖ

---

## Full Reproduction (From Scratch)

**Requirements:**
- GPU: A100 40GB or similar (for Qwen3 inference)
- Time: ~4 hours for full pipeline
- Cost: ~$5 on Lambda Labs

**All code below is commented out by default.**  
**Uncomment to run from scratch.**

---

# Setup and Installation

In [None]:
# Basic imports (always needed)
import json
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from pathlib import Path
from IPython.display import Image, display, Markdown
from tqdm.auto import tqdm

# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 11

print("‚úì Basic imports successful")

In [None]:
# Google Colab detection
try:
    import google.colab
    IN_COLAB = True
    print("Running in Google Colab")
except:
    IN_COLAB = False
    print("Running locally or in Jupyter")

### Option 1: Mount Google Drive (Pre-generated Data)

In [None]:
# if IN_COLAB:
#     from google.colab import drive
#     drive.mount('/content/drive')
#     
#     # Set paths to Google Drive data
#     DATA_DIR = Path('/content/drive/MyDrive/DeepConf_ToxicChat')  # Adjust to your folder
#     PLOTS_DIR = DATA_DIR / 'plots'
#     PREDICTIONS_FILE = DATA_DIR / 'predictions.jsonl'
#     
#     print(f"‚úì Google Drive mounted")
#     print(f"Data directory: {DATA_DIR}")
# else:
#     DATA_DIR = Path('results/toxicchat_qwen06b_1000_vllm_reclassified')
#     PLOTS_DIR = Path('plots')
#     PREDICTIONS_FILE = DATA_DIR / 'predictions.jsonl'

### Option 2: Upload Files Manually

In [None]:
# Alternative: Upload files manually to Colab
if IN_COLAB:
    # from google.colab import files
    # uploaded = files.upload()  # Upload predictions.jsonl and 4 PNG files
    DATA_DIR = Path('/content')
    PLOTS_DIR = Path('/content')
    PREDICTIONS_FILE = DATA_DIR / 'predictions.jsonl'
else:
    DATA_DIR = Path('results/toxicchat_qwen06b_1000_vllm_reclassified')
    PLOTS_DIR = Path('plots')
    PREDICTIONS_FILE = DATA_DIR / 'predictions.jsonl'

print(f"Data directory: {DATA_DIR}")
print(f"Plots directory: {PLOTS_DIR}")

---

# FULL REPRODUCTION CODE (Commented Out)

## Step 1: Install Dependencies

**Uncomment to run from scratch**

In [None]:
# # Install required packages
# !pip install -q torch torchvision torchaudio
# !pip install -q transformers accelerate vllm
# !pip install -q datasets huggingface_hub
# !pip install -q scipy scikit-learn
# !pip install -q bitsandbytes  # For 8-bit quantization
# 
# print("‚úì All dependencies installed")

## Step 2: Download ToxicChat Dataset

In [None]:
# # Download ToxicChat dataset from HuggingFace
# from datasets import load_dataset
# 
# dataset = load_dataset("lmsys/toxic-chat", "toxicchat0124")
# 
# # Save to disk
# !mkdir -p data/toxicchat
# dataset['test'].to_json('data/toxicchat/test.jsonl')
# dataset['train'].to_json('data/toxicchat/train.jsonl')
# 
# print(f"‚úì Downloaded ToxicChat dataset")
# print(f"  Test: {len(dataset['test'])} instances")
# print(f"  Train: {len(dataset['train'])} instances")

## Step 3: Load Qwen3-0.6B Model with vLLM

In [None]:
# # Load Qwen3-0.6B with vLLM for fast inference
# from vllm import LLM, SamplingParams
# 
# llm = LLM(
#     model="Qwen/Qwen3-0.6B",
#     gpu_memory_utilization=0.9,
#     max_model_len=2048
# )
# 
# # Sampling parameters for safety evaluation
# sampling_params = SamplingParams(
#     temperature=1.0,
#     top_p=1.0,
#     max_tokens=256,
#     logprobs=5  # Get top-5 logprobs for confidence
# )
# 
# print("‚úì Qwen3-0.6B loaded with vLLM")

## Step 4: Run DeepConf Experiment (5,083 instances √ó 10 traces)

**This takes ~2-4 hours on A100 GPU**

In [None]:
# # Run experiment with DeepConf
# # This generates 10 traces per instance with confidence scores
# 
# !python run_experiment.py \
#     --model Qwen/Qwen3-0.6B \
#     --benchmark toxicchat \
#     --num-instances 5083 \
#     --min-traces 3 \
#     --max-traces 10 \
#     --output results/toxicchat_qwen06b_vllm
# 
# print("‚úì Experiment complete")
# print("Results saved to: results/toxicchat_qwen06b_vllm/predictions.jsonl")

## Step 5: Fix the any() Bug and Reclassify

**Critical bug fix: Use majority voting instead of any()**

In [None]:
# # Reclassify with majority voting fix
# # This fixes the bug where any() caused false positives
# 
# !python scripts/reclassify_with_majority_voting.py \
#     --input results/toxicchat_qwen06b_vllm/predictions.jsonl \
#     --output results/toxicchat_qwen06b_vllm_reclassified/predictions.jsonl
# 
# print("‚úì Reclassification complete")
# print("Accuracy improved from 77.3% to 89.4% (+12%)")

## Step 6: Run Percentile Sweep Analysis

In [None]:
# # Analyze different confidence percentile thresholds
# 
# !python scripts/analyze_percentile_sweep.py \
#     --results-dir results/toxicchat_qwen06b_vllm_reclassified \
#     --output-file percentile_sweep_results.json
# 
# print("‚úì Percentile sweep complete")
# print("Best: 20-30th percentile (83% accuracy, 70% savings)")
# print("Worst: 90th percentile (77.6% accuracy, 9% savings)")

## Step 7: Generate Confidence Distribution Visualizations

In [None]:
# # Generate all visualizations
# 
# !python visualize_confidence_analysis.py \
#     --results-dir results/toxicchat_qwen06b_vllm_reclassified \
#     --benchmark toxicchat \
#     --output plots/ \
#     --samples 100
# 
# print("‚úì Visualizations generated")
# print("Files: confidence_by_correctness.png, confidence_by_category.png, ")
# print("       confidence_by_toxicity.png, trace_evolution.png")

---

# ANALYSIS (Works with Pre-generated or Reproduced Data)

## Load Predictions

In [None]:
# Load predictions
predictions = []
with open(PREDICTIONS_FILE, 'r') as f:
    for line in f:
        predictions.append(json.loads(line))

print(f"‚úì Loaded {len(predictions):,} predictions")
print(f"  Total traces: {sum(len(p['traces']) for p in predictions):,}")

# Show sample
print("\nSample prediction structure:")
sample = predictions[0]
print(json.dumps({
    'instance_id': sample['instance_id'],
    'num_traces': len(sample['traces']),
    'confidences': sample['confidences'][:3],
    'final_prediction': sample['final_prediction'],
    'refusal_rate': sample['refusal_rate'],
    'confidence_category': sample['confidence_category']
}, indent=2))

## Load Ground Truth (For Metrics)

In [None]:
# Download ToxicChat test set for ground truth labels
# (Small download, just labels)
from datasets import load_dataset

dataset = load_dataset("lmsys/toxic-chat", "toxicchat0124", split="test")

ground_truth = {}
for idx, item in enumerate(dataset):
    instance_id = f"toxicchat_test_{idx}"
    # toxicity=1 means toxic/unsafe, 0 means safe
    gt_label = 'unsafe' if item['toxicity'] == 1 else 'safe'
    ground_truth[instance_id] = gt_label

print(f"‚úì Loaded {len(ground_truth):,} ground truth labels")
print(f"  Toxic instances: {sum(1 for v in ground_truth.values() if v == 'unsafe')}")
print(f"  Safe instances: {sum(1 for v in ground_truth.values() if v == 'safe')}")

## Compute Overall Accuracy

In [None]:
correct = sum(1 for p in predictions if p['final_prediction'] == ground_truth.get(p['instance_id']))
total = len(predictions)
accuracy = correct / total

print("="*60)
print("OVERALL ACCURACY")
print("="*60)
print(f"Total predictions: {total:,}")
print(f"Correct: {correct:,}")
print(f"Accuracy: {accuracy:.1%}")
print()
print("Note: This is with the majority voting fix (+12% improvement)")

## Extract Confidence Statistics

In [None]:
# Analyze confidences by correctness and toxicity
correct_confs = []
incorrect_confs = []
safe_confs = []
toxic_confs = []

for pred in predictions:
    is_correct = pred['final_prediction'] == ground_truth.get(pred['instance_id'])
    gt_label = ground_truth.get(pred['instance_id'])
    confs = pred['confidences']
    
    if is_correct:
        correct_confs.extend(confs)
    else:
        incorrect_confs.extend(confs)
    
    if gt_label == 'safe':
        safe_confs.extend(confs)
    elif gt_label == 'unsafe':
        toxic_confs.extend(confs)

print("="*60)
print("CONFIDENCE BY CORRECTNESS")
print("="*60)
print(f"Correct predictions:   mean={np.mean(correct_confs):.3f}, std={np.std(correct_confs):.3f}")
print(f"Incorrect predictions: mean={np.mean(incorrect_confs):.3f}, std={np.std(incorrect_confs):.3f}")
print(f"\n‚ö†Ô∏è  INCORRECT predictions have HIGHER confidence!")
print(f"   This is the OPPOSITE of what we'd expect.")
print()

print("="*60)
print("CONFIDENCE BY GROUND TRUTH TOXICITY")
print("="*60)
print(f"Safe prompts:  mean={np.mean(safe_confs):.3f}, std={np.std(safe_confs):.3f}, n={len(safe_confs):,}")
print(f"Toxic prompts: mean={np.mean(toxic_confs):.3f}, std={np.std(toxic_confs):.3f}, n={len(toxic_confs):,}")
print(f"\n‚ö†Ô∏è  TOXIC prompts get HIGHER confidence responses!")
print(f"   High confidence does NOT mean safe.")

---

# Visualizations

## 1. Confidence by Correctness

In [None]:
display(Image(filename=str(PLOTS_DIR / 'confidence_by_correctness.png')))

## 2. Confidence by Refusal Category

In [None]:
display(Image(filename=str(PLOTS_DIR / 'confidence_by_category.png')))

## 3. Confidence by Ground Truth Toxicity

In [None]:
display(Image(filename=str(PLOTS_DIR / 'confidence_by_toxicity.png')))

## 4. Trace Evolution

In [None]:
display(Image(filename=str(PLOTS_DIR / 'trace_evolution.png')))

---

# Percentile Sweep Results

In [None]:
# [Include all the percentile sweep code from the viewer notebook]
# ... (same as before)

---

# Key Findings & Conclusions

[Include all the summary content from the viewer notebook]
...