# ToxicChat Confidence Distribution Analysis

**Dataset:** ToxicChat test set (5,083 instances, ~48K traces)  
**Model:** Qwen3-0.6B  
**Experiment:** DeepConf with majority voting fix

---

## Executive Summary

This notebook presents a comprehensive analysis of confidence distributions in the ToxicChat safety evaluation experiment.

### Critical Findings

1. **Confidence is INVERSELY related to correctness** - Incorrect predictions have higher confidence (0.640) than correct ones (0.510)
2. **Higher confidence on toxic prompts** - Model is more confident when responding to toxic content (0.679 vs 0.512)
3. **Uncertain compliance is better** - 95.7% accuracy vs 88.4% for confident compliance
4. **Lower percentiles work better** - 20th percentile: 83% accuracy with 70% token savings

## Setup and Imports

In [None]:
import json
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from IPython.display import Image, display, Markdown

# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 11

print("‚úì Imports successful")

## Load Data

In [None]:
# If running in Colab, uncomment and run this to mount Google Drive
# from google.colab import drive
# drive.mount('/content/drive')

# If running locally, use relative paths
results_dir = Path('results/toxicchat_qwen06b_1000_vllm_reclassified')
plots_dir = Path('plots')

# Load predictions
predictions_path = results_dir / 'predictions.jsonl'
predictions = []
with predictions_path.open('r') as f:
    for line in f:
        predictions.append(json.loads(line))

# Load metrics
metrics_path = results_dir / 'reclassification_metrics.json'
with metrics_path.open('r') as f:
    metrics = json.load(f)

print(f"‚úì Loaded {len(predictions)} predictions")
print(f"‚úì Overall accuracy: {metrics['overall_accuracy']:.1%}")

## Key Metrics Summary

In [None]:
# Overall metrics
print("="*60)
print("OVERALL METRICS")
print("="*60)
print(f"Total Predictions:  {metrics['total_predictions']:,}")
print(f"Correct:            {metrics['correct_predictions']:,}")
print(f"Accuracy:           {metrics['overall_accuracy']:.1%}")
print()

# Category breakdown
print("="*60)
print("CATEGORY BREAKDOWN")
print("="*60)
for cat, data in metrics['categories'].items():
    print(f"\n{cat.replace('_', ' ').title()}:")
    print(f"  Count:      {data['count']:,}")
    print(f"  Correct:    {data['correct']:,}")
    print(f"  Accuracy:   {data['accuracy']:.1%}")

## Extract Confidence Statistics

In [None]:
# Load ground truth for correctness analysis
import sys
sys.path.append('src')
from benchmark_loaders import ToxicChatLoader

instances = ToxicChatLoader.load('data/toxicchat', split='test')
ground_truth = {inst.id: inst.ground_truth_label for inst in instances}

# Analyze confidences
correct_confs = []
incorrect_confs = []
safe_confs = []
toxic_confs = []

for pred in predictions:
    is_correct = pred['final_prediction'] == ground_truth.get(pred['instance_id'])
    gt_label = ground_truth.get(pred['instance_id'])
    confs = pred['confidences']
    
    if is_correct:
        correct_confs.extend(confs)
    else:
        incorrect_confs.extend(confs)
    
    if gt_label == 'safe':
        safe_confs.extend(confs)
    elif gt_label == 'unsafe':
        toxic_confs.extend(confs)

print("="*60)
print("CONFIDENCE BY CORRECTNESS")
print("="*60)
print(f"Correct predictions:   mean={np.mean(correct_confs):.3f}, std={np.std(correct_confs):.3f}")
print(f"Incorrect predictions: mean={np.mean(incorrect_confs):.3f}, std={np.std(incorrect_confs):.3f}")
print(f"\n‚ö†Ô∏è  INCORRECT predictions have HIGHER confidence!")
print()

print("="*60)
print("CONFIDENCE BY GROUND TRUTH TOXICITY")
print("="*60)
print(f"Safe prompts:  mean={np.mean(safe_confs):.3f}, std={np.std(safe_confs):.3f}, n={len(safe_confs)}")
print(f"Toxic prompts: mean={np.mean(toxic_confs):.3f}, std={np.std(toxic_confs):.3f}, n={len(toxic_confs)}")
print(f"\n‚ö†Ô∏è  TOXIC prompts get HIGHER confidence responses!")

---

# Visualizations

## 1. Confidence by Correctness

**Key Finding:** Incorrect predictions have **higher** confidence than correct ones!

- Incorrect: mean = 0.640
- Correct: mean = 0.510

This is the **opposite** of what we'd expect. The model is overconfident when wrong.

In [None]:
display(Image(filename='plots/confidence_by_correctness.png'))

---

## 2. Confidence by Refusal Category

**Key Findings:**

1. **Uncertain Compliance (14.2% of traces)** - Mean: 0.197, Accuracy: **95.7%**
2. **Confident Compliance (85.7% of traces)** - Mean: 0.577, Accuracy: **88.4%**
3. **Refusals are extremely rare** - Only 102 refusals out of 48,145 traces (0.2%)

**Implication:** Low-confidence compliance is actually MORE accurate!

In [None]:
display(Image(filename='plots/confidence_by_category.png'))

---

## 3. Confidence by Ground Truth Toxicity

**Key Finding:** The model has **higher confidence** when responding to toxic prompts!

- Safe prompts: mean = 0.512
- Toxic prompts: mean = 0.679

**Implication:** Confidence doesn't signal safety. High confidence may indicate the model is responding to something it shouldn't.

In [None]:
display(Image(filename='plots/confidence_by_toxicity.png'))

---

## 4. Trace Evolution Across Multiple Generations

**Key Finding:** Confidence stays **remarkably stable** across all 10 traces.

**Implications:**
- Additional traces don't add new information
- Early stopping at 3 traces (20th percentile) is sufficient
- More traces = more opportunities for high-confidence errors

**Note:** Only 4 instances where model correctly handled toxic prompts (out of 362 toxic prompts)!

In [None]:
display(Image(filename='plots/trace_evolution.png'))

---

# Percentile Sweep Results

Analysis of different confidence percentile thresholds for early stopping.

In [None]:
# Percentile sweep data from ANALYSIS_RESULTS.md
percentile_data = {
    'Percentile': [20, 30, 40, 50, 60, 70, 80, 90],
    'Accuracy': [83.04, 83.04, 82.29, 81.21, 80.05, 78.30, 77.75, 77.59],
    'Avg Traces': [3.00, 3.00, 4.02, 5.28, 6.29, 8.04, 8.70, 9.07],
    'Token Savings': [70.0, 70.0, 60.1, 47.0, 36.4, 19.4, 12.8, 9.1],
    'Cost Reduction': ['7.0x', '7.0x', '5.6x', '4.2x', '3.1x', '1.9x', '1.6x', '1.4x']
}

import pandas as pd
df = pd.DataFrame(percentile_data)
display(df.style.highlight_max(subset=['Accuracy', 'Token Savings'], color='lightgreen')
                 .highlight_min(subset=['Avg Traces'], color='lightgreen'))

In [None]:
# Plot accuracy vs efficiency trade-off
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Accuracy vs Percentile
axes[0].plot(percentile_data['Percentile'], percentile_data['Accuracy'], 
             marker='o', linewidth=2, markersize=8, color='blue')
axes[0].axhline(y=83.04, color='green', linestyle='--', alpha=0.5, label='Best (20-30th)')
axes[0].axhline(y=77.59, color='red', linestyle='--', alpha=0.5, label='Default (90th)')
axes[0].set_xlabel('Confidence Percentile Threshold')
axes[0].set_ylabel('Accuracy (%)')
axes[0].set_title('Accuracy vs Percentile Threshold')
axes[0].grid(alpha=0.3)
axes[0].legend()

# Token Savings vs Percentile
axes[1].plot(percentile_data['Percentile'], percentile_data['Token Savings'], 
             marker='s', linewidth=2, markersize=8, color='green')
axes[1].fill_between(percentile_data['Percentile'], percentile_data['Token Savings'], 
                      alpha=0.3, color='green')
axes[1].set_xlabel('Confidence Percentile Threshold')
axes[1].set_ylabel('Token Savings (%)')
axes[1].set_title('Token Savings vs Percentile Threshold')
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("\n‚úì Lower percentiles are BETTER: higher accuracy AND lower cost!")

---

# Why Lower Percentiles Work Better

## The "Overthinking Penalty"

**20th percentile (3 traces):**
- Accuracy: 83.04%
- Cost: 7x cheaper

**90th percentile (9 traces):**
- Accuracy: 77.59%
- Cost: 1.4x cheaper

**Why this happens:**

1. **Incorrect predictions have higher confidence (0.640)**
   - More traces = more high-confidence wrong answers
   - Majority voting can still be swayed

2. **Confidence stays stable across traces**
   - Additional traces don't add new information
   - Just add noise and false positives

3. **Low-confidence traces are actually good**
   - Uncertain compliance: 95.7% accurate
   - Confident compliance: 88.4% accurate
   - Early stopping preserves good low-confidence predictions

---

# Hypothesis Validation

## Original Hypothesis
> "Low-confidence non-refusals are riskier and should trigger additional scrutiny"

## Reality from Data

‚úÖ **Low-confidence compliance has 95.7% accuracy**  
‚ùå **High-confidence compliance has 88.4% accuracy**  
‚ùå **Low confidence actually signals MORE careful, accurate responses**  

## Revised Understanding

**Confidence is an efficiency signal, NOT a safety signal:**

‚úÖ **For efficiency:** Low confidence allows early stopping without hurting accuracy  
‚ùå **For safety:** Confidence doesn't predict toxicity or harmfulness  
‚ùå **Counterintuitive:** Higher confidence often means more errors  

## What Confidence CAN'T Do

- ‚ùå Predict whether a response is harmful
- ‚ùå Identify when the model is wrong (inverse relationship!)
- ‚ùå Signal that a toxic prompt should be refused

## What Confidence CAN Do

- ‚úÖ Enable early stopping for efficiency (3 traces vs 10)
- ‚úÖ Reduce costs by 70% without hurting accuracy
- ‚úÖ Identify when the model is uncertain (but uncertainty = better accuracy!)

---

# The Refusal Problem

**Critical Issue:** Qwen3-0.6B barely refuses toxic prompts.

In [None]:
# Calculate refusal statistics
total_traces = sum(len(p['confidences']) for p in predictions)
total_refusals = metrics['categories']['confident_refusal']['count'] + \
                 metrics['categories']['uncertain_refusal']['count']

toxic_instances = sum(1 for inst in instances if inst.ground_truth_label == 'unsafe')
toxic_traces_expected = toxic_instances * 10  # Each instance has 10 traces

print("="*60)
print("REFUSAL ANALYSIS")
print("="*60)
print(f"Total traces:              {total_traces:,}")
print(f"Total refusals:            {total_refusals} ({total_refusals/total_traces*100:.2f}%)")
print()
print(f"Toxic prompts in dataset:  {toxic_instances} ({toxic_instances/len(instances)*100:.1f}%)")
print(f"Expected toxic traces:     {toxic_traces_expected:,}")
print(f"Actual refusals:           {total_refusals}")
print()
print(f"‚ö†Ô∏è  Refusal rate on toxic content: {total_refusals/toxic_traces_expected*100:.1f}%")
print(f"‚ö†Ô∏è  Toxic responses leaked:        {(toxic_traces_expected - total_refusals)/toxic_traces_expected*100:.1f}%")
print()
print("Conclusion: Confidence-based early stopping won't help with safety.")
print("           Need WildGuard or similar classifier for proper toxicity detection.")

---

# Recommendations

## 1. Use 20-30th Percentile for Efficiency ‚úÖ

- **5.5% better accuracy** (83% vs 77.6%)
- **70% token savings** (3 traces vs 9)
- **7x cost reduction**
- Validated finding despite bug fix

## 2. Don't Use Confidence for Safety ‚ùå

- Confidence is inversely related to correctness
- High confidence on toxic prompts is common
- Need separate safety classifier (WildGuard)

## 3. Early Stopping is About Speed, Not Safety

**Frame as:**
- "Reduce inference cost without hurting performance"
- "Avoid overthinking penalty with efficient early stopping"

**Don't claim:**
- "Identify risky predictions to filter out"
- "Use confidence to improve safety"

The data shows these are opposite goals!

## 4. Next Steps for Safety

- Run WildGuard classifier on these results
- Compare WildGuard's confidence to Qwen's confidence
- Test on WildGuardMix with gold-standard refusal labels
- Statistical hypothesis testing (Chi-square, effect sizes)

---

# Cost Analysis

In [None]:
# Cost calculations
num_instances = 5083
gpu_rate = 1.29  # $/hour

# Original approach (90th percentile)
orig_traces = 9.07
orig_hours = 4.0
orig_cost = orig_hours * gpu_rate

# Optimal approach (20th percentile)
opt_traces = 3.00
opt_hours = 1.3
opt_cost = opt_hours * gpu_rate

savings = orig_cost - opt_cost
savings_pct = (savings / orig_cost) * 100

print("="*60)
print(f"COST ANALYSIS FOR {num_instances:,} INSTANCES")
print("="*60)
print()
print("Original approach (90th percentile):")
print(f"  Avg traces:  {orig_traces:.2f}")
print(f"  GPU time:    {orig_hours:.1f} hours")
print(f"  Cost:        ${orig_cost:.2f}")
print(f"  Accuracy:    77.59%")
print()
print("Optimal approach (20th percentile):")
print(f"  Avg traces:  {opt_traces:.2f}")
print(f"  GPU time:    {opt_hours:.1f} hours")
print(f"  Cost:        ${opt_cost:.2f}")
print(f"  Accuracy:    83.04%")
print()
print(f"üí∞ Savings:    ${savings:.2f} ({savings_pct:.0f}% cheaper)")
print(f"üìà Accuracy:   +5.45 percentage points better!")
print()
print("At scale (1M queries/month):")
monthly_orig = (orig_cost / num_instances) * 1_000_000
monthly_opt = (opt_cost / num_instances) * 1_000_000
annual_savings = (monthly_orig - monthly_opt) * 12
print(f"  Original:  ${monthly_orig:,.0f}/month")
print(f"  Optimal:   ${monthly_opt:,.0f}/month")
print(f"  üí∞ Annual savings: ${annual_savings:,.0f}")

---

# Conclusion

## Key Takeaways

1. **Confidence predicts the opposite of what we expected**
   - Higher confidence ‚Üí More errors
   - Toxic prompts ‚Üí Higher confidence
   - Uncertain compliance ‚Üí Best accuracy (95.7%)

2. **Lower percentiles are objectively better**
   - 20-30th: 83% accuracy, 70% savings
   - 90th: 77.6% accuracy, 9% savings
   - The "overthinking penalty" is real

3. **Confidence is for efficiency, not safety**
   - Use for early stopping to reduce costs
   - Don't use for safety filtering
   - Need WildGuard for actual toxicity detection

4. **Qwen3-0.6B doesn't refuse enough**
   - Only 2.8% refusal rate on toxic content
   - 97.2% of toxic prompts got responses
   - Confidence alone won't fix this

## Next Steps

- ‚úÖ Bug fixed and validated (majority voting)
- ‚úÖ Percentile sweep complete
- ‚úÖ Comprehensive visualizations created
- üîÑ Run WildGuard classifier for better safety detection
- üîÑ Test on WildGuardMix with gold-standard labels
- üìä Statistical hypothesis testing
- üìù Write up findings for publication

---

**Generated:** November 23, 2024  
**Experiment ID:** toxicchat_qwen06b_1000_vllm_reclassified  
**Branch:** claude/deepconf-adapter-implementation-01Ej26SYeEfWTu9weEEYSU2a