# ToxicChat Confidence Analysis - Results Viewer

**Dataset:** ToxicChat (5,083 instances, 48,145 traces)  
**Model:** Qwen3-0.6B  
**Date:** November 23, 2024

This notebook displays pre-generated visualizations and analysis results.  
**No data files needed** - just upload the 4 PNG files from `plots/` directory.

---

## Setup

### If running in Colab:

1. Upload these 4 plot files to Colab:
   - `confidence_by_correctness.png`
   - `confidence_by_category.png`
   - `confidence_by_toxicity.png`
   - `trace_evolution.png`

2. Run the cells below

In [None]:
from IPython.display import Image, display, Markdown
import matplotlib.pyplot as plt
import pandas as pd

# If running in Colab, files should be in /content/
# If running locally, adjust path as needed
import os
plot_dir = '/content/' if os.path.exists('/content/') else 'plots/'

print("‚úì Setup complete")
print(f"Looking for plots in: {plot_dir}")

---

# Executive Summary

## Critical Findings

### 1. Confidence is INVERSELY Related to Correctness üö®

| Prediction Type | Mean Confidence | Standard Deviation |
|----------------|-----------------|-------------------|
| **Incorrect** | **0.640** | 0.223 |
| **Correct** | **0.510** | 0.215 |

**The model is overconfident when it's wrong!**

---

### 2. Higher Confidence on Toxic Prompts ‚ö†Ô∏è

| Ground Truth | Mean Confidence | Num Traces |
|--------------|-----------------|------------|
| **Toxic Prompts** | **0.679** | 3,414 |
| **Safe Prompts** | **0.512** | 44,731 |

**The model is MORE confident when responding to toxic content!**

---

### 3. Uncertain Compliance is Actually Better ‚úÖ

| Category | Count | Accuracy | Mean Confidence |
|----------|-------|----------|----------------|
| **Uncertain Compliance** | **734** | **95.8%** | **0.197** |
| Confident Compliance | 4,338 | 88.4% | 0.577 |
| Confident Refusal | 9 | 66.7% | 0.745 |
| Uncertain Refusal | 2 | 0.0% | 0.116 |

**Low-confidence compliance has the BEST accuracy!**

---

### 4. Refusal Rate is Abysmal üî¥

- Total refusals: **11 out of 5,083 instances** (0.2%)
- Toxic prompts in dataset: **362** (7.1%)
- **Refusal rate on toxic content: 3.0%**
- **97% of toxic prompts got responses!**

---

# Percentile Sweep Results

## Lower Percentiles are BETTER!

After fixing the `any()` bug with majority voting:

In [None]:
# Percentile sweep data
data = {
    'Percentile': [20, 30, 40, 50, 60, 70, 80, 90],
    'Accuracy (%)': [83.04, 83.04, 82.29, 81.21, 80.05, 78.30, 77.75, 77.59],
    'Avg Traces': [3.00, 3.00, 4.02, 5.28, 6.29, 8.04, 8.70, 9.07],
    'Token Savings (%)': [70.0, 70.0, 60.1, 47.0, 36.4, 19.4, 12.8, 9.1],
    'Cost Reduction': ['7.0x', '7.0x', '5.6x', '4.2x', '3.1x', '1.9x', '1.6x', '1.4x']
}

df = pd.DataFrame(data)

# Style the dataframe
styled_df = df.style.apply(
    lambda x: ['background-color: lightgreen' if i < 2 else '' for i in range(len(x))],
    axis=0
)

display(df)

print("\n‚úÖ Best: 20-30th percentile")
print("   - 83.04% accuracy (+5.5% vs 90th)")
print("   - 3 traces (vs 9.07)")
print("   - 70% token savings (7x cheaper)")
print("\n‚ùå Worst: 90th percentile (original default)")
print("   - 77.59% accuracy")
print("   - 9.07 traces")
print("   - 9.1% token savings (1.4x cheaper)")

In [None]:
# Visualize accuracy vs efficiency trade-off
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Accuracy vs Percentile
axes[0].plot(data['Percentile'], data['Accuracy (%)'], 
             marker='o', linewidth=2.5, markersize=10, color='#2E86AB')
axes[0].axhline(y=83.04, color='green', linestyle='--', linewidth=2, alpha=0.7, label='Best (20-30th)')
axes[0].axhline(y=77.59, color='red', linestyle='--', linewidth=2, alpha=0.7, label='Default (90th)')
axes[0].fill_between(data['Percentile'], 77.59, 83.04, alpha=0.2, color='green')
axes[0].set_xlabel('Confidence Percentile Threshold', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Accuracy (%)', fontsize=12, fontweight='bold')
axes[0].set_title('Accuracy vs Percentile\n(Higher percentile = WORSE accuracy)', 
                  fontsize=13, fontweight='bold')
axes[0].grid(alpha=0.3)
axes[0].legend(fontsize=11)
axes[0].set_ylim([76, 84])

# Token Savings vs Percentile
axes[1].plot(data['Percentile'], data['Token Savings (%)'], 
             marker='s', linewidth=2.5, markersize=10, color='#06A77D')
axes[1].fill_between(data['Percentile'], data['Token Savings (%)'], 
                      alpha=0.3, color='#06A77D')
axes[1].set_xlabel('Confidence Percentile Threshold', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Token Savings (%)', fontsize=12, fontweight='bold')
axes[1].set_title('Token Savings vs Percentile\n(Lower percentile = MORE savings)', 
                  fontsize=13, fontweight='bold')
axes[1].grid(alpha=0.3)
axes[1].set_ylim([0, 75])

plt.tight_layout()
plt.show()

display(Markdown("""
### The "Overthinking Penalty"

**Why more traces = worse accuracy:**
1. Incorrect predictions have **higher confidence** (0.640)
2. More traces = more chances for high-confidence wrong answers
3. Majority voting can still be swayed by confident errors
4. Confidence stays stable across traces (no new information)
"""))

---

# Visualization 1: Confidence by Correctness

**Key Finding:** Incorrect predictions have HIGHER confidence!

This matches the DeepConf paper style but shows the **opposite** pattern:
- DeepConf: Correct predictions had higher confidence ‚úÖ
- Our results: Incorrect predictions have higher confidence ‚ùå

In [None]:
display(Image(filename=f'{plot_dir}confidence_by_correctness.png'))

**Statistics:**
- Correct predictions: mean=0.510, std=0.215
- Incorrect predictions: mean=0.640, std=0.223

**Implication:** High confidence is a **warning sign**, not a quality signal!

---

# Visualization 2: Confidence by Refusal Category

**Key Finding:** Uncertain compliance has the BEST accuracy (95.8%)!

Four categories based on confidence threshold (0.5) and refusal:

In [None]:
display(Image(filename=f'{plot_dir}confidence_by_category.png'))

**Category Statistics:**

| Category | Mean Conf | Count (traces) | % of Total | Accuracy |
|----------|-----------|----------------|------------|----------|
| Confident Refusal | 0.745 | 86 | 0.2% | 66.7% |
| **Confident Compliance** | **0.577** | **41,228** | **85.7%** | **88.4%** |
| **Uncertain Compliance** | **0.197** | **6,815** | **14.2%** | **95.8%** |
| Uncertain Refusal | 0.116 | 16 | 0.03% | 0.0% |

**Critical Insight:** When the model is uncertain, it makes FEWER mistakes!

---

# Visualization 3: Confidence by Ground Truth Toxicity

**Key Finding:** Model has HIGHER confidence when responding to toxic prompts!

This is dangerous - the model doesn't "know" when it should refuse.

In [None]:
display(Image(filename=f'{plot_dir}confidence_by_toxicity.png'))

**Statistics:**
- Safe prompts: mean=0.512, std=0.213 (44,731 traces)
- Toxic prompts: mean=0.679, std=0.240 (3,414 traces)

**Implication:** High confidence does NOT mean safe. In fact, it may indicate the model is confidently responding to something harmful!

---

# Visualization 4: Trace Evolution

**Key Finding:** Confidence stays STABLE across all 10 traces.

Shows 100 sampled instances broken down by correctness √ó toxicity:

In [None]:
display(Image(filename=f'{plot_dir}trace_evolution.png'))

**Pattern Analysis:**
- **Correct on Safe:** Flat ~0.50 (stable, many instances)
- **Correct on Toxic:** Slight increase 0.70 ‚Üí 0.85 (only 4 instances!)
- **Incorrect on Safe:** Flat ~0.45 (stable, few instances)
- **Incorrect on Toxic:** Flat ~0.80 (high confidence, wrong)

**Implication:**
- Additional traces don't change confidence much
- Early stopping at 3 traces is sufficient
- More traces just add noise and false positives

---

# Cost-Benefit Analysis

In [None]:
# Cost calculations
num_instances = 5083
gpu_rate = 1.29  # $/hour (Lambda Labs)

approaches = pd.DataFrame({
    'Approach': ['Original (90th)', 'Optimal (20th)', 'Improvement'],
    'Accuracy': ['77.59%', '83.04%', '+5.45%'],
    'Avg Traces': [9.07, 3.00, '-67%'],
    'GPU Hours': [4.0, 1.3, '-67%'],
    'Cost (5K instances)': ['$5.16', '$1.68', '-$3.48'],
    'Cost (1M/month)': ['$1,014', '$330', '-$684/month']
})

display(approaches)

display(Markdown("""
### Annual Savings at Scale

For 1M queries/month:
- **Original cost:** $1,014/month √ó 12 = $12,168/year
- **Optimal cost:** $330/month √ó 12 = $3,960/year
- **üí∞ Savings: $8,208/year** 

**And you get 5.5% BETTER accuracy!**
"""))

---

# Hypothesis Validation

## Original Hypothesis ‚ùå
> "Low-confidence non-refusals are riskier and should trigger additional scrutiny"

## What We Found ‚úÖ

1. **Low-confidence compliance has 95.8% accuracy** (BEST category)
2. **High-confidence compliance has 88.4% accuracy** (MORE errors)
3. **Low confidence signals careful, accurate responses**
4. **High confidence often means the model is wrong**

## Revised Understanding

**Confidence is an EFFICIENCY signal, NOT a safety signal:**

### What Confidence CAN Do ‚úÖ
- Enable early stopping (3 traces vs 10)
- Reduce costs by 70%
- Improve accuracy by avoiding "overthinking penalty"

### What Confidence CANNOT Do ‚ùå
- Predict whether a response is harmful
- Identify when the model is wrong (inverse relationship!)
- Signal that a toxic prompt should be refused

### For Actual Safety
- Need WildGuard or similar 7B classifier
- Keyword detection misses subtle toxicity
- Qwen3-0.6B refuses <3% of toxic content
- Confidence alone won't fix this

---

# Recommendations

## 1. Use 20-30th Percentile (Not 90th) ‚úÖ

**Benefits:**
- 5.5% better accuracy (83% vs 77.6%)
- 70% token savings (3 traces vs 9)
- 7x cost reduction
- Avoids "overthinking penalty"

## 2. The Bug MUST Be Fixed üêõ

**Original code (line 346):**
```python
final_prediction = 'unsafe' if any(self.is_unsafe_content(t) for t in traces) else 'safe'
```

**Fixed code:**
```python
unsafe_count = sum(1 for t in traces if self.is_unsafe_content(t))
final_prediction = 'unsafe' if unsafe_count >= len(traces) * 0.5 else 'safe'
```

**Impact:** +12% accuracy improvement

## 3. Reframe the Hypothesis üîÑ

**Old:** "Low-confidence non-refusals are riskier"  
**New:** "Low-confidence enables efficient early stopping"

This is an **efficiency technique**, not a safety technique.

## 4. For Safety, Add WildGuard Classifier üõ°Ô∏è

**Why:**
- Qwen3-0.6B barely refuses (11/5083 instances)
- Need proper 7B safety classifier
- Confidence alone doesn't predict toxicity
- WildGuard can detect subtle refusals and harmfulness

---

# Conclusion

## Key Takeaways

1. **Confidence predicts the opposite of what we expected**
   - ‚ùå Higher confidence ‚Üí More errors
   - ‚ùå Toxic prompts ‚Üí Higher confidence  
   - ‚úÖ Uncertain compliance ‚Üí Best accuracy (95.8%)

2. **Lower percentiles are objectively better**
   - ‚úÖ 20-30th: 83% accuracy, 70% savings, 7x cheaper
   - ‚ùå 90th: 77.6% accuracy, 9% savings, 1.4x cheaper
   - The "overthinking penalty" is real

3. **Confidence is for efficiency, not safety**
   - ‚úÖ Use for early stopping to reduce costs
   - ‚ùå Don't use for safety filtering
   - Need WildGuard for actual toxicity detection

4. **Qwen3-0.6B doesn't refuse enough**
   - Only 3% refusal rate on toxic content
   - 97% of toxic prompts got responses
   - Confidence alone won't fix this

## Next Steps

- ‚úÖ Bug fixed and validated (majority voting)
- ‚úÖ Percentile sweep complete
- ‚úÖ Comprehensive visualizations created
- üîÑ Run WildGuard classifier for better safety detection
- üîÑ Test on WildGuardMix with gold-standard labels
- üìä Statistical hypothesis testing

---

**Generated:** November 23, 2024  
**Experiment:** toxicchat_qwen06b_1000_vllm_reclassified  
**Analysis:** DeepConf Confidence-Based Early Stopping