# Run 2: WildGuardMix + Heuristic Refusal Detection

**DeepConf Safety Evaluation - Viewer Notebook**

---

## Executive Summary

**Dataset:** WildGuardMix (1,725 test instances)
**Model:** Qwen3-0.6B
**Refusal Detection:** Pattern-based heuristics
**Traces per instance:** 10 (max)

### Critical Findings 🚨

1. **Confidence Pattern (Minimal)**
   - Incorrect: 0.655 confidence
   - Correct: 0.640 confidence
   - ⚠️ Only 2% difference (vs 25% in ToxicChat)

2. **Toxicity Bias (Minimal)**
   - Toxic prompts: 0.648 confidence
   - Safe prompts: 0.646 confidence
   - ✅ Nearly identical (vs 33% difference in ToxicChat)

3. **Safety Performance**
   - ✅ High sensitivity: 92% toxic catch rate
   - ✅ **Better accuracy: 41-42%** (vs 9-10% in ToxicChat)
   - ❌ Low specificity: Only 2% of safe content correctly allowed

4. **Optimal Strategy**
   - **Best safety:** 80-90th percentile (92.4% sensitivity, 57 missed toxic)
   - **Best efficiency:** 20th percentile (64.3% token savings)
   - **Trade-off:** Similar to ToxicChat

### WildGuardMix vs ToxicChat

- ✅ **Better accuracy** (41% vs 9%)
- ✅ **Less confidence bias** (nearly equal for safe/toxic)
- ✅ **Similar sensitivity** (92% vs 91-94%)
- ❌ **Similar low specificity** (2% vs 2-4%)

---


## Key Metrics

| Metric | Value |
|--------|-------|
| Total instances | 1,725 |
| Toxic prompts | ~753 (43.7%) |
| Safe prompts | ~972 (56.3%) |
| Total traces analyzed | 16,370 |
| Avg traces per instance | 9.5 |

### Confidence Statistics

| Category | Mean | Std Dev |
|----------|------|--------|
| Correct predictions | 0.640 | 0.165 |
| Incorrect predictions | 0.655 | 0.157 |
| Safe prompts | 0.646 | 0.154 |
| Toxic prompts | 0.648 | 0.170 |

✅ **Nearly equal confidence across categories (minimal bias)**

---


## 1. Safety Confusion Matrix

Ground Truth × Model Prediction


In [2]:
from IPython.display import Image, display
display(Image("plots/run2/confusion_matrix_2x2.png"))

## 2. Percentile Safety Curves

Trade-offs: Accuracy, Sensitivity, Specificity, Token Savings


In [3]:
from IPython.display import Image, display
display(Image("plots/run2/percentile_safety_curves.png"))

## 3. Confidence by Correctness

Minimal difference between correct/incorrect


In [4]:
from IPython.display import Image, display
display(Image("plots/run2/confidence_by_correctness.png"))

## 4. Confidence by Refusal Category

4 Categories: confident_refusal, confident_compliance, uncertain_refusal, uncertain_compliance


In [5]:
from IPython.display import Image, display
display(Image("plots/run2/confidence_by_category.png"))

## 5. Confidence by Ground Truth Toxicity

✅ Nearly identical confidence for safe vs toxic


In [6]:
from IPython.display import Image, display
display(Image("plots/run2/confidence_by_toxicity.png"))

## 6. Trace Evolution

How confidence changes across 10 traces (100 sampled instances)


In [7]:
from IPython.display import Image, display
display(Image("plots/run2/trace_evolution.png"))