Motivation
DeepMind shipped AI co-clinician (https://deepmind.google/blog/ai-co-clinician/) on 2026-04-30 with a dual-agent (Planner monitors Talker) safety architecture. The Planner is doing what activation-level probes do — gating an agent based on a monitored signal. The DeepMind work is closed-weight and research-only.
FabricationGuard L31 (https://huggingface.co/openinterp/fabricationguard-qwen36-27b-l31-v2) is the open analogue at the residual-stream level. Trained on HaluEval, tested cross-task on SimpleQA, AUROC 0.88. The medical transfer test has not been measured.
Why this matters
- CZI Biohub committed $500M to AI-biology on 2026-04-29
- EU AI Act Article 14 enforcement on high-risk medical AI starts Aug 2026
- DeepMind's post admits AI underperforms physicians at red-flag identification — rare-class detection under distribution shift, paradigmatic linear-probe use case
- /probebench/applications/medical-ai page on the site needs measured numbers, not just methodology
Acceptance criteria
Run FabricationGuard L31 (no recalibration, then with recalibration) on:
- MedQA (USMLE multiple choice) — adapt fabrication detection to MC by scoring residual at end-of-question
- PubMedQA (yes/no/maybe with biomedical abstract) — open-ended factual
- HaluEval-medical subset if available, else HaluEval-QA filtered for medical entities
For each: report AUROC + per-class confusion + N + threshold sweep. Compare to baseline (Qwen3.6 logprob).
Compute: 1 Colab Pro session, ~2-3h. Single Qwen3.6-27B forward pass per prompt.
Output
When eval completes, add a new Atlas entry (type: probe-result) at https://github.com/OpenInterpretability/registry/atlas/2026/ documenting the measured transfer numbers. Then update /probebench/applications/medical-ai with the real cross-domain AUROC table.
If AUROC drops below 0.65 on medical: publish as honest-negative finding (atlas-entry type, "cross-domain hallucination probes don't generalize to medical without recalibration") — still high value for the field.
Link to related work
Motivation
DeepMind shipped AI co-clinician (https://deepmind.google/blog/ai-co-clinician/) on 2026-04-30 with a dual-agent (Planner monitors Talker) safety architecture. The Planner is doing what activation-level probes do — gating an agent based on a monitored signal. The DeepMind work is closed-weight and research-only.
FabricationGuard L31 (https://huggingface.co/openinterp/fabricationguard-qwen36-27b-l31-v2) is the open analogue at the residual-stream level. Trained on HaluEval, tested cross-task on SimpleQA, AUROC 0.88. The medical transfer test has not been measured.
Why this matters
Acceptance criteria
Run FabricationGuard L31 (no recalibration, then with recalibration) on:
For each: report AUROC + per-class confusion + N + threshold sweep. Compare to baseline (Qwen3.6 logprob).
Compute: 1 Colab Pro session, ~2-3h. Single Qwen3.6-27B forward pass per prompt.
Output
When eval completes, add a new Atlas entry (type: probe-result) at https://github.com/OpenInterpretability/registry/atlas/2026/ documenting the measured transfer numbers. Then update /probebench/applications/medical-ai with the real cross-domain AUROC table.
If AUROC drops below 0.65 on medical: publish as honest-negative finding (atlas-entry type, "cross-domain hallucination probes don't generalize to medical without recalibration") — still high value for the field.
Link to related work