Skip to content

Medical transfer test — FabricationGuard on MedQA + PubMedQA + HaluEval-medical #11

@caiovicentino

Description

@caiovicentino

Motivation

DeepMind shipped AI co-clinician (https://deepmind.google/blog/ai-co-clinician/) on 2026-04-30 with a dual-agent (Planner monitors Talker) safety architecture. The Planner is doing what activation-level probes do — gating an agent based on a monitored signal. The DeepMind work is closed-weight and research-only.

FabricationGuard L31 (https://huggingface.co/openinterp/fabricationguard-qwen36-27b-l31-v2) is the open analogue at the residual-stream level. Trained on HaluEval, tested cross-task on SimpleQA, AUROC 0.88. The medical transfer test has not been measured.

Why this matters

  • CZI Biohub committed $500M to AI-biology on 2026-04-29
  • EU AI Act Article 14 enforcement on high-risk medical AI starts Aug 2026
  • DeepMind's post admits AI underperforms physicians at red-flag identification — rare-class detection under distribution shift, paradigmatic linear-probe use case
  • /probebench/applications/medical-ai page on the site needs measured numbers, not just methodology

Acceptance criteria

Run FabricationGuard L31 (no recalibration, then with recalibration) on:

  • MedQA (USMLE multiple choice) — adapt fabrication detection to MC by scoring residual at end-of-question
  • PubMedQA (yes/no/maybe with biomedical abstract) — open-ended factual
  • HaluEval-medical subset if available, else HaluEval-QA filtered for medical entities

For each: report AUROC + per-class confusion + N + threshold sweep. Compare to baseline (Qwen3.6 logprob).

Compute: 1 Colab Pro session, ~2-3h. Single Qwen3.6-27B forward pass per prompt.

Output

When eval completes, add a new Atlas entry (type: probe-result) at https://github.com/OpenInterpretability/registry/atlas/2026/ documenting the measured transfer numbers. Then update /probebench/applications/medical-ai with the real cross-domain AUROC table.

If AUROC drops below 0.65 on medical: publish as honest-negative finding (atlas-entry type, "cross-domain hallucination probes don't generalize to medical without recalibration") — still high value for the field.

Link to related work

Metadata

Metadata

Assignees

No one assigned

    Labels

    fabricationguardFabricationGuard hallucination probepaperResearch paper writing / submission / reviewprobebenchProbe leaderboard / categorical evaluation

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions