A full-stack trust-and-performance pipeline for high-stakes voice AI. Find correctness failures. Optimize the compute path. Prove clinical- reasoning lift. Deploy the agent stack. Package into a 911 call-center demo anyone can interact with.
Prism42 proves three things before the demo runs: the agents are
correct, they are fast, and they are clinically safer than
baseline. The pipeline is four stages composing into one deployable
system — see docs/pipeline-narrative.md
for the full thesis.
No speculative findings. No benchmark numbers we didn't measure ourselves. No AI-slop. Every claim on the landing page traces to a session ID; every session ID reproduces with a shell command; every agent the public talks to has an auditor running the same dialectic that found the kernel bugs.
- Find correctness failures — kernel layer. Five-role adversarial
dialectic (defender / attacker / synthesizer / executor /
adjudicator, coordinated) running as Anthropic Managed Agents on
Claude Opus 4.7. Every finding compiles and runs on real GPU
hardware before shipping. See
mla/+scripts/. - Optimize the compute path — inference layer. Clean-process
measurement rubric: fresh subprocess per run, 200 CUDA-event samples,
3 replicates, full p10/p50/p90/p99 distribution. Six
benchmark-gaming detectors (
mla/prism/gaming_patterns.py). Six mechanisms counter the "AI-slop benchmark number" pattern. - Prove clinical-reasoning lift — reasoning layer. HealthBench Hard
(OpenAI
simple-evals, Apache 2.0, vendored) as the primary rubric grader. First public Opus 4.7 HealthBench Hard baseline:0.196 ± 0.068(N = 3, 95 % CI, 30-example subset). Canonical 1000-example parent set pinned atcorpus/pins/healthbench-hard-1000.yaml. Paired-design harness delta gates on CI-excludes-zero. - Deploy the agent stack — voice / product layer. ElevenLabs
Conversational AI front end over the Managed Agents session layer.
Live-call voice-facing agents phased by call stage (intake → triage
→ dispatch → PDI → handoff). In-session oversight agents
(safety-monitor, OHCA-detector, intent-verifier, rubric-live) on
every turn. Post-session auditor runs the dialectic over the call
transcript for the physician-readable QI summary. Packaged as a
public 911 call-center simulation at
www.thegoatnote.com/prism42.
The agents visitors interact with at www.thegoatnote.com/prism42 are
the same ones whose correctness, performance, and clinical-reasoning
lift were measured in stages 1–3. No "benchmark agent" vs "demo agent"
bait-and-switch. Every public call produces a structured post-call
verdict from the same dialectic that audits the kernels.
That continuity is the credibility mechanism. See
docs/pipeline-narrative.md for how the
stages compose.
Defender asserts a rubric invariant, attacker perturbs the prompt,
synthesizer packages a candidate delta, executor runs stock Opus 4.7 vs.
harness-modified Opus 4.7 against the OpenAI simple-evals grader,
adjudicator scores. The harness does not grade itself.
Every technique ships only after an external public-benchmark delta:
| Benchmark | Role | Grader |
|---|---|---|
| HealthBench Hard | primary clinical metric | simple-evals (Apache 2.0, vendored pinned @ third_party/simple-evals/) |
| MedQA (USMLE) | null-result control — ` | Δ |
| PubMedQA | RAG validator — retrieval must lift ≥10 pp | exact-match |
| MMLU-Medical-6 | breadth / null-result | exact-match |
| MedAgentBench | agentic clinical (H1 epic) | side-effect verifier |
Anthropic's Opus 4.7 launch page publishes zero medical benchmarks. Prism
establishes the full public suite, one run at a time. Every row in
docs/opus47-baseline-card.md is either a direct quote from a named
source with a fetch-date, or pending.
Baseline + harness deltas follow the determinism-aware gate defined in
CLAUDE.md §4: aggregates reported as mean ± 95% CI over N ≥ 3 runs;
the paired harness-delta CI must exclude 0 at α=0.05 before any
technique ships.
Clinical findings are model-behavior observations that route privately
through Anthropic's feedback channel after physician review —
never a public issue tracker, never a preprint before review,
never a social-media thread. Physician-in-loop gate is enforced
by the adjudicator's physician_review field in verdict.json (code
never pre-signs it). Disclosure posture: docs/clinical-handling.md.
Physician-facing 60-second safeguards summary: docs/safeguards.md.
git clone https://github.com/GOATnote-Inc/prism42
cd prism42
make verify-all # offline tests green across 5 layers
make clinical-demo-artifacts-commit # synthetic rubric cards (physician-review-required)Live Managed Agents smoke (requires ANTHROPIC_API_KEY in .env;
~$0.15 per run):
PRISM_SMOKE_SESSION_COMMIT=1 python scripts/smoke_session.py --commitCLAUDE.md Operating contract (§8 for Managed Agents specifics)
docs/clinical-extension-spec.md Normative clinical-rail contract (frozen path)
docs/clinical-roadmap.md Task DAG + dispatch protocol
docs/sota-portfolio.md Technique portfolio + benchmark grammar
docs/safeguards.md Physician-facing 60-second safeguards page
docs/opus47-baseline-card.md Every quoted Opus 4.7 medical benchmark, with fetch-date
docs/clinical-handling.md Clinical-finding disclosure posture (physician-gated)
docs/kernel-research-posture.md Kernel-research disclosure posture (private channels)
docs/runaway-ai-kb/ AI-control literature mapped onto Prism's dialectic
docs/anthropic-elevenlabs-agent-bp-*.md ElevenLabs + Opus 4.7 voice stack reference
agents/*.yaml 6 agent configs (coordinator + 5 sub-agents)
agents/manifest.yaml Live Anthropic IDs after register_agents.py --commit
environments/prism-standard-env.yaml BetaCloudConfigParams body (limited networking, 4-host allowlist)
scripts/register_agents.py Double-gated; writes manifest on success
scripts/smoke_session.py Live session smoke (event-channel proof)
scripts/smoke_delegation.py Live delegation smoke (gate-aware)
scripts/generate_clinical_demo_artifacts.py Clinical demo artifact generator
scripts/check_sdk_containment.py AST guard: SDK import only inside do_commit()
scripts/check_pipeline_invariants.py Model pins, role-filename, egress, mounts, manifest, schemas
corpus/clinical-demo/CLN-DEMO-*/ Synthetic clinical fixtures (not PHI)
corpus/golden-cases/KERNEL-GOLDEN/ Synthetic kernel regression fixture
corpus/golden-cases/HBH-CLN-SYNTH/ Synthetic clinical golden case
corpus/mla/ MLA oracle + reference implementations
findings/*.md Evidence + smoke reports (no embargoed material)
mla/ Evolutionary MLA/NVFP4 kernel package (validator + runners + evolve loop)
tests/ pytest suite; offline verification green in CI
music-video/ Subtree; Opus-4.7 video editor
mvp/ PSAP / 911 console research prototypes
Five offline layers gate every commit. Matches .github/workflows/verify.yml.
| Layer | Command | Proves |
|---|---|---|
| L1 schema | scripts/validate_artifacts.py |
Every case-dir artifact matches its JSON Schema 2020-12 |
| L2 agent | per-agent output validation | Agent emitted a parseable, schema-aligned verdict |
| L3 regression | make validate-golden |
KERNEL-GOLDEN and HBH-CLN-SYNTH fixtures still pass |
| L4 invariants | scripts/check_pipeline_invariants.py |
Model pins, role↔filename, egress allowlist, no-secret-mount, manifest shape, schemas compile |
| L5 CI | .github/workflows/verify.yml |
Offline-green on every push |
| T3 umbrella | make verify-all |
All of the above + generator dry-runs |
No commit ships without make verify-all green. SDK containment is
AST-verified by scripts/check_sdk_containment.py across gated scripts —
the anthropic SDK may only be imported inside do_commit(), never at
module scope.
- Every action ends with a verification step whose exit code proves the claim.
- Any script that spends money or calls an external LLM is gated by TWO
independent signals:
--commitflag +PRISM_<COMPONENT>_COMMIT=1env var. Either alone stays dry-run. - No technique ships without a measured delta on a Phase B scorer.
- Frozen paths (
docs/clinical-extension-spec.md,.env,.state/) are read-only.
Prism performs kernel-correctness research against public open-source
code, running on hardware we rent by the hour. Kernel findings that reach
the threshold for disclosure route through private channels, never
through this repo; see docs/kernel-research-posture.md for the
contract. This repo intentionally carries no embargoed material, no
target-specific naming, and no reproduction fingerprints.
- Claude Opus 4.7 — the auditor and the audited.
- OpenAI
simple-evals(Apache 2.0) — HealthBench Hard rubric grader. - Anthropic Managed Agents — research-preview multi-agent.
- GOATnote Emergency Dispatch Protocol (GEDP) v0.1 — developed under direction of Brandon Dent, MD (emergency medicine). Author: GOATnote Inc. MIT-licensed. Grounded in AHA BLS 2025, NHTSA EMS Scope of Practice Model, peer-reviewed EMS literature, and publicly published US PSAP materials. No IAED-licensed content.
MIT. See LICENSE. Third-party code under third_party/ retains its
upstream license; attribution in NOTICE.