Prism42

A full-stack trust-and-performance pipeline for high-stakes voice AI. Find correctness failures. Optimize the compute path. Prove clinical- reasoning lift. Deploy the agent stack. Package into a 911 call-center demo anyone can interact with.

Prism42 proves three things before the demo runs: the agents are correct, they are fast, and they are clinically safer than baseline. The pipeline is four stages composing into one deployable system — see docs/pipeline-narrative.md for the full thesis.

No speculative findings. No benchmark numbers we didn't measure ourselves. No AI-slop. Every claim on the landing page traces to a session ID; every session ID reproduces with a shell command; every agent the public talks to has an auditor running the same dialectic that found the kernel bugs.

Four stages, one pipeline

Find correctness failures — kernel layer. Five-role adversarial dialectic (defender / attacker / synthesizer / executor / adjudicator, coordinated) running as Anthropic Managed Agents on Claude Opus 4.7. Every finding compiles and runs on real GPU hardware before shipping. See mla/ + scripts/.
Optimize the compute path — inference layer. Clean-process measurement rubric: fresh subprocess per run, 200 CUDA-event samples, 3 replicates, full p10/p50/p90/p99 distribution. Six benchmark-gaming detectors (mla/prism/gaming_patterns.py). Six mechanisms counter the "AI-slop benchmark number" pattern.
Prove clinical-reasoning lift — reasoning layer. HealthBench Hard (OpenAI simple-evals, Apache 2.0, vendored) as the primary rubric grader. First public Opus 4.7 HealthBench Hard baseline: 0.196 ± 0.068 (N = 3, 95 % CI, 30-example subset). Canonical 1000-example parent set pinned at corpus/pins/healthbench-hard-1000.yaml. Paired-design harness delta gates on CI-excludes-zero.
Deploy the agent stack — voice / product layer. ElevenLabs Conversational AI front end over the Managed Agents session layer. Live-call voice-facing agents phased by call stage (intake → triage → dispatch → PDI → handoff). In-session oversight agents (safety-monitor, OHCA-detector, intent-verifier, rubric-live) on every turn. Post-session auditor runs the dialectic over the call transcript for the physician-readable QI summary. Packaged as a public 911 call-center simulation at www.thegoatnote.com/prism42.

The continuity claim

The agents visitors interact with at www.thegoatnote.com/prism42 are the same ones whose correctness, performance, and clinical-reasoning lift were measured in stages 1–3. No "benchmark agent" vs "demo agent" bait-and-switch. Every public call produces a structured post-call verdict from the same dialectic that audits the kernels.

That continuity is the credibility mechanism. See docs/pipeline-narrative.md for how the stages compose.

Clinical rail — HealthBench Hard

Defender asserts a rubric invariant, attacker perturbs the prompt, synthesizer packages a candidate delta, executor runs stock Opus 4.7 vs. harness-modified Opus 4.7 against the OpenAI simple-evals grader, adjudicator scores. The harness does not grade itself.

Every technique ships only after an external public-benchmark delta:

Benchmark	Role	Grader
HealthBench Hard	primary clinical metric	`simple-evals` (Apache 2.0, vendored pinned @ `third_party/simple-evals/`)
MedQA (USMLE)	null-result control — `	Δ
PubMedQA	RAG validator — retrieval must lift ≥10 pp	exact-match
MMLU-Medical-6	breadth / null-result	exact-match
MedAgentBench	agentic clinical (H1 epic)	side-effect verifier

Opus 4.7 baseline card

Anthropic's Opus 4.7 launch page publishes zero medical benchmarks. Prism establishes the full public suite, one run at a time. Every row in docs/opus47-baseline-card.md is either a direct quote from a named source with a fetch-date, or pending.

Baseline + harness deltas follow the determinism-aware gate defined in CLAUDE.md §4: aggregates reported as mean ± 95% CI over N ≥ 3 runs; the paired harness-delta CI must exclude 0 at α=0.05 before any technique ships.

Clinical findings are not CVEs

Clinical findings are model-behavior observations that route privately through Anthropic's feedback channel after physician review — never a public issue tracker, never a preprint before review, never a social-media thread. Physician-in-loop gate is enforced by the adjudicator's physician_review field in verdict.json (code never pre-signs it). Disclosure posture: docs/clinical-handling.md. Physician-facing 60-second safeguards summary: docs/safeguards.md.

Reproduce

git clone https://github.com/GOATnote-Inc/prism42
cd prism42
make verify-all                        # offline tests green across 5 layers
make clinical-demo-artifacts-commit    # synthetic rubric cards (physician-review-required)

Live Managed Agents smoke (requires ANTHROPIC_API_KEY in .env; ~$0.15 per run):

PRISM_SMOKE_SESSION_COMMIT=1 python scripts/smoke_session.py --commit

Repo map — where to look

CLAUDE.md                              Operating contract (§8 for Managed Agents specifics)
docs/clinical-extension-spec.md        Normative clinical-rail contract (frozen path)
docs/clinical-roadmap.md               Task DAG + dispatch protocol
docs/sota-portfolio.md                 Technique portfolio + benchmark grammar
docs/safeguards.md                     Physician-facing 60-second safeguards page
docs/opus47-baseline-card.md           Every quoted Opus 4.7 medical benchmark, with fetch-date
docs/clinical-handling.md              Clinical-finding disclosure posture (physician-gated)
docs/kernel-research-posture.md        Kernel-research disclosure posture (private channels)
docs/runaway-ai-kb/                    AI-control literature mapped onto Prism's dialectic
docs/anthropic-elevenlabs-agent-bp-*.md  ElevenLabs + Opus 4.7 voice stack reference
agents/*.yaml                          6 agent configs (coordinator + 5 sub-agents)
agents/manifest.yaml                   Live Anthropic IDs after register_agents.py --commit
environments/prism-standard-env.yaml   BetaCloudConfigParams body (limited networking, 4-host allowlist)
scripts/register_agents.py             Double-gated; writes manifest on success
scripts/smoke_session.py               Live session smoke (event-channel proof)
scripts/smoke_delegation.py            Live delegation smoke (gate-aware)
scripts/generate_clinical_demo_artifacts.py  Clinical demo artifact generator
scripts/check_sdk_containment.py       AST guard: SDK import only inside do_commit()
scripts/check_pipeline_invariants.py   Model pins, role-filename, egress, mounts, manifest, schemas
corpus/clinical-demo/CLN-DEMO-*/       Synthetic clinical fixtures (not PHI)
corpus/golden-cases/KERNEL-GOLDEN/     Synthetic kernel regression fixture
corpus/golden-cases/HBH-CLN-SYNTH/     Synthetic clinical golden case
corpus/mla/                            MLA oracle + reference implementations
findings/*.md                          Evidence + smoke reports (no embargoed material)
mla/                                   Evolutionary MLA/NVFP4 kernel package (validator + runners + evolve loop)
tests/                                 pytest suite; offline verification green in CI
music-video/                           Subtree; Opus-4.7 video editor
mvp/                                   PSAP / 911 console research prototypes

Verification discipline

Five offline layers gate every commit. Matches .github/workflows/verify.yml.

Layer	Command	Proves
L1 schema	`scripts/validate_artifacts.py`	Every case-dir artifact matches its JSON Schema 2020-12
L2 agent	per-agent output validation	Agent emitted a parseable, schema-aligned verdict
L3 regression	`make validate-golden`	`KERNEL-GOLDEN` and `HBH-CLN-SYNTH` fixtures still pass
L4 invariants	`scripts/check_pipeline_invariants.py`	Model pins, role↔filename, egress allowlist, no-secret-mount, manifest shape, schemas compile
L5 CI	`.github/workflows/verify.yml`	Offline-green on every push
T3 umbrella	`make verify-all`	All of the above + generator dry-runs

No commit ships without make verify-all green. SDK containment is AST-verified by scripts/check_sdk_containment.py across gated scripts — the anthropic SDK may only be imported inside do_commit(), never at module scope.

Hard rules (excerpted from `CLAUDE.md`)

Every action ends with a verification step whose exit code proves the claim.
Any script that spends money or calls an external LLM is gated by TWO independent signals: --commit flag + PRISM_<COMPONENT>_COMMIT=1 env var. Either alone stays dry-run.
No technique ships without a measured delta on a Phase B scorer.
Frozen paths (docs/clinical-extension-spec.md, .env, .state/) are read-only.

Research posture

Prism performs kernel-correctness research against public open-source code, running on hardware we rent by the hour. Kernel findings that reach the threshold for disclosure route through private channels, never through this repo; see docs/kernel-research-posture.md for the contract. This repo intentionally carries no embargoed material, no target-specific naming, and no reproduction fingerprints.

Credits

Claude Opus 4.7 — the auditor and the audited.
OpenAI simple-evals (Apache 2.0) — HealthBench Hard rubric grader.
Anthropic Managed Agents — research-preview multi-agent.
GOATnote Emergency Dispatch Protocol (GEDP) v0.1 — developed under direction of Brandon Dent, MD (emergency medicine). Author: GOATnote Inc. MIT-licensed. Grounded in AHA BLS 2025, NHTSA EMS Scope of Practice Model, peer-reviewed EMS literature, and publicly published US PSAP materials. No IAED-licensed content.

License

MIT. See LICENSE. Third-party code under third_party/ retains its upstream license; attribution in NOTICE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Prism42

Four stages, one pipeline

The continuity claim

Clinical rail — HealthBench Hard

Opus 4.7 baseline card

Clinical findings are not CVEs

Reproduce

Repo map — where to look

Verification discipline

Hard rules (excerpted from `CLAUDE.md`)

Research posture

Credits

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
.claude/agents		.claude/agents
.github/workflows		.github/workflows
agents		agents
cases		cases
cloud-init		cloud-init
corpus		corpus
docs		docs
environments		environments
findings		findings
infra/b300		infra/b300
mla		mla
music-video		music-video
mvp		mvp
schemas		schemas
scripts		scripts
skills		skills
src		src
tests		tests
third_party		third_party
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
Makefile		Makefile
NOTICE		NOTICE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Prism42

Four stages, one pipeline

The continuity claim

Clinical rail — HealthBench Hard

Opus 4.7 baseline card

Clinical findings are not CVEs

Reproduce

Repo map — where to look

Verification discipline

Hard rules (excerpted from CLAUDE.md)

Research posture

Credits

License

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Hard rules (excerpted from `CLAUDE.md`)

Packages