Skip to content

GOATnote-Inc/prism42

Repository files navigation

Prism42

A full-stack trust-and-performance pipeline for high-stakes voice AI. Find correctness failures. Optimize the compute path. Prove clinical- reasoning lift. Deploy the agent stack. Package into a 911 call-center demo anyone can interact with.

Prism42 proves three things before the demo runs: the agents are correct, they are fast, and they are clinically safer than baseline. The pipeline is four stages composing into one deployable system — see docs/pipeline-narrative.md for the full thesis.

No speculative findings. No benchmark numbers we didn't measure ourselves. No AI-slop. Every claim on the landing page traces to a session ID; every session ID reproduces with a shell command; every agent the public talks to has an auditor running the same dialectic that found the kernel bugs.

Four stages, one pipeline

  1. Find correctness failures — kernel layer. Five-role adversarial dialectic (defender / attacker / synthesizer / executor / adjudicator, coordinated) running as Anthropic Managed Agents on Claude Opus 4.7. Every finding compiles and runs on real GPU hardware before shipping. See mla/ + scripts/.
  2. Optimize the compute path — inference layer. Clean-process measurement rubric: fresh subprocess per run, 200 CUDA-event samples, 3 replicates, full p10/p50/p90/p99 distribution. Six benchmark-gaming detectors (mla/prism/gaming_patterns.py). Six mechanisms counter the "AI-slop benchmark number" pattern.
  3. Prove clinical-reasoning lift — reasoning layer. HealthBench Hard (OpenAI simple-evals, Apache 2.0, vendored) as the primary rubric grader. First public Opus 4.7 HealthBench Hard baseline: 0.196 ± 0.068 (N = 3, 95 % CI, 30-example subset). Canonical 1000-example parent set pinned at corpus/pins/healthbench-hard-1000.yaml. Paired-design harness delta gates on CI-excludes-zero.
  4. Deploy the agent stack — voice / product layer. ElevenLabs Conversational AI front end over the Managed Agents session layer. Live-call voice-facing agents phased by call stage (intake → triage → dispatch → PDI → handoff). In-session oversight agents (safety-monitor, OHCA-detector, intent-verifier, rubric-live) on every turn. Post-session auditor runs the dialectic over the call transcript for the physician-readable QI summary. Packaged as a public 911 call-center simulation at www.thegoatnote.com/prism42.

The continuity claim

The agents visitors interact with at www.thegoatnote.com/prism42 are the same ones whose correctness, performance, and clinical-reasoning lift were measured in stages 1–3. No "benchmark agent" vs "demo agent" bait-and-switch. Every public call produces a structured post-call verdict from the same dialectic that audits the kernels.

That continuity is the credibility mechanism. See docs/pipeline-narrative.md for how the stages compose.

Clinical rail — HealthBench Hard

Defender asserts a rubric invariant, attacker perturbs the prompt, synthesizer packages a candidate delta, executor runs stock Opus 4.7 vs. harness-modified Opus 4.7 against the OpenAI simple-evals grader, adjudicator scores. The harness does not grade itself.

Every technique ships only after an external public-benchmark delta:

Benchmark Role Grader
HealthBench Hard primary clinical metric simple-evals (Apache 2.0, vendored pinned @ third_party/simple-evals/)
MedQA (USMLE) null-result control — ` Δ
PubMedQA RAG validator — retrieval must lift ≥10 pp exact-match
MMLU-Medical-6 breadth / null-result exact-match
MedAgentBench agentic clinical (H1 epic) side-effect verifier

Opus 4.7 baseline card

Anthropic's Opus 4.7 launch page publishes zero medical benchmarks. Prism establishes the full public suite, one run at a time. Every row in docs/opus47-baseline-card.md is either a direct quote from a named source with a fetch-date, or pending.

Baseline + harness deltas follow the determinism-aware gate defined in CLAUDE.md §4: aggregates reported as mean ± 95% CI over N ≥ 3 runs; the paired harness-delta CI must exclude 0 at α=0.05 before any technique ships.

Clinical findings are not CVEs

Clinical findings are model-behavior observations that route privately through Anthropic's feedback channel after physician review — never a public issue tracker, never a preprint before review, never a social-media thread. Physician-in-loop gate is enforced by the adjudicator's physician_review field in verdict.json (code never pre-signs it). Disclosure posture: docs/clinical-handling.md. Physician-facing 60-second safeguards summary: docs/safeguards.md.

Reproduce

git clone https://github.com/GOATnote-Inc/prism42
cd prism42
make verify-all                        # offline tests green across 5 layers
make clinical-demo-artifacts-commit    # synthetic rubric cards (physician-review-required)

Live Managed Agents smoke (requires ANTHROPIC_API_KEY in .env; ~$0.15 per run):

PRISM_SMOKE_SESSION_COMMIT=1 python scripts/smoke_session.py --commit

Repo map — where to look

CLAUDE.md                              Operating contract (§8 for Managed Agents specifics)
docs/clinical-extension-spec.md        Normative clinical-rail contract (frozen path)
docs/clinical-roadmap.md               Task DAG + dispatch protocol
docs/sota-portfolio.md                 Technique portfolio + benchmark grammar
docs/safeguards.md                     Physician-facing 60-second safeguards page
docs/opus47-baseline-card.md           Every quoted Opus 4.7 medical benchmark, with fetch-date
docs/clinical-handling.md              Clinical-finding disclosure posture (physician-gated)
docs/kernel-research-posture.md        Kernel-research disclosure posture (private channels)
docs/runaway-ai-kb/                    AI-control literature mapped onto Prism's dialectic
docs/anthropic-elevenlabs-agent-bp-*.md  ElevenLabs + Opus 4.7 voice stack reference
agents/*.yaml                          6 agent configs (coordinator + 5 sub-agents)
agents/manifest.yaml                   Live Anthropic IDs after register_agents.py --commit
environments/prism-standard-env.yaml   BetaCloudConfigParams body (limited networking, 4-host allowlist)
scripts/register_agents.py             Double-gated; writes manifest on success
scripts/smoke_session.py               Live session smoke (event-channel proof)
scripts/smoke_delegation.py            Live delegation smoke (gate-aware)
scripts/generate_clinical_demo_artifacts.py  Clinical demo artifact generator
scripts/check_sdk_containment.py       AST guard: SDK import only inside do_commit()
scripts/check_pipeline_invariants.py   Model pins, role-filename, egress, mounts, manifest, schemas
corpus/clinical-demo/CLN-DEMO-*/       Synthetic clinical fixtures (not PHI)
corpus/golden-cases/KERNEL-GOLDEN/     Synthetic kernel regression fixture
corpus/golden-cases/HBH-CLN-SYNTH/     Synthetic clinical golden case
corpus/mla/                            MLA oracle + reference implementations
findings/*.md                          Evidence + smoke reports (no embargoed material)
mla/                                   Evolutionary MLA/NVFP4 kernel package (validator + runners + evolve loop)
tests/                                 pytest suite; offline verification green in CI
music-video/                           Subtree; Opus-4.7 video editor
mvp/                                   PSAP / 911 console research prototypes

Verification discipline

Five offline layers gate every commit. Matches .github/workflows/verify.yml.

Layer Command Proves
L1 schema scripts/validate_artifacts.py Every case-dir artifact matches its JSON Schema 2020-12
L2 agent per-agent output validation Agent emitted a parseable, schema-aligned verdict
L3 regression make validate-golden KERNEL-GOLDEN and HBH-CLN-SYNTH fixtures still pass
L4 invariants scripts/check_pipeline_invariants.py Model pins, role↔filename, egress allowlist, no-secret-mount, manifest shape, schemas compile
L5 CI .github/workflows/verify.yml Offline-green on every push
T3 umbrella make verify-all All of the above + generator dry-runs

No commit ships without make verify-all green. SDK containment is AST-verified by scripts/check_sdk_containment.py across gated scripts — the anthropic SDK may only be imported inside do_commit(), never at module scope.

Hard rules (excerpted from CLAUDE.md)

  • Every action ends with a verification step whose exit code proves the claim.
  • Any script that spends money or calls an external LLM is gated by TWO independent signals: --commit flag + PRISM_<COMPONENT>_COMMIT=1 env var. Either alone stays dry-run.
  • No technique ships without a measured delta on a Phase B scorer.
  • Frozen paths (docs/clinical-extension-spec.md, .env, .state/) are read-only.

Research posture

Prism performs kernel-correctness research against public open-source code, running on hardware we rent by the hour. Kernel findings that reach the threshold for disclosure route through private channels, never through this repo; see docs/kernel-research-posture.md for the contract. This repo intentionally carries no embargoed material, no target-specific naming, and no reproduction fingerprints.

Credits

  • Claude Opus 4.7 — the auditor and the audited.
  • OpenAI simple-evals (Apache 2.0) — HealthBench Hard rubric grader.
  • Anthropic Managed Agents — research-preview multi-agent.
  • GOATnote Emergency Dispatch Protocol (GEDP) v0.1 — developed under direction of Brandon Dent, MD (emergency medicine). Author: GOATnote Inc. MIT-licensed. Grounded in AHA BLS 2025, NHTSA EMS Scope of Practice Model, peer-reviewed EMS literature, and publicly published US PSAP materials. No IAED-licensed content.

License

MIT. See LICENSE. Third-party code under third_party/ retains its upstream license; attribution in NOTICE.

About

Managed Agents harness on Claude Opus 4.7 for kernel correctness research and clinical reasoning auditing

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors