feat(prompt+bench): structural decision trigger + reproducible benchmark#133
Merged
Fullstop000 merged 7 commits intomainfrom May 2, 2026
Merged
feat(prompt+bench): structural decision trigger + reproducible benchmark#133Fullstop000 merged 7 commits intomainfrom
Fullstop000 merged 7 commits intomainfrom
Conversation
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the input-pattern enumeration in the Decision Inbox prompt section
(PR-review phrasing, "should I X or Y", config-knob examples) with a
four-property structural test: mutually-exclusive options + blocking +
material consequence + delegated picker. The trigger is the shape of the
agent's intended reply, not the asker's words. The PR-review case
becomes the canonical example, not the rule.
Why: the enumeration didn't scale. Verdict-shaped requests in triage,
hiring, time-boxing, and compliance use neutral phrasing ("tell me which
3 to fix", "walk me through whether we need X") and were falling
through to send_message. The structural rule generalizes to any new
workflow without re-listing phrasings.
Add bench/decision-trigger/ — a reproducible benchmark that spins up
one isolated claude/sonnet agent per case in parallel, dispatches a
DM, and classifies the response turn as decision (dispatch_decision) or
chat (send_message). 15 cases across 8 work domains (PR review, vendor
pick, architecture, status, triage, hiring, doc, compliance, time-box,
naming). Current score: 15/15.
The benchmark intentionally pauses non-bench agents during runs so the
bench cohort isn't drowned in #all welcome messages. Side-effect-free
prompts only — README documents the constraint.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… flag
Two follow-up changes building on the structural-rule rewrite:
1) Whole-prompt injectability for benchmark/A-B convenience.
Adds CHORUS_SYSTEM_PROMPT_OVERRIDE_FILE env var: when set to a readable
file, the file's contents become the system prompt verbatim. Also adds
PromptOptions.system_prompt_override for in-process tests/benches.
Programmatic override wins over env var. Tool names must be pre-resolved
in the override file (no template substitution). Lets the bench compare
prompt variants without rebuilding the binary.
2) Drop include_stdin_notification_section + MessageNotificationStyle.
The flag toggled between two phrasings of the same message-delivery
contract — "you'll be restarted" vs "messages may arrive directly". The
LLM doesn't need to distinguish; it just needs to know not to poll. One
universal Message Notifications section now always emits, telling the
agent to call check_messages at natural breakpoints.
Updates all 5 driver call sites to use the simpler PromptOptions {..Default
::default()} pattern. Adds 4 prompt tests covering both override paths and
asserting the conditional notification branching is gone.
bench/decision-trigger/README.md gains an A/B section showing how to use
the env var to compare prompt variants without recompiling.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds harder benchmark cases and a multi-model matrix runner that exposes real differences between the structural-rule prompt and the model's own inference style. Hard cases (cases-hard.tsv, 15 scenarios): - Realistic narrative framings (P0 escalation, sprint capacity, vendor procurement, hiring under deadline, SOC2 compliance, time-box at sprint end, architecture review, VP briefing) - No verdict-flavored phrasing — no "merge or hold", no "what's your call", no "X or Y". Decisions must be inferred from situational context - Trap cases for chat (rhetorical frustration, retrospective, exploration, status update, info request, debug ask, facilitator role) Multi-model matrix: - models.tsv lists (runtime, model, tier, label) rows. Default ships with the two-per-family pattern: Anthropic best/efficiency, OpenAI best/ efficiency - run.sh now takes RUNTIME, MODEL, RUN_LABEL, CASES via env so it can be driven by the matrix runner - run-matrix.sh sweeps all rows in models.tsv, runs the bench once per model, collates a side-by-side matrix.tsv Baseline (cases-hard.tsv, structural-rule prompt): - claude/opus: 9/15 (conservative — implicit delegation reads as chat) - claude/sonnet: 15/15 (best — infers delegation from context) - codex/gpt-5.5: 14/15 (one hiring miss) - codex/gpt-5.4-mini: 13/15 (one mis-fire, one silent) All 4 models score 7/7 on chat cases. The discriminator is property #4 (Delegated) — whether the model treats "we need X by Y" as an implicit delegation. Same prompt, same cases, 9-15/15 spread by model. BASELINE.md captures this and lays out the implications for the next prompt iteration. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captures the actual head-to-head between the OLD prompt (input-pattern enumeration on main) and the NEW prompt (four-property structural test on this branch). Same 15 hard cases, same 4 models, parallel runner. Headline scores (cases-hard.tsv): Model Tier OLD NEW Δ claude/opus best 15/15 9/15 -6 claude/sonnet efficiency 14/15 15/15 +1 codex/gpt-5.5 best 14/15 14/15 0 codex/gpt-5.4-mini efficiency 12/15 13/15 +1 ------------------------------------------------- average 13.75 12.75 -1.0 Aggregate behavior: Decisions caught (32 max): OLD 30/32 (94%) vs NEW 23/32 (72%) Chat held back (28 max): OLD 25/28 (89%) vs NEW 28/28 (100%) The structural rewrite is NOT a strict win. NEW closes the retrospective false-positive (case 10: "in hindsight, was that the right call?" — OLD over-fires on sonnet/gpt-5.5/gpt-5.4-mini, NEW correctly chats on all). But NEW costs Opus 6 implicit-delegation decisions because Opus reads property #4 (Delegated) strictly: "we need X by Y" doesn't count as delegation without an explicit "you pick" clause. Sonnet, gpt-5.5, and gpt-5.4-mini are stable across both prompts — they infer delegation from situational context regardless of which rule is loaded. The Opus regression is model-specific. BASELINE.md captures the full per-case matrix, named winners and losers, known failure modes (gpt-5.4-mini case 1 silent under NEW; gpt-5.5 case 5 flips), and three iteration paths for the next prompt revision. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
bench/decision-trigger/— reproducible benchmark that runs each case in an isolated claude/sonnet agent in parallel, classifies the response turn, and scores match-rate against predictions.Why
The enumerated triggers didn't scale. Verdict-shaped requests in triage, hiring, time-boxing, and compliance use neutral phrasing ("tell me which 3 to fix", "walk me through whether we need X", "status on the auth bug") and were falling through to
send_message. The structural rule generalizes to any new workflow without re-listing phrasings.How it was validated
bench/decision-trigger/run.shspawns 15 isolated claude/sonnet agents in parallel (one per case), sends each its DM, classifies the response turn from the server log, and scores against predictions:15 cases span 8 work domains: PR review, code explain, vendor pick, architecture, status check, triage, hiring, doc proofread, compliance, time-box, naming. 9 decision predictions / 6 chat predictions, all match.
The benchmark itself is structured for re-runs — agent-name uniqueness via run-id, intentional pause of non-bench agents during runs (avoids #all welcome storms), strict side-effect-free prompts.
Test plan
cargo test --lib drivers::prompt::tests— 11 prompt tests pass (8 existing + 1 new structural-property test, plus 2 modified for added assertions)bench/decision-trigger/run.sh— 15/15 match🤖 Generated with Claude Code