feat(prompt+bench): structural decision trigger + reproducible benchmark by Fullstop000 · Pull Request #133 · Fullstop000/Chorus

Fullstop000 · 2026-05-01T14:10:44Z

Summary

Replace input-pattern enumeration in the Decision Inbox prompt with a four-property structural test: mutually-exclusive options + blocking + material consequence + delegated picker. The trigger is the shape of the agent's intended reply, not the asker's words.
Keep the PR-review example as the canonical anchor, not the rule.
Add bench/decision-trigger/ — reproducible benchmark that runs each case in an isolated claude/sonnet agent in parallel, classifies the response turn, and scores match-rate against predictions.

Why

The enumerated triggers didn't scale. Verdict-shaped requests in triage, hiring, time-boxing, and compliance use neutral phrasing ("tell me which 3 to fix", "walk me through whether we need X", "status on the auth bug") and were falling through to send_message. The structural rule generalizes to any new workflow without re-listing phrasings.

How it was validated

bench/decision-trigger/run.sh spawns 15 isolated claude/sonnet agents in parallel (one per case), sends each its DM, classifies the response turn from the server log, and scores against predictions:

match: 15/15

15 cases span 8 work domains: PR review, code explain, vendor pick, architecture, status check, triage, hiring, doc proofread, compliance, time-box, naming. 9 decision predictions / 6 chat predictions, all match.

The benchmark itself is structured for re-runs — agent-name uniqueness via run-id, intentional pause of non-bench agents during runs (avoids #all welcome storms), strict side-effect-free prompts.

Test plan

cargo test --lib drivers::prompt::tests — 11 prompt tests pass (8 existing + 1 new structural-property test, plus 2 modified for added assertions)
bench/decision-trigger/run.sh — 15/15 match
Manual sanity-check the new prompt section reads cleanly when an agent loads it

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace the input-pattern enumeration in the Decision Inbox prompt section (PR-review phrasing, "should I X or Y", config-knob examples) with a four-property structural test: mutually-exclusive options + blocking + material consequence + delegated picker. The trigger is the shape of the agent's intended reply, not the asker's words. The PR-review case becomes the canonical example, not the rule. Why: the enumeration didn't scale. Verdict-shaped requests in triage, hiring, time-boxing, and compliance use neutral phrasing ("tell me which 3 to fix", "walk me through whether we need X") and were falling through to send_message. The structural rule generalizes to any new workflow without re-listing phrasings. Add bench/decision-trigger/ — a reproducible benchmark that spins up one isolated claude/sonnet agent per case in parallel, dispatches a DM, and classifies the response turn as decision (dispatch_decision) or chat (send_message). 15 cases across 8 work domains (PR review, vendor pick, architecture, status, triage, hiring, doc, compliance, time-box, naming). Current score: 15/15. The benchmark intentionally pauses non-bench agents during runs so the bench cohort isn't drowned in #all welcome messages. Side-effect-free prompts only — README documents the constraint. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… flag Two follow-up changes building on the structural-rule rewrite: 1) Whole-prompt injectability for benchmark/A-B convenience. Adds CHORUS_SYSTEM_PROMPT_OVERRIDE_FILE env var: when set to a readable file, the file's contents become the system prompt verbatim. Also adds PromptOptions.system_prompt_override for in-process tests/benches. Programmatic override wins over env var. Tool names must be pre-resolved in the override file (no template substitution). Lets the bench compare prompt variants without rebuilding the binary. 2) Drop include_stdin_notification_section + MessageNotificationStyle. The flag toggled between two phrasings of the same message-delivery contract — "you'll be restarted" vs "messages may arrive directly". The LLM doesn't need to distinguish; it just needs to know not to poll. One universal Message Notifications section now always emits, telling the agent to call check_messages at natural breakpoints. Updates all 5 driver call sites to use the simpler PromptOptions {..Default ::default()} pattern. Adds 4 prompt tests covering both override paths and asserting the conditional notification branching is gone. bench/decision-trigger/README.md gains an A/B section showing how to use the env var to compare prompt variants without recompiling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds harder benchmark cases and a multi-model matrix runner that exposes real differences between the structural-rule prompt and the model's own inference style. Hard cases (cases-hard.tsv, 15 scenarios): - Realistic narrative framings (P0 escalation, sprint capacity, vendor procurement, hiring under deadline, SOC2 compliance, time-box at sprint end, architecture review, VP briefing) - No verdict-flavored phrasing — no "merge or hold", no "what's your call", no "X or Y". Decisions must be inferred from situational context - Trap cases for chat (rhetorical frustration, retrospective, exploration, status update, info request, debug ask, facilitator role) Multi-model matrix: - models.tsv lists (runtime, model, tier, label) rows. Default ships with the two-per-family pattern: Anthropic best/efficiency, OpenAI best/ efficiency - run.sh now takes RUNTIME, MODEL, RUN_LABEL, CASES via env so it can be driven by the matrix runner - run-matrix.sh sweeps all rows in models.tsv, runs the bench once per model, collates a side-by-side matrix.tsv Baseline (cases-hard.tsv, structural-rule prompt): - claude/opus: 9/15 (conservative — implicit delegation reads as chat) - claude/sonnet: 15/15 (best — infers delegation from context) - codex/gpt-5.5: 14/15 (one hiring miss) - codex/gpt-5.4-mini: 13/15 (one mis-fire, one silent) All 4 models score 7/7 on chat cases. The discriminator is property #4 (Delegated) — whether the model treats "we need X by Y" as an implicit delegation. Same prompt, same cases, 9-15/15 spread by model. BASELINE.md captures this and lays out the implications for the next prompt iteration. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Captures the actual head-to-head between the OLD prompt (input-pattern enumeration on main) and the NEW prompt (four-property structural test on this branch). Same 15 hard cases, same 4 models, parallel runner. Headline scores (cases-hard.tsv): Model Tier OLD NEW Δ claude/opus best 15/15 9/15 -6 claude/sonnet efficiency 14/15 15/15 +1 codex/gpt-5.5 best 14/15 14/15 0 codex/gpt-5.4-mini efficiency 12/15 13/15 +1 ------------------------------------------------- average 13.75 12.75 -1.0 Aggregate behavior: Decisions caught (32 max): OLD 30/32 (94%) vs NEW 23/32 (72%) Chat held back (28 max): OLD 25/28 (89%) vs NEW 28/28 (100%) The structural rewrite is NOT a strict win. NEW closes the retrospective false-positive (case 10: "in hindsight, was that the right call?" — OLD over-fires on sonnet/gpt-5.5/gpt-5.4-mini, NEW correctly chats on all). But NEW costs Opus 6 implicit-delegation decisions because Opus reads property #4 (Delegated) strictly: "we need X by Y" doesn't count as delegation without an explicit "you pick" clause. Sonnet, gpt-5.5, and gpt-5.4-mini are stable across both prompts — they infer delegation from situational context regardless of which rule is loaded. The Opus regression is model-specific. BASELINE.md captures the full per-case matrix, named winners and losers, known failure modes (gpt-5.4-mini case 1 silent under NEW; gpt-5.5 case 5 flips), and three iteration paths for the next prompt revision. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Fullstop000 and others added 7 commits May 1, 2026 18:24

feat(drivers/codex): add gpt-5.5 to model list

eeba8ba

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Merge PR #132: add gpt-5.5 model

4bbd0e1

chore: cargo fmt prompt.rs

ed49a8b

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Fullstop000 merged commit 8e484c9 into main May 2, 2026
3 checks passed

Fullstop000 deleted the decision-trigger-structural-rule branch May 2, 2026 03:57

Fullstop000 mentioned this pull request May 2, 2026

fix(codex+observability): session-file resume guard + silent-run detection #134

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(prompt+bench): structural decision trigger + reproducible benchmark#133

feat(prompt+bench): structural decision trigger + reproducible benchmark#133
Fullstop000 merged 7 commits intomainfrom
decision-trigger-structural-rule

Fullstop000 commented May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Fullstop000 commented May 1, 2026

Summary

Why

How it was validated

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant