Skip to content

feat(workspace): add agent workspace explorer#4

Merged
Fullstop000 merged 3 commits intomainfrom
codex/agent-workspace-panel
Mar 22, 2026
Merged

feat(workspace): add agent workspace explorer#4
Fullstop000 merged 3 commits intomainfrom
codex/agent-workspace-panel

Conversation

@Fullstop000
Copy link
Copy Markdown
Owner

Summary

  • replace the placeholder workspace list with a split-pane agent workspace explorer
  • add workspace file preview API support with metadata for path, size, timestamp, and truncation
  • expand QA coverage with workspace server/e2e tests and tighten WRK-001 in qa/cases/agents.md

Verification

  • cargo test
  • live browser verification on local app:
    • opened agent alice
    • switched to Workspace
    • verified real workspace path and copy action
    • selected notes/work-log.md
    • verified selected-row highlight and header metadata
    • toggled Raw / Preview
    • refreshed workspace and confirmed no console errors, workspace endpoints returned 200

@Fullstop000 Fullstop000 merged commit 4ce92a8 into main Mar 22, 2026
3 checks passed
@Fullstop000 Fullstop000 deleted the codex/agent-workspace-panel branch March 22, 2026 10:11
Fullstop000 added a commit that referenced this pull request Apr 27, 2026
… pointer (#117)

Address self-review code smells flagged on PR #117:

#1 — Three `*_for_test` shims leaked module internals just to bridge the
sibling-module boundary between `acp_native::tests` and the files under
test. Replaced with two visibility tightenings and one test relocation:

- `AcpNativeHandle::alloc_id` is now `pub(super)`. Deleted
  `alloc_id_for_test`. Tests call `alloc_id()` directly.
- `reader::handle_response` is now `pub(super)`. Deleted
  `handle_response_for_test`. Tests call `handle_response(...)` directly.
- The three `close()` multi-session tests moved into `handle.rs::tests`
  as an inline `#[cfg(test)] mod tests` block. Inside the same module
  they construct `AcpNativeHandle` with private field access — no
  `set_session_for_test` setter shim required. Deleted that shim too.

To support tests in multiple files, factored shared fixtures
(`TEST_CFG`, `TEST_REGISTRY`, `test_spec`, `make_core`,
`open_test_session`, `fresh_shared`) into a new
`acp_native/test_fixtures.rs` gated on `#[cfg(test)]`. Both
`acp_native::tests` and `acp_native::handle::tests` import from it.

#2 — `InitPromptStrategy::Deferred` was annotated `#[allow(dead_code)]`
"for future runtimes." YAGNI. Deleted the variant. The enum stays as a
single-variant enum (rather than collapsing to "always immediate"
behavior) so a future driver that genuinely needs to defer can extend
without a wire-shape breaking change. Doc-comment on the enum explains
why.

#4 — `AcpDriverConfig::registry` was `fn() -> &'static AgentRegistry<...>`
wrapping a function-local static. Hoisted each driver's static to module
level (`KIMI_REGISTRY`, `GEMINI_REGISTRY`, `OPENCODE_REGISTRY`,
`TEST_REGISTRY`) and changed the field type to
`&'static AgentRegistry<AcpNativeCore>`. Removed the `(cfg.registry)()`
call indirection at every use site. `AgentRegistry::new` is `const fn`
so this just works.

Verified: cargo fmt --check (clean), cargo test --lib (324 passed),
cargo clippy --lib --tests -- -D warnings (clean).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fullstop000 added a commit that referenced this pull request Apr 27, 2026
…an label test

- templates.rs: pass human.id and result.id instead of human.name/result.name
  to join_channel (Copilot review comments #2, #3)
- store_tests: add UUID-id human join assertion verifying label resolution
  (Copilot review comment #1)
- agents.rs auto-join path already fixed in prior refactor commit 9e10c5c
  (Copilot review comment #4)
Fullstop000 added a commit that referenced this pull request Apr 27, 2026
…encode (#111) (#117)

* refactor(drivers): extract shared acp_native base from gemini/kimi/opencode (#111)

Three ACP-native drivers (gemini, kimi, opencode) each carried ~1500-2700
lines of structurally near-identical code: reader loop, response routing,
session lifecycle, cancel/close, ensure_started semantics, EOF drain,
permission auto-approval. Bug fixes had to be applied in three places and
behaviors drifted apart at edges.

Move all of it into `src/agent/drivers/acp_native/`:
- `mod.rs` — `AcpDriverConfig` (struct of fn pointers + bools + enums) +
  `InitPromptStrategy` + shared `open_session` helper.
- `state.rs` — `SharedReaderState`, `PendingRequest`, `SessionState`.
- `core.rs` — `AcpNativeCore`, `ensure_started` (race-safe lazy spawn,
  non-sticky failure), `spawn_and_initialize`, `is_stale`, `Drop`.
- `handle.rs` — `AcpNativeHandle` + full `Session` impl: run, prompt,
  cancel, close (with `closed_emitted` race guard).
- `reader.rs` — `reader_loop`, `handle_response`, `handle_session_update`,
  `pick_session*`. Routes responses by JSON-RPC id through
  `pending_requests` map (avoids `acp_protocol::parse_line`'s id-bucketing
  that misclassifies `session/new` at id≥3 as PromptResponse).
- `tests.rs` — 19 generic tests with a `TestConfig`. Audit table at top
  maps each pre-migration per-driver test to its shared equivalent.

Per-runtime variation lives entirely in the static `&'static AcpDriverConfig`
each driver owns. No trait, no generics — three runtimes, single
instantiation, function-pointer dispatch.

Behavior preserved bit-for-bit:
- Cancel stays local-only.
- `stopReason` continues to be ignored; all completions emit Natural.
- `session/close` remains local-only (no RPC).
- No capability checking before `session/load`; no `session/resume`.
- HTTP MCP transport stays as-is.
The 8 ACP spec gaps catalogued in the plan are tracked as follow-up issues
— each becomes a 1-place fix now that the base is shared.

New shared test `ensure_started_concurrent` closes a coverage gap: drives
two concurrent `ensure_started` calls and asserts the slow path runs
exactly twice (each caller retries after its predecessor fails, proving
serialization + non-stickiness without needing a real runtime binary).

Opencode shape conversion: collapse the `FactoryPath::Bootstrap | Secondary`
split into the unified handle. The race the bootstrap protected against
(deferred prompt id colliding with a racing secondary `new_session`)
cannot occur in the unified model — `ensure_started` serializes through
`start_in_progress`, and `alloc_id` runs only after that mutex is
released. Deletes `OpencodeAgentProcess`, `FactoryPath`,
`run_bootstrap*`, `send_deferred_bootstrap_prompt`, the local
`classify_line`/`dispatch_line`, and the bootstrap-only state fields
(`bootstrap_pending_prompt`, `bootstrap_session_id`,
`bootstrap_requested_session_id`).

Diff:
- gemini.rs   1718 → 423 (−75%)
- kimi.rs     2737 → 328 (−88%)
- opencode.rs 2834 → 339 (−88%)
- net: 7289 → 3650 lines (−50% across drivers + new shared module incl. tests)

Verified: cargo test (527 passed), cargo test --test e2e_tests (10 passed),
cargo clippy --lib --tests -- -D warnings (clean).

Plan: docs/plans/2026-04-27-acp-native-driver-unification-plan.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(acp_native): rustfmt + address Copilot PR review

CI fixes:
- cargo fmt across acp_native module + per-driver wrappers (rustfmt rules
  prefer multi-line struct/match args).

PR review (Copilot):
- mod.rs: scrap reference to gitignored docs/plans/* file; point readers
  at issue #111 + the PR description for spec gap context.
- mod.rs: rewrite InitPromptStrategy::Deferred docstring — opencode no
  longer needs it. Kept as a config knob for future runtimes that
  genuinely defer the first prompt.
- tests.rs: ensure_started_concurrent now expects each tokio JoinHandle
  so a panic in either task fails the test instead of getting masked.
- tests.rs + handle.rs: add the missing
  `alloc_id_starts_at_3_after_spawn_and_initialize` shared test the
  audit table claimed existed. Tests that the first allocated id after
  ensure_started seeds next_request_id=3 is exactly 3, and that no id-3
  placeholder is pre-registered. Exposed via a #[cfg(test)] alloc_id_for_test
  shim on AcpNativeHandle.

Verified locally: cargo fmt --check (clean), cargo test --lib (324 passed,
+1 vs prior count for the new alloc_id test), cargo clippy --lib --tests
-- -D warnings (clean).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(acp_native): tighten API — drop test shims, dead variant, fn pointer (#117)

Address self-review code smells flagged on PR #117:

#1 — Three `*_for_test` shims leaked module internals just to bridge the
sibling-module boundary between `acp_native::tests` and the files under
test. Replaced with two visibility tightenings and one test relocation:

- `AcpNativeHandle::alloc_id` is now `pub(super)`. Deleted
  `alloc_id_for_test`. Tests call `alloc_id()` directly.
- `reader::handle_response` is now `pub(super)`. Deleted
  `handle_response_for_test`. Tests call `handle_response(...)` directly.
- The three `close()` multi-session tests moved into `handle.rs::tests`
  as an inline `#[cfg(test)] mod tests` block. Inside the same module
  they construct `AcpNativeHandle` with private field access — no
  `set_session_for_test` setter shim required. Deleted that shim too.

To support tests in multiple files, factored shared fixtures
(`TEST_CFG`, `TEST_REGISTRY`, `test_spec`, `make_core`,
`open_test_session`, `fresh_shared`) into a new
`acp_native/test_fixtures.rs` gated on `#[cfg(test)]`. Both
`acp_native::tests` and `acp_native::handle::tests` import from it.

#2 — `InitPromptStrategy::Deferred` was annotated `#[allow(dead_code)]`
"for future runtimes." YAGNI. Deleted the variant. The enum stays as a
single-variant enum (rather than collapsing to "always immediate"
behavior) so a future driver that genuinely needs to defer can extend
without a wire-shape breaking change. Doc-comment on the enum explains
why.

#4 — `AcpDriverConfig::registry` was `fn() -> &'static AgentRegistry<...>`
wrapping a function-local static. Hoisted each driver's static to module
level (`KIMI_REGISTRY`, `GEMINI_REGISTRY`, `OPENCODE_REGISTRY`,
`TEST_REGISTRY`) and changed the field type to
`&'static AgentRegistry<AcpNativeCore>`. Removed the `(cfg.registry)()`
call indirection at every use site. `AgentRegistry::new` is `const fn`
so this just works.

Verified: cargo fmt --check (clean), cargo test --lib (324 passed),
cargo clippy --lib --tests -- -D warnings (clean).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(acp_native): unify pick_session into pick_session_and_run (#117)

Self-review followup: the two pick_session* helpers had identical hint
resolution logic (lock, single-session fallback, warnings) and differed
only in whether the return value bundled the session's run_id. Six call
sites were split arbitrarily — Thinking/Text used pick_session_and_run,
ToolCall/ToolCallUpdate/ToolResult/TurnEnd used pick_session. ~30 lines
of duplicated logic for a single Option<Uuid> field lookup.

Folded into one function: `pick_session_and_run` returning
`(Option<String>, Option<RunId>)`. Callers that don't want run_id
destructure with `_`. Behavior preserved bit-for-bit — same lock policy,
same warn messages (renamed to driver-agnostic "session-update" since
they no longer name a specific helper).

Verified: cargo fmt --check (clean), cargo test --lib (324 passed),
cargo clippy --lib --tests -- -D warnings (clean).

Re acp_protocol.rs location: keep at drivers::acp_protocol. It's the
ACP wire-format layer (JSON-RPC parsing + frame builders), used by
acp_native AND by event_forwarder::strip_mcp_prefix outside acp_native.
Moving into acp_native would imply ownership it doesn't have.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fullstop000 added a commit that referenced this pull request Apr 27, 2026
…an label test

- templates.rs: pass human.id and result.id instead of human.name/result.name
  to join_channel (Copilot review comments #2, #3)
- store_tests: add UUID-id human join assertion verifying label resolution
  (Copilot review comment #1)
- agents.rs auto-join path already fixed in prior refactor commit 9e10c5c
  (Copilot review comment #4)
Fullstop000 added a commit that referenced this pull request Apr 28, 2026
…ng (#116)

* feat: system message when member joins a channel (#114)

When a member joins a channel via API handlers (creation, invite, team
assignment), post a server-authored system message into the channel so
the join is visible in chat history.

Backend:
- Added Store::resolve_member_label_tx to resolve human-readable labels
  (display_name for agents, name for humans)
- Added join_channel_by_id_with_system_message and
  join_channel_with_system_message: atomically insert membership row
  and create a system message, then emit both member_joined and
  message.created stream events. Idempotent — returns false and skips
  the system message when the member is already present.
- Updated all runtime API handlers to use the new methods:
  handle_create_channel, handle_invite_channel_member,
  handle_create_agent, handle_create_team, handle_add_team_member,
  handle_launch_trio

Tests:
- Added test_join_channel_with_system_message_creates_notice_and_is_idempotent
  verifying human join, agent join with 'Agent' prefix, and idempotency

* fix: ensure system message on agent creation by moving auto-join out of inner helper

The  function was directly inserting into
 for the #all channel. This meant that when
 later called
for auto-join channels, the INSERT OR IGNORE returned rows=0 (already a
member), so no system message was ever created.

Fix: remove the channel_members INSERT from
and have  /  call
 instead. The connection lock is
dropped first to avoid deadlock with the method's own lock acquisition.

QA verified: creating an agent now shows 'Agent <name> joined #all' in chat.

* refactor: eliminate join_channel duplication, promote system-message variants to canonical API

The old  and  duplicated the INSERT logic
and were only used by tests. The  variants were the
actual production API but had verbose names.

Changes:
- Removed old silent  /  from public API
- Renamed  →
- Renamed  →
-  delegates to  after name resolution, eliminating duplication
- Added  /  for unit tests
- Added  for integration tests
- Updated all test files to use silent helpers where they assert on message counts
- Fixed test data bugs where  was passed by name instead of ID

* fix(copilot-review): use stable IDs in template handler, add UUID human label test

- templates.rs: pass human.id and result.id instead of human.name/result.name
  to join_channel (Copilot review comments #2, #3)
- store_tests: add UUID-id human join assertion verifying label resolution
  (Copilot review comment #1)
- agents.rs auto-join path already fixed in prior refactor commit 9e10c5c
  (Copilot review comment #4)

* style: cargo fmt

* refactor: unify system-message structured payloads

Rename `messages.notice` column to `messages.payload` and migrate task
events from JSON-in-content to the same payload column. Two roles, one
column:

  - `content` — always-readable English fallback
  - `payload` — kind-discriminated JSON (`{kind, audience?, ...}`)

Producers:
  - `member_joined` → payload `{kind, audience: "humans", actor, verb, target}`,
    content `"alice joined #planning"`
  - `task_event` → payload (existing camelCase shape) + English sentence in
    content via new `as_human_sentence()` (no `[task]` prefix)

Agent visibility filter is structural — `payload.audience != 'humans'`,
not a kind allowlist. Adding new ambient kinds = set audience humans.
Adding new operational kinds = omit audience (defaults to all). Honors
the project memory rule "no typed event allowlists."

Frontend `Notice/NoticeActor/NoticeTarget` interfaces collapse to a loose
`MessagePayload` (`{kind, [k]: unknown}`); `SystemNotice` and
`parseTaskEvent` narrow at use time. `format_message_for_agent` deleted —
agents read `content` raw now that producers always write it.

No data migration. Existing dev DBs need to be reset on this branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v0.0.4.0)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs: simplify v0.0.4.0 changelog entry

Drop implementation detail in favor of two user-facing bullets.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fullstop000 added a commit that referenced this pull request May 1, 2026
Adds harder benchmark cases and a multi-model matrix runner that exposes
real differences between the structural-rule prompt and the model's own
inference style.

Hard cases (cases-hard.tsv, 15 scenarios):
- Realistic narrative framings (P0 escalation, sprint capacity, vendor
  procurement, hiring under deadline, SOC2 compliance, time-box at sprint
  end, architecture review, VP briefing)
- No verdict-flavored phrasing — no "merge or hold", no "what's your
  call", no "X or Y". Decisions must be inferred from situational context
- Trap cases for chat (rhetorical frustration, retrospective, exploration,
  status update, info request, debug ask, facilitator role)

Multi-model matrix:
- models.tsv lists (runtime, model, tier, label) rows. Default ships with
  the two-per-family pattern: Anthropic best/efficiency, OpenAI best/
  efficiency
- run.sh now takes RUNTIME, MODEL, RUN_LABEL, CASES via env so it can be
  driven by the matrix runner
- run-matrix.sh sweeps all rows in models.tsv, runs the bench once per
  model, collates a side-by-side matrix.tsv

Baseline (cases-hard.tsv, structural-rule prompt):
- claude/opus:        9/15  (conservative — implicit delegation reads as chat)
- claude/sonnet:      15/15 (best — infers delegation from context)
- codex/gpt-5.5:      14/15 (one hiring miss)
- codex/gpt-5.4-mini: 13/15 (one mis-fire, one silent)

All 4 models score 7/7 on chat cases. The discriminator is property #4
(Delegated) — whether the model treats "we need X by Y" as an implicit
delegation. Same prompt, same cases, 9-15/15 spread by model.

BASELINE.md captures this and lays out the implications for the next
prompt iteration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fullstop000 added a commit that referenced this pull request May 1, 2026
Captures the actual head-to-head between the OLD prompt (input-pattern
enumeration on main) and the NEW prompt (four-property structural test
on this branch). Same 15 hard cases, same 4 models, parallel runner.

Headline scores (cases-hard.tsv):

  Model               Tier         OLD     NEW    Δ
  claude/opus         best         15/15   9/15   -6
  claude/sonnet       efficiency   14/15   15/15  +1
  codex/gpt-5.5       best         14/15   14/15   0
  codex/gpt-5.4-mini  efficiency   12/15   13/15  +1
  -------------------------------------------------
  average                          13.75   12.75  -1.0

Aggregate behavior:
  Decisions caught (32 max):  OLD 30/32 (94%) vs NEW 23/32 (72%)
  Chat held back (28 max):    OLD 25/28 (89%) vs NEW 28/28 (100%)

The structural rewrite is NOT a strict win. NEW closes the retrospective
false-positive (case 10: "in hindsight, was that the right call?" — OLD
over-fires on sonnet/gpt-5.5/gpt-5.4-mini, NEW correctly chats on all).
But NEW costs Opus 6 implicit-delegation decisions because Opus reads
property #4 (Delegated) strictly: "we need X by Y" doesn't count as
delegation without an explicit "you pick" clause.

Sonnet, gpt-5.5, and gpt-5.4-mini are stable across both prompts —
they infer delegation from situational context regardless of which rule
is loaded. The Opus regression is model-specific.

BASELINE.md captures the full per-case matrix, named winners and losers,
known failure modes (gpt-5.4-mini case 1 silent under NEW; gpt-5.5 case 5
flips), and three iteration paths for the next prompt revision.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fullstop000 added a commit that referenced this pull request May 2, 2026
…ark (#133)

* feat(drivers/codex): add gpt-5.5 to model list

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(prompt+bench): structural decision trigger + reproducible benchmark

Replace the input-pattern enumeration in the Decision Inbox prompt section
(PR-review phrasing, "should I X or Y", config-knob examples) with a
four-property structural test: mutually-exclusive options + blocking +
material consequence + delegated picker. The trigger is the shape of the
agent's intended reply, not the asker's words. The PR-review case
becomes the canonical example, not the rule.

Why: the enumeration didn't scale. Verdict-shaped requests in triage,
hiring, time-boxing, and compliance use neutral phrasing ("tell me which
3 to fix", "walk me through whether we need X") and were falling
through to send_message. The structural rule generalizes to any new
workflow without re-listing phrasings.

Add bench/decision-trigger/ — a reproducible benchmark that spins up
one isolated claude/sonnet agent per case in parallel, dispatches a
DM, and classifies the response turn as decision (dispatch_decision) or
chat (send_message). 15 cases across 8 work domains (PR review, vendor
pick, architecture, status, triage, hiring, doc, compliance, time-box,
naming). Current score: 15/15.

The benchmark intentionally pauses non-bench agents during runs so the
bench cohort isn't drowned in #all welcome messages. Side-effect-free
prompts only — README documents the constraint.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(prompt): whole-prompt override + drop vestigial notification flag

Two follow-up changes building on the structural-rule rewrite:

1) Whole-prompt injectability for benchmark/A-B convenience.
   Adds CHORUS_SYSTEM_PROMPT_OVERRIDE_FILE env var: when set to a readable
   file, the file's contents become the system prompt verbatim. Also adds
   PromptOptions.system_prompt_override for in-process tests/benches.
   Programmatic override wins over env var. Tool names must be pre-resolved
   in the override file (no template substitution). Lets the bench compare
   prompt variants without rebuilding the binary.

2) Drop include_stdin_notification_section + MessageNotificationStyle.
   The flag toggled between two phrasings of the same message-delivery
   contract — "you'll be restarted" vs "messages may arrive directly". The
   LLM doesn't need to distinguish; it just needs to know not to poll. One
   universal Message Notifications section now always emits, telling the
   agent to call check_messages at natural breakpoints.

Updates all 5 driver call sites to use the simpler PromptOptions {..Default
::default()} pattern. Adds 4 prompt tests covering both override paths and
asserting the conditional notification branching is gone.

bench/decision-trigger/README.md gains an A/B section showing how to use
the env var to compare prompt variants without recompiling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* bench(decision-trigger): hard cases + multi-model matrix sweep

Adds harder benchmark cases and a multi-model matrix runner that exposes
real differences between the structural-rule prompt and the model's own
inference style.

Hard cases (cases-hard.tsv, 15 scenarios):
- Realistic narrative framings (P0 escalation, sprint capacity, vendor
  procurement, hiring under deadline, SOC2 compliance, time-box at sprint
  end, architecture review, VP briefing)
- No verdict-flavored phrasing — no "merge or hold", no "what's your
  call", no "X or Y". Decisions must be inferred from situational context
- Trap cases for chat (rhetorical frustration, retrospective, exploration,
  status update, info request, debug ask, facilitator role)

Multi-model matrix:
- models.tsv lists (runtime, model, tier, label) rows. Default ships with
  the two-per-family pattern: Anthropic best/efficiency, OpenAI best/
  efficiency
- run.sh now takes RUNTIME, MODEL, RUN_LABEL, CASES via env so it can be
  driven by the matrix runner
- run-matrix.sh sweeps all rows in models.tsv, runs the bench once per
  model, collates a side-by-side matrix.tsv

Baseline (cases-hard.tsv, structural-rule prompt):
- claude/opus:        9/15  (conservative — implicit delegation reads as chat)
- claude/sonnet:      15/15 (best — infers delegation from context)
- codex/gpt-5.5:      14/15 (one hiring miss)
- codex/gpt-5.4-mini: 13/15 (one mis-fire, one silent)

All 4 models score 7/7 on chat cases. The discriminator is property #4
(Delegated) — whether the model treats "we need X by Y" as an implicit
delegation. Same prompt, same cases, 9-15/15 spread by model.

BASELINE.md captures this and lays out the implications for the next
prompt iteration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* bench(decision-trigger): A/B baseline OLD vs NEW prompt across 4 models

Captures the actual head-to-head between the OLD prompt (input-pattern
enumeration on main) and the NEW prompt (four-property structural test
on this branch). Same 15 hard cases, same 4 models, parallel runner.

Headline scores (cases-hard.tsv):

  Model               Tier         OLD     NEW    Δ
  claude/opus         best         15/15   9/15   -6
  claude/sonnet       efficiency   14/15   15/15  +1
  codex/gpt-5.5       best         14/15   14/15   0
  codex/gpt-5.4-mini  efficiency   12/15   13/15  +1
  -------------------------------------------------
  average                          13.75   12.75  -1.0

Aggregate behavior:
  Decisions caught (32 max):  OLD 30/32 (94%) vs NEW 23/32 (72%)
  Chat held back (28 max):    OLD 25/28 (89%) vs NEW 28/28 (100%)

The structural rewrite is NOT a strict win. NEW closes the retrospective
false-positive (case 10: "in hindsight, was that the right call?" — OLD
over-fires on sonnet/gpt-5.5/gpt-5.4-mini, NEW correctly chats on all).
But NEW costs Opus 6 implicit-delegation decisions because Opus reads
property #4 (Delegated) strictly: "we need X by Y" doesn't count as
delegation without an explicit "you pick" clause.

Sonnet, gpt-5.5, and gpt-5.4-mini are stable across both prompts —
they infer delegation from situational context regardless of which rule
is loaded. The Opus regression is model-specific.

BASELINE.md captures the full per-case matrix, named winners and losers,
known failure modes (gpt-5.4-mini case 1 silent under NEW; gpt-5.5 case 5
flips), and three iteration paths for the next prompt revision.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: cargo fmt prompt.rs

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant