Skip to content

feat(ce-review): add per-finding judgment loop to Interactive mode#590

Merged
tmchow merged 21 commits intomainfrom
tmchow/ce-review-interactive
Apr 18, 2026
Merged

feat(ce-review): add per-finding judgment loop to Interactive mode#590
tmchow merged 21 commits intomainfrom
tmchow/ce-review-interactive

Conversation

@tmchow
Copy link
Copy Markdown
Collaborator

@tmchow tmchow commented Apr 18, 2026

Summary

Before this PR, ce:review's Interactive mode asked one bucket-level policy question covering every remaining gated_auto and manual finding. Users either approved the whole bucket, deferred the whole bucket, or scrolled a pipe-delimited table with no way to engage per-item. In practice it degraded into rubber-stamping or wholesale deferral — the gated_auto / manual routing tiers in the schema existed but were never exercised per-finding.

After this PR, a four-option routing question dispatches to focused interactions sized to the user's intent: a per-finding walk-through with plain-English framing for engaged review, an LFG path with a compact plan preview for trust-and-ship moments, a batch-file-tickets path for teams that triage in Linear / Jira / GitHub Issues, and a report-only exit. Defer actions route to the project's tracker via reasoning-based detection — never to the deprecated internal .context/ todo store. The redesign ships together with supporting changes: deterministic recommended-action tie-breaking so LFG is auditable, a shared-template framing-guidance upgrade so walk-through content comes from persona output directly, and a presentation split that keeps the question stem compact while the finding explanation renders with full markdown hierarchy.

Fixes #562

What's different

flowchart TD
    A[safe_auto fixes applied] --> B{Any gated/manual<br/>findings remain?}
    B -->|No| Z[Completion summary -> Step 5]
    B -->|Yes| C[Four-option<br/>routing question]

    C -->|A Review| W[Per-finding<br/>walk-through]
    C -->|B LFG| P1[Bulk preview]
    C -->|C File tickets| P2[Bulk preview]
    C -->|D Report only| E[Exit]

    W -->|Apply/Defer/Skip| W
    W -->|LFG the rest| P3[Bulk preview<br/>scoped to remaining]

    P1 -->|Proceed| X[Apply -> Step 3 fixer<br/>Defer -> tracker<br/>Skip/Acknowledge no-op]
    P2 -->|Proceed| Y[All findings -> tracker]
    P3 -->|Proceed| X

    X --> S{Any fixes<br/>landed?}
    Y --> E
    S -->|Yes| F[Step 5 final-next-steps]
    S -->|No| E
Loading

Routing

The current Review and approve specific gated fixes / Leave as residual work / Report only question is gone. In its place:

  • (A) Review each finding one by one — accept the recommendation or choose another action
  • (B) LFG. Apply the agent's best-judgment action per finding
  • (C) File a [TRACKER] ticket per finding without applying fixes
  • (D) Report only — take no further action

Distinguishing words (Review / LFG / File / Report) front-load so labels survive truncation in harnesses that hide option descriptions. The stem is third-person agent voice per plugins/compound-engineering/AGENTS.md:127. When tracker detection is low-confidence or the sink is unavailable, the tracker name falls back to generic wording or option C is omitted entirely with a one-line explanation.

Per-finding walk-through

Each finding renders in two surfaces: a terminal markdown block carries the explanation (heading, What's wrong, Proposed fix, Why it works); a compact two-line question stem carries the decision. The stem opens with Finding N of M — {severity} {short handle}. and frames the single recommended action as a yes/no question (Apply the format-validation guard?, Skip the fix since the fixture is being deleted?). The stem never enumerates alternatives — the option list carries those.

The four per-finding options are Apply / Defer / Skip / LFG the rest. Advisory-only findings substitute Acknowledge — mark as reviewed. N=1 suppresses LFG the rest. No-sink platforms omit Defer. The post-tie-break recommendation from Stage 5 step 7b is marked (recommended) on whichever option carries it — any of the four can be recommended, and option order is fixed so the marker alone carries the signal (no reordering to hoist the recommendation to position 1).

Proposed fix renders as intent-language prose, not quoted code. Examples: Throw on non-2xx response before parsing JSON., Replace `==` with `===` on line 42., Extract the request-building logic into a helper and call it from both sites. A code-span budget of 2 inline backtick spans per sentence caps complexity — each span a single identifier or operator, never a full statement or nested template literal. Raw code blocks are reserved for short (≤5 line) genuinely additive new code where no before-state exists. Diff blocks are banned — diffs are PR-review tooling that don't fit a decision surface.

Between findings, a one-line confirmation of the action taken (→ Applied. Fix staged at src/utils/api-client.ts:36-37., → Deferred. Ticket filed: <url>.) gives the user continuous feedback. Walk-through state is in-memory only; Apply decisions accumulate and dispatch once at end-of-loop through the existing one-fixer-consistent-tree mechanic.

Bulk preview

Before any bulk action (routing option B, option C, or walk-through LFG the rest) executes, a compact plan preview shows findings grouped by intended action — Applying (N):, Filing [TRACKER] tickets (N):, Skipping (N):, Acknowledging (N): — with one line of compressed framing per finding. Exactly two responses: Proceed or Cancel. Cancel from the walk-through's LFG the rest returns the user to the current finding, not to the routing question.

Recommended-action tie-breaking

Stage 5 gains a new sub-step 7b. When contributing reviewers imply different per-finding actions for the same merged finding, synthesis picks the most conservative using Skip > Defer > Apply > Acknowledge. Identical review artifacts now produce the same recommendation deterministically — LFG runs are reproducible and the walk-through's recommendation is stable across re-runs.

Framing quality in reviewer output

Sampling 15+ recent review artifacts surfaced two failure modes: adversarial-reviewer and api-contract-reviewer emitting why_it_matters: null (schema violation), and correctness-reviewer / maintainability-reviewer leading with code-structure-first prose (orders_controller.rb:42 has a missing authorization check) instead of observable behavior. The shared subagent template now includes a framing-guidance block — one file change applied universally to every persona via the existing dispatch path. Observable-behavior-first, why-the-fix-works grounding, 2-4 sentence budget, required-field reminder, and a positive/negative example pair. No synthesis-time rewrite pass, no per-persona file edits.

Tracker detection and defer

Detection is reasoning-based: the agent reads CLAUDE.md / AGENTS.md and whatever other documentation is obvious, then forms a { tracker_name, confidence, sink_available } tuple. Availability probes (gh auth status, MCP-tracker reachability) run at most once per session, only when the routing question is about to be asked. The fallback chain prefers durable external trackers: named tracker → GitHub Issues via gh → harness task primitive (TaskCreate / update_plan) with an explicit once-per-session durability confirmation. The internal .context/compound-engineering/todos/ system is never part of this fallback chain.

Platform question-tool loading (Claude Code)

In Claude Code, AskUserQuestion is a deferred tool — its schema is not available at session start. A new ### Interactive mode rules section at the top of SKILL.md directs the agent to load it via ToolSearch with query select:AskUserQuestion once at the start of the Interactive flow, not lazily at each invocation site. The numbered-list text fallback applies only when ToolSearch explicitly returns no match or the tool call errors — never as a path the agent takes because it skipped the load. On Codex (request_user_input) and Gemini (ask_user) this step is not required; the tools are loaded by default.

Step 5 conditioning

The existing push / PR / exit final-next-steps flow now runs only when at least one fix landed in the working tree. Skipped for routing options C and D, for LFG runs where every recommendation was Defer/Skip/Acknowledge, and for walk-throughs that completed with zero Apply decisions. Asking a user "push fixes?" after they deliberately picked Report only was incoherent.

Cohort color alignment

agent-native-reviewer frontmatter color moves from cyan (one-off) to blue, matching the structured-findings JSON reviewer cohort it shares a role with. Blue for structured JSON reviewers; red for adversarial-reviewer; yellow for the distinct-lens set (previous-comments-reviewer, cli-agent-readiness-reviewer).

Scope boundaries

  • No new ce:fix skill. The redesign lives inside ce:review.
  • No changes to the findings schema, merge/dedup routing beyond R15, or autofix-mode residual-todo creation.
  • No inline freeform fix authoring. The walk-through is a decision loop; "approve the intent but write a variant" is explicitly unsupported in v1 — users Skip and hand-edit.
  • Other modes (Autofix, Report-only, Headless) unchanged. Contract test asserts this continues to hold.
  • Internal .context/compound-engineering/todos/ phase-out is acknowledged as the long-term direction but out of scope for this PR.

Design decisions worth calling out

  • Shared subagent-template upgrade over persona upgrades or synthesis rewrite. Sampling evidence shaped this choice. Per-persona edits would inflate scope to 5+ files; a synthesis-time rewrite pass would add recurring per-review model cost and paper over the adversarial/api-contract schema violation rather than fix it. One file edit at the source fixes both gaps for every reviewer.
  • Split per-finding surface: terminal block for explanation, compact stem for decision. Packing the full attack scenario + proposed fix + reasoning into the AskUserQuestion stem produced a wall of text that couldn't render markdown. Splitting lets each surface do its job: terminal block uses markdown hierarchy; stem uses plain-text yes/no framing on a single recommendation.
  • Intent-over-syntax for Proposed fix. Showing code (raw or diffed) forces syntax-parsing during a decision loop. The walk-through's job is trust-the-action, not review-the-code; the fixer subagent owns syntax. Prose intent description with a 2-span code budget avoids nested-backtick rendering bugs observed in live testing, where the terminal's markdown renderer ate delimiters and ran adjacent words together.
  • (recommended) marker mandatory over optional. Previous wording said "may label (recommended)" and let the agent communicate the recommendation through option ordering instead. Stabilizing the order and making the marker required decouples recommendation signal from position — users don't have to re-orient when Skip is recommended instead of Apply.
  • Apply actions batch at end of walk-through, not per-finding. Preserves the existing "one fixer, consistent tree" guarantee. The fixer receives the full Apply set at once and can handle inter-fix dependencies (two Applies touching overlapping regions) in one pass. Trade-off: fix failures surface at the end of the walk-through rather than per-decision.
  • Walk-through state is in-memory only. No per-decision disk writes. An interrupted walk-through discards in-flight state; Defer actions that already executed remain in the tracker (external side effects). Avoids the complexity of state-file schema, external-edit staleness detection, and .context/ lifecycle management for a feature (inspectable partial state) with no consumer.
  • Hoisted pre-load directive over inline reminders for AskUserQuestion. Inline "load via ToolSearch" instructions at each invocation point proved unreliable in long-skill attention-decay conditions — the agent would emit the question as narrative text instead of firing the tool. A single directive in a new Interactive mode rules section (symmetric with the existing autofix / report-only / headless rules), plus a local checklist bullet before the routing question, and a gated fallback requiring confirmed ToolSearch failure, reliably fires the tool.
  • Walk-through, bulk preview, and tracker defer live in references/ files. SKILL.md was already 744 lines; inlining would add ~200 lines paid on every invocation. Per plugins/compound-engineering/AGENTS.md:109-114 rationale discipline.
  • Contract test uses structural assertions, not exact prose. Landmark tokens (file paths, required distinguishing words, mode-branch isolation) so wording improvements in future PRs do not break the test — regression guard, not authoring ossification.

Testing

  • bun test tests/review-skill-contract.test.ts — 10/10 pass. The documents policy-driven routing and residual handoff test was rewritten to cover the new structure: four-option distinguishing words (regex), third-person stem assertion, reference-file existence, subagent-template framing-guidance presence, Stage 5 tie-breaking order, internal-todos-forbidden assertion in tracker-defer.md, Step 5 fixes-applied gating, walk-through option labels, bulk-preview Proceed / Cancel contract. Structural assertions only; exact prose is not locked.
  • bun test full suite — 752 pass, 1 fail. The failing test is resolve-base.sh > prefers the PR base remote from gh metadata over origin, which fails on main too (pre-existing, environment-dependent, unrelated to this work).
  • bun run release:validate — in sync.
  • Live validation on a controlled diff (src/utils/api-client.ts seeded with 7 deliberate issues across severity and autofix-class tiers) surfaced the split-surface, intent-over-syntax, mandatory-recommended-marker, and hoisted-pre-load refinements that shipped as presentation-layer follow-ups. Scenario A (walk-through with mixed Apply/Defer/Skip decisions) now fires the question tool correctly and renders each finding's explanation in readable markdown.

Review

  • Brainstorm: docs/brainstorms/2026-04-17-ce-review-interactive-judgment-requirements.md — 28 requirements, refined via document-review (6 personas, 38 findings addressed).
  • Plan: docs/plans/2026-04-17-002-feat-ce-review-interactive-judgment-plan.md — 8 units with requirements trace, refined via document-review (6 personas, 54 findings addressed), status now completed.
  • Post-implementation self-review completed via ce:review mode:autofix equivalent. Zero P0–P2 findings. One safe_auto contract-test assertion-coverage fix applied mid-review. Run artifact at .context/compound-engineering/ce-review/20260417-215343-cde5b0e8/.
  • End-to-end pipeline learnings captured at docs/solutions/developer-experience/ce-pipeline-end-to-end-learnings-2026-04-17.md (300 lines) — the full brainstorm → plan → work → review → PR flow on this PR itself, including the research-subagent pipeline separation pattern that came out of it.

Post-deploy monitoring and validation

No additional operational monitoring required. This is a plugin skill change — no runtime infrastructure, no user data, no external API contracts. Users who invoke Interactive mode after this lands will see the new four-option routing question on their next review with gated_auto or manual findings. If UX feedback surfaces issues (wrong tracker inferred, unclear option wording, walk-through friction), iterate via prose edits in follow-up PRs.


Compound Engineering
Claude Code

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e5947b8a8a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread plugins/compound-engineering/skills/ce-review/SKILL.md Outdated
Comment thread plugins/compound-engineering/skills/ce-review/SKILL.md Outdated
Comment thread plugins/compound-engineering/skills/ce-review/references/bulk-preview.md Outdated
tmchow added a commit that referenced this pull request Apr 18, 2026
- Rewrite Step 5 gating to track total fixes_applied_count across the run
  rather than listing routing options. The prior list implied options C
  and D always skip Step 5, but safe_auto fixes land BEFORE the routing
  question — picking C or D after safe_auto applied fixes should NOT
  skip push/PR handoff. The counter-based gate makes every routing path
  behave correctly. (Codex P1)
- Clarify the generic tracker label as a whole-label substitution, not a
  [TRACKER] token swap. "File an issue per finding without applying fixes"
  replaces the full "File a [TRACKER] ticket per finding without applying
  fixes" — literal token swap would have produced malformed text like
  "File a File a ticket ticket per finding". (Codex P2)
- Specify the no-sink Defer->Skip conversion as an explicit preview-time
  runtime step, not Stage 5 tie-breaking. Step 7b orders conflicting
  reviewer recommendations (Skip > Defer > Apply) and has no knowledge
  of sink_available. The downgrade happens before preview rendering and
  is surfaced to the user so they see why Defer counts moved to Skip.
  (Codex P2)
- Update contract test to match the revised Step 5 gate landmark
  (fixes_applied_count).
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 244bc3fd86

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread plugins/compound-engineering/skills/ce-review/references/tracker-defer.md Outdated
tmchow added a commit that referenced this pull request Apr 18, 2026
- Split sink availability into named_sink_available and any_sink_available.
  The prior single sink_available flag conflated two distinct concepts:
  (a) whether the detected/named tracker can be invoked (drives label
      confidence — inline tracker naming needs this)
  (b) whether ANY tier in the fallback chain is reachable (drives whether
      Defer is offered at all — no-sink behavior needs this)

  Using one flag for both meant no-sink logic fired when the named tracker
  was unreachable even if `gh` would have worked, silently dropping Defer
  availability for runs without explicit tracker docs but with working
  GitHub Issues or harness task primitive fallback.

  Updated tracker-defer.md detection tuple, probe sequence, label logic,
  and failure-path cache-invalidation to use the two-flag model.
  Propagated the flag rename across SKILL.md option C rendering,
  walkthrough.md menu adaptations, and bulk-preview.md no-sink
  downgrade. (Codex P1)
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 81de392dc2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread plugins/compound-engineering/skills/ce-review/SKILL.md Outdated
Comment thread plugins/compound-engineering/skills/ce-review/references/walkthrough.md Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 57efb95538

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread plugins/compound-engineering/skills/ce-review/references/tracker-defer.md Outdated
tmchow added a commit that referenced this pull request Apr 18, 2026
- Route option D through Step 5 when safe_auto fixes were applied
- Fix tracker placeholder to use whole-label substitution in walkthrough.md
- Replace evidence fallback with merge-tier compact fields in tracker-defer.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f620e1372e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread plugins/compound-engineering/skills/ce-review/SKILL.md
Comment thread plugins/compound-engineering/skills/ce-review/SKILL.md Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bb1c268ca6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread plugins/compound-engineering/skills/ce-review/SKILL.md Outdated
Comment thread plugins/compound-engineering/skills/ce-review/references/walkthrough.md Outdated
tmchow added a commit that referenced this pull request Apr 18, 2026
- Align Stage 5 tie-break cross-reference in bulk-preview.md with SKILL.md's canonical `Skip > Defer > Apply > Acknowledge` ordering (was missing `Acknowledge`).
- Reword the zero-remaining completion summary so it doesn't imply advisory/pre-existing findings are cleared. Unqualified `All findings resolved` now applies only when no advisory/pre-existing remain; otherwise use the qualified `All actionable findings resolved ... (K advisory, J pre-existing remain in the report.)` form. Updated in SKILL.md Step 2 Interactive and walkthrough.md's zero-findings degenerate case.
- Scope the Step 2 `Verify question-tool pre-load (checklist)` bullet to Claude Code only. Codex (request_user_input) and Gemini (ask_user) don't need a ToolSearch pre-load step; the rephrased bullet says so explicitly so converted targets aren't blocked.
- Add an explicit no-sink Defer→Skip remap in walkthrough.md. When `any_sink_available: false` removes Defer from the menu, Stage 5 step 7b can still produce a Defer recommendation; the remap ensures the `(recommended)` marker always lands on a visible option. The downgrade is surfaced on the R15 conflict context line. Mirrors the shape bulk-preview.md already uses for LFG previews.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 45a5424b7f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread plugins/compound-engineering/skills/ce-review/SKILL.md Outdated
Comment thread plugins/compound-engineering/skills/ce-review/SKILL.md Outdated
tmchow added a commit that referenced this pull request Apr 18, 2026
- Scope the Stage 2 intent-ambiguity pre-load step to Claude Code only, mirroring the Step 2 routing-question checklist scoping from the previous round. On Codex (request_user_input) and Gemini (ask_user) the platform-native question tool is loaded by default and no ToolSearch step applies.
- Remove out-of-tree `plugins/compound-engineering/AGENTS.md:127` reference from the routing-question stem per the skill file-reference rule (skills must only reference files within their own directory tree). Restate the third-person voice requirement locally instead.
@tmchow
Copy link
Copy Markdown
Collaborator Author

tmchow commented Apr 18, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d06d022527

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread plugins/compound-engineering/skills/ce-review/SKILL.md Outdated
Comment thread plugins/compound-engineering/skills/ce-review/references/walkthrough.md Outdated
tmchow added a commit that referenced this pull request Apr 18, 2026
- Route Step 2 Interactive dispatch by option letter (A/B/C/D) rather than the rendered label text. Prevents label-variant mismatches (e.g., generic-fallback `File an issue per finding without applying fixes` vs named-tracker `File a Linear ticket ...`) from bypassing option C's dispatch. Canonical labels preserved inline for documentation; the letter is the stable dispatch signal.
- Fix contradictory `suggested_fix` render instruction in walkthrough.md template. The template line at :42 said "render as a code block for diffs/literal changes" but the substitution rules below (recently rewritten) mandate prose-first and ban diff blocks. Template placeholder now points at the substitution rules below instead of competing with them.
tmchow added 14 commits April 18, 2026 13:01
Design artifacts for upcoming ce:review Interactive mode redesign: per-finding
walk-through, compact bulk action preview before bulk actions, four-option
routing question replacing today's bucket policy question, defer-to-tracker
integration, and a small shared-template upgrade for persona framing quality.

Requirements document has 28 requirements across 7 groups; plan decomposes
into 8 implementation units with dependency ordering.
When contributing reviewers imply different per-finding actions for the same
merged finding (e.g., correctness says Apply via suggested_fix, testing says
Skip via low confidence), synthesis now picks the most conservative action
using Skip > Defer > Apply > Acknowledge.

This makes the walk-through's recommended action stable across re-runs and
makes LFG results auditable after the fact. The user can still override per
finding — the rule only determines what gets labeled "recommended."

Part of the Interactive judgment-loop redesign (Unit 1 of 8).
…late

Reviewer personas previously had no explicit guidance on how to write the
why_it_matters field — only the schema's one-line description steered them
toward observable behavior. Sampling recent review artifacts showed two
failure modes: adversarial and api-contract reviewers producing null
why_it_matters entirely, and correctness/maintainability leading with code
structure instead of observable behavior.

This change adds a dedicated framing-guidance block to the shared subagent
template. One file edit, applied universally to every persona via the
dispatch path. The block covers: lead-with-observable-behavior rule,
why-the-fix-works grounding, tight 2-4 sentence budget, required-field
reminder, and a positive/negative example pair for calibration.

Part of the Interactive judgment-loop redesign (Unit 2 of 8).
New reference file covering how Defer actions file tickets in the project's
tracker: reasoning-based detection from CLAUDE.md/AGENTS.md, probe timing
(once per session, only when the routing question is about to be asked,
cached for the run), label confidence logic, fallback chain (named tracker
-> GitHub Issues via gh -> harness task primitive; never internal todos),
no-sink behavior, failure path with Retry/Fallback/Skip, once-per-session
harness durability confirmation, and per-tracker execution behavior.

Loaded by SKILL.md when option C of the new routing question is a candidate
and when the walk-through's Defer option executes.

Part of the Interactive judgment-loop redesign (Unit 5 of 8).
New reference file defining the compact plan preview shown before every bulk
action: top-level LFG (routing option B), top-level File tickets (routing
option C), and walk-through LFG the rest. Single-screen summary grouped by
intended action (Applying / Filing tickets / Skipping / Acknowledging), one
line per finding in compressed framing-quality form, Proceed / Cancel
options only.

Cancel semantics differ by call site: routing-level Cancel returns to the
routing question; walk-through Cancel returns to the current finding.
Proceed dispatches the plan and delegates failure handling to
tracker-defer.md's Retry/Fallback/Skip path.

Part of the Interactive judgment-loop redesign (Unit 4 of 8).
New reference file defining the per-finding walk-through — the path users
enter by picking Review (option A) from the new routing question. Covers:

- Per-finding question format: mode+position indicator, persona-produced
  why_it_matters read from artifact via the same rule headless mode uses,
  proposed fix, R15 conflict-context surfacing
- Four per-finding options (Apply / Defer / Skip / LFG the rest)
- Advisory variant (Apply -> Acknowledge)
- N=1 and no-sink adaptations that suppress options without leaving
  ambiguity
- Override rule (preset actions only; no inline freeform fix authoring)
- In-memory state only — no per-decision disk writes
- End-of-walk-through dispatch: one fixer subagent for the accumulated
  Apply set, preserving the existing "one fixer, consistent tree" mechanic
- Unified completion report structure shared across every terminal path
  (walk-through / LFG / File tickets / zero findings)

Notes the small Step 3 fixer-prompt update needed for the heterogeneous
Apply queue (gated_auto + manual mix).

Part of the Interactive judgment-loop redesign (Unit 3 of 8).
Restructure the Interactive branch of After Review Step 2. The old
bucket-level policy question (Review and approve specific gated fixes /
Leave as residual work / Report only) is removed and replaced with a
four-option routing question that dispatches to focused reference files:

- (A) Review each finding one by one -> references/walkthrough.md
- (B) LFG. Apply the agent's best-judgment action per finding ->
      references/bulk-preview.md -> end-of-batch fixer + tracker-defer
- (C) File a [TRACKER] ticket per finding without applying fixes ->
      references/bulk-preview.md -> references/tracker-defer.md
- (D) Report only — take no further action -> stop

Stem is third-person ("What should the agent do...") per AGENTS.md:127.
Tracker name is substituted conditionally based on detection confidence
and sink availability; option C is omitted entirely when no sink is
available with a one-line explanation in the stem. Zero-remaining case
skips the routing question with a completion summary.

Autofix, Report-only, and Headless mode branches are unchanged.

Contract test updates to follow in Unit 8. Two assertions in the current
contract test reference the removed bucket-policy wording and will be
reworked with structural-landmark assertions instead.

Part of the Interactive judgment-loop redesign (Unit 6 of 8).
Step 5's push/PR/exit flow previously ran after every Interactive review.
With the new routing question, three of the four options (File tickets,
Report only, LFG with no Apply recommendations) can complete without any
fix landing in the working tree — running push/PR next-step prompts after
those is incoherent.

Step 5 now runs only when one or more fixes actually modified files this
run. Covers the four zero-apply cases explicitly: options C and D, LFG
that recommends only Defer/Skip/Acknowledge, and walk-through that
completes with no Apply decisions. safe_auto fixes still count — a
zero-remaining review that applied safe_auto fixes still proceeds to
Step 5.

Part of the Interactive judgment-loop redesign (Unit 7 of 8).
Rework the policy-driven-routing contract test to use structural-landmark
assertions (landmark tokens, file paths, mode-branch isolation) instead of
exact-prose matches. Wording improvements in future PRs should not break
the test.

Replaced:
- bucket-policy wording assertions (removed from SKILL.md in Unit 6) with
  four-option routing assertions using regex for each distinguishing
  word (Review / LFG / File / Report)
- policy-question skip assertion with structural "skip the routing
  question entirely" landmark

Added:
- assertions that routing dispatches to references/walkthrough.md,
  bulk-preview.md, tracker-defer.md
- assertion that the routing stem does NOT contain first-person "I" /
  "me" per AGENTS.md:127
- assertion for Stage 5 tie-breaking order (Skip > Defer > Apply)
- assertions that tracker-defer.md forbids the internal todos fallback
- assertions that subagent-template.md carries why_it_matters framing
  guidance (observable behavior, required field)
- assertion that Step 5 is gated on fixes-applied

Autofix / Report-only / Headless mode branches preserved — mode isolation
is unchanged.

Part of the Interactive judgment-loop redesign (Unit 8 of 8 — completes
the plan).
…preview contracts

Post-implementation self-review surfaced a coverage gap: the contract test
asserts the three new reference files EXIST but did not assert their key
contract surfaces. Add structural assertions — renaming a per-finding
option or the Proceed/Cancel labels now breaks the test.

Covers:
- walkthrough.md: four per-finding option labels (Apply / Defer / Skip /
  LFG the rest). Distinguishing words assertible even as prose is refined.
- bulk-preview.md: Proceed and Cancel option labels.

Applied as a safe_auto fix per ce:review mode:autofix self-review.
All 8 implementation units landed; self-review via ce:review mode:autofix
found no P0-P2 findings and the one safe_auto contract-test addition has
landed. Plan status transitions from active to completed.
- Rewrite Step 5 gating to track total fixes_applied_count across the run
  rather than listing routing options. The prior list implied options C
  and D always skip Step 5, but safe_auto fixes land BEFORE the routing
  question — picking C or D after safe_auto applied fixes should NOT
  skip push/PR handoff. The counter-based gate makes every routing path
  behave correctly. (Codex P1)
- Clarify the generic tracker label as a whole-label substitution, not a
  [TRACKER] token swap. "File an issue per finding without applying fixes"
  replaces the full "File a [TRACKER] ticket per finding without applying
  fixes" — literal token swap would have produced malformed text like
  "File a File a ticket ticket per finding". (Codex P2)
- Specify the no-sink Defer->Skip conversion as an explicit preview-time
  runtime step, not Stage 5 tie-breaking. Step 7b orders conflicting
  reviewer recommendations (Skip > Defer > Apply) and has no knowledge
  of sink_available. The downgrade happens before preview rendering and
  is surfaced to the user so they see why Defer counts moved to Skip.
  (Codex P2)
- Update contract test to match the revised Step 5 gate landmark
  (fixes_applied_count).
- Split sink availability into named_sink_available and any_sink_available.
  The prior single sink_available flag conflated two distinct concepts:
  (a) whether the detected/named tracker can be invoked (drives label
      confidence — inline tracker naming needs this)
  (b) whether ANY tier in the fallback chain is reachable (drives whether
      Defer is offered at all — no-sink behavior needs this)

  Using one flag for both meant no-sink logic fired when the named tracker
  was unreachable even if `gh` would have worked, silently dropping Defer
  availability for runs without explicit tracker docs but with working
  GitHub Issues or harness task primitive fallback.

  Updated tracker-defer.md detection tuple, probe sequence, label logic,
  and failure-path cache-invalidation to use the two-flag model.
  Propagated the flag rename across SKILL.md option C rendering,
  walkthrough.md menu adaptations, and bulk-preview.md no-sink
  downgrade. (Codex P1)
Document meta-learnings from running the full compound-engineering
pipeline (ce:brainstorm → document-review → ce:plan → document-review →
ce:work → ce:review → resolve-pr-feedback) end-to-end on a substantial
feature (ce:review Interactive mode redesign).

Ten guidance items plus five worked examples covering:
- sampling actual evidence before accepting research-agent claims
- running document-review after both brainstorm AND plan stages
- treating "trust the agent" UX as a rubber-stamp vector
- distinguishing bulk-preview vs per-item walk-through ergonomics
- tool/platform caps as structural constraints, not annoyances
- never conflating two semantic meanings in one flag
- contract tests assert structure, not prose
- keeping external plugin references out of durable artifacts
- skill bodies are product code
- each pipeline stage catches a different issue class

Synthesized net-new at the workflow level; cross-links to adjacent
prior docs in skill-design and workflow categories.
tmchow and others added 7 commits April 18, 2026 13:01
Remove the stale annotation on beta-skills-framework.md (the claim that
it references deepen-plan is incorrect — that reference does not exist
in the file). Add cross-link to the new end-to-end CE pipeline learnings
doc, which extends this doc's brainstorm/plan/work framing downstream to
document-review, ce:review, and resolve-pr-feedback.

beta-skills-framework.md itself reviewed and kept — still accurate for
its scope (beta-skills rollout pattern).
- Route option D through Step 5 when safe_auto fixes were applied
- Fix tracker placeholder to use whole-label substitution in walkthrough.md
- Replace evidence fallback with merge-tier compact fields in tracker-defer.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…hort

Structured-findings JSON reviewers (correctness, testing, security, etc.)
use blue; adversarial uses red; a small set of distinct-lens reviewers
(previous-comments, cli-agent-readiness) use yellow. agent-native-reviewer
was cyan — a one-off that didn't match any cohort. Moved to blue to match
its role as a CE always-on reviewer in the structured cohort.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Live testing of the Interactive-mode walk-through surfaced multiple UX
gaps in how findings were presented, how the recommendation was signaled,
and how the question tool was invoked. This commit batches the resulting
redesign.

Presentation: split surfaces
- Each finding now renders in two parts: a terminal output block carrying
  the explanation (with proper markdown — heading, bold labels, blank
  lines between sections), and a compact question stem carrying the
  decision. The stem used to pack severity + location + attack scenario
  + full fix + reasoning into a wall of prose that couldn't render
  markdown; splitting lets each surface do its job.
- Between findings, emit a one-line action-taken confirmation
  (`→ Applied. Fix staged at ...`) so users get continuous feedback
  without overloading either surface.

Proposed fix: intent, not syntax
- The walk-through is a decision surface, not a code review surface.
  Users pick Apply / Defer / Skip / LFG based on trusting the action
  at an intent level. Showing code (raw or diffed) forces syntax-
  parsing during a decision loop, which doesn't serve the decision.
- Default: one sentence describing intent, with inline backticks only
  for short literal tokens. Code-span budget is 2 per sentence, each a
  single identifier or operator — never a full statement or template
  literal. Always space around backtick spans (terminal markdown eats
  delimiters and glues adjacent words together otherwise).
- Raw code blocks only for short (≤5 line) genuinely additive new code.
  Summary + artifact pointer for larger fixes. No diff blocks —
  diffs are PR-review tooling, not walk-through tooling.

Recommendation signaling
- The `(recommended)` marker on the post-tie-break option label is now
  mandatory, not optional. Any of the four options can carry it.
- Option order is fixed — never reorder to hoist the recommendation to
  position 1. The marker carries the signal; position does not.
- The stem frames the single recommended action as a yes/no question
  rather than enumerating alternatives. When the recommendation is
  close, surface the disagreement in the R15 conflict-context line
  below the stem, not as a multi-option stem.

AskUserQuestion tool loading (Claude Code)
- Added a new `### Interactive mode rules` section in SKILL.md with a
  hoisted pre-load directive: before any Interactive-mode question
  fires, load AskUserQuestion via ToolSearch("select:AskUserQuestion").
  Load eagerly at the start of the Interactive flow, not lazily at
  each site. The previous attempt to fix this by burying the
  instruction inline at each call site failed repeatedly on long-skill
  attention-decay during live runs — the hoisted rule plus a local
  "Verify pre-load" checklist bullet before the routing question
  finally worked on the third attempt.
- Gated the numbered-list text fallback. The previous wording ("When
  no blocking question tool is available") let the agent rationalize
  "I haven't loaded the tool, so it's unavailable" and skip to the
  text fallback. Every fallback site now reads "Only when ToolSearch
  explicitly returns no match or the tool call errors" — requires a
  confirmed failure, not the agent's preference.

Rationale discipline
- Trimmed prose across the walk-through reference that explained
  rationale without changing runtime behavior. Per AGENTS.md rationale
  discipline: every line loads on every invocation; rationale that
  doesn't change behavior is carrying cost for no benefit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Align Stage 5 tie-break cross-reference in bulk-preview.md with SKILL.md's canonical `Skip > Defer > Apply > Acknowledge` ordering (was missing `Acknowledge`).
- Reword the zero-remaining completion summary so it doesn't imply advisory/pre-existing findings are cleared. Unqualified `All findings resolved` now applies only when no advisory/pre-existing remain; otherwise use the qualified `All actionable findings resolved ... (K advisory, J pre-existing remain in the report.)` form. Updated in SKILL.md Step 2 Interactive and walkthrough.md's zero-findings degenerate case.
- Scope the Step 2 `Verify question-tool pre-load (checklist)` bullet to Claude Code only. Codex (request_user_input) and Gemini (ask_user) don't need a ToolSearch pre-load step; the rephrased bullet says so explicitly so converted targets aren't blocked.
- Add an explicit no-sink Defer→Skip remap in walkthrough.md. When `any_sink_available: false` removes Defer from the menu, Stage 5 step 7b can still produce a Defer recommendation; the remap ensures the `(recommended)` marker always lands on a visible option. The downgrade is surfaced on the R15 conflict context line. Mirrors the shape bulk-preview.md already uses for LFG previews.
- Scope the Stage 2 intent-ambiguity pre-load step to Claude Code only, mirroring the Step 2 routing-question checklist scoping from the previous round. On Codex (request_user_input) and Gemini (ask_user) the platform-native question tool is loaded by default and no ToolSearch step applies.
- Remove out-of-tree `plugins/compound-engineering/AGENTS.md:127` reference from the routing-question stem per the skill file-reference rule (skills must only reference files within their own directory tree). Restate the third-person voice requirement locally instead.
- Route Step 2 Interactive dispatch by option letter (A/B/C/D) rather than the rendered label text. Prevents label-variant mismatches (e.g., generic-fallback `File an issue per finding without applying fixes` vs named-tracker `File a Linear ticket ...`) from bypassing option C's dispatch. Canonical labels preserved inline for documentation; the letter is the stable dispatch signal.
- Fix contradictory `suggested_fix` render instruction in walkthrough.md template. The template line at :42 said "render as a code block for diffs/literal changes" but the substitution rules below (recently rewritten) mandate prose-first and ban diff blocks. Template placeholder now points at the substitution rules below instead of competing with them.
@tmchow tmchow force-pushed the tmchow/ce-review-interactive branch from cefef2a to 6505265 Compare April 18, 2026 20:02
@tmchow tmchow merged commit 27cbaf8 into main Apr 18, 2026
2 checks passed
@github-actions github-actions bot mentioned this pull request Apr 18, 2026
tmchow added a commit that referenced this pull request Apr 18, 2026
PR #590 landed new reference files (bulk-preview, tracker-defer, walkthrough)
and test assertions under the old ce-review path while this branch was renaming
to ce-code-review. Update the absorbed content to match: artifact path strings
inside tracker-defer.md and walkthrough.md, plus hardcoded skill paths in
review-skill-contract.test.ts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tmchow added a commit that referenced this pull request Apr 18, 2026
PR #590 landed new reference files (bulk-preview, tracker-defer, walkthrough)
and test assertions under the old ce-review path while this branch was renaming
to ce-code-review. Update the absorbed content to match: artifact path strings
inside tracker-defer.md and walkthrough.md, plus hardcoded skill paths in
review-skill-contract.test.ts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tmchow added a commit that referenced this pull request Apr 19, 2026
…and bulk preview

Ports ce-code-review's interactive flow (post-PR #590) into ce-doc-review
while adapting for document-specific semantics. Replaces the current
binary 'refine / complete' Phase 5 question with a four-option routing
question that dispatches to focused interactions sized to the user's
intent.

New references:

- walkthrough.md — the per-finding walk-through that fires when the user
  picks routing option A. Each finding renders in two surfaces: a
  terminal markdown block carrying the explanation (heading, What's
  wrong, Proposed fix, Why it works) and a compact two-line question
  stem carrying the decision. Four per-finding options: Apply / Defer
  (append to Open Questions) / Skip / LFG-the-rest. Post-tie-break
  recommendation marked with '(recommended)'. N=1 and no-append
  adaptations included. Unified completion report covers walk-through,
  LFG, Append-to-Open-Questions, and zero-findings terminal paths

- bulk-preview.md — the compact plan preview before every bulk action
  (routing B, routing C, walk-through LFG-the-rest). Grouped buckets
  (Applying / Appending to Open Questions / Skipping) with one-line
  plain-English summary per finding. Two options: Proceed or Cancel.
  Cancel semantics route back to the originating question (routing or
  per-finding). Proceed executes the plan and emits the unified
  completion report

Differences from ce-code-review's equivalents:

- No tracker-detection tuple or sink-availability logic — Defer is always
  an in-doc append, no external tracker
- No [TRACKER] label substitution — bucket is always 'Appending to Open
  Questions'
- No fixer-subagent batch dispatch — the orchestrator applies
  gated_auto/manual fixes inline (document edits are single-file
  markdown with no cross-file dependencies)
- No advisory variant with Acknowledge — advisory-style findings surface
  in the FYI subsection at the presentation layer, outside the
  walk-through
- No .context/ artifact-lookup paths — ce-doc-review personas don't
  write run artifacts
- Observable-consequence phrasing tuned for document surface (readers,
  implementers, downstream decisions) rather than ce-code-review's
  runtime-behavior framing

SKILL.md changes:

- Added 'Interactive mode rules' section at the top with the
  AskUserQuestion pre-load directive and numbered-list fallback rule.
  Mirrors ce-code-review's approach for reliable question-tool loading
  on Claude Code
- Updated the Phases 3-5 handoff to load walkthrough.md and
  bulk-preview.md after agent dispatch completes, lazily via backtick
  paths per the skill compliance checklist

Part of the ce-doc-review autofix and interaction overhaul plan. Unit 5
of 8.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

What is the best workflow after ce:review - worth introducing a ce:fix?

1 participant