feat(ce-review): add per-finding judgment loop to Interactive mode#590
feat(ce-review): add per-finding judgment loop to Interactive mode#590
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e5947b8a8a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Rewrite Step 5 gating to track total fixes_applied_count across the run rather than listing routing options. The prior list implied options C and D always skip Step 5, but safe_auto fixes land BEFORE the routing question — picking C or D after safe_auto applied fixes should NOT skip push/PR handoff. The counter-based gate makes every routing path behave correctly. (Codex P1) - Clarify the generic tracker label as a whole-label substitution, not a [TRACKER] token swap. "File an issue per finding without applying fixes" replaces the full "File a [TRACKER] ticket per finding without applying fixes" — literal token swap would have produced malformed text like "File a File a ticket ticket per finding". (Codex P2) - Specify the no-sink Defer->Skip conversion as an explicit preview-time runtime step, not Stage 5 tie-breaking. Step 7b orders conflicting reviewer recommendations (Skip > Defer > Apply) and has no knowledge of sink_available. The downgrade happens before preview rendering and is surfaced to the user so they see why Defer counts moved to Skip. (Codex P2) - Update contract test to match the revised Step 5 gate landmark (fixes_applied_count).
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 244bc3fd86
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Split sink availability into named_sink_available and any_sink_available.
The prior single sink_available flag conflated two distinct concepts:
(a) whether the detected/named tracker can be invoked (drives label
confidence — inline tracker naming needs this)
(b) whether ANY tier in the fallback chain is reachable (drives whether
Defer is offered at all — no-sink behavior needs this)
Using one flag for both meant no-sink logic fired when the named tracker
was unreachable even if `gh` would have worked, silently dropping Defer
availability for runs without explicit tracker docs but with working
GitHub Issues or harness task primitive fallback.
Updated tracker-defer.md detection tuple, probe sequence, label logic,
and failure-path cache-invalidation to use the two-flag model.
Propagated the flag rename across SKILL.md option C rendering,
walkthrough.md menu adaptations, and bulk-preview.md no-sink
downgrade. (Codex P1)
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 81de392dc2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 57efb95538
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Route option D through Step 5 when safe_auto fixes were applied - Fix tracker placeholder to use whole-label substitution in walkthrough.md - Replace evidence fallback with merge-tier compact fields in tracker-defer.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f620e1372e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bb1c268ca6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Align Stage 5 tie-break cross-reference in bulk-preview.md with SKILL.md's canonical `Skip > Defer > Apply > Acknowledge` ordering (was missing `Acknowledge`). - Reword the zero-remaining completion summary so it doesn't imply advisory/pre-existing findings are cleared. Unqualified `All findings resolved` now applies only when no advisory/pre-existing remain; otherwise use the qualified `All actionable findings resolved ... (K advisory, J pre-existing remain in the report.)` form. Updated in SKILL.md Step 2 Interactive and walkthrough.md's zero-findings degenerate case. - Scope the Step 2 `Verify question-tool pre-load (checklist)` bullet to Claude Code only. Codex (request_user_input) and Gemini (ask_user) don't need a ToolSearch pre-load step; the rephrased bullet says so explicitly so converted targets aren't blocked. - Add an explicit no-sink Defer→Skip remap in walkthrough.md. When `any_sink_available: false` removes Defer from the menu, Stage 5 step 7b can still produce a Defer recommendation; the remap ensures the `(recommended)` marker always lands on a visible option. The downgrade is surfaced on the R15 conflict context line. Mirrors the shape bulk-preview.md already uses for LFG previews.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 45a5424b7f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Scope the Stage 2 intent-ambiguity pre-load step to Claude Code only, mirroring the Step 2 routing-question checklist scoping from the previous round. On Codex (request_user_input) and Gemini (ask_user) the platform-native question tool is loaded by default and no ToolSearch step applies. - Remove out-of-tree `plugins/compound-engineering/AGENTS.md:127` reference from the routing-question stem per the skill file-reference rule (skills must only reference files within their own directory tree). Restate the third-person voice requirement locally instead.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d06d022527
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Route Step 2 Interactive dispatch by option letter (A/B/C/D) rather than the rendered label text. Prevents label-variant mismatches (e.g., generic-fallback `File an issue per finding without applying fixes` vs named-tracker `File a Linear ticket ...`) from bypassing option C's dispatch. Canonical labels preserved inline for documentation; the letter is the stable dispatch signal. - Fix contradictory `suggested_fix` render instruction in walkthrough.md template. The template line at :42 said "render as a code block for diffs/literal changes" but the substitution rules below (recently rewritten) mandate prose-first and ban diff blocks. Template placeholder now points at the substitution rules below instead of competing with them.
Design artifacts for upcoming ce:review Interactive mode redesign: per-finding walk-through, compact bulk action preview before bulk actions, four-option routing question replacing today's bucket policy question, defer-to-tracker integration, and a small shared-template upgrade for persona framing quality. Requirements document has 28 requirements across 7 groups; plan decomposes into 8 implementation units with dependency ordering.
When contributing reviewers imply different per-finding actions for the same merged finding (e.g., correctness says Apply via suggested_fix, testing says Skip via low confidence), synthesis now picks the most conservative action using Skip > Defer > Apply > Acknowledge. This makes the walk-through's recommended action stable across re-runs and makes LFG results auditable after the fact. The user can still override per finding — the rule only determines what gets labeled "recommended." Part of the Interactive judgment-loop redesign (Unit 1 of 8).
…late Reviewer personas previously had no explicit guidance on how to write the why_it_matters field — only the schema's one-line description steered them toward observable behavior. Sampling recent review artifacts showed two failure modes: adversarial and api-contract reviewers producing null why_it_matters entirely, and correctness/maintainability leading with code structure instead of observable behavior. This change adds a dedicated framing-guidance block to the shared subagent template. One file edit, applied universally to every persona via the dispatch path. The block covers: lead-with-observable-behavior rule, why-the-fix-works grounding, tight 2-4 sentence budget, required-field reminder, and a positive/negative example pair for calibration. Part of the Interactive judgment-loop redesign (Unit 2 of 8).
New reference file covering how Defer actions file tickets in the project's tracker: reasoning-based detection from CLAUDE.md/AGENTS.md, probe timing (once per session, only when the routing question is about to be asked, cached for the run), label confidence logic, fallback chain (named tracker -> GitHub Issues via gh -> harness task primitive; never internal todos), no-sink behavior, failure path with Retry/Fallback/Skip, once-per-session harness durability confirmation, and per-tracker execution behavior. Loaded by SKILL.md when option C of the new routing question is a candidate and when the walk-through's Defer option executes. Part of the Interactive judgment-loop redesign (Unit 5 of 8).
New reference file defining the compact plan preview shown before every bulk action: top-level LFG (routing option B), top-level File tickets (routing option C), and walk-through LFG the rest. Single-screen summary grouped by intended action (Applying / Filing tickets / Skipping / Acknowledging), one line per finding in compressed framing-quality form, Proceed / Cancel options only. Cancel semantics differ by call site: routing-level Cancel returns to the routing question; walk-through Cancel returns to the current finding. Proceed dispatches the plan and delegates failure handling to tracker-defer.md's Retry/Fallback/Skip path. Part of the Interactive judgment-loop redesign (Unit 4 of 8).
New reference file defining the per-finding walk-through — the path users enter by picking Review (option A) from the new routing question. Covers: - Per-finding question format: mode+position indicator, persona-produced why_it_matters read from artifact via the same rule headless mode uses, proposed fix, R15 conflict-context surfacing - Four per-finding options (Apply / Defer / Skip / LFG the rest) - Advisory variant (Apply -> Acknowledge) - N=1 and no-sink adaptations that suppress options without leaving ambiguity - Override rule (preset actions only; no inline freeform fix authoring) - In-memory state only — no per-decision disk writes - End-of-walk-through dispatch: one fixer subagent for the accumulated Apply set, preserving the existing "one fixer, consistent tree" mechanic - Unified completion report structure shared across every terminal path (walk-through / LFG / File tickets / zero findings) Notes the small Step 3 fixer-prompt update needed for the heterogeneous Apply queue (gated_auto + manual mix). Part of the Interactive judgment-loop redesign (Unit 3 of 8).
Restructure the Interactive branch of After Review Step 2. The old
bucket-level policy question (Review and approve specific gated fixes /
Leave as residual work / Report only) is removed and replaced with a
four-option routing question that dispatches to focused reference files:
- (A) Review each finding one by one -> references/walkthrough.md
- (B) LFG. Apply the agent's best-judgment action per finding ->
references/bulk-preview.md -> end-of-batch fixer + tracker-defer
- (C) File a [TRACKER] ticket per finding without applying fixes ->
references/bulk-preview.md -> references/tracker-defer.md
- (D) Report only — take no further action -> stop
Stem is third-person ("What should the agent do...") per AGENTS.md:127.
Tracker name is substituted conditionally based on detection confidence
and sink availability; option C is omitted entirely when no sink is
available with a one-line explanation in the stem. Zero-remaining case
skips the routing question with a completion summary.
Autofix, Report-only, and Headless mode branches are unchanged.
Contract test updates to follow in Unit 8. Two assertions in the current
contract test reference the removed bucket-policy wording and will be
reworked with structural-landmark assertions instead.
Part of the Interactive judgment-loop redesign (Unit 6 of 8).
Step 5's push/PR/exit flow previously ran after every Interactive review. With the new routing question, three of the four options (File tickets, Report only, LFG with no Apply recommendations) can complete without any fix landing in the working tree — running push/PR next-step prompts after those is incoherent. Step 5 now runs only when one or more fixes actually modified files this run. Covers the four zero-apply cases explicitly: options C and D, LFG that recommends only Defer/Skip/Acknowledge, and walk-through that completes with no Apply decisions. safe_auto fixes still count — a zero-remaining review that applied safe_auto fixes still proceeds to Step 5. Part of the Interactive judgment-loop redesign (Unit 7 of 8).
Rework the policy-driven-routing contract test to use structural-landmark assertions (landmark tokens, file paths, mode-branch isolation) instead of exact-prose matches. Wording improvements in future PRs should not break the test. Replaced: - bucket-policy wording assertions (removed from SKILL.md in Unit 6) with four-option routing assertions using regex for each distinguishing word (Review / LFG / File / Report) - policy-question skip assertion with structural "skip the routing question entirely" landmark Added: - assertions that routing dispatches to references/walkthrough.md, bulk-preview.md, tracker-defer.md - assertion that the routing stem does NOT contain first-person "I" / "me" per AGENTS.md:127 - assertion for Stage 5 tie-breaking order (Skip > Defer > Apply) - assertions that tracker-defer.md forbids the internal todos fallback - assertions that subagent-template.md carries why_it_matters framing guidance (observable behavior, required field) - assertion that Step 5 is gated on fixes-applied Autofix / Report-only / Headless mode branches preserved — mode isolation is unchanged. Part of the Interactive judgment-loop redesign (Unit 8 of 8 — completes the plan).
…preview contracts Post-implementation self-review surfaced a coverage gap: the contract test asserts the three new reference files EXIST but did not assert their key contract surfaces. Add structural assertions — renaming a per-finding option or the Proceed/Cancel labels now breaks the test. Covers: - walkthrough.md: four per-finding option labels (Apply / Defer / Skip / LFG the rest). Distinguishing words assertible even as prose is refined. - bulk-preview.md: Proceed and Cancel option labels. Applied as a safe_auto fix per ce:review mode:autofix self-review.
All 8 implementation units landed; self-review via ce:review mode:autofix found no P0-P2 findings and the one safe_auto contract-test addition has landed. Plan status transitions from active to completed.
- Rewrite Step 5 gating to track total fixes_applied_count across the run rather than listing routing options. The prior list implied options C and D always skip Step 5, but safe_auto fixes land BEFORE the routing question — picking C or D after safe_auto applied fixes should NOT skip push/PR handoff. The counter-based gate makes every routing path behave correctly. (Codex P1) - Clarify the generic tracker label as a whole-label substitution, not a [TRACKER] token swap. "File an issue per finding without applying fixes" replaces the full "File a [TRACKER] ticket per finding without applying fixes" — literal token swap would have produced malformed text like "File a File a ticket ticket per finding". (Codex P2) - Specify the no-sink Defer->Skip conversion as an explicit preview-time runtime step, not Stage 5 tie-breaking. Step 7b orders conflicting reviewer recommendations (Skip > Defer > Apply) and has no knowledge of sink_available. The downgrade happens before preview rendering and is surfaced to the user so they see why Defer counts moved to Skip. (Codex P2) - Update contract test to match the revised Step 5 gate landmark (fixes_applied_count).
- Split sink availability into named_sink_available and any_sink_available.
The prior single sink_available flag conflated two distinct concepts:
(a) whether the detected/named tracker can be invoked (drives label
confidence — inline tracker naming needs this)
(b) whether ANY tier in the fallback chain is reachable (drives whether
Defer is offered at all — no-sink behavior needs this)
Using one flag for both meant no-sink logic fired when the named tracker
was unreachable even if `gh` would have worked, silently dropping Defer
availability for runs without explicit tracker docs but with working
GitHub Issues or harness task primitive fallback.
Updated tracker-defer.md detection tuple, probe sequence, label logic,
and failure-path cache-invalidation to use the two-flag model.
Propagated the flag rename across SKILL.md option C rendering,
walkthrough.md menu adaptations, and bulk-preview.md no-sink
downgrade. (Codex P1)
Document meta-learnings from running the full compound-engineering pipeline (ce:brainstorm → document-review → ce:plan → document-review → ce:work → ce:review → resolve-pr-feedback) end-to-end on a substantial feature (ce:review Interactive mode redesign). Ten guidance items plus five worked examples covering: - sampling actual evidence before accepting research-agent claims - running document-review after both brainstorm AND plan stages - treating "trust the agent" UX as a rubber-stamp vector - distinguishing bulk-preview vs per-item walk-through ergonomics - tool/platform caps as structural constraints, not annoyances - never conflating two semantic meanings in one flag - contract tests assert structure, not prose - keeping external plugin references out of durable artifacts - skill bodies are product code - each pipeline stage catches a different issue class Synthesized net-new at the workflow level; cross-links to adjacent prior docs in skill-design and workflow categories.
Remove the stale annotation on beta-skills-framework.md (the claim that it references deepen-plan is incorrect — that reference does not exist in the file). Add cross-link to the new end-to-end CE pipeline learnings doc, which extends this doc's brainstorm/plan/work framing downstream to document-review, ce:review, and resolve-pr-feedback. beta-skills-framework.md itself reviewed and kept — still accurate for its scope (beta-skills rollout pattern).
- Route option D through Step 5 when safe_auto fixes were applied - Fix tracker placeholder to use whole-label substitution in walkthrough.md - Replace evidence fallback with merge-tier compact fields in tracker-defer.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…hort Structured-findings JSON reviewers (correctness, testing, security, etc.) use blue; adversarial uses red; a small set of distinct-lens reviewers (previous-comments, cli-agent-readiness) use yellow. agent-native-reviewer was cyan — a one-off that didn't match any cohort. Moved to blue to match its role as a CE always-on reviewer in the structured cohort. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Live testing of the Interactive-mode walk-through surfaced multiple UX
gaps in how findings were presented, how the recommendation was signaled,
and how the question tool was invoked. This commit batches the resulting
redesign.
Presentation: split surfaces
- Each finding now renders in two parts: a terminal output block carrying
the explanation (with proper markdown — heading, bold labels, blank
lines between sections), and a compact question stem carrying the
decision. The stem used to pack severity + location + attack scenario
+ full fix + reasoning into a wall of prose that couldn't render
markdown; splitting lets each surface do its job.
- Between findings, emit a one-line action-taken confirmation
(`→ Applied. Fix staged at ...`) so users get continuous feedback
without overloading either surface.
Proposed fix: intent, not syntax
- The walk-through is a decision surface, not a code review surface.
Users pick Apply / Defer / Skip / LFG based on trusting the action
at an intent level. Showing code (raw or diffed) forces syntax-
parsing during a decision loop, which doesn't serve the decision.
- Default: one sentence describing intent, with inline backticks only
for short literal tokens. Code-span budget is 2 per sentence, each a
single identifier or operator — never a full statement or template
literal. Always space around backtick spans (terminal markdown eats
delimiters and glues adjacent words together otherwise).
- Raw code blocks only for short (≤5 line) genuinely additive new code.
Summary + artifact pointer for larger fixes. No diff blocks —
diffs are PR-review tooling, not walk-through tooling.
Recommendation signaling
- The `(recommended)` marker on the post-tie-break option label is now
mandatory, not optional. Any of the four options can carry it.
- Option order is fixed — never reorder to hoist the recommendation to
position 1. The marker carries the signal; position does not.
- The stem frames the single recommended action as a yes/no question
rather than enumerating alternatives. When the recommendation is
close, surface the disagreement in the R15 conflict-context line
below the stem, not as a multi-option stem.
AskUserQuestion tool loading (Claude Code)
- Added a new `### Interactive mode rules` section in SKILL.md with a
hoisted pre-load directive: before any Interactive-mode question
fires, load AskUserQuestion via ToolSearch("select:AskUserQuestion").
Load eagerly at the start of the Interactive flow, not lazily at
each site. The previous attempt to fix this by burying the
instruction inline at each call site failed repeatedly on long-skill
attention-decay during live runs — the hoisted rule plus a local
"Verify pre-load" checklist bullet before the routing question
finally worked on the third attempt.
- Gated the numbered-list text fallback. The previous wording ("When
no blocking question tool is available") let the agent rationalize
"I haven't loaded the tool, so it's unavailable" and skip to the
text fallback. Every fallback site now reads "Only when ToolSearch
explicitly returns no match or the tool call errors" — requires a
confirmed failure, not the agent's preference.
Rationale discipline
- Trimmed prose across the walk-through reference that explained
rationale without changing runtime behavior. Per AGENTS.md rationale
discipline: every line loads on every invocation; rationale that
doesn't change behavior is carrying cost for no benefit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Align Stage 5 tie-break cross-reference in bulk-preview.md with SKILL.md's canonical `Skip > Defer > Apply > Acknowledge` ordering (was missing `Acknowledge`). - Reword the zero-remaining completion summary so it doesn't imply advisory/pre-existing findings are cleared. Unqualified `All findings resolved` now applies only when no advisory/pre-existing remain; otherwise use the qualified `All actionable findings resolved ... (K advisory, J pre-existing remain in the report.)` form. Updated in SKILL.md Step 2 Interactive and walkthrough.md's zero-findings degenerate case. - Scope the Step 2 `Verify question-tool pre-load (checklist)` bullet to Claude Code only. Codex (request_user_input) and Gemini (ask_user) don't need a ToolSearch pre-load step; the rephrased bullet says so explicitly so converted targets aren't blocked. - Add an explicit no-sink Defer→Skip remap in walkthrough.md. When `any_sink_available: false` removes Defer from the menu, Stage 5 step 7b can still produce a Defer recommendation; the remap ensures the `(recommended)` marker always lands on a visible option. The downgrade is surfaced on the R15 conflict context line. Mirrors the shape bulk-preview.md already uses for LFG previews.
- Scope the Stage 2 intent-ambiguity pre-load step to Claude Code only, mirroring the Step 2 routing-question checklist scoping from the previous round. On Codex (request_user_input) and Gemini (ask_user) the platform-native question tool is loaded by default and no ToolSearch step applies. - Remove out-of-tree `plugins/compound-engineering/AGENTS.md:127` reference from the routing-question stem per the skill file-reference rule (skills must only reference files within their own directory tree). Restate the third-person voice requirement locally instead.
- Route Step 2 Interactive dispatch by option letter (A/B/C/D) rather than the rendered label text. Prevents label-variant mismatches (e.g., generic-fallback `File an issue per finding without applying fixes` vs named-tracker `File a Linear ticket ...`) from bypassing option C's dispatch. Canonical labels preserved inline for documentation; the letter is the stable dispatch signal. - Fix contradictory `suggested_fix` render instruction in walkthrough.md template. The template line at :42 said "render as a code block for diffs/literal changes" but the substitution rules below (recently rewritten) mandate prose-first and ban diff blocks. Template placeholder now points at the substitution rules below instead of competing with them.
cefef2a to
6505265
Compare
PR #590 landed new reference files (bulk-preview, tracker-defer, walkthrough) and test assertions under the old ce-review path while this branch was renaming to ce-code-review. Update the absorbed content to match: artifact path strings inside tracker-defer.md and walkthrough.md, plus hardcoded skill paths in review-skill-contract.test.ts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #590 landed new reference files (bulk-preview, tracker-defer, walkthrough) and test assertions under the old ce-review path while this branch was renaming to ce-code-review. Update the absorbed content to match: artifact path strings inside tracker-defer.md and walkthrough.md, plus hardcoded skill paths in review-skill-contract.test.ts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…and bulk preview Ports ce-code-review's interactive flow (post-PR #590) into ce-doc-review while adapting for document-specific semantics. Replaces the current binary 'refine / complete' Phase 5 question with a four-option routing question that dispatches to focused interactions sized to the user's intent. New references: - walkthrough.md — the per-finding walk-through that fires when the user picks routing option A. Each finding renders in two surfaces: a terminal markdown block carrying the explanation (heading, What's wrong, Proposed fix, Why it works) and a compact two-line question stem carrying the decision. Four per-finding options: Apply / Defer (append to Open Questions) / Skip / LFG-the-rest. Post-tie-break recommendation marked with '(recommended)'. N=1 and no-append adaptations included. Unified completion report covers walk-through, LFG, Append-to-Open-Questions, and zero-findings terminal paths - bulk-preview.md — the compact plan preview before every bulk action (routing B, routing C, walk-through LFG-the-rest). Grouped buckets (Applying / Appending to Open Questions / Skipping) with one-line plain-English summary per finding. Two options: Proceed or Cancel. Cancel semantics route back to the originating question (routing or per-finding). Proceed executes the plan and emits the unified completion report Differences from ce-code-review's equivalents: - No tracker-detection tuple or sink-availability logic — Defer is always an in-doc append, no external tracker - No [TRACKER] label substitution — bucket is always 'Appending to Open Questions' - No fixer-subagent batch dispatch — the orchestrator applies gated_auto/manual fixes inline (document edits are single-file markdown with no cross-file dependencies) - No advisory variant with Acknowledge — advisory-style findings surface in the FYI subsection at the presentation layer, outside the walk-through - No .context/ artifact-lookup paths — ce-doc-review personas don't write run artifacts - Observable-consequence phrasing tuned for document surface (readers, implementers, downstream decisions) rather than ce-code-review's runtime-behavior framing SKILL.md changes: - Added 'Interactive mode rules' section at the top with the AskUserQuestion pre-load directive and numbered-list fallback rule. Mirrors ce-code-review's approach for reliable question-tool loading on Claude Code - Updated the Phases 3-5 handoff to load walkthrough.md and bulk-preview.md after agent dispatch completes, lazily via backtick paths per the skill compliance checklist Part of the ce-doc-review autofix and interaction overhaul plan. Unit 5 of 8. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Before this PR,
ce:review's Interactive mode asked one bucket-level policy question covering every remaininggated_autoandmanualfinding. Users either approved the whole bucket, deferred the whole bucket, or scrolled a pipe-delimited table with no way to engage per-item. In practice it degraded into rubber-stamping or wholesale deferral — thegated_auto/manualrouting tiers in the schema existed but were never exercised per-finding.After this PR, a four-option routing question dispatches to focused interactions sized to the user's intent: a per-finding walk-through with plain-English framing for engaged review, an LFG path with a compact plan preview for trust-and-ship moments, a batch-file-tickets path for teams that triage in Linear / Jira / GitHub Issues, and a report-only exit. Defer actions route to the project's tracker via reasoning-based detection — never to the deprecated internal
.context/todo store. The redesign ships together with supporting changes: deterministic recommended-action tie-breaking so LFG is auditable, a shared-template framing-guidance upgrade so walk-through content comes from persona output directly, and a presentation split that keeps the question stem compact while the finding explanation renders with full markdown hierarchy.Fixes #562
What's different
flowchart TD A[safe_auto fixes applied] --> B{Any gated/manual<br/>findings remain?} B -->|No| Z[Completion summary -> Step 5] B -->|Yes| C[Four-option<br/>routing question] C -->|A Review| W[Per-finding<br/>walk-through] C -->|B LFG| P1[Bulk preview] C -->|C File tickets| P2[Bulk preview] C -->|D Report only| E[Exit] W -->|Apply/Defer/Skip| W W -->|LFG the rest| P3[Bulk preview<br/>scoped to remaining] P1 -->|Proceed| X[Apply -> Step 3 fixer<br/>Defer -> tracker<br/>Skip/Acknowledge no-op] P2 -->|Proceed| Y[All findings -> tracker] P3 -->|Proceed| X X --> S{Any fixes<br/>landed?} Y --> E S -->|Yes| F[Step 5 final-next-steps] S -->|No| ERouting
The current
Review and approve specific gated fixes / Leave as residual work / Report onlyquestion is gone. In its place:Review each finding one by one — accept the recommendation or choose another actionLFG. Apply the agent's best-judgment action per findingFile a [TRACKER] ticket per finding without applying fixesReport only — take no further actionDistinguishing words (
Review/LFG/File/Report) front-load so labels survive truncation in harnesses that hide option descriptions. The stem is third-person agent voice perplugins/compound-engineering/AGENTS.md:127. When tracker detection is low-confidence or the sink is unavailable, the tracker name falls back to generic wording or option C is omitted entirely with a one-line explanation.Per-finding walk-through
Each finding renders in two surfaces: a terminal markdown block carries the explanation (heading,
What's wrong,Proposed fix,Why it works); a compact two-line question stem carries the decision. The stem opens withFinding N of M — {severity} {short handle}.and frames the single recommended action as a yes/no question (Apply the format-validation guard?,Skip the fix since the fixture is being deleted?). The stem never enumerates alternatives — the option list carries those.The four per-finding options are Apply / Defer / Skip /
LFG the rest. Advisory-only findings substituteAcknowledge — mark as reviewed. N=1 suppressesLFG the rest. No-sink platforms omit Defer. The post-tie-break recommendation from Stage 5 step 7b is marked(recommended)on whichever option carries it — any of the four can be recommended, and option order is fixed so the marker alone carries the signal (no reordering to hoist the recommendation to position 1).Proposed fixrenders as intent-language prose, not quoted code. Examples:Throw on non-2xx response before parsing JSON.,Replace `==` with `===` on line 42.,Extract the request-building logic into a helper and call it from both sites.A code-span budget of 2 inline backtick spans per sentence caps complexity — each span a single identifier or operator, never a full statement or nested template literal. Raw code blocks are reserved for short (≤5 line) genuinely additive new code where no before-state exists. Diff blocks are banned — diffs are PR-review tooling that don't fit a decision surface.Between findings, a one-line confirmation of the action taken (
→ Applied. Fix staged at src/utils/api-client.ts:36-37.,→ Deferred. Ticket filed: <url>.) gives the user continuous feedback. Walk-through state is in-memory only; Apply decisions accumulate and dispatch once at end-of-loop through the existing one-fixer-consistent-tree mechanic.Bulk preview
Before any bulk action (routing option B, option C, or walk-through
LFG the rest) executes, a compact plan preview shows findings grouped by intended action —Applying (N):,Filing [TRACKER] tickets (N):,Skipping (N):,Acknowledging (N):— with one line of compressed framing per finding. Exactly two responses:ProceedorCancel. Cancel from the walk-through'sLFG the restreturns the user to the current finding, not to the routing question.Recommended-action tie-breaking
Stage 5 gains a new sub-step
7b. When contributing reviewers imply different per-finding actions for the same merged finding, synthesis picks the most conservative usingSkip > Defer > Apply > Acknowledge. Identical review artifacts now produce the same recommendation deterministically — LFG runs are reproducible and the walk-through's recommendation is stable across re-runs.Framing quality in reviewer output
Sampling 15+ recent review artifacts surfaced two failure modes:
adversarial-reviewerandapi-contract-revieweremittingwhy_it_matters: null(schema violation), andcorrectness-reviewer/maintainability-reviewerleading with code-structure-first prose (orders_controller.rb:42 has a missing authorization check) instead of observable behavior. The shared subagent template now includes a framing-guidance block — one file change applied universally to every persona via the existing dispatch path. Observable-behavior-first, why-the-fix-works grounding, 2-4 sentence budget, required-field reminder, and a positive/negative example pair. No synthesis-time rewrite pass, no per-persona file edits.Tracker detection and defer
Detection is reasoning-based: the agent reads
CLAUDE.md/AGENTS.mdand whatever other documentation is obvious, then forms a{ tracker_name, confidence, sink_available }tuple. Availability probes (gh auth status, MCP-tracker reachability) run at most once per session, only when the routing question is about to be asked. The fallback chain prefers durable external trackers: named tracker → GitHub Issues viagh→ harness task primitive (TaskCreate/update_plan) with an explicit once-per-session durability confirmation. The internal.context/compound-engineering/todos/system is never part of this fallback chain.Platform question-tool loading (Claude Code)
In Claude Code,
AskUserQuestionis a deferred tool — its schema is not available at session start. A new### Interactive mode rulessection at the top ofSKILL.mddirects the agent to load it viaToolSearchwith queryselect:AskUserQuestiononce at the start of the Interactive flow, not lazily at each invocation site. The numbered-list text fallback applies only whenToolSearchexplicitly returns no match or the tool call errors — never as a path the agent takes because it skipped the load. On Codex (request_user_input) and Gemini (ask_user) this step is not required; the tools are loaded by default.Step 5 conditioning
The existing
push / PR / exitfinal-next-steps flow now runs only when at least one fix landed in the working tree. Skipped for routing options C and D, for LFG runs where every recommendation was Defer/Skip/Acknowledge, and for walk-throughs that completed with zero Apply decisions. Asking a user "push fixes?" after they deliberately pickedReport onlywas incoherent.Cohort color alignment
agent-native-reviewerfrontmatter color moves fromcyan(one-off) toblue, matching the structured-findings JSON reviewer cohort it shares a role with. Blue for structured JSON reviewers; red foradversarial-reviewer; yellow for the distinct-lens set (previous-comments-reviewer,cli-agent-readiness-reviewer).Scope boundaries
ce:fixskill. The redesign lives insidece:review..context/compound-engineering/todos/phase-out is acknowledged as the long-term direction but out of scope for this PR.Design decisions worth calling out
Proposed fix. Showing code (raw or diffed) forces syntax-parsing during a decision loop. The walk-through's job is trust-the-action, not review-the-code; the fixer subagent owns syntax. Prose intent description with a 2-span code budget avoids nested-backtick rendering bugs observed in live testing, where the terminal's markdown renderer ate delimiters and ran adjacent words together.(recommended)marker mandatory over optional. Previous wording said "may label (recommended)" and let the agent communicate the recommendation through option ordering instead. Stabilizing the order and making the marker required decouples recommendation signal from position — users don't have to re-orient when Skip is recommended instead of Apply..context/lifecycle management for a feature (inspectable partial state) with no consumer.AskUserQuestion. Inline "load via ToolSearch" instructions at each invocation point proved unreliable in long-skill attention-decay conditions — the agent would emit the question as narrative text instead of firing the tool. A single directive in a newInteractive mode rulessection (symmetric with the existing autofix / report-only / headless rules), plus a local checklist bullet before the routing question, and a gated fallback requiring confirmedToolSearchfailure, reliably fires the tool.references/files. SKILL.md was already 744 lines; inlining would add ~200 lines paid on every invocation. Perplugins/compound-engineering/AGENTS.md:109-114rationale discipline.Testing
bun test tests/review-skill-contract.test.ts— 10/10 pass. Thedocuments policy-driven routing and residual handofftest was rewritten to cover the new structure: four-option distinguishing words (regex), third-person stem assertion, reference-file existence, subagent-template framing-guidance presence, Stage 5 tie-breaking order, internal-todos-forbidden assertion intracker-defer.md, Step 5 fixes-applied gating, walk-through option labels, bulk-previewProceed/Cancelcontract. Structural assertions only; exact prose is not locked.bun testfull suite — 752 pass, 1 fail. The failing test isresolve-base.sh > prefers the PR base remote from gh metadata over origin, which fails onmaintoo (pre-existing, environment-dependent, unrelated to this work).bun run release:validate— in sync.src/utils/api-client.tsseeded with 7 deliberate issues across severity and autofix-class tiers) surfaced the split-surface, intent-over-syntax, mandatory-recommended-marker, and hoisted-pre-load refinements that shipped as presentation-layer follow-ups. Scenario A (walk-through with mixed Apply/Defer/Skip decisions) now fires the question tool correctly and renders each finding's explanation in readable markdown.Review
docs/brainstorms/2026-04-17-ce-review-interactive-judgment-requirements.md— 28 requirements, refined viadocument-review(6 personas, 38 findings addressed).docs/plans/2026-04-17-002-feat-ce-review-interactive-judgment-plan.md— 8 units with requirements trace, refined viadocument-review(6 personas, 54 findings addressed), status nowcompleted.ce:review mode:autofixequivalent. Zero P0–P2 findings. One safe_auto contract-test assertion-coverage fix applied mid-review. Run artifact at.context/compound-engineering/ce-review/20260417-215343-cde5b0e8/.docs/solutions/developer-experience/ce-pipeline-end-to-end-learnings-2026-04-17.md(300 lines) — the full brainstorm → plan → work → review → PR flow on this PR itself, including the research-subagent pipeline separation pattern that came out of it.Post-deploy monitoring and validation
No additional operational monitoring required. This is a plugin skill change — no runtime infrastructure, no user data, no external API contracts. Users who invoke Interactive mode after this lands will see the new four-option routing question on their next review with
gated_autoormanualfindings. If UX feedback surfaces issues (wrong tracker inferred, unclear option wording, walk-through friction), iterate via prose edits in follow-up PRs.