feat(ce-review): add per-finding judgment loop to Interactive mode by tmchow · Pull Request #590 · EveryInc/compound-engineering-plugin

tmchow · 2026-04-18T05:00:12Z

Summary

Before this PR, ce:review's Interactive mode asked one bucket-level policy question covering every remaining gated_auto and manual finding. Users either approved the whole bucket, deferred the whole bucket, or scrolled a pipe-delimited table with no way to engage per-item. In practice it degraded into rubber-stamping or wholesale deferral — the gated_auto / manual routing tiers in the schema existed but were never exercised per-finding.

After this PR, a four-option routing question dispatches to focused interactions sized to the user's intent: a per-finding walk-through with plain-English framing for engaged review, an LFG path with a compact plan preview for trust-and-ship moments, a batch-file-tickets path for teams that triage in Linear / Jira / GitHub Issues, and a report-only exit. Defer actions route to the project's tracker via reasoning-based detection — never to the deprecated internal .context/ todo store. The redesign ships together with supporting changes: deterministic recommended-action tie-breaking so LFG is auditable, a shared-template framing-guidance upgrade so walk-through content comes from persona output directly, and a presentation split that keeps the question stem compact while the finding explanation renders with full markdown hierarchy.

Fixes #562

What's different

flowchart TD
    A[safe_auto fixes applied] --> B{Any gated/manual<br/>findings remain?}
    B -->|No| Z[Completion summary -> Step 5]
    B -->|Yes| C[Four-option<br/>routing question]

    C -->|A Review| W[Per-finding<br/>walk-through]
    C -->|B LFG| P1[Bulk preview]
    C -->|C File tickets| P2[Bulk preview]
    C -->|D Report only| E[Exit]

    W -->|Apply/Defer/Skip| W
    W -->|LFG the rest| P3[Bulk preview<br/>scoped to remaining]

    P1 -->|Proceed| X[Apply -> Step 3 fixer<br/>Defer -> tracker<br/>Skip/Acknowledge no-op]
    P2 -->|Proceed| Y[All findings -> tracker]
    P3 -->|Proceed| X

    X --> S{Any fixes<br/>landed?}
    Y --> E
    S -->|Yes| F[Step 5 final-next-steps]
    S -->|No| E

Routing

The current Review and approve specific gated fixes / Leave as residual work / Report only question is gone. In its place:

(A) Review each finding one by one — accept the recommendation or choose another action
(B) LFG. Apply the agent's best-judgment action per finding
(C) File a [TRACKER] ticket per finding without applying fixes
(D) Report only — take no further action

Distinguishing words (Review / LFG / File / Report) front-load so labels survive truncation in harnesses that hide option descriptions. The stem is third-person agent voice per plugins/compound-engineering/AGENTS.md:127. When tracker detection is low-confidence or the sink is unavailable, the tracker name falls back to generic wording or option C is omitted entirely with a one-line explanation.

Per-finding walk-through

Each finding renders in two surfaces: a terminal markdown block carries the explanation (heading, What's wrong, Proposed fix, Why it works); a compact two-line question stem carries the decision. The stem opens with Finding N of M — {severity} {short handle}. and frames the single recommended action as a yes/no question (Apply the format-validation guard?, Skip the fix since the fixture is being deleted?). The stem never enumerates alternatives — the option list carries those.

The four per-finding options are Apply / Defer / Skip / LFG the rest. Advisory-only findings substitute Acknowledge — mark as reviewed. N=1 suppresses LFG the rest. No-sink platforms omit Defer. The post-tie-break recommendation from Stage 5 step 7b is marked (recommended) on whichever option carries it — any of the four can be recommended, and option order is fixed so the marker alone carries the signal (no reordering to hoist the recommendation to position 1).

Proposed fix renders as intent-language prose, not quoted code. Examples: Throw on non-2xx response before parsing JSON., Replace `==` with `===` on line 42., Extract the request-building logic into a helper and call it from both sites. A code-span budget of 2 inline backtick spans per sentence caps complexity — each span a single identifier or operator, never a full statement or nested template literal. Raw code blocks are reserved for short (≤5 line) genuinely additive new code where no before-state exists. Diff blocks are banned — diffs are PR-review tooling that don't fit a decision surface.

Between findings, a one-line confirmation of the action taken (→ Applied. Fix staged at src/utils/api-client.ts:36-37., → Deferred. Ticket filed: <url>.) gives the user continuous feedback. Walk-through state is in-memory only; Apply decisions accumulate and dispatch once at end-of-loop through the existing one-fixer-consistent-tree mechanic.

Bulk preview

Before any bulk action (routing option B, option C, or walk-through LFG the rest) executes, a compact plan preview shows findings grouped by intended action — Applying (N):, Filing [TRACKER] tickets (N):, Skipping (N):, Acknowledging (N): — with one line of compressed framing per finding. Exactly two responses: Proceed or Cancel. Cancel from the walk-through's LFG the rest returns the user to the current finding, not to the routing question.

Recommended-action tie-breaking

Stage 5 gains a new sub-step 7b. When contributing reviewers imply different per-finding actions for the same merged finding, synthesis picks the most conservative using Skip > Defer > Apply > Acknowledge. Identical review artifacts now produce the same recommendation deterministically — LFG runs are reproducible and the walk-through's recommendation is stable across re-runs.

Framing quality in reviewer output

Sampling 15+ recent review artifacts surfaced two failure modes: adversarial-reviewer and api-contract-reviewer emitting why_it_matters: null (schema violation), and correctness-reviewer / maintainability-reviewer leading with code-structure-first prose (orders_controller.rb:42 has a missing authorization check) instead of observable behavior. The shared subagent template now includes a framing-guidance block — one file change applied universally to every persona via the existing dispatch path. Observable-behavior-first, why-the-fix-works grounding, 2-4 sentence budget, required-field reminder, and a positive/negative example pair. No synthesis-time rewrite pass, no per-persona file edits.

Tracker detection and defer

Detection is reasoning-based: the agent reads CLAUDE.md / AGENTS.md and whatever other documentation is obvious, then forms a { tracker_name, confidence, sink_available } tuple. Availability probes (gh auth status, MCP-tracker reachability) run at most once per session, only when the routing question is about to be asked. The fallback chain prefers durable external trackers: named tracker → GitHub Issues via gh → harness task primitive (TaskCreate / update_plan) with an explicit once-per-session durability confirmation. The internal .context/compound-engineering/todos/ system is never part of this fallback chain.

Platform question-tool loading (Claude Code)

In Claude Code, AskUserQuestion is a deferred tool — its schema is not available at session start. A new ### Interactive mode rules section at the top of SKILL.md directs the agent to load it via ToolSearch with query select:AskUserQuestion once at the start of the Interactive flow, not lazily at each invocation site. The numbered-list text fallback applies only when ToolSearch explicitly returns no match or the tool call errors — never as a path the agent takes because it skipped the load. On Codex (request_user_input) and Gemini (ask_user) this step is not required; the tools are loaded by default.

Step 5 conditioning

The existing push / PR / exit final-next-steps flow now runs only when at least one fix landed in the working tree. Skipped for routing options C and D, for LFG runs where every recommendation was Defer/Skip/Acknowledge, and for walk-throughs that completed with zero Apply decisions. Asking a user "push fixes?" after they deliberately picked Report only was incoherent.

Cohort color alignment

agent-native-reviewer frontmatter color moves from cyan (one-off) to blue, matching the structured-findings JSON reviewer cohort it shares a role with. Blue for structured JSON reviewers; red for adversarial-reviewer; yellow for the distinct-lens set (previous-comments-reviewer, cli-agent-readiness-reviewer).

Scope boundaries

No new ce:fix skill. The redesign lives inside ce:review.
No changes to the findings schema, merge/dedup routing beyond R15, or autofix-mode residual-todo creation.
No inline freeform fix authoring. The walk-through is a decision loop; "approve the intent but write a variant" is explicitly unsupported in v1 — users Skip and hand-edit.
Other modes (Autofix, Report-only, Headless) unchanged. Contract test asserts this continues to hold.
Internal .context/compound-engineering/todos/ phase-out is acknowledged as the long-term direction but out of scope for this PR.

Design decisions worth calling out

Shared subagent-template upgrade over persona upgrades or synthesis rewrite. Sampling evidence shaped this choice. Per-persona edits would inflate scope to 5+ files; a synthesis-time rewrite pass would add recurring per-review model cost and paper over the adversarial/api-contract schema violation rather than fix it. One file edit at the source fixes both gaps for every reviewer.
Split per-finding surface: terminal block for explanation, compact stem for decision. Packing the full attack scenario + proposed fix + reasoning into the AskUserQuestion stem produced a wall of text that couldn't render markdown. Splitting lets each surface do its job: terminal block uses markdown hierarchy; stem uses plain-text yes/no framing on a single recommendation.
Intent-over-syntax for Proposed fix. Showing code (raw or diffed) forces syntax-parsing during a decision loop. The walk-through's job is trust-the-action, not review-the-code; the fixer subagent owns syntax. Prose intent description with a 2-span code budget avoids nested-backtick rendering bugs observed in live testing, where the terminal's markdown renderer ate delimiters and ran adjacent words together.
(recommended) marker mandatory over optional. Previous wording said "may label (recommended)" and let the agent communicate the recommendation through option ordering instead. Stabilizing the order and making the marker required decouples recommendation signal from position — users don't have to re-orient when Skip is recommended instead of Apply.
Apply actions batch at end of walk-through, not per-finding. Preserves the existing "one fixer, consistent tree" guarantee. The fixer receives the full Apply set at once and can handle inter-fix dependencies (two Applies touching overlapping regions) in one pass. Trade-off: fix failures surface at the end of the walk-through rather than per-decision.
Walk-through state is in-memory only. No per-decision disk writes. An interrupted walk-through discards in-flight state; Defer actions that already executed remain in the tracker (external side effects). Avoids the complexity of state-file schema, external-edit staleness detection, and .context/ lifecycle management for a feature (inspectable partial state) with no consumer.
Hoisted pre-load directive over inline reminders for AskUserQuestion. Inline "load via ToolSearch" instructions at each invocation point proved unreliable in long-skill attention-decay conditions — the agent would emit the question as narrative text instead of firing the tool. A single directive in a new Interactive mode rules section (symmetric with the existing autofix / report-only / headless rules), plus a local checklist bullet before the routing question, and a gated fallback requiring confirmed ToolSearch failure, reliably fires the tool.
Walk-through, bulk preview, and tracker defer live in references/ files. SKILL.md was already 744 lines; inlining would add ~200 lines paid on every invocation. Per plugins/compound-engineering/AGENTS.md:109-114 rationale discipline.
Contract test uses structural assertions, not exact prose. Landmark tokens (file paths, required distinguishing words, mode-branch isolation) so wording improvements in future PRs do not break the test — regression guard, not authoring ossification.

Testing

bun test tests/review-skill-contract.test.ts — 10/10 pass. The documents policy-driven routing and residual handoff test was rewritten to cover the new structure: four-option distinguishing words (regex), third-person stem assertion, reference-file existence, subagent-template framing-guidance presence, Stage 5 tie-breaking order, internal-todos-forbidden assertion in tracker-defer.md, Step 5 fixes-applied gating, walk-through option labels, bulk-preview Proceed / Cancel contract. Structural assertions only; exact prose is not locked.
bun test full suite — 752 pass, 1 fail. The failing test is resolve-base.sh > prefers the PR base remote from gh metadata over origin, which fails on main too (pre-existing, environment-dependent, unrelated to this work).
bun run release:validate — in sync.
Live validation on a controlled diff (src/utils/api-client.ts seeded with 7 deliberate issues across severity and autofix-class tiers) surfaced the split-surface, intent-over-syntax, mandatory-recommended-marker, and hoisted-pre-load refinements that shipped as presentation-layer follow-ups. Scenario A (walk-through with mixed Apply/Defer/Skip decisions) now fires the question tool correctly and renders each finding's explanation in readable markdown.

Review

Brainstorm: docs/brainstorms/2026-04-17-ce-review-interactive-judgment-requirements.md — 28 requirements, refined via document-review (6 personas, 38 findings addressed).
Plan: docs/plans/2026-04-17-002-feat-ce-review-interactive-judgment-plan.md — 8 units with requirements trace, refined via document-review (6 personas, 54 findings addressed), status now completed.
Post-implementation self-review completed via ce:review mode:autofix equivalent. Zero P0–P2 findings. One safe_auto contract-test assertion-coverage fix applied mid-review. Run artifact at .context/compound-engineering/ce-review/20260417-215343-cde5b0e8/.
End-to-end pipeline learnings captured at docs/solutions/developer-experience/ce-pipeline-end-to-end-learnings-2026-04-17.md (300 lines) — the full brainstorm → plan → work → review → PR flow on this PR itself, including the research-subagent pipeline separation pattern that came out of it.

Post-deploy monitoring and validation

No additional operational monitoring required. This is a plugin skill change — no runtime infrastructure, no user data, no external API contracts. Users who invoke Interactive mode after this lands will see the new four-option routing question on their next review with gated_auto or manual findings. If UX feedback surfaces issues (wrong tracker inferred, unclear option wording, walk-through friction), iterate via prose edits in follow-up PRs.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e5947b8a8a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

- Rewrite Step 5 gating to track total fixes_applied_count across the run rather than listing routing options. The prior list implied options C and D always skip Step 5, but safe_auto fixes land BEFORE the routing question — picking C or D after safe_auto applied fixes should NOT skip push/PR handoff. The counter-based gate makes every routing path behave correctly. (Codex P1) - Clarify the generic tracker label as a whole-label substitution, not a [TRACKER] token swap. "File an issue per finding without applying fixes" replaces the full "File a [TRACKER] ticket per finding without applying fixes" — literal token swap would have produced malformed text like "File a File a ticket ticket per finding". (Codex P2) - Specify the no-sink Defer->Skip conversion as an explicit preview-time runtime step, not Stage 5 tie-breaking. Step 7b orders conflicting reviewer recommendations (Skip > Defer > Apply) and has no knowledge of sink_available. The downgrade happens before preview rendering and is surfaced to the user so they see why Defer counts moved to Skip. (Codex P2) - Update contract test to match the revised Step 5 gate landmark (fixes_applied_count).

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 244bc3fd86

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

- Split sink availability into named_sink_available and any_sink_available. The prior single sink_available flag conflated two distinct concepts: (a) whether the detected/named tracker can be invoked (drives label confidence — inline tracker naming needs this) (b) whether ANY tier in the fallback chain is reachable (drives whether Defer is offered at all — no-sink behavior needs this) Using one flag for both meant no-sink logic fired when the named tracker was unreachable even if `gh` would have worked, silently dropping Defer availability for runs without explicit tracker docs but with working GitHub Issues or harness task primitive fallback. Updated tracker-defer.md detection tuple, probe sequence, label logic, and failure-path cache-invalidation to use the two-flag model. Propagated the flag rename across SKILL.md option C rendering, walkthrough.md menu adaptations, and bulk-preview.md no-sink downgrade. (Codex P1)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 81de392dc2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 57efb95538

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

- Route option D through Step 5 when safe_auto fixes were applied - Fix tracker placeholder to use whole-label substitution in walkthrough.md - Replace evidence fallback with merge-tier compact fields in tracker-defer.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f620e1372e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bb1c268ca6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

- Align Stage 5 tie-break cross-reference in bulk-preview.md with SKILL.md's canonical `Skip > Defer > Apply > Acknowledge` ordering (was missing `Acknowledge`). - Reword the zero-remaining completion summary so it doesn't imply advisory/pre-existing findings are cleared. Unqualified `All findings resolved` now applies only when no advisory/pre-existing remain; otherwise use the qualified `All actionable findings resolved ... (K advisory, J pre-existing remain in the report.)` form. Updated in SKILL.md Step 2 Interactive and walkthrough.md's zero-findings degenerate case. - Scope the Step 2 `Verify question-tool pre-load (checklist)` bullet to Claude Code only. Codex (request_user_input) and Gemini (ask_user) don't need a ToolSearch pre-load step; the rephrased bullet says so explicitly so converted targets aren't blocked. - Add an explicit no-sink Defer→Skip remap in walkthrough.md. When `any_sink_available: false` removes Defer from the menu, Stage 5 step 7b can still produce a Defer recommendation; the remap ensures the `(recommended)` marker always lands on a visible option. The downgrade is surfaced on the R15 conflict context line. Mirrors the shape bulk-preview.md already uses for LFG previews.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 45a5424b7f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

- Scope the Stage 2 intent-ambiguity pre-load step to Claude Code only, mirroring the Step 2 routing-question checklist scoping from the previous round. On Codex (request_user_input) and Gemini (ask_user) the platform-native question tool is loaded by default and no ToolSearch step applies. - Remove out-of-tree `plugins/compound-engineering/AGENTS.md:127` reference from the routing-question stem per the skill file-reference rule (skills must only reference files within their own directory tree). Restate the third-person voice requirement locally instead.

tmchow · 2026-04-18T17:15:58Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d06d022527

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

- Route Step 2 Interactive dispatch by option letter (A/B/C/D) rather than the rendered label text. Prevents label-variant mismatches (e.g., generic-fallback `File an issue per finding without applying fixes` vs named-tracker `File a Linear ticket ...`) from bypassing option C's dispatch. Canonical labels preserved inline for documentation; the letter is the stable dispatch signal. - Fix contradictory `suggested_fix` render instruction in walkthrough.md template. The template line at :42 said "render as a code block for diffs/literal changes" but the substitution rules below (recently rewritten) mandate prose-first and ban diff blocks. Template placeholder now points at the substitution rules below instead of competing with them.

Design artifacts for upcoming ce:review Interactive mode redesign: per-finding walk-through, compact bulk action preview before bulk actions, four-option routing question replacing today's bucket policy question, defer-to-tracker integration, and a small shared-template upgrade for persona framing quality. Requirements document has 28 requirements across 7 groups; plan decomposes into 8 implementation units with dependency ordering.

When contributing reviewers imply different per-finding actions for the same merged finding (e.g., correctness says Apply via suggested_fix, testing says Skip via low confidence), synthesis now picks the most conservative action using Skip > Defer > Apply > Acknowledge. This makes the walk-through's recommended action stable across re-runs and makes LFG results auditable after the fact. The user can still override per finding — the rule only determines what gets labeled "recommended." Part of the Interactive judgment-loop redesign (Unit 1 of 8).

…late Reviewer personas previously had no explicit guidance on how to write the why_it_matters field — only the schema's one-line description steered them toward observable behavior. Sampling recent review artifacts showed two failure modes: adversarial and api-contract reviewers producing null why_it_matters entirely, and correctness/maintainability leading with code structure instead of observable behavior. This change adds a dedicated framing-guidance block to the shared subagent template. One file edit, applied universally to every persona via the dispatch path. The block covers: lead-with-observable-behavior rule, why-the-fix-works grounding, tight 2-4 sentence budget, required-field reminder, and a positive/negative example pair for calibration. Part of the Interactive judgment-loop redesign (Unit 2 of 8).

New reference file covering how Defer actions file tickets in the project's tracker: reasoning-based detection from CLAUDE.md/AGENTS.md, probe timing (once per session, only when the routing question is about to be asked, cached for the run), label confidence logic, fallback chain (named tracker -> GitHub Issues via gh -> harness task primitive; never internal todos), no-sink behavior, failure path with Retry/Fallback/Skip, once-per-session harness durability confirmation, and per-tracker execution behavior. Loaded by SKILL.md when option C of the new routing question is a candidate and when the walk-through's Defer option executes. Part of the Interactive judgment-loop redesign (Unit 5 of 8).

New reference file defining the compact plan preview shown before every bulk action: top-level LFG (routing option B), top-level File tickets (routing option C), and walk-through LFG the rest. Single-screen summary grouped by intended action (Applying / Filing tickets / Skipping / Acknowledging), one line per finding in compressed framing-quality form, Proceed / Cancel options only. Cancel semantics differ by call site: routing-level Cancel returns to the routing question; walk-through Cancel returns to the current finding. Proceed dispatches the plan and delegates failure handling to tracker-defer.md's Retry/Fallback/Skip path. Part of the Interactive judgment-loop redesign (Unit 4 of 8).

New reference file defining the per-finding walk-through — the path users enter by picking Review (option A) from the new routing question. Covers: - Per-finding question format: mode+position indicator, persona-produced why_it_matters read from artifact via the same rule headless mode uses, proposed fix, R15 conflict-context surfacing - Four per-finding options (Apply / Defer / Skip / LFG the rest) - Advisory variant (Apply -> Acknowledge) - N=1 and no-sink adaptations that suppress options without leaving ambiguity - Override rule (preset actions only; no inline freeform fix authoring) - In-memory state only — no per-decision disk writes - End-of-walk-through dispatch: one fixer subagent for the accumulated Apply set, preserving the existing "one fixer, consistent tree" mechanic - Unified completion report structure shared across every terminal path (walk-through / LFG / File tickets / zero findings) Notes the small Step 3 fixer-prompt update needed for the heterogeneous Apply queue (gated_auto + manual mix). Part of the Interactive judgment-loop redesign (Unit 3 of 8).

Restructure the Interactive branch of After Review Step 2. The old bucket-level policy question (Review and approve specific gated fixes / Leave as residual work / Report only) is removed and replaced with a four-option routing question that dispatches to focused reference files: - (A) Review each finding one by one -> references/walkthrough.md - (B) LFG. Apply the agent's best-judgment action per finding -> references/bulk-preview.md -> end-of-batch fixer + tracker-defer - (C) File a [TRACKER] ticket per finding without applying fixes -> references/bulk-preview.md -> references/tracker-defer.md - (D) Report only — take no further action -> stop Stem is third-person ("What should the agent do...") per AGENTS.md:127. Tracker name is substituted conditionally based on detection confidence and sink availability; option C is omitted entirely when no sink is available with a one-line explanation in the stem. Zero-remaining case skips the routing question with a completion summary. Autofix, Report-only, and Headless mode branches are unchanged. Contract test updates to follow in Unit 8. Two assertions in the current contract test reference the removed bucket-policy wording and will be reworked with structural-landmark assertions instead. Part of the Interactive judgment-loop redesign (Unit 6 of 8).

Step 5's push/PR/exit flow previously ran after every Interactive review. With the new routing question, three of the four options (File tickets, Report only, LFG with no Apply recommendations) can complete without any fix landing in the working tree — running push/PR next-step prompts after those is incoherent. Step 5 now runs only when one or more fixes actually modified files this run. Covers the four zero-apply cases explicitly: options C and D, LFG that recommends only Defer/Skip/Acknowledge, and walk-through that completes with no Apply decisions. safe_auto fixes still count — a zero-remaining review that applied safe_auto fixes still proceeds to Step 5. Part of the Interactive judgment-loop redesign (Unit 7 of 8).

Rework the policy-driven-routing contract test to use structural-landmark assertions (landmark tokens, file paths, mode-branch isolation) instead of exact-prose matches. Wording improvements in future PRs should not break the test. Replaced: - bucket-policy wording assertions (removed from SKILL.md in Unit 6) with four-option routing assertions using regex for each distinguishing word (Review / LFG / File / Report) - policy-question skip assertion with structural "skip the routing question entirely" landmark Added: - assertions that routing dispatches to references/walkthrough.md, bulk-preview.md, tracker-defer.md - assertion that the routing stem does NOT contain first-person "I" / "me" per AGENTS.md:127 - assertion for Stage 5 tie-breaking order (Skip > Defer > Apply) - assertions that tracker-defer.md forbids the internal todos fallback - assertions that subagent-template.md carries why_it_matters framing guidance (observable behavior, required field) - assertion that Step 5 is gated on fixes-applied Autofix / Report-only / Headless mode branches preserved — mode isolation is unchanged. Part of the Interactive judgment-loop redesign (Unit 8 of 8 — completes the plan).

…preview contracts Post-implementation self-review surfaced a coverage gap: the contract test asserts the three new reference files EXIST but did not assert their key contract surfaces. Add structural assertions — renaming a per-finding option or the Proceed/Cancel labels now breaks the test. Covers: - walkthrough.md: four per-finding option labels (Apply / Defer / Skip / LFG the rest). Distinguishing words assertible even as prose is refined. - bulk-preview.md: Proceed and Cancel option labels. Applied as a safe_auto fix per ce:review mode:autofix self-review.

All 8 implementation units landed; self-review via ce:review mode:autofix found no P0-P2 findings and the one safe_auto contract-test addition has landed. Plan status transitions from active to completed.

- Rewrite Step 5 gating to track total fixes_applied_count across the run rather than listing routing options. The prior list implied options C and D always skip Step 5, but safe_auto fixes land BEFORE the routing question — picking C or D after safe_auto applied fixes should NOT skip push/PR handoff. The counter-based gate makes every routing path behave correctly. (Codex P1) - Clarify the generic tracker label as a whole-label substitution, not a [TRACKER] token swap. "File an issue per finding without applying fixes" replaces the full "File a [TRACKER] ticket per finding without applying fixes" — literal token swap would have produced malformed text like "File a File a ticket ticket per finding". (Codex P2) - Specify the no-sink Defer->Skip conversion as an explicit preview-time runtime step, not Stage 5 tie-breaking. Step 7b orders conflicting reviewer recommendations (Skip > Defer > Apply) and has no knowledge of sink_available. The downgrade happens before preview rendering and is surfaced to the user so they see why Defer counts moved to Skip. (Codex P2) - Update contract test to match the revised Step 5 gate landmark (fixes_applied_count).

- Split sink availability into named_sink_available and any_sink_available. The prior single sink_available flag conflated two distinct concepts: (a) whether the detected/named tracker can be invoked (drives label confidence — inline tracker naming needs this) (b) whether ANY tier in the fallback chain is reachable (drives whether Defer is offered at all — no-sink behavior needs this) Using one flag for both meant no-sink logic fired when the named tracker was unreachable even if `gh` would have worked, silently dropping Defer availability for runs without explicit tracker docs but with working GitHub Issues or harness task primitive fallback. Updated tracker-defer.md detection tuple, probe sequence, label logic, and failure-path cache-invalidation to use the two-flag model. Propagated the flag rename across SKILL.md option C rendering, walkthrough.md menu adaptations, and bulk-preview.md no-sink downgrade. (Codex P1)

Document meta-learnings from running the full compound-engineering pipeline (ce:brainstorm → document-review → ce:plan → document-review → ce:work → ce:review → resolve-pr-feedback) end-to-end on a substantial feature (ce:review Interactive mode redesign). Ten guidance items plus five worked examples covering: - sampling actual evidence before accepting research-agent claims - running document-review after both brainstorm AND plan stages - treating "trust the agent" UX as a rubber-stamp vector - distinguishing bulk-preview vs per-item walk-through ergonomics - tool/platform caps as structural constraints, not annoyances - never conflating two semantic meanings in one flag - contract tests assert structure, not prose - keeping external plugin references out of durable artifacts - skill bodies are product code - each pipeline stage catches a different issue class Synthesized net-new at the workflow level; cross-links to adjacent prior docs in skill-design and workflow categories.

Remove the stale annotation on beta-skills-framework.md (the claim that it references deepen-plan is incorrect — that reference does not exist in the file). Add cross-link to the new end-to-end CE pipeline learnings doc, which extends this doc's brainstorm/plan/work framing downstream to document-review, ce:review, and resolve-pr-feedback. beta-skills-framework.md itself reviewed and kept — still accurate for its scope (beta-skills rollout pattern).

- Route option D through Step 5 when safe_auto fixes were applied - Fix tracker placeholder to use whole-label substitution in walkthrough.md - Replace evidence fallback with merge-tier compact fields in tracker-defer.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…hort Structured-findings JSON reviewers (correctness, testing, security, etc.) use blue; adversarial uses red; a small set of distinct-lens reviewers (previous-comments, cli-agent-readiness) use yellow. agent-native-reviewer was cyan — a one-off that didn't match any cohort. Moved to blue to match its role as a CE always-on reviewer in the structured cohort. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Live testing of the Interactive-mode walk-through surfaced multiple UX gaps in how findings were presented, how the recommendation was signaled, and how the question tool was invoked. This commit batches the resulting redesign. Presentation: split surfaces - Each finding now renders in two parts: a terminal output block carrying the explanation (with proper markdown — heading, bold labels, blank lines between sections), and a compact question stem carrying the decision. The stem used to pack severity + location + attack scenario + full fix + reasoning into a wall of prose that couldn't render markdown; splitting lets each surface do its job. - Between findings, emit a one-line action-taken confirmation (`→ Applied. Fix staged at ...`) so users get continuous feedback without overloading either surface. Proposed fix: intent, not syntax - The walk-through is a decision surface, not a code review surface. Users pick Apply / Defer / Skip / LFG based on trusting the action at an intent level. Showing code (raw or diffed) forces syntax- parsing during a decision loop, which doesn't serve the decision. - Default: one sentence describing intent, with inline backticks only for short literal tokens. Code-span budget is 2 per sentence, each a single identifier or operator — never a full statement or template literal. Always space around backtick spans (terminal markdown eats delimiters and glues adjacent words together otherwise). - Raw code blocks only for short (≤5 line) genuinely additive new code. Summary + artifact pointer for larger fixes. No diff blocks — diffs are PR-review tooling, not walk-through tooling. Recommendation signaling - The `(recommended)` marker on the post-tie-break option label is now mandatory, not optional. Any of the four options can carry it. - Option order is fixed — never reorder to hoist the recommendation to position 1. The marker carries the signal; position does not. - The stem frames the single recommended action as a yes/no question rather than enumerating alternatives. When the recommendation is close, surface the disagreement in the R15 conflict-context line below the stem, not as a multi-option stem. AskUserQuestion tool loading (Claude Code) - Added a new `### Interactive mode rules` section in SKILL.md with a hoisted pre-load directive: before any Interactive-mode question fires, load AskUserQuestion via ToolSearch("select:AskUserQuestion"). Load eagerly at the start of the Interactive flow, not lazily at each site. The previous attempt to fix this by burying the instruction inline at each call site failed repeatedly on long-skill attention-decay during live runs — the hoisted rule plus a local "Verify pre-load" checklist bullet before the routing question finally worked on the third attempt. - Gated the numbered-list text fallback. The previous wording ("When no blocking question tool is available") let the agent rationalize "I haven't loaded the tool, so it's unavailable" and skip to the text fallback. Every fallback site now reads "Only when ToolSearch explicitly returns no match or the tool call errors" — requires a confirmed failure, not the agent's preference. Rationale discipline - Trimmed prose across the walk-through reference that explained rationale without changing runtime behavior. Per AGENTS.md rationale discipline: every line loads on every invocation; rationale that doesn't change behavior is carrying cost for no benefit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Align Stage 5 tie-break cross-reference in bulk-preview.md with SKILL.md's canonical `Skip > Defer > Apply > Acknowledge` ordering (was missing `Acknowledge`). - Reword the zero-remaining completion summary so it doesn't imply advisory/pre-existing findings are cleared. Unqualified `All findings resolved` now applies only when no advisory/pre-existing remain; otherwise use the qualified `All actionable findings resolved ... (K advisory, J pre-existing remain in the report.)` form. Updated in SKILL.md Step 2 Interactive and walkthrough.md's zero-findings degenerate case. - Scope the Step 2 `Verify question-tool pre-load (checklist)` bullet to Claude Code only. Codex (request_user_input) and Gemini (ask_user) don't need a ToolSearch pre-load step; the rephrased bullet says so explicitly so converted targets aren't blocked. - Add an explicit no-sink Defer→Skip remap in walkthrough.md. When `any_sink_available: false` removes Defer from the menu, Stage 5 step 7b can still produce a Defer recommendation; the remap ensures the `(recommended)` marker always lands on a visible option. The downgrade is surfaced on the R15 conflict context line. Mirrors the shape bulk-preview.md already uses for LFG previews.

- Scope the Stage 2 intent-ambiguity pre-load step to Claude Code only, mirroring the Step 2 routing-question checklist scoping from the previous round. On Codex (request_user_input) and Gemini (ask_user) the platform-native question tool is loaded by default and no ToolSearch step applies. - Remove out-of-tree `plugins/compound-engineering/AGENTS.md:127` reference from the routing-question stem per the skill file-reference rule (skills must only reference files within their own directory tree). Restate the third-person voice requirement locally instead.

- Route Step 2 Interactive dispatch by option letter (A/B/C/D) rather than the rendered label text. Prevents label-variant mismatches (e.g., generic-fallback `File an issue per finding without applying fixes` vs named-tracker `File a Linear ticket ...`) from bypassing option C's dispatch. Canonical labels preserved inline for documentation; the letter is the stable dispatch signal. - Fix contradictory `suggested_fix` render instruction in walkthrough.md template. The template line at :42 said "render as a code block for diffs/literal changes" but the substitution rules below (recently rewritten) mandate prose-first and ban diff blocks. Template placeholder now points at the substitution rules below instead of competing with them.

PR #590 landed new reference files (bulk-preview, tracker-defer, walkthrough) and test assertions under the old ce-review path while this branch was renaming to ce-code-review. Update the absorbed content to match: artifact path strings inside tracker-defer.md and walkthrough.md, plus hardcoded skill paths in review-skill-contract.test.ts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…and bulk preview Ports ce-code-review's interactive flow (post-PR #590) into ce-doc-review while adapting for document-specific semantics. Replaces the current binary 'refine / complete' Phase 5 question with a four-option routing question that dispatches to focused interactions sized to the user's intent. New references: - walkthrough.md — the per-finding walk-through that fires when the user picks routing option A. Each finding renders in two surfaces: a terminal markdown block carrying the explanation (heading, What's wrong, Proposed fix, Why it works) and a compact two-line question stem carrying the decision. Four per-finding options: Apply / Defer (append to Open Questions) / Skip / LFG-the-rest. Post-tie-break recommendation marked with '(recommended)'. N=1 and no-append adaptations included. Unified completion report covers walk-through, LFG, Append-to-Open-Questions, and zero-findings terminal paths - bulk-preview.md — the compact plan preview before every bulk action (routing B, routing C, walk-through LFG-the-rest). Grouped buckets (Applying / Appending to Open Questions / Skipping) with one-line plain-English summary per finding. Two options: Proceed or Cancel. Cancel semantics route back to the originating question (routing or per-finding). Proceed executes the plan and emits the unified completion report Differences from ce-code-review's equivalents: - No tracker-detection tuple or sink-availability logic — Defer is always an in-doc append, no external tracker - No [TRACKER] label substitution — bucket is always 'Appending to Open Questions' - No fixer-subagent batch dispatch — the orchestrator applies gated_auto/manual fixes inline (document edits are single-file markdown with no cross-file dependencies) - No advisory variant with Acknowledge — advisory-style findings surface in the FYI subsection at the presentation layer, outside the walk-through - No .context/ artifact-lookup paths — ce-doc-review personas don't write run artifacts - Observable-consequence phrasing tuned for document surface (readers, implementers, downstream decisions) rather than ce-code-review's runtime-behavior framing SKILL.md changes: - Added 'Interactive mode rules' section at the top with the AskUserQuestion pre-load directive and numbered-list fallback rule. Mirrors ce-code-review's approach for reliable question-tool loading on Claude Code - Updated the Phases 3-5 handoff to load walkthrough.md and bulk-preview.md after agent dispatch completes, lazily via backtick paths per the skill compliance checklist Part of the ce-doc-review autofix and interaction overhaul plan. Unit 5 of 8. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector bot reviewed Apr 18, 2026

View reviewed changes

Comment thread plugins/compound-engineering/skills/ce-review/SKILL.md Outdated

Comment thread plugins/compound-engineering/skills/ce-review/SKILL.md Outdated

Comment thread plugins/compound-engineering/skills/ce-review/references/bulk-preview.md Outdated

chatgpt-codex-connector bot reviewed Apr 18, 2026

View reviewed changes

Comment thread plugins/compound-engineering/skills/ce-review/references/tracker-defer.md Outdated

chatgpt-codex-connector bot reviewed Apr 18, 2026

View reviewed changes

Comment thread plugins/compound-engineering/skills/ce-review/SKILL.md Outdated

Comment thread plugins/compound-engineering/skills/ce-review/references/walkthrough.md Outdated

chatgpt-codex-connector bot reviewed Apr 18, 2026

View reviewed changes

Comment thread plugins/compound-engineering/skills/ce-review/references/tracker-defer.md Outdated

chatgpt-codex-connector bot reviewed Apr 18, 2026

View reviewed changes

Comment thread plugins/compound-engineering/skills/ce-review/SKILL.md

Comment thread plugins/compound-engineering/skills/ce-review/SKILL.md Outdated

chatgpt-codex-connector bot reviewed Apr 18, 2026

View reviewed changes

Comment thread plugins/compound-engineering/skills/ce-review/SKILL.md Outdated

Comment thread plugins/compound-engineering/skills/ce-review/references/walkthrough.md Outdated

chatgpt-codex-connector bot reviewed Apr 18, 2026

View reviewed changes

Comment thread plugins/compound-engineering/skills/ce-review/SKILL.md Outdated

Comment thread plugins/compound-engineering/skills/ce-review/SKILL.md Outdated

chatgpt-codex-connector bot reviewed Apr 18, 2026

View reviewed changes

Comment thread plugins/compound-engineering/skills/ce-review/SKILL.md Outdated

Comment thread plugins/compound-engineering/skills/ce-review/references/walkthrough.md Outdated

tmchow mentioned this pull request Apr 18, 2026

What is the best workflow after ce:review - worth introducing a ce:fix? #562

Closed

tmchow added 14 commits April 18, 2026 13:01

docs(ce-review): mark interactive-judgment plan as completed

81bc2ec

All 8 implementation units landed; self-review via ce:review mode:autofix found no P0-P2 findings and the one safe_auto contract-test addition has landed. Plan status transitions from active to completed.

tmchow and others added 7 commits April 18, 2026 13:01

tmchow force-pushed the tmchow/ce-review-interactive branch from cefef2a to 6505265 Compare April 18, 2026 20:02

tmchow merged commit 27cbaf8 into main Apr 18, 2026
2 checks passed

github-actions bot mentioned this pull request Apr 18, 2026

chore: release main #596

Open

tmchow mentioned this pull request Apr 19, 2026

feat(ce-doc-review): autofix tiers, per-finding walk-through, multi-round memory #601

Open

Conversation

tmchow commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's different

Routing

Per-finding walk-through

Bulk preview

Recommended-action tie-breaking

Framing quality in reviewer output

Tracker detection and defer

Platform question-tool loading (Claude Code)

Step 5 conditioning

Cohort color alignment

Scope boundaries

Design decisions worth calling out

Testing

Review

Post-deploy monitoring and validation

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

tmchow commented Apr 18, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tmchow commented Apr 18, 2026 •

edited

Loading