feat(doc-review, learnings-researcher): tiers, chain grouping, rewrite#601
feat(doc-review, learnings-researcher): tiers, chain grouping, rewrite#601
Conversation
…ion overhaul Capture requirements brainstorm (38 requirements across 11 sections) and implementation plan (8 units, 4 phases) for overhauling ce-doc-review: three-tier autofix classification (safe_auto / gated_auto / manual) aligned with ce-code-review names, per-severity confidence gates, ported walk-through + bulk-preview + routing-question interaction, in-doc Open Questions deferral, multi-round decision memory, domain-agnostic learnings-researcher rewrite, and ce:compound problem_type enum expansion. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… / design_pattern / tooling_decision / convention Adds four knowledge-track values to the ce:compound frontmatter schema to absorb the best_practice overflow that had been serving as a catch-all for architecture, design, tooling, and convention learnings. best_practice remains valid but is now documented as the fallback for entries that no narrower knowledge-track value covers. Schema updates land in both ce-compound and ce-compound-refresh (duplicate files are kept in sync intentionally). Category mappings added for the four new values under docs/solutions/. Plugin AGENTS.md gets a new "Documented Solutions" discoverability section naming all knowledge-track and bug-track categories so agents reading project instructions know the full taxonomy. Migrates 10 existing best_practice entries to narrower values and resolves one correctness-gap schema violation (workflow/todo-status-lifecycle.md -> workflow_issue). Remaining best_practice entries are genuinely broad enough to stay in that bucket (e.g., ce-pipeline-end-to-end-learnings). bun test tests/frontmatter.test.ts passes. Part of the ce-doc-review autofix and interaction overhaul plan. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… knowledge The learnings-researcher was shaped around bug-hunting — its taxonomy, category table, output framing, and DO/DON'T guidelines all assumed the caller was looking for past defects. But the team captures more than bugs: architecture patterns, design patterns, tooling decisions, conventions, and workflow learnings are all first-class in docs/solutions/ now. Authors were retreating to best_practice as a catch-all because the agent didn't reliably surface non-bug entries. This rewrite treats all learning shapes equally: - Identity framing dropped "preventing repeated mistakes" for "find and distill applicable learnings" — bug avoidance is one flavor, not the premise - Accepts a structured <work-context> input (Activity / Concepts / Decisions / Domains) and still handles free-form text for callers that don't use the structured form - Replaced the hardcoded Feature-Type-to-Directory table with a dynamic probe of docs/solutions/ at invocation time so new subdirectories (architecture-patterns/, design-patterns/, conventions/, etc.) are picked up automatically - Expanded keyword extraction from 4 dimensions (Module, Technical terms, Problem indicators, Component) to 8 by adding Concepts, Decisions, Approaches, Domains — so design-pattern and convention queries have something to match on - Made Step 3b (critical-patterns.md read) conditional on file existence — the file does not exist in this repo and demanding it was emitting warnings - Reframed output to "Applicable Past Learnings" with a Type field so knowledge-track entries surface as themselves (architecture_pattern, design_pattern, etc.) instead of being flattened to a bug-shape - Reworked Integration Points to drop the ce-doc-review reference (it explicitly does not dispatch this agent) and stop framing planning-time as the agent's primary home - Updated DO/DON'T guidelines: don't discard candidates for missing bug-shaped fields, treat non-bug learnings as first-class, don't assume critical-patterns.md exists Preserved the grep-first filtering strategy intact (it was already efficient) and the overall 7-step flow. File grew from 246 to 283 lines due to the expanded taxonomy and structured input documentation. Part of the ce-doc-review autofix and interaction overhaul plan. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tion, strawman rule, framing guidance The persona subagent template was producing output that pushed too much into 'present' (requires user judgment) because the classification rule was permissive about what counted as a 'reasonable alternative' — (a) real fix + (b) 'accept the defect' strawman + (c) 'document in release notes' strawman looked like three valid approaches, inflating the manual/present bucket. This upgrade sharpens classification and improves persona output quality across all seven document-review personas via one file edit. Changes: - autofix_class enum renamed and expanded: [auto, present] -> [safe_auto, gated_auto, manual]. Aligns names with ce-code-review's first three tiers so cross-skill vocabulary is consistent. Drops the advisory tier in favor of a presentation-layer FYI subsection handled by synthesis - Three-tier routing spelled out with one-sentence descriptions and examples: safe_auto for single-option fixes, gated_auto for concrete fixes that touch meaning/scope (the new default for 'I know the fix but the author should sign off'), manual for genuine judgment calls - Strawman-aware classification rule added: a 'do nothing / accept the defect' option is not a real alternative — it is the failure state the finding describes. Same for 'document in release notes' and similar sidesteps. Listed alternatives must be ones a competent implementer would genuinely weigh. Positive and negative example pair included - Strawman safeguard on safe_auto: if safe_auto is chosen via strawman-dismissal, the dismissed alternatives must be named in why_it_matters. Any genuine non-strawman alternative downgrades the finding to gated_auto (user confirms) rather than silent apply - Auto-promotion patterns consolidated into an explicit rule set: factually incorrect behavior, missing standard security/reliability controls, codebase-pattern-resolved fixes (with cited pattern), framework-native-API substitutions, mechanically-implied completeness - Framing guidance block added (ported from ce-code-review's subagent-template): observable-consequence-first phrasing, why-the-fix-works grounding, 2-4 sentence budget, weak/strong illustrative pair. Tuned for document-review's surface — 'observable consequence' rather than 'observable behavior' since doc problems surface as disagreements, misreads, and wrong decisions rather than runtime behavior - Persona exclusion rule added: personas must exclude '## Deferred / Open Questions' sections and their timestamped subsections from review scope. Prevents round-2 feedback loops where deferred entries get flagged as new findings or quoted as evidence for suppression checks. Unit 6 is going to create these sections as the document-native defer mechanic Template grows from 52 to 95 lines, still under the 150-line @ inclusion threshold so SKILL.md continues to use @-inclusion. findings-schema.json autofix_class enum and description updated to match. Part of the ce-doc-review autofix and interaction overhaul plan. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…s and FYI subsection Rewrites the synthesis pipeline to route findings by the three new tiers (safe_auto / gated_auto / manual) with per-severity confidence gates and a distinct FYI subsection for low-confidence manual findings. This is the load-bearing synthesis change that enables the interaction model (Unit 5) and multi-round memory (Unit 7) to follow. Pipeline changes: - Step 3.2: replace flat 0.50 gate with per-severity table. P0 0.50, P1 0.60, P2 0.65, P3 0.75. FYI floor at 0.40: below-gate manual findings above the floor survive in an FYI subsection rather than being dropped — preserves observational value without forcing decisions - Step 3.4: delete residual-concern promotion. Replace with cross-persona agreement boost (+0.10, capped at 1.0) applied to merged findings that multiple personas flagged. Mirrors ce-code-review stage 5 step 4. Residual concerns surface in Coverage only; below-gate findings are not promoted back into the review surface - Step 3.5: tier language updated for three tiers. Contradictions route to manual with both perspectives - Step 3.6: auto-promotion patterns consolidated into an explicit rule set: codebase-pattern-resolved, factually incorrect behavior, missing standard security/reliability controls, framework-native-API substitutions, mechanically-implied completeness additions. Promotion targets safe_auto or gated_auto depending on whether the addition is substantive. Strawman-downgrade safeguard: if safe_auto was chosen via strawman-dismissal per the subagent template, synthesis verifies the dismissed alternatives were genuinely strawmen and downgrades to gated_auto if any were plausible design choices - Step 3.7: routing rewritten for three tiers + FYI. safe_auto applies silently in Phase 4. gated_auto enters walk-through with Apply recommended. manual enters walk-through with user-judgment framing. FYI-subsection routes outside the walk-through entirely - State-machine narration added to the phase header so each transition (Raised -> Classified -> one of four buckets) is a named step in synthesis prose, not an implied carry-forward - R30 fix-landed matching predicate specified explicitly: fingerprint + evidence-substring overlap >50%. Section renames count as different locations so a rename-caused re-raise is treated as new, not as 'fix did not land' Phase 4 updates: - Headless envelope extended with distinct 'Gated-auto findings' and 'FYI observations' sections alongside existing 'Applied N safe_auto fixes' and 'Manual findings' sections. Backward-compatible: callers that only read the old sections continue to work; new sections surface alongside, not instead of, the existing buckets. Landing the envelope change here (Unit 4) rather than Unit 8 means ce-plan 5.3.8 headless invocations never observe an interstitial state where new tiers are produced but the envelope can't carry them - Interactive presentation split by tier and finding type. FYI observations get their own subsection. Applied safe_auto fixes list every change with section + reviewer attribution - Phase 5 simplified: handoff to Unit 5 walkthrough.md for routing question and per-finding interaction. Terminal question (Unit 7's three-option structure) fires after walk-through or bulk action Creates tests/fixtures/ce-doc-review/seeded-plan.md — a test fixture plan doc with known issues planted across tier shapes (3 safe_auto, 3 gated_auto, 5 manual, 5 FYI candidates, 3 drop-worthy P3s). Used for the validation gate: run the new pipeline against this fixture and verify bucket distribution matches expectations before shipping Phase 2. Part of the ce-doc-review autofix and interaction overhaul plan. Unit 4 of 8. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…and bulk preview Ports ce-code-review's interactive flow (post-PR #590) into ce-doc-review while adapting for document-specific semantics. Replaces the current binary 'refine / complete' Phase 5 question with a four-option routing question that dispatches to focused interactions sized to the user's intent. New references: - walkthrough.md — the per-finding walk-through that fires when the user picks routing option A. Each finding renders in two surfaces: a terminal markdown block carrying the explanation (heading, What's wrong, Proposed fix, Why it works) and a compact two-line question stem carrying the decision. Four per-finding options: Apply / Defer (append to Open Questions) / Skip / LFG-the-rest. Post-tie-break recommendation marked with '(recommended)'. N=1 and no-append adaptations included. Unified completion report covers walk-through, LFG, Append-to-Open-Questions, and zero-findings terminal paths - bulk-preview.md — the compact plan preview before every bulk action (routing B, routing C, walk-through LFG-the-rest). Grouped buckets (Applying / Appending to Open Questions / Skipping) with one-line plain-English summary per finding. Two options: Proceed or Cancel. Cancel semantics route back to the originating question (routing or per-finding). Proceed executes the plan and emits the unified completion report Differences from ce-code-review's equivalents: - No tracker-detection tuple or sink-availability logic — Defer is always an in-doc append, no external tracker - No [TRACKER] label substitution — bucket is always 'Appending to Open Questions' - No fixer-subagent batch dispatch — the orchestrator applies gated_auto/manual fixes inline (document edits are single-file markdown with no cross-file dependencies) - No advisory variant with Acknowledge — advisory-style findings surface in the FYI subsection at the presentation layer, outside the walk-through - No .context/ artifact-lookup paths — ce-doc-review personas don't write run artifacts - Observable-consequence phrasing tuned for document surface (readers, implementers, downstream decisions) rather than ce-code-review's runtime-behavior framing SKILL.md changes: - Added 'Interactive mode rules' section at the top with the AskUserQuestion pre-load directive and numbered-list fallback rule. Mirrors ce-code-review's approach for reliable question-tool loading on Claude Code - Updated the Phases 3-5 handoff to load walkthrough.md and bulk-preview.md after agent dispatch completes, lazily via backtick paths per the skill compliance checklist Part of the ce-doc-review autofix and interaction overhaul plan. Unit 5 of 8. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implements the Defer action's document-native analogue to ce-code-review's tracker-ticket mechanic. When the user chooses Defer on a finding (from the walk-through or from the bulk-preview Append-to-Open-Questions path), an entry appends to a '## Deferred / Open Questions' section at the end of the document under review. Design decisions: - In-doc append instead of sibling follow-up file or external tracker: deferred findings stay attached to the document they came from, naturally discoverable by anyone reading the doc, no new infrastructure required. Trade-off: deferred findings visibly mutate the doc — but that's the point, an Open Questions section is a canonical place for 'worth remembering but not acting on now' in planning docs - Section location handling covers three cases: end-of-doc heading (most common — append inside), mid-doc heading (user positioned deliberately — still append at that location, don't duplicate at end), absent (create at end, or above trailing separators/footers, or after frontmatter for frontmatter-only docs) - Timestamped subsections (### From YYYY-MM-DD review) group multiple defers per review and keep sequential reviews distinguishable. Same-day reviews within a session share one subsection; cross-day reviews get distinct subsections - Entry format: title, severity, reviewer attribution, confidence, and the why_it_matters framing. Intentionally excludes suggested_fix and evidence — those live in the run artifact when applicable; the in-doc entry is a concern summary for the reader returning months later, not a full decision packet - Idempotence on title collisions within the same subsection prevents duplicate entries from LFG-the-rest re-routing or orchestrator retry paths - Concurrent-edit safety: re-read document from disk before every append to reduce collision window with user-in-editor writes. No persistent lock — interactive review needs observation-before-write, not coordination - Failure path via blocking sub-question (Retry / Record-only / Convert- to-Skip) ensures no silent failures. Default-to-Record-only when the sub-question goes unanswered keeps in-memory state consistent - Upstream availability signal cached at Phase 4 start — known-unwritable documents suppress the Defer option in the walk-through menu and routing question; mid-flow failures route through the per-finding failure path without flipping the session signal Referenced by references/walkthrough.md (per-finding Defer option) and references/bulk-preview.md (routing option C Proceed). Includes a worked example showing the appended-content format in context. Part of the ce-doc-review autofix and interaction overhaul plan. Unit 6 of 8. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nd decision primer
Replaces the current binary 'refine / complete' Phase 5 question with a
three-option terminal question that separates 'apply decisions' from
're-review' — the most common user frustration the brainstorm surfaced
was that the old wording forced a re-review even when the user wanted
to apply fixes and move on.
Terminal question:
- 'Apply decisions and proceed to <next stage>' — default when fixes
were applied or decisions were made. <next stage> substitutes to
ce-plan for requirements docs and ce-work for plan docs
- 'Apply decisions and re-review' — opt-in re-review when the user
believes their edits warrant another pass
- 'Exit without further action' — user wants to stop for now
When fixes_applied_count == 0 (zero-actionable case or all-Skip walk-
through), the 're-review' option drops (re-review is useless when
nothing changed) and the 'Apply decisions and' prefix drops from the
primary option's label so it matches what the system is doing.
Caller-context handling is implicit rather than requiring a formal
nested:true argument. The agent interprets 'Proceed to <next stage>'
contextually from the visible conversation state — when invoked from
inside another skill's flow, it returns control to the caller; when
standalone, it dispatches the next stage skill. No caller-side change
to ce-brainstorm or ce-plan. If implicit handling proves unreliable,
an explicit nested:true flag can be added as a follow-up.
Multi-round decision memory:
- SKILL.md Phase 2 adds a new {decision_primer} template variable that
accumulates prior-round decisions across the session. Round 1 renders
an empty <prior-decisions> block; round 2+ renders applied and
rejected entries by round with section, title, reviewer, confidence,
and rejection reason. Skip and Defer both count as 'rejected' for
suppression purposes. Cross-session persistence is out of scope —
fresh invocations start a fresh round 1
- synthesis-and-presentation.md adds an R29 suppression step. For each
current-round finding, compare against the primer's rejected list
using the same fingerprint + evidence-substring predicate as R30.
On match, drop the current-round finding and log the suppression in
Coverage. Materially-different exception: if the section was edited
enough that the prior-round evidence quote no longer appears in the
current document, treat as new. Synthesis is the authoritative gate;
the persona instruction below is a soft hint
- subagent-template.md adds a <decision-primer-rules> block that
instructs personas to honor the primer: don't re-raise a finding whose
title and evidence pattern-match a prior-round rejected entry unless
materially different. Prior-round Applied findings are informational
— the orchestrator verifies those landed via R30. Round 1 runs with
no primer constraints
Part of the ce-doc-review autofix and interaction overhaul plan. Unit 7
of 8.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ugh, defer, enum expansion
Extends pipeline-review-contract.test.ts with structural assertions for
the ce-doc-review overhaul. Uses pattern matching on key tokens rather
than exact prose so future wording improvements don't ossify the tests.
New test groups:
- ce-doc-review contract (8 tests):
- findings-schema autofix_class enum uses [safe_auto, gated_auto,
manual] — aligned with ce-code-review's first three tier names,
no advisory tier, no legacy auto/present values
- subagent template carries framing guidance, strawman rule, strawman
safeguard, Deferred/Open-Questions exclusion, and decision-primer
slot + rules
- synthesis pipeline has per-severity gates (P0 0.50 / P1 0.60 /
P2 0.65 / P3 0.75), FYI floor at 0.40, cross-persona agreement
boost (+0.10), three-tier routing, R29 rejected-finding suppression,
R30 fix-landed matching predicate
- headless envelope surfaces new tiers distinctly (Applied N safe_auto
fixes / Gated-auto findings / Manual findings / FYI observations)
with preserved 'Review complete' terminal signal
- terminal question is three-option by default with label adaptation
in the zero-actionable case; next-stage substitution rules
documented
- SKILL.md has Interactive mode rules with AskUserQuestion pre-load
via ToolSearch, decision_primer variable in dispatch, lazy
references to walkthrough.md and bulk-preview.md
- walkthrough has routing question with front-loaded distinguishing
words, four per-finding options, (recommended) marker, no advisory
variant, no tracker-detection machinery; bulk-preview has Proceed/
Cancel with three bucket labels, no Acknowledging bucket
- open-questions-defer implements the append mechanic with timestamped
subsections, field-level entry format, failure-path sub-question
(Retry / Record-only / Convert-to-Skip), no tracker-detection logic
- ce-compound frontmatter schema expansion contract (3 tests):
- problem_type enum includes architecture_pattern, design_pattern,
tooling_decision, convention — with best_practice retained as
fallback
- ce-compound-refresh schema is byte-identical to canonical
ce-compound schema (kept in sync per AGENTS.md duplicate convention)
- yaml-schema.md documents category mappings for the four new values
- ce-learnings-researcher domain-agnostic contract (1 test):
- Agent prompt frames as domain-agnostic (not bug-focused), names
Architecture / Design / Tooling / Conventions as first-class
learning shapes, accepts structured <work-context> input, replaces
the hardcoded category table with a dynamic probe, reads
critical-patterns.md conditionally, and drops ce-doc-review from
Integration Points (agent is ce-plan-owned)
All 13 new tests pass. Full suite: 798 pass, 0 fail. bun run
release:validate: in sync.
Part of the ce-doc-review autofix and interaction overhaul plan. Unit 8
of 8 — final unit; this ships the contract tests that lock in the
three-tier / multi-round / interaction-model changes shipped in Units 3
through 7.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d897823f52
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Preserve evidence snippets in decision primer entries (thread r3106059451, P1) R29 rejected-finding suppression and R30 fix-landed verification both use fingerprint + evidence-substring overlap (>50%) as their matching predicate. The primer format in SKILL.md previously stored only section/title/reviewer/confidence/reason, leaving the orchestrator without prior-round evidence needed to compute the overlap check. Adds an 'Evidence:' line to each primer entry containing the first evidence quote truncated to ~120 chars (word-boundary-preserving, internal quotes escaped). Documented inline why the field is required. - Align review-output-template.md with new three-tier taxonomy (thread r3106059452, P2) The interactive presentation template still referenced the legacy 'auto/present' contract — summary line, 'Auto-fixes Applied' heading, 'Auto | Present' coverage columns, and rules text all leaked stale bucket semantics into user-facing output. Rewrites the template for the new taxonomy: 'Applied N safe_auto fixes' summary line, a Tier column in severity tables, a dedicated 'FYI Observations' section, and Coverage columns for SafeAuto | GatedAuto | Manual | FYI | Residual. Residual Concerns section clarified: no longer promoted via residual-concern promotion (dropped in Unit 4) — listed for transparency only. Adds a note up-front that this template covers Phase 4 interactive presentation specifically; the headless envelope lives in synthesis-and-presentation.md. All contract tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: df79db5015
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Both findings fall in the same pattern the prior round surfaced: high-level rules written without the deterministic mechanics the orchestrator needs to actually implement them. Cross-invocation signal from resolve-pr-feedback flagged the theme. - Add deterministic recommended-action tie-break to synthesis (thread r3106064985, P1) walkthrough.md said 'recommended action has already been normalized by synthesis step 3.6' but step 3.6 only handled auto-promotion; there was no rule for picking Apply/Defer/Skip when contributing personas implied different actions. Without a tie-break, LFG outcomes are non-deterministic across runs and walk-through '(recommended)' marks drift. Adds synthesis step 3.5b 'Deterministic Recommended-Action Tie-Break' implementing 'Skip > Defer > Apply' (most conservative first). Scans contributing personas in that order; first match wins. Documents persona-to-action mapping (safe_auto/gated_auto -> Apply, manual tradeoffs -> Defer, residual-suppression-eligible -> Skip, contradiction-set with 'keep as-is' -> Skip). Silent-personas fallback to Apply. Records a conflict-context string on the merged finding for the walk-through's R15 conflict-context line. Updates walkthrough.md's reference from 'synthesis step 3.6' to the correct '3.5b'. Invariant: walk-through and bulk-preview read recommended_action and never recompute. LFG-the-rest and routing option B execute the recorded recommendation in bulk, so the same review artifact always produces the same bulk plan. - Strengthen dedup key in open-questions-defer.md (thread r3106064987, P2) Title-only idempotence silently drops legitimate deferred entries when two different concerns share a short title in the same review date. Loses user-visible backlog context. Compound key is now 'normalize(section) + normalize(title) + evidence_fingerprint', mirroring the R29/R30 matching predicate (fingerprint + evidence-substring overlap) so cross-round and intra-round dedup behave consistently. Evidence fingerprint is the first ~120 chars of the finding's first evidence quote (same slice used in the decision primer). Falls back to section+title when evidence is unavailable. All contract tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: afc699da8f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Both findings continue the pattern: specification gaps in deterministic orchestration mechanics. Different gaps each round — Codex keeps finding adjacent spots where high-level rules need concrete implementation detail. - Drop routing-level recommendation labeling (thread r3106084407, P2) walkthrough.md's routing question said any of A/B/C/D could be marked '(recommended)' based on synthesis, but synthesis only defines per-finding recommended_action — there was no rule for picking a routing-level recommendation. Rather than invent one (a mapping from finding-set shape to routing recommendation would pressure users toward automated paths in ways that conflict with the user-intent framing of the four options), remove the routing- level (recommended) labeling. Per-finding (recommended) inside the walk-through and bulk preview still applies — that's deterministic via step 3.5b. The routing choice itself is user-intent: engage / trust / triage / skim. - Persist dedup-key fields in the appended entry (thread r3106084409, P2) Previous round added a compound dedup key (section + title + evidence_fingerprint) but the deferred-entry format on disk only stored title/severity/reviewer/confidence/why_it_matters — so retries or same-day reruns couldn't reliably reconstruct the key from persisted data. Duplicates could still slip through. Adds an HTML-comment line on each appended entry carrying the machine-readable dedup fields: <!-- dedup-key: section=\"...\" title=\"...\" evidence=\"...\" -->. Keeps the visible content reader-friendly (section now surfaces in the header line alongside severity/reviewer/confidence; the HTML comment stays hidden in rendered markdown) while persisting exactly the fields Step 4's compound-key check needs. Legacy entries without the comment fall back to title-only comparison — imperfect but strictly better than duplicate-appending. All contract tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f65830da9b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Thanks for the thorough review — these are all good catches. Summary of what I plan to address: P1s (will fix before merge):
P2s:
Will push a follow-up commit shortly. |
…ting Second seeded fixture alongside seeded-plan.md, covering auth-migration domain with two deliberately distinct premise roots (managed-auth migration, custom policy-layer warranted?) at different scopes so the synthesis multi-root path gets exercised. Includes the same tier distribution as the rename fixture: 3 safe_auto, 3 gated_auto, roots with dependents, root-independent manual findings, 4 FYI, 3 drop-worthy P3s. Run in parallel with seeded-plan.md to compare calibration across domains and catch domain-specific persona bias. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d design-lens coverage Third seeded fixture alongside seeded-plan.md (rename/infra) and seeded-auth-plan.md (auth migration). Designed to exercise three paths the other fixtures do not cover: 1. design-lens persona activation — the fixture contains UI components, user flows, visual hierarchy, and interaction descriptions so the design-lens reviewer triggers and can be observed. 2. Zero-root chain path — every seeded finding is an independent execution detail; no premise-level challenges exist. Synthesis should skip chain grouping entirely (report 'Chains: 0 roots' or omit the line). If a chain appears, a reviewer has over-elevated an execution finding to premise status — a calibration signal. 3. Small-document / proportional-review path — ~130 lines of content (vs ~210 for the infra fixtures) so adversarial runs Quick mode (≤3 findings) or does not activate, and scope-guardian may stay inactive given no priority tiers. Run alongside seeded-plan.md and seeded-auth-plan.md to catch design-lens regressions, zero-root bugs, and domain bias when reviewing user-facing feature plans rather than infrastructure plans. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…im legacy debris Tighten the agent for domain-agnostic institutional knowledge retrieval: - Add runtime guardrails: output bound (up to 5 findings), fallibility caveat (flag when learning conflicts with observable code, include entry date so caller can judge supersession), gap signal (suggest /ce-compound capture when no matches). - Drop ~40 lines of Rails-specific component/root_cause enum debris duplicated from the authoritative yaml-schema.md reference. The inline duplication undermined the single-source-of-truth contract and carried irrelevant vocabulary (rails_model, thread_violation, missing_index) into every invocation in a TypeScript/Bun plugin repo. - Realign Step 3 example queries to domain-agnostic shapes so the grep-pre-filter examples don't quietly re-prime bug-shaped search. - Collapse 4x repetition of the bug-parity rule to a single canonical statement; disambiguate the DO/DON'T scope-vs-depth rule (reading frontmatter of search-matched candidates is fine; reading frontmatter of every file in docs/solutions/ is the anti-pattern). - Drop the unenforced "30 seconds" performance claim. - Expand integration points to document all callers: /ce-plan, /ce-code-review, /ce-optimize, /ce-ideate, plus standalone invocation. Net: -39 lines with material quality gain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ot and cascade
Cut user engagement burden on reviews with premise-level fanout: a
single premise challenge ("is this work justified?") often generates
downstream findings that all evaporate if the premise is rejected.
Surfacing each as an independent decision forces the user to
re-litigate the same root question N times.
Synthesis changes (synthesis-and-presentation.md):
- Add step 3.5c Premise-Dependency Chain Linking. Identify P0/P1
manual findings challenging a foundational premise as roots; link
dependent findings whose suggested_fix becomes moot if the root's
fix is accepted. Annotate with depends_on / dependents fields
(purely annotative — no reclassification).
- Multi-root elevation: elevate ALL candidates matching the root
criteria independently. A single-root default strands the second
root's dependents as top-level manual findings — the exact UX
problem chains are meant to solve. Sanity diagnostic at >3 without
a hard cap (criteria are the filter).
- Peer-vs-nested mechanical test: roots are peers when accepting
one's fix does not resolve the other's concern. Nested when one's
fix moots the other.
- Surviving-root selection under nested: scope dominates confidence.
When nested, the surviving root is the one whose fix moots the
other, not the higher-confidence candidate. Confidence is for
tie-breaking among peers, not for picking dominant root.
- Independence safeguard: findings identifying problems that exist
regardless of the premise (rollback plans, error handling, test
coverage, security gaps) stay standalone; safe_auto and gated_auto
findings never link.
- Count invariant: the dependents array populated in Step 4 is the
single source of truth for both coverage counting and rendering.
Prevents coverage/rendering drift (e.g., "6 dependents" reported
but 5 nested in output).
- Two worked examples in different domains (rename-shape, auth-shape)
paired with the lesson "same shape, different vocabulary" to
prevent example-vocabulary overfitting.
Walkthrough changes (walkthrough.md):
- Root-first iteration regardless of severity order within a chain.
- Cascade on Skip/Defer of a root: prompt with Cascade
(apply same action to all dependents) or Decide each individually.
Apply on root does NOT cascade — the premise held, dependents each
need their own decision.
- Orphaned dependents (root suppressed in prior round per R29): treat
as standalone findings with no chain context.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ce, and safe_auto resilience
Populate the FYI bucket, stabilize schema conformance across long
persona prompts, and prevent safe_auto from silently collapsing under
over-charitable re-interpretation.
Persona calibration bands (7 personas):
- Add LOW (0.40-0.59) Advisory band tailored to each persona's scope
(coherence, feasibility, product-lens, design-lens, security-lens,
scope-guardian, adversarial). Drop suppress floor from 0.50 to 0.40
so advisory observations without articulable consequence can land
in FYI rather than either being promoted above gate or silently
dropped. Without this band, advisory findings (filename asymmetry,
speculative future-work, stylistic preference) either clutter
manual or disappear entirely — the FYI bucket stayed empty across
initial calibration runs.
Subagent template (subagent-template.md):
- Add Schema conformance block at the top of the output contract
listing exact enum values inline (severity P0/P1/P2/P3, finding_type
error/omission, autofix_class safe_auto/gated_auto/manual, evidence
as array). Schema injection via {schema} alone was insufficient —
longer personas (adversarial 89 lines, scope-guardian 54 lines)
pushed schema attention down, producing "high/medium/low" severity
and singleton string evidence. Inline enum anchoring at the top of
the contract survives longer prompts. Add a severity translation
rule so informal priority language in persona rubrics maps to P0-P3
at emit time.
- Add Advisory routing rule using the "what actually breaks if we
don't fix this?" heuristic to anchor confidence scoring into the
0.40-0.59 band. Per-persona bands without this template rule
produce inconsistent results across personas; the template rule
without per-persona bands has nothing to calibrate against. Both
are needed.
Coherence persona (ce-coherence-reviewer.agent.md):
- Add Safe_auto patterns you own section naming the three high-stakes
patterns (header/body count mismatch, cross-reference to a named
section that does not exist, terminology drift between
interchangeable synonyms) with strawman-resistance guidance.
Addresses observed regression where coherence demoted wrong-count
findings from safe_auto to P1 manual on the hypothetical that a
missing R6 might exist. The new guidance instructs: "If the only
reason to not pick safe_auto is that the author might have meant
something else, consider whether that interpretation is actually
plausible vs a strawman." Let synthesis's strawman-downgrade
safeguard catch genuine strawmen; do not pre-demote at the persona
level.
- Remove stale autofix_class: auto reference (old binary value, now
split into the 3-tier enum).
Calibration patterns documented (docs/solutions/skill-design/):
- Capture nine durable calibration patterns from the calibration
runs so future contributors don't "fix" intentional behavior:
external stale cross-refs route to gated_auto (not safe_auto);
multi-surface terminology drift routes to gated_auto; peer-vs-nested
decision test; surviving-root scope dominance; multi-root explicit
elevation; FYI band + template anchoring both required; inline enum
callouts vs schema injection; coverage/rendering single source of
truth; reviewer variance is inherent (validate across multiple
runs).
- Placed in docs/solutions/skill-design/ so persona agents don't read
it during dispatch (ce-doc-review does not dispatch
ce-learnings-researcher per the plan's scope cut, so the doc stays
outside persona runtime context).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c56c64c3f5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Close four spec gaps the Codex reviewer flagged on the downstream implications of recently-added mechanisms: - 3.5b actionless default — when contributing personas are all silent on action, gate the Apply default on suggested_fix presence; fall back to Defer when no fix exists. Adds a cross-branch downgrade: if the tie-break winner is Apply but the merged finding has no suggested_fix after 3.6/3.7 run, demote to Defer. Prevents LFG and bulk-preview from scheduling non-executable Apply actions. - Routing question option C — suppress when open-questions-defer.md's cached append_available is false. Drops the menu to three options (A/B/D) and appends a one-line explanation, mirroring the per-finding option B suppression already documented under Adaptations. - Cascade persistence — every cascaded decision lands in the in-memory decision list (the canonical state-model home for all answered findings), with action-specific side effects layered on top: cascaded Apply also joins the Apply set, cascaded Defer also invokes the open-questions append flow (with per-finding failure path on append error), cascaded Skip is decision-list-only. Aligns the cascade with the Per-finding routing rules and the state model. - evidence_fingerprint sanitization — collapse any whitespace run (newlines/CRs/tabs) to single spaces, strip HTML-comment-breaking sequences, escape quotes, trim — applied before the 120-char slice so the dedup-key HTML comment stays on one parseable line. Step 4 reconstruction updated to reference the sanitization and fall through to the legacy-fallback path on malformed entries rather than attempting partial parse. All 38 contract tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 21fb160adf
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…ply on no-fix findings
Close two spec gaps flagged by Codex review:
- 3.5c chain predicate mismatch — the linking rule defined dependents
as findings that "become moot when the root's fix is accepted", but
the walkthrough and worked examples cascade dependents on root
Skip/Defer (rejection). Rewrite the Step 2 predicate and
substitution check around the rejection branch so linking is
semantically consistent with cascade execution. Worked examples A
and B already read naturally under the new framing; independence
safeguard is branch-neutral and unchanged.
- Apply execution on no-fix findings — the prior 3.5b fix demoted
Apply to Defer for the automated default, but users can still
manually pick Apply on a manual finding with no suggested_fix
during walk-through. Add two-layer guards:
* Decision-time: if user picks Apply on a no-fix finding, fire a
sub-question with Defer-to-Open-Questions (recommended) / Skip
/ Acknowledge-without-applying. Includes append-availability
adaptation and cascading-root handling.
* Execution-time: defensive check before dispatching each
Apply-set edit — if suggested_fix is missing, skip with a
recorded warning in the completion report, continue the batch.
Completion report gains an Acknowledged bucket (omitted when zero).
All 38 contract tests still pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 236de00ab5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…ded) to A/B/C Two small spec gaps flagged by Codex review: - Acknowledged decisions were not persisted across rounds. R29/primer carried forward Skip and Defer but not Acknowledged, leaving the no-fix-Apply guard case re-surfaceable on round 2. Extended the primer carry-forward and R29 rule to treat Acknowledged as a rejected-class decision, semantically equivalent to Skip for round-to-round suppression. - The (recommended) marker on per-finding options said "any of the four" when synthesis only emits Apply/Defer/Skip — A/B/C. Corrected to exclude D (LFG the rest) since that is a workflow shortcut for bulk execution, not a finding-level resolution action. All 38 contract tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2712a2feb9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…teractive template
Two bundled concerns:
1. Separate internal schema vocabulary from user-facing rendered text.
Schema enum values (safe_auto / gated_auto / manual) are correct
for the persona output contract and synthesis routing logic, but
leak into user-facing output as jargon. Rename in rendered text:
- "Applied N safe_auto fixes" → "Applied N fixes"
- "Gated-auto findings (concrete fix, requires user
confirmation)" → "Proposed fixes (confirm each)"
- "Manual findings (requires judgment)" → "Decisions (requires
your judgment)"
- FYI kept as-is (already natural English)
- Coverage columns SafeAuto/GatedAuto/Manual → Auto/Proposed/
Decisions
Schema definitions, persona output contract, synthesis pipeline
prose, and Tier-column enum values stay — those are internal.
2. Add chain-rendering rules to review-output-template.md
(interactive mode) to mirror the headless envelope. Dependents
render only under their root as a Dependents (N) sub-block; MUST
NOT appear at their own severity position. Count invariant:
Findings = Auto + Proposed + Decisions + FYI, each counted once in
its assigned bucket. Optional Chains: N root(s) with M dependents
line for parity with the headless envelope.
Contract test `headless envelope surfaces new tiers distinctly`
updated to match the new literal substrings (semantic check — bucket
headers exist — preserved).
Verified no caller parses the old substrings: grep against
ce-brainstorm and ce-plan returned zero matches for the legacy
substrings the spec claimed "backward compatibility" on. Dropped
that claim.
All 38 contract tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5afe103e17
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
walkthrough.md:197 said the completion report groups acknowledged entries alongside the Skipped bucket, but lines 249-256 require a dedicated Acknowledged bucket with its own count, its own per-finding action label, and its own position in the report ordering (Applied / Deferred / Skipped / Acknowledged). Two distinct concerns had gotten tangled in one sentence: - Report display: Acknowledged is its own bucket (distinct audit trail; users can see what was consciously acknowledged vs what was skipped as irrelevant). - R29 suppression: Acknowledged is treated as rejected-class, equivalent to Skip (for suppressing re-raises in round N+1). Rewrote the Acknowledge routing bullet to state both explicitly and point at the completion-report spec sections below for the full contract. All 38 contract tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
ce-doc-reviewpreviously surfaced 14+ findings per run, all routed to manual, forcing reviewers to re-litigate the same premise through many rephrasings. This PR cuts engagement to ~4-6 real decisions per review while preserving coverage quality.Six compounding mechanisms deliver the drop:
safe_auto(silent) /gated_auto(one-click) /manual(judgment) replaces binary auto/manualShips alongside:
ce-learnings-researcherrewritten for domain-agnostic institutional knowledge (architecture patterns, design patterns, tooling decisions, conventions, not just bugs);ce-compoundfrontmatter enum expanded with four knowledge-track values; 10 existing entries migrated.Docs:
docs/brainstorms/2026-04-18-ce-doc-review-autofix-and-interaction-requirements.mddocs/plans/2026-04-18-001-feat-ce-doc-review-autofix-and-interaction-overhaul-plan.mddocs/solutions/skill-design/ce-doc-review-calibration-patterns-2026-04-19.mdKey design decisions
A reviewer should scrutinize these specifically; the rest of the diff follows from them.
synthesis-and-presentation.md§3.5c.{schema}injection. The JSON schema was insufficient on its own for longer personas (adversarial at 89 lines) — attention got pushed down. Inline enum anchoring at the top of the contract held where injection alone didn't.safe_autoresilience got strawman-resistance guidance. Without it, coherence silently demoted wrong-count and stale-cross-ref findings to manual on over-charitable re-interpretation. The persona-level patterns section is short and specific — three patterns, explicit strawman test per pattern.ce-doc-reviewdoes NOT dispatchce-learnings-researcher. Per the pipeline-separation learning (docs/solutions/skill-design/research-agent-pipeline-separation-2026-04-05.md). The learnings-researcher refactor benefitsce-plan/ce-code-review/ce-optimize/ce-ideatecallers, not doc-review.Testing
bun test: 798 pass, 0 fail. 13 new contract tests cover the three-tier enum, premise-chain mechanics, FYI floor, cross-persona boost, R29/R30 suppression, headless envelope tiers, terminal-question structure, and schema expansion.tests/fixtures/ce-doc-review/exercise complementary shapes: rename/infra (nested chains), auth migration (peer multi-root + security persona), UI feature (zero-root + design-lens + proportional review). Seed maps documented in each fixture's header comment. Variance is expected; validate trends across runs, not single samples.Scope boundaries
Fixes #562 spiritually (document-review analog of the ce-review interactive judgment overhaul).