You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Make every reviewable pipeline step run a disciplined identify → revise → re-review convergence cycle driven by that step's review panel, with honest non-convergence handling (kickback + provenance), and make every step robust to LLM context overflow. Two small reusable SSoT primitives + thin per-step adapters — wide reach, low complexity. This also removes the point system in favor of unanimous-acceptance convergence, and folds in a batch of real wiring bugs found during a full pipeline audit.
Design doc (living SSoT, will land on the spec branch): docs/superpowers/specs/2026-05-27-pipeline-convergence-protocol.md.
Motivation
A code-grounded audit of all ~17 pipeline steps found:
Most steps have no review at all — an agent emits an artifact and the project advances (only the formal research/paper review panels + the Tasker analyze loop review anything).
The Tasker analyze loop never converges (spec-014 finding): the analyzer surfaces fresh findings every round, is_clean requires the literal string "CLEAN", so it always hits the cap and advances "best-effort" while reporting passed — masking non-convergence.
Context overflow is pervasive and unhandled (qwen3.5-122b): planner, tasker (worst — re-sends full spec+plan+tasks+all reviews every round), specifier, paper_specifier (full research spec+plan+tasks untruncated), etc. Only paper_reviewer summarizes, and it's single-pass (no recursion) with a truncation fallback.
Gating is internally inconsistent: research-review uses a point threshold (≥5.0) and all-accept; paper-review uses all-accept only (PAPER_ACCEPT_THRESHOLD defined but unused).
Real wiring bugs (see "Discrepancies" below), incl. the publisher is not wired into the graph and the research implementer's prompt is actually the paper-revision LaTeX prompt.
Goal
A reasonable idea has a convergent path to publication within a bounded budget (kickbacks allowed); genuinely poor work does not; and this holds across all 9 domains (LIBRARIAN_DEFAULT_FIELDS).
1. The convergence protocol (per reviewable step)
R1 — identify: each reviewer in the step's panel raises structured critical concerns (id, reviewer, severity, artifact, location, text).
R2 — revise: the step's author/reviser addresses every concern, runs a self-consistency pass, and emits a structured response + change-log per concern.
R3 — re-review: each reviewer re-judges, anchored to its own R1 concerns + the R2 change-log: resolved without introducing new problems? → pass / fail(+new concerns). An R1-accepter re-reviews only if R2 changed an artifact relevant to its lens.
Iterate [R2 → R3] until all reviewers pass or 3 rounds → adaptive kickback: the worst unresolved severity routes the project to the appropriate prior stage, carrying a kickback record (unresolved concerns + links to all artifacts/reviews + plain-language "why it failed to converge") for the next worker.
2. Review model — replaces the point system
Point system REMOVED (no accumulated 0.5/1.0 review points).
Gate = unanimous LLM-panel acceptance within the 3-round cap; else kickback. Unifies research-review (had a threshold) with paper-review (already all-accept).
Human + simulated-personality reviews are ADVISORY INPUTS, not gates. Each submitted review → review-intake triage (a new shared, stage-aware SSoT agent): (a) quality filter (evidence-based / specific / relevant — else ignored); (b) aspect-mapping → if it matches a lens an LLM reviewer covers, that reviewer receives it as additional input. Quality+safety+on-topic reviews are preserved in the project folder and included in the publication's review log; unsafe/poor/off-topic ones are excluded.
Living document: post-posted triaged comments append to the project log and (batched) trigger a recompile that adds/updates a Discussion section + a new Zenodo version DOI when the PDF materially changes.
Consequence: the public status model (README/about-page Backlog→Ready→Done) must be re-expressed in convergence terms; status_reporter updated accordingly.
3. Two SSoT primitives
3a. Recursive task-preserving summarizer — src/llmxive/tools/summarize.py (new). summarize_to_budget(content, *, goal, model, token_budget=None) -> str. The goal is a preservation contract (what MUST survive, possibly verbatim — e.g. link-check → preserve every URL/DOI verbatim; logic-check → preserve the argument chain; claim-check → preserve numbers). Discrete-element checks use deterministic extraction, not summarization. Boundary-aware chunking → goal-targeted per-chunk summary → recurse on joined summaries → last-resort truncate with a NOTICE. Critical elements carried verbatim through every recursion level. Generalizes paper_reviewer's chunk+cache logic (which then calls it). Edge cases enumerated + tested with real examples (atomic-unit splitting across chunks; cross-chunk references/logic; quantitative claims; ordering; output cut-off; recursion-loss compounding). Validated FIRST.
3b. Generic review-convergence engine — src/llmxive/convergence.py (new).
Parameterized by a per-step ReviewSpec(artifacts, panel, reviser, kickback_routing, constitution, max_rounds=3). Types: Reviewer, Concern, ConcernResponse, Verdict, ConvergenceResult. The engine owns the round loop, concern tracking, the unanimous-accept test, kickback emission, and the persisted inspection trail (replaces the bespoke tasker_rounds/inspection). Overflowing inputs route through 3a. The per-project constitution.md is a standard input to every panel + the identify phase (from specified onward).
4. Per-step integration (audit-grounded; full table in the design doc §5/§6)
Spec (specifier+clarifier collapsed into ONE unit; new 4-lens panel: requirements_coverage/internal_consistency/testability/scope).
Plan (planner; new 4-lens panel: methodology/spec_coverage/data_resources/plan_consistency; deterministic guards as pre-filter).
Tasks (tasker; Mode-A/B refactored INTO the engine; new 4-lens panel: coverage/ordering/executability/constraint_preservation; spec-014 honest-reporting + FR-021 per-round budget fold in here).
Research unit (implementer + EXISTING 8-panel; new implement↔review loop replaces immediate kickback; adaptive kickback).
5. Calibration & validation (the high-impact part)
Anti-circularity is the core constraint (reviewers are LLMs). Two grounded label sources:
Negatives (must be REJECTED): inject specific known flaws into good artifacts (trivial/circular RQ, FR with no task, gutted requirement, fabricated data, nonexistent citation, plan↔tasks contradiction) → objective labels; the right lens must catch each.
Positives (must be ACCEPTABLE): ≥1 real human-peer-reviewed published paper per domain (9 fields), reverse-engineered to ideas; + HF top-5 daily papers + a sample of the real brainstorm backlog (the practical distribution).
Co-evaluation (the real paper is the domain-expertise anchor; human spot-checks a sample). Two granularities: per-step unit calibration (labeled good/flawed sets; 100% recall on injected critical flaws, low over-flag rate) + end-to-end traversal (golden projects reach posted; weak projects rejected; K runs for noise-robustness). Domain-generality validated on a held-out field. Summarizer validated first. PROJ-261/262 are e2e smoke tests, not the quality bar.
6. Discrepancies / bugs to fix (found in audit)
Research implementer registry prompt is the paper-revision LaTeX prompt (no research-code prompt exists).
Publisher not wired into the graph (paper_accepted→posted skips it — no DOI/compile/publication.yaml).
No recursive summarizer; pervasive unhandled overflow.
analyze_cmd.ANALYZE_SYSTEM_PROMPT_PATH dead; paper analyze loop reuses the research prompt; analyzer omits the constitution.
Dead escalation paths (clarifier.attempts_so_far=0; paper_clarifier never branches on escalate).
Two parallel paper-revision routing schemes (graph vs advancement.py).
research_reviewer._produced_by self-review-prevention is a stub.
Stale prompt stage-headers vs graph wiring.
PAPER_ACCEPT_THRESHOLD defined but unused.
paper_specifier/paper_clarifier prompts advertise code_summary/data_summary inputs never supplied.
7. Scope & sequencing
Recursive summarizer (Primitive 1) — TDD on real large artifacts + real qwen calls; edge-case suite. First.
Summary
Make every reviewable pipeline step run a disciplined identify → revise → re-review convergence cycle driven by that step's review panel, with honest non-convergence handling (kickback + provenance), and make every step robust to LLM context overflow. Two small reusable SSoT primitives + thin per-step adapters — wide reach, low complexity. This also removes the point system in favor of unanimous-acceptance convergence, and folds in a batch of real wiring bugs found during a full pipeline audit.
Design doc (living SSoT, will land on the spec branch):
docs/superpowers/specs/2026-05-27-pipeline-convergence-protocol.md.Motivation
A code-grounded audit of all ~17 pipeline steps found:
is_cleanrequires the literal string "CLEAN", so it always hits the cap and advances "best-effort" while reportingpassed— masking non-convergence.planner,tasker(worst — re-sends full spec+plan+tasks+all reviews every round),specifier,paper_specifier(full research spec+plan+tasks untruncated), etc. Onlypaper_reviewersummarizes, and it's single-pass (no recursion) with a truncation fallback.≥5.0) and all-accept; paper-review uses all-accept only (PAPER_ACCEPT_THRESHOLDdefined but unused).implementer's prompt is actually the paper-revision LaTeX prompt.Goal
1. The convergence protocol (per reviewable step)
id, reviewer, severity, artifact, location, text).[R2 → R3]until all reviewers pass or 3 rounds → adaptive kickback: the worst unresolved severity routes the project to the appropriate prior stage, carrying a kickback record (unresolved concerns + links to all artifacts/reviews + plain-language "why it failed to converge") for the next worker.2. Review model — replaces the point system
postedtriaged comments append to the project log and (batched) trigger a recompile that adds/updates a Discussion section + a new Zenodo version DOI when the PDF materially changes.status_reporterupdated accordingly.3. Two SSoT primitives
3a. Recursive task-preserving summarizer —
src/llmxive/tools/summarize.py(new).summarize_to_budget(content, *, goal, model, token_budget=None) -> str. Thegoalis a preservation contract (what MUST survive, possibly verbatim — e.g. link-check → preserve every URL/DOI verbatim; logic-check → preserve the argument chain; claim-check → preserve numbers). Discrete-element checks use deterministic extraction, not summarization. Boundary-aware chunking → goal-targeted per-chunk summary → recurse on joined summaries → last-resort truncate with a NOTICE. Critical elements carried verbatim through every recursion level. Generalizespaper_reviewer's chunk+cache logic (which then calls it). Edge cases enumerated + tested with real examples (atomic-unit splitting across chunks; cross-chunk references/logic; quantitative claims; ordering; output cut-off; recursion-loss compounding). Validated FIRST.3b. Generic review-convergence engine —
src/llmxive/convergence.py(new).Parameterized by a per-step
ReviewSpec(artifacts, panel, reviser, kickback_routing, constitution, max_rounds=3). Types:Reviewer,Concern,ConcernResponse,Verdict,ConvergenceResult. The engine owns the round loop, concern tracking, the unanimous-accept test, kickback emission, and the persisted inspection trail (replaces the bespoketasker_rounds/inspection). Overflowing inputs route through 3a. The per-projectconstitution.mdis a standard input to every panel + the identify phase (fromspecifiedonward).4. Per-step integration (audit-grounded; full table in the design doc §5/§6)
paper_accepted → publisher → posted).5. Calibration & validation (the high-impact part)
Anti-circularity is the core constraint (reviewers are LLMs). Two grounded label sources:
Co-evaluation (the real paper is the domain-expertise anchor; human spot-checks a sample). Two granularities: per-step unit calibration (labeled good/flawed sets; 100% recall on injected critical flaws, low over-flag rate) + end-to-end traversal (golden projects reach
posted; weak projects rejected; K runs for noise-robustness). Domain-generality validated on a held-out field. Summarizer validated first. PROJ-261/262 are e2e smoke tests, not the quality bar.6. Discrepancies / bugs to fix (found in audit)
implementerregistry prompt is the paper-revision LaTeX prompt (no research-code prompt exists).paper_accepted→postedskips it — no DOI/compile/publication.yaml).analyze_cmd.ANALYZE_SYSTEM_PROMPT_PATHdead; paper analyze loop reuses the research prompt; analyzer omits the constitution.clarifier.attempts_so_far=0;paper_clarifiernever branches onescalate).research_reviewer._produced_byself-review-prevention is a stub.PAPER_ACCEPT_THRESHOLDdefined but unused.paper_specifier/paper_clarifierprompts advertisecode_summary/data_summaryinputs never supplied.7. Scope & sequencing
ReviewSpecadapters (generalize/formalize existing panels; add early-step panels).Folded-in standalone fix: arXiv resilience (theoremsearch degrades gracefully on transient 429/503).
8. Affected issues
#107 (pipeline map), #51 (advancement/verdict routing → replaced by engine), #50 (8-panel) & #56 (12-panel → become engine R1/R3), #49 (implementer bug → implement↔review loop + filesystem re-verification), #216 (high quality from small models → recursive summarizer), #112 (non-deterministic relevance judge → calibration adjacency), #58 (publisher wiring). Per-phase #47–#58 and per-agent #63–#106 gain a
ReviewSpecas implemented.🤖 Generated with Claude Code