Skip to content

[Spec] Pipeline-wide convergence protocol (identify→revise→re-review) + recursive summarizer + review-model overhaul #239

@jeremymanning

Description

@jeremymanning

Summary

Make every reviewable pipeline step run a disciplined identify → revise → re-review convergence cycle driven by that step's review panel, with honest non-convergence handling (kickback + provenance), and make every step robust to LLM context overflow. Two small reusable SSoT primitives + thin per-step adapters — wide reach, low complexity. This also removes the point system in favor of unanimous-acceptance convergence, and folds in a batch of real wiring bugs found during a full pipeline audit.

Design doc (living SSoT, will land on the spec branch): docs/superpowers/specs/2026-05-27-pipeline-convergence-protocol.md.

Motivation

A code-grounded audit of all ~17 pipeline steps found:

  • Most steps have no review at all — an agent emits an artifact and the project advances (only the formal research/paper review panels + the Tasker analyze loop review anything).
  • The Tasker analyze loop never converges (spec-014 finding): the analyzer surfaces fresh findings every round, is_clean requires the literal string "CLEAN", so it always hits the cap and advances "best-effort" while reporting passed — masking non-convergence.
  • Context overflow is pervasive and unhandled (qwen3.5-122b): planner, tasker (worst — re-sends full spec+plan+tasks+all reviews every round), specifier, paper_specifier (full research spec+plan+tasks untruncated), etc. Only paper_reviewer summarizes, and it's single-pass (no recursion) with a truncation fallback.
  • Gating is internally inconsistent: research-review uses a point threshold (≥5.0) and all-accept; paper-review uses all-accept only (PAPER_ACCEPT_THRESHOLD defined but unused).
  • Real wiring bugs (see "Discrepancies" below), incl. the publisher is not wired into the graph and the research implementer's prompt is actually the paper-revision LaTeX prompt.

Goal

A reasonable idea has a convergent path to publication within a bounded budget (kickbacks allowed); genuinely poor work does not; and this holds across all 9 domains (LIBRARIAN_DEFAULT_FIELDS).

1. The convergence protocol (per reviewable step)

  • R1 — identify: each reviewer in the step's panel raises structured critical concerns (id, reviewer, severity, artifact, location, text).
  • R2 — revise: the step's author/reviser addresses every concern, runs a self-consistency pass, and emits a structured response + change-log per concern.
  • R3 — re-review: each reviewer re-judges, anchored to its own R1 concerns + the R2 change-log: resolved without introducing new problems? → pass / fail(+new concerns). An R1-accepter re-reviews only if R2 changed an artifact relevant to its lens.
  • Iterate [R2 → R3] until all reviewers pass or 3 roundsadaptive kickback: the worst unresolved severity routes the project to the appropriate prior stage, carrying a kickback record (unresolved concerns + links to all artifacts/reviews + plain-language "why it failed to converge") for the next worker.

2. Review model — replaces the point system

  • Point system REMOVED (no accumulated 0.5/1.0 review points).
  • Gate = unanimous LLM-panel acceptance within the 3-round cap; else kickback. Unifies research-review (had a threshold) with paper-review (already all-accept).
  • Human + simulated-personality reviews are ADVISORY INPUTS, not gates. Each submitted review → review-intake triage (a new shared, stage-aware SSoT agent): (a) quality filter (evidence-based / specific / relevant — else ignored); (b) aspect-mapping → if it matches a lens an LLM reviewer covers, that reviewer receives it as additional input. Quality+safety+on-topic reviews are preserved in the project folder and included in the publication's review log; unsafe/poor/off-topic ones are excluded.
  • Living document: post-posted triaged comments append to the project log and (batched) trigger a recompile that adds/updates a Discussion section + a new Zenodo version DOI when the PDF materially changes.
  • Consequence: the public status model (README/about-page Backlog→Ready→Done) must be re-expressed in convergence terms; status_reporter updated accordingly.

3. Two SSoT primitives

3a. Recursive task-preserving summarizersrc/llmxive/tools/summarize.py (new).
summarize_to_budget(content, *, goal, model, token_budget=None) -> str. The goal is a preservation contract (what MUST survive, possibly verbatim — e.g. link-check → preserve every URL/DOI verbatim; logic-check → preserve the argument chain; claim-check → preserve numbers). Discrete-element checks use deterministic extraction, not summarization. Boundary-aware chunking → goal-targeted per-chunk summary → recurse on joined summaries → last-resort truncate with a NOTICE. Critical elements carried verbatim through every recursion level. Generalizes paper_reviewer's chunk+cache logic (which then calls it). Edge cases enumerated + tested with real examples (atomic-unit splitting across chunks; cross-chunk references/logic; quantitative claims; ordering; output cut-off; recursion-loss compounding). Validated FIRST.

3b. Generic review-convergence enginesrc/llmxive/convergence.py (new).
Parameterized by a per-step ReviewSpec(artifacts, panel, reviser, kickback_routing, constitution, max_rounds=3). Types: Reviewer, Concern, ConcernResponse, Verdict, ConvergenceResult. The engine owns the round loop, concern tracking, the unanimous-accept test, kickback emission, and the persisted inspection trail (replaces the bespoke tasker_rounds/inspection). Overflowing inputs route through 3a. The per-project constitution.md is a standard input to every panel + the identify phase (from specified onward).

4. Per-step integration (audit-grounded; full table in the design doc §5/§6)

  • Idea (flesh_out + validator-derived multi-lens panel: rq_validity/novelty/feasibility) · kickback → brainstormed.
  • project_initializer — EXEMPT (mechanical).
  • Spec (specifier+clarifier collapsed into ONE unit; new 4-lens panel: requirements_coverage/internal_consistency/testability/scope).
  • Plan (planner; new 4-lens panel: methodology/spec_coverage/data_resources/plan_consistency; deterministic guards as pre-filter).
  • Tasks (tasker; Mode-A/B refactored INTO the engine; new 4-lens panel: coverage/ordering/executability/constraint_preservation; spec-014 honest-reporting + FR-021 per-round budget fold in here).
  • Research unit (implementer + EXISTING 8-panel; new implement↔review loop replaces immediate kickback; adaptive kickback).
  • Advancement/verdict routing ([Phase 7] Advancement & Verdict Routing #51) — replaced by the engine's converge/kickback outcome (unifies the two routing schemes).
  • Paper track mirrors the research track (paper-spec unit, paper-plan, paper-tasks, paper-implement + EXISTING 12-panel reviewing the assembled paper).
  • Publisher — EXEMPT but must be wired (paper_accepted → publisher → posted).
  • Cross-cutting — task_atomizer/joiner = mechanical transforms invoked during tasks revision; status_reporter/repository_hygiene = exempt maintenance.

5. Calibration & validation (the high-impact part)

Anti-circularity is the core constraint (reviewers are LLMs). Two grounded label sources:

  • Negatives (must be REJECTED): inject specific known flaws into good artifacts (trivial/circular RQ, FR with no task, gutted requirement, fabricated data, nonexistent citation, plan↔tasks contradiction) → objective labels; the right lens must catch each.
  • Positives (must be ACCEPTABLE): ≥1 real human-peer-reviewed published paper per domain (9 fields), reverse-engineered to ideas; + HF top-5 daily papers + a sample of the real brainstorm backlog (the practical distribution).

Co-evaluation (the real paper is the domain-expertise anchor; human spot-checks a sample). Two granularities: per-step unit calibration (labeled good/flawed sets; 100% recall on injected critical flaws, low over-flag rate) + end-to-end traversal (golden projects reach posted; weak projects rejected; K runs for noise-robustness). Domain-generality validated on a held-out field. Summarizer validated first. PROJ-261/262 are e2e smoke tests, not the quality bar.

6. Discrepancies / bugs to fix (found in audit)

  1. Research implementer registry prompt is the paper-revision LaTeX prompt (no research-code prompt exists).
  2. Publisher not wired into the graph (paper_accepted→posted skips it — no DOI/compile/publication.yaml).
  3. No recursive summarizer; pervasive unhandled overflow.
  4. analyze_cmd.ANALYZE_SYSTEM_PROMPT_PATH dead; paper analyze loop reuses the research prompt; analyzer omits the constitution.
  5. Dead escalation paths (clarifier.attempts_so_far=0; paper_clarifier never branches on escalate).
  6. Two parallel paper-revision routing schemes (graph vs advancement.py).
  7. research_reviewer._produced_by self-review-prevention is a stub.
  8. Stale prompt stage-headers vs graph wiring.
  9. PAPER_ACCEPT_THRESHOLD defined but unused.
  10. paper_specifier/paper_clarifier prompts advertise code_summary/data_summary inputs never supplied.

7. Scope & sequencing

  1. Recursive summarizer (Primitive 1) — TDD on real large artifacts + real qwen calls; edge-case suite. First.
  2. Convergence engine + kickback (Primitive 2) + review-intake triage — TDD.
  3. Per-step ReviewSpec adapters (generalize/formalize existing panels; add early-step panels).
  4. Reviewer calibration (per-step labeled sets) until targets met, domain-general.
  5. End-to-end: push real projects through the entire pipeline; scrutinize every output for truncation / missing artifacts / broken tools / poor quality.

Folded-in standalone fix: arXiv resilience (theoremsearch degrades gracefully on transient 429/503).

8. Affected issues

#107 (pipeline map), #51 (advancement/verdict routing → replaced by engine), #50 (8-panel) & #56 (12-panel → become engine R1/R3), #49 (implementer bug → implement↔review loop + filesystem re-verification), #216 (high quality from small models → recursive summarizer), #112 (non-deterministic relevance judge → calibration adjacency), #58 (publisher wiring). Per-phase #47#58 and per-agent #63#106 gain a ReviewSpec as implemented.

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions