[Spec] Pipeline-wide convergence protocol (identify→revise→re-review) + recursive summarizer + review-model overhaul

## Summary

Make **every reviewable pipeline step** run a disciplined **identify → revise → re-review** convergence cycle driven by that step's review panel, with honest non-convergence handling (kickback + provenance), and make every step robust to LLM context overflow. Two small reusable SSoT primitives + thin per-step adapters — wide reach, low complexity. This also **removes the point system** in favor of unanimous-acceptance convergence, and folds in a batch of real wiring bugs found during a full pipeline audit.

Design doc (living SSoT, will land on the spec branch): `docs/superpowers/specs/2026-05-27-pipeline-convergence-protocol.md`.

## Motivation

A code-grounded audit of all ~17 pipeline steps found:
- **Most steps have no review at all** — an agent emits an artifact and the project advances (only the formal research/paper review panels + the Tasker analyze loop review anything).
- **The Tasker analyze loop never converges** (spec-014 finding): the analyzer surfaces fresh findings every round, `is_clean` requires the literal string "CLEAN", so it always hits the cap and advances "best-effort" while reporting `passed` — masking non-convergence.
- **Context overflow is pervasive and unhandled** (qwen3.5-122b): `planner`, `tasker` (worst — re-sends full spec+plan+tasks+all reviews every round), `specifier`, `paper_specifier` (full research spec+plan+tasks untruncated), etc. Only `paper_reviewer` summarizes, and it's single-pass (no recursion) with a truncation fallback.
- **Gating is internally inconsistent:** research-review uses a point threshold (`≥5.0`) **and** all-accept; paper-review uses all-accept only (`PAPER_ACCEPT_THRESHOLD` defined but unused).
- **Real wiring bugs** (see "Discrepancies" below), incl. the **publisher is not wired into the graph** and the research `implementer`'s prompt is actually the paper-revision LaTeX prompt.

## Goal

> A reasonable idea has a **convergent path** to publication within a bounded budget (kickbacks allowed); genuinely poor work does not; and this holds across all 9 domains (`LIBRARIAN_DEFAULT_FIELDS`).

## 1. The convergence protocol (per reviewable step)

- **R1 — identify:** each reviewer in the step's panel raises structured *critical* concerns (`id, reviewer, severity, artifact, location, text`).
- **R2 — revise:** the step's author/reviser addresses **every** concern, runs a self-consistency pass, and emits a structured **response + change-log per concern**.
- **R3 — re-review:** each reviewer re-judges, anchored to **its own** R1 concerns + the R2 change-log: *resolved without introducing new problems?* → pass / fail(+new concerns). An R1-accepter re-reviews only if R2 changed an artifact relevant to its lens.
- Iterate `[R2 → R3]` until **all reviewers pass** or **3 rounds** → **adaptive kickback**: the worst unresolved severity routes the project to the appropriate prior stage, carrying a **kickback record** (unresolved concerns + links to all artifacts/reviews + plain-language "why it failed to converge") for the next worker.

## 2. Review model — replaces the point system

- **Point system REMOVED** (no accumulated 0.5/1.0 review points).
- **Gate = unanimous LLM-panel acceptance** within the 3-round cap; else kickback. Unifies research-review (had a threshold) with paper-review (already all-accept).
- **Human + simulated-personality reviews are ADVISORY INPUTS, not gates.** Each submitted review → **review-intake triage** (a new shared, stage-aware SSoT agent): (a) quality filter (evidence-based / specific / relevant — else ignored); (b) aspect-mapping → if it matches a lens an LLM reviewer covers, that reviewer receives it as additional input. Quality+safety+on-topic reviews are preserved in the project folder and included in the publication's review log; unsafe/poor/off-topic ones are excluded.
- **Living document:** post-`posted` triaged comments append to the project log and (batched) trigger a recompile that adds/updates a **Discussion** section + a new Zenodo version DOI when the PDF materially changes.
- Consequence: the public status model (README/about-page Backlog→Ready→Done) must be re-expressed in convergence terms; `status_reporter` updated accordingly.

## 3. Two SSoT primitives

**3a. Recursive task-preserving summarizer** — `src/llmxive/tools/summarize.py` (new).
`summarize_to_budget(content, *, goal, model, token_budget=None) -> str`. The `goal` is a **preservation contract** (what MUST survive, possibly verbatim — e.g. link-check → preserve every URL/DOI verbatim; logic-check → preserve the argument chain; claim-check → preserve numbers). Discrete-element checks use deterministic **extraction**, not summarization. Boundary-aware chunking → goal-targeted per-chunk summary → **recurse** on joined summaries → last-resort truncate with a NOTICE. Critical elements carried verbatim through every recursion level. Generalizes `paper_reviewer`'s chunk+cache logic (which then calls it). **Edge cases enumerated + tested with real examples** (atomic-unit splitting across chunks; cross-chunk references/logic; quantitative claims; ordering; output cut-off; recursion-loss compounding). Validated FIRST.

**3b. Generic review-convergence engine** — `src/llmxive/convergence.py` (new).
Parameterized by a per-step `ReviewSpec(artifacts, panel, reviser, kickback_routing, constitution, max_rounds=3)`. Types: `Reviewer`, `Concern`, `ConcernResponse`, `Verdict`, `ConvergenceResult`. The engine owns the round loop, concern tracking, the unanimous-accept test, kickback emission, and the persisted inspection trail (replaces the bespoke `tasker_rounds`/inspection). Overflowing inputs route through 3a. The **per-project `constitution.md` is a standard input to every panel + the identify phase** (from `specified` onward).

## 4. Per-step integration (audit-grounded; full table in the design doc §5/§6)

- **Idea** (flesh_out + validator-derived multi-lens panel: rq_validity/novelty/feasibility) · kickback → brainstormed.
- **project_initializer** — EXEMPT (mechanical).
- **Spec** (specifier+clarifier collapsed into ONE unit; new 4-lens panel: requirements_coverage/internal_consistency/testability/scope).
- **Plan** (planner; new 4-lens panel: methodology/spec_coverage/data_resources/plan_consistency; deterministic guards as pre-filter).
- **Tasks** (tasker; Mode-A/B refactored INTO the engine; new 4-lens panel: coverage/ordering/executability/constraint_preservation; spec-014 honest-reporting + FR-021 per-round budget fold in here).
- **Research unit** (implementer + EXISTING 8-panel; new implement↔review loop replaces immediate kickback; adaptive kickback).
- **Advancement/verdict routing (#51)** — replaced by the engine's converge/kickback outcome (unifies the two routing schemes).
- **Paper track** mirrors the research track (paper-spec unit, paper-plan, paper-tasks, paper-implement + EXISTING 12-panel reviewing the assembled paper).
- **Publisher** — EXEMPT but **must be wired** (`paper_accepted → publisher → posted`).
- **Cross-cutting** — task_atomizer/joiner = mechanical transforms invoked during tasks revision; status_reporter/repository_hygiene = exempt maintenance.

## 5. Calibration & validation (the high-impact part)

Anti-circularity is the core constraint (reviewers are LLMs). Two grounded label sources:
- **Negatives (must be REJECTED):** inject specific known flaws into good artifacts (trivial/circular RQ, FR with no task, gutted requirement, fabricated data, nonexistent citation, plan↔tasks contradiction) → objective labels; the right lens must catch each.
- **Positives (must be ACCEPTABLE):** ≥1 real human-peer-reviewed published paper **per domain** (9 fields), reverse-engineered to ideas; + HF top-5 daily papers + a sample of the real brainstorm backlog (the practical distribution).

Co-evaluation (the real paper is the domain-expertise anchor; human spot-checks a sample). Two granularities: per-step unit calibration (labeled good/flawed sets; 100% recall on injected critical flaws, low over-flag rate) + end-to-end traversal (golden projects reach `posted`; weak projects rejected; K runs for noise-robustness). Domain-generality validated on a held-out field. **Summarizer validated first.** PROJ-261/262 are e2e smoke tests, not the quality bar.

## 6. Discrepancies / bugs to fix (found in audit)

1. Research `implementer` registry prompt is the paper-revision LaTeX prompt (no research-code prompt exists).
2. **Publisher not wired into the graph** (`paper_accepted→posted` skips it — no DOI/compile/publication.yaml).
3. No recursive summarizer; pervasive unhandled overflow.
4. `analyze_cmd.ANALYZE_SYSTEM_PROMPT_PATH` dead; paper analyze loop reuses the research prompt; analyzer omits the constitution.
5. Dead escalation paths (`clarifier.attempts_so_far=0`; `paper_clarifier` never branches on `escalate`).
6. Two parallel paper-revision routing schemes (graph vs advancement.py).
7. `research_reviewer._produced_by` self-review-prevention is a stub.
8. Stale prompt stage-headers vs graph wiring.
9. `PAPER_ACCEPT_THRESHOLD` defined but unused.
10. `paper_specifier`/`paper_clarifier` prompts advertise `code_summary`/`data_summary` inputs never supplied.

## 7. Scope & sequencing

1. Recursive summarizer (Primitive 1) — TDD on real large artifacts + real qwen calls; edge-case suite. **First.**
2. Convergence engine + kickback (Primitive 2) + review-intake triage — TDD.
3. Per-step `ReviewSpec` adapters (generalize/formalize existing panels; add early-step panels).
4. Reviewer calibration (per-step labeled sets) until targets met, domain-general.
5. End-to-end: push real projects through the **entire** pipeline; scrutinize every output for truncation / missing artifacts / broken tools / poor quality.

Folded-in standalone fix: arXiv resilience (theoremsearch degrades gracefully on transient 429/503).

## 8. Affected issues

#107 (pipeline map), #51 (advancement/verdict routing → replaced by engine), #50 (8-panel) & #56 (12-panel → become engine R1/R3), #49 (implementer bug → implement↔review loop + filesystem re-verification), #216 (high quality from small models → recursive summarizer), #112 (non-deterministic relevance judge → calibration adjacency), #58 (publisher wiring). Per-phase #47–#58 and per-agent #63–#106 gain a `ReviewSpec` as implemented.

🤖 Generated with [Claude Code](https://claude.com/claude-code)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spec] Pipeline-wide convergence protocol (identify→revise→re-review) + recursive summarizer + review-model overhaul #239

Summary

Motivation

Goal

1. The convergence protocol (per reviewable step)

2. Review model — replaces the point system

3. Two SSoT primitives

4. Per-step integration (audit-grounded; full table in the design doc §5/§6)

5. Calibration & validation (the high-impact part)

6. Discrepancies / bugs to fix (found in audit)

7. Scope & sequencing

8. Affected issues

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Spec] Pipeline-wide convergence protocol (identify→revise→re-review) + recursive summarizer + review-model overhaul #239

Description

Summary

Motivation

Goal

1. The convergence protocol (per reviewable step)

2. Review model — replaces the point system

3. Two SSoT primitives

4. Per-step integration (audit-grounded; full table in the design doc §5/§6)

5. Calibration & validation (the high-impact part)

6. Discrepancies / bugs to fix (found in audit)

7. Scope & sequencing

8. Affected issues

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions