Skip to content

spec 015: pipeline convergence protocol (closes #239)#250

Open
jeremymanning wants to merge 163 commits into
mainfrom
015-pipeline-convergence-protocol
Open

spec 015: pipeline convergence protocol (closes #239)#250
jeremymanning wants to merge 163 commits into
mainfrom
015-pipeline-convergence-protocol

Conversation

@jeremymanning
Copy link
Copy Markdown
Member

Summary

Implements spec 015 — Pipeline Convergence Protocol (issue #239). Replaces the legacy accumulated-review-points model (≥10 LLMs / ≥5 humans, 0.5/1.0 points) with a convergence-based gate: each reviewable stage runs identify → revise → re-review with its LLM panel and advances only on unanimous panel acceptance within a 3-round cap, else an adaptive kickback to the prior stage with full provenance. Human/personality reviews are advisory only and route through stage-aware triage.

Key behavior (selected FRs)

  • Convergence engine: R1 identify → R2 revise → R3 re-review; unanimous-acceptance gate; honest converged reporting (FR-016).
  • FR-012 selective re-review: dissenters always re-review; R1-accepters re-review only when R2 changed a lens-relevant artifact.
  • FR-011 reviser self-consistency: a second code-level audit call + one corrective re-pass, exception-guarded.
  • FR-048 living-document batched recompile: render Discussion → sha256 material-change → FR-054 sign-off gate → version DOI; cron auto-triggers but never auto-mints.
  • HF Inference-API backend removed — HF models run locally via transformers; backend chain is dartmouth → local.

Hardening in this PR

  • Fixed 2 latent finally: return bugs (implementer/publisher) that double-appended run-log entries on the skip path and swallowed re-raises.
  • Fixed a real NameError in agents/librarian.py (loop var/body mismatch on the marginal-fallback path), surfaced by the mypy pass.
  • Introduced LLMXIVE_REPO_ROOT repo-root override (centralized ~60 __file__ climbs) and de-rotted the Phase-3 real-call e2e so it runs hermetically against a synthetic repo (verified: real Specifier+Clarifier run, 95s).
  • (str, Enum) → StrEnum migration; mypy strict: 213 → 0; ruff check .: clean (repo-wide); offline suite 1232 passed.

Verification

  • ruff check . → All checks passed
  • mypy src/llmxive → 0 errors (154 files)
  • offline suite → 1232 passed, 1 skipped, 2 deselected
  • Phase-3 e2e (real-call) → passes (95s); prompts-check → OK (53 agents)

Note: part 7 of #239 (full sequential end-to-end pipeline run with per-step artifact-quality review) is in progress as follow-up work on this branch.

🤖 Generated with Claude Code

jeremymanning and others added 30 commits May 27, 2026 20:08
…+ review-model overhaul (#239)

Comprehensive Spec Kit specification for umbrella issue #239, grounded in the
2026-05-27 design doc SSoT and a code-verified audit. Covers: the inode-table
summarize/desummarize primitive (no silent loss of check-critical elements),
the generic identify->revise->re-review convergence engine + adaptive kickback,
removal of the point system for unanimous-panel acceptance + advisory triage,
per-step ReviewSpec adapters across the whole research + paper track, reviewer
calibration (9 domains, held-out generality), end-to-end traversal proof,
living-document discussion board, and all 10 audit bug fixes + arXiv resilience.

Three scope decisions resolved with maintainer up front (living-doc=full;
point cutover=migrate-forward; overflow floor=inode-table pointers).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Five clarifications integrated into the spec (Clarifications + FRs/SCs/scenarios/assumptions):
- Publish target: real public Zenodo/GitHub/site, but a MANDATORY manual
  maintainer sign-off before every DOI mint for the duration of this spec
  (new FR-054, SC-014; FR-036/FR-048 updated).
- E2E coverage: all 9 domains traverse end-to-end to posted (FR-045, SC-007).
- Calibration: differential clean-vs-injected test + manual adjudication +
  adaptive sensitivity tuning (no fixed over-flag % / K) (FR-042, FR-044, SC-005).
- Kickback budget: NO global cap; monotonic-improvement-until-convergence;
  per-step 3-round cap retained (FR-017, edge case, assumptions).
- Cutover: no posted/done projects exist -> migration applies to in-flight only (FR-025).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
plan.md (Constitution Check: points-removal + no-global-cap tracked as
authorized deviations -> constitution amendment task), research.md (10 grounded
technical decisions incl. inode-table summarizer format, engine-as-callables,
adaptive kickback, manual DOI sign-off, differential calibration), data-model.md
(pydantic entities), quickstart.md, and 6 contracts (summarize-api,
convergence-engine, reviewspec-registry, review-intake-triage, kickback-record,
publisher-signoff). CLAUDE.md SPECKIT ref -> 015.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Organized by user story (US1-US8) with Setup/Foundational/Polish. TDD + real-call
+ manual-QC tasks included per spec. Dependency chain: summarizer first ->
engine -> bug fixes -> review model -> per-step panels -> calibration (9 domains)
-> e2e to posted (9 domains, manual DOI sign-off) -> living-doc -> polish.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Closed 4 coverage/underspecification findings from /speckit-analyze (0 remain):
- C1 (HIGH): FR-006 authoring-side overflow routing + paper twins -> T054-T057
- C2 (MED): FR-026 repository_hygiene line-count/gitignore -> T043
- U1 (MED): FR-053 convergence principle encoding -> T007
- U2 (LOW): FR-017 ProgressRecord emission -> T026
Constitution point-conflict (CRITICAL) resolved by explicit amendment task T007.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- T001: new package dirs (convergence/, calibration/, agents/prompts/panels/)
- T002: STATUS.md living progress doc (FR-052)
- T003: Stage.AWAITING_PUBLICATION_SIGNOFF; config CONVERGENCE_MAX_ROUNDS=3 +
  CONVERGENCE_PER_ROUND_BUDGET_SECONDS=600. Imports verified.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
New SSoT primitive src/llmxive/tools/summarize.py: summarize()/desummarize() with
on-disk inode-table pointer hierarchy. Deterministic no-loss guarantee (URLs/DOIs/
arXiv/citations/FR-SC-task ids/numbers preserved verbatim; full content on disk,
recursively paged in). 12 tests pass (7 edge cases + core no-loss + manifest
contract + no-dangling-pointer); ruff + mypy clean.

Remaining for US1: T009 real-call fidelity, T017 re-point paper_reviewer (SSoT),
T018 real-call verification. See STATUS.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
_build_corpus_with_summaries now delegates context reduction to
tools/summarize.summarize() (inode-table, no silent truncation), preserving the
1-arg summarize_fn contract + _cached_summarize memoization. Supersedes the old
truncate-with-notice fallback (Const. I SSoT). Updated the 2 coupled unit tests
to the new behavior (full source recoverable via desummarize); _chunk_corpus +
its 3 tests untouched. 24 paper_reviewer + 12 summarizer tests pass; mypy-clean
for the changed function.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
tests/real_call/test_summarize_fidelity.py: real qwen3.5-122b summarize_fn over an
over-budget doc; desummarize recovers EVERY critical element verbatim (no loss
through a real-LLM reduction). PASSED in 334s. US1 (summarizer) fully done &
verified: 12 offline + 1 real-call, ruff clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…tion (#239)

- T004/T005: convergence/types.py — Severity (ordered + legacy mapping) and the
  Concern/ConcernResponse/Verdict/ProgressRecord/ConvergenceResult/KickbackRecord/
  TriageRecord pydantic models + Reviewer/Reviser Protocols + ReviewSpec dataclass.
- T006: tests/contract/test_convergence_types.py (7 pass; ruff + mypy clean).
- T007: constitution -> v1.1.0; added Principle VI (Convergent Review,
  NON-NEGOTIABLE), replaced the point-based Review-thresholds gate with
  unanimous-panel convergence + advisory triage, Sync Impact Report updated.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
convergence/engine.py run_convergence: identify->revise->re-review loop with
honest converged reporting (FR-016), 3-round cap, self-review/producer exclusion
+ stale-never-passes (FR-018), per-round wall-clock budget (FR-013), and overflow
inputs routed through tools/summarize (FR-006). convergence/kickback.py
route_kickback (adaptive worst-severity->stage, full-provenance KickbackRecord)
+ progress_record (FR-017). 15 unit tests pass; ruff + mypy clean.

US2 remaining (coupled to US4/US3): T021 real-project integration, T025
advancement.py _produced_by stub, T027 tasker Mode-A/B refactor into the engine.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Addressed the tech debt I had flagged (per "fix issues as you notice them"):
- types-PyYAML dev dep -> yaml stubs resolve under `python -m mypy` (clears yaml
  errors codebase-wide).
- ReviewRecord.score: invalid Literal[float] -> float + field_validator (PEP 586;
  identical {0.0,0.5,1.0} constraint).
- paper_reviewer: list[dict]->list[dict[str,Any]]; text coerced to str.
- removed 2 unused PaperReviewerAgent imports in test.
- FIX: T003 added Stage.AWAITING_PUBLICATION_SIGNOFF but not the project-state
  schema enum -> contract test failed; added it (single SSoT schema).
- FIX: T001 panels dir was under src/llmxive/agents/prompts/ but prompts live at
  repo-root agents/prompts/ -> relocated; corrected 7 path refs in tasks.md.

Finding (STATUS.md): project does NOT gate on ruff/mypy (no config, no CI step;
gates = pytest + checks.*). ~273 legacy mypy errors are pre-existing, out of #239.
Focused regression: 92 passed (all contract + score/paper_reviewer/convergence).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…239)

New agents/prompts/implementer_research.md: instructs the research speckit
implementer to emit the artifacts/verdict YAML the parser expects (write real
runnable code/data, no stubs/diffs, fail-loud verdicts). implement_cmd.py now
renders it instead of the paper-revision LaTeX implementer.md (which stays for
the separate paper-revision agent). Also fixed 2 pre-existing ruff nits in
implement_cmd.py (I001 import sort, F541) since I touched the file.
tests/integration/test_audit_bugfixes.py verifies the fix (2 pass).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
theoremsearch.search() now retries transient failures (429/500/502/503/504 +
RequestException/timeout) with exponential backoff (MAX_TRANSIENT_RETRIES=3),
then degrades via TransientBackendError (the librarian wrapper already treats
that as "optional source unavailable"). Non-transient 4xx are not retried.
retry_backoff_base_seconds is injectable (tests pass 0). 4 unit tests; ruff+mypy clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
#239)

Full offline suite verified green: tests/contract + 599 tests/unit (7.45s) +
real-call summarize_fidelity. Flagged pre-existing live-PDF test in tests/unit
(not CI-gated, hangs offline) for separate gating.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…yze (#239)

Discrepancy #4 fix: ANALYZE_SYSTEM_PROMPT_PATH was defined but unused (inline
prompt hardcoded; paper reused research tasker.md). Now there are TWO real
analyze prompts that ARE used via render_prompt:
- agents/prompts/analyze.md (research): requirements_coverage / internal_consistency /
  testability / scope / constitution_alignment lenses (same vocabulary as the
  US4 Tasks panel).
- agents/prompts/paper_analyze.md (paper): reader_scenario_coverage /
  claims_supported / required_sections_figures / scope_vs_research /
  internal_consistency / constitution_alignment.

run_analyze() gains kind={"research","paper"} + constitution_text kwargs.
paper_tasks_cmd passes kind="paper" + paper constitution; tasks_cmd passes
research constitution (FR-030: constitution is a standard analyze input from
`specified` onward). 6 audit-bugfix tests + 38 phase4 integration tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
clarifier.attempts_so_far was hardcoded 0 (escalation unreachable) and
paper_clarifier never branched on verdict=escalate AND silently substituted a
"Resolved by default" stub on missing patches — a no-silent-shortcuts violation.

Fixes:
- New shared _clarify_attempts.py: persists per-project attempt count under
  .specify/memory/clarifier_attempts.yaml; bump/read/reset + write_human_input_needed.
- Both clarifiers now read REAL attempts and pass them to the prompt.
- Both branch on verdict=escalate -> write human_input_needed.yaml + raise.
- Both escalate at TASKER_MAX_REVISION_ROUNDS (=5) -> write human_input_needed.yaml + raise.
- paper_clarifier no longer substitutes the silent "Resolved by default" stub
  (matches research clarifier's loud failure behavior).
- Also removed 2 pre-existing F841 dead locals in clarify_cmd._spec_path.

29 tests pass (audit + phase3 integration); ruff clean for touched files.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…239)

paper_specifier.md advertised `code_summary` / `data_summary` inputs that the
code never supplied (silent drift between prompt and reality). paper_specify_cmd
now injects both blocks into the user message, reusing research_reviewer's
_summarize_tree() as the SSoT tree-summary helper — Const. I (share, don't fork).
The advertised inputs ARE now present, grounding the paper-spec generation in
the project's actual code/ and data/ trees.

11 audit-bugfix tests pass; ruff clean for touched files.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…R-054) (#239)

Discrepancy #2 fix (FR-036): graph._decide_next_stage no longer shortcuts
PAPER_ACCEPTED -> POSTED. It now routes paper_accepted -> AWAITING_PUBLICATION_SIGNOFF,
then AWAITING_PUBLICATION_SIGNOFF -> POSTED ONLY when the maintainer sign-off record
exists. The PaperPublisher itself enforces the same gate (defense-in-depth) — at
PAPER_ACCEPTED or AWAITING_PUBLICATION_SIGNOFF with NO signoff record it SKIPs with a
clear "awaiting manual maintainer DOI sign-off (FR-054)" reason. No Zenodo DOI is
minted without recorded approval.

New surface:
- src/llmxive/speckit/_publication_signoff.py: read/write/has/clear_signoff
  persistence under <project>/.specify/memory/publication_signoff.yaml; FR-054
  who/when/what record (kinds "initial" / "version").
- `llmxive project publish-approve <PROJ-ID> --who X --what Y [--kind initial|version]`
  CLI command writes the sign-off record.
- 6 new audit-bugfix tests + 27 publisher/graph regression tests pass.

Also fixed 38 pre-existing ruff issues in touched files (auto-fix).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Discrepancy #7 fix (FR-018): advancement._produced_by was a stub returning None.
It now scans state/run-log/<YYYY-MM>/*.jsonl for the latest entry whose outputs
list contains the artifact path and returns that entry's agent_name. Exact +
suffix path matching tolerates relative-vs-absolute bookkeeping. A repo_root
kwarg keeps the production call (no repo_root) working while making tests
hermetic. Defensive: returns None on missing run-log instead of raising.

T029: the audit-bugfix test file (now 18 tests) verifies T030/T031/T032/T033/
T034/T035/T025 fixes. 38 tests pass (audit + advancement regression).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…to US3 (#239)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
New convergence/triage.py — stage-aware triage for submitted human + simulated-
personality reviews. Three filters: quality (length + evidence-indicator regex
sweep — FR/SC/T ids, citations, URLs, DOIs, quoted phrases, code fences,
scientific topic vocab), safety + on-topic (rule-based stop-list + stage/lens
vocabulary overlap), and aspect-mapping to LLM reviewer lenses (preserved but
mapped_lenses=[] when no match -> routes to the step's generic reviewer per
FR-022). Injectable judge_fn for the real-LLM path (US4 wiring); rule-based
default keeps unit tests offline.

tests/integration/test_triage.py: 8 tests covering quality pass/fail, safety
exclusion, off-topic exclusion, lens mapping, unmapped-but-preserved, record
provenance, and the judge_fn injection override. All pass; ruff clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
)

Rewrote the user-facing status-model descriptions in README + web/index.html +
docs/index.html (HTML mirror copy) to convergence semantics: identify -> revise
-> re-review; unanimous panel acceptance within a 3-round cap; advisory
triage for human + simulated-personality reviews; no accumulated points.
Replaces 6 stale "points threshold" / "Human reviews count double" passages.

status_reporter.py + repository_hygiene.py needed no change for the new
status model — their FR-026 duties (projects.json regen, GitHub issue
comment/close on POSTED, line-count delta, gitignore assertions) are not
point-dependent and remain in force unchanged. The points_research_total /
points_paper_total fields the web JS displays will be removed in a follow-up
(part of T041 point-system removal).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…239)

Discrepancy #9 + Const. I cleanup: the accumulated review-point system is gone
from the advancement decision path. Unanimous LLM-panel acceptance is now the
sole gate everywhere (research + paper both).

advancement.py:
  - Research-review gate no longer reads `accept_total` / `RESEARCH_ACCEPT_THRESHOLD`.
    It now uses `_all_specialists_accept(records, required)` with a defensive
    backstop (require ≥1 accept AND zero non-accept records when the registry
    isn't loaded) — mirroring the paper-side default.
  - Paper-review gate's `_award_review_points` call removed (the all-specialists-
    accept-most-recent check was already the real decision).
  - `_award_review_points` definition DELETED (no remaining callers).
  - `RESEARCH_ACCEPT_THRESHOLD` import dropped; replaced with an FR-019 comment.

config.py:
  - `RESEARCH_ACCEPT_THRESHOLD` and `PAPER_ACCEPT_THRESHOLD` constants kept for
    back-compat with `web/about.html` mirror consumers, but VALUES set to 0.0 and
    no advancement code reads them.

T038 tests (`tests/integration/test_no_points.py`, 3 tests): grep guard +
behavioral assertion that no-accept records cannot trip the gate.

T044: per clarify Q3 there are no posted/done projects to grandfather; the gate
change applies on next tick automatically — no data-migration logic needed.

Broad regression: 784 passed, 1 skipped (was 781 — three new T038 tests added).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
src/llmxive/convergence/reviewspecs.py: reviewspec_for(stage) -> ReviewSpec | None.
9 stage entries (idea + 4 research + 4 paper) matching contracts/reviewspec-
registry.md; EXEMPT_STAGES frozenset of 7 mechanical steps. Constitution input
is True for every spec from `specified` onward (FR-030); idea-stage opts out
(no constitution yet). Kickback routing per the contract's worst-severity ->
prior-stage table.

Stages whose panel prompts (T049-T053) or wiring (T054-T059) haven't landed yet
get _TodoReviewer / _TodoReviser placeholders that conform to the Protocol but
raise NotImplementedError with a clear pointer to the follow-up task -- fail-loud
SSoT structure, no silent empty verdicts. 15 contract tests pass; ruff clean.

Also marked T060 (constitution-as-analyze-input, done in T031) and T061 (publisher
wired into graph, done in T035) as already complete.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
US4 panel-prompt authoring: 27 lens prompts + 1 SSoT shared block + a contract
test that catches future registry/file-name drift.

agents/prompts/_shared/panel_review_block.md
- SSoT (Constitution Principle I) for the panel R1/R3 output contract.
  Severity vocabulary matches the spec-015 Severity enum (trivial → fatal);
  identify and re-review phases both defined.

agents/prompts/panels/ — 27 files total
  T049: panel_idea_{rq_validity,novelty,feasibility,idea_quality}.md
  T050: panel_spec_{requirements_coverage,internal_consistency,testability,scope}.md
  T051: panel_plan_{methodology,spec_coverage,data_resources,consistency}.md
  T052: panel_tasks_{coverage,ordering,executability,constraint_preservation}.md
  T053: panel_paper_spec_* (4) + panel_paper_plan_* (3) + panel_paper_tasks_* (4)

Each per-lens file is thin: lens + scope ("what NOT to flag") + inputs
(constitution from `specified` onward per FR-030) + per-severity-class
guidance + reference to the SSoT block. T054-T059 wiring will concatenate
lens-prompt + SSoT-block at render time.

tests/contract/test_panel_prompts.py (16 tests)
- Every lens in the ReviewSpec registry resolves to a real prompt file.
- Every panel file references the SSoT block (Principle I drift guard).
- Every panel file has `## Lens` and `## Output format` sections.
- Reuse-stages (research_review/paper_review) map to existing specialist
  files, with the _research/_paper suffix convention preserved.
- The SSoT block enumerates every Severity enum value + defines R1 and R3.

Tech debt fixed inline (surfaced by ruff+mypy installation in venv):
- reviewspecs.py: _todo_reviewers now returns list[Reviewer] (list is
  invariant). Removed an unused `# type: ignore`.
- triage.py: JudgeFn return-type narrowed to dict[str, object]; the
  mapped_lenses access narrowed with isinstance(list|tuple) at the
  callsite — honest about the contract boundary rather than ignore.

Verification:
- ruff check src/llmxive/convergence + summarize.py: All checks passed
- mypy src/llmxive/convergence + summarize.py: 0 errors (7 source files)
- pytest tests/contract: 43 passed
- pytest 4 conv-related unit files: 27 passed
- pytest 3 spec-015 integration files: 29 passed
- llmxive.checks.prompts: OK (53 agents)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Spec convergence unit: the new SpecReviser implements the Reviser Protocol
and folds BOTH `[NEEDS CLARIFICATION]` marker resolution AND every panel
concern into ONE LLM round. This is the spec-015 "collapse" — the previous
two-step author + refine flow becomes one R2 call that produces a fully-
revised spec.md plus a per-concern change-log.

src/llmxive/convergence/revisers/spec_reviser.py
- `SpecReviser` class (Reviser-protocol-conformant): constructed with
  (backend, repo_root, project_id, model?, token_budget?, cache_dir?).
- `.revise(artifacts, concerns)`:
  - Picks the spec.md artifact (suffix match; excludes paper-side spec).
  - Gathers idea text from artifacts (`idea/` keys).
  - Overflow routing (FR-006): when bundle approx-tokens > budget, routes
    idea + comments_block through `tools.summarize.summarize` with a
    preservation goal that pins FR/SC ids verbatim. spec.md itself is
    NEVER summarized — the reviser must see what it's editing.
  - Composes a system (clarifier.md SSoT) + user (current spec + concerns
    + remaining markers + comments) prompt asking for ONE JSON document
    with `new_spec_md` + `responses[]`.
  - Honest failure modes: missing `new_spec_md` raises; non-JSON raises;
    fewer responses than concerns → padded with `<missing>` entries
    (Constitution Principle II: no silent omission).
- `_scan_markers` + `_strip_json_fences` helpers (testable in isolation).

src/llmxive/convergence/revisers/__init__.py
- Package docstring documenting the build_*_reviewspec pattern.

src/llmxive/convergence/reviewspecs.py
- New `build_spec_reviewspec(backend=, repo_root=, project_id=, model=?)`
  returns a LIVE ReviewSpec for the spec stage with the SpecReviser bound
  as `.reviser`. Static `reviewspec_for("clarified")` still returns the
  TodoReviser placeholder; the build_* path is the live wiring (T058 will
  add reviewer-side wiring for the panel).
- Local import of SpecReviser keeps the static-registry import graph
  clean for callers that never touch the live path.

tests/integration/test_spec_reviser.py (8 tests)
- `_scan_markers` handles bracket + bold marker forms; returns empty
  on clean specs.
- `_strip_json_fences` handles fenced + bare JSON.
- End-to-end revise: backend called with system+user; new spec text
  written; markers resolved; ConcernResponse per concern.
- Padded missing responses: backend omits one concern → `<missing>`
  marker preserved (honest no-silent-omission).
- Missing `new_spec_md` → RuntimeError.
- Non-JSON reply → RuntimeError.
- No spec.md in artifacts → ValueError (engine misuse).

Verification
- ruff check src/llmxive/convergence + tests: All checks passed
- mypy src/llmxive/convergence + summarize.py: 0 errors (9 source files)
- pytest tests/integration/test_spec_reviser.py + tests/contract: 51 passed
- pytest broader unit + integration suite: 52 passed (no regressions)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
jeremymanning and others added 29 commits May 29, 2026 14:39
#58)

The audit found the publisher fix was fake: PaperPublisher was referenced only
by its own class + tests + the EXEMPT list — NEVER invoked by the live pipeline.
_decide_next_stage just flipped PAPER_ACCEPTED -> AWAITING_PUBLICATION_SIGNOFF ->
POSTED with a comment *claiming* 'the publisher assembles the manuscript during
this transition', but no code ran it. So no DOI, no final compile, no
publication.yaml was ever produced.

Wire it for real:
- STAGE_TO_AGENT[AWAITING_PUBLICATION_SIGNOFF] = 'paper_publisher'; register
  PaperPublisher in _NON_SPECKIT_AGENTS + import it. run_one_step now dispatches
  the publisher at AWAITING; it self-gates on the FR-054 maintainer sign-off
  (no sign-off -> no-op stays awaiting; sign-off -> compile + Zenodo DOI +
  publication.yaml -> POSTED).
- Make the publisher the SOLE driver of -> POSTED: _decide_next_stage(AWAITING)
  no longer auto-advances to POSTED on has_signoff. Previously, if the publisher
  hadn't run (it never did) but a sign-off existed, the graph would mark the
  project POSTED with no DOI/publication.yaml. Now only the publisher's own
  successful self-transition reaches POSTED.
- Fix the issue-close hook: it fired only on a graph-detected transition, but the
  publisher self-transitions (project sees no next_stage change). Capture
  entry_stage and fire the hook whenever the project REACHES POSTED this step.
- Update the brittle source-grep test to assert the real wiring (STAGE_TO_AGENT
  mapping + no graph-side auto-POSTED) instead of a since-removed import.

Verified: ruff+mypy clean; offline suite 1232 passed / 1 skipped. Direct
end-to-end verification (publisher mints a real DOI gated on sign-off) is part 7.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ew (FR-025)

Spec 015 deleted the transient revision stages but the FR-025/T044 in-flight
migration never reached 8 real projects left at removed stages:
  paper_minor_revision (7): PROJ-564/565/566/568/570/571/576
  ready_for_implementation (1): PROJ-578
Neither value is in the Stage enum or the project-state schema, so
project_store.load() (validate -> model_validate) RAISED on them — the
projects were unloadable: invisible to the pipeline, web_data, and
status_reporter (surfaced by tests/phase2/test_web_data_blocks).

Verified these states are unreachable under the new architecture (the user's
condition for a direct fix): not in the enum/schema, and no code assigns
current_stage to either (the lone 'ready_for_implementation' literal is a
revision-round final_outcome LABEL, not a stage). So a one-time data migration
is correct — no load-time shim needed.

All 8 are paper-track (have paper/ + completed 12-panel reviews, no research
specs/), so per FR-025 'migrate forward + re-evaluate under unanimous
convergence on next tick' they re-enter at paper_review (the paper convergence
unit re-runs; the engine kicks back to paper revision if concerns remain).
Only the current_stage line changed (file formatting + dead-but-present legacy
point fields preserved, per research.md). All 639 project states now load.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The differential calibration driver (run_calibration.py) omitted the 'idea'
stage, so '--stage idea' was rejected and '--stage all' skipped it — leaving
the mandated circular-RQ negative (one of the 6 required flaws -> rq_validity)
with no real-call differential wiring, even though its on-disk labeled set
(calibration/idea/negative_circular_rq.md), the build_idea_reviewspec panel
(rq_validity/novelty/feasibility/idea_quality), and the data-layer unit test
already existed.

Wires the idea entry: _STAGES['idea'] + build_idea_reviewspec import;
_artifact_key_for_stage('idea') = a /idea/...md path (FleshOutReviser requires
the idea md under such a key); _supporting_artifacts_for_stage('idea') supplies
only __comments_block__ (idea is the earliest stage — constitution_input=False,
so no constitution is injected). All 6 mandated flaws are now end-to-end
calibration-runnable. ruff clean; argparse now accepts --stage idea.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ESTS

Pre-existing bug noticed during the e2e investigation: tests/phase1, phase2,
and e2e contained real network/LLM/browser-call tests gated ONLY on
credential presence (e.g. skipif(not HAS_DM_KEY), both_keys_required) — NOT on
the repo's real-call opt-in LLMXIVE_REAL_TESTS. In any env with keys but no/
slow network (the default dev sandbox), they EXECUTED and HUNG forever on a
network socket (0% CPU) — e.g. test_librarian_cross_domain[biology],
test_site.py (browser). This made `pytest tests/phase2` / tests/e2e
un-runnable offline.

Gates every such test (AND-ed with existing key gates) behind
`_REAL = os.environ.get('LLMXIVE_REAL_TESTS') == '1'`, matching the repo's
established convention (test_math_classifier, test_submission_*, the
tests/real_call conftest). Offline tests in mixed files stay ungated.

Verified: collection clean (262, no sockets); default `pytest tests/phase2`
= 165 passed / 45 skipped, NO HANG; the previously-hanging cross_domain test
now skips in 0.05s; phase1/e2e no hang; ruff clean. No src changes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…alization

Regression from this PR's repo-root refactor (bf94af4a): ProjectInitializerAgent
now resolves its repo via llmxive.config.repo_root() instead of climbing from
its own module __file__. The 3 fake-repo idempotency tests isolated by
monkeypatching pi_mod.__file__ — now INERT — so the agent wrote
projects/<id>/.specify/memory/constitution.md into the REAL repo: this both
failed test_project_initializer_writes_on_first_invocation (it asserted against
the tmp fake_repo) AND polluted the real projects/ tree with PROJ-test-* dirs.
Undetected because tests/phase1 is outside the default contract+integration+unit
suite scope.

Fix: point repo_root() at the fake repo via monkeypatch.setenv('LLMXIVE_REPO_ROOT',
fake_repo) in all 3 tests (the mechanism the refactor introduced). Agent now
writes to tmp; no real-repo pollution. Removed the stray PROJ-test-* dirs +
two run-log entries the buggy runs had committed-adjacent.

Verified: tests/phase1/test_idempotency.py -> 4 passed; no PROJ-test-* created
in the real repo.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…leanup + runlog hardening

Three audit findings + a latent bug surfaced fixing the first:

- FIX 3 (discrepancy #7/#49): research_reviewer/paper_reviewer passed
  produced_by_agent=None (self-review-prevention stub). Added
  runlog.producer_of_artifact(project_id, artifact_path) — resolves the agent
  that actually recorded the artifact in its run-log outputs (newest-first,
  suffix path-match) — and wired it into both reviewers. (personality.py left
  as None by design: a '(simulated)' persona never authors an artifact.)

- LATENT BUG hardened while doing FIX 3: run-log .jsonl files also hold FOREIGN
  records (personality-activity rows written by personality.py: action/
  personality_slug/... no run_id). read_entries / latest_for_project /
  producer_of_artifact did RunLogEntry.model_validate_json with no guard ->
  crash on the first such line (latest_for_project only dodged it by reverse-
  scan early-return luck). Added _parse_run_log_entry() that skips non-
  RunLogEntry lines, catching PYDANTIC's ValidationError (runlog had imported
  jsonschema's ValidationError, so even the new guard wouldn't have caught the
  pydantic one — fixed). Regression test in test_runlog_producer.py.

- FIX 4b (summarizer §3a): _render_pointer_block inlined critical elements
  PER CHUNK and broke out on budget overflow, so under a tight budget a
  reviewer's block could contain only some — or zero — critical elements
  (they survived on disk but not in what the panel reads). Now inlines the
  FULL deduped critical-element list FIRST and unconditionally (prose is what's
  bounded); 500-URL/budget-300 repro -> 500/500 in the block. Test updated to
  the correct contract (block <= budget + critical-list).

- FIX 4a (discrepancy #9): deleted the unused RESEARCH_ACCEPT_THRESHOLD /
  PAPER_ACCEPT_THRESHOLD constants (DEFAULTS + module vars + __all__) — no
  consumer anywhere (the 'back-compat' comment was stale).

Verified: ruff (whole repo) clean; mypy src 0; offline suite 1235 passed /
1 skipped / 2 deselected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…FR-027/028 / #239 core)

Completes the central #239 deliverable left as placeholders: the per-step
multi-lens convergence panels are now INVOKED in production for every
reviewable doc-stage (previously the agent ran + advanced with no panel).

- New shared helper src/llmxive/speckit/_stage_panel.py::run_stage_panel:
  in-cmd engine path (mirrors paper_implement_cmd) via run_engine_for_project
  — converged -> advance; kickback -> write human_input_needed.yaml + raise
  StagePanelKickback (no advance); engine exception -> escalate
  (StagePanelEscalation). Sources each stage's required __X__ extras from the
  REAL project artifacts (idea md, comments, spec/plan, constitution,
  templates); empty string for legitimately-absent inputs (FR-049).
- Wired into write_artifacts of: clarify_cmd (spec), plan_cmd (plan),
  paper_clarify_cmd (paper_spec), paper_plan_cmd (paper_plan), paper_tasks_cmd
  (paper_tasks). Each guards backend-None (offline) -> skip gracefully (the
  agent already produced the artifact; never crash the stage).
- tasks: _tasker_engine_bridge no longer OVERWRITES the panel with a single
  analyze reviewer — it now runs the live 4-lens build_tasks_reviewspec panel
  ALONGSIDE the analyze-derived reviewer (keeps spec-014 honest-reporting AND
  the LLM lenses; placeholder reviewers filtered when backend is None).
- 11 new integration tests (fake backend): per doc-stage, panel-invoked +
  advance on all-accept, and kickback marker written + no-advance on a fail
  verdict; + a tasks-bridge test proving the live panel runs with analyze.
- Corrected 2 tests written against the bug (panel bypassed) to be panel-aware
  fake backends WITHOUT weakening their disk-state/convergence assertions.

Verified: ruff src clean; mypy src 0 (155 files); offline suite 1246 passed /
1 skipped / 2 deselected (1235 + 11). No real LLM calls; PROJ-552 untouched by
this change. Real-call verification on PROJ-552 follows.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…y calls (F-13)

Found by the real-call e2e (PROJ-552 spec panel): the convergence panel
(LLMReviewer), the revisers, and the FR-011 self-consistency audit call
backend.chat DIRECTLY (not via the router), and invoke_reviser_backend +
self_consistency_pass passed NO max_tokens. Dartmouth then omits the field, so
the API applies its OpenAI-shaped 512-token default. qwen3.5-122b is a
*reasoning* model — its hidden chain-of-thought consumed all 512 tokens before
emitting any answer -> empty content + finish_reason=length ->
TransientBackendError -> the stage panel escalated to human_input_needed. This
broke EVERY reasoning-model panel/reviser/self-consistency call in production.

Fix: pass a reasoning-safe budget (131072, matching chat_with_fallback's
default; qwen's 256K context leaves ample input room) on these direct
backend.chat calls, via a shared _chat_reasoning_safe() helper that degrades
gracefully for backends / test fakes whose signature omits the kwargs
(TypeError -> retry without max_tokens, then bare). LLMReviewer's default
max_tokens 8192 -> 131072 for consistency + safety on complex lenses.

Verified live: after the fix the spec panel RUNS the full 4-lens x 3-round loop
(R1 produced 10 substantive, well-calibrated concerns) instead of failing at
512. Offline suite 1246 passed / 1 skipped / 2 deselected; ruff + mypy clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Real-call e2e bug: every convergence reviser instructed the LLM to embed the
full revised document(s) as JSON STRING value(s) (new_*_md) then bare
json.loads'd them. A ~19K-char spec/plan/paper doc full of quotes/$/backslashes
made the model emit invalid JSON (one unescaped quote ends the string early ->
"Expecting , delimiter ... char 19455"), crashing R2 of the convergence loop on
EVERY reviewable stage.

Fix - new shared src/llmxive/convergence/revisers/_reviser_response.py:
- RESPONSE_FORMAT_BLOCK: a SMALL fenced json change-log, then each full
  artifact VERBATIM between ===BEGIN_ARTIFACT <repo-rel-path>=== /
  ===END_ARTIFACT=== markers (raw - quotes/$/backslashes need no escaping).
- parse_reviser_response(text, expected_artifacts) -> (artifacts_by_path,
  responses): regex-extracts delimited blocks (no unescaping, CRLF-tolerant);
  parses the change-log leniently (reuses clarify _escape_newlines + YAML
  fallback); BACKWARD-COMPAT fallback to legacy new_*_md/updated_artifacts JSON;
  fail-loud RuntimeError on total failure.
- build_concern_responses: shared one-per-concern padding (<missing>/<empty>).

Migrated all 9 reviser classes (single-doc spec/paper_spec/tasks/paper_tasks/
flesh_out; multi-doc plan/paper_plan; code implementer/paper_implement) - prompt
+ _parse_response - preserving each one's exact error messages, path-validation,
empty-map tolerance (impl #49), and dispatch prefixes. Legacy single-doc key
selection is target-suffix-aware (tasks.md->new_tasks_md) - fixed a
test_tasker_production_cutover regression.

Verified: ruff src clean; mypy src 0 (156 files); offline suite 1260 passed /
1 skipped / 2 deselected (+14 new). REAL-CALL (PROJ-552 spec, qwen3.5-122b, one
isolated round): qwen reliably produced the delimited format; parser extracted a
complete 16,981-char revised spec.md + 2 well-formed responses.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…fs (F-18)

Part-7 e2e finding: PROJ-552's spec.md attached a FABRICATED citation
("Lee et al. 2024, arXiv:2402.13") to a (correct) knot count. The malformed
arXiv id slipped past extraction (regex required \d{4}\.\d{4,5}), and when the
convergence panel correctly flagged "verify this citation" the reviser
"resolved" it by fabricating a *different* wrong number + a second fake
citation. Violates Constitution Principle II (no fabricated references).

General fix — a citation-verification "strip/flag" pass that resolves every
external reference in produced docs and rewrites unresolvable ones in-place as
`[UNVERIFIED: <ref> — <reason>]` (explicit + greppable; never silently deleted):

- NEW src/llmxive/agents/citation_guard.py: apply_citation_verdicts (pure,
  idempotent rewriter), verify_and_clean (network orchestrator).
- reference_validator.extract_citations: also capture MALFORMED arXiv refs
  (e.g. arXiv:2402.13) so fabricated-malformed ids are flagged, not ignored.
- Hooked at BOTH production points: stage-doc write (slash_command
  _validate_artifact_citations writes cleaned text back) AND the shared reviser
  chokepoint (_self_consistency) so reviser-introduced fakes are caught too,
  BEFORE the panel re-reviews (prevents the fabrication cascade).

Resolution is REGISTRAR-AGNOSTIC (requirement: support Zenodo/bioRxiv/PsyArXiv/
medRxiv/OSF + all URLs). New public verify.resolve_reference(kind, value)
resolves DOIs via https://doi.org/<doi> redirect (works for Crossref AND
DataCite) instead of the old Crossref-only path that would FALSE-FLAG every
DataCite DOI (Zenodo 10.5281, PsyArXiv/OSF 10.31234). arXiv→arxiv.org/abs,
URL→HEAD+GET. Paywall/403-after-redirect = PRESENT (not flagged); 404/DNS/
malformed = flagged. Drops the FR-022-forbidden fetch_citation caller.

Real-call verified: real Zenodo/PsyArXiv/OSF/bioRxiv/arXiv/URL all resolve
PRESENT; fabricated DOI/URL + malformed arXiv:2402.13 all flagged. ruff clean;
mypy 0 (157 files); offline gate 1267 passed (+pure-logic guard tests);
real_call + FR-022 no-duplicate-caller tests pass live.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…18c)

Per user decision: a document containing any citation-guard
`[UNVERIFIED: <ref> — <reason>]` marker (a reference that could not be
resolved to a live primary source) MUST NOT advance through the pipeline.
Wired generally at the three gate sites:

- convergence engine (universal gate for the 6 doc-stages): run_convergence
  scans the FINAL produced-doc artifacts (skipping __sentinel__ context keys)
  BEFORE declaring convergence; each artifact still carrying a marker yields a
  synthesized Severity.SCIENCE concern naming the artifact + verbatim marker
  bodies, appended to open_concerns so converged->False and route_kickback
  carries the reason (SCIENCE routes the factual defect to an earlier content
  stage, not an in-loop re-edit). Clean artifacts converge exactly as before.
- advancement evaluator: research-accept and paper-accept now block when the
  project's governing artifacts contain markers, OR-combined with the existing
  _has_blocking_citations status gate.
- paper_complete gate (graph.py): blocks paper_in_progress->paper_complete when
  paper artifacts contain markers (cheap short-circuit before the LaTeX build).

New citation_guard helpers: UNVERIFIED_MARKER_PREFIX (single source of truth),
has_unverified_markers, find_unverified_markers, project_unverified_markers,
project_artifacts_have_markers. kickback.route_kickback reason surfaces markers.

Also: graph.py two telemetry print() -> logger.warning (noticed in passing).

Tests +10 (guard helpers, engine converged->kickback on marker, advancement
non-advance, paper-complete block). ruff clean; mypy 0 (157 files); offline
gate 1277 passed / 1 skipped / 2 deselected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Brainstormed design for verifying that a cited claim is substantiated by the
FULL TEXT of the source (numbers match, concept conveyed accurately), not just
that the reference exists (F-18) or that the abstract overlaps (F-19 v1).

Maintainer-confirmed decisions: hybrid passage-location + LLM entailment;
open-access-first retrieval cascade (arXiv / Unpaywall / Semantic Scholar
openAccessPdf / preprint patterns / direct URL) with abstract fallback;
reviser-chokepoint each round + persistent (source,claim) cache; flag on
unreadable/unresolvable/free-text; standalone llmxive.grounding service reusing
pdf_sample/verify helpers (librarian untouched). UNPAYWALL_EMAIL=llmxive@gmail.com.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
9 TDD tasks: config, RetrievedDoc + PDF/HTML extractors, OA-first retrieval
cascade (arXiv/Unpaywall/S2/preprint/URL), passage location + LLM entailment,
persistent caches, service orchestrator + policy, wire into F-19 guard,
real-call e2e + gates, tracker. Reuses pdf_sample/verify/librarian.cache.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… + reviser hook)

Abstract-only/arXiv-only grounding baseline. Becomes the extraction + rewriter
front-end for the full-text grounding service (F-19 v2, see
docs/superpowers/plans/2026-05-29-full-text-claim-grounding.md); its arXiv-only
_fetch_source_text/ground_claim internals are replaced there. Env-gated
LLMXIVE_GROUNDING_GUARD (on in cli.run, off in offline gate). offline 1290.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Add offline test that runs pypdf on a hand-authored minimal PDF byte
  string asserting "Grounding" and "12345" round-trip through
  extract_pdf_text (issue 1).
- Move `import pypdf` outside the bare except so ImportError propagates
  instead of being swallowed silently (issue 2).
- Re-export RetrievedDoc, extract_pdf_text, html_to_text from
  grounding/__init__.py with __all__ (issue 3).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…reprint/URL)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two JSON caches under state/grounding-cache/{fulltext,verdict}/,
keyed by SHA-256. Verdict key includes normalized claim + number so
different numbers yield independent cache entries. max_age_s<0 always
expires. TTL defaults: 90d full-text, 30d verdict.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…JSON on concurrent writes

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace manual os.write/os.close pair with os.fdopen context manager so
the file descriptor is closed exactly once. On failure, unlink the temp
file safely (ignoring OSError) so the original exception is not masked.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ground_claim now keeps only the free-text short-circuit and delegates
resolvable-source grounding to grounding.service.ground_cited_claim via a
function-local _service_ground seam (avoids the import cycle: service +
entailment import names from grounding_guard at module top). Deletes the dead
abstract-only grounding body and _fetch_source_text. Threads repo_root into
ground_claim from verify_grounding_and_clean. Updates the real_call test to the
new signature + full-text service reason strings (verified live: 5 passed
against arXiv + Dartmouth).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-call assertion

Remove the unused `timeout=30.0` parameter from `ground_claim` — the
underlying service takes no timeout arg so it was never forwarded.
Confirm no callers pass it.

Strengthen `test_number_not_in_cited_source_is_flagged`: in addition to
asserting `ok is False`, assert the reason text contains at least one of
"not found", "contradict", or "unreadable" (case-insensitive), ensuring
the service vocabulary is reflected in the flag reason.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
F-19 v2 Task 8. Adds tests/real_call/test_grounding_end_to_end.py:
a fabricated cited number on a real arXiv paper (1706.03762) is flagged
[UNVERIFIED]; a correctly cited number is not.

Also fixes prompt-block resolution so the extraction and entailment
prompt blocks load from the real repo root (config.repo_root()) when not
found under the per-run cache repo_root (which may be a tmp dir for
isolation). Without this, passing repo_root=tmp_path silently skipped
extraction/entailment.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…r unreadable grounding sources

F-19 full-text claim-grounding holistic-review fixes:

Fix 1 (number gate, design §5): ground_cited_claim now overrides an LLM
"grounded" verdict to a FLAG when the claim's number is absent from the
retrieved source text (number_substantiated() pure helper, unit-tested
offline + proven end-to-end to flip grounded->flagged).

Fix 2 (Tier-5 URL): _fetch_url_text restricts to http/https, streams the
body with a 50MB cap (shared PDF_MAX_BYTES), bounded timeout+redirects;
non-http schemes yield no text.

Fix 3: unreadable sources are no longer written to either cache, so a
transient retrieval failure self-heals next round.

Fix 4 (doc): design §9 records the reviser-chokepoint-only and v1-only
preprint limitations.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…/F-20 B)

Panel non-convergence now writes a generic convergence_kickback.yaml record
(to_stage/worst_severity/reason/unresolved_concerns/stage) instead of
human_input_needed.yaml; the graph consumes it and auto-kicks-back to the
content stage, bounded by a per-stage kickback cap
(CONVERGENCE_KICKBACK_CAP=3) that escalates to human_input_needed and resets
on clean advancement. human_input_needed.yaml is reserved for genuine human
escalation (engine exception + cap-hit). Adds an on_round inspection trail
persisted under .specify/memory/convergence_trail/<stage>-NNN.jsonl, and the
missing backward kickback transitions in ALLOWED_TRANSITIONS.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant