docs(specs): RFC for async WM→SWM promote queue#643
Conversation
PR #602 (Graphify importer) showed that the synchronous POST /api/assertion/<name>/promote step dominates wall-clock time on multi-thousand-partition imports. Each promote blocks the importer through the daemon's quad-subtraction, WM→SWM transfer, and graph- manager bookkeeping; for a 10k-file import, the cumulative blocking latency is the import's bottleneck. The existing AsyncLiftPublisher (packages/publisher) already solves the analogous problem for the next memory tier (SWM→VM): enqueue, durable retry, structured status. This RFC proposes a direct analogue — AsyncPromoteQueue — for WM→SWM, with: - POST /api/assertion/<name>/promote-async (enqueue) - GET /api/assertion/promote-async/<jobId> (status) - DELETE / POST .../recover (cancel / requeue) - In-process worker with bounded concurrency, exponential backoff, crash-recovery via "running > timeout → queued" sweep on startup. The spec deliberately leaves the sync /promote route untouched — small interactive importers keep their answer-or-error shape; bulk importers opt into the async path. The queue is persisted via the same assertion API the WM uses, so #640's durability story covers it. Implementation is out of scope for this PR; sign-off on the surface + internals first. Independent of #640 (WM persistence fix), #641 (ADRs), and #642 (importer helpers), but depends on #640 having landed before any implementation work begins. Related to #596. Co-authored-by: Cursor <cursoragent@cursor.com>
| } | ||
| ``` | ||
|
|
||
| `state` transitions: `queued → running → (succeeded | failed)`. A failed |
There was a problem hiding this comment.
🔴 Bug: failed is defined here as a state the daemon may still retry (lastError.retryable: true), but the importer example in §5 stops polling and marks the partition failed as soon as it sees failed. That means a transient promote error can be reported as terminal even while the worker is still backing off and retrying. Reserve failed for exhausted-retry terminal failures, or add a separate retrying/backoff state and document that polling clients should keep waiting.
There was a problem hiding this comment.
Addressed on this branch: failed_retrying is now the transient retry state, failed is terminal, and the importer example keeps polling retryable failures.
| } | ||
| ``` | ||
|
|
||
| Persisted in the daemon's `meta` sub-graph as RDF triples under a |
There was a problem hiding this comment.
🔴 Bug: Persisting queue state in the context graph's meta sub-graph via the assertion API creates a circular dependency: the queue's own bookkeeping now depends on the same WM/assertion path it is supposed to recover when promote calls fail. It also mixes internal control-plane state into graph data. AsyncLiftPublisher stores jobs in dedicated local control graphs instead; this spec should follow that pattern rather than routing queue persistence through assertions.
There was a problem hiding this comment.
Addressed on this branch: the queue is now persisted in a dedicated control graph, not in a project meta sub-graph through the assertion API.
|
|
||
| - **Daemon crash mid-job**: the job is still `running` in the persisted | ||
| store. On startup, the worker scans for `running` jobs older than | ||
| `runningTimeoutMs` (default 60s) and transitions them back to `queued` |
There was a problem hiding this comment.
🔴 Bug: Re-queuing any running job older than 60s will duplicate healthy long-running promotes. Large assertions can legitimately spend longer than a minute in subtraction/gossip/backpressure, so this timeout can schedule a second worker against the same assertion while the first one is still mutating it. Use a renewable lease/heartbeat (like the async publisher's claim lock) or only reclaim running jobs after a confirmed process restart.
There was a problem hiding this comment.
Addressed on this branch: fixed reclaim semantics to use a renewable lease/heartbeat rather than a fixed age timeout for running jobs.
| } | ||
| ``` | ||
|
|
||
| `entities` is bounded by the same `ROOT_CHUNK = 1000` budget as the sync |
There was a problem hiding this comment.
🟡 Issue: This claims the async route has the same ROOT_CHUNK = 1000 contract as sync /promote, but the current sync API accepts entities: "all" and does not enforce a 1000-root cap. That makes the later 'SAME code path' claim inaccurate and creates an avoidable API mismatch. Please either align async with the existing promote contract or call out the intentional incompatibility explicitly.
There was a problem hiding this comment.
Addressed in 33817c8: async now explicitly follows the sync promote contract, including entities: all or explicit arrays, lenient root filtering, and the SMALL_BODY_BYTES budget.
Four fixes to the proposal: - State machine: split terminal `failed` from transient `failed_retrying`. Importer polling clients now have a documented signal for "daemon will retry on its own — keep waiting"; the canonical poll example was treating any non-`succeeded` non-`failed` state correctly anyway but the spec did not say so explicitly. - Persistence: move queue state out of the project `meta` sub-graph and into a dedicated `urn:dkg:promote-queue:control-plane` control graph (mirroring AsyncLiftPublisher's `urn:dkg:publisher:control-plane`). Avoids the circular dependency of routing queue bookkeeping through the same assertion-write path the queue is meant to recover, and keeps daemon control-plane state out of user queries. - Reclaim: replace the fixed 60s "running" age threshold with a renewable lease (claimToken + claimLeaseExpiresAt), same pattern AsyncLiftPublisher uses today. Long-but-healthy promotes keep renewing and stay claimed; only truly stalled jobs (process gone, GC pause longer than lease TTL) become reclaimable. - Root-cap accuracy: the sync `/promote` route does not enforce a server-side ROOT_CHUNK=1000 cap today (it accepts `entities: "all"` and anything under MAX_BODY_BYTES). Reworded §3.1 so the async route inherits the same lenient acceptance instead of inventing a contract that sync does not honour. Co-authored-by: Cursor <cursoragent@cursor.com>
| assertionName: string; | ||
| subGraphName: string; | ||
| entities: string[]; | ||
| state: 'queued' | 'running' | 'succeeded' | 'failed'; |
There was a problem hiding this comment.
🔴 Bug: This public state enum omits failed_retrying, but §3.2 and the importer polling example both rely on that transient state. If implementers follow this sketch, valid status payloads become unrepresentable and clients may treat retryable failures as terminal. Add failed_retrying everywhere the API/model defines job states.
There was a problem hiding this comment.
Addressed on this branch: the public PromoteJobState enum now includes failed_retrying everywhere the API/model lists states.
| A single `AsyncPromoteWorker` runs in-process inside the daemon (no | ||
| separate worker thread — promote is not CPU-bound). It: | ||
|
|
||
| 1. Polls the control graph for jobs in `queued` state, ordered by |
There was a problem hiding this comment.
🔴 Bug: The worker only polls queued jobs here, but step 4 moves retryable failures to failed_retrying and never describes a transition back to queued. As written, a job that fails once with a retryable error will never run again. Either have the scheduler pick up failed_retrying jobs whose nextRetryAt <= now, or explicitly flip them back to queued when the backoff expires.
There was a problem hiding this comment.
Addressed on this branch: runnable jobs now include failed_retrying rows whose nextRetryAt has expired, or an equivalent tick can flip them back to queued.
| 5. On `attempts >= maxAttempts` (default 5) or non-retryable error, | ||
| transitions to terminal `failed`. Requires explicit `recover` to | ||
| retry. | ||
| 6. Concurrency: at most N (default 4) running jobs per daemon, matching |
There was a problem hiding this comment.
🔴 Bug: This says promotions stay 1 per assertion at a time, but the design only describes a claim CAS on an individual job row. If two promote-async requests target the same (contextGraphId, assertionName), different workers can still claim both jobs and run concurrent promotes for the same assertion. Add a per-assertion lock/uniqueness rule, or reject duplicate enqueues for the same assertion while one job is queued or running.
There was a problem hiding this comment.
Addressed in 33817c8: concurrency and dedupe now use the full assertion identity, contextGraphId plus subGraphName plus assertionName, and block concurrent jobs for the same assertion.
Three correctness gaps Codex flagged on the revised RFC: 1. The TypeScript PromoteJob interface omitted `failed_retrying` from `state` even though §3.2 and the importer polling example both rely on it. Hoisted the state list into a dedicated `PromoteJobState` type alias and added a note that every public surface MUST reference it; updated the JSON example state list accordingly. 2. The worker loop only polled `queued` jobs but §4.2 step 4 transitions retryable failures to `failed_retrying` and never describes how they get back. Reworded step 1 to make a job "runnable" iff `state = queued` OR (`state = failed_retrying` AND `nextRetryAt <= now`). Step 4 now explicitly clears the claim so the rescheduler picks the job up. Documents two equivalent implementation options (direct schedule vs flip-back-to-queued tick task). 3. The "1 promote per assertion at a time" claim was only enforced per-job by the claim CAS, so two concurrent `promote-async` calls against the same `(contextGraphId, assertionName)` could end up with two workers running. Added defence-in-depth: enqueue-time 409 Conflict if a job is already queued/running/failed_retrying for the same pair, AND a claim-time exclusion in the runnable query. Matches the sync `/promote` route's behaviour today. Co-authored-by: Cursor <cursoragent@cursor.com>
| `(contextGraphId, assertionName)`. Multi-CG bulk promotes are out of | ||
| scope (importers that want them can enqueue many jobs). | ||
| - **Promote-side validation changes.** The async path applies the same | ||
| validation as the sync path; no relaxation of "root URI must exist in |
There was a problem hiding this comment.
🔴 Bug: This says the async path keeps the sync contract and then defines a stricter root-validation rule than the current /promote implementation. Today assertion.promote does not fail when a requested root is missing from WM; it filters the quads and can return promotedCount: 0. If the async queue starts rejecting those requests, retries and resumptions that are currently harmless become breaking failures. Please align the spec with the existing permissive behavior, or call out the stricter validation as an intentional contract change.
There was a problem hiding this comment.
Addressed in 33817c8: the spec now preserves sync promote permissive missing-root behavior instead of adding stricter async validation.
| `ROOT_CHUNK ≤ 1000` — for symmetry with the sync route's request-size | ||
| budget. Note that the **sync** `/promote` route does not currently | ||
| enforce a per-call root cap server-side (it accepts `entities: "all"` | ||
| and any explicit array that fits under `MAX_BODY_BYTES`); the async |
There was a problem hiding this comment.
🔴 Bug: The sync /promote route is not actually bounded by MAX_BODY_BYTES today; the daemon reads that endpoint with SMALL_BODY_BYTES (256 KB). Basing the async design on the larger limit will mis-spec request sizing and queue/backoff behavior. Please either reference the real current limit here or explicitly say the sync route must be widened before this async contract claims parity.
There was a problem hiding this comment.
Addressed in 33817c8: request sizing now references the real sync promote SMALL_BODY_BYTES limit instead of MAX_BODY_BYTES.
| 6. **Concurrency caps**: | ||
| - At most N (default 4) running jobs per daemon, matching ADR 0002's | ||
| "max concurrent assertions" guidance. | ||
| - **At most 1 active job per `(contextGraphId, assertionName)` pair.** |
There was a problem hiding this comment.
🔴 Bug: Uniqueness keyed only by (contextGraphId, assertionName) is too coarse. Assertions are also scoped by subGraphName in the current implementation, so two different assertions with the same name in different sub-graphs would incorrectly 409 each other or be serialized behind one queue slot. Use the full assertion identity for dedupe/claiming (at least contextGraphId + subGraphName + assertionName).
There was a problem hiding this comment.
Addressed in 33817c8: dedupe and claim locking now key on contextGraphId plus subGraphName plus assertionName, not contextGraphId plus assertionName only.
| 2. **Drain semantics on shutdown.** Two options: | ||
| - **Wait for running jobs to finish.** Cleanest, but a 5-min running | ||
| job blocks `dkg stop` for 5 min. Bad for ops. | ||
| - **Mark running → queued + exit.** Some jobs run twice. The promote |
There was a problem hiding this comment.
🔴 Bug: Mark running → queued + exit is not safe with the current promote flow. assertionPromote mutates SWM, WM cleanup, lifecycle metadata, and workspace-operation metadata in separate steps, so a shutdown after the SWM insert is not a pure no-op rerun. Requeueing that job on stop can duplicate metadata/gossip or re-run against a partially moved assertion. This section needs the same lease-expiry recovery story as crash handling, or an explicit idempotency/commit marker, before recommending this shutdown behavior.
There was a problem hiding this comment.
Addressed in 33817c8: shutdown no longer rewrites running jobs to queued. Recovery now requires lease expiry plus an idempotency or commit marker, with ambiguous partial promotes parked instead of blindly rerun.
Appends §10 with three concrete data points pulled from importing a real codebase graph (Graphify: 26,960 nodes / 63,003 edges) against an rc.10 sandbox daemon: - Cascading write degradation: write latency drops ~200x from baseline (5-7 ms/batch -> 1.2 s/batch and worse) once cumulative SWM grows past ~100k triples, because sync /promote shares a worker with write ingestion. Per-partition table included. - 10 MB gossip cap failures: 4 of 17 partitions exceeded the cap with entities:"all". Total wall-clock in halve-and-retry: ~15 minutes on top of legitimate write/promote work. - Production reliability: a single fetch-failed socket error during a multi-minute promote kills the importer with no recovery in the sync path. Also adds §10.4 explaining why "in front of" (async queue) is the right layer instead of "underneath" (sync /promote with internal batching) — the importer's pain is wall-clock per partition, not request shape. The empirical numbers strengthen the case made by every existing section but are quoted in one appendix so the body remains an RFC, not a benchmark report. Co-authored-by: Cursor <cursoragent@cursor.com>
…mport Adds a short "Empirical motivation" section between the Consequences and Rejected alternatives lists, with concrete data from importing this repo's Graphify output (26,960 nodes / 63,003 edges) on rc.10: - Hand-rolled importer from dkg-node SKILL.md alone: ~330 LOC, half of which is cap-discovery ceremony. - Same workload using rc.10's existing scripts/lib/dkg-daemon.mjs DkgClient: ~half the LOC, gaps remaining are the ones #642 and #643 close. - Four of 17 partitions exceeded the 10 MB gossip cap; total wall- clock in halve-and-retry: ~15 minutes on top of legitimate work. - Write latency degrades ~200x from baseline once SWM grows past ~100k triples; that cost is what motivates #643 but requires the chunking discipline this ADR formalises. Keeps the headlines so the contract's "why these numbers" question has a concrete answer for reviewers; the full per-partition data lives in the rc.10 test artifacts referenced in the PR description. Co-authored-by: Cursor <cursoragent@cursor.com>
Adds §1.1 to dkg-importer/SKILL.md with the three hard caps an agent will hit at scale, plus the verbatim daemon error text for each one: - /write: MAX_BODY_BYTES = 10 MB body cap (HTTP 413) - /promote: SMALL_BODY_BYTES = 256 KB body cap (HTTP 413) — same status code as /write but a different limit, easy to confuse - /promote: 10 MB gossip-message cap (HTTP 500 "too large for gossip"), independent of the body cap and triggerable even with a single root URI Expands §5 Error handling so each of the three failure modes has its own recipe instead of one generic "halve and retry on 413" block. Calls out that 413-on-promote and 500-gossip are orthogonal recovery paths, both of which an rc.10 importer needs until #643's async promote queue lands. The cap numbers and error strings were validated empirically during the rc.10 Graphify import. Quoting them inline turns "discover via 413" into "look up before you start", which is the difference between a 5-minute import and a 20-minute import for non-trivial graphs. Co-authored-by: Cursor <cursoragent@cursor.com>
Brings in §10 'Empirical motivation (rc.10 Graphify import)' with: - §10.1 cascading write degradation table (~200x slowdown past 100k SWM) - §10.2 the 4 partitions that exceeded the 10 MB gossip cap (~15 min total wall-clock in halve-and-retry) - §10.3 production reliability (single fetch failed -> crash with no retry classification on sync /promote) - §10.4 why 'in front of' (async queue) beats 'underneath' (sync with internal batching) as a fix layer
Adds a "What changed in round 3" section to INTEGRATION_NOTES_GRAPHIFY.md summarising the per-PR updates folded into integration after running the stack against a real Graphify codebase import on rc.10: - #641: empirical motivation in ADR 0002 + portable scanner ref + dead-link cleanup - #642: §1.1 caps reference with verbatim error strings + manifest hardening (idempotent create, ms tie-breaker, loud empty bindings) - #643: SMALL_BODY_BYTES correction + full assertion identity for uniqueness + lease-expiry shutdown + §10 empirical appendix Notes that rc.10 import test artifacts are kept local, not committed. Co-authored-by: Cursor <cursoragent@cursor.com>
Adds §1.1 to dkg-importer/SKILL.md with the three hard caps an agent will hit at scale, plus the verbatim daemon error text for each one: - /write: MAX_BODY_BYTES = 10 MB body cap (HTTP 413) - /promote: SMALL_BODY_BYTES = 256 KB body cap (HTTP 413) — same status code as /write but a different limit, easy to confuse - /promote: 10 MB gossip-message cap (HTTP 500 "too large for gossip"), independent of the body cap and triggerable even with a single root URI Expands §5 Error handling so each of the three failure modes has its own recipe instead of one generic "halve and retry on 413" block. Calls out that 413-on-promote and 500-gossip are orthogonal recovery paths, both of which an rc.10 importer needs until #643's async promote queue lands. The cap numbers and error strings were validated empirically during the rc.10 Graphify import. Quoting them inline turns "discover via 413" into "look up before you start", which is the difference between a 5-minute import and a 20-minute import for non-trivial graphs. Co-authored-by: Cursor <cursoragent@cursor.com>
… via HTTP Two PR #642 follow-ups + the originally-scheduled SKILL.md update for the async promote queue (the bulk of PR #4's documentation work, deferred until #642 merged because that's where dkg-importer/SKILL.md first landed). 1. packages/cli/skills/dkg-importer/SKILL.md — new §6 "Async promote queue" - Route inventory (POST/GET/DELETE /api/assertion/.../promote-async + recover) - Async write loop (drop-in for §2's synchronous promote step) - Failure-classification table mapping attempt.lastError.classification (transient | cap_exceeded | fatal) to importer actions - Migration guidance vs. the sync /promote route (still supported) - Inspecting in-flight jobs alongside the manifest in §3 New cheat-sheet variant under §8 mirroring the async loop. §5's forward reference to "PR #643's async promote queue ... until that lands" rewritten to point at the now-shipping §6. 2. New endpoint GET /.well-known/skill-importer.md (Codex PR #642 follow-up) The cross-link from dkg-node/SKILL.md to dkg-importer/SKILL.md was unreachable for agents installed via the setup flow because only the former was served from packages/cli/src/daemon/manifest.ts's loadSkillTemplate. Adds a sibling loadImporterSkillTemplate + /.well-known/skill-importer.md route in status.ts (same auth-public, ETag-cacheable shape as /.well-known/skill.md), plus PUBLIC_GET_PATHS / PUBLIC_HEAD_PATHS / isLoopbackRateLimitExemptPath allowlist entries. Cross-link in dkg-node/SKILL.md now points at the runtime URL. 4 new tests (auth public path + 4 dkg-importer skill content pins). 3. scripts/lib/manifest.mjs — partitionDeclared helper (Codex PR #642 quadratic-validation follow-up). markPartitionStatus used to call loadImportManifest on every status write, materialising the full manifest from SWM. For a 10k-partition import marking each twice (in_progress + done), that's 20k full SWM round-trips. Replaced with a single-row SPARQL ASK: one query, one Boolean, same error semantics when a partition isn't declared. 4 new tests including a regression pin asserting markPartitionStatus issues exactly one query. Tests: 27/27 (scripts/lib manifest), 32/32 (publisher async-promote-queue), 190/190 (CLI unit suites incl. new skill-endpoint cases, async-promote routes/worker/E2E, auth allowlist). tsc green. Zero lints. Co-authored-by: Cursor <cursoragent@cursor.com>
Summary
PR #602 (Graphify importer) showed that the synchronous `POST /api/assertion//promote` step dominates wall-clock time on multi-thousand-partition imports. The existing `AsyncLiftPublisher` already solves the analogous problem for the next memory tier (SWM → VM); this RFC proposes a direct analogue — `AsyncPromoteQueue` — for WM → SWM.
Surface:
Internals: in-process worker, bounded concurrency (default 4), exponential backoff, crash-recovery via "running > timeout → queued" sweep on startup. Queue persisted via the assertion API in the `meta` sub-graph (durability inherited from #640).
The sync `/promote` route stays unchanged — small interactive importers keep their answer-or-error shape; bulk importers opt into the async path.
Status
RFC only — no implementation in this PR. Sign-off on the surface + internals first, then implementation lands in a sibling `packages/publisher/src/async-promote-queue-impl.ts` of similar size to the existing `async-lift-publisher-impl.ts`.
Sequencing
Independent of #640 (WM persistence fix), #641 (ADRs), and #642 (importer helpers), but depends on #640 having landed before any implementation work begins — the queue's durability rests on WM-store persistence.
Test plan
Related: #596.
Made with Cursor
Update — round 3 (2026-05-25): empirical motivation appendix
Adds §10 — Empirical motivation (rc.10 Graphify import) to the RFC with concrete data from importing this repository's Graphify output (26,960 nodes / 63,003 edges across 17 partitions) on an rc.10 sandbox daemon:
/promoteshares a worker with write ingestion. The 17th partition (packages-cli, 72k triples) took 3 minutes for the write step alone.entities: "all"(24,599 KB, 11,993 KB, 10,788 KB, 20,686 KB). Total wall-clock in halve-and-retry: ~15 minutes on top of legitimate work.fetch failedsocket error during a multi-minute promote kills the importer. The sync path has no retry classification; §3.3failed_retrying+ §4.4 attempt commit marker move that responsibility into the daemon, where it has the lease information and persistence to do it correctly./promote: the importer's pain is wall-clock per partition, not request shape. A 5-minute sync call that quietly chunks is still a 5-minute sync call — the importer can't pipeline the next partition's write, can't show progress, and a TCP reset still kills the operation.Also resolves round-3 codex line comments in the same commit series:
/promote(no stricter validation in async).MAX_BODY_BYTES→SMALL_BODY_BYTES— §3.1 cites the actual 256 KB limit the daemon enforces on/promotetoday, not the 10 MB/writelimit. This was caught by codex and validated empirically.(contextGraphId, subGraphName, assertionName)identity in §2 (Non-goals) and §4.5 (Concurrency caps); previously was(contextGraphId, assertionName)which would have incorrectly serialized different sub-graphs' jobs.