Skip to content

docs(specs): RFC for async WM→SWM promote queue#643

Merged
branarakic merged 5 commits into
mainfrom
docs/spec-async-promote-queue
May 25, 2026
Merged

docs(specs): RFC for async WM→SWM promote queue#643
branarakic merged 5 commits into
mainfrom
docs/spec-async-promote-queue

Conversation

@branarakic
Copy link
Copy Markdown
Contributor

@branarakic branarakic commented May 25, 2026

Summary

PR #602 (Graphify importer) showed that the synchronous `POST /api/assertion//promote` step dominates wall-clock time on multi-thousand-partition imports. The existing `AsyncLiftPublisher` already solves the analogous problem for the next memory tier (SWM → VM); this RFC proposes a direct analogue — `AsyncPromoteQueue` — for WM → SWM.

Surface:

  • `POST /api/assertion//promote-async` — enqueue
  • `GET /api/assertion/promote-async/` — status
  • `DELETE` / `POST .../recover` — cancel / requeue

Internals: in-process worker, bounded concurrency (default 4), exponential backoff, crash-recovery via "running > timeout → queued" sweep on startup. Queue persisted via the assertion API in the `meta` sub-graph (durability inherited from #640).

The sync `/promote` route stays unchanged — small interactive importers keep their answer-or-error shape; bulk importers opt into the async path.

Status

RFC only — no implementation in this PR. Sign-off on the surface + internals first, then implementation lands in a sibling `packages/publisher/src/async-promote-queue-impl.ts` of similar size to the existing `async-lift-publisher-impl.ts`.

Sequencing

Independent of #640 (WM persistence fix), #641 (ADRs), and #642 (importer helpers), but depends on #640 having landed before any implementation work begins — the queue's durability rests on WM-store persistence.

Test plan

  • RFC reads coherently; cross-references resolve
  • §7 "Rejected alternatives" anticipates the obvious counter-proposals
  • Reviewer: confirm the §3 surface is the right shape (HTTP verb / path / payload choices)
  • Reviewer: confirm "drain semantics on shutdown" (§6.2) — running → queued + exit is the proposed answer, but it's the most reversible decision and worth a fresh look
  • Reviewer: confirm sync `/promote` is NOT being deprecated

Related: #596.

Made with Cursor

Update — round 3 (2026-05-25): empirical motivation appendix

Adds §10 — Empirical motivation (rc.10 Graphify import) to the RFC with concrete data from importing this repository's Graphify output (26,960 nodes / 63,003 edges across 17 partitions) on an rc.10 sandbox daemon:

  • §10.1 cascading write degradation — per-partition table showing write latency degrade ~200× from baseline (5-7 ms/batch → 1.2 s/batch and worse) once cumulative SWM grows past ~100k triples, because sync /promote shares a worker with write ingestion. The 17th partition (packages-cli, 72k triples) took 3 minutes for the write step alone.
  • §10.2 10 MB gossip-cap failures — 4 of 17 partitions exceeded the cap when promoted with entities: "all" (24,599 KB, 11,993 KB, 10,788 KB, 20,686 KB). Total wall-clock in halve-and-retry: ~15 minutes on top of legitimate work.
  • §10.3 production reliability — single fetch failed socket error during a multi-minute promote kills the importer. The sync path has no retry classification; §3.3 failed_retrying + §4.4 attempt commit marker move that responsibility into the daemon, where it has the lease information and persistence to do it correctly.
  • §10.4 "in front of" vs "underneath" — explains why the async queue is the right layer instead of an internally-batched sync /promote: the importer's pain is wall-clock per partition, not request shape. A 5-minute sync call that quietly chunks is still a 5-minute sync call — the importer can't pipeline the next partition's write, can't show progress, and a TCP reset still kills the operation.

Also resolves round-3 codex line comments in the same commit series:

  • L87 root validation — Non-goals section now explicitly states permissive root handling matching sync /promote (no stricter validation in async).
  • L118 MAX_BODY_BYTESSMALL_BODY_BYTES — §3.1 cites the actual 256 KB limit the daemon enforces on /promote today, not the 10 MB /write limit. This was caught by codex and validated empirically.
  • L300 uniqueness key — Now uses the full (contextGraphId, subGraphName, assertionName) identity in §2 (Non-goals) and §4.5 (Concurrency caps); previously was (contextGraphId, assertionName) which would have incorrectly serialized different sub-graphs' jobs.
  • L422 shutdown safety — Drain semantics no longer says "mark running → queued + exit" (which could double-execute multi-step promotes). Now uses lease-expiry recovery (§4.4) as the only automatic recovery path.

PR #602 (Graphify importer) showed that the synchronous
POST /api/assertion/<name>/promote step dominates wall-clock time on
multi-thousand-partition imports. Each promote blocks the importer
through the daemon's quad-subtraction, WM→SWM transfer, and graph-
manager bookkeeping; for a 10k-file import, the cumulative blocking
latency is the import's bottleneck.

The existing AsyncLiftPublisher (packages/publisher) already solves the
analogous problem for the next memory tier (SWM→VM): enqueue, durable
retry, structured status. This RFC proposes a direct analogue —
AsyncPromoteQueue — for WM→SWM, with:

- POST /api/assertion/<name>/promote-async (enqueue)
- GET  /api/assertion/promote-async/<jobId> (status)
- DELETE / POST .../recover (cancel / requeue)
- In-process worker with bounded concurrency, exponential backoff,
  crash-recovery via "running > timeout → queued" sweep on startup.

The spec deliberately leaves the sync /promote route untouched — small
interactive importers keep their answer-or-error shape; bulk importers
opt into the async path. The queue is persisted via the same assertion
API the WM uses, so #640's durability story covers it.

Implementation is out of scope for this PR; sign-off on the surface +
internals first.

Independent of #640 (WM persistence fix), #641 (ADRs), and #642
(importer helpers), but depends on #640 having landed before any
implementation work begins.

Related to #596.

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread docs/specs/SPEC_ASYNC_PROMOTE_QUEUE.md Outdated
}
```

`state` transitions: `queued → running → (succeeded | failed)`. A failed
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: failed is defined here as a state the daemon may still retry (lastError.retryable: true), but the importer example in §5 stops polling and marks the partition failed as soon as it sees failed. That means a transient promote error can be reported as terminal even while the worker is still backing off and retrying. Reserve failed for exhausted-retry terminal failures, or add a separate retrying/backoff state and document that polling clients should keep waiting.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed on this branch: failed_retrying is now the transient retry state, failed is terminal, and the importer example keeps polling retryable failures.

Comment thread docs/specs/SPEC_ASYNC_PROMOTE_QUEUE.md Outdated
}
```

Persisted in the daemon's `meta` sub-graph as RDF triples under a
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: Persisting queue state in the context graph's meta sub-graph via the assertion API creates a circular dependency: the queue's own bookkeeping now depends on the same WM/assertion path it is supposed to recover when promote calls fail. It also mixes internal control-plane state into graph data. AsyncLiftPublisher stores jobs in dedicated local control graphs instead; this spec should follow that pattern rather than routing queue persistence through assertions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed on this branch: the queue is now persisted in a dedicated control graph, not in a project meta sub-graph through the assertion API.

Comment thread docs/specs/SPEC_ASYNC_PROMOTE_QUEUE.md Outdated

- **Daemon crash mid-job**: the job is still `running` in the persisted
store. On startup, the worker scans for `running` jobs older than
`runningTimeoutMs` (default 60s) and transitions them back to `queued`
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: Re-queuing any running job older than 60s will duplicate healthy long-running promotes. Large assertions can legitimately spend longer than a minute in subtraction/gossip/backpressure, so this timeout can schedule a second worker against the same assertion while the first one is still mutating it. Use a renewable lease/heartbeat (like the async publisher's claim lock) or only reclaim running jobs after a confirmed process restart.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed on this branch: fixed reclaim semantics to use a renewable lease/heartbeat rather than a fixed age timeout for running jobs.

Comment thread docs/specs/SPEC_ASYNC_PROMOTE_QUEUE.md Outdated
}
```

`entities` is bounded by the same `ROOT_CHUNK = 1000` budget as the sync
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: This claims the async route has the same ROOT_CHUNK = 1000 contract as sync /promote, but the current sync API accepts entities: "all" and does not enforce a 1000-root cap. That makes the later 'SAME code path' claim inaccurate and creates an avoidable API mismatch. Please either align async with the existing promote contract or call out the intentional incompatibility explicitly.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 33817c8: async now explicitly follows the sync promote contract, including entities: all or explicit arrays, lenient root filtering, and the SMALL_BODY_BYTES budget.

Four fixes to the proposal:

- State machine: split terminal `failed` from transient `failed_retrying`.
  Importer polling clients now have a documented signal for "daemon will
  retry on its own — keep waiting"; the canonical poll example was
  treating any non-`succeeded` non-`failed` state correctly anyway but
  the spec did not say so explicitly.

- Persistence: move queue state out of the project `meta` sub-graph and
  into a dedicated `urn:dkg:promote-queue:control-plane` control graph
  (mirroring AsyncLiftPublisher's `urn:dkg:publisher:control-plane`).
  Avoids the circular dependency of routing queue bookkeeping through
  the same assertion-write path the queue is meant to recover, and
  keeps daemon control-plane state out of user queries.

- Reclaim: replace the fixed 60s "running" age threshold with a
  renewable lease (claimToken + claimLeaseExpiresAt), same pattern
  AsyncLiftPublisher uses today. Long-but-healthy promotes keep
  renewing and stay claimed; only truly stalled jobs (process gone,
  GC pause longer than lease TTL) become reclaimable.

- Root-cap accuracy: the sync `/promote` route does not enforce a
  server-side ROOT_CHUNK=1000 cap today (it accepts `entities: "all"`
  and anything under MAX_BODY_BYTES). Reworded §3.1 so the async route
  inherits the same lenient acceptance instead of inventing a
  contract that sync does not honour.

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread docs/specs/SPEC_ASYNC_PROMOTE_QUEUE.md Outdated
assertionName: string;
subGraphName: string;
entities: string[];
state: 'queued' | 'running' | 'succeeded' | 'failed';
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: This public state enum omits failed_retrying, but §3.2 and the importer polling example both rely on that transient state. If implementers follow this sketch, valid status payloads become unrepresentable and clients may treat retryable failures as terminal. Add failed_retrying everywhere the API/model defines job states.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed on this branch: the public PromoteJobState enum now includes failed_retrying everywhere the API/model lists states.

Comment thread docs/specs/SPEC_ASYNC_PROMOTE_QUEUE.md Outdated
A single `AsyncPromoteWorker` runs in-process inside the daemon (no
separate worker thread — promote is not CPU-bound). It:

1. Polls the control graph for jobs in `queued` state, ordered by
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: The worker only polls queued jobs here, but step 4 moves retryable failures to failed_retrying and never describes a transition back to queued. As written, a job that fails once with a retryable error will never run again. Either have the scheduler pick up failed_retrying jobs whose nextRetryAt <= now, or explicitly flip them back to queued when the backoff expires.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed on this branch: runnable jobs now include failed_retrying rows whose nextRetryAt has expired, or an equivalent tick can flip them back to queued.

Comment thread docs/specs/SPEC_ASYNC_PROMOTE_QUEUE.md Outdated
5. On `attempts >= maxAttempts` (default 5) or non-retryable error,
transitions to terminal `failed`. Requires explicit `recover` to
retry.
6. Concurrency: at most N (default 4) running jobs per daemon, matching
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: This says promotions stay 1 per assertion at a time, but the design only describes a claim CAS on an individual job row. If two promote-async requests target the same (contextGraphId, assertionName), different workers can still claim both jobs and run concurrent promotes for the same assertion. Add a per-assertion lock/uniqueness rule, or reject duplicate enqueues for the same assertion while one job is queued or running.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 33817c8: concurrency and dedupe now use the full assertion identity, contextGraphId plus subGraphName plus assertionName, and block concurrent jobs for the same assertion.

Three correctness gaps Codex flagged on the revised RFC:

1. The TypeScript PromoteJob interface omitted `failed_retrying` from
   `state` even though §3.2 and the importer polling example both rely
   on it. Hoisted the state list into a dedicated `PromoteJobState`
   type alias and added a note that every public surface MUST
   reference it; updated the JSON example state list accordingly.

2. The worker loop only polled `queued` jobs but §4.2 step 4
   transitions retryable failures to `failed_retrying` and never
   describes how they get back. Reworded step 1 to make a job
   "runnable" iff `state = queued` OR
   (`state = failed_retrying` AND `nextRetryAt <= now`). Step 4 now
   explicitly clears the claim so the rescheduler picks the job up.
   Documents two equivalent implementation options (direct schedule
   vs flip-back-to-queued tick task).

3. The "1 promote per assertion at a time" claim was only enforced
   per-job by the claim CAS, so two concurrent `promote-async` calls
   against the same `(contextGraphId, assertionName)` could end up
   with two workers running. Added defence-in-depth: enqueue-time
   409 Conflict if a job is already queued/running/failed_retrying for
   the same pair, AND a claim-time exclusion in the runnable query.
   Matches the sync `/promote` route's behaviour today.

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread docs/specs/SPEC_ASYNC_PROMOTE_QUEUE.md Outdated
`(contextGraphId, assertionName)`. Multi-CG bulk promotes are out of
scope (importers that want them can enqueue many jobs).
- **Promote-side validation changes.** The async path applies the same
validation as the sync path; no relaxation of "root URI must exist in
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: This says the async path keeps the sync contract and then defines a stricter root-validation rule than the current /promote implementation. Today assertion.promote does not fail when a requested root is missing from WM; it filters the quads and can return promotedCount: 0. If the async queue starts rejecting those requests, retries and resumptions that are currently harmless become breaking failures. Please align the spec with the existing permissive behavior, or call out the stricter validation as an intentional contract change.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 33817c8: the spec now preserves sync promote permissive missing-root behavior instead of adding stricter async validation.

Comment thread docs/specs/SPEC_ASYNC_PROMOTE_QUEUE.md Outdated
`ROOT_CHUNK ≤ 1000` — for symmetry with the sync route's request-size
budget. Note that the **sync** `/promote` route does not currently
enforce a per-call root cap server-side (it accepts `entities: "all"`
and any explicit array that fits under `MAX_BODY_BYTES`); the async
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: The sync /promote route is not actually bounded by MAX_BODY_BYTES today; the daemon reads that endpoint with SMALL_BODY_BYTES (256 KB). Basing the async design on the larger limit will mis-spec request sizing and queue/backoff behavior. Please either reference the real current limit here or explicitly say the sync route must be widened before this async contract claims parity.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 33817c8: request sizing now references the real sync promote SMALL_BODY_BYTES limit instead of MAX_BODY_BYTES.

Comment thread docs/specs/SPEC_ASYNC_PROMOTE_QUEUE.md Outdated
6. **Concurrency caps**:
- At most N (default 4) running jobs per daemon, matching ADR 0002's
"max concurrent assertions" guidance.
- **At most 1 active job per `(contextGraphId, assertionName)` pair.**
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: Uniqueness keyed only by (contextGraphId, assertionName) is too coarse. Assertions are also scoped by subGraphName in the current implementation, so two different assertions with the same name in different sub-graphs would incorrectly 409 each other or be serialized behind one queue slot. Use the full assertion identity for dedupe/claiming (at least contextGraphId + subGraphName + assertionName).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 33817c8: dedupe and claim locking now key on contextGraphId plus subGraphName plus assertionName, not contextGraphId plus assertionName only.

Comment thread docs/specs/SPEC_ASYNC_PROMOTE_QUEUE.md Outdated
2. **Drain semantics on shutdown.** Two options:
- **Wait for running jobs to finish.** Cleanest, but a 5-min running
job blocks `dkg stop` for 5 min. Bad for ops.
- **Mark running → queued + exit.** Some jobs run twice. The promote
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: Mark running → queued + exit is not safe with the current promote flow. assertionPromote mutates SWM, WM cleanup, lifecycle metadata, and workspace-operation metadata in separate steps, so a shutdown after the SWM insert is not a pure no-op rerun. Requeueing that job on stop can duplicate metadata/gossip or re-run against a partially moved assertion. This section needs the same lease-expiry recovery story as crash handling, or an explicit idempotency/commit marker, before recommending this shutdown behavior.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 33817c8: shutdown no longer rewrites running jobs to queued. Recovery now requires lease expiry plus an idempotency or commit marker, with ambiguous partial promotes parked instead of blindly rerun.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex review skipped: filtered diff is 33093 lines (cap: 5,000). Please consider splitting this into smaller PRs for reviewability.

Appends §10 with three concrete data points pulled from importing a
real codebase graph (Graphify: 26,960 nodes / 63,003 edges) against
an rc.10 sandbox daemon:

- Cascading write degradation: write latency drops ~200x from baseline
  (5-7 ms/batch -> 1.2 s/batch and worse) once cumulative SWM grows
  past ~100k triples, because sync /promote shares a worker with
  write ingestion. Per-partition table included.
- 10 MB gossip cap failures: 4 of 17 partitions exceeded the cap
  with entities:"all". Total wall-clock in halve-and-retry: ~15
  minutes on top of legitimate write/promote work.
- Production reliability: a single fetch-failed socket error during
  a multi-minute promote kills the importer with no recovery in the
  sync path.

Also adds §10.4 explaining why "in front of" (async queue) is the
right layer instead of "underneath" (sync /promote with internal
batching) — the importer's pain is wall-clock per partition, not
request shape.

The empirical numbers strengthen the case made by every existing
section but are quoted in one appendix so the body remains an RFC,
not a benchmark report.

Co-authored-by: Cursor <cursoragent@cursor.com>
branarakic pushed a commit that referenced this pull request May 25, 2026
…mport

Adds a short "Empirical motivation" section between the Consequences
and Rejected alternatives lists, with concrete data from importing
this repo's Graphify output (26,960 nodes / 63,003 edges) on rc.10:

- Hand-rolled importer from dkg-node SKILL.md alone: ~330 LOC, half
  of which is cap-discovery ceremony.
- Same workload using rc.10's existing scripts/lib/dkg-daemon.mjs
  DkgClient: ~half the LOC, gaps remaining are the ones #642 and #643
  close.
- Four of 17 partitions exceeded the 10 MB gossip cap; total wall-
  clock in halve-and-retry: ~15 minutes on top of legitimate work.
- Write latency degrades ~200x from baseline once SWM grows past
  ~100k triples; that cost is what motivates #643 but requires the
  chunking discipline this ADR formalises.

Keeps the headlines so the contract's "why these numbers" question
has a concrete answer for reviewers; the full per-partition data
lives in the rc.10 test artifacts referenced in the PR description.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex review skipped: filtered diff is 33180 lines (cap: 5,000). Please consider splitting this into smaller PRs for reviewability.

branarakic pushed a commit that referenced this pull request May 25, 2026
Adds §1.1 to dkg-importer/SKILL.md with the three hard caps an
agent will hit at scale, plus the verbatim daemon error text for
each one:

- /write: MAX_BODY_BYTES = 10 MB body cap (HTTP 413)
- /promote: SMALL_BODY_BYTES = 256 KB body cap (HTTP 413) — same
  status code as /write but a different limit, easy to confuse
- /promote: 10 MB gossip-message cap (HTTP 500 "too large for
  gossip"), independent of the body cap and triggerable even with
  a single root URI

Expands §5 Error handling so each of the three failure modes has
its own recipe instead of one generic "halve and retry on 413"
block. Calls out that 413-on-promote and 500-gossip are orthogonal
recovery paths, both of which an rc.10 importer needs until #643's
async promote queue lands.

The cap numbers and error strings were validated empirically during
the rc.10 Graphify import. Quoting them inline turns "discover via
413" into "look up before you start", which is the difference
between a 5-minute import and a 20-minute import for non-trivial
graphs.

Co-authored-by: Cursor <cursoragent@cursor.com>
branarakic pushed a commit that referenced this pull request May 25, 2026
Brings in §10 'Empirical motivation (rc.10 Graphify import)' with:
- §10.1 cascading write degradation table (~200x slowdown past 100k SWM)
- §10.2 the 4 partitions that exceeded the 10 MB gossip cap (~15 min
  total wall-clock in halve-and-retry)
- §10.3 production reliability (single fetch failed -> crash with no
  retry classification on sync /promote)
- §10.4 why 'in front of' (async queue) beats 'underneath' (sync with
  internal batching) as a fix layer
branarakic pushed a commit that referenced this pull request May 25, 2026
Adds a "What changed in round 3" section to INTEGRATION_NOTES_GRAPHIFY.md
summarising the per-PR updates folded into integration after running the
stack against a real Graphify codebase import on rc.10:

- #641: empirical motivation in ADR 0002 + portable scanner ref +
  dead-link cleanup
- #642: §1.1 caps reference with verbatim error strings + manifest
  hardening (idempotent create, ms tie-breaker, loud empty bindings)
- #643: SMALL_BODY_BYTES correction + full assertion identity for
  uniqueness + lease-expiry shutdown + §10 empirical appendix

Notes that rc.10 import test artifacts are kept local, not committed.

Co-authored-by: Cursor <cursoragent@cursor.com>
branarakic pushed a commit that referenced this pull request May 25, 2026
Adds §1.1 to dkg-importer/SKILL.md with the three hard caps an
agent will hit at scale, plus the verbatim daemon error text for
each one:

- /write: MAX_BODY_BYTES = 10 MB body cap (HTTP 413)
- /promote: SMALL_BODY_BYTES = 256 KB body cap (HTTP 413) — same
  status code as /write but a different limit, easy to confuse
- /promote: 10 MB gossip-message cap (HTTP 500 "too large for
  gossip"), independent of the body cap and triggerable even with
  a single root URI

Expands §5 Error handling so each of the three failure modes has
its own recipe instead of one generic "halve and retry on 413"
block. Calls out that 413-on-promote and 500-gossip are orthogonal
recovery paths, both of which an rc.10 importer needs until #643's
async promote queue lands.

The cap numbers and error strings were validated empirically during
the rc.10 Graphify import. Quoting them inline turns "discover via
413" into "look up before you start", which is the difference
between a 5-minute import and a 20-minute import for non-trivial
graphs.

Co-authored-by: Cursor <cursoragent@cursor.com>
@branarakic branarakic merged commit ad3abb9 into main May 25, 2026
35 checks passed
branarakic pushed a commit that referenced this pull request May 25, 2026
… via HTTP

Two PR #642 follow-ups + the originally-scheduled SKILL.md update for the
async promote queue (the bulk of PR #4's documentation work, deferred until
#642 merged because that's where dkg-importer/SKILL.md first landed).

1. packages/cli/skills/dkg-importer/SKILL.md — new §6 "Async promote queue"
   - Route inventory (POST/GET/DELETE /api/assertion/.../promote-async + recover)
   - Async write loop (drop-in for §2's synchronous promote step)
   - Failure-classification table mapping attempt.lastError.classification
     (transient | cap_exceeded | fatal) to importer actions
   - Migration guidance vs. the sync /promote route (still supported)
   - Inspecting in-flight jobs alongside the manifest in §3
   New cheat-sheet variant under §8 mirroring the async loop.
   §5's forward reference to "PR #643's async promote queue ... until that
   lands" rewritten to point at the now-shipping §6.

2. New endpoint GET /.well-known/skill-importer.md (Codex PR #642 follow-up)
   The cross-link from dkg-node/SKILL.md to dkg-importer/SKILL.md was
   unreachable for agents installed via the setup flow because only the
   former was served from packages/cli/src/daemon/manifest.ts's
   loadSkillTemplate. Adds a sibling loadImporterSkillTemplate +
   /.well-known/skill-importer.md route in status.ts (same auth-public,
   ETag-cacheable shape as /.well-known/skill.md), plus PUBLIC_GET_PATHS /
   PUBLIC_HEAD_PATHS / isLoopbackRateLimitExemptPath allowlist entries.
   Cross-link in dkg-node/SKILL.md now points at the runtime URL.
   4 new tests (auth public path + 4 dkg-importer skill content pins).

3. scripts/lib/manifest.mjs — partitionDeclared helper (Codex PR #642
   quadratic-validation follow-up). markPartitionStatus used to call
   loadImportManifest on every status write, materialising the full
   manifest from SWM. For a 10k-partition import marking each twice
   (in_progress + done), that's 20k full SWM round-trips. Replaced with a
   single-row SPARQL ASK: one query, one Boolean, same error semantics
   when a partition isn't declared. 4 new tests including a regression
   pin asserting markPartitionStatus issues exactly one query.

Tests: 27/27 (scripts/lib manifest), 32/32 (publisher async-promote-queue),
190/190 (CLI unit suites incl. new skill-endpoint cases, async-promote
routes/worker/E2E, auth allowlist). tsc green. Zero lints.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant