Skip to content

feat(importer): manifest library + dkg-importer SKILL.md#642

Merged
branarakic merged 11 commits into
mainfrom
feat/dkg-importer-helpers
May 25, 2026
Merged

feat(importer): manifest library + dkg-importer SKILL.md#642
branarakic merged 11 commits into
mainfrom
feat/dkg-importer-helpers

Conversation

@branarakic
Copy link
Copy Markdown
Contributor

@branarakic branarakic commented May 25, 2026

Summary

Implements the resumable-import pattern from ADR 0002 (#641) and gives every agent / human importer a single canonical reference for chunked bulk writes against a DKG node.

`scripts/lib/manifest.mjs` — Resumable-import manifest helpers

  • `createImportManifest` writes the import as an RDF assertion in the project's `meta` sub-graph.
  • `markPartitionStatus` appends status events (append-only — latest event wins on read).
  • `loadImportManifest` + `pendingPartitions` resolve current per-partition status in one SPARQL round-trip, so an interrupted import resumes from the next pending partition.
  • Itself respects the ADR 0002 chunking contract by reusing `DkgClient.writeAssertion`.

`scripts/lib/tests/manifest.test.mjs` — Six unit tests for pure helpers (URI encoding, ontology constants, triple building, status filtering). Daemon-roundtrip coverage stays in the repro suite.

`packages/cli/skills/dkg-importer/SKILL.md` (NEW) — Agent-readable manual codifying:

  • Chunking contract from ADR 0002 (CHUNK=5000, ROOT_CHUNK=1000)
  • Worked examples in TypeScript + Python
  • Manifest pattern walk-through
  • Canonical URIs from ADR 0003
  • Anti-patterns + HTTP error handling
  • One-page cheat sheet

`packages/cli/skills/dkg-node/SKILL.md` — One-line cross-reference under the assertion-write tool table, pointing agents to `dkg-importer/SKILL.md` when pushing >5k quads in one go.

Test plan

  • `node --test scripts/lib/tests/manifest.test.mjs` → 6/6 pass
  • No new linter errors
  • Reviewer: confirm `DkgClient.writeAssertion`'s 500-quad default `batchSize` is the right conservative default for the docs to advertise
  • Reviewer: confirm the manifest's append-only status pattern is acceptable (SPARQL DELETE/INSERT would be tidier but the daemon doesn't expose it on assertions)

Stack

This PR is stacked on #641 (ADRs). Once #641 lands, this rebases against main.

Related: #596, #636, #640.

Made with Cursor

Update — round 3 (2026-05-25): known caps + manifest hardening

dkg-importer/SKILL.md — Known caps with exact error strings (§1.1)

Adds a new §1.1 reference table calling out the three independent daemon caps an importer must handle, with the verbatim daemon error text for each:

Endpoint Cap Constant Error
POST /api/assertion/<name>/write 10 MB request body MAX_BODY_BYTES HTTP 413 "Request body too large (>10485760 bytes)"
POST /api/assertion/<name>/promote 256 KB request body SMALL_BODY_BYTES HTTP 413 "Request body too large (>262144 bytes)"
POST /api/assertion/<name>/promote 10 MB gossip message hard-coded in gossipsub publish HTTP 500 "Promoted assertion too large for gossip (XXXX KB, limit 10 MB)"

The two /promote caps are independent — same status code (413) for the body cap, but a different limit from /write's — and the 10 MB gossip cap can trip even with a single root URI if that root's transitive triple set is large enough. §5 (Error handling) is rewritten so each failure mode has its own recipe instead of one generic "halve on 413" block.

Numbers and error strings validated empirically during the rc.10 Graphify import. Quoting them inline turns "discover via 413" into "look up before you start", which is the difference between a 5-minute import and a 20-minute import for non-trivial graphs.

scripts/lib/manifest.mjs — round-3 hardening

Three round-3 codex line comments fixed (each with a new unit test, 13/13 pass):

  • L222 — createImportManifest not idempotent on already exists: now catches the daemon's "already exists" error from /api/assertion/create and treats it as successful reuse, so a retry / second-process resume of the same import doesn't crash before reading progress.
  • L369 — same-millisecond status events: the SPARQL "latest event per partition" filter now includes a deterministic IRI tie-breaker (?ts2 = ?recordedAt && STR(?ev2) > STR(?ev)) so two writes within one ms produce a deterministic ordering instead of nondeterministic status selection.
  • L375 — empty bindings ≠ complete: loadImportManifest now throws loudly when the SPARQL returns zero bindings instead of silently returning {partitions: []} (which previously made resume callers think the import was complete when the manifest was actually missing / not yet visible in SWM).

Update — round 4 (2026-05-25 22:50): post-empirical-motivation Codex review

Round-4 review against the round-3 push surfaced five issues — three real bugs that would have shipped, one CodeQL finding, and one runnable-example fix. All addressed inline with new unit-test coverage (18/18 pass).

CRITICAL — loadImportManifest was reading the bare data graph, not SWM (L391)

/api/query with subGraphName alone routes to the bare data graph did:dkg:context-graph:<cg>/<sub>, NOT the SWM tier …/_shared_memory. Since the manifest is created via the assertion API (create → write → promote), it ONLY exists in SWM. So a peer/restart resume calling loadImportManifest would silently see zero bindings even on a healthy import — and the §L375 "loud failure" fix would then throw "manifest is missing" for a manifest that actually exists.

Fix: plumb graphSuffix (and view for completeness) through DkgClient.query. loadImportManifest now passes graphSuffix: '_shared_memory' so the query hits the right graph. The mock client in tests now also routes by graphSuffix, so a regression where the SWM hint is dropped fails the test suite immediately. A new dedicated test pins the call shape.

CRITICAL — statusEventUri same-millisecond ordering was non-deterministic (L84)

The previous shape <ts>-<6-char random> meant two events in the same millisecond would tie on recordedAt AND tie on a random token, leaving SPARQL's lexicographic tie-breaker non-deterministic.

Fix: new shape <ts>-<12-digit monotonic counter>-<6-char random>. The counter is process-local and zero-padded so lexicographic ordering matches call order for same-process same-ms events. Cross-process same-ms collisions fall back to the random component — no worse than before. New regression test asserts strictly-increasing IRIs across 50 back-to-back calls.

CodeQL #90 — double-unescape risk in unquote (L138)

The previous chain of four .replace() passes had a textbook double-unescape pitfall: \\\\n (escaped backslash + literal n) could collapse to a newline because the second pass would re-interpret the \ left behind by the first pass.

Fix: single-pass regex .replace(/\\(["\\nr])/g, replacer) so each escape sequence matches exactly once. New regression test round-trips literal \n, \\, \", mixed strings through lit() / unquote() byte-for-byte.

importId sanitization for default assertion name (L222)

importUri() / partitionUri() percent-encode any importId, but the daemon's validateAssertionName rejects /, whitespace, and other IRI-unsafe characters. A caller passing importId="corpus/2026-q1" would get valid URIs back but /api/assertion/create would 400.

Fix: new exported defaultManifestAssertionName() that maps unsafe character runs to -, length-clamps to the 256-char daemon limit, and throws a descriptive error if nothing valid remains (instead of crashing later with a cryptic 400). Two regression tests: parameterised sanitization assertions + an end-to-end "create/load with an unsafe importId works" check.

Python sample fix (L127)

The Python snippet in dkg-importer/SKILL.md was missing import os and used an undefined port. Fixed: added the import, defaulted port to int(os.environ.get('DKG_PORT', '9200')), switched to a with block for the token file.

Test coverage

5 new tests added (18/18 total pass) — one per fix, all targeted at the exact failure mode codex flagged, so a future regression on any of these would fail CI immediately.

Comment thread scripts/lib/manifest.mjs Outdated
contextGraphId: cgId,
assertionName: assertion,
subGraphName,
entities: [importUri(importId)],
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: only the import root is promoted here. The partition subjects (urn:dkg:import:...#part:...) stay in WM, so after the promote loadImportManifest() cannot reliably read imp:key / imp:initialStatus from SWM and another node cannot resume the import. Promote the partition URIs alongside the import URI (chunked if needed).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed on this branch: createImportManifest now promotes the import root and every partition URI, chunked by ROOT_CHUNK.

Comment thread scripts/lib/manifest.mjs
{ subject: evIri, predicate: IMPORT_P.status, object: lit(status) },
{ subject: evIri, predicate: IMPORT_P.recordedAt, object: dt(nowIso) },
];
await client.writeAssertion({
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: this appends the status event to the manifest assertion but never promotes it. That leaves progress updates in WM only, so peers will still see the old status and a resume from another node can restart already-finished partitions. Promote the partition/event roots after each status write, or use the direct SWM path if that's the intended visibility.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed on this branch: markPartitionStatus writes the status event and promotes it so progress becomes visible from SWM.

Comment thread scripts/lib/manifest.mjs Outdated
imp:initialStatus ?initial .
OPTIONAL {
{
SELECT ?part (SAMPLE(?status) AS ?latestStatus) (MAX(?recordedAt) AS ?latestRecordedAt) WHERE {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: SAMPLE(?status) is not correlated with MAX(?recordedAt). Once a partition has multiple events, this subquery can return the newest timestamp with an older status, so loadImportManifest() may report pending/failed for a partition that is actually done. Select the row with the max timestamp (for example via ORDER BY DESC(?recordedAt) LIMIT 1 per partition) instead of mixing independent aggregates.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed on this branch: loadImportManifest uses a latest-row selection pattern instead of SAMPLE plus MAX, so status and timestamp stay correlated.

Comment thread scripts/lib/manifest.mjs
const bindings = res?.result?.bindings ?? res?.bindings ?? [];

const partitions = bindings.map((row) => {
const key = unquote(row.key);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: /api/query returns SPARQL-JSON binding cells ({ value, type, ... }) on the normal path, but this code treats row.key, row.part, row.latestStatus, etc. as raw strings. The returned manifest state will contain objects instead of strings and the sort/comparisons will misbehave. Normalize bindings through .value/.id first, the same way the other daemon clients do.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed on this branch: query bindings are normalized from SPARQL JSON cells through .value before sorting/comparing.

Comment thread scripts/lib/manifest.mjs Outdated
contextGraphId: cgId,
assertionName: assertion,
subGraphName,
entities: [evIri],
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: promoting only evIri does not publish the partIri -> imp:statusEvent -> evIri triple you just wrote on line 266. assertion.promote({ entities: [...] }) only moves quads whose subject is one of those roots, so after a restart or a peer-side resume loadImportManifest() still sees no status events and treats every partition as pending. Promote both partIri and evIri here, or flip the edge so the event points at the partition.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed on this branch: markPartitionStatus promotes both partIri and evIri so the partition-to-event edge is visible to loadImportManifest.

/**
* Unit tests for the pure helpers in `scripts/lib/manifest.mjs`.
*
* Daemon-roundtrip behaviour (createImportManifest / markPartitionStatus /
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: this test file only exercises pure helpers, but the risky behavior in this PR is the WM→SWM round-trip through markPartitionStatus()/loadImportManifest(). A regression in promote root selection will pass CI unnoticed. Please add an integration-style test that creates a manifest, marks a partition done, promotes it, and then verifies loadImportManifest() reads done back from shared memory.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed on this branch: added mock WM/SWM round-trip tests covering create, mark done, promote, and reload from shared memory.

branarakic pushed a commit that referenced this pull request May 25, 2026
…ration tests

Addresses Codex round-2 review on PR #642 (12:15Z):

1. markPartitionStatus now promotes BOTH `partIri` and `evIri`.
   Codex correctly flagged that the `partIri imp:statusEvent evIri`
   triple has subject `partIri` — `assertion.promote({entities})` only
   moves quads whose subject is in `entities`, so promoting only `evIri`
   left the partition→event edge in WM and a peer-side
   `loadImportManifest` couldn't see the new status. Re-promoting
   `partIri` is idempotent for its prior triples and ships the new edge.

2. `unquote()` now collapses SPARQL 1.1 results-JSON cells
   (`{value, type, datatype?, "xml:lang"?}`) to their `.value` before
   the literal-unquoting pass. The daemon's `bindingValue` helper
   already handles both shapes; this library was assuming only the
   flat-string shape Oxigraph emits. Added `bareUri()` for URI
   bindings and switched `loadImportManifest` to use it for `?part`.

3. New integration-style tests in
   scripts/lib/__tests__/manifest.test.mjs cover:
   - full WM→SWM round-trip with flat bindings (Oxigraph shape)
   - same round-trip with SPARQL-JSON cell bindings (sparql-http shape)
   - latest-event-wins across multiple status updates on one partition
   A mock client tracks the WM/SWM split as two in-memory sets and
   serves SPARQL queries from SWM, exactly matching the daemon's
   promote contract. Confirmed all three tests FAIL when
   markPartitionStatus reverts to promoting only `evIri` — that's the
   regression coverage Codex asked for.

4. SKILL.md updated to document the dual-promotion behaviour so
   downstream importers / agents implementing this contract don't
   re-introduce the bug.

Co-authored-by: Cursor <cursoragent@cursor.com>
@branarakic branarakic force-pushed the feat/dkg-importer-helpers branch from 673dfca to c5e6d51 Compare May 25, 2026 13:07
Comment thread scripts/lib/manifest.mjs Outdated
}
const cgId = client.cgId;
const assertion = assertionName ?? `import-manifest-${importId}`;
await client.request('POST', '/api/assertion/create', {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: this helper is meant for resumable imports, but /api/assertion/create returns already exists on retries. A transient failure after the create succeeds, or a second process resuming the same import, will now fail before it can load progress. Please treat an existing manifest assertion as a successful reuse (or add an explicit ensure flow) so createImportManifest() is idempotent.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 3b04c99: createImportManifest now treats an existing manifest assertion as idempotent reuse and still runs the additive write/promote healing path.

Comment thread scripts/lib/manifest.mjs Outdated
FILTER NOT EXISTS {
?part imp:statusEvent ?ev2 .
?ev2 imp:recordedAt ?ts2 .
FILTER (?ts2 > ?latestRecordedAt)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: this "latest event" filter only excludes strictly later timestamps. recordedAt is only millisecond precision, so two status writes in the same ms both survive and you can return duplicate rows for one partition with nondeterministic status selection. Add a deterministic tie-breaker for equal timestamps (for example compare ?ev as a secondary key) or use a monotonic sequence instead of raw ISO timestamps alone.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 3b04c99: latest-event selection now uses the event IRI as a deterministic tie-breaker for equal millisecond timestamps, with a same-ms regression test.

Comment thread scripts/lib/manifest.mjs
}
`;
const res = await client.query({ sparql, contextGraphId: cgId, subGraphName });
const bindings = res?.result?.bindings ?? res?.bindings ?? [];
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: createImportManifest() rejects empty partition lists, so an empty binding set here means the manifest was missing or not visible in SWM, not that the import is complete. Returning { partitions: [] } will make resume callers skip all remaining work on a typo / wrong sub-graph / failed promote. Please fail loudly when no manifest rows are found, or explicitly verify that the import root exists before treating this as success.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 3b04c99: loadImportManifest now throws when no manifest rows are visible instead of returning an empty partition list, with a regression test.

Comment thread scripts/lib/manifest.mjs Outdated
}
}
`;
const res = await client.query({ sparql, contextGraphId: cgId, subGraphName });
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: client.query() defaults to working-memory when no view/graph suffix is supplied, so this read will miss the manifest/status triples you just promoted to SWM. On a restart or peer resume, loadImportManifest() will report the manifest as missing even though it exists in shared memory. Plumb view: 'shared-working-memory' (or graphSuffix: '_shared_memory') through DkgClient.query and use it here, then update the mock/test to match the daemon's real default.

Comment thread scripts/lib/manifest.mjs Outdated
/** Stable-ish URI for one StatusEvent. Exported so callers can promote it. */
export function statusEventUri(importId, key) {
// Random-ish suffix keeps multiple events on the same partition unique.
const suffix = `${Date.now()}-${Math.random().toString(36).slice(2, 8)}`;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: the same-millisecond tie-break in loadImportManifest() orders by event IRI, but this suffix is random, not monotonic. If two status updates for the same partition land in one millisecond, the older one can win nondeterministically and regress the visible status. Use an order-preserving suffix (for example timestamp + per-process counter) or persist an explicit sequence field and sort by that instead of a random token.


```python
import requests
token = open(os.path.expanduser('~/.dkg/auth.token')).read().strip()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: This copy-paste Python example is not runnable as written: it uses os.path.expanduser(...) without importing os, and it relies on port below without defining it. Add the missing setup (import os, a default port = 9200 or a function parameter) so the sample works when readers try it.

branarakic pushed a commit that referenced this pull request May 25, 2026
…mport

Adds a short "Empirical motivation" section between the Consequences
and Rejected alternatives lists, with concrete data from importing
this repo's Graphify output (26,960 nodes / 63,003 edges) on rc.10:

- Hand-rolled importer from dkg-node SKILL.md alone: ~330 LOC, half
  of which is cap-discovery ceremony.
- Same workload using rc.10's existing scripts/lib/dkg-daemon.mjs
  DkgClient: ~half the LOC, gaps remaining are the ones #642 and #643
  close.
- Four of 17 partitions exceeded the 10 MB gossip cap; total wall-
  clock in halve-and-retry: ~15 minutes on top of legitimate work.
- Write latency degrades ~200x from baseline once SWM grows past
  ~100k triples; that cost is what motivates #643 but requires the
  chunking discipline this ADR formalises.

Keeps the headlines so the contract's "why these numbers" question
has a concrete answer for reviewers; the full per-partition data
lives in the rc.10 test artifacts referenced in the PR description.

Co-authored-by: Cursor <cursoragent@cursor.com>
branarakic pushed a commit that referenced this pull request May 25, 2026
…ration tests

Addresses Codex round-2 review on PR #642 (12:15Z):

1. markPartitionStatus now promotes BOTH `partIri` and `evIri`.
   Codex correctly flagged that the `partIri imp:statusEvent evIri`
   triple has subject `partIri` — `assertion.promote({entities})` only
   moves quads whose subject is in `entities`, so promoting only `evIri`
   left the partition→event edge in WM and a peer-side
   `loadImportManifest` couldn't see the new status. Re-promoting
   `partIri` is idempotent for its prior triples and ships the new edge.

2. `unquote()` now collapses SPARQL 1.1 results-JSON cells
   (`{value, type, datatype?, "xml:lang"?}`) to their `.value` before
   the literal-unquoting pass. The daemon's `bindingValue` helper
   already handles both shapes; this library was assuming only the
   flat-string shape Oxigraph emits. Added `bareUri()` for URI
   bindings and switched `loadImportManifest` to use it for `?part`.

3. New integration-style tests in
   scripts/lib/__tests__/manifest.test.mjs cover:
   - full WM→SWM round-trip with flat bindings (Oxigraph shape)
   - same round-trip with SPARQL-JSON cell bindings (sparql-http shape)
   - latest-event-wins across multiple status updates on one partition
   A mock client tracks the WM/SWM split as two in-memory sets and
   serves SPARQL queries from SWM, exactly matching the daemon's
   promote contract. Confirmed all three tests FAIL when
   markPartitionStatus reverts to promoting only `evIri` — that's the
   regression coverage Codex asked for.

4. SKILL.md updated to document the dual-promotion behaviour so
   downstream importers / agents implementing this contract don't
   re-introduce the bug.

Co-authored-by: Cursor <cursoragent@cursor.com>
@branarakic branarakic force-pushed the feat/dkg-importer-helpers branch from 3b04c99 to bb8b16e Compare May 25, 2026 20:37
branarakic pushed a commit that referenced this pull request May 25, 2026
…ivation

Brings in:
- #641: round-3 codex (portable scanner path; importer skill/manifest
  framed as future work) + empirical motivation subsection in ADR 0002
  (~330 LOC hand-rolled importer, ~15 min wall-clock in halve-and-retry).
- #642: round-3 codex (createImportManifest idempotency on 'already
  exists'; timestamp tie-breaker on equal-ms; loud failure on empty
  manifest bindings) + new §1.1 in dkg-importer/SKILL.md documenting
  the three hard caps and their verbatim error strings (10 MB
  /write body, 256 KB /promote body, 10 MB gossip).

Co-authored-by: Cursor <cursoragent@cursor.com>

# Conflicts:
#	packages/cli/skills/dkg-importer/SKILL.md
#	scripts/lib/__tests__/manifest.test.mjs
#	scripts/lib/manifest.mjs
Comment thread scripts/lib/manifest.mjs Outdated
}
}
`;
const res = await client.query({ sparql, contextGraphId: cgId, subGraphName });
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: client.query({ sparql, contextGraphId, subGraphName }) uses the legacy data-graph route, which targets did:dkg:context-graph:<cg>/<subGraph> rather than the sub-graph's shared-memory graph. Since this manifest is only written via assertion WM→SWM promotion, a real daemon query here will come back empty even after a successful promote. Route this through the SWM graph explicitly (graphSuffix: '_shared_memory' or equivalent) and add a regression test that matches the real DkgClient.query semantics.

Comment thread scripts/lib/manifest.mjs Outdated
/** Stable-ish URI for one StatusEvent. Exported so callers can promote it. */
export function statusEventUri(importId, key) {
// Random-ish suffix keeps multiple events on the same partition unique.
const suffix = `${Date.now()}-${Math.random().toString(36).slice(2, 8)}`;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: the same-millisecond tie-break in loadImportManifest() relies on later events sorting lexicographically higher by IRI, but this suffix is random. Two markPartitionStatus() calls in the same millisecond can therefore resolve to either status arbitrarily. Use a monotonic suffix/counter (or persist an explicit sequence field) so equal-timestamp events preserve write order.

Comment thread scripts/lib/manifest.mjs Outdated
throw new Error('createImportManifest requires `subGraphName` (typically "meta").');
}
const cgId = client.cgId;
const assertion = assertionName ?? `import-manifest-${importId}`;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: the default assertion name embeds raw importId, but assertion names reject /, whitespace, and other unsafe characters. importUri()/partitionUri() already accept arbitrary IDs via percent-encoding, so a caller can pass an importId that produces valid manifest URIs but makes /api/assertion/create fail here. Either sanitize the default assertion name or validate importId up front and throw a clear error.

branarakic pushed a commit that referenced this pull request May 25, 2026
Adds a "What changed in round 3" section to INTEGRATION_NOTES_GRAPHIFY.md
summarising the per-PR updates folded into integration after running the
stack against a real Graphify codebase import on rc.10:

- #641: empirical motivation in ADR 0002 + portable scanner ref +
  dead-link cleanup
- #642: §1.1 caps reference with verbatim error strings + manifest
  hardening (idempotent create, ms tie-breaker, loud empty bindings)
- #643: SMALL_BODY_BYTES correction + full assertion identity for
  uniqueness + lease-expiry shutdown + §10 empirical appendix

Notes that rc.10 import test artifacts are kept local, not committed.

Co-authored-by: Cursor <cursoragent@cursor.com>
Branimir Rakic and others added 5 commits May 25, 2026 22:42
Implements the resumable-import pattern from ADR 0002 and gives every
agent / human importer a single canonical reference for chunked bulk
writes against a DKG node.

scripts/lib/manifest.mjs:
- buildInitialManifestTriples / createImportManifest write the import
  graph as an RDF assertion in the project's `meta` sub-graph.
- markPartitionStatus appends status events (append-only — no
  DELETE/INSERT needed; latest event wins on read).
- loadImportManifest + pendingPartitions resolve the current per-
  partition status in one SPARQL round-trip, so an interrupted import
  can resume from the next pending partition without re-doing work.
- Itself respects the ADR 0002 chunking contract by reusing
  DkgClient.writeAssertion.

scripts/lib/__tests__/manifest.test.mjs:
- Six unit tests covering URI encoding, ontology constants, and the
  buildInitialManifestTriples / pendingPartitions pure helpers. The
  daemon-roundtrip behaviour stays covered by the repro suite in
  scripts/repro/.

packages/cli/skills/dkg-importer/SKILL.md (NEW):
- The agent-readable manual for bulk imports. Codifies the chunking
  contract from ADR 0002 (CHUNK=5000, ROOT_CHUNK=1000), gives
  worked examples in TypeScript + Python, walks through the manifest
  pattern, and links to ADR 0003 for canonical URIs.
- Includes anti-patterns ("don't bump MAX_BODY_BYTES to 1GB",
  "don't invent a graphify:* URI scheme"), HTTP error handling,
  and a one-page cheat sheet.

packages/cli/skills/dkg-node/SKILL.md:
- One-line cross-reference under the assertion-write tool table
  pointing agents to dkg-importer/SKILL.md when they're about to
  push >5k quads in one go.

Stacked on docs/importer-adrs (#641). Related to #596, #636, #640.

Co-authored-by: Cursor <cursoragent@cursor.com>
Three real correctness bugs Codex flagged on the manifest library, plus
one documentation note clarifying a binding-format concern that turned
out to be a false positive (the daemon's /api/query route returns flat
JSON-encoded literal strings, not SPARQL-1.1 results-JSON cells — same
shape the existing seed/drain scripts already consume via bareIri()).

- createImportManifest now promotes the import root AND every partition
  URI (chunked at ROOT_CHUNK=1000 per ADR 0002). Promoting only the root
  left partition subjects in WM, so any peer reading the manifest back
  from SWM could not see imp:key / imp:initialStatus, breaking resume
  across nodes.

- markPartitionStatus now promotes the new StatusEvent root after
  writing it. Previously the event sat in WM only, so SWM-visible reads
  of the manifest still reported the prior status and a resume on
  another node could restart an already-finished partition.

- loadImportManifest's "latest event per partition" sub-query replaced
  SAMPLE+MAX (decorrelated — the status could come from one row and
  the timestamp from another) with the standard "max row" idiom
  (`FILTER NOT EXISTS { ... ?ts2 > ?latestRecordedAt }`). The status
  now genuinely corresponds to the chosen timestamp.

- statusEventUri was renamed from a private eventUri to a public export
  so the promote step has a stable handle to the just-written event.
  Added a unit test covering its shape + uniqueness; existing 6 tests
  still green.

- Added a doc note above `unquote()` explaining the daemon's flat
  binding format, with cross-refs to the other scripts that depend on
  it. This is to prevent a future reader (or a future Codex pass) from
  trying to swap in SPARQL-1.1-results-JSON parsing and silently
  breaking everything.

- Updated dkg-importer SKILL.md to match the new promote-on-status
  behaviour and to call out the SAMPLE/MAX foot-gun for future
  importers writing similar queries.

Co-authored-by: Cursor <cursoragent@cursor.com>
…ration tests

Addresses Codex round-2 review on PR #642 (12:15Z):

1. markPartitionStatus now promotes BOTH `partIri` and `evIri`.
   Codex correctly flagged that the `partIri imp:statusEvent evIri`
   triple has subject `partIri` — `assertion.promote({entities})` only
   moves quads whose subject is in `entities`, so promoting only `evIri`
   left the partition→event edge in WM and a peer-side
   `loadImportManifest` couldn't see the new status. Re-promoting
   `partIri` is idempotent for its prior triples and ships the new edge.

2. `unquote()` now collapses SPARQL 1.1 results-JSON cells
   (`{value, type, datatype?, "xml:lang"?}`) to their `.value` before
   the literal-unquoting pass. The daemon's `bindingValue` helper
   already handles both shapes; this library was assuming only the
   flat-string shape Oxigraph emits. Added `bareUri()` for URI
   bindings and switched `loadImportManifest` to use it for `?part`.

3. New integration-style tests in
   scripts/lib/__tests__/manifest.test.mjs cover:
   - full WM→SWM round-trip with flat bindings (Oxigraph shape)
   - same round-trip with SPARQL-JSON cell bindings (sparql-http shape)
   - latest-event-wins across multiple status updates on one partition
   A mock client tracks the WM/SWM split as two in-memory sets and
   serves SPARQL queries from SWM, exactly matching the daemon's
   promote contract. Confirmed all three tests FAIL when
   markPartitionStatus reverts to promoting only `evIri` — that's the
   regression coverage Codex asked for.

4. SKILL.md updated to document the dual-promotion behaviour so
   downstream importers / agents implementing this contract don't
   re-introduce the bug.

Co-authored-by: Cursor <cursoragent@cursor.com>
Adds §1.1 to dkg-importer/SKILL.md with the three hard caps an
agent will hit at scale, plus the verbatim daemon error text for
each one:

- /write: MAX_BODY_BYTES = 10 MB body cap (HTTP 413)
- /promote: SMALL_BODY_BYTES = 256 KB body cap (HTTP 413) — same
  status code as /write but a different limit, easy to confuse
- /promote: 10 MB gossip-message cap (HTTP 500 "too large for
  gossip"), independent of the body cap and triggerable even with
  a single root URI

Expands §5 Error handling so each of the three failure modes has
its own recipe instead of one generic "halve and retry on 413"
block. Calls out that 413-on-promote and 500-gossip are orthogonal
recovery paths, both of which an rc.10 importer needs until #643's
async promote queue lands.

The cap numbers and error strings were validated empirically during
the rc.10 Graphify import. Quoting them inline turns "discover via
413" into "look up before you start", which is the difference
between a 5-minute import and a 20-minute import for non-trivial
graphs.

Co-authored-by: Cursor <cursoragent@cursor.com>
@branarakic branarakic changed the base branch from docs/importer-adrs to main May 25, 2026 20:42
@branarakic branarakic force-pushed the feat/dkg-importer-helpers branch from bb8b16e to 36e4c22 Compare May 25, 2026 20:42
Comment thread scripts/lib/manifest.mjs Fixed
Five fixes from the round-4 review against the round-3 push:

1. SWM query routing (codex #642 L391, CRITICAL): the daemon's
   /api/query default routing for `subGraphName` alone reads the
   bare data graph, not the SWM tier. Since the manifest is created
   via the assertion API (create -> write -> promote), it only
   lives in SWM. Plumb `graphSuffix` (and `view`) through
   DkgClient.query, then have loadImportManifest pass
   `graphSuffix: '_shared_memory'`. Without this, a peer/restart
   resume would silently see zero manifest rows even on a healthy
   import. The mock client now routes by graphSuffix too, and a
   new regression test pins the call shape.

2. Monotonic statusEventUri suffix (codex L84, CRITICAL):
   `statusEventUri()` was `<ts>-<random>`. Two writes in the same
   millisecond therefore tied on both `recordedAt` AND a random
   token, so SPARQL's lexicographic tie-breaker picked a
   non-deterministic winner. New shape is `<ts>-<12-digit
   counter>-<random>` so within-process same-ms events sort by
   call order. Cross-process collisions still fall back to the
   random component, which is no worse than before. New
   regression test asserts strictly-increasing IRIs across 50
   back-to-back calls.

3. Single-pass unquote() (CodeQL #90, L138): the previous four
   chained .replace() calls had a textbook double-unescape
   pitfall (`\\\\n` could collapse to a newline). Replaced with a
   single regex + replacer that matches each escape exactly once.
   New regression test round-trips literal `\n`, `\\`, `\"`, etc.
   through lit() / unquote() byte-for-byte.

4. importId sanitization for default assertion name (codex L222):
   `importUri()` / `partitionUri()` percent-encode any importId,
   but the daemon's `validateAssertionName` rejects `/`,
   whitespace, and other IRI-unsafe chars. The default assertion
   name now goes through a new exported `defaultManifestAssertionName()`
   that maps unsafe runs to `-`, length-clamps to the 256-char
   limit, and throws a descriptive error if nothing valid remains
   (instead of crashing /api/assertion/create with a cryptic 400).
   New regression tests cover sanitization, length cap, and the
   "all-unsafe" error case.

5. Python example fixes (codex L127): the README Python snippet
   was missing `import os` and referenced an undefined `port`.
   Added the import, defaulted port to `int(os.environ.get('DKG_PORT', '9200'))`,
   and used a `with` block for the token file.

All 18 unit tests pass; 5 new tests cover each of the five fixes
above.

Co-authored-by: Cursor <cursoragent@cursor.com>
branarakic pushed a commit that referenced this pull request May 25, 2026
Round-4 codex review against the round-3 push surfaced five issues,
all addressed inline:

1. CRITICAL: loadImportManifest queried the bare data graph instead of
   SWM, so resume from a peer/restart would silently see zero manifest
   rows on a healthy import. Plumb graphSuffix='_shared_memory' through
   DkgClient.query.
2. CRITICAL: same-millisecond statusEventUri suffixes used pure random;
   replaced with monotonic per-process counter so SPARQL tie-breaker
   matches call order.
3. CodeQL #90: unquote()'s chained .replace() had a textbook double-
   unescape pitfall; replaced with a single regex + replacer.
4. importId sanitization for the default assertion name (rejects /,
   whitespace, IRI-unsafe chars per validateAssertionName).
5. Python example: missing 'import os' + undefined 'port'.

5 new unit tests cover each fix; 18/18 pass.

Co-authored-by: Cursor <cursoragent@cursor.com>

# Conflicts:
#	packages/cli/skills/dkg-importer/SKILL.md
#	scripts/lib/__tests__/manifest.test.mjs
#	scripts/lib/manifest.mjs
Comment thread scripts/lib/manifest.mjs Outdated
throw err;
}
}
const triples = buildInitialManifestTriples(importId, partitions, new Date().toISOString());
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: retries/resumes for the same importId will append a fresh imp:startedAt every time, so one manifest can end up with multiple conflicting start timestamps. Because this helper explicitly treats already exists as success, that retry path is normal. Preserve the original start time on reuse (or make it caller-supplied/stable) instead of regenerating it here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 59532a3. On assertion reuse, createImportManifest now detects the 409 and suppresses a fresh imp:startedAt while still healing the root/partition triples. Added a regression that reruns create against an existing assertion and asserts the original startedAt triple is unchanged.

Comment thread scripts/lib/manifest.mjs Outdated
// `imp:initialStatus` and resume can't recover. Chunk by `ROOT_CHUNK` so
// very-large imports stay within the daemon's body-size budget — see
// ADR 0002.
const ROOT_CHUNK = 1000;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: fixed ROOT_CHUNK = 1000 batches can still exceed the daemon's /promote 256 KB body cap when partition URIs are long (deep paths, percent-encoding, GitHub-style keys). That makes manifest creation fail on large imports even though the new skill documents 413-aware splitting as part of the contract. This promote loop should be size-aware or halve-and-retry on 413.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 59532a3. Promote now batches roots by estimated JSON body size under the daemon cap and recursively splits/retries on 413, so long percent-encoded partition URIs do not blow up a nominal 1000-root chunk. Added a mock 413 regression for long partition keys.

Comment thread scripts/lib/__tests__/manifest.test.mjs Outdated
partitions: ['only.ts'],
subGraphName: 'meta',
});
// The mock records all `request()` calls; the create call should have
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: this regression test never asserts the sanitized assertion name it claims to cover. makeMockClient.request() drops the request body, so the test would still pass if createImportManifest sent the raw importId or omitted name entirely. Record the create payload in the mock and assert name === "import-manifest-corpus-2026-q1" so the regression is actually pinned.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 59532a3. The mock client now records request() payloads, and the sanitized-name regression asserts the create body sends import-manifest-corpus-2026-q1.

Preserve the original imp:startedAt triple when createImportManifest reuses an existing assertion, and batch manifest promote roots by estimated JSON body size with recursive 413 split/retry.

Extend the manifest mock to record create request bodies and simulate promote payload limits, covering sanitized assertion names and long partition URI promote batches.

Co-authored-by: Codex <codex@openai.com>
Comment thread scripts/lib/manifest.mjs
const prefix = 'import-manifest-';
const maxSlugLen = 256 - prefix.length;
const slug = importId
.replace(/[^A-Za-z0-9._-]+/g, '-')
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: This sanitization is no longer one-to-one. Different importIds such as a/b vs a b (and long IDs that only differ after the 240-char cutoff) both collapse to the same assertion name, so the later import will silently reuse the earlier manifest assertion and mix their progress/lifecycle state. Preserve uniqueness here, for example by appending a stable hash/encoded suffix whenever sanitization or truncation changes the original id.

Comment thread scripts/lib/manifest.mjs
return `${partitionUri(importId, key)}/event/${ts}-${seq}-${rand}`;
}

function lit(value) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: The new literal codec only escapes \\, ", \n, and \r. Partition keys/status values containing other valid RDF control escapes such as tabs are emitted raw here, which can make the manifest write invalid or read back corrupted. Please reuse the existing shared literal helper or extend both the encode and decode paths to handle the full N-Triples escape set consistently.

Comment thread scripts/lib/manifest.mjs
const evIri = statusEventUri(importId, partitionKey);
const nowIso = new Date().toISOString();
const triples = [
{ subject: partIri, predicate: IMPORT_P.statusEvent, object: evIri },
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: markPartitionStatus() never checks that partitionKey was declared in the manifest. If a caller passes a typo or a new key, this write/promote succeeds, but loadImportManifest() will never surface that status because it only walks <import> imp:partition ?part edges created by createImportManifest(). That leaves the real partition looking permanently pending with no obvious error. Please validate membership before writing, or backfill/promote the partition declaration when it is missing.

Comment thread scripts/lib/manifest.mjs
createdFresh = false;
}
const importIri = importUri(importId);
const triples = buildInitialManifestTriples(importId, partitions, new Date().toISOString())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: on the already exists path we blindly append whatever partitions array the caller supplies into the existing assertion. Reusing an importId with a different partition set will silently produce the union of old and new partitions, so resume logic can keep retrying stale entries from a previous run. Consider loading the existing manifest on reuse and failing fast when the partition set differs, or make additive extension an explicit opt-in.

Comment thread packages/cli/skills/dkg-node/SKILL.md Outdated

P2P tools fail gracefully when the peer is offline. `dkg_publish` (fresh quads + write + publish, two HTTP calls) and `dkg_shared_memory_publish` (publish existing SWM, one HTTP call) differ in intent: use the two-call helper for "I have quads, publish now"; use the canonical finalizer as step 4 of the stepwise write → promote → publish flow. `dkg_share` is a direct SWM convenience helper for quick team-visible notes, not a replacement for assertion lifecycle tracking.

**Bulk imports (>5,000 quads in one logical operation):** the per-call `dkg_assertion_*` loop IS the chunked-write API; there is no `/api/import/bulk`. If you're pushing a code graph, a corpus, or any multi-thousand-quad import, read [`dkg-importer/SKILL.md`](../dkg-importer/SKILL.md) first — it codifies the chunking budgets, HTTP 413 recovery, resumability manifest, and canonical-URI rules so your importer converges with the rest of the workspace.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: this cross-link is not reachable for installed users right now. The setup/install flows and the public skill endpoint still hardcode only skills/dkg-node/SKILL.md, so agents running via dkg mcp/openclaw/hermes setup will not have ../dkg-importer/SKILL.md available. Either ship the importer skill through those paths too, or keep the bulk-import guidance inline here until the extra skill is actually distributed.

try {
await client.writeOne(slice);
} catch (err) {
if (err.statusCode !== 413) throw err;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: These retry examples check err.statusCode, but scripts/lib/dkg-daemon.mjs sets err.status on failed requests. Copied verbatim, the 413 branch never runs and the sample rethrows instead of halving. Please switch this snippet (and the similar one below) to the actual error shape or document a wrapper that normalizes it.

### Connection errors / 5xx

Standard retry with exponential backoff. The daemon does not implement
idempotency tokens, but `assertion/create` returns 200 for both first and
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: The raw daemon route does not return 200 on duplicate POST /api/assertion/create; packages/cli/src/daemon/routes/assertion.ts maps already exists to HTTP 400. Only higher-level clients normalize that into an idempotent success. As written, this tells importer authors the wrong wire contract and can lead to broken retry handling.

Comment thread scripts/lib/manifest.mjs
if (!subGraphName) throw new Error('markPartitionStatus requires `subGraphName`.');

const cgId = client.cgId;
const assertion = assertionName ?? defaultManifestAssertionName(importId);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: markPartitionStatus() recomputes the assertion name from importId on every write. If the manifest was originally created with a custom assertionName, a resumed importer that only knows importId will target a different assertion here and /api/assertion/:name/write will fail with not found. Because loadImportManifest() doesn't persist or return the actual assertion name, callers have no way to recover it later. Please either persist the chosen assertion name in the manifest and resolve it here, or remove the public assertionName override so the name is always derivable.

Comment thread scripts/lib/manifest.mjs

const cgId = client.cgId;
const assertion = assertionName ?? defaultManifestAssertionName(importId);
const manifest = await loadImportManifest({ client, importId, subGraphName });
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: this validates the partition key by loading and sorting the entire manifest on every status update. For the bulk imports this helper targets, that makes status tracking quadratic (in_progress + done across 10k partitions means 20k full SWM queries). Consider caching the declared partition set from createImportManifest()/loadImportManifest(), or replacing this with a narrower existence check / optional validation flag.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex review skipped: filtered diff is 11370 lines (cap: 5,000). Please consider splitting this into smaller PRs for reviewability.

@branarakic branarakic merged commit 6dc410c into main May 25, 2026
36 checks passed
branarakic pushed a commit that referenced this pull request May 25, 2026
… via HTTP

Two PR #642 follow-ups + the originally-scheduled SKILL.md update for the
async promote queue (the bulk of PR #4's documentation work, deferred until
#642 merged because that's where dkg-importer/SKILL.md first landed).

1. packages/cli/skills/dkg-importer/SKILL.md — new §6 "Async promote queue"
   - Route inventory (POST/GET/DELETE /api/assertion/.../promote-async + recover)
   - Async write loop (drop-in for §2's synchronous promote step)
   - Failure-classification table mapping attempt.lastError.classification
     (transient | cap_exceeded | fatal) to importer actions
   - Migration guidance vs. the sync /promote route (still supported)
   - Inspecting in-flight jobs alongside the manifest in §3
   New cheat-sheet variant under §8 mirroring the async loop.
   §5's forward reference to "PR #643's async promote queue ... until that
   lands" rewritten to point at the now-shipping §6.

2. New endpoint GET /.well-known/skill-importer.md (Codex PR #642 follow-up)
   The cross-link from dkg-node/SKILL.md to dkg-importer/SKILL.md was
   unreachable for agents installed via the setup flow because only the
   former was served from packages/cli/src/daemon/manifest.ts's
   loadSkillTemplate. Adds a sibling loadImporterSkillTemplate +
   /.well-known/skill-importer.md route in status.ts (same auth-public,
   ETag-cacheable shape as /.well-known/skill.md), plus PUBLIC_GET_PATHS /
   PUBLIC_HEAD_PATHS / isLoopbackRateLimitExemptPath allowlist entries.
   Cross-link in dkg-node/SKILL.md now points at the runtime URL.
   4 new tests (auth public path + 4 dkg-importer skill content pins).

3. scripts/lib/manifest.mjs — partitionDeclared helper (Codex PR #642
   quadratic-validation follow-up). markPartitionStatus used to call
   loadImportManifest on every status write, materialising the full
   manifest from SWM. For a 10k-partition import marking each twice
   (in_progress + done), that's 20k full SWM round-trips. Replaced with a
   single-row SPARQL ASK: one query, one Boolean, same error semantics
   when a partition isn't declared. 4 new tests including a regression
   pin asserting markPartitionStatus issues exactly one query.

Tests: 27/27 (scripts/lib manifest), 32/32 (publisher async-promote-queue),
190/190 (CLI unit suites incl. new skill-endpoint cases, async-promote
routes/worker/E2E, auth allowlist). tsc green. Zero lints.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants