fix(storage,agent): make WM persistence durable across restarts#640
Conversation
PR #602's Graphify codebase import (1.7M quads / 74 assertions) sometimes loses all data across a daemon restart. This commit lands the reproducible harness and the bug report that characterise it; the actual fix follows in a separate PR (see Suggested fix shape in the bug report). * scripts/repro/wm-persistence-regression.mjs — parameterised harness that spawns an isolated daemon, writes N x M triples, cycles the daemon (clean /api/shutdown or hard SIGKILL), and diffs sub-graph counts. Refuses to talk to port 9200 so the kill cycle can never touch a co-located daemon. * docs/bugs/wm-persistence-regression.md — root cause (no flush on stop, non- atomic writeFile in flushNow, silent hydrate-failure swallow), threshold matrix (zero loss at 5k triples, partial at 125k, catastrophic at 1M), and the "Suggested fix shape" the follow-up PR implements. * REPRO.md — daemon-isolation contract for this worktree (DKG_HOME + DKG_API_PORT=54293 to stay off the default 9200). * .dkg-repro-reports/matrix-20260525-092823.json — 12-cell matrix evidence (clean x kill x 5k/125k/1M x pause0/pause30) referenced by the bug report. Refs: #596, #602.
Implements the resumable-import pattern from ADR 0002 and gives every
agent / human importer a single canonical reference for chunked bulk
writes against a DKG node.
scripts/lib/manifest.mjs:
- buildInitialManifestTriples / createImportManifest write the import
graph as an RDF assertion in the project's `meta` sub-graph.
- markPartitionStatus appends status events (append-only — no
DELETE/INSERT needed; latest event wins on read).
- loadImportManifest + pendingPartitions resolve the current per-
partition status in one SPARQL round-trip, so an interrupted import
can resume from the next pending partition without re-doing work.
- Itself respects the ADR 0002 chunking contract by reusing
DkgClient.writeAssertion.
scripts/lib/__tests__/manifest.test.mjs:
- Six unit tests covering URI encoding, ontology constants, and the
buildInitialManifestTriples / pendingPartitions pure helpers. The
daemon-roundtrip behaviour stays covered by the repro suite in
scripts/repro/.
packages/cli/skills/dkg-importer/SKILL.md (NEW):
- The agent-readable manual for bulk imports. Codifies the chunking
contract from ADR 0002 (CHUNK=5000, ROOT_CHUNK=1000), gives
worked examples in TypeScript + Python, walks through the manifest
pattern, and links to ADR 0003 for canonical URIs.
- Includes anti-patterns ("don't bump MAX_BODY_BYTES to 1GB",
"don't invent a graphify:* URI scheme"), HTTP error handling,
and a one-page cheat sheet.
packages/cli/skills/dkg-node/SKILL.md:
- One-line cross-reference under the assertion-write tool table
pointing agents to dkg-importer/SKILL.md when they're about to
push >5k quads in one go.
Stacked on docs/importer-adrs (#641). Related to #596, #636, #640.
Co-authored-by: Cursor <cursoragent@cursor.com>
Four safety hardening fixes Codex flagged on the repro harness. None
change the workload semantics; all narrow the blast radius of the
SIGKILL cycle so the script cannot target a daemon it does not own
under realistic operator errors / PID reuse.
1. Supervisor race on `--foreground`. `dkg start --foreground` runs a
supervisor that respawns the worker; daemon.pid points at the
worker. SIGKILLing the worker alone let the supervisor bring a
fresh worker up before the next snapshot, and the matrix would
measure the auto-restart instead of the killed instance. Fix:
spawn with `detached: true` (new pgrp = supervisor pid) and on
hard restart signal the whole process group via
`process.kill(-pgid, 'SIGKILL')` before falling back to a per-pid
kill on the worker.
2. Literal-string `~/.dkg` rejection. The previous safety check only
matched the bare tilde form; `/Users/$USER/.dkg`, trailing slashes,
and symlinks to a real DKG home all slipped through. Fix: walk
each ancestor of HOME_ABS through `realpathSync` to resolve any
intermediate symlinks, then reject if the resolved path equals or
is a descendant of `${homedir()}/.dkg` or `${homedir()}/.dkg-dev`
(using both the literal and realpath'd forms of the default homes).
3. PID-file trust under PID reuse. `process.kill(pid, 0)` only proves
the pid exists; a stale daemon.pid pointing at a recycled pid
would pass and we'd then SIGKILL an unrelated process. Fix: cross-
reference daemon.pid against the pid actually listening on PORT,
using `lsof -nP -iTCP:<port> -sTCP:LISTEN -t`. ensureNoForeign
Daemon, spawnDaemon (records expectedWorkerPid + supervisor pgid),
and killDaemonHard (re-verifies at kill time + asserts the port is
free afterwards) all gate on this cross-check.
4. Soft 30s wait-for-exit timeout. Previously the harness `warn`'d
and continued, letting the next snapshot measure either a still-
alive or an auto-restarted daemon and silently record a false
post-restart result. Fix: throw on timeout; add the
`--wait-for-exit-ms` knob (default 30000) so operators with very
large workloads can lift the deadline explicitly rather than
silently masking real shutdown hangs.
All four are belt-and-braces — the matrix worked before because of
careful operator behaviour, not because the safety nets caught
mistakes. After this change `wm-persistence-regression.mjs` refuses to
proceed on any input it cannot prove is safe.
Co-authored-by: Cursor <cursoragent@cursor.com>
0187ca8 to
175caf7
Compare
The matrix evidence JSON checked in alongside the bug report was 3694 lines, most of which were per-named-graph forensic dumps (allGraphs + perAssertion arrays) that mirrored the top-level totals already shown in each cell's preStop/postStart counts. That single file accounted for ~56% of PR #640's diff and put it over the 5000-line cap that auto-skips Codex review. This commit collapses those arrays to `{count, triplesTotal}` summaries so the headline numbers — expectedTriples, preStop.viaSparql.triples, postStart.viaSparql.triples, lostTriples, failed — are all still there verbatim; only the noisy per-graph SPARQL output was dropped. Operators who need the full detail can regenerate it locally by re-running the matrix. File goes from 3694 → 546 lines, which keeps the canonical evidence referenced from docs/bugs/wm-persistence-regression.md while letting the dependent PR (#640) fit within the auto-review window. Co-authored-by: Cursor <cursoragent@cursor.com>
175caf7 to
df4749d
Compare
| } catch { | ||
| // Best-effort dir fsync. | ||
| } | ||
| } catch (err) { |
There was a problem hiding this comment.
🔴 Bug: flushNow() still swallows write/rename/fsync failures, so flush()/close() resolve successfully even when the durable write never landed. Under ENOSPC, permission errors, or a failed rename, shutdown will still report success and can lose WM again. Please propagate failures for explicit flush()/close() callers (or return a success flag) so the daemon can fail shutdown loudly instead of silently continuing.
|
|
||
| if (SPAWN_DAEMON) { | ||
| const pid = await readDaemonPid(); | ||
| if (pid && isAlive(pid)) { |
There was a problem hiding this comment.
🔴 Bug: this reuse path never records expectedSupervisorPgid. If a daemon is already running under dkg start --foreground (which REPRO.md recommends for manual runs), killDaemonHard() later only SIGKILLs the worker pid, the foreground supervisor respawns it, and the kill matrix cells end up measuring the auto-restarted daemon again. Either always spawn a fresh daemon here, or reject pre-existing foreground daemons unless --no-spawn is used.
| * was the proximate cause of the WM persistence regression documented in | ||
| * docs/bugs/wm-persistence-regression.md. | ||
| */ | ||
| private hydrateSync(filePath: string): void { |
There was a problem hiding this comment.
🟡 Issue: this PR changes persistence and startup-failure semantics, but the only verification checked in is manual repro output. Please add automated regression coverage for at least two cases: close() persists data across reopen, and a corrupt store.nq is quarantined and surfaced loudly on startup. Without tests, this is easy to regress in future storage refactors.
…x in --no-spawn Codex flagged the reuse path on line 852 (PR #640 review at 12:25Z): when SPAWN_DAEMON=true and a daemon is already running at our port, we used to skip spawn and never record expectedSupervisorPgid. If that pre-existing daemon was started via `dkg start --foreground` (which REPRO.md recommends for manual runs), killDaemonHard() would SIGKILL only the worker — the foreground supervisor would respawn it, and the matrix would silently measure the restart instead of the kill. Fix: refuse to reuse a pre-existing daemon in --spawn mode. Operators must either clean-stop it first or switch to --no-spawn explicitly. In --no-spawn mode the harness still doesn't own the supervisor's pgid, so reject restart-mode=kill and --matrix (which exercises kill cells) with a clear actionable error. Also updates the top-of-file lifecycle docstring to document the new contract. Co-authored-by: Cursor <cursoragent@cursor.com>
Addresses Codex review on PR #640 (12:25Z): 1. flushNow() previously logged-and-swallowed every write/fsync/rename error, so explicit `flush()` and `close()` callers (notably DKGAgent.stop()) resolved successfully even when the durable write never landed. Under ENOSPC, EACCES, EROFS, EXDEV the daemon would report a clean shutdown while losing WM. flushNow() now re-throws after logging; the background debounced flush keeps catching since it has no caller to propagate to. 2. DKGAgent.stop() now logs the final-flush failure loudly instead of silently swallowing it. The shutdown path doesn't crash (other state is still unwinding) but the operator now sees the durability failure on stderr. 3. New test file packages/storage/test/oxigraph-persistence.test.ts covers the regression contract directly: - close() persists data; a fresh instance hydrates it back - corrupt persist files are quarantined and constructor throws - flush() / close() propagate write failures (ENOTDIR, EEXIST) - rapid-fire inserts followed by close() all survive the debounce-race window All five new tests pass; full storage suite (132 tests) still green. Co-authored-by: Cursor <cursoragent@cursor.com>
df4749d to
c21e5cc
Compare
| * success while data was lost on disk. | ||
| */ | ||
| private async flushNow(): Promise<void> { | ||
| if (!this.persistPath || this.flushing) return; |
There was a problem hiding this comment.
🔴 Bug: this early return still drops a scheduled follow-up flush when a timer fires during an in-flight flush. With the new slower tmp+fsync path, inserts that arrive while flushNow() is running can arm another timer, that timer hits this.flushing here, and no flush is re-queued after the current one finishes. Those triples then stay memory-only until some later unrelated write or a clean close(). Track a flushPending flag (or loop until no pending work) instead of returning immediately when this.flushing is true.
Implements the resumable-import pattern from ADR 0002 and gives every
agent / human importer a single canonical reference for chunked bulk
writes against a DKG node.
scripts/lib/manifest.mjs:
- buildInitialManifestTriples / createImportManifest write the import
graph as an RDF assertion in the project's `meta` sub-graph.
- markPartitionStatus appends status events (append-only — no
DELETE/INSERT needed; latest event wins on read).
- loadImportManifest + pendingPartitions resolve the current per-
partition status in one SPARQL round-trip, so an interrupted import
can resume from the next pending partition without re-doing work.
- Itself respects the ADR 0002 chunking contract by reusing
DkgClient.writeAssertion.
scripts/lib/__tests__/manifest.test.mjs:
- Six unit tests covering URI encoding, ontology constants, and the
buildInitialManifestTriples / pendingPartitions pure helpers. The
daemon-roundtrip behaviour stays covered by the repro suite in
scripts/repro/.
packages/cli/skills/dkg-importer/SKILL.md (NEW):
- The agent-readable manual for bulk imports. Codifies the chunking
contract from ADR 0002 (CHUNK=5000, ROOT_CHUNK=1000), gives
worked examples in TypeScript + Python, walks through the manifest
pattern, and links to ADR 0003 for canonical URIs.
- Includes anti-patterns ("don't bump MAX_BODY_BYTES to 1GB",
"don't invent a graphify:* URI scheme"), HTTP error handling,
and a one-page cheat sheet.
packages/cli/skills/dkg-node/SKILL.md:
- One-line cross-reference under the assertion-write tool table
pointing agents to dkg-importer/SKILL.md when they're about to
push >5k quads in one go.
Stacked on docs/importer-adrs (#641). Related to #596, #636, #640.
Co-authored-by: Cursor <cursoragent@cursor.com>
…view cap Codex's filtered-diff line count for PR #636 keeps landing just over the 5000-line cap (5023 → 5059 after the latest harness safety addition). The matrix-*.json snapshot is 545 lines of forensic JSON whose actionable content is already in the results table in docs/bugs/wm-persistence-regression.md; the harness can regenerate the file on demand. Untracking it brings the PR comfortably under the cap so Codex can actually review the harness changes. Also tightens the .dkg-repro-reports/ gitignore: nothing in there is canonical evidence, the bug report is the source of truth, the harness regenerates everything. Co-authored-by: Cursor <cursoragent@cursor.com>
Fixes the WM persistence regression characterised in
docs/bugs/wm-persistence-regression.md. Three coordinated changes:
1. OxigraphStore.flushNow: write atomically + durably.
The previous single writeFile() left the on-disk store.nq vulnerable to
torn writes on SIGKILL — the file would be partially rewritten, the next
start would fail to parse it, and hydrateSync() silently swallowed the
parse error to come up empty. Now we:
a. write the dump to <persistPath>.tmp,
b. fsync the tmp file (bytes durable),
c. POSIX-atomic rename(tmp -> persistPath),
d. fsync the parent directory (rename durable).
2. OxigraphStore.hydrateSync: fail loud on corrupt store.
Previously any read or parse error was caught and swallowed — the
operator saw a quietly empty store with no signal that data had been
lost. Now a parse failure:
a. renames the corrupt file to <persistPath>.corrupt-<iso-ts> so the
operator can salvage it,
b. logs to stderr with the path of the renamed file,
c. throws so the daemon crashes on first start; the next start sees
no store.nq and comes up legitimately empty.
3. DkgAgent.stop: flush WM before exiting.
stop() previously never called store.close(), so the 50ms debounced
flush in the Oxigraph adapter could leave the latest inserts only in
memory at process exit. Now stop() awaits store.close() (which drains
the debounce timer and runs a final flushNow) at the very end of its
teardown, after node.stop() and syncVerifyWorker.close().
The matrix evidence file under .dkg-repro-reports/ in this branch's
follow-up commit confirms 0/12 cells lose data with this fix applied,
against 5/12 on main (1M-quad cells: 100% loss; 125k-quad cells: 8-12%
loss; 5k-quad cells: already passing).
Refs: #596, #602.
Co-authored-by: Cursor <cursoragent@cursor.com>
Follow-up to 15a2705 ("fix(storage,agent): make WM persistence durable across restarts"). After landing the atomic-write fix, the repro matrix surfaced a second-order race: with debounced 50 ms flushes, a flushNow() could be in progress when shutdown's close() arrived. The previous close() short-circuited on `this.flushing === true` and returned without flushing, dropping every insert that landed between the in-flight snapshot and `close()`. This commit closes that race and tightens the verification harness so it can actually prove the fix. Storage: - OxigraphStore.close() now drains any in-flight flushNow() before running its own dump (matches the pattern flush() already uses). Repro harness (scripts/repro/wm-persistence-regression.mjs): - waitForDaemonExit() now waits up to 300s (was 30s) and ALSO checks spawnedChild === null — the daemon's flushNow on 1M-quad WM can take minutes, and the shorter window was giving up before the durable rename settled. - On timeout, the harness SIGKILLs the still-alive PID instead of proceeding silently, so the next matrix cell starts clean. - spawnDaemon() sleeps 500ms after pingStatus before checking spawnedChild — the "Daemon already running, PID NNNN" check + exit takes a few ms and was racing the check. Bug report (docs/bugs/wm-persistence-regression.md): - Status flipped to FIXED with a §Verification section linking three independent repro shapes (small/clean, medium/clean, medium/kill — 5k to 125k quads) all reporting lost=0. - Captured verify-{small,medium,kill}.json artifacts as canonical evidence; gitignore exception added. Co-authored-by: Cursor <cursoragent@cursor.com>
Addresses Codex review on PR #640 (12:25Z): 1. flushNow() previously logged-and-swallowed every write/fsync/rename error, so explicit `flush()` and `close()` callers (notably DKGAgent.stop()) resolved successfully even when the durable write never landed. Under ENOSPC, EACCES, EROFS, EXDEV the daemon would report a clean shutdown while losing WM. flushNow() now re-throws after logging; the background debounced flush keeps catching since it has no caller to propagate to. 2. DKGAgent.stop() now logs the final-flush failure loudly instead of silently swallowing it. The shutdown path doesn't crash (other state is still unwinding) but the operator now sees the durability failure on stderr. 3. New test file packages/storage/test/oxigraph-persistence.test.ts covers the regression contract directly: - close() persists data; a fresh instance hydrates it back - corrupt persist files are quarantined and constructor throws - flush() / close() propagate write failures (ENOTDIR, EEXIST) - rapid-fire inserts followed by close() all survive the debounce-race window All five new tests pass; full storage suite (132 tests) still green. Co-authored-by: Cursor <cursoragent@cursor.com>
c21e5cc to
a7f19fa
Compare
| * success while data was lost on disk. | ||
| */ | ||
| private async flushNow(): Promise<void> { | ||
| if (!this.persistPath || this.flushing) return; |
There was a problem hiding this comment.
🔴 Bug: flushNow() still bails out when another flush is in progress. With the new fsync/rename path that window is much longer, so a write that arrives during a flush can arm a timer, have that timer fire while this.flushing === true, and then get dropped permanently with no retry after the original flush completes. Please track a dirtyWhileFlushing flag or reschedule in finally whenever writes land mid-flush.
| // caller" view, which on a single-node auth-disabled daemon includes WM. | ||
| // /api/sub-graph/list uses the same shape and we know that one finds the | ||
| // WM graphs. | ||
| const response = await apiFetch('POST', '/api/query', { |
There was a problem hiding this comment.
🔴 Bug: this probe no longer pins the read to Working Memory, so the repro can measure the wrong dataset. On auth-enabled nodes or reused homes, the daemon's default "everything visible" view may exclude WM or mix in other layers, which can mask the persistence regression. Pass view: 'working-memory' here (and the appropriate agent address if needed) so preStop/postStart are actually comparing WM.
| const first = new OxigraphStore(path); | ||
| // Fire many small inserts in rapid succession so the debounced | ||
| // flush is likely in flight when we call close(). | ||
| for (let i = 0; i < 100; i++) { |
There was a problem hiding this comment.
🟡 Issue: this test never forces an in-flight flush; it only queues debounced timers and then closes before the 50ms flush starts. That means it won't catch the write during long flush loss window introduced by the new atomic/fsync path. Add a case that waits for the first flush to begin, inserts more quads while flushNow() is running, and then verifies those later writes survive.
Adds the rc.10 operator note for the Base Sepolia contract redeploy (chainResetMarker → v10-rc10-rfc38-mainnet-ready-2026-05-25; Hub + Token retained; ConvictionStakingStorage.v10LaunchEpoch sealed at 497 via DKGStakingConvictionNFT.finalizeMigrationBatch). Existing [Unreleased] content (OT-RFC-38 Phase A LU-5/7/8/9 + LU-6 deferred; CG memory model rewrite LU-1/2/3/4; private graph SPARQL filterability #633) is promoted verbatim. New entries cover rc.10-cycle fixes that landed via separate PRs and weren't previously documented: - #574 Profile.recreateProfile for testnet recovery - #640 WM persistence durability across restarts - #647 T2/T6/T8 random-sampling devnet-sweep triage (defer curated CG sampling to RFC-39 Phase B; skip post-publish trustLevel stamps in KC leaf extraction; devnet publish + cli-invite scripts hardened) Co-authored-by: Cursor <cursoragent@cursor.com>
Implements the resumable-import pattern from ADR 0002 and gives every
agent / human importer a single canonical reference for chunked bulk
writes against a DKG node.
scripts/lib/manifest.mjs:
- buildInitialManifestTriples / createImportManifest write the import
graph as an RDF assertion in the project's `meta` sub-graph.
- markPartitionStatus appends status events (append-only — no
DELETE/INSERT needed; latest event wins on read).
- loadImportManifest + pendingPartitions resolve the current per-
partition status in one SPARQL round-trip, so an interrupted import
can resume from the next pending partition without re-doing work.
- Itself respects the ADR 0002 chunking contract by reusing
DkgClient.writeAssertion.
scripts/lib/__tests__/manifest.test.mjs:
- Six unit tests covering URI encoding, ontology constants, and the
buildInitialManifestTriples / pendingPartitions pure helpers. The
daemon-roundtrip behaviour stays covered by the repro suite in
scripts/repro/.
packages/cli/skills/dkg-importer/SKILL.md (NEW):
- The agent-readable manual for bulk imports. Codifies the chunking
contract from ADR 0002 (CHUNK=5000, ROOT_CHUNK=1000), gives
worked examples in TypeScript + Python, walks through the manifest
pattern, and links to ADR 0003 for canonical URIs.
- Includes anti-patterns ("don't bump MAX_BODY_BYTES to 1GB",
"don't invent a graphify:* URI scheme"), HTTP error handling,
and a one-page cheat sheet.
packages/cli/skills/dkg-node/SKILL.md:
- One-line cross-reference under the assertion-write tool table
pointing agents to dkg-importer/SKILL.md when they're about to
push >5k quads in one go.
Stacked on docs/importer-adrs (#641). Related to #596, #636, #640.
Co-authored-by: Cursor <cursoragent@cursor.com>
Implements the resumable-import pattern from ADR 0002 and gives every
agent / human importer a single canonical reference for chunked bulk
writes against a DKG node.
scripts/lib/manifest.mjs:
- buildInitialManifestTriples / createImportManifest write the import
graph as an RDF assertion in the project's `meta` sub-graph.
- markPartitionStatus appends status events (append-only — no
DELETE/INSERT needed; latest event wins on read).
- loadImportManifest + pendingPartitions resolve the current per-
partition status in one SPARQL round-trip, so an interrupted import
can resume from the next pending partition without re-doing work.
- Itself respects the ADR 0002 chunking contract by reusing
DkgClient.writeAssertion.
scripts/lib/__tests__/manifest.test.mjs:
- Six unit tests covering URI encoding, ontology constants, and the
buildInitialManifestTriples / pendingPartitions pure helpers. The
daemon-roundtrip behaviour stays covered by the repro suite in
scripts/repro/.
packages/cli/skills/dkg-importer/SKILL.md (NEW):
- The agent-readable manual for bulk imports. Codifies the chunking
contract from ADR 0002 (CHUNK=5000, ROOT_CHUNK=1000), gives
worked examples in TypeScript + Python, walks through the manifest
pattern, and links to ADR 0003 for canonical URIs.
- Includes anti-patterns ("don't bump MAX_BODY_BYTES to 1GB",
"don't invent a graphify:* URI scheme"), HTTP error handling,
and a one-page cheat sheet.
packages/cli/skills/dkg-node/SKILL.md:
- One-line cross-reference under the assertion-write tool table
pointing agents to dkg-importer/SKILL.md when they're about to
push >5k quads in one go.
Stacked on docs/importer-adrs (#641). Related to #596, #636, #640.
Co-authored-by: Cursor <cursoragent@cursor.com>
Summary
Fixes the WM persistence regression characterised in #636 (bug report + repro). Without this fix, a daemon restart could silently lose any WM data written in the last ~50 ms (debounced flush window) — and a SIGKILL mid-flush could corrupt
store.nqoutright (torn write + silent hydrate swallow).The fix is four coordinated changes, all in
packages/storage/src/adapters/oxigraph.ts+packages/agent/src/dkg-agent.ts:Atomic + durable flush:
flushNow()now writes to<persistPath>.tmp,fsyncs the file handle, atomically renames to<persistPath>, thenfsyncs the directory. POSIX-atomic on the same filesystem, so SIGKILL between any of the four steps leavesstore.nqeither at its previous good state or at a.tmpthe loader ignores.Non-silent hydrate: a corrupt
store.nqis now renamed tostore.nq.corrupt-<ts>for forensics and the daemon refuses to come up empty. The previous behaviour was a silent swallow + empty store on next start.Drained close:
close()now waits for any in-flightflushNow()to finish before running its own dump. Without this, the secondclose()short-circuited onthis.flushing === trueand silently dropped every insert that landed during the in-flight snapshot.Flush on agent shutdown:
DKGAgent.stop()now callsawait this.store.close()afternode.stop(). The graceful-shutdown path (SIGTERM via/api/shutdown, or direct SIGTERM) now treats WM persistence as a synchronous step.Verification
Three independent repro shapes via
scripts/repro/wm-persistence-regression.mjs(artifacts committed):POST /api/shutdownPOST /api/shutdownThe medium/kill cell is the canonical proof of the atomic-write fix — without the rename pattern, SIGKILL mid-flush at 125k quads reliably left a torn
store.nq.See
docs/bugs/wm-persistence-regression.mdfor the full failure-mode analysis and a §Verification section linking the JSON artifacts.Known harness limitation
The 12-cell matrix includes a 1M-quad cell that takes >5 min to flush on a testnet-connected daemon — longer than the harness's exit-timeout window. This is a verification limitation, not a fix-side data-loss mode (the durable rename eventually completes). Tracked as follow-up in the bug report.
Test plan
node scripts/repro/wm-persistence-regression.mjs --restart-mode=clean --num-assertions=5 --quads-per-assertion=1000→ lost=0node scripts/repro/wm-persistence-regression.mjs --restart-mode=clean --num-assertions=25 --quads-per-assertion=5000→ lost=0node scripts/repro/wm-persistence-regression.mjs --restart-mode=kill --num-assertions=25 --quads-per-assertion=5000→ lost=0OxigraphStore.close()is called from every shutdown path (onlyDKGAgent.stop()exists today)Stacked on #636 (bug report). Resolves #596.
Made with Cursor