OT-RFC-38 LU-6 B3 — persist host-only designation across restart#620
OT-RFC-38 LU-6 B3 — persist host-only designation across restart#620branarakic wants to merge 2 commits into
Conversation
… restart
Pre-fix, host-only cores re-derived their hosting set on every
restart from two sources only:
- chain `ContextGraphCreated` events (within poller lookback)
- curator-broadcast discovery beacons (5-min cadence)
CGs registered before the lookback window AND whose curator was
offline during the post-restart re-announce tick were silently
lost — the gossip handler never rewired and incoming envelopes
went to /dev/null.
This PR persists the host-mode designation in the per-CG `.meta`
file (next to `seqno`/`registered`) and restores it at startup
BEFORE the chain-event poller starts. The chain-event + beacon
paths remain the primary derivation; restoration is the "we
already knew about this" shortcut.
Store API additions:
* `markHostModeSubscribed(cgId)` — wire path persists the flag
* `markHostModeUnsubscribed(cgId)` — unwire path clears it
* `listHostModeSubscribedCgs()` — startup enumerates restorable
* `CgMetaState.hostModeSubscribed?: boolean` — new optional field
Agent wiring:
* `wireSwmHostModeHandler()` calls `markHostModeSubscribed` after
subscribing (best-effort; never fails the wire)
* `unwireSwmHostModeHandler()` calls `markHostModeUnsubscribed`
* `setupSwmHostMode()` restores persisted subscriptions after
store init, calling `wireSwmHostModeHandler` directly (skipping
the curated-check gate — we trust the previous decision; the
chain-anchored envelope-ingest authority check still catches
revocations)
Tests (added to `vitest.unit.config.ts` allowlist, 5 new cases):
* persists hostModeSubscribed across new store instances
* listHostModeSubscribedCgs returns only flagged CGs
* unmark clears the flag (unwire path)
* survives interleaved append-driven seqno updates
* mark/unmark are idempotent
Co-authored-by: Cursor <cursoragent@cursor.com>
| // authority check on every envelope ingest still catches | ||
| // revocations even if curator state has changed since. | ||
| try { | ||
| this.wireSwmHostModeHandler(cgId); |
There was a problem hiding this comment.
🔴 Bug: this restore path only re-wires the gossip handler; it never re-runs maybeMarkRegisteredForHostMode(). For host-only CGs, the periodic reconciler will not heal this later because GraphManager.listContextGraphs() only sees local store graphs, not opaque host-mode entries. If a CG was registered while this node was offline, the store stays on the 1 MiB / 6h pre-registration limits after restart and can prune valid ciphertext permanently. Restore through reconcileSwmHostModeSubscription() or explicitly call maybeMarkRegisteredForHostMode(cgId) here.
| // B3: clear the persisted host-mode designation so a restart | ||
| // does NOT re-engage. Best-effort. | ||
| if (this.swmHostModeStore) { | ||
| this.swmHostModeStore.markHostModeUnsubscribed(contextGraphId).catch((err: unknown) => { |
There was a problem hiding this comment.
🔴 Bug: persisting the host-mode flag as a fire-and-forget promise makes restart state nondeterministic. wireSwmHostModeHandler() and unwireSwmHostModeHandler() can run back-to-back during membership changes; if the earlier markHostModeSubscribed() write lands after this markHostModeUnsubscribed() write, the .meta file is left at true and startup re-subscribes a CG that was already torn down. Because B3 relies on this flag for correctness, the mark/unmark writes need to be serialized and awaited through one lifecycle path rather than logged and ignored.
…stration on restore Addresses both Codex bugs flagged on PR #620: 1. Restore path didn't re-run maybeMarkRegisteredForHostMode (dkg-agent.ts:8925) `setupSwmHostMode` restored host-only subscriptions by calling `wireSwmHostModeHandler(cgId)` directly. That skipped `maybeMarkRegisteredForHostMode`, so any host-only CG that was registered on chain while this node was offline would stay on the 1MiB / 6h pre-registration limits after restart and prune valid ciphertext permanently. The periodic reconciler couldn't heal this later because `GraphManager.listContextGraphs()` only sees local store graphs, not opaque host-mode entries — host-only CGs are invisible to it. Fix: explicitly call `maybeMarkRegisteredForHostMode(cgId)` for each restored CG. The same three-way registration probe runs that would otherwise gate the first envelope ingest, flipping the per-CG byte cap to 64MiB if the chain says the CG is registered. 2. Fire-and-forget mark/unmark made restart state nondeterministic (dkg-agent.ts:9090) `wireSwmHostModeHandler()` and `unwireSwmHostModeHandler()` can run back-to-back during membership changes. With the prior `.catch(()=>{})` pattern, the earlier `markHostModeSubscribed()` write could land AFTER a later `markHostModeUnsubscribed()`, leaving `.meta.hostModeSubscribed = true` for a CG that was already torn down. Startup would then re-subscribe to it. Fix: per-CG `hostModePersistenceQueues` chains every mark/unmark write through `Promise.then`, so disk writes always land in invocation order. The wire/unwire callers stay synchronous; persistence is awaited only by the chain itself. Queue entries self-evict when their tail settles. Co-authored-by: Cursor <cursoragent@cursor.com>
| * mode). Persisted so a restart can re-engage the gossip handler | ||
| * before the chain-event poller catches up. Idempotent. | ||
| */ | ||
| async markHostModeSubscribed(contextGraphId: string): Promise<void> { |
There was a problem hiding this comment.
🔴 Bug: markHostModeSubscribed()/markHostModeUnsubscribed() write the same .meta file outside the store's per-CG write lock. That makes them race with markRegistered()/append() on a cold CG, and the last writer can drop the other field (for example, wireSwmHostModeHandler() queues markHostModeSubscribed() and then maybeMarkRegisteredForHostMode() can overwrite it with a meta object that has no hostModeSubscribed). Route these through the same per-CG serialization used for other metadata writes, or re-read/merge before persisting.
| // in invocation order. Without this, back-to-back mark→unmark | ||
| // could write `true` after `false` and a restart would re-subscribe | ||
| // a torn-down CG. Still non-blocking at the wire level. | ||
| this.enqueueHostModePersistence(contextGraphId, true); |
There was a problem hiding this comment.
🔴 Bug: persisting hostModeSubscribed=true here assumes every teardown path clears the flag via unwireSwmHostModeHandler(), but reconcileSharedMemoryGossipSubscription() still has a branch that deletes the in-memory host-mode state directly after gossip.unsubscribe(...). If that branch runs for a CG that is no longer eligible for host mode, the persisted flag stays true and the new startup restore will subscribe again on the next restart. Please funnel that path through unwireSwmHostModeHandler() or explicitly enqueue the false write there as well.
…apter compat + mock signer Three more production bugs from the closed-PR codex review (#618, #620, #637) — the remaining hard-coded fail-paths that escaped the rc.10 integration merge. Companion to f85d2a3 which fixed the host-mode-store data-loss / lock-race set. T1b #1 — MockChainAdapter.signMessage (PR #618 c3) — the mock returned zero-byte `{r, vs}`, so every test that exercised `mintSignedCatchupRequest` recovered a garbage signer on the responder side and host catchup was effectively dead under MockChainAdapter. Existing test files worked around this by monkey-patching `signMessage` on the adapter instance (see cg-discovery-integration.test.ts:114). The mock now delegates to `mockACKSigner` when a wallet has been registered via `setMockACKSigner`, so any test that already wires the ACK signer gets a real EIP-191 signature on `signMessage` too. Tests that don't configure the ACK signer keep getting zeros — no behavioural change for unit tests that don't exercise the signed-catchup path. T1b #2 — gossip-teardown persistence leak (PR #620 c4) — when `reconcileSharedMemoryGossipSubscription` discovers the local node lost membership for a curated CG, it `gossip.unsubscribe(swmTopic)`s the whole topic and clears the in-memory `swmHostModeSubscribed` / `swmHostModeHandlers` maps. The persisted `hostModeSubscribed=true` flag in `_meta` was NOT cleared, so the B3 startup-restore loop in `initializeSwmHostModeStore` would happily re-subscribe to the same CG on next boot — exactly the CG this branch had just torn down for authorization reasons. Added an `enqueueHostModePersistence(cgId, false)` call alongside the in-memory deletes. If the immediate `reconcileSwmHostModeSubscription` re-engages, it does so through `wireSwmHostModeHandler` which enqueues `true` again; the per-CG queue's serialisation guarantees the final state always reflects the final intent. T1b #3 — backwards-compatible access-policy probe (PR #637 c1+c3) — `_resolveEncryptInlinePayload`'s probe path was returning `null` (UNKNOWN) when `chain.getContextGraphAccessPolicy` was undefined, and `null` made the throw at the bottom of the helper hard-fail publish for any numeric CG. Optional adapter method → mandatory in practice. External / custom adapters that support V10 publish but haven't adopted the access-policy getter would 500 on every publish they routed through here. Fix distinguishes "method not implemented" (falls back to PUBLIC — v9-era behaviour, restores compat) from "method threw" (still returns null, still fails closed — that's an actual RPC failure, refusing to pick plaintext-vs-encrypted without a verified policy is the right call there). Local-meta probe above runs unchanged and would still return `true` for any curated CG the local node created or joined with policy metadata. No behaviour change for clients that: - use the EVM adapter (always implements the getter) - run tests that already configure mockACKSigner - hit a curated CG with local policy metadata Builds clean (chain + agent). Co-authored-by: Cursor <cursoragent@cursor.com>
Summary
Pre-fix, host-only cores re-derived their hosting set on every restart from two sources only:
CGs registered before the lookback window AND whose curator was offline during the post-restart re-announce tick were silently lost — the gossip handler never rewired and incoming envelopes went to /dev/null.
This PR persists the host-mode designation in the per-CG `.meta` file (next to `seqno` / `registered`) and restores it at startup BEFORE the chain-event poller starts. Chain events + beacons remain the primary derivation; restoration is the "we already knew about this CG" shortcut.
Stacked on #610.
Store API additions
Agent wiring
Test plan
Made with Cursor