Skip to content

OT-RFC-38 LU-6 B3 — persist host-only designation across restart#620

Closed
branarakic wants to merge 2 commits into
feat/ot-rfc-38-lu6-host-modefrom
feat/lu6-followup-b3-host-only-persistence
Closed

OT-RFC-38 LU-6 B3 — persist host-only designation across restart#620
branarakic wants to merge 2 commits into
feat/ot-rfc-38-lu6-host-modefrom
feat/lu6-followup-b3-host-only-persistence

Conversation

@branarakic
Copy link
Copy Markdown
Contributor

Summary

Pre-fix, host-only cores re-derived their hosting set on every restart from two sources only:

  • chain `ContextGraphCreated` events (within poller lookback)
  • curator-broadcast discovery beacons (5-min cadence)

CGs registered before the lookback window AND whose curator was offline during the post-restart re-announce tick were silently lost — the gossip handler never rewired and incoming envelopes went to /dev/null.

This PR persists the host-mode designation in the per-CG `.meta` file (next to `seqno` / `registered`) and restores it at startup BEFORE the chain-event poller starts. Chain events + beacons remain the primary derivation; restoration is the "we already knew about this CG" shortcut.

Stacked on #610.

Store API additions

  • `markHostModeSubscribed(cgId)` — wire path persists the flag
  • `markHostModeUnsubscribed(cgId)` — unwire path clears it
  • `listHostModeSubscribedCgs()` — startup enumerates restorable
  • `CgMetaState.hostModeSubscribed?: boolean` — new optional field (existing CGs read as undefined → treated as not-subscribed)

Agent wiring

  • `wireSwmHostModeHandler()` calls `markHostModeSubscribed` after subscribing (best-effort; persistence failure never fails the wire)
  • `unwireSwmHostModeHandler()` calls `markHostModeUnsubscribed`
  • `setupSwmHostMode()` restores persisted subscriptions after store init, calling `wireSwmHostModeHandler` directly (skipping the curated-check gate — we trust the previous decision; the chain-anchored envelope-ingest authority check still catches revocations)

Test plan

  • 5 new `host-mode-store.test.ts` cases, added to `vitest.unit.config.ts` (33 tests passing)
  • persists hostModeSubscribed across new store instances
  • listHostModeSubscribedCgs returns only flagged CGs
  • unmark clears the flag (unwire path)
  • survives interleaved append-driven seqno updates
  • mark/unmark are idempotent

Made with Cursor

… restart

Pre-fix, host-only cores re-derived their hosting set on every
restart from two sources only:
  - chain `ContextGraphCreated` events (within poller lookback)
  - curator-broadcast discovery beacons (5-min cadence)

CGs registered before the lookback window AND whose curator was
offline during the post-restart re-announce tick were silently
lost — the gossip handler never rewired and incoming envelopes
went to /dev/null.

This PR persists the host-mode designation in the per-CG `.meta`
file (next to `seqno`/`registered`) and restores it at startup
BEFORE the chain-event poller starts. The chain-event + beacon
paths remain the primary derivation; restoration is the "we
already knew about this" shortcut.

Store API additions:
  * `markHostModeSubscribed(cgId)` — wire path persists the flag
  * `markHostModeUnsubscribed(cgId)` — unwire path clears it
  * `listHostModeSubscribedCgs()` — startup enumerates restorable
  * `CgMetaState.hostModeSubscribed?: boolean` — new optional field

Agent wiring:
  * `wireSwmHostModeHandler()` calls `markHostModeSubscribed` after
    subscribing (best-effort; never fails the wire)
  * `unwireSwmHostModeHandler()` calls `markHostModeUnsubscribed`
  * `setupSwmHostMode()` restores persisted subscriptions after
    store init, calling `wireSwmHostModeHandler` directly (skipping
    the curated-check gate — we trust the previous decision; the
    chain-anchored envelope-ingest authority check still catches
    revocations)

Tests (added to `vitest.unit.config.ts` allowlist, 5 new cases):
  * persists hostModeSubscribed across new store instances
  * listHostModeSubscribedCgs returns only flagged CGs
  * unmark clears the flag (unwire path)
  * survives interleaved append-driven seqno updates
  * mark/unmark are idempotent

Co-authored-by: Cursor <cursoragent@cursor.com>
// authority check on every envelope ingest still catches
// revocations even if curator state has changed since.
try {
this.wireSwmHostModeHandler(cgId);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: this restore path only re-wires the gossip handler; it never re-runs maybeMarkRegisteredForHostMode(). For host-only CGs, the periodic reconciler will not heal this later because GraphManager.listContextGraphs() only sees local store graphs, not opaque host-mode entries. If a CG was registered while this node was offline, the store stays on the 1 MiB / 6h pre-registration limits after restart and can prune valid ciphertext permanently. Restore through reconcileSwmHostModeSubscription() or explicitly call maybeMarkRegisteredForHostMode(cgId) here.

Comment thread packages/agent/src/dkg-agent.ts Outdated
// B3: clear the persisted host-mode designation so a restart
// does NOT re-engage. Best-effort.
if (this.swmHostModeStore) {
this.swmHostModeStore.markHostModeUnsubscribed(contextGraphId).catch((err: unknown) => {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: persisting the host-mode flag as a fire-and-forget promise makes restart state nondeterministic. wireSwmHostModeHandler() and unwireSwmHostModeHandler() can run back-to-back during membership changes; if the earlier markHostModeSubscribed() write lands after this markHostModeUnsubscribed() write, the .meta file is left at true and startup re-subscribes a CG that was already torn down. Because B3 relies on this flag for correctness, the mark/unmark writes need to be serialized and awaited through one lifecycle path rather than logged and ignored.

…stration on restore

Addresses both Codex bugs flagged on PR #620:

1. Restore path didn't re-run maybeMarkRegisteredForHostMode (dkg-agent.ts:8925)

   `setupSwmHostMode` restored host-only subscriptions by calling
   `wireSwmHostModeHandler(cgId)` directly. That skipped
   `maybeMarkRegisteredForHostMode`, so any host-only CG that was
   registered on chain while this node was offline would stay on
   the 1MiB / 6h pre-registration limits after restart and prune
   valid ciphertext permanently.

   The periodic reconciler couldn't heal this later because
   `GraphManager.listContextGraphs()` only sees local store graphs,
   not opaque host-mode entries — host-only CGs are invisible to
   it.

   Fix: explicitly call `maybeMarkRegisteredForHostMode(cgId)` for
   each restored CG. The same three-way registration probe runs
   that would otherwise gate the first envelope ingest, flipping
   the per-CG byte cap to 64MiB if the chain says the CG is
   registered.

2. Fire-and-forget mark/unmark made restart state nondeterministic
   (dkg-agent.ts:9090)

   `wireSwmHostModeHandler()` and `unwireSwmHostModeHandler()` can
   run back-to-back during membership changes. With the prior
   `.catch(()=>{})` pattern, the earlier `markHostModeSubscribed()`
   write could land AFTER a later `markHostModeUnsubscribed()`,
   leaving `.meta.hostModeSubscribed = true` for a CG that was
   already torn down. Startup would then re-subscribe to it.

   Fix: per-CG `hostModePersistenceQueues` chains every mark/unmark
   write through `Promise.then`, so disk writes always land in
   invocation order. The wire/unwire callers stay synchronous;
   persistence is awaited only by the chain itself. Queue entries
   self-evict when their tail settles.

Co-authored-by: Cursor <cursoragent@cursor.com>
* mode). Persisted so a restart can re-engage the gossip handler
* before the chain-event poller catches up. Idempotent.
*/
async markHostModeSubscribed(contextGraphId: string): Promise<void> {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: markHostModeSubscribed()/markHostModeUnsubscribed() write the same .meta file outside the store's per-CG write lock. That makes them race with markRegistered()/append() on a cold CG, and the last writer can drop the other field (for example, wireSwmHostModeHandler() queues markHostModeSubscribed() and then maybeMarkRegisteredForHostMode() can overwrite it with a meta object that has no hostModeSubscribed). Route these through the same per-CG serialization used for other metadata writes, or re-read/merge before persisting.

// in invocation order. Without this, back-to-back mark→unmark
// could write `true` after `false` and a restart would re-subscribe
// a torn-down CG. Still non-blocking at the wire level.
this.enqueueHostModePersistence(contextGraphId, true);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: persisting hostModeSubscribed=true here assumes every teardown path clears the flag via unwireSwmHostModeHandler(), but reconcileSharedMemoryGossipSubscription() still has a branch that deletes the in-memory host-mode state directly after gossip.unsubscribe(...). If that branch runs for a CG that is no longer eligible for host mode, the persisted flag stays true and the new startup restore will subscribe again on the next restart. Please funnel that path through unwireSwmHostModeHandler() or explicitly enqueue the false write there as well.

@branarakic
Copy link
Copy Markdown
Contributor Author

Superseded by PR #649 (release: rc.10 testnet-ready cut). All commits from this PR are now on main via #649. Unaddressed Codex review feedback (B3 restart-reconcile + per-CG lock races) is being tracked + fixed in a dedicated post-rc.10 followup PR.

@branarakic branarakic closed this May 25, 2026
branarakic pushed a commit that referenced this pull request May 25, 2026
…apter compat + mock signer

Three more production bugs from the closed-PR codex review (#618, #620,
#637) — the remaining hard-coded fail-paths that escaped the rc.10
integration merge. Companion to f85d2a3 which fixed the host-mode-store
data-loss / lock-race set.

T1b #1 — MockChainAdapter.signMessage (PR #618 c3) — the mock returned
zero-byte `{r, vs}`, so every test that exercised
`mintSignedCatchupRequest` recovered a garbage signer on the responder
side and host catchup was effectively dead under MockChainAdapter.
Existing test files worked around this by monkey-patching `signMessage`
on the adapter instance (see cg-discovery-integration.test.ts:114). The
mock now delegates to `mockACKSigner` when a wallet has been registered
via `setMockACKSigner`, so any test that already wires the ACK signer
gets a real EIP-191 signature on `signMessage` too. Tests that don't
configure the ACK signer keep getting zeros — no behavioural change
for unit tests that don't exercise the signed-catchup path.

T1b #2 — gossip-teardown persistence leak (PR #620 c4) — when
`reconcileSharedMemoryGossipSubscription` discovers the local node
lost membership for a curated CG, it `gossip.unsubscribe(swmTopic)`s
the whole topic and clears the in-memory `swmHostModeSubscribed` /
`swmHostModeHandlers` maps. The persisted `hostModeSubscribed=true`
flag in `_meta` was NOT cleared, so the B3 startup-restore loop in
`initializeSwmHostModeStore` would happily re-subscribe to the same
CG on next boot — exactly the CG this branch had just torn down for
authorization reasons. Added an `enqueueHostModePersistence(cgId,
false)` call alongside the in-memory deletes. If the immediate
`reconcileSwmHostModeSubscription` re-engages, it does so through
`wireSwmHostModeHandler` which enqueues `true` again; the per-CG
queue's serialisation guarantees the final state always reflects
the final intent.

T1b #3 — backwards-compatible access-policy probe (PR #637 c1+c3)
— `_resolveEncryptInlinePayload`'s probe path was returning `null`
(UNKNOWN) when `chain.getContextGraphAccessPolicy` was undefined,
and `null` made the throw at the bottom of the helper hard-fail
publish for any numeric CG. Optional adapter method → mandatory in
practice. External / custom adapters that support V10 publish but
haven't adopted the access-policy getter would 500 on every publish
they routed through here. Fix distinguishes "method not implemented"
(falls back to PUBLIC — v9-era behaviour, restores compat) from
"method threw" (still returns null, still fails closed — that's an
actual RPC failure, refusing to pick plaintext-vs-encrypted without
a verified policy is the right call there). Local-meta probe above
runs unchanged and would still return `true` for any curated CG the
local node created or joined with policy metadata.

No behaviour change for clients that:
  - use the EVM adapter (always implements the getter)
  - run tests that already configure mockACKSigner
  - hit a curated CG with local policy metadata

Builds clean (chain + agent).

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant