Skip to content

fix: post-rc.10 codex follow-up sweep + libp2p nodeVersion broadcast#653

Merged
branarakic merged 7 commits into
mainfrom
fix/post-rc10-codex-followup
May 26, 2026
Merged

fix: post-rc.10 codex follow-up sweep + libp2p nodeVersion broadcast#653
branarakic merged 7 commits into
mainfrom
fix/post-rc10-codex-followup

Conversation

@branarakic
Copy link
Copy Markdown
Contributor

@branarakic branarakic commented May 25, 2026

Summary

This PR closes the 47 unaddressed Codex inline review comments from
ten PRs merged into rc.10's integration branch (#618, #619, #620, #621,
#622, #623, #624, #625, #637, #638), plus pulls in one small rc.11
networking feature that landed on the same working branch.

The codex sweep is intentionally one PR rather than ten — every fix
is small, related, and the alternative was ten near-trivial PRs each
needing their own review cycle. Commits are scoped tightly enough that
a reviewer can read them in isolation.

What's on this branch (oldest first)

# Commit Scope
1 f85d2a37 fix(agent/swm): tier-1 codex followup — host-mode-store + catchup-sign hardening
2 30eb38c7 feat(core): broadcast DKG release on the libp2p wire via nodeInfo.userAgent
3 69c2d583 fix(agent,chain): tier-1b codex followup — host-mode persistence + adapter compat + mock signer
4 19dc13c8 test(devnet): tier-2 codex followup — close false-PASS gaps in 4 RFC-38 scripts
5 8096f104 docs: tier-3 codex followup — RFC-38 two-laptop runbook accuracy + CHANGELOG fix
6 8b354632 test(agent/swm): update host-mode-store reconcile tests for split corruptMetasRemoved counter

Coverage matrix (closed PR → comments → resolution)

Closed PR Codex comments Status
#618 (signed catchup) c1, c2 Already addressed in rc.10 integration (layered auth + signer derived from recovered EOA)
#618 c3 MockChainAdapter zero-byte signature Fixed in 69c2d583 — delegates to mockACKSigner when registered
#619 orphan-log reconcile + report shape Fixed in f85d2a37 (I/O vs JSON-parse split, corruptMetasRemoved counter). Tests updated in 8b354632
#620 c2 fire-and-forget mark/unmark persistence Already addressed in rc.10 integration via per-CG enqueueHostModePersistence queue
#620 c4 gossip-teardown drops in-memory state without clearing persisted flag Fixed in 69c2d583enqueueHostModePersistence(cgId, false) added
#621 revocation script — 4 false-PASS gaps + idempotent pre-create Most addressed in rc.10 integration; c4/c5/c6 fixed in 19dc13c8
#622 curator-offline script — late-joiner + cleanup gaps Most addressed in rc.10 integration; c5/c7 (sleep→wait, EXIT-trap readiness) fixed in 19dc13c8
#623 byte-cap script — vacuous precondition + pidfile fallback Most addressed in rc.10 integration; c5/c6 fixed in 19dc13c8
#624 unclean-restart script — partial-catchup precondition + peerId pin + response capture Most addressed in rc.10 integration; c4/c5/c6 fixed in 19dc13c8
#625 two-laptop runbook — 5 wrong HTTP/CLI surfaces All 5 fixed in 8096f104 (dkg show, cat auth.token, /list filter, /list route, catchup peerId pin)
#637 c1+c3 getContextGraphAccessPolicy mandatory regression Fixed in 69c2d583 — missing method → public (v9-era compat); method-throws → null (fail-closed)
#637 c2 vite devnet fallback Already addressed in playwright.config.ts (cross-env DEVNET_NODE=…)
#638 c2 verify-batch HTTP 400 assertion in lu8.sh Already addressed in rc.10 integration (api_call_with_status helper, line 153 hard-asserts 400)
#638 c1+c3 verify-batch quads-required Intentional design — daemon cannot safely identify a single batch's leaves inside a CG-wide store. CHANGELOG corrected in 8096f104; codex comments to be dismissed with this rationale
CHANGELOG LU-8 entry stale (promised reconstruction path that doesn't exist) Fixed in 8096f104

What's NOT in this PR

  • No release/tagging machinery touched
  • No packages/evm-module/deploy/ changes
  • No package version bumps (@origintrail-official/* stays at 10.0.0-rc.10)
  • No main mutations

libp2p nodeVersion (commit 30eb38c7, rc.11 follow-up)

Bundled here because it was already on the branch when the codex sweep
started. Independent of the codex fixes but reviewable as a single
self-contained commit.

Closes the version-on-the-wire gap surfaced during the v10.0.0-rc.10
testnet rollout — /api/peer-info could tell operators which protocols
a peer spoke but not which DKG release they were running, leaving "did
the network pick up the upgrade?" answerable only by SSHing into each
Core node. After this lands:

TOKEN=$(grep -v '^#' ~/.dkg/auth.token | tr -d '[:space:]')
curl -s -H "Authorization: Bearer $TOKEN" \
  "http://127.0.0.1:9200/api/peer-info?peerId=<peerId>" \
  | jq '.peerStore.nodeVersion'
# → "dkg/10.0.0-rc.11"   (pre-rc.11 peers fall through to libp2p's default)

Wire-format impact: none. libp2p's identify protocol already ships an
agentVersion PB field on every connection — we're just populating it.
Pre-rc.11 peers continue to advertise libp2p's default. No protocol-id
bump, no chainResetMarker bump, no migration.

Test plan

  • pnpm --filter @origintrail-official/dkg-chain build — clean
  • pnpm --filter @origintrail-official/dkg-agent build — clean
  • pnpm --filter @origintrail-official/dkg build (CLI) — clean
  • Full agent vitest sweep — 852/853 pass (1 pre-existing flake in swm-snapshot-sync.test.ts, commit 8d7d34fd, unrelated)
  • Targeted: test/swm/host-catchup-sign.test.ts — 16/16
  • Targeted: test/swm/host-catchup-wire.test.ts — 13/13
  • Targeted: test/swm/host-mode-store.test.ts — 25/25
  • Targeted: test/dkg-agent-diagnostics.test.ts — passes (rc.11 nodeVersion tests included)
  • bash -n syntax-check on all 4 modified devnet scripts
  • Devnet smoke run of all 4 RFC-38 scripts on a 7-node devnet (manual; CI doesn't have a devnet)
  • Two-laptop runbook walk-through on rc.11 testnet (manual)

Codex comments to dismiss with rationale

After merge, the following codex comments should be marked resolved
with "intentional design" replies (preserved here for the dismissal
PR-review action):

  • fix(cli): require explicit quads for batch verification #638 c1+c3 (verify-batch should reconstruct from local SWM when
    quads omitted): rejected — the local SWM / data graph can contain
    triples from multiple batches in the same CG, and there is no
    daemon-side way to identify which leaves belong to the batch being
    verified. The route stays caller-supplied-plaintext only; the
    CHANGELOG inconsistency that motivated the comments has been
    corrected in 8096f104.

Made with Cursor

Branimir Rakic and others added 2 commits May 25, 2026 19:56
…n hardening

Three production bugs from Codex review on closed PRs (subsumed by #649)
that warrant fixing before mainnet handoff:

T1 #1 (host-catchup-sign.ts: evictStale early-break) — the loop over the
seen-nonce map broke on the first entry whose timestamp was >= threshold,
which assumed strict ascending insertion order. Future-skewed clocks on
a single signer could leave stale entries that never get evicted. Drop
the break; full pass is O(n) on a bounded map.

T1 #2 (host-mode-store.ts: I/O vs JSON corruption) — reconcileOrphanLogs
treated fs.readFile EIO/EBUSY/etc the same as JSON.parse SyntaxError,
deleting healthy .meta+.log pairs on transient I/O. Split the try/catch:
I/O failures now park the key in ioSkippedMetaKeys so neither file is
removed; only genuinely unparseable metas count toward corruptMetasRemoved.
SwmHostModeStartupReconcileReport gains the corruptMetasRemoved field.

T1 #4 (host-mode-store.ts: per-CG meta race) — markHostModeSubscribed,
markHostModeUnsubscribed, markRegistered, markUnregistered, and append
all do load-modify-persist on the same .meta file but lacked any
serialisation. Concurrent calls (e.g. invite ACK + restart reconcile)
could lose updates. Added withCgWriteLock helper that serialises all
meta mutations per contextGraphId via a Map<id, Promise> chain.

No behaviour change for clients; all fixes are internal correctness.

Builds clean (pnpm --filter dkg-agent build). Tests follow in a later
commit as part of the broader followup PR.

Co-authored-by: Cursor <cursoragent@cursor.com>
…rAgent

Closes the version-on-the-wire gap surfaced during the v10.0.0-rc.10
testnet rollout: today /api/peer-info can tell operators which
protocols a peer speaks but not which DKG release it's running,
leaving "did the network pick up the upgrade?" answerable only by
SSHing into each Core node.

This adds DKGNodeConfig.nodeVersion (forwarded through DKGAgent into
libp2p's createLibp2p({ nodeInfo: { userAgent } })) and surfaces the
remote-peer side via PeerDiagnostics.peerStore.nodeVersion. The
daemon sets it to `dkg/<semver>` from the running CLI's package.json,
so /api/peer-info on any rc.11+ node can census peer versions with
one curl loop.

Naming: libp2p's wire field is `AgentVersion`, but inside DKG that
collides with `DKGAgent`. We translate at the read boundary
(`peer.metadata.AgentVersion` → `peerStore.nodeVersion`) and at the
write boundary (`DKGNodeConfig.nodeVersion` → `nodeInfo.userAgent`);
libp2p's name only appears in doc comments. `protocolVersion` keeps
its libp2p name because it really is libp2p's protocol version
(`ipfs/0.1.0`), unrelated to DKG.

Defensive on the read side: peer.metadata can be a Map (libp2p >=2.x),
a plain object (some serialised paths), or undefined (cold cache /
identify not yet completed). Three new test cases lock all three
shapes in; one existing test gained the rc.11 nodeVersion: null
assertion for the cold-cache path.

Pre-rc.11 peers fall through to libp2p's default
`js-libp2p/<version>`, which is itself a useful "peer hasn't adopted
the version-advertisement convention yet" signal.

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread packages/agent/src/swm/host-mode-store.ts Outdated
Comment thread packages/agent/src/swm/host-mode-store.ts
Branimir Rakic and others added 4 commits May 25, 2026 20:28
…apter compat + mock signer

Three more production bugs from the closed-PR codex review (#618, #620,
#637) — the remaining hard-coded fail-paths that escaped the rc.10
integration merge. Companion to f85d2a3 which fixed the host-mode-store
data-loss / lock-race set.

T1b #1 — MockChainAdapter.signMessage (PR #618 c3) — the mock returned
zero-byte `{r, vs}`, so every test that exercised
`mintSignedCatchupRequest` recovered a garbage signer on the responder
side and host catchup was effectively dead under MockChainAdapter.
Existing test files worked around this by monkey-patching `signMessage`
on the adapter instance (see cg-discovery-integration.test.ts:114). The
mock now delegates to `mockACKSigner` when a wallet has been registered
via `setMockACKSigner`, so any test that already wires the ACK signer
gets a real EIP-191 signature on `signMessage` too. Tests that don't
configure the ACK signer keep getting zeros — no behavioural change
for unit tests that don't exercise the signed-catchup path.

T1b #2 — gossip-teardown persistence leak (PR #620 c4) — when
`reconcileSharedMemoryGossipSubscription` discovers the local node
lost membership for a curated CG, it `gossip.unsubscribe(swmTopic)`s
the whole topic and clears the in-memory `swmHostModeSubscribed` /
`swmHostModeHandlers` maps. The persisted `hostModeSubscribed=true`
flag in `_meta` was NOT cleared, so the B3 startup-restore loop in
`initializeSwmHostModeStore` would happily re-subscribe to the same
CG on next boot — exactly the CG this branch had just torn down for
authorization reasons. Added an `enqueueHostModePersistence(cgId,
false)` call alongside the in-memory deletes. If the immediate
`reconcileSwmHostModeSubscription` re-engages, it does so through
`wireSwmHostModeHandler` which enqueues `true` again; the per-CG
queue's serialisation guarantees the final state always reflects
the final intent.

T1b #3 — backwards-compatible access-policy probe (PR #637 c1+c3)
— `_resolveEncryptInlinePayload`'s probe path was returning `null`
(UNKNOWN) when `chain.getContextGraphAccessPolicy` was undefined,
and `null` made the throw at the bottom of the helper hard-fail
publish for any numeric CG. Optional adapter method → mandatory in
practice. External / custom adapters that support V10 publish but
haven't adopted the access-policy getter would 500 on every publish
they routed through here. Fix distinguishes "method not implemented"
(falls back to PUBLIC — v9-era behaviour, restores compat) from
"method threw" (still returns null, still fails closed — that's an
actual RPC failure, refusing to pick plaintext-vs-encrypted without
a verified policy is the right call there). Local-meta probe above
runs unchanged and would still return `true` for any curated CG the
local node created or joined with policy metadata.

No behaviour change for clients that:
  - use the EVM adapter (always implements the getter)
  - run tests that already configure mockACKSigner
  - hit a curated CG with local policy metadata

Builds clean (chain + agent).

Co-authored-by: Cursor <cursoragent@cursor.com>
…38 scripts

The four LU-6 devnet scripts (#621/#622/#623/#624) shipped with several
control-flow gaps that let regressions silently sneak through:
errors swallowed with `|| true`, fixed sleeps where bounded retries
were needed, and assertions that passed even when the scenario under
test never happened. Codex's review on the closed PRs flagged 24
specific items; the rc.10 integration merge fixed most of them. This
commit closes the remaining ones — the ones that materially affect
whether a PASS is meaningful.

devnet-test-rfc38-revocation.sh (#621):
 - Member pre-creates (M1, M2) no longer swallow EVERY error with
   `|| true`. The new `member_pre_create` helper captures the response
   and tolerates ONLY the idempotent "already exists" signal; any
   other failure (wrong auth, malformed body, daemon down) now fails
   the script immediately with the actual error visible, instead of
   surfacing later as an opaque catchup timeout.
 - Phase 5's single "sleep 3 + one-shot read" replaced with a 30s
   bounded-retry loop (`wait_for_count_or_steady`) — gossip /
   catchup latency between the post-revocation write and a member's
   final triple count is variable, and the one-shot snapshot was
   reporting M1's partial state as a regression.
 - Added the forward-only-rotation lower bound: `M2_FINAL >= M2_PRE`.
   The pre-existing `<= 3` check alone would pass if revocation also
   wiped M2's previously-decryptable triples — violating the contract
   in the script header that the kicked member RETAINS what they
   could already decrypt; they just stop learning anything new.

devnet-test-rfc38-curator-offline-midbatch.sh (#622):
 - Phase 1's `sleep 3` after member pre-create replaced with an
   explicit `wait_for_m1_onchain_id` poll. Phase 5's non-curator
   publish requires M1 to have observed the CG's `onChainId`,
   otherwise it bounces with "Context graph ... is not registered
   on-chain" — gossip lag for the `ContextGraphCreated` event can
   easily exceed 3s under devnet load.
 - EXIT trap now waits up to 60s for `/api/status` to respond
   before declaring the curator healthy again. `devnet.sh restart-node`
   returns after spawning the daemon, not after it's actually ready
   to serve, so CI runners that chain another scenario immediately
   were inheriting a half-started devnet.

devnet-test-rfc38-prereg-bytecap-stress.sh (#623):
 - Added a precondition that the burst's SUBMITTED_BYTES actually
   exceeds the configured cap. With the default 80 KiB payload size
   anything up to 12 writes stayed under the 1 MiB cap, so the
   downstream clamp assertion was vacuously satisfied if anyone
   misconfigured WRITES_COUNT/WRITE_PAYLOAD_BYTES — a TEST bug
   passing as a daemon PASS.
 - Phase 6's liveness check no longer silently skips when the
   pidfile is missing/empty. Falls back to a `/api/status` HTTP
   probe so containerised devnets (where pidfiles aren't written)
   still get a meaningful liveness check; hard-fails when both
   pidfile AND status are unreachable.

devnet-test-rfc38-unclean-restart.sh (#624):
 - Captures the killed core's `peerId` from `/api/status` BEFORE
   the SIGKILL. The post-restart catchup calls in phase 6 now pin
   to that peerId so we're explicitly exercising recovery from the
   restarted node — previously the catchup fanned out to any
   connected peer and could pass by pulling data from the curator
   (still online), validating nothing about the unclean-restart
   contract.
 - Phase 3 now waits for STRICT mid-batch state (`0 < M1_PARTIAL <
   WRITES_COUNT`) before the SIGKILL. The previous one-shot snapshot
   accepted any partial value including 0 or already-complete; both
   meant the kill below never actually exercised the
   `lastHostCatchupSeqno` resume path this test claims to cover.
 - Catchup responses are captured (not discarded to `/dev/null`)
   and asserted free of `error` / `swmError` / `durableError` via
   the new `assert_catchup_clean` helper. HTTP 500s, auth denials,
   host-catchup failures, etc. were previously invisible — the
   final triple-count check would still go green if data arrived
   via background gossip.

Bash-syntax-checked (`bash -n` on all four). No behaviour change
when the underlying scenarios are healthy; the changes only convert
silent false-positives into loud failures.

Co-authored-by: Cursor <cursoragent@cursor.com>
…ANGELOG fix

PR #625's runbook documented HTTP and CLI surfaces that don't exist on
the rc.10 daemon — operators following it from step 1 would 404 / 401
their way through every section. PR #638's LU-8 CHANGELOG entry promised
a server-side reconstruction path for `verify-batch` that the actual
route refuses. Both are doc bugs only, no code change.

docs/RFC38_LU6_TWO_LAPTOP_TESTNET_RUNBOOK.md:
 - Replaces `dkg show` (never a real top-level command) with the
   actual surface: `dkg status` (peerId/multiaddrs/role), `dkg auth
   show` (token, stripped of comments), and `GET /api/agent/identity`
   for the agent EOA. Same for `dkg show-cg` (compute the wire id as
   `keccak256(<cgId>)` since there's no CLI shortcut) and `dkg
   shared-memory host-mode stats` (use `GET /api/shared-memory/
   host-mode/stats` directly).
 - Every `$(cat ~/.dkg/auth.token)` interpolation now uses `dkg auth
   show` instead. The token file has a commented-header preamble by
   default, so the literal `cat` would inject `#\n<token>` into the
   `Authorization` header and 401 even on a healthy node. `dkg auth
   show` strips comments + blank lines, matching what
   `packages/node-ui/vite.config.ts` does for the same reason.
 - `/api/context-graph/list` filter corrected — the response is
   `{ contextGraphs: [...] }` (envelope, not bare array) and items
   expose `accessPolicy` (numeric: 0=public, 1=curated), not `access`.
   The old `jq '.[] | select(.access=="curated")'` filter would have
   silently matched nothing.
 - Member catchup now pins `peerId` to the core's libp2p identity.
   Without `peerId`, `/api/shared-memory/catchup` fans out to every
   connected peer — which in a two-laptop+core topology can hit
   laptop A directly or no peer at all, and won't reliably validate
   the "via the core" host-catchup path the runbook claims to verify.
   Showed how to grab the peer id from `/api/status` first.
 - Replaced the `/api/shared-memory/list?contextGraphId=...` curl
   (404, no such route) with a SPARQL `SELECT (COUNT(*) AS ?n)` via
   `/api/query` against the `_shared_memory` graph suffix — the same
   shape every devnet script uses for the equivalent assertion.

CHANGELOG.md:
 - LU-8 entry no longer claims "when `quads` is omitted the route
   reconstructs from the local SWM or post-publish CG data graph".
   That hasn't been true since the safety hardening in the published
   route (`packages/cli/src/daemon/routes/memory.ts:1042-1049`):
   `quads` is **required**, HTTP 400 on omission, because the daemon
   can't safely identify a single published batch's leaves inside a
   CG-wide store. Devnet scenario `scripts/devnet-test-rfc38-lu8.sh`
   SCENARIO 1 pins this contract with a HTTP-400 assertion. The
   CHANGELOG now matches the route's actual behaviour + rationale.

Co-authored-by: Cursor <cursoragent@cursor.com>
…ruptMetasRemoved counter

Tier 1 commit f85d2a3 deliberately split `SwmHostModeStartupReconcileReport`
into `orphanLogsRemoved` (lonely .log without meta — data lost before
reconcile ran) + a new `corruptMetasRemoved` (meta-only casualty —
.meta failed to parse, matching log already reaped via the orphan-logs
pass) so operators could distinguish the two failure modes from a single
report. The change was driven by Codex PR #619 R3's pushback that lumping
both into one counter hid the meta-only signal.

The four affected tests still asserted against the OLD lumped shape:
  - `removes a .log file with no matching .meta on init`
  - `preserves .meta without matching .log (markRegistered + never-written CG case)`
  - `handles mixed dirs: removes orphan logs, leaves healthy pairs + lonely metas alone`
  - `reaps .log paired with a .meta that fails to parse as JSON (crashed mid-persist)`

Updated each to assert the new shape:
  - The three "lonely .log" cases now assert `corruptMetasRemoved: 0`
    alongside their existing counts — they have no corrupt metas.
  - The `crashed mid-persist` case now asserts the split: the lonely
    log contributes to `orphanLogsRemoved=1` and the unparseable meta
    contributes to `corruptMetasRemoved=1`. Total reaped files is
    still 2 (their sum); previously the single counter conflated them.

The `SwmHostModeStartupReconcileReport` type is now imported from the
store module so the local `reportSeen` typing matches the production
shape (the prior inline literal type would have masked future field
additions).

All 25 tests in host-mode-store.test.ts pass. No production code change.

Co-authored-by: Cursor <cursoragent@cursor.com>
@branarakic branarakic changed the title feat(core): broadcast DKG release on the libp2p wire (rc.11 follow-up) + tier-1 SWM codex hardening fix: post-rc.10 codex follow-up sweep + libp2p nodeVersion broadcast May 25, 2026
Comment thread packages/agent/src/dkg-agent.ts
Comment thread scripts/devnet-test-rfc38-unclean-restart.sh Outdated
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex review skipped: filtered diff is 5201 lines (cap: 5,000). Please consider splitting this into smaller PRs for reviewability.

@branarakic branarakic merged commit 4202233 into main May 26, 2026
38 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant