fix: post-rc.10 codex follow-up sweep + libp2p nodeVersion broadcast#653
Merged
Conversation
…n hardening Three production bugs from Codex review on closed PRs (subsumed by #649) that warrant fixing before mainnet handoff: T1 #1 (host-catchup-sign.ts: evictStale early-break) — the loop over the seen-nonce map broke on the first entry whose timestamp was >= threshold, which assumed strict ascending insertion order. Future-skewed clocks on a single signer could leave stale entries that never get evicted. Drop the break; full pass is O(n) on a bounded map. T1 #2 (host-mode-store.ts: I/O vs JSON corruption) — reconcileOrphanLogs treated fs.readFile EIO/EBUSY/etc the same as JSON.parse SyntaxError, deleting healthy .meta+.log pairs on transient I/O. Split the try/catch: I/O failures now park the key in ioSkippedMetaKeys so neither file is removed; only genuinely unparseable metas count toward corruptMetasRemoved. SwmHostModeStartupReconcileReport gains the corruptMetasRemoved field. T1 #4 (host-mode-store.ts: per-CG meta race) — markHostModeSubscribed, markHostModeUnsubscribed, markRegistered, markUnregistered, and append all do load-modify-persist on the same .meta file but lacked any serialisation. Concurrent calls (e.g. invite ACK + restart reconcile) could lose updates. Added withCgWriteLock helper that serialises all meta mutations per contextGraphId via a Map<id, Promise> chain. No behaviour change for clients; all fixes are internal correctness. Builds clean (pnpm --filter dkg-agent build). Tests follow in a later commit as part of the broader followup PR. Co-authored-by: Cursor <cursoragent@cursor.com>
…rAgent
Closes the version-on-the-wire gap surfaced during the v10.0.0-rc.10
testnet rollout: today /api/peer-info can tell operators which
protocols a peer speaks but not which DKG release it's running,
leaving "did the network pick up the upgrade?" answerable only by
SSHing into each Core node.
This adds DKGNodeConfig.nodeVersion (forwarded through DKGAgent into
libp2p's createLibp2p({ nodeInfo: { userAgent } })) and surfaces the
remote-peer side via PeerDiagnostics.peerStore.nodeVersion. The
daemon sets it to `dkg/<semver>` from the running CLI's package.json,
so /api/peer-info on any rc.11+ node can census peer versions with
one curl loop.
Naming: libp2p's wire field is `AgentVersion`, but inside DKG that
collides with `DKGAgent`. We translate at the read boundary
(`peer.metadata.AgentVersion` → `peerStore.nodeVersion`) and at the
write boundary (`DKGNodeConfig.nodeVersion` → `nodeInfo.userAgent`);
libp2p's name only appears in doc comments. `protocolVersion` keeps
its libp2p name because it really is libp2p's protocol version
(`ipfs/0.1.0`), unrelated to DKG.
Defensive on the read side: peer.metadata can be a Map (libp2p >=2.x),
a plain object (some serialised paths), or undefined (cold cache /
identify not yet completed). Three new test cases lock all three
shapes in; one existing test gained the rc.11 nodeVersion: null
assertion for the cold-cache path.
Pre-rc.11 peers fall through to libp2p's default
`js-libp2p/<version>`, which is itself a useful "peer hasn't adopted
the version-advertisement convention yet" signal.
Co-authored-by: Cursor <cursoragent@cursor.com>
…apter compat + mock signer Three more production bugs from the closed-PR codex review (#618, #620, #637) — the remaining hard-coded fail-paths that escaped the rc.10 integration merge. Companion to f85d2a3 which fixed the host-mode-store data-loss / lock-race set. T1b #1 — MockChainAdapter.signMessage (PR #618 c3) — the mock returned zero-byte `{r, vs}`, so every test that exercised `mintSignedCatchupRequest` recovered a garbage signer on the responder side and host catchup was effectively dead under MockChainAdapter. Existing test files worked around this by monkey-patching `signMessage` on the adapter instance (see cg-discovery-integration.test.ts:114). The mock now delegates to `mockACKSigner` when a wallet has been registered via `setMockACKSigner`, so any test that already wires the ACK signer gets a real EIP-191 signature on `signMessage` too. Tests that don't configure the ACK signer keep getting zeros — no behavioural change for unit tests that don't exercise the signed-catchup path. T1b #2 — gossip-teardown persistence leak (PR #620 c4) — when `reconcileSharedMemoryGossipSubscription` discovers the local node lost membership for a curated CG, it `gossip.unsubscribe(swmTopic)`s the whole topic and clears the in-memory `swmHostModeSubscribed` / `swmHostModeHandlers` maps. The persisted `hostModeSubscribed=true` flag in `_meta` was NOT cleared, so the B3 startup-restore loop in `initializeSwmHostModeStore` would happily re-subscribe to the same CG on next boot — exactly the CG this branch had just torn down for authorization reasons. Added an `enqueueHostModePersistence(cgId, false)` call alongside the in-memory deletes. If the immediate `reconcileSwmHostModeSubscription` re-engages, it does so through `wireSwmHostModeHandler` which enqueues `true` again; the per-CG queue's serialisation guarantees the final state always reflects the final intent. T1b #3 — backwards-compatible access-policy probe (PR #637 c1+c3) — `_resolveEncryptInlinePayload`'s probe path was returning `null` (UNKNOWN) when `chain.getContextGraphAccessPolicy` was undefined, and `null` made the throw at the bottom of the helper hard-fail publish for any numeric CG. Optional adapter method → mandatory in practice. External / custom adapters that support V10 publish but haven't adopted the access-policy getter would 500 on every publish they routed through here. Fix distinguishes "method not implemented" (falls back to PUBLIC — v9-era behaviour, restores compat) from "method threw" (still returns null, still fails closed — that's an actual RPC failure, refusing to pick plaintext-vs-encrypted without a verified policy is the right call there). Local-meta probe above runs unchanged and would still return `true` for any curated CG the local node created or joined with policy metadata. No behaviour change for clients that: - use the EVM adapter (always implements the getter) - run tests that already configure mockACKSigner - hit a curated CG with local policy metadata Builds clean (chain + agent). Co-authored-by: Cursor <cursoragent@cursor.com>
…38 scripts The four LU-6 devnet scripts (#621/#622/#623/#624) shipped with several control-flow gaps that let regressions silently sneak through: errors swallowed with `|| true`, fixed sleeps where bounded retries were needed, and assertions that passed even when the scenario under test never happened. Codex's review on the closed PRs flagged 24 specific items; the rc.10 integration merge fixed most of them. This commit closes the remaining ones — the ones that materially affect whether a PASS is meaningful. devnet-test-rfc38-revocation.sh (#621): - Member pre-creates (M1, M2) no longer swallow EVERY error with `|| true`. The new `member_pre_create` helper captures the response and tolerates ONLY the idempotent "already exists" signal; any other failure (wrong auth, malformed body, daemon down) now fails the script immediately with the actual error visible, instead of surfacing later as an opaque catchup timeout. - Phase 5's single "sleep 3 + one-shot read" replaced with a 30s bounded-retry loop (`wait_for_count_or_steady`) — gossip / catchup latency between the post-revocation write and a member's final triple count is variable, and the one-shot snapshot was reporting M1's partial state as a regression. - Added the forward-only-rotation lower bound: `M2_FINAL >= M2_PRE`. The pre-existing `<= 3` check alone would pass if revocation also wiped M2's previously-decryptable triples — violating the contract in the script header that the kicked member RETAINS what they could already decrypt; they just stop learning anything new. devnet-test-rfc38-curator-offline-midbatch.sh (#622): - Phase 1's `sleep 3` after member pre-create replaced with an explicit `wait_for_m1_onchain_id` poll. Phase 5's non-curator publish requires M1 to have observed the CG's `onChainId`, otherwise it bounces with "Context graph ... is not registered on-chain" — gossip lag for the `ContextGraphCreated` event can easily exceed 3s under devnet load. - EXIT trap now waits up to 60s for `/api/status` to respond before declaring the curator healthy again. `devnet.sh restart-node` returns after spawning the daemon, not after it's actually ready to serve, so CI runners that chain another scenario immediately were inheriting a half-started devnet. devnet-test-rfc38-prereg-bytecap-stress.sh (#623): - Added a precondition that the burst's SUBMITTED_BYTES actually exceeds the configured cap. With the default 80 KiB payload size anything up to 12 writes stayed under the 1 MiB cap, so the downstream clamp assertion was vacuously satisfied if anyone misconfigured WRITES_COUNT/WRITE_PAYLOAD_BYTES — a TEST bug passing as a daemon PASS. - Phase 6's liveness check no longer silently skips when the pidfile is missing/empty. Falls back to a `/api/status` HTTP probe so containerised devnets (where pidfiles aren't written) still get a meaningful liveness check; hard-fails when both pidfile AND status are unreachable. devnet-test-rfc38-unclean-restart.sh (#624): - Captures the killed core's `peerId` from `/api/status` BEFORE the SIGKILL. The post-restart catchup calls in phase 6 now pin to that peerId so we're explicitly exercising recovery from the restarted node — previously the catchup fanned out to any connected peer and could pass by pulling data from the curator (still online), validating nothing about the unclean-restart contract. - Phase 3 now waits for STRICT mid-batch state (`0 < M1_PARTIAL < WRITES_COUNT`) before the SIGKILL. The previous one-shot snapshot accepted any partial value including 0 or already-complete; both meant the kill below never actually exercised the `lastHostCatchupSeqno` resume path this test claims to cover. - Catchup responses are captured (not discarded to `/dev/null`) and asserted free of `error` / `swmError` / `durableError` via the new `assert_catchup_clean` helper. HTTP 500s, auth denials, host-catchup failures, etc. were previously invisible — the final triple-count check would still go green if data arrived via background gossip. Bash-syntax-checked (`bash -n` on all four). No behaviour change when the underlying scenarios are healthy; the changes only convert silent false-positives into loud failures. Co-authored-by: Cursor <cursoragent@cursor.com>
…ANGELOG fix PR #625's runbook documented HTTP and CLI surfaces that don't exist on the rc.10 daemon — operators following it from step 1 would 404 / 401 their way through every section. PR #638's LU-8 CHANGELOG entry promised a server-side reconstruction path for `verify-batch` that the actual route refuses. Both are doc bugs only, no code change. docs/RFC38_LU6_TWO_LAPTOP_TESTNET_RUNBOOK.md: - Replaces `dkg show` (never a real top-level command) with the actual surface: `dkg status` (peerId/multiaddrs/role), `dkg auth show` (token, stripped of comments), and `GET /api/agent/identity` for the agent EOA. Same for `dkg show-cg` (compute the wire id as `keccak256(<cgId>)` since there's no CLI shortcut) and `dkg shared-memory host-mode stats` (use `GET /api/shared-memory/ host-mode/stats` directly). - Every `$(cat ~/.dkg/auth.token)` interpolation now uses `dkg auth show` instead. The token file has a commented-header preamble by default, so the literal `cat` would inject `#\n<token>` into the `Authorization` header and 401 even on a healthy node. `dkg auth show` strips comments + blank lines, matching what `packages/node-ui/vite.config.ts` does for the same reason. - `/api/context-graph/list` filter corrected — the response is `{ contextGraphs: [...] }` (envelope, not bare array) and items expose `accessPolicy` (numeric: 0=public, 1=curated), not `access`. The old `jq '.[] | select(.access=="curated")'` filter would have silently matched nothing. - Member catchup now pins `peerId` to the core's libp2p identity. Without `peerId`, `/api/shared-memory/catchup` fans out to every connected peer — which in a two-laptop+core topology can hit laptop A directly or no peer at all, and won't reliably validate the "via the core" host-catchup path the runbook claims to verify. Showed how to grab the peer id from `/api/status` first. - Replaced the `/api/shared-memory/list?contextGraphId=...` curl (404, no such route) with a SPARQL `SELECT (COUNT(*) AS ?n)` via `/api/query` against the `_shared_memory` graph suffix — the same shape every devnet script uses for the equivalent assertion. CHANGELOG.md: - LU-8 entry no longer claims "when `quads` is omitted the route reconstructs from the local SWM or post-publish CG data graph". That hasn't been true since the safety hardening in the published route (`packages/cli/src/daemon/routes/memory.ts:1042-1049`): `quads` is **required**, HTTP 400 on omission, because the daemon can't safely identify a single published batch's leaves inside a CG-wide store. Devnet scenario `scripts/devnet-test-rfc38-lu8.sh` SCENARIO 1 pins this contract with a HTTP-400 assertion. The CHANGELOG now matches the route's actual behaviour + rationale. Co-authored-by: Cursor <cursoragent@cursor.com>
…ruptMetasRemoved counter Tier 1 commit f85d2a3 deliberately split `SwmHostModeStartupReconcileReport` into `orphanLogsRemoved` (lonely .log without meta — data lost before reconcile ran) + a new `corruptMetasRemoved` (meta-only casualty — .meta failed to parse, matching log already reaped via the orphan-logs pass) so operators could distinguish the two failure modes from a single report. The change was driven by Codex PR #619 R3's pushback that lumping both into one counter hid the meta-only signal. The four affected tests still asserted against the OLD lumped shape: - `removes a .log file with no matching .meta on init` - `preserves .meta without matching .log (markRegistered + never-written CG case)` - `handles mixed dirs: removes orphan logs, leaves healthy pairs + lonely metas alone` - `reaps .log paired with a .meta that fails to parse as JSON (crashed mid-persist)` Updated each to assert the new shape: - The three "lonely .log" cases now assert `corruptMetasRemoved: 0` alongside their existing counts — they have no corrupt metas. - The `crashed mid-persist` case now asserts the split: the lonely log contributes to `orphanLogsRemoved=1` and the unparseable meta contributes to `corruptMetasRemoved=1`. Total reaped files is still 2 (their sum); previously the single counter conflated them. The `SwmHostModeStartupReconcileReport` type is now imported from the store module so the local `reportSeen` typing matches the production shape (the prior inline literal type would have masked future field additions). All 25 tests in host-mode-store.test.ts pass. No production code change. Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR closes the 47 unaddressed Codex inline review comments from
ten PRs merged into rc.10's integration branch (#618, #619, #620, #621,
#622, #623, #624, #625, #637, #638), plus pulls in one small rc.11
networking feature that landed on the same working branch.
The codex sweep is intentionally one PR rather than ten — every fix
is small, related, and the alternative was ten near-trivial PRs each
needing their own review cycle. Commits are scoped tightly enough that
a reviewer can read them in isolation.
What's on this branch (oldest first)
f85d2a3730eb38c7nodeInfo.userAgent69c2d58319dc13c88096f1048b354632corruptMetasRemovedcounterCoverage matrix (closed PR → comments → resolution)
69c2d583— delegates tomockACKSignerwhen registeredf85d2a37(I/O vs JSON-parse split,corruptMetasRemovedcounter). Tests updated in8b354632enqueueHostModePersistencequeue69c2d583—enqueueHostModePersistence(cgId, false)added19dc13c819dc13c819dc13c819dc13c88096f104(dkg show,cat auth.token,/listfilter,/listroute, catchup peerId pin)getContextGraphAccessPolicymandatory regression69c2d583— missing method → public (v9-era compat); method-throws → null (fail-closed)playwright.config.ts(cross-env DEVNET_NODE=…)verify-batchHTTP 400 assertion in lu8.shapi_call_with_statushelper, line 153 hard-asserts 400)verify-batchquads-required8096f104; codex comments to be dismissed with this rationale8096f104What's NOT in this PR
packages/evm-module/deploy/changes@origintrail-official/*stays at10.0.0-rc.10)mainmutationslibp2p nodeVersion (commit
30eb38c7, rc.11 follow-up)Bundled here because it was already on the branch when the codex sweep
started. Independent of the codex fixes but reviewable as a single
self-contained commit.
Closes the version-on-the-wire gap surfaced during the v10.0.0-rc.10
testnet rollout —
/api/peer-infocould tell operators which protocolsa peer spoke but not which DKG release they were running, leaving "did
the network pick up the upgrade?" answerable only by SSHing into each
Core node. After this lands:
Wire-format impact: none. libp2p's identify protocol already ships an
agentVersionPB field on every connection — we're just populating it.Pre-rc.11 peers continue to advertise libp2p's default. No protocol-id
bump, no
chainResetMarkerbump, no migration.Test plan
pnpm --filter @origintrail-official/dkg-chain build— cleanpnpm --filter @origintrail-official/dkg-agent build— cleanpnpm --filter @origintrail-official/dkg build(CLI) — cleanswm-snapshot-sync.test.ts, commit8d7d34fd, unrelated)test/swm/host-catchup-sign.test.ts— 16/16test/swm/host-catchup-wire.test.ts— 13/13test/swm/host-mode-store.test.ts— 25/25test/dkg-agent-diagnostics.test.ts— passes (rc.11 nodeVersion tests included)bash -nsyntax-check on all 4 modified devnet scriptsCodex comments to dismiss with rationale
After merge, the following codex comments should be marked resolved
with "intentional design" replies (preserved here for the dismissal
PR-review action):
verify-batchshould reconstruct from local SWM whenquadsomitted): rejected — the local SWM / data graph can containtriples from multiple batches in the same CG, and there is no
daemon-side way to identify which leaves belong to the batch being
verified. The route stays caller-supplied-plaintext only; the
CHANGELOG inconsistency that motivated the comments has been
corrected in
8096f104.Made with Cursor