fix: post-rc.10 codex follow-up sweep + libp2p nodeVersion broadcast by branarakic · Pull Request #653 · OriginTrail/dkg

branarakic · 2026-05-25T17:57:43Z

Summary

This PR closes the 47 unaddressed Codex inline review comments from
ten PRs merged into rc.10's integration branch (#618, #619, #620, #621,
#622, #623, #624, #625, #637, #638), plus pulls in one small rc.11
networking feature that landed on the same working branch.

The codex sweep is intentionally one PR rather than ten — every fix
is small, related, and the alternative was ten near-trivial PRs each
needing their own review cycle. Commits are scoped tightly enough that
a reviewer can read them in isolation.

What's on this branch (oldest first)

#	Commit	Scope
1	`f85d2a37`	fix(agent/swm): tier-1 codex followup — host-mode-store + catchup-sign hardening
2	`30eb38c7`	feat(core): broadcast DKG release on the libp2p wire via `nodeInfo.userAgent`
3	`69c2d583`	fix(agent,chain): tier-1b codex followup — host-mode persistence + adapter compat + mock signer
4	`19dc13c8`	test(devnet): tier-2 codex followup — close false-PASS gaps in 4 RFC-38 scripts
5	`8096f104`	docs: tier-3 codex followup — RFC-38 two-laptop runbook accuracy + CHANGELOG fix
6	`8b354632`	test(agent/swm): update host-mode-store reconcile tests for split `corruptMetasRemoved` counter

Coverage matrix (closed PR → comments → resolution)

Closed PR	Codex comments	Status
#618 (signed catchup)	c1, c2	Already addressed in rc.10 integration (layered auth + signer derived from recovered EOA)
#618 c3	MockChainAdapter zero-byte signature	Fixed in `69c2d583` — delegates to `mockACKSigner` when registered
#619	orphan-log reconcile + report shape	Fixed in `f85d2a37` (I/O vs JSON-parse split, `corruptMetasRemoved` counter). Tests updated in `8b354632`
#620 c2	fire-and-forget mark/unmark persistence	Already addressed in rc.10 integration via per-CG `enqueueHostModePersistence` queue
#620 c4	gossip-teardown drops in-memory state without clearing persisted flag	Fixed in `69c2d583` — `enqueueHostModePersistence(cgId, false)` added
#621	revocation script — 4 false-PASS gaps + idempotent pre-create	Most addressed in rc.10 integration; c4/c5/c6 fixed in `19dc13c8`
#622	curator-offline script — late-joiner + cleanup gaps	Most addressed in rc.10 integration; c5/c7 (sleep→wait, EXIT-trap readiness) fixed in `19dc13c8`
#623	byte-cap script — vacuous precondition + pidfile fallback	Most addressed in rc.10 integration; c5/c6 fixed in `19dc13c8`
#624	unclean-restart script — partial-catchup precondition + peerId pin + response capture	Most addressed in rc.10 integration; c4/c5/c6 fixed in `19dc13c8`
#625	two-laptop runbook — 5 wrong HTTP/CLI surfaces	All 5 fixed in `8096f104` (`dkg show`, `cat auth.token`, `/list` filter, `/list` route, catchup peerId pin)
#637 c1+c3	`getContextGraphAccessPolicy` mandatory regression	Fixed in `69c2d583` — missing method → public (v9-era compat); method-throws → null (fail-closed)
#637 c2	vite devnet fallback	Already addressed in `playwright.config.ts` (`cross-env DEVNET_NODE=…`)
#638 c2	`verify-batch` HTTP 400 assertion in lu8.sh	Already addressed in rc.10 integration (`api_call_with_status` helper, line 153 hard-asserts 400)
#638 c1+c3	`verify-batch` `quads`-required	Intentional design — daemon cannot safely identify a single batch's leaves inside a CG-wide store. CHANGELOG corrected in `8096f104`; codex comments to be dismissed with this rationale
CHANGELOG	LU-8 entry stale (promised reconstruction path that doesn't exist)	Fixed in `8096f104`

What's NOT in this PR

No release/tagging machinery touched
No packages/evm-module/deploy/ changes
No package version bumps (@origintrail-official/* stays at 10.0.0-rc.10)
No main mutations

libp2p nodeVersion (commit `30eb38c7`, rc.11 follow-up)

Bundled here because it was already on the branch when the codex sweep
started. Independent of the codex fixes but reviewable as a single
self-contained commit.

Closes the version-on-the-wire gap surfaced during the v10.0.0-rc.10
testnet rollout — /api/peer-info could tell operators which protocols
a peer spoke but not which DKG release they were running, leaving "did
the network pick up the upgrade?" answerable only by SSHing into each
Core node. After this lands:

TOKEN=$(grep -v '^#' ~/.dkg/auth.token | tr -d '[:space:]')
curl -s -H "Authorization: Bearer $TOKEN" \
  "http://127.0.0.1:9200/api/peer-info?peerId=<peerId>" \
  | jq '.peerStore.nodeVersion'
# → "dkg/10.0.0-rc.11"   (pre-rc.11 peers fall through to libp2p's default)

Wire-format impact: none. libp2p's identify protocol already ships an
agentVersion PB field on every connection — we're just populating it.
Pre-rc.11 peers continue to advertise libp2p's default. No protocol-id
bump, no chainResetMarker bump, no migration.

Test plan

Codex comments to dismiss with rationale

After merge, the following codex comments should be marked resolved
with "intentional design" replies (preserved here for the dismissal
PR-review action):

fix(cli): require explicit quads for batch verification #638 c1+c3 (verify-batch should reconstruct from local SWM when
quads omitted): rejected — the local SWM / data graph can contain
triples from multiple batches in the same CG, and there is no
daemon-side way to identify which leaves belong to the batch being
verified. The route stays caller-supplied-plaintext only; the
CHANGELOG inconsistency that motivated the comments has been
corrected in 8096f104.

Made with Cursor

…n hardening Three production bugs from Codex review on closed PRs (subsumed by #649) that warrant fixing before mainnet handoff: T1 #1 (host-catchup-sign.ts: evictStale early-break) — the loop over the seen-nonce map broke on the first entry whose timestamp was >= threshold, which assumed strict ascending insertion order. Future-skewed clocks on a single signer could leave stale entries that never get evicted. Drop the break; full pass is O(n) on a bounded map. T1 #2 (host-mode-store.ts: I/O vs JSON corruption) — reconcileOrphanLogs treated fs.readFile EIO/EBUSY/etc the same as JSON.parse SyntaxError, deleting healthy .meta+.log pairs on transient I/O. Split the try/catch: I/O failures now park the key in ioSkippedMetaKeys so neither file is removed; only genuinely unparseable metas count toward corruptMetasRemoved. SwmHostModeStartupReconcileReport gains the corruptMetasRemoved field. T1 #4 (host-mode-store.ts: per-CG meta race) — markHostModeSubscribed, markHostModeUnsubscribed, markRegistered, markUnregistered, and append all do load-modify-persist on the same .meta file but lacked any serialisation. Concurrent calls (e.g. invite ACK + restart reconcile) could lose updates. Added withCgWriteLock helper that serialises all meta mutations per contextGraphId via a Map<id, Promise> chain. No behaviour change for clients; all fixes are internal correctness. Builds clean (pnpm --filter dkg-agent build). Tests follow in a later commit as part of the broader followup PR. Co-authored-by: Cursor <cursoragent@cursor.com>

…rAgent Closes the version-on-the-wire gap surfaced during the v10.0.0-rc.10 testnet rollout: today /api/peer-info can tell operators which protocols a peer speaks but not which DKG release it's running, leaving "did the network pick up the upgrade?" answerable only by SSHing into each Core node. This adds DKGNodeConfig.nodeVersion (forwarded through DKGAgent into libp2p's createLibp2p({ nodeInfo: { userAgent } })) and surfaces the remote-peer side via PeerDiagnostics.peerStore.nodeVersion. The daemon sets it to `dkg/<semver>` from the running CLI's package.json, so /api/peer-info on any rc.11+ node can census peer versions with one curl loop. Naming: libp2p's wire field is `AgentVersion`, but inside DKG that collides with `DKGAgent`. We translate at the read boundary (`peer.metadata.AgentVersion` → `peerStore.nodeVersion`) and at the write boundary (`DKGNodeConfig.nodeVersion` → `nodeInfo.userAgent`); libp2p's name only appears in doc comments. `protocolVersion` keeps its libp2p name because it really is libp2p's protocol version (`ipfs/0.1.0`), unrelated to DKG. Defensive on the read side: peer.metadata can be a Map (libp2p >=2.x), a plain object (some serialised paths), or undefined (cold cache / identify not yet completed). Three new test cases lock all three shapes in; one existing test gained the rc.11 nodeVersion: null assertion for the cold-cache path. Pre-rc.11 peers fall through to libp2p's default `js-libp2p/<version>`, which is itself a useful "peer hasn't adopted the version-advertisement convention yet" signal. Co-authored-by: Cursor <cursoragent@cursor.com>

…apter compat + mock signer Three more production bugs from the closed-PR codex review (#618, #620, #637) — the remaining hard-coded fail-paths that escaped the rc.10 integration merge. Companion to f85d2a3 which fixed the host-mode-store data-loss / lock-race set. T1b #1 — MockChainAdapter.signMessage (PR #618 c3) — the mock returned zero-byte `{r, vs}`, so every test that exercised `mintSignedCatchupRequest` recovered a garbage signer on the responder side and host catchup was effectively dead under MockChainAdapter. Existing test files worked around this by monkey-patching `signMessage` on the adapter instance (see cg-discovery-integration.test.ts:114). The mock now delegates to `mockACKSigner` when a wallet has been registered via `setMockACKSigner`, so any test that already wires the ACK signer gets a real EIP-191 signature on `signMessage` too. Tests that don't configure the ACK signer keep getting zeros — no behavioural change for unit tests that don't exercise the signed-catchup path. T1b #2 — gossip-teardown persistence leak (PR #620 c4) — when `reconcileSharedMemoryGossipSubscription` discovers the local node lost membership for a curated CG, it `gossip.unsubscribe(swmTopic)`s the whole topic and clears the in-memory `swmHostModeSubscribed` / `swmHostModeHandlers` maps. The persisted `hostModeSubscribed=true` flag in `_meta` was NOT cleared, so the B3 startup-restore loop in `initializeSwmHostModeStore` would happily re-subscribe to the same CG on next boot — exactly the CG this branch had just torn down for authorization reasons. Added an `enqueueHostModePersistence(cgId, false)` call alongside the in-memory deletes. If the immediate `reconcileSwmHostModeSubscription` re-engages, it does so through `wireSwmHostModeHandler` which enqueues `true` again; the per-CG queue's serialisation guarantees the final state always reflects the final intent. T1b #3 — backwards-compatible access-policy probe (PR #637 c1+c3) — `_resolveEncryptInlinePayload`'s probe path was returning `null` (UNKNOWN) when `chain.getContextGraphAccessPolicy` was undefined, and `null` made the throw at the bottom of the helper hard-fail publish for any numeric CG. Optional adapter method → mandatory in practice. External / custom adapters that support V10 publish but haven't adopted the access-policy getter would 500 on every publish they routed through here. Fix distinguishes "method not implemented" (falls back to PUBLIC — v9-era behaviour, restores compat) from "method threw" (still returns null, still fails closed — that's an actual RPC failure, refusing to pick plaintext-vs-encrypted without a verified policy is the right call there). Local-meta probe above runs unchanged and would still return `true` for any curated CG the local node created or joined with policy metadata. No behaviour change for clients that: - use the EVM adapter (always implements the getter) - run tests that already configure mockACKSigner - hit a curated CG with local policy metadata Builds clean (chain + agent). Co-authored-by: Cursor <cursoragent@cursor.com>

…38 scripts The four LU-6 devnet scripts (#621/#622/#623/#624) shipped with several control-flow gaps that let regressions silently sneak through: errors swallowed with `|| true`, fixed sleeps where bounded retries were needed, and assertions that passed even when the scenario under test never happened. Codex's review on the closed PRs flagged 24 specific items; the rc.10 integration merge fixed most of them. This commit closes the remaining ones — the ones that materially affect whether a PASS is meaningful. devnet-test-rfc38-revocation.sh (#621): - Member pre-creates (M1, M2) no longer swallow EVERY error with `|| true`. The new `member_pre_create` helper captures the response and tolerates ONLY the idempotent "already exists" signal; any other failure (wrong auth, malformed body, daemon down) now fails the script immediately with the actual error visible, instead of surfacing later as an opaque catchup timeout. - Phase 5's single "sleep 3 + one-shot read" replaced with a 30s bounded-retry loop (`wait_for_count_or_steady`) — gossip / catchup latency between the post-revocation write and a member's final triple count is variable, and the one-shot snapshot was reporting M1's partial state as a regression. - Added the forward-only-rotation lower bound: `M2_FINAL >= M2_PRE`. The pre-existing `<= 3` check alone would pass if revocation also wiped M2's previously-decryptable triples — violating the contract in the script header that the kicked member RETAINS what they could already decrypt; they just stop learning anything new. devnet-test-rfc38-curator-offline-midbatch.sh (#622): - Phase 1's `sleep 3` after member pre-create replaced with an explicit `wait_for_m1_onchain_id` poll. Phase 5's non-curator publish requires M1 to have observed the CG's `onChainId`, otherwise it bounces with "Context graph ... is not registered on-chain" — gossip lag for the `ContextGraphCreated` event can easily exceed 3s under devnet load. - EXIT trap now waits up to 60s for `/api/status` to respond before declaring the curator healthy again. `devnet.sh restart-node` returns after spawning the daemon, not after it's actually ready to serve, so CI runners that chain another scenario immediately were inheriting a half-started devnet. devnet-test-rfc38-prereg-bytecap-stress.sh (#623): - Added a precondition that the burst's SUBMITTED_BYTES actually exceeds the configured cap. With the default 80 KiB payload size anything up to 12 writes stayed under the 1 MiB cap, so the downstream clamp assertion was vacuously satisfied if anyone misconfigured WRITES_COUNT/WRITE_PAYLOAD_BYTES — a TEST bug passing as a daemon PASS. - Phase 6's liveness check no longer silently skips when the pidfile is missing/empty. Falls back to a `/api/status` HTTP probe so containerised devnets (where pidfiles aren't written) still get a meaningful liveness check; hard-fails when both pidfile AND status are unreachable. devnet-test-rfc38-unclean-restart.sh (#624): - Captures the killed core's `peerId` from `/api/status` BEFORE the SIGKILL. The post-restart catchup calls in phase 6 now pin to that peerId so we're explicitly exercising recovery from the restarted node — previously the catchup fanned out to any connected peer and could pass by pulling data from the curator (still online), validating nothing about the unclean-restart contract. - Phase 3 now waits for STRICT mid-batch state (`0 < M1_PARTIAL < WRITES_COUNT`) before the SIGKILL. The previous one-shot snapshot accepted any partial value including 0 or already-complete; both meant the kill below never actually exercised the `lastHostCatchupSeqno` resume path this test claims to cover. - Catchup responses are captured (not discarded to `/dev/null`) and asserted free of `error` / `swmError` / `durableError` via the new `assert_catchup_clean` helper. HTTP 500s, auth denials, host-catchup failures, etc. were previously invisible — the final triple-count check would still go green if data arrived via background gossip. Bash-syntax-checked (`bash -n` on all four). No behaviour change when the underlying scenarios are healthy; the changes only convert silent false-positives into loud failures. Co-authored-by: Cursor <cursoragent@cursor.com>

…ANGELOG fix PR #625's runbook documented HTTP and CLI surfaces that don't exist on the rc.10 daemon — operators following it from step 1 would 404 / 401 their way through every section. PR #638's LU-8 CHANGELOG entry promised a server-side reconstruction path for `verify-batch` that the actual route refuses. Both are doc bugs only, no code change. docs/RFC38_LU6_TWO_LAPTOP_TESTNET_RUNBOOK.md: - Replaces `dkg show` (never a real top-level command) with the actual surface: `dkg status` (peerId/multiaddrs/role), `dkg auth show` (token, stripped of comments), and `GET /api/agent/identity` for the agent EOA. Same for `dkg show-cg` (compute the wire id as `keccak256(<cgId>)` since there's no CLI shortcut) and `dkg shared-memory host-mode stats` (use `GET /api/shared-memory/ host-mode/stats` directly). - Every `$(cat ~/.dkg/auth.token)` interpolation now uses `dkg auth show` instead. The token file has a commented-header preamble by default, so the literal `cat` would inject `#\n<token>` into the `Authorization` header and 401 even on a healthy node. `dkg auth show` strips comments + blank lines, matching what `packages/node-ui/vite.config.ts` does for the same reason. - `/api/context-graph/list` filter corrected — the response is `{ contextGraphs: [...] }` (envelope, not bare array) and items expose `accessPolicy` (numeric: 0=public, 1=curated), not `access`. The old `jq '.[] | select(.access=="curated")'` filter would have silently matched nothing. - Member catchup now pins `peerId` to the core's libp2p identity. Without `peerId`, `/api/shared-memory/catchup` fans out to every connected peer — which in a two-laptop+core topology can hit laptop A directly or no peer at all, and won't reliably validate the "via the core" host-catchup path the runbook claims to verify. Showed how to grab the peer id from `/api/status` first. - Replaced the `/api/shared-memory/list?contextGraphId=...` curl (404, no such route) with a SPARQL `SELECT (COUNT(*) AS ?n)` via `/api/query` against the `_shared_memory` graph suffix — the same shape every devnet script uses for the equivalent assertion. CHANGELOG.md: - LU-8 entry no longer claims "when `quads` is omitted the route reconstructs from the local SWM or post-publish CG data graph". That hasn't been true since the safety hardening in the published route (`packages/cli/src/daemon/routes/memory.ts:1042-1049`): `quads` is **required**, HTTP 400 on omission, because the daemon can't safely identify a single published batch's leaves inside a CG-wide store. Devnet scenario `scripts/devnet-test-rfc38-lu8.sh` SCENARIO 1 pins this contract with a HTTP-400 assertion. The CHANGELOG now matches the route's actual behaviour + rationale. Co-authored-by: Cursor <cursoragent@cursor.com>

…ruptMetasRemoved counter Tier 1 commit f85d2a3 deliberately split `SwmHostModeStartupReconcileReport` into `orphanLogsRemoved` (lonely .log without meta — data lost before reconcile ran) + a new `corruptMetasRemoved` (meta-only casualty — .meta failed to parse, matching log already reaped via the orphan-logs pass) so operators could distinguish the two failure modes from a single report. The change was driven by Codex PR #619 R3's pushback that lumping both into one counter hid the meta-only signal. The four affected tests still asserted against the OLD lumped shape: - `removes a .log file with no matching .meta on init` - `preserves .meta without matching .log (markRegistered + never-written CG case)` - `handles mixed dirs: removes orphan logs, leaves healthy pairs + lonely metas alone` - `reaps .log paired with a .meta that fails to parse as JSON (crashed mid-persist)` Updated each to assert the new shape: - The three "lonely .log" cases now assert `corruptMetasRemoved: 0` alongside their existing counts — they have no corrupt metas. - The `crashed mid-persist` case now asserts the split: the lonely log contributes to `orphanLogsRemoved=1` and the unparseable meta contributes to `corruptMetasRemoved=1`. Total reaped files is still 2 (their sum); previously the single counter conflated them. The `SwmHostModeStartupReconcileReport` type is now imported from the store module so the local `reportSeen` typing matches the production shape (the prior inline literal type would have masked future field additions). All 25 tests in host-mode-store.test.ts pass. No production code change. Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions

Codex review skipped: filtered diff is 5201 lines (cap: 5,000). Please consider splitting this into smaller PRs for reviewability.

Branimir Rakic and others added 2 commits May 25, 2026 19:56

github-actions Bot reviewed May 25, 2026

View reviewed changes

Comment thread packages/agent/src/swm/host-mode-store.ts Outdated

Comment thread packages/agent/src/swm/host-mode-store.ts

Branimir Rakic and others added 4 commits May 25, 2026 20:28

branarakic requested review from Jurij89 and zsculac as code owners May 25, 2026 18:44

branarakic changed the title ~~feat(core): broadcast DKG release on the libp2p wire (rc.11 follow-up) + tier-1 SWM codex hardening~~ fix: post-rc.10 codex follow-up sweep + libp2p nodeVersion broadcast May 25, 2026

github-actions Bot reviewed May 25, 2026

View reviewed changes

Comment thread packages/agent/src/dkg-agent.ts

Comment thread scripts/devnet-test-rfc38-unclean-restart.sh Outdated

fix(agent): address post-rc10 review follow-ups

ad27c1f

github-actions Bot reviewed May 25, 2026

View reviewed changes

branarakic merged commit 4202233 into main May 26, 2026
38 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: post-rc.10 codex follow-up sweep + libp2p nodeVersion broadcast#653

fix: post-rc.10 codex follow-up sweep + libp2p nodeVersion broadcast#653
branarakic merged 7 commits into
mainfrom
fix/post-rc10-codex-followup

branarakic commented May 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

branarakic commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's on this branch (oldest first)

Coverage matrix (closed PR → comments → resolution)

What's NOT in this PR

libp2p nodeVersion (commit 30eb38c7, rc.11 follow-up)

Test plan

Codex comments to dismiss with rationale

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

branarakic commented May 25, 2026 •

edited

Loading

libp2p nodeVersion (commit `30eb38c7`, rc.11 follow-up)