fix: rc.11 follow-up — address 11 unaddressed Codex findings from the #680 integration sweep by branarakic · Pull Request #685 · OriginTrail/dkg

branarakic · 2026-05-26T12:13:50Z

Summary

Follow-up to the rc.11 integration PR (#680). Closes the highest-severity Codex review findings that the rc.11 cut deferred so they could be addressed in a single reviewable PR after merge instead of holding the release.

9 critical 🔴 findings + 2 medium 🟡 findings, across 7 source files and 6 test files. All addressed with regression test coverage where the surface is unit-testable.

PRs whose findings are addressed here:

PR	Finding	File
#661	IPv6 zero-padded `2001:0db8::` slips past TEST-NET-2 check (🔴)	`core-prereq-check.ts:262`
#661	`isReservedDnsName` trailing-root-dot bypass (🟡)	`core-prereq-check.ts:309`
#661	`lifecycle.ts` hides `prereq.indeterminate` verdict (🟡)	`lifecycle.ts:1276`
#664	Watchdog disarm via `api.port` deletion is permanent (🔴)	`supervisor-liveness.ts`
#665	`swmInserted` marker write failure → re-promote risk (🔴)	`async-promote-worker.ts:262`
#666	`dkgHomeNow` derives from CLI install mode, not target checkout (🔴)	`cli.ts:4082` / `migrate-to-npm.ts`
#668	First `self:peer:update` locks in `private` before AutoNAT verifies (🔴)	`nat-status.ts:273`
#669	`stopSignal` not cached at handler entry; `stream.close()` not abortable (🔴)	`protocol-router.ts:355`
#670	`vitest.unit.config.ts` hardcoded include drops new unit tests (🟡)	`chain/vitest.unit.config.ts`
#670	`JSON.stringify` throws on bigint in ethers payloads (🟡)	`filter-error-silencer.ts:127`
#673	SIGKILL hits supervisor pid only; daemon worker survives (🔴)	`devnet-test-rc11-promote-crash-recovery.sh:238`
#673	`exit 0` on false-green path (worker drained before kill) (🔴)	`devnet-test-rc11-promote-crash-recovery.sh:209`
#673	SPARQL `selection` shape rejected by `/api/shared-memory/publish` (🔴)	`devnet-test-rc11-shutdown-mid-publish.sh:162`

Each commit references the specific Codex discussion_r* ID it addresses.

Notable design decisions

feat(cli): supervisor positive-liveness probe (PR-9) #664 watchdog disarm: replaced "disarm forever" with a bounded shutdownGraceMs (default 30s = 2× SHUTDOWN_HARD_TIMEOUT_MS). Graceful shutdown gets the full window; wedged teardowns still get SIGKILLed. shutdownGraceMs < 0 preserves the legacy "never SIGKILL during graceful shutdown" semantic for operators who explicitly want it.
feat(daemon): PR #3 — async-promote worker supervisor + lifecycle wiring #665 swmInserted marker: introduces a new partial_promote_ambiguity outcome that does NOT call queue.fail() — the job stays in running state, and recoverOnStartup() correctly routes it into the "abandoned partial promote" bucket on next boot. This prevents /promote-async/{jobId}/recover from blindly re-running an already-completed promote. A loud PARTIAL-PROMOTE-AMBIGUITY: jobId=… log surfaces the condition for operator alerting.
feat(cli): 'dkg migrate-to-npm' subcommand + operator runbook (PR-7) #666 dkgHomeNow: extracted selectMigrationDkgHome() that probes BOTH ~/.dkg-dev and ~/.dkg for a live daemon. Falls back to install-mode-based resolution only when neither home has a running daemon (greenfield migration). Surfaces recoveredGlobalCliInCheckout so the CLI logs an explicit line when it overrides the install-mode default.
feat(daemon): AutoNAT-driven NAT-status watcher for core nodes (PR-5) #668 NAT-status: only the FIRST event-driven non-public reclassification is treated as the bound-address baseline (same as the initial snapshot). Once an event has fired once, subsequent private events are treated as real verdicts — we don't permanently suppress legitimate transitions.

Findings NOT in this PR (parked for #676 rc.12 backlog)

The remaining 7 medium 🟡 findings are tracked in #676 (rc.12 follow-ups):

feat(cli/daemon): surface relay metrics in /api/status #662 /api/status HEAD path / duplicate RelayStatusResponse shape (HEAD path actually ADDRESSED on main, only the duplicate shape remains)
feat(daemon): PR #3 — async-promote worker supervisor + lifecycle wiring #665 supervisor stop() race with start() resolution
feat(daemon): PR #3 — async-promote worker supervisor + lifecycle wiring #665 PromoteCommitMarker.promoteStarted source-compat (internal type)
feat(daemon): PR #4 — async-promote queue config knobs + E2E test + SKILL.md docs #667 heartbeatIntervalMs vs actual queue lease (partial guard exists for default lease)
feat(daemon): AutoNAT-driven NAT-status watcher for core nodes (PR-5) #668 third independent address classifier in nat-status.ts:107 (engineering hygiene)
feat(rc.11) PR-A: delete self-signed ACK fallback + delete tentative-VM concept #671 hardhat-harness.ts:432 quorum=1 weakens ACK validation
feat(rc.11) PR-B: typed ACK errors + RPC caching + LU-6 runbook + provenance telemetry #672 evm-adapter.ts:390 1h TTL caches mutable reads (invalidatePublishPreflightCache() exists as explicit-flush hook)

Findings actually ALREADY ADDRESSED on main (verified by file-level inspection during triage):

feat(cli/daemon): surface relay metrics in /api/status #662 status.ts HEAD fast-path (line 494 returns early before relay walk) ✅
feat(daemon): PR #4 — async-promote queue config knobs + E2E test + SKILL.md docs #667 /promote-async 503-gate when enabled: false (assertion.ts:533) ✅
feat(daemon): PR #4 — async-promote queue config knobs + E2E test + SKILL.md docs #667 SKILL.md state query param docs match searchParams.get() ✅
feat(rc.11) PR-A: delete self-signed ACK fallback + delete tentative-VM concept #671 verified-memory root graph restored by a63c8669 (in main via release: v10.0.0-rc.11 integration branch #680) ✅
fix(evm-module): use ERC-721 _safeMint with CEI ordering on every mint path #663 (entire PR) — reverted and superseded by refactor(evm-module): apply CEI mint-last ordering on every ERC-721 mint path #681 in the integration branch ✅

Closing redundant PRs that #680 folded in (no commits ahead of main):

feat(daemon): PR #3 — async-promote worker supervisor + lifecycle wiring #665 (async-promote worker supervisor) — closed
feat(daemon): PR #4 — async-promote queue config knobs + E2E test + SKILL.md docs #667 (async-promote queue config + e2e) — closed
feat(rc.11) PR-B: typed ACK errors + RPC caching + LU-6 runbook + provenance telemetry #672 (typed errors + RPC caching + LU-6 runbook + provenance) — closed

Verification

pnpm --filter '@origintrail-official/dkg' test:unit — 416/416 passing (up from 403; +13 new regression cases)
pnpm --filter '@origintrail-official/dkg-chain' test:unit — 68/68 passing (up from 59; +8 newly-discovered hub-resolution-cache + 1 new bigint case)
pnpm --filter '@origintrail-official/dkg-core' exec vitest run test/protocol-router-abort.test.ts test/protocol-router.test.ts test/protocol-router-pool.test.ts — 70/70 passing
pnpm -r build clean
bash -n scripts/devnet-test-rc11-*.sh — syntax OK

Test plan

Build clean (pnpm -r build)
Touched-package unit tests green (416 CLI, 68 chain, 70 core protocol-router)
New regression cases for every fix where the surface is unit-testable (IPv6, trailing-dot DNS, indeterminate branch, watchdog re-arm + legacy opt-out, partial-promote-ambiguity, probe-both-homes, NAT first-event-not-definitive, bigint serializer)
Re-run the rc11-targeted devnet scenarios on a fresh devnet to confirm the script fixes
CI green

Made with Cursor

PR #673 follow-up — three false-green / wrong-target issues Codex flagged in the final review pass that the rc.11 cut deferred: 1. **promote-crash-recovery.sh** SIGKILL hit only the supervisor pid (#673#discussion_r3302023868). `devnet.pid` is the foreground supervisor process — `daemon-foreground-worker` keeps the API port and SQLite store open after `kill -9 <supervisor>`, so the restart path doesn't reproduce a real crash. Now reads BOTH `devnet.pid` (supervisor) and `daemon.pid` (worker) and kills both, with the alive-after-SIGKILL gate checking both pids. 2. **promote-crash-recovery.sh** exit code on the inconclusive path (#673#discussion_r3302023872). When the worker drains the job before the kill could land, the script used to `exit 0`, hiding the fact that the crash-recovery path was never exercised. Now `exit 2` so CI can distinguish a real pass from a missed window. 3. **shutdown-mid-publish.sh** SPARQL-shaped selection silently rejected (#673#discussion_r3302023873). `/api/shared-memory/publish` accepts `selection: "all"` or a root- entity string array — not `{ kind: "sparql", query: ... }`. The surrounding `|| true` was hiding the failure, so the script ended up timing an idle shutdown instead of shutdown under StorageACK pressure. Now passes the 8 generated root entities directly. Co-authored-by: Cursor <cursoragent@cursor.com>

PR #661 follow-up — three classifier holes Codex flagged in the final review pass that the rc.11 cut deferred: 1. **IPv6 documentation range zero-padded bypass** (#661#discussion_r3302752877). The strict `ipv6Class()` check used `startsWith('2001:db8:')` against the lowercased string, which matches `2001:db8::1` but misses zero-padded spellings like `2001:0db8::1` and `2001:0db8:0000:0000::1`. Normalize the first two hextets numerically before classifying. Adds two regression cases to the `classifyMultiaddr` table. 2. **`isReservedDnsName()` trailing-root-dot bypass** (#661#discussion_r3302752890). FQDNs may carry a trailing root dot; `localhost.`, `relay.test.`, `svc.cluster.local.` all fell through as usable DNS and could rescue a degraded listener. Strip the terminal `.` before the suffix checks. Adds a regression case exercising the rescue path for all three. 3. **`lifecycle.ts` hides `prereq.indeterminate` verdict** (#661#discussion_r3302752893). The post-start logger only branched on `looksDegraded`; the warn-only DNS-rescue path hit the unconditional `OK: N public-class listen addresses bound` line, even though the checker had just computed `indeterminate=true` with non-empty `prereq.reasons`. Add an `indeterminate` branch that surfaces those reasons so operators see why the sweep neither passed strictly nor failed. Verification: `pnpm exec vitest run test/core-prereq-check.test.ts` — 72 passing (3 new regression cases + 69 existing). Co-authored-by: Cursor <cursoragent@cursor.com>

PR #664 follow-up. Codex (#664#discussion_r3302432762) flagged that the previous implementation permanently disarmed the watchdog as soon as `isShuttingDown()` returned true. The wiring uses `api.port` file deletion as the shutdown signal, and the worker removes it BEFORE the slow cleanup tail (`agent.stop()`, DB close, …). If any of those later awaits hangs, the supervisor could never SIGKILL or respawn the worker because the watcher had already set `stopped=true`. Fix: introduce `shutdownGraceMs` (default 30s = 2× SHUTDOWN_HARD_ TIMEOUT_MS). When shutdown is first observed, the watcher records the timestamp and suppresses failure counting for that window — long enough for a healthy graceful shutdown to finish, including the daemon's own self-force-exit deadline. After the window expires the SIGKILL path re-arms so wedged teardowns still get force-killed. `shutdownGraceMs < 0` preserves the legacy "disarm forever" behavior for operators who explicitly want it. Verification: `pnpm exec vitest run test/supervisor-liveness.test.ts` — 32 passing, including 2 new regression cases for the re-arm and the legacy-opt-out paths. Co-authored-by: Cursor <cursoragent@cursor.com>

…ing fails PR #665 follow-up. Codex (#665#discussion_r3302646439) flagged that if `assertion.promote()` has ALREADY returned successfully and one of the two subsequent writes (`recordCommitMarker('swmInserted')` or `queue.succeed()`) throws — store hiccup, lost lease, transient FS error — the outer worker catch parked the job as `failed` with retryable=false. That left it eligible for `/promote-async/{jobId}/ recover`, which blindly re-queues failed jobs and would re-run a promote that already mutated SWM and emitted gossip. Fix: wrap the two post-promote writes in a dedicated try/catch that returns a new `partial_promote_ambiguity` outcome instead of throwing. This outcome is counted in `PromoteWorkerCounters.partialPromoteAmbiguity` but DOES NOT call `queue.fail()`, so the job stays in `running` state. On next daemon boot, `recoverOnStartup()` sees `promoteStarted=true && swmInserted=false` and correctly routes it into the "abandoned partial promote" bucket per `PromoteRecoveryResult.abandoned` — exactly the path the type docs describe as "operator action required". A loud `PARTIAL-PROMOTE-AMBIGUITY: jobId=…` log line surfaces the condition for grep-able operator alerting. Verification: `pnpm exec vitest run test/async-promote-worker.test.ts` — 25 passing, including a new regression case that simulates a store hiccup on the `swmInserted` marker write and asserts the job stays in `running` state with `promoteStarted=true, swmInserted=false`. Co-authored-by: Cursor <cursoragent@cursor.com>

PR #666 follow-up. Codex (#666#discussion_r3302712591) flagged that `dkgHomeNow` was derived purely from the LIVE CLI's install mode via `repoDir()`. When an operator runs a globally installed `dkg` (standalone install-mode → `repoDir() === null`) from inside an unmigrated git checkout, the local daemon is still using `~/.dkg-dev` but the previous code resolved to `~/.dkg`. That: - missed the live daemon in the orphan-state blocker - read `currentAutoUpdateSource` from the wrong config - landed the `autoUpdate.source = "npm"` pin in the wrong config while the daemon kept reading the original ~/.dkg-dev/config.json Fix: introduce `selectMigrationDkgHome()` which probes BOTH the monorepo-candidate home (~/.dkg-dev) and the standalone home (~/.dkg) for an active daemon. Picks whichever has a live pid; falls back to the install-mode-based `resolveMigrationDkgHome()` only when no daemon is running (greenfield migration). Surfaces `recoveredGlobalCliInCheckout` so the CLI logs an explicit line when it overrides the install-mode default — operators see why the home selection diverged from `repoDir()`. Verification: `pnpm exec vitest run test/migrate-to-npm.test.ts` — 39 passing, including 5 new cases for the global-CLI-in-checkout, greenfield, monorepo-matches-monorepo, standalone-matches-standalone, and dead-pid-fallback paths. Co-authored-by: Cursor <cursoragent@cursor.com>

…rdict PR #668 follow-up. Codex (#668#discussion_r3302734688) flagged that the very first `self:peer:update` after listen also fires for the initial post-listen peer record AutoNAT publishes. If that record contains only RFC1918/CGNAT multiaddrs (common during a cold boot before AutoNAT has verified external reach), the previous logic marked `private` as DEFINITIVE immediately — disabling the soft timeout and locking the verdict until a later address-update event happened (which may never come on a stable private-only host). Fix: treat the FIRST non-public event-driven reclassification the same as the initial bound-address snapshot. The cached status updates so `/api/status` reflects the current state, but the classification is NOT marked definitive — the soft timeout still arms, and AutoNAT's later verification of an external address can still flip the verdict to `public` cleanly. Once an event has fired once, subsequent `private` events are treated as the real verdict so we don't permanently suppress legitimate transitions. Verification: `pnpm exec vitest run test/nat-status.test.ts` — 42 passing, including 2 new regression cases for the first-event-not-definitive and second-event-IS-definitive paths. Co-authored-by: Cursor <cursoragent@cursor.com>

PR #669 follow-up. Codex (#669#discussion_r3302188320) flagged that the inbound handler in `register()` re-read `this.node.stopSignal` three times — once for the initial `readAllWithSignal()`, once in the catch-path abort check, and (implicitly) when calling `stream.close()` without any signal at all. If `DKGNode.stop()` fires after the request has been read but before the response path finishes, the controller is cleared in the node's `finally`, so the catch-path can see `undefined` and misclassify the shutdown as a real handler error. `stream.close()` could also hang on a silent peer because no abort signal was threaded into it. Fix: cache `const stopSignal = this.node.stopSignal` once at handler entry. Pass it to `readAllWithSignal()`, to `stream.close({ signal })`, and use the cached reference for the catch-path `aborted` check. Aligns the inbound lifecycle with the existing outbound paths in `dialProtocol()` / pooled-conn close that already pass `{ signal }`. Verification: `pnpm exec vitest run test/protocol-router-abort.test.ts test/protocol-router.test.ts` — 54 passing (no behavioral change for the helper-level unit tests; the handler-level abort lifecycle is exercised at integration time). Co-authored-by: Cursor <cursoragent@cursor.com>

PR #670 follow-up. Two medium-severity Codex findings: 1. **vitest.unit.config.ts hardcoded include silently drops new tests** (#670#discussion_r3301775307). The original include list omitted `hub-resolution-cache.unit.test.ts` even though it had existed for several sprints — `pnpm --filter ... test:unit` reported green without ever running it. Switch to a glob (`test/**/*.unit.test.ts`) so unit coverage is auto-discovered; keep the explicit `filter-error-silencer.test.ts` entry since it's pure-logic but doesn't follow the `.unit.test.ts` naming convention. 2. **`safeJson()` throws on bigint, dropping ethers error diagnostics** (#670#discussion_r3301775310). Ethers error payloads commonly carry `bigint` fields (transaction `value`, `chainId`, `gasLimit`, …) inside `info.error.data`. The previous `JSON.stringify` call threw on these and the catch-fallback returned `String(value)` → `"[object Object]"`, throwing away the structured diagnostics this path is supposed to preserve. Use a replacer that coerces bigints to strings. Verification: `pnpm exec vitest run --config vitest.unit.config.ts` — 67 passing (up from 59; +8 newly-discovered `hub-resolution-cache` cases). Added a regression case for the bigint serializer to filter-error-silencer.test.ts. Co-authored-by: Cursor <cursoragent@cursor.com>

branarakic · 2026-05-31T22:19:33Z

Already merged into release/rc.12 in ee6589a; reaches main via #716. Closing as superseded — the open state against main is a side effect of merging into a non-base branch.

…se/rc.12 Address 11 Codex findings from the OriginTrail#680 release/rc.11 integration review that were tagged 'critical' but landed in OriginTrail#683 without being addressed. rc.12 follow-up.

branarakic and others added 8 commits May 26, 2026 13:49

branarakic requested review from Jurij89 and zsculac as code owners May 26, 2026 12:13

branarakic mentioned this pull request May 26, 2026

rc.12 follow-ups from rc.11 review feedback #676

Open

29 tasks

branarakic commented May 26, 2026

View reviewed changes

Comment thread packages/cli/src/daemon/nat-status.ts

branarakic commented May 26, 2026

View reviewed changes

Comment thread packages/cli/src/daemon/worker/async-promote-worker.ts

fix: address rc11 follow-up review gaps

81c4782

github-actions Bot reviewed May 26, 2026

View reviewed changes

Comment thread packages/cli/src/migrate-to-npm.ts Outdated

Comment thread packages/cli/src/daemon/supervisor-liveness.ts

Comment thread scripts/devnet-test-rc11-promote-crash-recovery.sh

test(publisher): confirm private subtraction fixtures

e57da07

github-actions Bot reviewed May 26, 2026

View reviewed changes

Comment thread packages/cli/src/daemon/worker/async-promote-worker.ts

fix(core): decode 0x-prefixed base64url keys

0fc5547

github-actions Bot reviewed May 26, 2026

View reviewed changes

Comment thread packages/cli/src/migrate-to-npm.ts Outdated

Comment thread packages/cli/src/migrate-to-npm.ts Outdated

Comment thread packages/cli/src/daemon/supervisor-liveness.ts

fix(cli): close rc11 recovery edge cases

2a75781

github-actions Bot reviewed May 26, 2026

View reviewed changes

Comment thread packages/cli/src/migrate-to-npm.ts

branarakic mentioned this pull request May 26, 2026

release: v10.0.0-rc.12 integration branch #716

Merged

12 tasks

branarakic closed this May 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: rc.11 follow-up — address 11 unaddressed Codex findings from the #680 integration sweep#685

fix: rc.11 follow-up — address 11 unaddressed Codex findings from the #680 integration sweep#685
branarakic wants to merge 12 commits into
mainfrom
fix/rc11-followup-codex-critical

branarakic commented May 26, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

branarakic commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

branarakic commented May 26, 2026

Summary

Notable design decisions

Findings NOT in this PR (parked for #676 rc.12 backlog)

Verification

Test plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

branarakic commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant