Skip to content

fix: rc.11 follow-up — address 11 unaddressed Codex findings from the #680 integration sweep#685

Closed
branarakic wants to merge 12 commits into
mainfrom
fix/rc11-followup-codex-critical
Closed

fix: rc.11 follow-up — address 11 unaddressed Codex findings from the #680 integration sweep#685
branarakic wants to merge 12 commits into
mainfrom
fix/rc11-followup-codex-critical

Conversation

@branarakic
Copy link
Copy Markdown
Contributor

Summary

Follow-up to the rc.11 integration PR (#680). Closes the highest-severity Codex review findings that the rc.11 cut deferred so they could be addressed in a single reviewable PR after merge instead of holding the release.

9 critical 🔴 findings + 2 medium 🟡 findings, across 7 source files and 6 test files. All addressed with regression test coverage where the surface is unit-testable.

PRs whose findings are addressed here:

PR Finding File
#661 IPv6 zero-padded 2001:0db8:: slips past TEST-NET-2 check (🔴) core-prereq-check.ts:262
#661 isReservedDnsName trailing-root-dot bypass (🟡) core-prereq-check.ts:309
#661 lifecycle.ts hides prereq.indeterminate verdict (🟡) lifecycle.ts:1276
#664 Watchdog disarm via api.port deletion is permanent (🔴) supervisor-liveness.ts
#665 swmInserted marker write failure → re-promote risk (🔴) async-promote-worker.ts:262
#666 dkgHomeNow derives from CLI install mode, not target checkout (🔴) cli.ts:4082 / migrate-to-npm.ts
#668 First self:peer:update locks in private before AutoNAT verifies (🔴) nat-status.ts:273
#669 stopSignal not cached at handler entry; stream.close() not abortable (🔴) protocol-router.ts:355
#670 vitest.unit.config.ts hardcoded include drops new unit tests (🟡) chain/vitest.unit.config.ts
#670 JSON.stringify throws on bigint in ethers payloads (🟡) filter-error-silencer.ts:127
#673 SIGKILL hits supervisor pid only; daemon worker survives (🔴) devnet-test-rc11-promote-crash-recovery.sh:238
#673 exit 0 on false-green path (worker drained before kill) (🔴) devnet-test-rc11-promote-crash-recovery.sh:209
#673 SPARQL selection shape rejected by /api/shared-memory/publish (🔴) devnet-test-rc11-shutdown-mid-publish.sh:162

Each commit references the specific Codex discussion_r* ID it addresses.

Notable design decisions

  • feat(cli): supervisor positive-liveness probe (PR-9) #664 watchdog disarm: replaced "disarm forever" with a bounded shutdownGraceMs (default 30s = 2× SHUTDOWN_HARD_TIMEOUT_MS). Graceful shutdown gets the full window; wedged teardowns still get SIGKILLed. shutdownGraceMs < 0 preserves the legacy "never SIGKILL during graceful shutdown" semantic for operators who explicitly want it.

  • feat(daemon): PR #3 — async-promote worker supervisor + lifecycle wiring #665 swmInserted marker: introduces a new partial_promote_ambiguity outcome that does NOT call queue.fail() — the job stays in running state, and recoverOnStartup() correctly routes it into the "abandoned partial promote" bucket on next boot. This prevents /promote-async/{jobId}/recover from blindly re-running an already-completed promote. A loud PARTIAL-PROMOTE-AMBIGUITY: jobId=… log surfaces the condition for operator alerting.

  • feat(cli): 'dkg migrate-to-npm' subcommand + operator runbook (PR-7) #666 dkgHomeNow: extracted selectMigrationDkgHome() that probes BOTH ~/.dkg-dev and ~/.dkg for a live daemon. Falls back to install-mode-based resolution only when neither home has a running daemon (greenfield migration). Surfaces recoveredGlobalCliInCheckout so the CLI logs an explicit line when it overrides the install-mode default.

  • feat(daemon): AutoNAT-driven NAT-status watcher for core nodes (PR-5) #668 NAT-status: only the FIRST event-driven non-public reclassification is treated as the bound-address baseline (same as the initial snapshot). Once an event has fired once, subsequent private events are treated as real verdicts — we don't permanently suppress legitimate transitions.

Findings NOT in this PR (parked for #676 rc.12 backlog)

The remaining 7 medium 🟡 findings are tracked in #676 (rc.12 follow-ups):

Findings actually ALREADY ADDRESSED on main (verified by file-level inspection during triage):

Closing redundant PRs that #680 folded in (no commits ahead of main):

Verification

  • pnpm --filter '@origintrail-official/dkg' test:unit416/416 passing (up from 403; +13 new regression cases)
  • pnpm --filter '@origintrail-official/dkg-chain' test:unit68/68 passing (up from 59; +8 newly-discovered hub-resolution-cache + 1 new bigint case)
  • pnpm --filter '@origintrail-official/dkg-core' exec vitest run test/protocol-router-abort.test.ts test/protocol-router.test.ts test/protocol-router-pool.test.ts70/70 passing
  • pnpm -r build clean
  • bash -n scripts/devnet-test-rc11-*.sh — syntax OK

Test plan

  • Build clean (pnpm -r build)
  • Touched-package unit tests green (416 CLI, 68 chain, 70 core protocol-router)
  • New regression cases for every fix where the surface is unit-testable (IPv6, trailing-dot DNS, indeterminate branch, watchdog re-arm + legacy opt-out, partial-promote-ambiguity, probe-both-homes, NAT first-event-not-definitive, bigint serializer)
  • Re-run the rc11-targeted devnet scenarios on a fresh devnet to confirm the script fixes
  • CI green

Made with Cursor

branarakic and others added 8 commits May 26, 2026 13:49
PR #673 follow-up — three false-green / wrong-target issues Codex flagged
in the final review pass that the rc.11 cut deferred:

1. **promote-crash-recovery.sh** SIGKILL hit only the supervisor pid
   (#673#discussion_r3302023868). `devnet.pid` is the foreground
   supervisor process — `daemon-foreground-worker` keeps the API port
   and SQLite store open after `kill -9 <supervisor>`, so the restart
   path doesn't reproduce a real crash. Now reads BOTH `devnet.pid`
   (supervisor) and `daemon.pid` (worker) and kills both, with the
   alive-after-SIGKILL gate checking both pids.

2. **promote-crash-recovery.sh** exit code on the inconclusive path
   (#673#discussion_r3302023872). When the worker drains the job
   before the kill could land, the script used to `exit 0`, hiding
   the fact that the crash-recovery path was never exercised. Now
   `exit 2` so CI can distinguish a real pass from a missed window.

3. **shutdown-mid-publish.sh** SPARQL-shaped selection silently
   rejected (#673#discussion_r3302023873).
   `/api/shared-memory/publish` accepts `selection: "all"` or a root-
   entity string array — not `{ kind: "sparql", query: ... }`. The
   surrounding `|| true` was hiding the failure, so the script ended
   up timing an idle shutdown instead of shutdown under StorageACK
   pressure. Now passes the 8 generated root entities directly.

Co-authored-by: Cursor <cursoragent@cursor.com>
PR #661 follow-up — three classifier holes Codex flagged in the final
review pass that the rc.11 cut deferred:

1. **IPv6 documentation range zero-padded bypass**
   (#661#discussion_r3302752877). The strict `ipv6Class()` check used
   `startsWith('2001:db8:')` against the lowercased string, which
   matches `2001:db8::1` but misses zero-padded spellings like
   `2001:0db8::1` and `2001:0db8:0000:0000::1`. Normalize the first
   two hextets numerically before classifying. Adds two regression
   cases to the `classifyMultiaddr` table.

2. **`isReservedDnsName()` trailing-root-dot bypass**
   (#661#discussion_r3302752890). FQDNs may carry a trailing root
   dot; `localhost.`, `relay.test.`, `svc.cluster.local.` all fell
   through as usable DNS and could rescue a degraded listener. Strip
   the terminal `.` before the suffix checks. Adds a regression case
   exercising the rescue path for all three.

3. **`lifecycle.ts` hides `prereq.indeterminate` verdict**
   (#661#discussion_r3302752893). The post-start logger only branched
   on `looksDegraded`; the warn-only DNS-rescue path hit the
   unconditional `OK: N public-class listen addresses bound` line,
   even though the checker had just computed `indeterminate=true`
   with non-empty `prereq.reasons`. Add an `indeterminate` branch
   that surfaces those reasons so operators see why the sweep
   neither passed strictly nor failed.

Verification: `pnpm exec vitest run test/core-prereq-check.test.ts` —
72 passing (3 new regression cases + 69 existing).

Co-authored-by: Cursor <cursoragent@cursor.com>
PR #664 follow-up. Codex (#664#discussion_r3302432762) flagged that
the previous implementation permanently disarmed the watchdog as soon
as `isShuttingDown()` returned true. The wiring uses `api.port` file
deletion as the shutdown signal, and the worker removes it BEFORE the
slow cleanup tail (`agent.stop()`, DB close, …). If any of those later
awaits hangs, the supervisor could never SIGKILL or respawn the worker
because the watcher had already set `stopped=true`.

Fix: introduce `shutdownGraceMs` (default 30s = 2× SHUTDOWN_HARD_
TIMEOUT_MS). When shutdown is first observed, the watcher records the
timestamp and suppresses failure counting for that window — long
enough for a healthy graceful shutdown to finish, including the
daemon's own self-force-exit deadline. After the window expires the
SIGKILL path re-arms so wedged teardowns still get force-killed.

`shutdownGraceMs < 0` preserves the legacy "disarm forever" behavior
for operators who explicitly want it.

Verification: `pnpm exec vitest run test/supervisor-liveness.test.ts` —
32 passing, including 2 new regression cases for the re-arm and the
legacy-opt-out paths.

Co-authored-by: Cursor <cursoragent@cursor.com>
…ing fails

PR #665 follow-up. Codex (#665#discussion_r3302646439) flagged that if
`assertion.promote()` has ALREADY returned successfully and one of the
two subsequent writes (`recordCommitMarker('swmInserted')` or
`queue.succeed()`) throws — store hiccup, lost lease, transient FS
error — the outer worker catch parked the job as `failed` with
retryable=false. That left it eligible for `/promote-async/{jobId}/
recover`, which blindly re-queues failed jobs and would re-run a
promote that already mutated SWM and emitted gossip.

Fix: wrap the two post-promote writes in a dedicated try/catch that
returns a new `partial_promote_ambiguity` outcome instead of throwing.
This outcome is counted in `PromoteWorkerCounters.partialPromoteAmbiguity`
but DOES NOT call `queue.fail()`, so the job stays in `running` state.
On next daemon boot, `recoverOnStartup()` sees `promoteStarted=true
&& swmInserted=false` and correctly routes it into the "abandoned
partial promote" bucket per `PromoteRecoveryResult.abandoned` — exactly
the path the type docs describe as "operator action required".

A loud `PARTIAL-PROMOTE-AMBIGUITY: jobId=…` log line surfaces the
condition for grep-able operator alerting.

Verification: `pnpm exec vitest run test/async-promote-worker.test.ts` —
25 passing, including a new regression case that simulates a store
hiccup on the `swmInserted` marker write and asserts the job stays in
`running` state with `promoteStarted=true, swmInserted=false`.

Co-authored-by: Cursor <cursoragent@cursor.com>
PR #666 follow-up. Codex (#666#discussion_r3302712591) flagged that
`dkgHomeNow` was derived purely from the LIVE CLI's install mode via
`repoDir()`. When an operator runs a globally installed `dkg`
(standalone install-mode → `repoDir() === null`) from inside an
unmigrated git checkout, the local daemon is still using `~/.dkg-dev`
but the previous code resolved to `~/.dkg`. That:
  - missed the live daemon in the orphan-state blocker
  - read `currentAutoUpdateSource` from the wrong config
  - landed the `autoUpdate.source = "npm"` pin in the wrong config
    while the daemon kept reading the original ~/.dkg-dev/config.json

Fix: introduce `selectMigrationDkgHome()` which probes BOTH the
monorepo-candidate home (~/.dkg-dev) and the standalone home (~/.dkg)
for an active daemon. Picks whichever has a live pid; falls back to
the install-mode-based `resolveMigrationDkgHome()` only when no daemon
is running (greenfield migration). Surfaces
`recoveredGlobalCliInCheckout` so the CLI logs an explicit line when
it overrides the install-mode default — operators see why the home
selection diverged from `repoDir()`.

Verification: `pnpm exec vitest run test/migrate-to-npm.test.ts` —
39 passing, including 5 new cases for the global-CLI-in-checkout,
greenfield, monorepo-matches-monorepo, standalone-matches-standalone,
and dead-pid-fallback paths.

Co-authored-by: Cursor <cursoragent@cursor.com>
…rdict

PR #668 follow-up. Codex (#668#discussion_r3302734688) flagged that
the very first `self:peer:update` after listen also fires for the
initial post-listen peer record AutoNAT publishes. If that record
contains only RFC1918/CGNAT multiaddrs (common during a cold boot
before AutoNAT has verified external reach), the previous logic
marked `private` as DEFINITIVE immediately — disabling the soft
timeout and locking the verdict until a later address-update event
happened (which may never come on a stable private-only host).

Fix: treat the FIRST non-public event-driven reclassification the
same as the initial bound-address snapshot. The cached status
updates so `/api/status` reflects the current state, but the
classification is NOT marked definitive — the soft timeout still
arms, and AutoNAT's later verification of an external address can
still flip the verdict to `public` cleanly. Once an event has fired
once, subsequent `private` events are treated as the real verdict so
we don't permanently suppress legitimate transitions.

Verification: `pnpm exec vitest run test/nat-status.test.ts` —
42 passing, including 2 new regression cases for the
first-event-not-definitive and second-event-IS-definitive paths.

Co-authored-by: Cursor <cursoragent@cursor.com>
PR #669 follow-up. Codex (#669#discussion_r3302188320) flagged that
the inbound handler in `register()` re-read `this.node.stopSignal`
three times — once for the initial `readAllWithSignal()`, once in
the catch-path abort check, and (implicitly) when calling
`stream.close()` without any signal at all. If `DKGNode.stop()` fires
after the request has been read but before the response path
finishes, the controller is cleared in the node's `finally`, so the
catch-path can see `undefined` and misclassify the shutdown as a
real handler error. `stream.close()` could also hang on a silent
peer because no abort signal was threaded into it.

Fix: cache `const stopSignal = this.node.stopSignal` once at handler
entry. Pass it to `readAllWithSignal()`, to `stream.close({ signal })`,
and use the cached reference for the catch-path `aborted` check.
Aligns the inbound lifecycle with the existing outbound paths in
`dialProtocol()` / pooled-conn close that already pass `{ signal }`.

Verification: `pnpm exec vitest run test/protocol-router-abort.test.ts
test/protocol-router.test.ts` — 54 passing (no behavioral change for
the helper-level unit tests; the handler-level abort lifecycle is
exercised at integration time).

Co-authored-by: Cursor <cursoragent@cursor.com>
PR #670 follow-up. Two medium-severity Codex findings:

1. **vitest.unit.config.ts hardcoded include silently drops new tests**
   (#670#discussion_r3301775307). The original include list omitted
   `hub-resolution-cache.unit.test.ts` even though it had existed for
   several sprints — `pnpm --filter ... test:unit` reported green
   without ever running it. Switch to a glob (`test/**/*.unit.test.ts`)
   so unit coverage is auto-discovered; keep the explicit
   `filter-error-silencer.test.ts` entry since it's pure-logic but
   doesn't follow the `.unit.test.ts` naming convention.

2. **`safeJson()` throws on bigint, dropping ethers error diagnostics**
   (#670#discussion_r3301775310). Ethers error payloads commonly carry
   `bigint` fields (transaction `value`, `chainId`, `gasLimit`, …)
   inside `info.error.data`. The previous `JSON.stringify` call threw
   on these and the catch-fallback returned `String(value)` →
   `"[object Object]"`, throwing away the structured diagnostics this
   path is supposed to preserve. Use a replacer that coerces bigints
   to strings.

Verification: `pnpm exec vitest run --config vitest.unit.config.ts` —
67 passing (up from 59; +8 newly-discovered `hub-resolution-cache`
cases). Added a regression case for the bigint serializer to
filter-error-silencer.test.ts.

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread packages/cli/src/daemon/nat-status.ts
Comment thread packages/cli/src/daemon/worker/async-promote-worker.ts
Comment thread packages/cli/src/migrate-to-npm.ts Outdated
Comment thread packages/cli/src/daemon/supervisor-liveness.ts
Comment thread scripts/devnet-test-rc11-promote-crash-recovery.sh
Comment thread packages/cli/src/daemon/worker/async-promote-worker.ts
Comment thread packages/cli/src/migrate-to-npm.ts Outdated
Comment thread packages/cli/src/migrate-to-npm.ts Outdated
Comment thread packages/cli/src/daemon/supervisor-liveness.ts
Comment thread packages/cli/src/migrate-to-npm.ts
@branarakic
Copy link
Copy Markdown
Contributor Author

Already merged into release/rc.12 in ee6589a; reaches main via #716. Closing as superseded — the open state against main is a side effect of merging into a non-base branch.

@branarakic branarakic closed this May 31, 2026
matic031 pushed a commit to KilianTrunk/dkg that referenced this pull request Jun 2, 2026
…se/rc.12

Address 11 Codex findings from the OriginTrail#680 release/rc.11 integration
review that were tagged 'critical' but landed in OriginTrail#683 without being
addressed. rc.12 follow-up.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant