fix: rc.11 follow-up — address 11 unaddressed Codex findings from the #680 integration sweep#685
Closed
branarakic wants to merge 12 commits into
Closed
fix: rc.11 follow-up — address 11 unaddressed Codex findings from the #680 integration sweep#685branarakic wants to merge 12 commits into
branarakic wants to merge 12 commits into
Conversation
PR #673 follow-up — three false-green / wrong-target issues Codex flagged in the final review pass that the rc.11 cut deferred: 1. **promote-crash-recovery.sh** SIGKILL hit only the supervisor pid (#673#discussion_r3302023868). `devnet.pid` is the foreground supervisor process — `daemon-foreground-worker` keeps the API port and SQLite store open after `kill -9 <supervisor>`, so the restart path doesn't reproduce a real crash. Now reads BOTH `devnet.pid` (supervisor) and `daemon.pid` (worker) and kills both, with the alive-after-SIGKILL gate checking both pids. 2. **promote-crash-recovery.sh** exit code on the inconclusive path (#673#discussion_r3302023872). When the worker drains the job before the kill could land, the script used to `exit 0`, hiding the fact that the crash-recovery path was never exercised. Now `exit 2` so CI can distinguish a real pass from a missed window. 3. **shutdown-mid-publish.sh** SPARQL-shaped selection silently rejected (#673#discussion_r3302023873). `/api/shared-memory/publish` accepts `selection: "all"` or a root- entity string array — not `{ kind: "sparql", query: ... }`. The surrounding `|| true` was hiding the failure, so the script ended up timing an idle shutdown instead of shutdown under StorageACK pressure. Now passes the 8 generated root entities directly. Co-authored-by: Cursor <cursoragent@cursor.com>
PR #661 follow-up — three classifier holes Codex flagged in the final review pass that the rc.11 cut deferred: 1. **IPv6 documentation range zero-padded bypass** (#661#discussion_r3302752877). The strict `ipv6Class()` check used `startsWith('2001:db8:')` against the lowercased string, which matches `2001:db8::1` but misses zero-padded spellings like `2001:0db8::1` and `2001:0db8:0000:0000::1`. Normalize the first two hextets numerically before classifying. Adds two regression cases to the `classifyMultiaddr` table. 2. **`isReservedDnsName()` trailing-root-dot bypass** (#661#discussion_r3302752890). FQDNs may carry a trailing root dot; `localhost.`, `relay.test.`, `svc.cluster.local.` all fell through as usable DNS and could rescue a degraded listener. Strip the terminal `.` before the suffix checks. Adds a regression case exercising the rescue path for all three. 3. **`lifecycle.ts` hides `prereq.indeterminate` verdict** (#661#discussion_r3302752893). The post-start logger only branched on `looksDegraded`; the warn-only DNS-rescue path hit the unconditional `OK: N public-class listen addresses bound` line, even though the checker had just computed `indeterminate=true` with non-empty `prereq.reasons`. Add an `indeterminate` branch that surfaces those reasons so operators see why the sweep neither passed strictly nor failed. Verification: `pnpm exec vitest run test/core-prereq-check.test.ts` — 72 passing (3 new regression cases + 69 existing). Co-authored-by: Cursor <cursoragent@cursor.com>
PR #664 follow-up. Codex (#664#discussion_r3302432762) flagged that the previous implementation permanently disarmed the watchdog as soon as `isShuttingDown()` returned true. The wiring uses `api.port` file deletion as the shutdown signal, and the worker removes it BEFORE the slow cleanup tail (`agent.stop()`, DB close, …). If any of those later awaits hangs, the supervisor could never SIGKILL or respawn the worker because the watcher had already set `stopped=true`. Fix: introduce `shutdownGraceMs` (default 30s = 2× SHUTDOWN_HARD_ TIMEOUT_MS). When shutdown is first observed, the watcher records the timestamp and suppresses failure counting for that window — long enough for a healthy graceful shutdown to finish, including the daemon's own self-force-exit deadline. After the window expires the SIGKILL path re-arms so wedged teardowns still get force-killed. `shutdownGraceMs < 0` preserves the legacy "disarm forever" behavior for operators who explicitly want it. Verification: `pnpm exec vitest run test/supervisor-liveness.test.ts` — 32 passing, including 2 new regression cases for the re-arm and the legacy-opt-out paths. Co-authored-by: Cursor <cursoragent@cursor.com>
…ing fails PR #665 follow-up. Codex (#665#discussion_r3302646439) flagged that if `assertion.promote()` has ALREADY returned successfully and one of the two subsequent writes (`recordCommitMarker('swmInserted')` or `queue.succeed()`) throws — store hiccup, lost lease, transient FS error — the outer worker catch parked the job as `failed` with retryable=false. That left it eligible for `/promote-async/{jobId}/ recover`, which blindly re-queues failed jobs and would re-run a promote that already mutated SWM and emitted gossip. Fix: wrap the two post-promote writes in a dedicated try/catch that returns a new `partial_promote_ambiguity` outcome instead of throwing. This outcome is counted in `PromoteWorkerCounters.partialPromoteAmbiguity` but DOES NOT call `queue.fail()`, so the job stays in `running` state. On next daemon boot, `recoverOnStartup()` sees `promoteStarted=true && swmInserted=false` and correctly routes it into the "abandoned partial promote" bucket per `PromoteRecoveryResult.abandoned` — exactly the path the type docs describe as "operator action required". A loud `PARTIAL-PROMOTE-AMBIGUITY: jobId=…` log line surfaces the condition for grep-able operator alerting. Verification: `pnpm exec vitest run test/async-promote-worker.test.ts` — 25 passing, including a new regression case that simulates a store hiccup on the `swmInserted` marker write and asserts the job stays in `running` state with `promoteStarted=true, swmInserted=false`. Co-authored-by: Cursor <cursoragent@cursor.com>
PR #666 follow-up. Codex (#666#discussion_r3302712591) flagged that `dkgHomeNow` was derived purely from the LIVE CLI's install mode via `repoDir()`. When an operator runs a globally installed `dkg` (standalone install-mode → `repoDir() === null`) from inside an unmigrated git checkout, the local daemon is still using `~/.dkg-dev` but the previous code resolved to `~/.dkg`. That: - missed the live daemon in the orphan-state blocker - read `currentAutoUpdateSource` from the wrong config - landed the `autoUpdate.source = "npm"` pin in the wrong config while the daemon kept reading the original ~/.dkg-dev/config.json Fix: introduce `selectMigrationDkgHome()` which probes BOTH the monorepo-candidate home (~/.dkg-dev) and the standalone home (~/.dkg) for an active daemon. Picks whichever has a live pid; falls back to the install-mode-based `resolveMigrationDkgHome()` only when no daemon is running (greenfield migration). Surfaces `recoveredGlobalCliInCheckout` so the CLI logs an explicit line when it overrides the install-mode default — operators see why the home selection diverged from `repoDir()`. Verification: `pnpm exec vitest run test/migrate-to-npm.test.ts` — 39 passing, including 5 new cases for the global-CLI-in-checkout, greenfield, monorepo-matches-monorepo, standalone-matches-standalone, and dead-pid-fallback paths. Co-authored-by: Cursor <cursoragent@cursor.com>
…rdict PR #668 follow-up. Codex (#668#discussion_r3302734688) flagged that the very first `self:peer:update` after listen also fires for the initial post-listen peer record AutoNAT publishes. If that record contains only RFC1918/CGNAT multiaddrs (common during a cold boot before AutoNAT has verified external reach), the previous logic marked `private` as DEFINITIVE immediately — disabling the soft timeout and locking the verdict until a later address-update event happened (which may never come on a stable private-only host). Fix: treat the FIRST non-public event-driven reclassification the same as the initial bound-address snapshot. The cached status updates so `/api/status` reflects the current state, but the classification is NOT marked definitive — the soft timeout still arms, and AutoNAT's later verification of an external address can still flip the verdict to `public` cleanly. Once an event has fired once, subsequent `private` events are treated as the real verdict so we don't permanently suppress legitimate transitions. Verification: `pnpm exec vitest run test/nat-status.test.ts` — 42 passing, including 2 new regression cases for the first-event-not-definitive and second-event-IS-definitive paths. Co-authored-by: Cursor <cursoragent@cursor.com>
PR #669 follow-up. Codex (#669#discussion_r3302188320) flagged that the inbound handler in `register()` re-read `this.node.stopSignal` three times — once for the initial `readAllWithSignal()`, once in the catch-path abort check, and (implicitly) when calling `stream.close()` without any signal at all. If `DKGNode.stop()` fires after the request has been read but before the response path finishes, the controller is cleared in the node's `finally`, so the catch-path can see `undefined` and misclassify the shutdown as a real handler error. `stream.close()` could also hang on a silent peer because no abort signal was threaded into it. Fix: cache `const stopSignal = this.node.stopSignal` once at handler entry. Pass it to `readAllWithSignal()`, to `stream.close({ signal })`, and use the cached reference for the catch-path `aborted` check. Aligns the inbound lifecycle with the existing outbound paths in `dialProtocol()` / pooled-conn close that already pass `{ signal }`. Verification: `pnpm exec vitest run test/protocol-router-abort.test.ts test/protocol-router.test.ts` — 54 passing (no behavioral change for the helper-level unit tests; the handler-level abort lifecycle is exercised at integration time). Co-authored-by: Cursor <cursoragent@cursor.com>
PR #670 follow-up. Two medium-severity Codex findings: 1. **vitest.unit.config.ts hardcoded include silently drops new tests** (#670#discussion_r3301775307). The original include list omitted `hub-resolution-cache.unit.test.ts` even though it had existed for several sprints — `pnpm --filter ... test:unit` reported green without ever running it. Switch to a glob (`test/**/*.unit.test.ts`) so unit coverage is auto-discovered; keep the explicit `filter-error-silencer.test.ts` entry since it's pure-logic but doesn't follow the `.unit.test.ts` naming convention. 2. **`safeJson()` throws on bigint, dropping ethers error diagnostics** (#670#discussion_r3301775310). Ethers error payloads commonly carry `bigint` fields (transaction `value`, `chainId`, `gasLimit`, …) inside `info.error.data`. The previous `JSON.stringify` call threw on these and the catch-fallback returned `String(value)` → `"[object Object]"`, throwing away the structured diagnostics this path is supposed to preserve. Use a replacer that coerces bigints to strings. Verification: `pnpm exec vitest run --config vitest.unit.config.ts` — 67 passing (up from 59; +8 newly-discovered `hub-resolution-cache` cases). Added a regression case for the bigint serializer to filter-error-silencer.test.ts. Co-authored-by: Cursor <cursoragent@cursor.com>
29 tasks
branarakic
commented
May 26, 2026
branarakic
commented
May 26, 2026
12 tasks
Contributor
Author
matic031
pushed a commit
to KilianTrunk/dkg
that referenced
this pull request
Jun 2, 2026
…se/rc.12 Address 11 Codex findings from the OriginTrail#680 release/rc.11 integration review that were tagged 'critical' but landed in OriginTrail#683 without being addressed. rc.12 follow-up.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to the rc.11 integration PR (#680). Closes the highest-severity Codex review findings that the rc.11 cut deferred so they could be addressed in a single reviewable PR after merge instead of holding the release.
9 critical 🔴 findings + 2 medium 🟡 findings, across 7 source files and 6 test files. All addressed with regression test coverage where the surface is unit-testable.
PRs whose findings are addressed here:
2001:0db8::slips past TEST-NET-2 check (🔴)core-prereq-check.ts:262isReservedDnsNametrailing-root-dot bypass (🟡)core-prereq-check.ts:309lifecycle.tshidesprereq.indeterminateverdict (🟡)lifecycle.ts:1276api.portdeletion is permanent (🔴)supervisor-liveness.tsswmInsertedmarker write failure → re-promote risk (🔴)async-promote-worker.ts:262dkgHomeNowderives from CLI install mode, not target checkout (🔴)cli.ts:4082/migrate-to-npm.tsself:peer:updatelocks inprivatebefore AutoNAT verifies (🔴)nat-status.ts:273stopSignalnot cached at handler entry;stream.close()not abortable (🔴)protocol-router.ts:355vitest.unit.config.tshardcoded include drops new unit tests (🟡)chain/vitest.unit.config.tsJSON.stringifythrows on bigint in ethers payloads (🟡)filter-error-silencer.ts:127devnet-test-rc11-promote-crash-recovery.sh:238exit 0on false-green path (worker drained before kill) (🔴)devnet-test-rc11-promote-crash-recovery.sh:209selectionshape rejected by/api/shared-memory/publish(🔴)devnet-test-rc11-shutdown-mid-publish.sh:162Each commit references the specific Codex
discussion_r*ID it addresses.Notable design decisions
feat(cli): supervisor positive-liveness probe (PR-9) #664 watchdog disarm: replaced "disarm forever" with a bounded
shutdownGraceMs(default 30s = 2×SHUTDOWN_HARD_TIMEOUT_MS). Graceful shutdown gets the full window; wedged teardowns still get SIGKILLed.shutdownGraceMs < 0preserves the legacy "never SIGKILL during graceful shutdown" semantic for operators who explicitly want it.feat(daemon): PR #3 — async-promote worker supervisor + lifecycle wiring #665 swmInserted marker: introduces a new
partial_promote_ambiguityoutcome that does NOT callqueue.fail()— the job stays inrunningstate, andrecoverOnStartup()correctly routes it into the "abandoned partial promote" bucket on next boot. This prevents/promote-async/{jobId}/recoverfrom blindly re-running an already-completed promote. A loudPARTIAL-PROMOTE-AMBIGUITY: jobId=…log surfaces the condition for operator alerting.feat(cli): 'dkg migrate-to-npm' subcommand + operator runbook (PR-7) #666 dkgHomeNow: extracted
selectMigrationDkgHome()that probes BOTH~/.dkg-devand~/.dkgfor a live daemon. Falls back to install-mode-based resolution only when neither home has a running daemon (greenfield migration). SurfacesrecoveredGlobalCliInCheckoutso the CLI logs an explicit line when it overrides the install-mode default.feat(daemon): AutoNAT-driven NAT-status watcher for core nodes (PR-5) #668 NAT-status: only the FIRST event-driven non-public reclassification is treated as the bound-address baseline (same as the initial snapshot). Once an event has fired once, subsequent
privateevents are treated as real verdicts — we don't permanently suppress legitimate transitions.Findings NOT in this PR (parked for #676 rc.12 backlog)
The remaining 7 medium 🟡 findings are tracked in #676 (rc.12 follow-ups):
/api/statusHEAD path / duplicateRelayStatusResponseshape (HEAD path actually ADDRESSED on main, only the duplicate shape remains)PromoteCommitMarker.promoteStartedsource-compat (internal type)heartbeatIntervalMsvs actual queue lease (partial guard exists for default lease)nat-status.ts:107(engineering hygiene)hardhat-harness.ts:432quorum=1 weakens ACK validationevm-adapter.ts:3901h TTL caches mutable reads (invalidatePublishPreflightCache()exists as explicit-flush hook)Findings actually ALREADY ADDRESSED on
main(verified by file-level inspection during triage):/promote-async503-gate whenenabled: false(assertion.ts:533) ✅statequery param docs matchsearchParams.get()✅a63c8669(in main via release: v10.0.0-rc.11 integration branch #680) ✅Closing redundant PRs that #680 folded in (no commits ahead of main):
Verification
pnpm --filter '@origintrail-official/dkg' test:unit— 416/416 passing (up from 403; +13 new regression cases)pnpm --filter '@origintrail-official/dkg-chain' test:unit— 68/68 passing (up from 59; +8 newly-discoveredhub-resolution-cache+ 1 new bigint case)pnpm --filter '@origintrail-official/dkg-core' exec vitest run test/protocol-router-abort.test.ts test/protocol-router.test.ts test/protocol-router-pool.test.ts— 70/70 passingpnpm -r buildcleanbash -n scripts/devnet-test-rc11-*.sh— syntax OKTest plan
pnpm -r build)Made with Cursor