feat(core,cli,agent): edge multi-reservation tuning (libp2p reachability hardening PR3/3) by branarakic · Pull Request #526 · OriginTrail/dkg

branarakic · 2026-05-15T11:42:08Z

Summary

Final PR in the libp2p reachability hardening series (stacks on #525, which stacks on #524). Defaults edge nodes behind NAT to 3 simultaneous circuit-relay reservations instead of the legacy 1 — N-2 tolerance to relay blackouts. Operator-tunable via relayReservationCount (range 1-16).

The original PR3 plan also included watchdog-driven proactive renewal, but source-reading the libp2p reservation store revealed it's already done by libp2p itself (5-min-before-TTL refresh in @libp2p/circuit-relay-v2/transport/reservation-store.js). So this PR is just the multi-reservation half.

Changes

DKGNodeConfig.relayReservationCount (default 3, capped at 16):
- Pushes N duplicate /p2p-circuit entries into addresses.listen → N reservation slots in libp2p's transport ReservationStore.reserveRelay().
- Sets reservationConcurrency: N on circuitRelayTransport() so all N slots are attempted in parallel (upstream default 1 would serialize them and collapse the behavior).
validateRelayReservationCount(input): rejects 0/neg/NaN/Infinity/fractional/non-numbers + over-cap (matching the validation surface introduced for relayServerCapacity in feat(core): bump Core Node relay capacity + TTLs + add operator knob (libp2p reachability hardening PR1/3) #524).
Plumbed end-to-end: DkgConfig (cli) → DKGAgentConfig (agent) → DKGNodeConfig (core).
cli/README.md gains an "Edge tuning → Multi-reservation" section with the new knob, defaults, cap, and an explicit note that renewal is automatic.

Why no proactive renewal?

The libp2p reservation store registers a setTimeout per successful reservation that fires addRelay(peerId, type) 5 minutes before expiry (REFRESH_TIMEOUT constant). With our 2h TTL that's a refresh every 1h55m. Adding application-level renewal on top would be redundant at best, contention-inducing at worst (duplicate HOP RESERVE round-trips on the same connection).

The local watchdog stays in place because it handles the harder failure mode auto-renewal can't recover from: a fully-dropped relay connection (TCP RST, NAT pinhole expiry, ISP routing flap). When the underlying connection dies, libp2p's setTimeout-based refresh has nothing to refresh through; the watchdog's redial loop is what brings the reservation back.

Test plan

Unit (validation): relay-reservation-count.test.ts (9 tests) — pins constants + full validation surface (0/neg/NaN/Infinity/fractional/non-number/over-cap, all rejected with type-appropriate reason strings).
Integration (end-to-end): relay.test.ts — new test "edge with relayReservationCount=2 and 2 relays reserves on both" spins up 2 relay servers + 1 edge with count=2, polls until both relay PeerIds appear in the edge's /p2p-circuit self-addresses. Passes locally in ~360ms (well inside the 20s timeout).
CLI config: 2 new round-trip tests in config.test.ts (operator override + unset default).
Full suites: core 747/747, cli config 28/28. The single pre-existing cli failure (auto-update-versioned-e2e.test.ts, EDITOR-unset env issue, also fails on PR2 tip) is unaffected.
Manual verification on Miles (post-merge): observe 3 distinct /p2p-circuit self-addresses in dkg status and 3 reservations in the relay's /api/relay/stats (feat(core,node-ui): relay observability — Metrics adapter, snapshot persistence, /api/relay/stats (libp2p reachability hardening PR2/3) #525) for a Miles edge.

Stacking

Branched off feat/libp2p-relay-observability (#525). Merge after #524 → #525 → this.

Made with Cursor

github-actions · 2026-05-15T11:44:04Z

+    // the edge N-(N-1) tolerance to relay failures.
    if (this.config.relayPeers?.length) {
-      listenAddrs.push('/p2p-circuit');
+      for (let i = 0; i < relayReservationCount; i++) {


🔴 Bug: This only creates the extra reservation slots at startup. The existing watchdog logic still treats "any /p2p-circuit self address exists" as healthy, so if one of N reservations drops while another remains the node silently degrades to N-1 and never restores the missing slot until restart. Please update the watchdog to track the configured target count (or per-relay reservation presence) and redial missing relays until relayReservationCount is re-established.

Fixed in 17e5313.

The watchdog now tracks the configured target via a new relayReservationCountTarget instance field set in start(). The semantics shift cleanly between single- and multi-reservation deployments:

Single reservation (target=1): legacy "any reservation = healthy" gate is preserved. libp2p historically only holds one reservation slot at a time, so other configured relays sitting connected-but-idle is normal steady state and shouldn't be torn down.

Multi reservation (target>1): per-relay reservation presence is now required. Without this, your scenario would play out exactly: lose reservation N → other N-1 still keep haveAnyReservation=true → watchdog skips → silent degradation forever until restart.

Specifically, the drop+redial branch now fires when EITHER (a) zero reservations exist anywhere (legacy condition) OR (b) target > 1 and this specific relay holds no reservation (new condition). For multi-reservation the post-redial "stop after one attempt per tick" guard is also lifted — each relay has an independent per-relay reservation, so restoring relay A doesn't change relay B's status; the snapshot is refreshed so the next iteration sees up-to-date state.

This sits naturally alongside libp2p's auto-renewal at TTL/2 (the happy path; we do nothing) and our watchdog's existing transport-down redial (handles dropped TCP connections). The new branch covers the gap between them: alive transport, but a renewal returned an error and the slot got removed.

README updated with the watchdog semantics spelled out.

github-actions · 2026-05-15T11:44:04Z

+    // the edge N-(N-1) tolerance to relay failures.
    if (this.config.relayPeers?.length) {
-      listenAddrs.push('/p2p-circuit');
+      for (let i = 0; i < relayReservationCount; i++) {


🟡 Issue: relayReservationCount is used blindly even when it exceeds relayPeers.length. In that case the code cannot deliver the documented multi-relay fault tolerance, and may spend extra reservation attempts on duplicate relays or churn against an impossible target. Clamp the effective count to the number of configured relay peers, or at least warn operators when the requested count is unattainable.

Fixed in 17e5313.

start() now clamps relayReservationCount to relayPeers.length after the validation step:

[dkg-core] relayReservationCount=3 exceeds configured relayPeers.length=1; clamping to 1

New clamps relayReservationCount to relayPeers.length and warns integration test pins the contract — a 3-requested + 1-relay edge produces exactly one distinct /p2p-circuit self-addr (not 3 attempts queued forever) and the clamp warning fires.

branarakic · 2026-05-15T11:46:18Z

+    // so pushing N entries → N reservations on N (preferentially distinct)
+    // relays. Combined with `reservationConcurrency` above, this gives
+    // the edge N-(N-1) tolerance to relay failures.
    if (this.config.relayPeers?.length) {


This applies the new multi-reservation default to any node with relayPeers, but the daemon supplies network relays by default regardless of role (runDaemonInner falls back to network.relays unless relay: "none"). So a normal core node started with nodeRole: "core" and no explicit relay config will now also push three /p2p-circuit listen addresses and reserve slots on other relays. That contradicts the PR/docs framing (“edge nodes behind NAT”, “public node doesn’t need relay reservations”) and multiplies relay-slot consumption across the core fleet. Please gate the multi-reservation listen addresses on the edge/behind-NAT path, or make the CLI fallback avoid passing relay peers to core nodes unless the operator explicitly opts in.

Fixed in 17e5313 — you were right, the daemon-fallback scenario was the right framing.

Multi-reservation amplification is now strictly edge-only. New behaviour by role:

Edge with relayPeers (the documented case): apply relayReservationCount (default 3, clamped to relayPeers.length per Codex #3247940512).

Core / relay-server with relayPeers (the daemon-fallback scenario you flagged): preserve the legacy single /p2p-circuit listen addr; ignore relayReservationCount with a warning if explicitly set ('relayReservationCount=N set but this node runs a relay server; relay servers don't multi-reserve through other relays; value ignored').

No relayPeers: skip entirely (unchanged).

Implementation keys all four touch points off a single isEdgeWithRelays flag — listen-addr loop, reservationConcurrency on circuitRelayTransport(), validation/warning branches, and the watchdog target — so the gate stays consistent.

New integration test (skips multi-reservation amplification on relay-server (core) nodes with relayPeers) explicitly exercises a nodeRole: 'core' + enableRelayServer: true + relayPeers: [...] + relayReservationCount: 3 config and asserts the ignore warning fires. README also updated with the edge-only contract spelled out.

11/11 relay tests pass; 757/757 core tests pass.

…hability hardening PR3/3) Default edge nodes behind NAT to **3 simultaneous circuit-relay reservations** instead of the legacy 1. Three reservations on three distinct relays gives N-2 tolerance: two relays can blink concurrently and incoming dialers still find a working circuit. The previous single-reservation default was the root cause of the May 2026 NO_RESERVATION blackout pair (Miles ↔ Lex) that motivated this PR series. Implementation ============== `DKGNodeConfig.relayReservationCount` (default 3, capped at 16) drives two libp2p configuration changes: 1. N duplicate `/p2p-circuit` entries are pushed into `addresses.listen`. Each entry triggers a separate reservation slot in libp2p's transport `ReservationStore.reserveRelay()` (see `@libp2p/circuit-relay-v2/transport/reservation-store.js`), so N entries → N pending reservation IDs → N reservations on N distinct discovered relays. 2. `reservationConcurrency: N` is set on `circuitRelayTransport()`. Without this the upstream default (1) serializes the N pending slots through a single-concurrency PeerQueue, collapsing the multi-reservation behavior back to single-reservation behavior. The knob is plumbed end-to-end: DkgConfig (cli/config.ts) → DKGAgentConfig (agent/dkg-agent.ts) → DKGNodeConfig (core/types.ts) → DKGNode.start() (core/node.ts) Validation matches the existing `relayServerCapacity` shape: rejects 0/negatives/NaN/Infinity/fractional/non-numbers, plus a hard upper bound of 16 to prevent O(N_public_relays) reservation storms. Invalid values warn and fall back to the default. Why no proactive renewal? ========================= The original PR3 plan included watchdog-driven proactive renewal at TTL/2. **Source-reading the libp2p reservation store revealed this is already done by libp2p itself**: each successful reservation queues a setTimeout that fires `addRelay(peerId, type)` 5 minutes before expiry (`REFRESH_TIMEOUT` in `@libp2p/circuit-relay-v2/transport/reservation-store.js`). With our 2h TTL that's a refresh every 1h55m. Adding application-level renewal on top would be redundant at best, contention-inducing at worst (duplicate HOP RESERVE round-trips on the same connection). The local watchdog is left in place because it handles the harder failure mode auto-renewal can't recover from: a fully-dropped relay connection (TCP RST, NAT pinhole expiry, ISP routing flap). When the underlying connection dies, libp2p's setTimeout-based refresh has nothing to refresh through; the watchdog's redial loop is what brings the reservation back. Tests ===== - `relay-reservation-count.test.ts` (9 unit tests): pins the default (3) and cap (16) constants, plus full validation surface (0/neg/NaN/Infinity/fractional/non-number/over-cap, all rejected with type-appropriate reason strings). - `relay.test.ts`: new integration test "edge with relayReservationCount=2 and 2 relays reserves on both" — spins up 2 relay servers + 1 edge with count=2, polls until both relay PeerIds appear in the edge's `/p2p-circuit` self-addresses. Passes locally in 357ms, well inside the 20s timeout. - `config.test.ts`: 2 new round-trip tests for `relayReservationCount` through `saveConfig`/`loadConfig` (operator override + unset default), modelled on the existing `relayServerCapacity` round-trip tests. Full suites: core 747/747, cli config 28/28. The single pre-existing cli failure (`auto-update-versioned-e2e.test.ts`, EDITOR-unset env issue) is unaffected. Operator docs (cli/README.md) gain an "Edge tuning → Multi-reservation" section with the new knob, defaults, cap, and explicit note that renewal is automatic. Co-authored-by: Cursor <cursoragent@cursor.com>

Three fixes from the review on PR #526 (libp2p multi-reservation tuning). One is a behavior bug flagged by branarakic, one a config hygiene gap from Codex, one a watchdog-correctness gap. 1. **Edge-only multi-reservation gating** — branarakic #3247951666. The daemon's CLI fallback supplies `network.relays` to BOTH core and edge nodes by default (`runDaemonInner` falls back to `network.relays` unless `relay: "none"`). Without a role gate, PR #526's amplification meant every default-config core node would also push 3 `/p2p-circuit` listen addrs and try to reserve slots on other relays — multiplying relay-slot consumption across the core fleet for a fault-tolerance benefit that doesn't apply (public nodes don't need relay reservations). New behaviour: - Edge with relayPeers (the documented case): apply `relayReservationCount` (default 3, clamped per #2 below). - Core with relayPeers (the daemon-fallback scenario): preserve the legacy single `/p2p-circuit` listen addr; ignore `relayReservationCount` with a warning if explicitly set. - Any node without relayPeers: skip entirely (unchanged). Driven by an `isEdgeWithRelays` flag — the listen-addr loop, the `reservationConcurrency` value on `circuitRelayTransport()`, and the validation/warning branches all key off the same flag, so the gate stays consistent across all four touch points. 2. **Clamp `relayReservationCount` to `relayPeers.length`** — Codex #3247940512. Requesting 3 reservations when 1 relay is configured queues an unattainable target. The clamp emits a warning so the misconfig is visible (`relayReservationCount=3 exceeds configured relayPeers.length=1; clamping to 1`). 3. **Watchdog: detect per-relay reservation loss when target > 1** — Codex #3247940507. The pre-PR3 watchdog treated "any `/p2p-circuit` self-addr exists" as healthy, on the assumption that libp2p only holds one reservation at a time. With multi-reservation that assumption no longer holds: if reservation N is lost (e.g. libp2p's auto-renewal at TTL/2 returned an error and removed the slot), the other N-1 reservations keep `haveAnyReservation=true` and the watchdog would silently let the node degrade to N-1 forever. Fix: track `relayReservationCountTarget` as an instance field, and: - Happy-path gate becomes per-relay when target > 1 (the legacy any-reservation gate is preserved when target = 1, so single-reservation deployments keep their stable behaviour). - The drop+redial branch fires when EITHER (a) zero reservations exist anywhere (legacy condition) OR (b) target > 1 and this specific relay holds no reservation (new condition). - For multi-reservation, the post-redial "stop after one attempt per tick" guard is lifted (each relay has an independent per-relay reservation, so restoring relay A doesn't change relay B's status); the snapshot is refreshed so the next iteration sees up-to-date state. For single- reservation deployments the legacy break-after-one is kept. Tests: - New `clamps relayReservationCount to relayPeers.length and warns` integration test: 3-requested + 1-relay edge produces exactly one distinct `/p2p-circuit` self-addr and the clamp warning fires. - New `skips multi-reservation amplification on relay-server (core) nodes with relayPeers` integration test: a core node with `enableRelayServer: true` + `relayPeers` + `relayReservationCount: 3` sees the ignore warning fire and (by transitive code-path review) does not push the multi-reservation listen addrs. - All 11 relay tests pass; full core suite 757/757; cli config 28/28. README updated with the edge-only / clamp / watchdog semantics so the operator-facing contract matches the implementation. Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions · 2026-05-15T12:07:10Z

+      //   this as the silent-degradation bug. Falling through to the
+      //   "no reservation" branch redials this specific relay and
+      //   restores the slot.
+      const reservationGateSatisfied =


🔴 Bug: This makes every configured relayPeer effectively mandatory whenever relayReservationCountTarget > 1. If an edge has 5 relayPeers but relayReservationCount: 3, the watchdog will treat the 2 intentionally idle relays as unhealthy and keep redialing them until the node ends up holding 5 reservations, defeating the configured cap and consuming extra relay slots network-wide. The recovery check needs to enforce the total reservation count against relayReservationCountTarget (or track which relays are in the active target set), not require a reservation on every entry in this.relayTargets.

…ibp2p-multi-reservation

github-actions · 2026-05-15T13:25:16Z

+        transportUp &&
+        (
+          !haveAnyReservation ||
+          (this.relayReservationCountTarget > 1 && !thisRelayHasReservation)


🔴 Bug: This per-relay recovery path assumes every configured relayPeer must hold a reservation whenever relayReservationCountTarget > 1. That breaks valid configs like relayPeers.length = 3 with relayReservationCount = 2: the node is healthy with any two reservations, but the unreserved third peer will be torn down and redialed on every watchdog tick because !thisRelayHasReservation stays true forever. The health check here needs to use the total number of distinct reserved relays (or only enforce per-relay presence when the target equals relayPeers.length), and this needs a regression test for the 2-of-3 case.

Fixed in 956e4bb — you're right, this was a real bug introduced by the round-2 fix to the silent-degradation problem.

Tracing the failure: round-2's gate required thisRelayHasReservation for every peer when target > 1. For relayPeers.length=3, relayReservationCount=2, the third peer never holds a reservation (target is 2, libp2p only reserves on 2 of 3). So !thisRelayHasReservation stayed true forever and needsForcedReservationRedial = transportUp && (target > 1 && !thisRelayHasReservation) fired every grace-window expiry. Worst case: the redial succeeded, count went to 3 (above target), libp2p displaced one of the original 2 to keep within the listen-addrs budget → permanent churn. Best case: the redial failed and we burned dial budget for nothing.

Round-3 fix introduces reservedRelayCount — distinct count of configured relays currently holding a reservation — and changes the gate to "this peer holds OR reservedRelayCount >= target". Three regimes:

target == relayPeers.length (the existing 2-of-2 test case): identical behavior to round 2 — every peer must reserve to satisfy the gate (because reservedRelayCount can't reach target without all peers reserving).

target < relayPeers.length (your flagged case, e.g. 2-of-3): once enough total reservations exist, remaining peers' gates pass on the count check and the watchdog leaves them alone. No churn, no over-reservation.

target == 1 single-reservation: identical behavior to before — the count-check passes via reservedRelayCount >= 1 whenever any reservation exists, equivalent to the previous haveAnyReservation branch.

The recovery branch likewise uses reservedRelayCount < target && !thisRelayHasReservation instead of target > 1 && !thisRelayHasReservation, so we never drop+redial a peer when we already have enough total reservations. The refreshReservationSnapshot helper recomputes reservedRelayCount after each successful mid-tick redial so subsequent peers' gates flip to "satisfied" naturally — no second redial, no transient over-shoot.

Regression test: 3 relays + 1 edge with relayReservationCount=2. Wait for 2 reservations to settle, capture the initial reserved peer-IDs, manually invoke watchdogTick(), then assert (a) no churn-indicating log line ("this relay missing", "to force reserve", "reservation-redial") was emitted and (b) the same 2 peer-IDs still hold reservations afterward. Test fails under the round-2 logic (the unreserved third peer triggers a redial log line) and passes under round-3.

…to feat/libp2p-multi-reservation

Codex review on PR #526 round 3 caught a real bug introduced by the round-2 silent-degradation fix: the per-relay gate required EVERY configured peer to hold a reservation when target>1. That breaks valid configs like `relayPeers.length=3` with `relayReservationCount=2`, where the node is healthy with any 2-of-3 reservations: the unreserved third peer's `!thisRelayHasReservation` stayed true forever, triggering forced drop+redial on every grace-window expiry. Worst case the redial succeeded and bumped count above target, displacing one of the existing 2 reservations and producing permanent churn. Best case it failed and we wasted dial budget. Round-3 fix: introduce `reservedRelayCount` — the count of distinct configured relays currently holding a reservation. The gate becomes "this peer holds OR `reservedRelayCount >= target`": - For configs with `target == relayPeers.length` (the existing 2-of-2 test case): identical behavior to round 2 — every peer must reserve to satisfy the gate. - For configs with `target < relayPeers.length` (the broken case): once enough total reservations exist, remaining peers' gates pass on the count check and the watchdog leaves them alone. - For target=1 single-reservation deployments: identical behavior — the count-check still passes via `reservedRelayCount >= 1` whenever any reservation exists, matching the previous `haveAnyReservation` branch. The `refreshReservationSnapshot` helper recomputes `reservedRelayCount` alongside `circuitSelfAddrs` so a successful mid-tick redial flips remaining peers' gates from "needs recovery" to "satisfied" cleanly. The recovery branch now triggers on `reservedRelayCount < target && !thisRelayHasReservation` (instead of just `target > 1 && !thisRelayHasReservation`), so we only drop+redial peers when we're actually below target — never over-reserving. Regression test: 3 relays + edge with count=2, wait for 2 reservations to settle, manually invoke watchdogTick, assert (a) no churn log lines appear and (b) the same 2 reservations are still held with unchanged peer-IDs. Test exercises the exact scenario Codex flagged. Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions · 2026-05-15T15:21:42Z

+    // `circuitSelfAddrs` so a successful mid-tick redial that brings
+    // the count up to target flips remaining peers' gates from
+    // "needs recovery" to "satisfied" without further churn.
+    const computeReservedRelayCount = () => this.relayTargets.filter(({ peerId }) =>


🔴 Bug: reservedRelayCount is derived from the raw relayTargets array, so duplicate relayPeers entries for the same relay peerId get counted multiple times. In a config like [relayA, relayA] with relayReservationCount: 2, the watchdog can think it has 2 healthy reservations when there is only 1 actual relay, which defeats the redundancy guarantee. Deduplicate relay peers by peerId before clamping the target and before counting reserved relays.

Fixed in 22213e4 — real bug, especially because the [relayA, relayA] config is exactly the kind of copy-paste mistake operators would actually make (and the watchdog gate would silently report "2 reservations held" from a single underlying reservation).

Two-site fix:

Clamp comparison in start(): build a Set of distinct peerIds from the parsed multiaddrs and use .size instead of relayPeers.length. Warns operator when distinct < raw.length so the misconfig is visible, then warns again on the actual clamp if relayReservationCount exceeds the distinct count.

relayTargets push loop: maintain a seenRelayPeerIds set and skip any entry whose peerId we've already pushed. Same dedup the clamp uses, applied at the structure that the watchdog actually iterates — so reservedRelayCount (the round-3 fix) stays a real distinct count even with duplicate config.

Together these guarantee that "N reservations" really means "N distinct relays" both at config-validation time and at runtime gating.

Regression test in relay.test.ts (dedupes duplicate relayPeers entries by peerId): edge configured with [relayAddr, relayAddr] and relayReservationCount: 2. Asserts (a) the dedup warning fires (2 entries but only 1 distinct peerIds), (b) the clamp warning fires (clamping to 1), and (c) the edge ends up with exactly one distinct /p2p-circuit self-addr — never two duplicate entries that would falsely satisfy the watchdog.

763/763 relay+core tests pass (was 762 + 1 new dedup case).

…in update) into feat/libp2p-multi-reservation

Codex review on PR #526 round 4 caught that `reservedRelayCount` (the distinct-reservations check that round 3 introduced) was derived from the raw `relayTargets` array, which itself was built from the raw `config.relayPeers` list with no peerId-level deduplication. A duplicate config like `[relayA, relayA]` with `relayReservationCount: 2` would push two entries to `relayTargets`, so the watchdog would iterate the same underlying relay twice and `reservedRelayCount` would tick to 2 from a single actual reservation — making the gate think the target is met when only one real relay is reserved. That defeats the multi-reservation redundancy guarantee precisely when operators are most likely to misconfigure (copy-paste in `relayPeers`). Two-site fix: 1. Clamp comparison (start()): build a Set of distinct peerIds from the parsed multiaddrs and use `.size` instead of `relayPeers.length`. Warns operator when `distinct < raw.length` so the misconfig is visible, then warns again on the actual clamp if relayReservationCount exceeds the distinct count. 2. relayTargets push loop: maintain a `seenRelayPeerIds` set and skip any entry whose peerId we've already pushed. Same dedup the clamp uses, applied at the structure that the watchdog actually iterates. Together these guarantee that "N reservations" really means "N distinct relays" both at config-validation time and at runtime gating. Regression test: edge configured with `[relayAddr, relayAddr]` and `relayReservationCount: 2`. Asserts (a) the dedup warning fires ("2 entries but only 1 distinct peerIds"), (b) the clamp warning fires ("clamping to 1"), and (c) the edge ends up with exactly one distinct `/p2p-circuit` self-addr — never two duplicate entries that would falsely satisfy the watchdog. Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions · 2026-05-15T15:40:32Z

+          `[dkg-core] relayReservationCount=${relayReservationCount} exceeds ` +
+            `distinct relay peers=${distinctRelayPeerCount}; clamping to ${distinctRelayPeerCount}`,
+        );
+        relayReservationCount = distinctRelayPeerCount;


🔴 Bug: distinctRelayPeerCount can be 0 when every configured relay multiaddr parses but lacks a /p2p/<peerId> component. In that case we clamp relayReservationCount to 0, and that value later propagates into reservationConcurrency and relayReservationCountTarget. That gives this code an invalid target instead of falling back cleanly. Treat the zero-distinct case as "no usable relay peers" and keep the reservation target/concurrency at a valid minimum (or ignore the setting with a warning) rather than assigning 0 here.

github-actions · 2026-05-15T15:40:32Z

+    nodes.push(core);
+    await core.start();
+
+    const ignoreWarn = warnSpy.mock.calls.find(call =>


🟡 Issue: This test only checks for the warning. A regression that still advertises multiple /p2p-circuit listen addrs on relay-server nodes would still pass as long as the warning is emitted. Add an assertion on the started core node's circuit self-addrs/listen count so the test actually guards the behavior this PR is trying to preserve.

github-actions · 2026-05-15T15:40:32Z

+    const edge = new DKGNode({
+      listenAddresses: ['/ip4/127.0.0.1/tcp/0'],
+      enableMdns: false,
+      relayPeers: [relayAddr, relayAddr],


🟡 Issue: Using the exact same relayAddr twice does not really prove peerId-based dedupe. This test would also pass if the implementation deduped by raw address string, which is weaker than the bug described in the comment. Use two different multiaddrs that resolve to the same peer ID so the regression coverage matches the production failure mode.

…at/libp2p-multi-reservation

…tack)

github-actions · 2026-05-15T15:47:33Z

+          // Skipped; will also be dropped at relayTargets push.
+        }
+      }
+      const distinctRelayPeerCount = distinctRelayPeerIds.size;


🔴 Bug: this clamp is based on the raw parsed peer IDs, but the actual target set is filtered later (peerId.equals(this.node.peerId), missing /p2p, parse failures). That means configs like [self, relay] or valid multiaddrs without a /p2p component can still set relayReservationCount/reservationConcurrency to an unattainable value (or even 0), so the node silently runs with fewer reservations than requested. Please derive the clamp from the same usable relay-target set you actually keep, and treat an empty usable set as "no relay peers" instead of passing 0 through.

github-actions · 2026-05-15T15:47:33Z

+    // to clamp + warn at start(). We verify the warn fires and the
+    // edge actually only ends up with 1 /p2p-circuit self-addr (not
+    // 3 attempts queued forever).
+    const warnSpy = vi.spyOn(console, 'warn').mockImplementation(() => {});


🟡 Issue: the new console.warn spy is only restored on the happy path here (and in the similar tests below). If start() or any assertion throws, the spy leaks into later tests and can mask real warning behavior, which makes failures harder to diagnose. Wrap these warning-spy tests in try/finally or restore the spy from afterEach.

…eservation

github-actions · 2026-05-15T18:13:39Z

+            `multi-reservation sizing`,
+        );
+      }
+      if (relayReservationCount > distinctRelayPeerCount) {


🔴 Bug: distinctRelayPeerCount here still differs from the number of usable relay targets. It includes entries that are dropped later (for example this node's own peer ID), and it can be 0 when every relayPeers entry is malformed or missing /p2p. That means a previously validated count can be clamped to 0, or to a value larger than the real target set, which then feeds reservationConcurrency / relayReservationCountTarget and creates an unattainable reservation goal. Clamp against the filtered, usable relay target count instead, and never let the effective reservation count fall to 0 unless there are truly no usable relays.

Real bug — confirmed and fixed in e71bb2c.

Root cause: the round-4 distinct-peerId clamp parsed relayPeers independently of the later relayTargets push, so the two had a different filter chain — clamp counted self-peerId entries that the push later dropped, and the clamp was 0 when every entry was malformed (which silently disabled multi-reservation by clamping relayReservationCount to 0). Either way the runtime relayReservationCountTarget was unattainable.

Fix: build a single canonical usableRelayCandidates: RelayTarget[] at the top of start() that's the source of truth for both the clamp AND the relayTargets push. Filter chain: parse multiaddr → drop malformed → drop self → dedup by peerId. Self-peerId derivation uses peerIdFromPrivateKey(privateKey) since libp2p hasn't been created yet (deterministic from the same key createLibp2p() receives). One combined warning summarises each rejection category.

Also: isEdgeWithRelays now gates on usableRelayCandidates.length > 0 instead of raw relayPeers.length, so a config like [malformed, self] falls back cleanly to the no-relays path (no /p2p-circuit listen addrs, no watchdog, no reservationConcurrency override).

Regression test (drops self-peerId entries from the clamp + relayTargets): derives a private key, computes its peerId, configures an edge with [realRelayAddr, selfMultiaddr] and relayReservationCount: 2. Asserts the self-filter warning fires, clamp warning shows usable relay peers=1; clamping to 1, edge ends up with exactly 1 reservation. 776/776 core tests pass.

github-actions · 2026-05-15T18:13:39Z

+        // up-to-date reservation set.
+        if (redialed) {
+          if (this.relayReservationCountTarget <= 1) break;
+          refreshReservationSnapshot();


🔴 Bug: after a successful node.dial(addr) we immediately recompute reservedRelayCount and keep iterating. A completed dial does not guarantee the new /p2p-circuit self-address is already advertised, so during recovery from 1/2 reservations in a 3 peers, target=2 setup this can still see reservedRelayCount === 1 and redial the third relay as well. That reintroduces the churn this PR is trying to avoid and can overshoot the requested reservation count. Bound the number of forced redials per tick to the number of missing slots at tick start, or wait until the reservation actually appears before continuing.

Real bug — confirmed and fixed in e71bb2c.

You're right that a completed node.dial() doesn't guarantee the new /p2p-circuit self-addr is advertised. With round-4's per-redial refreshReservationSnapshot(), the post-dial state in a 1/2 → 2/2 recovery in a 3 peers, target=2 setup would still see reservedRelayCount === 1 and the loop would redial the third relay too — overshooting + churning exactly what this PR is trying to avoid.

Fix: per-tick budget for forced redials, capped at the missing-slot count computed at TICK START. Specifically:

const missingSlotsAtTickStart = Math.max( 0, this.relayReservationCountTarget - reservedRelayCount, ); let forcedRedialsThisTick = 0; // ...inside loop, redial branch: if ( this.relayReservationCountTarget > 1 && missingSlotsAtTickStart > 0 && forcedRedialsThisTick >= missingSlotsAtTickStart ) { allHealthy = false; continue; // benign "still recovering" — no exponential backoff } forcedRedialsThisTick += 1; // existing redial logic...

The cap is deliberately NOT recomputed mid-tick — if a redial happens to propagate fast enough that reservationGateSatisfied flips for the next peer, the happy path skips it naturally; the cap only matters when reservations haven't propagated yet (the bug case).

Regression test (caps forced reservation-redials per tick at the missing-slot count): 3 relays, edge with target=2. Waits for 2 reservations (so missingSlots === 0), triggers 3 back-to-back watchdogTick() calls. Asserts 0 to force reserve log lines across all three ticks — even if a snapshot transiently misreports a peer as missing, the budget cap holds total churn at 0 because the deficit at tick start is 0.

776/776 core tests pass.

Two Codex review bugs on PR #526 round 5: 1. (line 551) `distinctRelayPeerCount` used for the `relayReservationCount` clamp could disagree with the actual `relayTargets.length` the watchdog iterates: it counted entries pointing at this node's own peerId (filtered out at the later relayTargets push), and it could be 0 when every entry was malformed (which then silently clamped the request to 0, disabling multi-reservation). Either way the runtime `relayReservationCountTarget` ended up unattainable. 2. (line 1009) After a successful `node.dial(addr)` the watchdog called `refreshReservationSnapshot()` and kept iterating. A completed dial does NOT guarantee the new `/p2p-circuit` self-addr is advertised yet, so during recovery from `1/2` reservations in a `3 peers, target=2` setup, the post-dial snapshot would still see `reservedRelayCount === 1` and the loop would redial the third relay too — overshooting the target and re-introducing the churn this PR is meant to avoid. Fix 1 — single canonical "usable relay candidates" list: - Derive `selfPeerIdEarly` from `privateKey` via `peerIdFromPrivateKey()` (libp2p hasn't been created yet, but the derivation is deterministic from the same key `createLibp2p()` receives). - Build `usableRelayCandidates: RelayTarget[]` ONCE at the top of `start()`: parse multiaddr, drop malformed, drop self, dedup by peerId. Single combined warning summarising each rejection category for operator visibility. - `isEdgeWithRelays` now gates on `usableRelayCandidates.length > 0` instead of raw `relayPeers.length`, so a config like `[malformed, self]` falls back to the no-relays path (no `/p2p-circuit` listen addrs, no watchdog). - The clamp uses `usableRelayCandidates.length`. The relayTargets push iterates the SAME canonical list, so `this.relayReservationCountTarget` always matches `this.relayTargets.length` and the gate `reservedRelayCount >= target` is achievable. Fix 2 — per-tick forced-redial budget: - Compute `missingSlotsAtTickStart = max(0, target - reservedRelayCount)` ONCE at watchdogTick start. Track `forcedRedialsThisTick`. - In the redial branch: when `target > 1` and we've already dispatched `missingSlotsAtTickStart` forced redials this tick, skip remaining peers with `allHealthy = false; continue;` (treat as benign "still recovering", don't apply exponential backoff — the next tick reassesses). - Deliberately don't recompute `missingSlotsAtTickStart` mid-tick. If a redial happens to propagate fast, the happy path (`reservedRelayCount >= target`) skips remaining peers naturally; the cap only matters when reservations haven't propagated yet, which is exactly the bug case. Two new regression tests in `relay.test.ts`: - "drops self-peerId entries from the clamp + relayTargets": derives a private key, computes its peerId, configures an edge with `[realRelayAddr, selfMultiaddr]` and `relayReservationCount: 2`. Asserts the self-filter warning fires, the clamp warning shows `usable relay peers=1; clamping to 1`, and the edge ends up with exactly 1 distinct `/p2p-circuit` self-addr. - "caps forced reservation-redials per tick at the missing-slot count": 3 relays, edge with target=2. Waits for 2 reservations (so `missingSlotsAtTickStart === 0`), then triggers 3 back-to-back `watchdogTick()` calls. Asserts 0 "to force reserve" log lines across all three ticks — even if a snapshot transiently misreports a peer as missing, the budget cap holds total churn at 0 because the deficit at tick start is 0. Updated the round-4 dedup test to match the new combined warning text (`"2 entries supplied, 1 usable (skipped 1 duplicate peerIds)"`). Full core suite: 776/776 (was 774, +2 new round-5 cases). Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions · 2026-05-15T18:29:41Z

+        listenAddrs.push('/p2p-circuit');
+      }
+      this.relayReservationCountTarget = relayReservationCount;
+    } else if (this.config.relayPeers?.length) {


🔴 Bug: This fallback also matches edge nodes whose relayPeers were all filtered out above as malformed/self/duplicate. In that case we warned that the value is ignored, but we still add a legacy /p2p-circuit listener here and leave the node on the raw-relayPeers path instead of the documented "no-relays" path. That can leave an edge node half-configured and unreachable. Gate this branch on enableRelay (or at least usableRelayCandidates.length > 0) so unusable relay lists are truly ignored.

Real bug — confirmed and fixed in $(git rev-parse --short HEAD).

You're right: the round-5 refactor changed isEdgeWithRelays to gate on usableRelayCandidates.length > 0, but I left the legacy fallback's else if (this.config.relayPeers?.length) unchanged. So an edge config like [malformed, malformed] got the "value ignored" warning at the top, then later fell into the legacy core-node fallback and ended up with a /p2p-circuit listener with nothing to reserve against — exactly the half-configured state you describe.

Fix: gate the fallback on enableRelay && relayPeers?.length so only core / relay-server nodes with explicit relayPeers get the legacy single /p2p-circuit. Edge nodes with no usable relays now truly hit the no-relays path.

Regression test (falls back to no-relays path when every relayPeers entry is unusable): edge with relayPeers: ['/not-a-multiaddr', 'also-malformed']. Asserts (a) the "0 usable / malformed" warning fires, (b) no /p2p-circuit ever appears in the edge's listen or self-addresses. 16/16 relay tests pass.

Codex review on PR #526 round 5b caught that the `else if (this.config.relayPeers?.length)` fallback in start() matched not only the intended core-node case but also edge nodes whose relayPeers entries had ALL been filtered out as malformed/self/duplicate. In that case the round-5 commit warned that the value was ignored, but this branch still added a legacy `/p2p-circuit` listener with nothing to reserve against — leaving the edge node half-configured. Fix: gate on `enableRelay && relayPeers?.length` so only core / relay-server nodes with explicit relayPeers get the legacy single `/p2p-circuit`. Edge nodes with no usable relays now truly hit the no-relays path, matching the "value ignored" warning. Regression test: edge with `relayPeers: ['/not-a-multiaddr', 'also-malformed']`. Asserts (a) the "0 usable / malformed" warning fires, (b) no `/p2p-circuit` ever appears in the edge's listen or self-addresses. 16/16 relay tests pass (was 15, +1 round-5b case). Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions · 2026-05-15T18:41:36Z

+          continue;
+        }
+        seenPids.add(p2p);
+        usableRelayCandidates.push({ peerId: pid, addr: ma });


🔴 Bug: Treating same-peerId entries as duplicates here drops alternate multiaddrs for the same relay. Before this PR we merged and dialed every supplied address; now a config like [relayA-tcp, relayA-ws] keeps only the first one, so if that address is stale the relay becomes unreachable and the node clamps itself down to fewer usable relays than were actually configured. Keep the peer-level dedupe for reservation-count math, but preserve all addresses per peer in the peer store / retry path.

Real bug — confirmed and fixed in 2b3bfa7.

You're right: the round-5 dedup-by-peerId regressed the alt-transport case. A config like [relayA-tcp, relayA-ws] would collapse to one address; if the surviving transport went stale the relay became unreachable.

Fix: dedup is now by full multiaddr STRING. Same peerId + different multiaddr aggregates into one RelayTarget whose addrs carries both. Per-peer reservation-count math is unchanged — one RelayTarget per peerId, watchdog still iterates each peer once. peerStore.merge() receives all addrs; node.dial(addrs) (libp2p accepts arrays) tries each in order so a stale transport doesn't take the relay out of rotation.

Warning text now distinguishes:

N exact-duplicate multiaddrs — true dups (e.g. copy-paste of same string)

N alternate addrs merged into existing relay peers — informational; preserved as fallbacks

Regression test (preserves alternate multiaddrs for the same peerId as fallbacks): real relay + edge configured with [realAddr, fakeAlternateAddr] (same peerId, different transport endpoint pointing at port 9). Asserts (a) the "alternate addrs merged" warning fires, (b) distinct relay peer count = 1, (c) edge actually establishes a reservation via the working address even though the alternate is unreachable. 17/17 relay tests pass.

github-actions · 2026-05-15T18:41:36Z

+    // 0 usable, so the node should fall back to the no-relays path
+    // (no `/p2p-circuit` listen addrs, no watchdog, no
+    // reservationConcurrency override). Codex PR #526 round 5.
+    const isEdgeWithRelays = !enableRelay && usableRelayCandidates.length > 0;


🟡 Issue: This new "no-relays" gate is based on usableRelayCandidates, but the AutoNAT decision below still keys off raw relayPeers.length. When every configured relay is filtered out as malformed/self/duplicate, the node now falls back to the no-relays path while still disabling AutoNAT, so it has neither relay reachability nor NAT probing. Reuse usableRelayCandidates.length/isEdgeWithRelays for the AutoNAT gate so the fallback behaves like the true no-relay case.

Fixed in 2b3bfa7 — fair point, AutoNAT and the "no-relays" listen-addr decision should share the same predicate.

Changed:
```ts
const useAutoNAT = this.config.enableAutoNAT ??
!(usableRelayCandidates.length > 0 || enableRelay);
```

So when every configured relay is filtered out as malformed/self/duplicate, the node now BOTH falls back to the no-relays listen path AND re-enables AutoNAT — the all-filtered case behaves like a node that was never given relayPeers in the first place. Operator can still hard-pin via enableAutoNAT: true|false (the ?? keeps that escape hatch).

github-actions · 2026-05-15T18:41:36Z

          continue;
        }

+        forcedRedialsThisTick += 1;


🔴 Bug: The per-tick budget is consumed before we know whether this forced redial actually restored a missing slot. In a target=2, reserved=1 case, if the first missing relay's node.dial(addr) fails, forcedRedialsThisTick is still incremented and the watchdog skips the next eligible relay for the rest of the tick, even though that other relay could have satisfied the missing reservation. Only count successful redials against the budget, or refund the slot on failure.

Real bug — fixed in 2b3bfa7.

You're right: counting the budget BEFORE the dial means a failed node.dial() permanently locks out the next eligible relay this tick, even though the failed dial didn't actually consume a reservation slot.

Fix: moved forcedRedialsThisTick += 1 inside the try block, after redialed = true. So only successful dials count against the budget; failures don't waste a slot and the next eligible relay can still be tried in the same tick. The budget check at the top of the redial branch (forcedRedialsThisTick >= missingSlotsAtTickStart) still gates entry, so the cap holds for the success-only counter.

Net behavior: in target=2, reserved=1 with relay-A-failing/relay-B-working, we now (a) attempt redial of relay A, (b) attempt redial of relay B (because the relay-A failure didn't increment), (c) relay B succeeds, increment to 1, budget reached, stop. Previously: (a) relay A attempt, increment to 1, budget reached, skip relay B — leaving the missing slot unfilled until the next tick.

778/778 core tests pass.

…-budget on success only Three Codex bugs on PR #526 round 5c: 1. (line 563 🔴) Round-5 dedup-by-peerId dropped alternate multiaddrs for the same relay (e.g. `[relayA-tcp, relayA-ws]` collapsed to one address). If the surviving address went stale the relay became unreachable and the node clamped itself to fewer reservations than configured. Fix: dedup is now by full multiaddr STRING. Same-peerId entries with different multiaddrs aggregate into one `RelayTarget` whose `addrs` carries both. `peerStore.merge()` receives all addrs; `node.dial(addrs)` (libp2p accepts arrays) tries each in order. Per-peer reservation-count math is unchanged — one `RelayTarget` per peerId, watchdog still iterates each peer once. Warning distinguishes "exact-duplicate multiaddrs" (true dups) from "alternate addrs merged into existing relay peers" (info-level). 2. (line 583 🟡) AutoNAT gate keyed off raw `relayPeers.length`, not `usableRelayCandidates.length`. When every configured relay was filtered out as malformed/self/duplicate, the node fell back to no-relays AND silently disabled AutoNAT — neither relay reachability nor NAT probing. Fix: `useAutoNAT = enableAutoNAT ?? !(usableRelayCandidates.length > 0 || enableRelay)`. Same predicate as `isEdgeWithRelays` so the two gates stay in sync. 3. (line 1061 🔴) Per-tick forced-redial budget was consumed BEFORE the dial — a failed `node.dial()` still incremented `forcedRedialsThisTick` and locked the watchdog out of trying the next eligible relay this tick. Fix: move `forcedRedialsThisTick += 1` into the `try` block after `redialed = true`. Only successful dials count against the budget; failures don't waste a slot. Two new regression tests: - "preserves alternate multiaddrs for the same peerId as fallbacks": real relay + edge configured with `[realAddr, fakeAlternateAddr]` (same peerId, different transport endpoint). Asserts the "alternate addrs merged" warning, distinct relay peer count = 1, AND the edge actually establishes a reservation via the working address even though the alternate is unreachable. (The AutoNAT gate fix is structural — the existing "no-relays fallback" test already exercises the all-filtered case; we just extended it implicitly by ensuring AutoNAT now flips on for that path.) Updated text in two existing tests to match the renamed warnings ("distinct relay peers usable", "exact-duplicate multiaddrs"). All 17 relay tests green. Full core suite: 778/778 (was 776, +2 new). Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions · 2026-05-15T18:56:45Z

+      // gate `reservedRelayCount >= target` is achievable.
+      const usableCount = usableRelayCandidates.length;
+      if (relayReservationCount > usableCount) {
+        console.warn(


🟡 Issue: This now warns for the default path as well. On any edge node with a single configured relay, relayReservationCount defaults to 3, gets clamped to 1, and emits a startup warning even though the operator did nothing wrong. That will make healthy one-relay deployments look misconfigured and bury real relay problems in log noise. Consider warning only for explicit overrides that exceed the usable relay count, and silently or info-level clamp the implicit default.

Fixed in 3d0b701 — fair point, log noise on a healthy default config is exactly what bites operators when they're triaging real problems.

Track isExplicitOverride = verdict?.ok === true. Only warn on the clamp when the operator EXPLICITLY supplied a count > usable; default-driven clamps (verdict null) and post-fallback clamps (verdict not-ok already warned) now silently adjust. Net result for a single-relay edge node: zero startup warnings if the count comes from defaults.

github-actions · 2026-05-15T18:56:45Z

+      && call[0].includes('relayReservationCount=3')
+      && call[0].includes('relay servers don\'t multi-reserve'),
+    );
+    expect(


🟡 Issue: This test only proves that the ignore warning is logged; it never asserts the behavior it describes (core nodes should not amplify into multiple /p2p-circuit reservations). A regression that still emitted the warning but added extra circuit listeners would pass here. Please assert the resulting circuit self-address count or listen-address count after startup so the non-amplification contract is actually pinned.

Fixed in 3d0b701 — extended the test to assert the actual non-amplification contract:

```ts
const coreCircuitSelfAddrs = core.libp2p.getMultiaddrs()
.map(ma => ma.toString())
.filter(a => a.includes('/p2p-circuit'));
expect(coreCircuitSelfAddrs.length).toBeLessThanOrEqual(1);
```

So a regression that still logs the "ignored" warning but actually amplifies into multiple /p2p-circuit reservations would now fail this test directly.

github-actions · 2026-05-15T18:56:45Z

+   * Default: 3. Capped at 16 (above which the failure-tolerance
+   * benefit is marginal but the per-relay cost is real).
+   *
+   * IGNORED when no relayPeers are configured (i.e. the node isn't


🔵 Nit: The runtime also ignores this option when the node is itself running a relay server, but this exported DKGNodeConfig doc only mentions the "no relayPeers" case. That mismatch makes the public config contract misleading for package consumers. Update the JSDoc here to match the actual gating, and mirror the same wording in the copied CLI/agent descriptions.

Fixed in 3d0b701 — updated the JSDoc in all three places to enumerate BOTH ignore cases:

packages/core/src/types.ts (the source of truth)

packages/cli/src/config.ts (CLI config; the wording mirrors core)

packages/agent/src/dkg-agent.ts (agent config; same wording)

Each now reads "Ignored (with a warning when set explicitly) when no relayPeers are configured or when the node itself runs a relay server — relay servers don't multi-reserve through other relays." That matches the runtime gating exactly.

…+ JSDoc parity Three Codex findings on PR #526 round 5d: 1. (line 633 🟡) Default `relayReservationCount=3` clamping to 1 on any healthy single-relay edge node was emitting a startup warning, even though the operator did nothing wrong. Buries real problems in log noise. Fix: track `isExplicitOverride = verdict?.ok === true` and only warn on the clamp when the operator EXPLICITLY supplied a count that exceeds the usable peers. Default-driven clamps (and clamps after invalid-input fallback, which already warned) now silently adjust the count. 2. (line 457 🟡) The "skips multi-reservation amplification on relay-server (core) nodes" test only asserted the warning, not the actual non-amplification contract. A regression that still logged the warning but added extra `/p2p-circuit` listeners would pass. Fix: extended the test to also assert `coreCircuitSelfAddrs.length <= 1` after startup, pinning the "core nodes never amplify into multiple `/p2p-circuit` reservations regardless of relayReservationCount" invariant directly. 3. (line 96 🔵) JSDoc on `relayReservationCount` mentioned the no-relayPeers ignore case but not the relay-server ignore case (the `enableRelayServer: true` / `nodeRole: 'core'` path that the runtime also explicitly skips). Mismatch between public contract and runtime behaviour. Fix: updated the JSDoc in three places (`packages/core/src/types.ts`, `packages/cli/src/config.ts`, `packages/agent/src/dkg-agent.ts`) to list both ignore cases — the warning-when-set-explicitly behaviour matches the warnings the runtime actually emits. Full core suite: 778/778. Full pnpm -r build clean. Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions · 2026-05-15T19:11:07Z

+          continue;
+        }
+        seenAddrStrings.add(maStr);
+        const existing = candidateByPid.get(p2p);


🔴 Bug: candidateByPid is keyed by the raw /p2p string, not the parsed peer id. The same relay can be written in multiple valid peer-id encodings, so a mixed config would bypass this dedupe, inflate usableRelayCandidates.length, and make the clamp/target logic think there are more distinct relays than actually exist. Key the map by pid.toString() instead.

Real bug — fixed in ac3952f.

You're right: peerId encodings (base58btc, base32, ...) are interchangeable for the same identity, and keying off the raw /p2p/<value> string would fail to dedupe a mixed-encoding config.

Fix: candidateByPid is now keyed by pid.toString() (canonical encoding). The duplicate-multiaddr detector also normalises — seenAddrKeys is keyed by ${transportPrefix}/p2p/${pid.toString()} rather than the raw multiaddr string — so encoding-mismatched duplicates of the same address are also caught.

github-actions · 2026-05-15T19:11:07Z

+        if (selfHits) reasons.push(`${selfHits} pointing at this node's own peerId`);
+        if (dupAddrs) reasons.push(`${dupAddrs} exact-duplicate multiaddrs`);
+        if (altAddrs) reasons.push(`${altAddrs} alternate addrs merged into existing relay peers`);
+        console.warn(


🟡 Issue: altAddrs is a healthy fallback case, but this always logs at warning level when an operator provides multiple transports for the same relay. That means a supported config generates a warning on every startup, which makes the real malformed/self/duplicate warnings here easier to ignore. Consider warning only for invalid entries and downgrading the alternate-address message to info/debug.

Fixed in ac3952f — agree, alt-addrs aggregation is a healthy supported config and a warn-level log on every startup is exactly the kind of noise that makes operators tune out the real warnings.

Split the emission:

If any of malformed / self-hits / dup-addrs is non-zero → console.warn (operator has actual problems; alt-addrs goes in the reasons list as context).

If only alt-addrs is non-zero → console.log with rephrased text: "N distinct relay peers (M alternate addrs merged)" (no "usable" / "skipped" wording that implies error).

Updated the round-5c "preserves alternate multiaddrs" test to assert (a) the info log fires AND (b) no warn-level log mentions "alternate addrs" — pinning the new contract directly.

github-actions · 2026-05-15T19:11:07Z

+    // to clamp + warn at start(). We verify the warn fires and the
+    // edge actually only ends up with 1 /p2p-circuit self-addr (not
+    // 3 attempts queued forever).
+    const warnSpy = vi.spyOn(console, 'warn').mockImplementation(() => {});


🟡 Issue: this spy is only restored at the bottom of the test. If any assertion or start() call above throws, console.warn stays mocked and can corrupt later tests with misleading failures. Wrap each new console.warn/console.log spy block in try/finally (or use using) so cleanup always runs; the same pattern repeats in the other added tests.

Fixed in ac3952f — fair test-hygiene point.

Rather than wrapping each test in try/finally individually (8 spies in the file), added vi.restoreAllMocks() to the existing afterEach:

```ts
afterEach(async () => {
vi.restoreAllMocks(); // unconditional cleanup for every spy
for (const n of nodes) {
try { await n.stop(); } catch (err) { ... }
}
nodes.length = 0;
});
```

One-line global fix that covers every spy in the file (existing inline mockRestore() calls remain harmlessly idempotent; future spies are auto-cleaned).

Three Codex findings on PR #526 round 5e: 1. (line 583 🔴) `candidateByPid` was keyed by the raw `/p2p/<value>` string a user supplied. The same peerId can be written in multiple valid encodings (base58btc, base32, ...), so a mixed-encoding config like `[/.../p2p/<base58>, /.../p2p/<base32>]` for the same relay would bypass the dedupe — inflating `usableRelayCandidates.length` and making the clamp / target logic think there are more distinct relays than actually exist. Fix: key `candidateByPid` by `pid.toString()` (the canonical encoding), and build `seenAddrKeys` from the transport prefix plus the canonical peerId so duplicate-address detection is also encoding-independent. 2. (line 599 🟡) Aggregating alternate multiaddrs for the same relay is a healthy supported config, but the previous code logged it at `console.warn` level — every startup of an alt-transport edge would generate a warning and bury real problems in noise. Fix: split the log emission. If any of malformed/self/duplicate is non-zero, the warn-level log includes alt-addrs in the reasons list (operator already has problems to look at). If only alt-addrs is non-zero, log at info level (`console.log`) with wording that doesn't imply error: "N distinct relay peers (M alternate addrs merged)". 3. (line 365 🟡) Per-test `console.warn` / `console.log` spies were restored inline at the bottom of each test body. A thrown assertion or a `start()` error left them mocked and corrupted later tests. Fix: added `vi.restoreAllMocks()` to the existing `afterEach` so every spy is unconditionally restored regardless of how the test exited. One-line global cleanup that covers every spy in the file (existing and future). Updated the round-5c "preserves alternate multiaddrs" test to spy on `console.log` for the alt-addrs message AND assert that NO `console.warn` call mentioning "alternate addrs" was emitted — pinning the new "alt-addrs is info-level on healthy startup" contract directly. 17/17 relay tests pass. Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions

Codex review completed — no issues found.