Skip to content

feat(core,cli,agent): edge multi-reservation tuning (libp2p reachability hardening PR3/3)#526

Merged
branarakic merged 14 commits into
mainfrom
feat/libp2p-multi-reservation
May 15, 2026
Merged

feat(core,cli,agent): edge multi-reservation tuning (libp2p reachability hardening PR3/3)#526
branarakic merged 14 commits into
mainfrom
feat/libp2p-multi-reservation

Conversation

@branarakic
Copy link
Copy Markdown
Contributor

Summary

Final PR in the libp2p reachability hardening series (stacks on #525, which stacks on #524). Defaults edge nodes behind NAT to 3 simultaneous circuit-relay reservations instead of the legacy 1 — N-2 tolerance to relay blackouts. Operator-tunable via relayReservationCount (range 1-16).

The original PR3 plan also included watchdog-driven proactive renewal, but source-reading the libp2p reservation store revealed it's already done by libp2p itself (5-min-before-TTL refresh in @libp2p/circuit-relay-v2/transport/reservation-store.js). So this PR is just the multi-reservation half.

Changes

  • DKGNodeConfig.relayReservationCount (default 3, capped at 16):
    • Pushes N duplicate /p2p-circuit entries into addresses.listen → N reservation slots in libp2p's transport ReservationStore.reserveRelay().
    • Sets reservationConcurrency: N on circuitRelayTransport() so all N slots are attempted in parallel (upstream default 1 would serialize them and collapse the behavior).
  • validateRelayReservationCount(input): rejects 0/neg/NaN/Infinity/fractional/non-numbers + over-cap (matching the validation surface introduced for relayServerCapacity in feat(core): bump Core Node relay capacity + TTLs + add operator knob (libp2p reachability hardening PR1/3) #524).
  • Plumbed end-to-end: DkgConfig (cli) → DKGAgentConfig (agent) → DKGNodeConfig (core).
  • cli/README.md gains an "Edge tuning → Multi-reservation" section with the new knob, defaults, cap, and an explicit note that renewal is automatic.

Why no proactive renewal?

The libp2p reservation store registers a setTimeout per successful reservation that fires addRelay(peerId, type) 5 minutes before expiry (REFRESH_TIMEOUT constant). With our 2h TTL that's a refresh every 1h55m. Adding application-level renewal on top would be redundant at best, contention-inducing at worst (duplicate HOP RESERVE round-trips on the same connection).

The local watchdog stays in place because it handles the harder failure mode auto-renewal can't recover from: a fully-dropped relay connection (TCP RST, NAT pinhole expiry, ISP routing flap). When the underlying connection dies, libp2p's setTimeout-based refresh has nothing to refresh through; the watchdog's redial loop is what brings the reservation back.

Test plan

  • Unit (validation): relay-reservation-count.test.ts (9 tests) — pins constants + full validation surface (0/neg/NaN/Infinity/fractional/non-number/over-cap, all rejected with type-appropriate reason strings).
  • Integration (end-to-end): relay.test.ts — new test "edge with relayReservationCount=2 and 2 relays reserves on both" spins up 2 relay servers + 1 edge with count=2, polls until both relay PeerIds appear in the edge's /p2p-circuit self-addresses. Passes locally in ~360ms (well inside the 20s timeout).
  • CLI config: 2 new round-trip tests in config.test.ts (operator override + unset default).
  • Full suites: core 747/747, cli config 28/28. The single pre-existing cli failure (auto-update-versioned-e2e.test.ts, EDITOR-unset env issue, also fails on PR2 tip) is unaffected.
  • Manual verification on Miles (post-merge): observe 3 distinct /p2p-circuit self-addresses in dkg status and 3 reservations in the relay's /api/relay/stats (feat(core,node-ui): relay observability — Metrics adapter, snapshot persistence, /api/relay/stats (libp2p reachability hardening PR2/3) #525) for a Miles edge.

Stacking

Branched off feat/libp2p-relay-observability (#525). Merge after #524#525 → this.

Made with Cursor

Comment thread packages/core/src/node.ts
// the edge N-(N-1) tolerance to relay failures.
if (this.config.relayPeers?.length) {
listenAddrs.push('/p2p-circuit');
for (let i = 0; i < relayReservationCount; i++) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: This only creates the extra reservation slots at startup. The existing watchdog logic still treats "any /p2p-circuit self address exists" as healthy, so if one of N reservations drops while another remains the node silently degrades to N-1 and never restores the missing slot until restart. Please update the watchdog to track the configured target count (or per-relay reservation presence) and redial missing relays until relayReservationCount is re-established.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 17e5313.

The watchdog now tracks the configured target via a new relayReservationCountTarget instance field set in start(). The semantics shift cleanly between single- and multi-reservation deployments:

  • Single reservation (target=1): legacy "any reservation = healthy" gate is preserved. libp2p historically only holds one reservation slot at a time, so other configured relays sitting connected-but-idle is normal steady state and shouldn't be torn down.

  • Multi reservation (target>1): per-relay reservation presence is now required. Without this, your scenario would play out exactly: lose reservation N → other N-1 still keep haveAnyReservation=true → watchdog skips → silent degradation forever until restart.

Specifically, the drop+redial branch now fires when EITHER (a) zero reservations exist anywhere (legacy condition) OR (b) target > 1 and this specific relay holds no reservation (new condition). For multi-reservation the post-redial "stop after one attempt per tick" guard is also lifted — each relay has an independent per-relay reservation, so restoring relay A doesn't change relay B's status; the snapshot is refreshed so the next iteration sees up-to-date state.

This sits naturally alongside libp2p's auto-renewal at TTL/2 (the happy path; we do nothing) and our watchdog's existing transport-down redial (handles dropped TCP connections). The new branch covers the gap between them: alive transport, but a renewal returned an error and the slot got removed.

README updated with the watchdog semantics spelled out.

Comment thread packages/core/src/node.ts
// the edge N-(N-1) tolerance to relay failures.
if (this.config.relayPeers?.length) {
listenAddrs.push('/p2p-circuit');
for (let i = 0; i < relayReservationCount; i++) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: relayReservationCount is used blindly even when it exceeds relayPeers.length. In that case the code cannot deliver the documented multi-relay fault tolerance, and may spend extra reservation attempts on duplicate relays or churn against an impossible target. Clamp the effective count to the number of configured relay peers, or at least warn operators when the requested count is unattainable.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 17e5313.

start() now clamps relayReservationCount to relayPeers.length after the validation step:

[dkg-core] relayReservationCount=3 exceeds configured relayPeers.length=1; clamping to 1

New clamps relayReservationCount to relayPeers.length and warns integration test pins the contract — a 3-requested + 1-relay edge produces exactly one distinct /p2p-circuit self-addr (not 3 attempts queued forever) and the clamp warning fires.

Comment thread packages/core/src/node.ts Outdated
// so pushing N entries → N reservations on N (preferentially distinct)
// relays. Combined with `reservationConcurrency` above, this gives
// the edge N-(N-1) tolerance to relay failures.
if (this.config.relayPeers?.length) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This applies the new multi-reservation default to any node with relayPeers, but the daemon supplies network relays by default regardless of role (runDaemonInner falls back to network.relays unless relay: "none"). So a normal core node started with nodeRole: "core" and no explicit relay config will now also push three /p2p-circuit listen addresses and reserve slots on other relays. That contradicts the PR/docs framing (“edge nodes behind NAT”, “public node doesn’t need relay reservations”) and multiplies relay-slot consumption across the core fleet. Please gate the multi-reservation listen addresses on the edge/behind-NAT path, or make the CLI fallback avoid passing relay peers to core nodes unless the operator explicitly opts in.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 17e5313 — you were right, the daemon-fallback scenario was the right framing.

Multi-reservation amplification is now strictly edge-only. New behaviour by role:

  • Edge with relayPeers (the documented case): apply relayReservationCount (default 3, clamped to relayPeers.length per Codex #3247940512).
  • Core / relay-server with relayPeers (the daemon-fallback scenario you flagged): preserve the legacy single /p2p-circuit listen addr; ignore relayReservationCount with a warning if explicitly set ('relayReservationCount=N set but this node runs a relay server; relay servers don't multi-reserve through other relays; value ignored').
  • No relayPeers: skip entirely (unchanged).

Implementation keys all four touch points off a single isEdgeWithRelays flag — listen-addr loop, reservationConcurrency on circuitRelayTransport(), validation/warning branches, and the watchdog target — so the gate stays consistent.

New integration test (skips multi-reservation amplification on relay-server (core) nodes with relayPeers) explicitly exercises a nodeRole: 'core' + enableRelayServer: true + relayPeers: [...] + relayReservationCount: 3 config and asserts the ignore warning fires. README also updated with the edge-only contract spelled out.

11/11 relay tests pass; 757/757 core tests pass.

Branimir Rakic and others added 2 commits May 15, 2026 13:56
…hability hardening PR3/3)

Default edge nodes behind NAT to **3 simultaneous circuit-relay
reservations** instead of the legacy 1. Three reservations on three
distinct relays gives N-2 tolerance: two relays can blink concurrently
and incoming dialers still find a working circuit. The previous
single-reservation default was the root cause of the May 2026
NO_RESERVATION blackout pair (Miles ↔ Lex) that motivated this PR
series.

Implementation
==============

`DKGNodeConfig.relayReservationCount` (default 3, capped at 16) drives
two libp2p configuration changes:

  1. N duplicate `/p2p-circuit` entries are pushed into
     `addresses.listen`. Each entry triggers a separate reservation
     slot in libp2p's transport `ReservationStore.reserveRelay()`
     (see `@libp2p/circuit-relay-v2/transport/reservation-store.js`),
     so N entries → N pending reservation IDs → N reservations on
     N distinct discovered relays.

  2. `reservationConcurrency: N` is set on `circuitRelayTransport()`.
     Without this the upstream default (1) serializes the N pending
     slots through a single-concurrency PeerQueue, collapsing the
     multi-reservation behavior back to single-reservation behavior.

The knob is plumbed end-to-end:

  DkgConfig (cli/config.ts)
    → DKGAgentConfig (agent/dkg-agent.ts)
    → DKGNodeConfig (core/types.ts)
    → DKGNode.start() (core/node.ts)

Validation matches the existing `relayServerCapacity` shape: rejects
0/negatives/NaN/Infinity/fractional/non-numbers, plus a hard upper
bound of 16 to prevent O(N_public_relays) reservation storms. Invalid
values warn and fall back to the default.

Why no proactive renewal?
=========================

The original PR3 plan included watchdog-driven proactive renewal at
TTL/2. **Source-reading the libp2p reservation store revealed this is
already done by libp2p itself**: each successful reservation queues a
setTimeout that fires `addRelay(peerId, type)` 5 minutes before
expiry (`REFRESH_TIMEOUT` in
`@libp2p/circuit-relay-v2/transport/reservation-store.js`). With our
2h TTL that's a refresh every 1h55m. Adding application-level
renewal on top would be redundant at best, contention-inducing at
worst (duplicate HOP RESERVE round-trips on the same connection).

The local watchdog is left in place because it handles the harder
failure mode auto-renewal can't recover from: a fully-dropped relay
connection (TCP RST, NAT pinhole expiry, ISP routing flap). When the
underlying connection dies, libp2p's setTimeout-based refresh has
nothing to refresh through; the watchdog's redial loop is what brings
the reservation back.

Tests
=====

  - `relay-reservation-count.test.ts` (9 unit tests): pins the
    default (3) and cap (16) constants, plus full validation surface
    (0/neg/NaN/Infinity/fractional/non-number/over-cap, all
    rejected with type-appropriate reason strings).
  - `relay.test.ts`: new integration test "edge with
    relayReservationCount=2 and 2 relays reserves on both" — spins up
    2 relay servers + 1 edge with count=2, polls until both relay
    PeerIds appear in the edge's `/p2p-circuit` self-addresses.
    Passes locally in 357ms, well inside the 20s timeout.
  - `config.test.ts`: 2 new round-trip tests for
    `relayReservationCount` through `saveConfig`/`loadConfig`
    (operator override + unset default), modelled on the existing
    `relayServerCapacity` round-trip tests.

Full suites: core 747/747, cli config 28/28. The single pre-existing
cli failure (`auto-update-versioned-e2e.test.ts`, EDITOR-unset env
issue) is unaffected.

Operator docs (cli/README.md) gain an "Edge tuning → Multi-reservation"
section with the new knob, defaults, cap, and explicit note that
renewal is automatic.

Co-authored-by: Cursor <cursoragent@cursor.com>
Three fixes from the review on PR #526 (libp2p multi-reservation
tuning). One is a behavior bug flagged by branarakic, one a config
hygiene gap from Codex, one a watchdog-correctness gap.

1. **Edge-only multi-reservation gating** — branarakic #3247951666.
   The daemon's CLI fallback supplies `network.relays` to BOTH core
   and edge nodes by default (`runDaemonInner` falls back to
   `network.relays` unless `relay: "none"`). Without a role gate,
   PR #526's amplification meant every default-config core node would
   also push 3 `/p2p-circuit` listen addrs and try to reserve slots
   on other relays — multiplying relay-slot consumption across the
   core fleet for a fault-tolerance benefit that doesn't apply
   (public nodes don't need relay reservations).

   New behaviour:
     - Edge with relayPeers (the documented case): apply
       `relayReservationCount` (default 3, clamped per #2 below).
     - Core with relayPeers (the daemon-fallback scenario): preserve
       the legacy single `/p2p-circuit` listen addr; ignore
       `relayReservationCount` with a warning if explicitly set.
     - Any node without relayPeers: skip entirely (unchanged).

   Driven by an `isEdgeWithRelays` flag — the listen-addr loop, the
   `reservationConcurrency` value on `circuitRelayTransport()`, and
   the validation/warning branches all key off the same flag, so the
   gate stays consistent across all four touch points.

2. **Clamp `relayReservationCount` to `relayPeers.length`** — Codex
   #3247940512. Requesting 3 reservations when 1 relay is configured
   queues an unattainable target. The clamp emits a warning so the
   misconfig is visible (`relayReservationCount=3 exceeds configured
   relayPeers.length=1; clamping to 1`).

3. **Watchdog: detect per-relay reservation loss when target > 1** —
   Codex #3247940507. The pre-PR3 watchdog treated "any
   `/p2p-circuit` self-addr exists" as healthy, on the assumption
   that libp2p only holds one reservation at a time. With
   multi-reservation that assumption no longer holds: if reservation
   N is lost (e.g. libp2p's auto-renewal at TTL/2 returned an error
   and removed the slot), the other N-1 reservations keep
   `haveAnyReservation=true` and the watchdog would silently let the
   node degrade to N-1 forever.

   Fix: track `relayReservationCountTarget` as an instance field,
   and:
     - Happy-path gate becomes per-relay when target > 1 (the legacy
       any-reservation gate is preserved when target = 1, so
       single-reservation deployments keep their stable behaviour).
     - The drop+redial branch fires when EITHER (a) zero reservations
       exist anywhere (legacy condition) OR (b) target > 1 and this
       specific relay holds no reservation (new condition).
     - For multi-reservation, the post-redial "stop after one
       attempt per tick" guard is lifted (each relay has an
       independent per-relay reservation, so restoring relay A
       doesn't change relay B's status); the snapshot is refreshed
       so the next iteration sees up-to-date state. For single-
       reservation deployments the legacy break-after-one is kept.

Tests:
  - New `clamps relayReservationCount to relayPeers.length and warns`
    integration test: 3-requested + 1-relay edge produces exactly
    one distinct `/p2p-circuit` self-addr and the clamp warning
    fires.
  - New `skips multi-reservation amplification on relay-server (core)
    nodes with relayPeers` integration test: a core node with
    `enableRelayServer: true` + `relayPeers` + `relayReservationCount: 3`
    sees the ignore warning fire and (by transitive code-path
    review) does not push the multi-reservation listen addrs.
  - All 11 relay tests pass; full core suite 757/757; cli config
    28/28.

README updated with the edge-only / clamp / watchdog semantics so
the operator-facing contract matches the implementation.

Co-authored-by: Cursor <cursoragent@cursor.com>
@branarakic branarakic force-pushed the feat/libp2p-multi-reservation branch from 523fe05 to 17e5313 Compare May 15, 2026 12:05
Comment thread packages/core/src/node.ts
// this as the silent-degradation bug. Falling through to the
// "no reservation" branch redials this specific relay and
// restores the slot.
const reservationGateSatisfied =
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: This makes every configured relayPeer effectively mandatory whenever relayReservationCountTarget > 1. If an edge has 5 relayPeers but relayReservationCount: 3, the watchdog will treat the 2 intentionally idle relays as unhealthy and keep redialing them until the node ends up holding 5 reservations, defeating the configured cap and consuming extra relay slots network-wide. The recovery check needs to enforce the total reservation count against relayReservationCountTarget (or track which relays are in the active target set), not require a reservation on every entry in this.relayTargets.

Comment thread packages/core/src/node.ts Outdated
transportUp &&
(
!haveAnyReservation ||
(this.relayReservationCountTarget > 1 && !thisRelayHasReservation)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: This per-relay recovery path assumes every configured relayPeer must hold a reservation whenever relayReservationCountTarget > 1. That breaks valid configs like relayPeers.length = 3 with relayReservationCount = 2: the node is healthy with any two reservations, but the unreserved third peer will be torn down and redialed on every watchdog tick because !thisRelayHasReservation stays true forever. The health check here needs to use the total number of distinct reserved relays (or only enforce per-relay presence when the target equals relayPeers.length), and this needs a regression test for the 2-of-3 case.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 956e4bb — you're right, this was a real bug introduced by the round-2 fix to the silent-degradation problem.

Tracing the failure: round-2's gate required thisRelayHasReservation for every peer when target > 1. For relayPeers.length=3, relayReservationCount=2, the third peer never holds a reservation (target is 2, libp2p only reserves on 2 of 3). So !thisRelayHasReservation stayed true forever and needsForcedReservationRedial = transportUp && (target > 1 && !thisRelayHasReservation) fired every grace-window expiry. Worst case: the redial succeeded, count went to 3 (above target), libp2p displaced one of the original 2 to keep within the listen-addrs budget → permanent churn. Best case: the redial failed and we burned dial budget for nothing.

Round-3 fix introduces reservedRelayCount — distinct count of configured relays currently holding a reservation — and changes the gate to "this peer holds OR reservedRelayCount >= target". Three regimes:

  • target == relayPeers.length (the existing 2-of-2 test case): identical behavior to round 2 — every peer must reserve to satisfy the gate (because reservedRelayCount can't reach target without all peers reserving).
  • target < relayPeers.length (your flagged case, e.g. 2-of-3): once enough total reservations exist, remaining peers' gates pass on the count check and the watchdog leaves them alone. No churn, no over-reservation.
  • target == 1 single-reservation: identical behavior to before — the count-check passes via reservedRelayCount >= 1 whenever any reservation exists, equivalent to the previous haveAnyReservation branch.

The recovery branch likewise uses reservedRelayCount < target && !thisRelayHasReservation instead of target > 1 && !thisRelayHasReservation, so we never drop+redial a peer when we already have enough total reservations. The refreshReservationSnapshot helper recomputes reservedRelayCount after each successful mid-tick redial so subsequent peers' gates flip to "satisfied" naturally — no second redial, no transient over-shoot.

Regression test: 3 relays + 1 edge with relayReservationCount=2. Wait for 2 reservations to settle, capture the initial reserved peer-IDs, manually invoke watchdogTick(), then assert (a) no churn-indicating log line ("this relay missing", "to force reserve", "reservation-redial") was emitted and (b) the same 2 peer-IDs still hold reservations afterward. Test fails under the round-2 logic (the unreserved third peer triggers a redial log line) and passes under round-3.

Branimir Rakic and others added 2 commits May 15, 2026 17:15
Codex review on PR #526 round 3 caught a real bug introduced by the
round-2 silent-degradation fix: the per-relay gate required EVERY
configured peer to hold a reservation when target>1. That breaks
valid configs like `relayPeers.length=3` with `relayReservationCount=2`,
where the node is healthy with any 2-of-3 reservations: the
unreserved third peer's `!thisRelayHasReservation` stayed true
forever, triggering forced drop+redial on every grace-window expiry.
Worst case the redial succeeded and bumped count above target,
displacing one of the existing 2 reservations and producing
permanent churn. Best case it failed and we wasted dial budget.

Round-3 fix: introduce `reservedRelayCount` — the count of distinct
configured relays currently holding a reservation. The gate becomes
"this peer holds OR `reservedRelayCount >= target`":

- For configs with `target == relayPeers.length` (the existing 2-of-2
  test case): identical behavior to round 2 — every peer must reserve
  to satisfy the gate.
- For configs with `target < relayPeers.length` (the broken case):
  once enough total reservations exist, remaining peers' gates pass
  on the count check and the watchdog leaves them alone.
- For target=1 single-reservation deployments: identical behavior —
  the count-check still passes via `reservedRelayCount >= 1` whenever
  any reservation exists, matching the previous `haveAnyReservation`
  branch.

The `refreshReservationSnapshot` helper recomputes `reservedRelayCount`
alongside `circuitSelfAddrs` so a successful mid-tick redial flips
remaining peers' gates from "needs recovery" to "satisfied" cleanly.

The recovery branch now triggers on `reservedRelayCount < target &&
!thisRelayHasReservation` (instead of just `target > 1 &&
!thisRelayHasReservation`), so we only drop+redial peers when we're
actually below target — never over-reserving.

Regression test: 3 relays + edge with count=2, wait for 2 reservations
to settle, manually invoke watchdogTick, assert (a) no churn log
lines appear and (b) the same 2 reservations are still held with
unchanged peer-IDs. Test exercises the exact scenario Codex flagged.

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread packages/core/src/node.ts
// `circuitSelfAddrs` so a successful mid-tick redial that brings
// the count up to target flips remaining peers' gates from
// "needs recovery" to "satisfied" without further churn.
const computeReservedRelayCount = () => this.relayTargets.filter(({ peerId }) =>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: reservedRelayCount is derived from the raw relayTargets array, so duplicate relayPeers entries for the same relay peerId get counted multiple times. In a config like [relayA, relayA] with relayReservationCount: 2, the watchdog can think it has 2 healthy reservations when there is only 1 actual relay, which defeats the redundancy guarantee. Deduplicate relay peers by peerId before clamping the target and before counting reserved relays.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 22213e4 — real bug, especially because the [relayA, relayA] config is exactly the kind of copy-paste mistake operators would actually make (and the watchdog gate would silently report "2 reservations held" from a single underlying reservation).

Two-site fix:

  1. Clamp comparison in start(): build a Set of distinct peerIds from the parsed multiaddrs and use .size instead of relayPeers.length. Warns operator when distinct < raw.length so the misconfig is visible, then warns again on the actual clamp if relayReservationCount exceeds the distinct count.
  2. relayTargets push loop: maintain a seenRelayPeerIds set and skip any entry whose peerId we've already pushed. Same dedup the clamp uses, applied at the structure that the watchdog actually iterates — so reservedRelayCount (the round-3 fix) stays a real distinct count even with duplicate config.

Together these guarantee that "N reservations" really means "N distinct relays" both at config-validation time and at runtime gating.

Regression test in relay.test.ts (dedupes duplicate relayPeers entries by peerId): edge configured with [relayAddr, relayAddr] and relayReservationCount: 2. Asserts (a) the dedup warning fires (2 entries but only 1 distinct peerIds), (b) the clamp warning fires (clamping to 1), and (c) the edge ends up with exactly one distinct /p2p-circuit self-addr — never two duplicate entries that would falsely satisfy the watchdog.

763/763 relay+core tests pass (was 762 + 1 new dedup case).

Branimir Rakic and others added 2 commits May 15, 2026 17:35
…in update) into feat/libp2p-multi-reservation
Codex review on PR #526 round 4 caught that `reservedRelayCount` (the
distinct-reservations check that round 3 introduced) was derived from
the raw `relayTargets` array, which itself was built from the raw
`config.relayPeers` list with no peerId-level deduplication. A
duplicate config like `[relayA, relayA]` with `relayReservationCount: 2`
would push two entries to `relayTargets`, so the watchdog would
iterate the same underlying relay twice and `reservedRelayCount`
would tick to 2 from a single actual reservation — making the gate
think the target is met when only one real relay is reserved. That
defeats the multi-reservation redundancy guarantee precisely when
operators are most likely to misconfigure (copy-paste in
`relayPeers`).

Two-site fix:

1. Clamp comparison (start()): build a Set of distinct peerIds from
   the parsed multiaddrs and use `.size` instead of `relayPeers.length`.
   Warns operator when `distinct < raw.length` so the misconfig is
   visible, then warns again on the actual clamp if relayReservationCount
   exceeds the distinct count.

2. relayTargets push loop: maintain a `seenRelayPeerIds` set and skip
   any entry whose peerId we've already pushed. Same dedup the clamp
   uses, applied at the structure that the watchdog actually iterates.

Together these guarantee that "N reservations" really means
"N distinct relays" both at config-validation time and at runtime
gating.

Regression test: edge configured with `[relayAddr, relayAddr]` and
`relayReservationCount: 2`. Asserts (a) the dedup warning fires
("2 entries but only 1 distinct peerIds"), (b) the clamp warning
fires ("clamping to 1"), and (c) the edge ends up with exactly one
distinct `/p2p-circuit` self-addr — never two duplicate entries that
would falsely satisfy the watchdog.

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread packages/core/src/node.ts Outdated
`[dkg-core] relayReservationCount=${relayReservationCount} exceeds ` +
`distinct relay peers=${distinctRelayPeerCount}; clamping to ${distinctRelayPeerCount}`,
);
relayReservationCount = distinctRelayPeerCount;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: distinctRelayPeerCount can be 0 when every configured relay multiaddr parses but lacks a /p2p/<peerId> component. In that case we clamp relayReservationCount to 0, and that value later propagates into reservationConcurrency and relayReservationCountTarget. That gives this code an invalid target instead of falling back cleanly. Treat the zero-distinct case as "no usable relay peers" and keep the reservation target/concurrency at a valid minimum (or ignore the setting with a warning) rather than assigning 0 here.

nodes.push(core);
await core.start();

const ignoreWarn = warnSpy.mock.calls.find(call =>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: This test only checks for the warning. A regression that still advertises multiple /p2p-circuit listen addrs on relay-server nodes would still pass as long as the warning is emitted. Add an assertion on the started core node's circuit self-addrs/listen count so the test actually guards the behavior this PR is trying to preserve.

const edge = new DKGNode({
listenAddresses: ['/ip4/127.0.0.1/tcp/0'],
enableMdns: false,
relayPeers: [relayAddr, relayAddr],
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: Using the exact same relayAddr twice does not really prove peerId-based dedupe. This test would also pass if the implementation deduped by raw address string, which is weaker than the bug described in the comment. Use two different multiaddrs that resolve to the same peer ID so the regression coverage matches the production failure mode.

branarakic pushed a commit that referenced this pull request May 15, 2026
Comment thread packages/core/src/node.ts Outdated
// Skipped; will also be dropped at relayTargets push.
}
}
const distinctRelayPeerCount = distinctRelayPeerIds.size;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: this clamp is based on the raw parsed peer IDs, but the actual target set is filtered later (peerId.equals(this.node.peerId), missing /p2p, parse failures). That means configs like [self, relay] or valid multiaddrs without a /p2p component can still set relayReservationCount/reservationConcurrency to an unattainable value (or even 0), so the node silently runs with fewer reservations than requested. Please derive the clamp from the same usable relay-target set you actually keep, and treat an empty usable set as "no relay peers" instead of passing 0 through.

// to clamp + warn at start(). We verify the warn fires and the
// edge actually only ends up with 1 /p2p-circuit self-addr (not
// 3 attempts queued forever).
const warnSpy = vi.spyOn(console, 'warn').mockImplementation(() => {});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: the new console.warn spy is only restored on the happy path here (and in the similar tests below). If start() or any assertion throws, the spy leaks into later tests and can mask real warning behavior, which makes failures harder to diagnose. Wrap these warning-spy tests in try/finally or restore the spy from afterEach.

@branarakic branarakic changed the base branch from feat/libp2p-relay-observability to main May 15, 2026 18:10
Comment thread packages/core/src/node.ts Outdated
`multi-reservation sizing`,
);
}
if (relayReservationCount > distinctRelayPeerCount) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: distinctRelayPeerCount here still differs from the number of usable relay targets. It includes entries that are dropped later (for example this node's own peer ID), and it can be 0 when every relayPeers entry is malformed or missing /p2p. That means a previously validated count can be clamped to 0, or to a value larger than the real target set, which then feeds reservationConcurrency / relayReservationCountTarget and creates an unattainable reservation goal. Clamp against the filtered, usable relay target count instead, and never let the effective reservation count fall to 0 unless there are truly no usable relays.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Real bug — confirmed and fixed in e71bb2c.

Root cause: the round-4 distinct-peerId clamp parsed relayPeers independently of the later relayTargets push, so the two had a different filter chain — clamp counted self-peerId entries that the push later dropped, and the clamp was 0 when every entry was malformed (which silently disabled multi-reservation by clamping relayReservationCount to 0). Either way the runtime relayReservationCountTarget was unattainable.

Fix: build a single canonical usableRelayCandidates: RelayTarget[] at the top of start() that's the source of truth for both the clamp AND the relayTargets push. Filter chain: parse multiaddr → drop malformed → drop self → dedup by peerId. Self-peerId derivation uses peerIdFromPrivateKey(privateKey) since libp2p hasn't been created yet (deterministic from the same key createLibp2p() receives). One combined warning summarises each rejection category.

Also: isEdgeWithRelays now gates on usableRelayCandidates.length > 0 instead of raw relayPeers.length, so a config like [malformed, self] falls back cleanly to the no-relays path (no /p2p-circuit listen addrs, no watchdog, no reservationConcurrency override).

Regression test (drops self-peerId entries from the clamp + relayTargets): derives a private key, computes its peerId, configures an edge with [realRelayAddr, selfMultiaddr] and relayReservationCount: 2. Asserts the self-filter warning fires, clamp warning shows usable relay peers=1; clamping to 1, edge ends up with exactly 1 reservation. 776/776 core tests pass.

Comment thread packages/core/src/node.ts
// up-to-date reservation set.
if (redialed) {
if (this.relayReservationCountTarget <= 1) break;
refreshReservationSnapshot();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: after a successful node.dial(addr) we immediately recompute reservedRelayCount and keep iterating. A completed dial does not guarantee the new /p2p-circuit self-address is already advertised, so during recovery from 1/2 reservations in a 3 peers, target=2 setup this can still see reservedRelayCount === 1 and redial the third relay as well. That reintroduces the churn this PR is trying to avoid and can overshoot the requested reservation count. Bound the number of forced redials per tick to the number of missing slots at tick start, or wait until the reservation actually appears before continuing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Real bug — confirmed and fixed in e71bb2c.

You're right that a completed node.dial() doesn't guarantee the new /p2p-circuit self-addr is advertised. With round-4's per-redial refreshReservationSnapshot(), the post-dial state in a 1/2 → 2/2 recovery in a 3 peers, target=2 setup would still see reservedRelayCount === 1 and the loop would redial the third relay too — overshooting + churning exactly what this PR is trying to avoid.

Fix: per-tick budget for forced redials, capped at the missing-slot count computed at TICK START. Specifically:

const missingSlotsAtTickStart = Math.max(
  0, this.relayReservationCountTarget - reservedRelayCount,
);
let forcedRedialsThisTick = 0;
// ...inside loop, redial branch:
if (
  this.relayReservationCountTarget > 1 &&
  missingSlotsAtTickStart > 0 &&
  forcedRedialsThisTick >= missingSlotsAtTickStart
) {
  allHealthy = false;
  continue; // benign "still recovering" — no exponential backoff
}
forcedRedialsThisTick += 1;
// existing redial logic...

The cap is deliberately NOT recomputed mid-tick — if a redial happens to propagate fast enough that reservationGateSatisfied flips for the next peer, the happy path skips it naturally; the cap only matters when reservations haven't propagated yet (the bug case).

Regression test (caps forced reservation-redials per tick at the missing-slot count): 3 relays, edge with target=2. Waits for 2 reservations (so missingSlots === 0), triggers 3 back-to-back watchdogTick() calls. Asserts 0 to force reserve log lines across all three ticks — even if a snapshot transiently misreports a peer as missing, the budget cap holds total churn at 0 because the deficit at tick start is 0.

776/776 core tests pass.

Two Codex review bugs on PR #526 round 5:

1. (line 551) `distinctRelayPeerCount` used for the
   `relayReservationCount` clamp could disagree with the actual
   `relayTargets.length` the watchdog iterates: it counted entries
   pointing at this node's own peerId (filtered out at the later
   relayTargets push), and it could be 0 when every entry was
   malformed (which then silently clamped the request to 0,
   disabling multi-reservation). Either way the runtime
   `relayReservationCountTarget` ended up unattainable.

2. (line 1009) After a successful `node.dial(addr)` the watchdog
   called `refreshReservationSnapshot()` and kept iterating. A
   completed dial does NOT guarantee the new `/p2p-circuit`
   self-addr is advertised yet, so during recovery from `1/2`
   reservations in a `3 peers, target=2` setup, the post-dial
   snapshot would still see `reservedRelayCount === 1` and the loop
   would redial the third relay too — overshooting the target and
   re-introducing the churn this PR is meant to avoid.

Fix 1 — single canonical "usable relay candidates" list:
- Derive `selfPeerIdEarly` from `privateKey` via
  `peerIdFromPrivateKey()` (libp2p hasn't been created yet, but the
  derivation is deterministic from the same key `createLibp2p()`
  receives).
- Build `usableRelayCandidates: RelayTarget[]` ONCE at the top of
  `start()`: parse multiaddr, drop malformed, drop self, dedup by
  peerId. Single combined warning summarising each rejection
  category for operator visibility.
- `isEdgeWithRelays` now gates on `usableRelayCandidates.length > 0`
  instead of raw `relayPeers.length`, so a config like
  `[malformed, self]` falls back to the no-relays path (no
  `/p2p-circuit` listen addrs, no watchdog).
- The clamp uses `usableRelayCandidates.length`. The relayTargets
  push iterates the SAME canonical list, so
  `this.relayReservationCountTarget` always matches
  `this.relayTargets.length` and the gate
  `reservedRelayCount >= target` is achievable.

Fix 2 — per-tick forced-redial budget:
- Compute `missingSlotsAtTickStart = max(0, target - reservedRelayCount)`
  ONCE at watchdogTick start. Track `forcedRedialsThisTick`.
- In the redial branch: when `target > 1` and we've already
  dispatched `missingSlotsAtTickStart` forced redials this tick,
  skip remaining peers with `allHealthy = false; continue;` (treat
  as benign "still recovering", don't apply exponential backoff —
  the next tick reassesses).
- Deliberately don't recompute `missingSlotsAtTickStart` mid-tick.
  If a redial happens to propagate fast, the happy path
  (`reservedRelayCount >= target`) skips remaining peers
  naturally; the cap only matters when reservations haven't
  propagated yet, which is exactly the bug case.

Two new regression tests in `relay.test.ts`:

- "drops self-peerId entries from the clamp + relayTargets":
  derives a private key, computes its peerId, configures an edge
  with `[realRelayAddr, selfMultiaddr]` and `relayReservationCount: 2`.
  Asserts the self-filter warning fires, the clamp warning shows
  `usable relay peers=1; clamping to 1`, and the edge ends up with
  exactly 1 distinct `/p2p-circuit` self-addr.

- "caps forced reservation-redials per tick at the missing-slot
  count": 3 relays, edge with target=2. Waits for 2 reservations
  (so `missingSlotsAtTickStart === 0`), then triggers 3
  back-to-back `watchdogTick()` calls. Asserts 0 "to force
  reserve" log lines across all three ticks — even if a snapshot
  transiently misreports a peer as missing, the budget cap holds
  total churn at 0 because the deficit at tick start is 0.

Updated the round-4 dedup test to match the new combined warning
text (`"2 entries supplied, 1 usable (skipped 1 duplicate peerIds)"`).

Full core suite: 776/776 (was 774, +2 new round-5 cases).

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread packages/core/src/node.ts Outdated
listenAddrs.push('/p2p-circuit');
}
this.relayReservationCountTarget = relayReservationCount;
} else if (this.config.relayPeers?.length) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: This fallback also matches edge nodes whose relayPeers were all filtered out above as malformed/self/duplicate. In that case we warned that the value is ignored, but we still add a legacy /p2p-circuit listener here and leave the node on the raw-relayPeers path instead of the documented "no-relays" path. That can leave an edge node half-configured and unreachable. Gate this branch on enableRelay (or at least usableRelayCandidates.length > 0) so unusable relay lists are truly ignored.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Real bug — confirmed and fixed in $(git rev-parse --short HEAD).

You're right: the round-5 refactor changed isEdgeWithRelays to gate on usableRelayCandidates.length > 0, but I left the legacy fallback's else if (this.config.relayPeers?.length) unchanged. So an edge config like [malformed, malformed] got the "value ignored" warning at the top, then later fell into the legacy core-node fallback and ended up with a /p2p-circuit listener with nothing to reserve against — exactly the half-configured state you describe.

Fix: gate the fallback on enableRelay && relayPeers?.length so only core / relay-server nodes with explicit relayPeers get the legacy single /p2p-circuit. Edge nodes with no usable relays now truly hit the no-relays path.

Regression test (falls back to no-relays path when every relayPeers entry is unusable): edge with relayPeers: ['/not-a-multiaddr', 'also-malformed']. Asserts (a) the "0 usable / malformed" warning fires, (b) no /p2p-circuit ever appears in the edge's listen or self-addresses. 16/16 relay tests pass.

Codex review on PR #526 round 5b caught that the
`else if (this.config.relayPeers?.length)` fallback in start()
matched not only the intended core-node case but also edge nodes
whose relayPeers entries had ALL been filtered out as
malformed/self/duplicate. In that case the round-5 commit warned
that the value was ignored, but this branch still added a legacy
`/p2p-circuit` listener with nothing to reserve against — leaving
the edge node half-configured.

Fix: gate on `enableRelay && relayPeers?.length` so only core /
relay-server nodes with explicit relayPeers get the legacy single
`/p2p-circuit`. Edge nodes with no usable relays now truly hit
the no-relays path, matching the "value ignored" warning.

Regression test: edge with `relayPeers: ['/not-a-multiaddr',
'also-malformed']`. Asserts (a) the "0 usable / malformed" warning
fires, (b) no `/p2p-circuit` ever appears in the edge's listen or
self-addresses. 16/16 relay tests pass (was 15, +1 round-5b case).

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread packages/core/src/node.ts Outdated
continue;
}
seenPids.add(p2p);
usableRelayCandidates.push({ peerId: pid, addr: ma });
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: Treating same-peerId entries as duplicates here drops alternate multiaddrs for the same relay. Before this PR we merged and dialed every supplied address; now a config like [relayA-tcp, relayA-ws] keeps only the first one, so if that address is stale the relay becomes unreachable and the node clamps itself down to fewer usable relays than were actually configured. Keep the peer-level dedupe for reservation-count math, but preserve all addresses per peer in the peer store / retry path.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Real bug — confirmed and fixed in 2b3bfa7.

You're right: the round-5 dedup-by-peerId regressed the alt-transport case. A config like [relayA-tcp, relayA-ws] would collapse to one address; if the surviving transport went stale the relay became unreachable.

Fix: dedup is now by full multiaddr STRING. Same peerId + different multiaddr aggregates into one RelayTarget whose addrs carries both. Per-peer reservation-count math is unchanged — one RelayTarget per peerId, watchdog still iterates each peer once. peerStore.merge() receives all addrs; node.dial(addrs) (libp2p accepts arrays) tries each in order so a stale transport doesn't take the relay out of rotation.

Warning text now distinguishes:

  • N exact-duplicate multiaddrs — true dups (e.g. copy-paste of same string)
  • N alternate addrs merged into existing relay peers — informational; preserved as fallbacks

Regression test (preserves alternate multiaddrs for the same peerId as fallbacks): real relay + edge configured with [realAddr, fakeAlternateAddr] (same peerId, different transport endpoint pointing at port 9). Asserts (a) the "alternate addrs merged" warning fires, (b) distinct relay peer count = 1, (c) edge actually establishes a reservation via the working address even though the alternate is unreachable. 17/17 relay tests pass.

Comment thread packages/core/src/node.ts
// 0 usable, so the node should fall back to the no-relays path
// (no `/p2p-circuit` listen addrs, no watchdog, no
// reservationConcurrency override). Codex PR #526 round 5.
const isEdgeWithRelays = !enableRelay && usableRelayCandidates.length > 0;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: This new "no-relays" gate is based on usableRelayCandidates, but the AutoNAT decision below still keys off raw relayPeers.length. When every configured relay is filtered out as malformed/self/duplicate, the node now falls back to the no-relays path while still disabling AutoNAT, so it has neither relay reachability nor NAT probing. Reuse usableRelayCandidates.length/isEdgeWithRelays for the AutoNAT gate so the fallback behaves like the true no-relay case.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 2b3bfa7 — fair point, AutoNAT and the "no-relays" listen-addr decision should share the same predicate.

Changed:
```ts
const useAutoNAT = this.config.enableAutoNAT ??
!(usableRelayCandidates.length > 0 || enableRelay);
```

So when every configured relay is filtered out as malformed/self/duplicate, the node now BOTH falls back to the no-relays listen path AND re-enables AutoNAT — the all-filtered case behaves like a node that was never given relayPeers in the first place. Operator can still hard-pin via enableAutoNAT: true|false (the ?? keeps that escape hatch).

Comment thread packages/core/src/node.ts Outdated
continue;
}

forcedRedialsThisTick += 1;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: The per-tick budget is consumed before we know whether this forced redial actually restored a missing slot. In a target=2, reserved=1 case, if the first missing relay's node.dial(addr) fails, forcedRedialsThisTick is still incremented and the watchdog skips the next eligible relay for the rest of the tick, even though that other relay could have satisfied the missing reservation. Only count successful redials against the budget, or refund the slot on failure.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Real bug — fixed in 2b3bfa7.

You're right: counting the budget BEFORE the dial means a failed node.dial() permanently locks out the next eligible relay this tick, even though the failed dial didn't actually consume a reservation slot.

Fix: moved forcedRedialsThisTick += 1 inside the try block, after redialed = true. So only successful dials count against the budget; failures don't waste a slot and the next eligible relay can still be tried in the same tick. The budget check at the top of the redial branch (forcedRedialsThisTick >= missingSlotsAtTickStart) still gates entry, so the cap holds for the success-only counter.

Net behavior: in target=2, reserved=1 with relay-A-failing/relay-B-working, we now (a) attempt redial of relay A, (b) attempt redial of relay B (because the relay-A failure didn't increment), (c) relay B succeeds, increment to 1, budget reached, stop. Previously: (a) relay A attempt, increment to 1, budget reached, skip relay B — leaving the missing slot unfilled until the next tick.

778/778 core tests pass.

…-budget on success only

Three Codex bugs on PR #526 round 5c:

1. (line 563 🔴) Round-5 dedup-by-peerId dropped alternate
   multiaddrs for the same relay (e.g. `[relayA-tcp, relayA-ws]`
   collapsed to one address). If the surviving address went stale
   the relay became unreachable and the node clamped itself to
   fewer reservations than configured.

   Fix: dedup is now by full multiaddr STRING. Same-peerId entries
   with different multiaddrs aggregate into one `RelayTarget` whose
   `addrs` carries both. `peerStore.merge()` receives all addrs;
   `node.dial(addrs)` (libp2p accepts arrays) tries each in order.
   Per-peer reservation-count math is unchanged — one `RelayTarget`
   per peerId, watchdog still iterates each peer once. Warning
   distinguishes "exact-duplicate multiaddrs" (true dups) from
   "alternate addrs merged into existing relay peers" (info-level).

2. (line 583 🟡) AutoNAT gate keyed off raw `relayPeers.length`,
   not `usableRelayCandidates.length`. When every configured relay
   was filtered out as malformed/self/duplicate, the node fell
   back to no-relays AND silently disabled AutoNAT — neither
   relay reachability nor NAT probing.

   Fix: `useAutoNAT = enableAutoNAT ?? !(usableRelayCandidates.length > 0
   || enableRelay)`. Same predicate as `isEdgeWithRelays` so the
   two gates stay in sync.

3. (line 1061 🔴) Per-tick forced-redial budget was consumed
   BEFORE the dial — a failed `node.dial()` still incremented
   `forcedRedialsThisTick` and locked the watchdog out of trying
   the next eligible relay this tick.

   Fix: move `forcedRedialsThisTick += 1` into the `try` block
   after `redialed = true`. Only successful dials count against
   the budget; failures don't waste a slot.

Two new regression tests:

- "preserves alternate multiaddrs for the same peerId as
  fallbacks": real relay + edge configured with `[realAddr,
  fakeAlternateAddr]` (same peerId, different transport endpoint).
  Asserts the "alternate addrs merged" warning, distinct relay peer
  count = 1, AND the edge actually establishes a reservation via
  the working address even though the alternate is unreachable.

(The AutoNAT gate fix is structural — the existing "no-relays
fallback" test already exercises the all-filtered case; we just
extended it implicitly by ensuring AutoNAT now flips on for that
path.)

Updated text in two existing tests to match the renamed warnings
("distinct relay peers usable", "exact-duplicate multiaddrs"). All
17 relay tests green. Full core suite: 778/778 (was 776, +2 new).

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread packages/core/src/node.ts Outdated
// gate `reservedRelayCount >= target` is achievable.
const usableCount = usableRelayCandidates.length;
if (relayReservationCount > usableCount) {
console.warn(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: This now warns for the default path as well. On any edge node with a single configured relay, relayReservationCount defaults to 3, gets clamped to 1, and emits a startup warning even though the operator did nothing wrong. That will make healthy one-relay deployments look misconfigured and bury real relay problems in log noise. Consider warning only for explicit overrides that exceed the usable relay count, and silently or info-level clamp the implicit default.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 3d0b701 — fair point, log noise on a healthy default config is exactly what bites operators when they're triaging real problems.

Track isExplicitOverride = verdict?.ok === true. Only warn on the clamp when the operator EXPLICITLY supplied a count > usable; default-driven clamps (verdict null) and post-fallback clamps (verdict not-ok already warned) now silently adjust. Net result for a single-relay edge node: zero startup warnings if the count comes from defaults.

&& call[0].includes('relayReservationCount=3')
&& call[0].includes('relay servers don\'t multi-reserve'),
);
expect(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: This test only proves that the ignore warning is logged; it never asserts the behavior it describes (core nodes should not amplify into multiple /p2p-circuit reservations). A regression that still emitted the warning but added extra circuit listeners would pass here. Please assert the resulting circuit self-address count or listen-address count after startup so the non-amplification contract is actually pinned.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 3d0b701 — extended the test to assert the actual non-amplification contract:

```ts
const coreCircuitSelfAddrs = core.libp2p.getMultiaddrs()
.map(ma => ma.toString())
.filter(a => a.includes('/p2p-circuit'));
expect(coreCircuitSelfAddrs.length).toBeLessThanOrEqual(1);
```

So a regression that still logs the "ignored" warning but actually amplifies into multiple /p2p-circuit reservations would now fail this test directly.

Comment thread packages/core/src/types.ts Outdated
* Default: 3. Capped at 16 (above which the failure-tolerance
* benefit is marginal but the per-relay cost is real).
*
* IGNORED when no relayPeers are configured (i.e. the node isn't
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 Nit: The runtime also ignores this option when the node is itself running a relay server, but this exported DKGNodeConfig doc only mentions the "no relayPeers" case. That mismatch makes the public config contract misleading for package consumers. Update the JSDoc here to match the actual gating, and mirror the same wording in the copied CLI/agent descriptions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 3d0b701 — updated the JSDoc in all three places to enumerate BOTH ignore cases:

  • packages/core/src/types.ts (the source of truth)
  • packages/cli/src/config.ts (CLI config; the wording mirrors core)
  • packages/agent/src/dkg-agent.ts (agent config; same wording)

Each now reads "Ignored (with a warning when set explicitly) when no relayPeers are configured or when the node itself runs a relay server — relay servers don't multi-reserve through other relays." That matches the runtime gating exactly.

…+ JSDoc parity

Three Codex findings on PR #526 round 5d:

1. (line 633 🟡) Default `relayReservationCount=3` clamping to 1 on
   any healthy single-relay edge node was emitting a startup
   warning, even though the operator did nothing wrong. Buries real
   problems in log noise.

   Fix: track `isExplicitOverride = verdict?.ok === true` and only
   warn on the clamp when the operator EXPLICITLY supplied a count
   that exceeds the usable peers. Default-driven clamps (and clamps
   after invalid-input fallback, which already warned) now silently
   adjust the count.

2. (line 457 🟡) The "skips multi-reservation amplification on
   relay-server (core) nodes" test only asserted the warning, not
   the actual non-amplification contract. A regression that still
   logged the warning but added extra `/p2p-circuit` listeners would
   pass.

   Fix: extended the test to also assert
   `coreCircuitSelfAddrs.length <= 1` after startup, pinning the
   "core nodes never amplify into multiple `/p2p-circuit`
   reservations regardless of relayReservationCount" invariant
   directly.

3. (line 96 🔵) JSDoc on `relayReservationCount` mentioned the
   no-relayPeers ignore case but not the relay-server ignore case
   (the `enableRelayServer: true` / `nodeRole: 'core'` path that
   the runtime also explicitly skips). Mismatch between public
   contract and runtime behaviour.

   Fix: updated the JSDoc in three places (`packages/core/src/types.ts`,
   `packages/cli/src/config.ts`, `packages/agent/src/dkg-agent.ts`) to
   list both ignore cases — the warning-when-set-explicitly behaviour
   matches the warnings the runtime actually emits.

Full core suite: 778/778. Full pnpm -r build clean.

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread packages/core/src/node.ts Outdated
continue;
}
seenAddrStrings.add(maStr);
const existing = candidateByPid.get(p2p);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: candidateByPid is keyed by the raw /p2p string, not the parsed peer id. The same relay can be written in multiple valid peer-id encodings, so a mixed config would bypass this dedupe, inflate usableRelayCandidates.length, and make the clamp/target logic think there are more distinct relays than actually exist. Key the map by pid.toString() instead.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Real bug — fixed in ac3952f.

You're right: peerId encodings (base58btc, base32, ...) are interchangeable for the same identity, and keying off the raw /p2p/<value> string would fail to dedupe a mixed-encoding config.

Fix: candidateByPid is now keyed by pid.toString() (canonical encoding). The duplicate-multiaddr detector also normalises — seenAddrKeys is keyed by ${transportPrefix}/p2p/${pid.toString()} rather than the raw multiaddr string — so encoding-mismatched duplicates of the same address are also caught.

Comment thread packages/core/src/node.ts
if (selfHits) reasons.push(`${selfHits} pointing at this node's own peerId`);
if (dupAddrs) reasons.push(`${dupAddrs} exact-duplicate multiaddrs`);
if (altAddrs) reasons.push(`${altAddrs} alternate addrs merged into existing relay peers`);
console.warn(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: altAddrs is a healthy fallback case, but this always logs at warning level when an operator provides multiple transports for the same relay. That means a supported config generates a warning on every startup, which makes the real malformed/self/duplicate warnings here easier to ignore. Consider warning only for invalid entries and downgrading the alternate-address message to info/debug.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ac3952f — agree, alt-addrs aggregation is a healthy supported config and a warn-level log on every startup is exactly the kind of noise that makes operators tune out the real warnings.

Split the emission:

  • If any of malformed / self-hits / dup-addrs is non-zero → console.warn (operator has actual problems; alt-addrs goes in the reasons list as context).
  • If only alt-addrs is non-zero → console.log with rephrased text: "N distinct relay peers (M alternate addrs merged)" (no "usable" / "skipped" wording that implies error).

Updated the round-5c "preserves alternate multiaddrs" test to assert (a) the info log fires AND (b) no warn-level log mentions "alternate addrs" — pinning the new contract directly.

// to clamp + warn at start(). We verify the warn fires and the
// edge actually only ends up with 1 /p2p-circuit self-addr (not
// 3 attempts queued forever).
const warnSpy = vi.spyOn(console, 'warn').mockImplementation(() => {});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: this spy is only restored at the bottom of the test. If any assertion or start() call above throws, console.warn stays mocked and can corrupt later tests with misleading failures. Wrap each new console.warn/console.log spy block in try/finally (or use using) so cleanup always runs; the same pattern repeats in the other added tests.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ac3952f — fair test-hygiene point.

Rather than wrapping each test in try/finally individually (8 spies in the file), added vi.restoreAllMocks() to the existing afterEach:

```ts
afterEach(async () => {
vi.restoreAllMocks(); // unconditional cleanup for every spy
for (const n of nodes) {
try { await n.stop(); } catch (err) { ... }
}
nodes.length = 0;
});
```

One-line global fix that covers every spy in the file (existing inline mockRestore() calls remain harmlessly idempotent; future spies are auto-cleaned).

Three Codex findings on PR #526 round 5e:

1. (line 583 🔴) `candidateByPid` was keyed by the raw `/p2p/<value>`
   string a user supplied. The same peerId can be written in
   multiple valid encodings (base58btc, base32, ...), so a
   mixed-encoding config like `[/.../p2p/<base58>, /.../p2p/<base32>]`
   for the same relay would bypass the dedupe — inflating
   `usableRelayCandidates.length` and making the clamp / target
   logic think there are more distinct relays than actually exist.

   Fix: key `candidateByPid` by `pid.toString()` (the canonical
   encoding), and build `seenAddrKeys` from the transport prefix
   plus the canonical peerId so duplicate-address detection is
   also encoding-independent.

2. (line 599 🟡) Aggregating alternate multiaddrs for the same
   relay is a healthy supported config, but the previous code
   logged it at `console.warn` level — every startup of an
   alt-transport edge would generate a warning and bury real
   problems in noise.

   Fix: split the log emission. If any of malformed/self/duplicate
   is non-zero, the warn-level log includes alt-addrs in the
   reasons list (operator already has problems to look at). If only
   alt-addrs is non-zero, log at info level (`console.log`) with
   wording that doesn't imply error: "N distinct relay peers (M
   alternate addrs merged)".

3. (line 365 🟡) Per-test `console.warn` / `console.log` spies were
   restored inline at the bottom of each test body. A thrown
   assertion or a `start()` error left them mocked and corrupted
   later tests.

   Fix: added `vi.restoreAllMocks()` to the existing `afterEach` so
   every spy is unconditionally restored regardless of how the
   test exited. One-line global cleanup that covers every
   spy in the file (existing and future).

Updated the round-5c "preserves alternate multiaddrs" test to
spy on `console.log` for the alt-addrs message AND assert that
NO `console.warn` call mentioning "alternate addrs" was emitted —
pinning the new "alt-addrs is info-level on healthy startup"
contract directly.

17/17 relay tests pass.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex review completed — no issues found.

@branarakic branarakic merged commit 143ce22 into main May 15, 2026
36 checks passed
@branarakic branarakic deleted the feat/libp2p-multi-reservation branch May 15, 2026 21:31
@branarakic branarakic mentioned this pull request May 15, 2026
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant