Skip to content

fix(daemon): route auto-handshake through HandshakeSendRequest wrapper (follow-up to #208)#209

Merged
TeoSlayer merged 1 commit into
mainfrom
fix/auto-handshake-dedup-recursion
May 31, 2026
Merged

fix(daemon): route auto-handshake through HandshakeSendRequest wrapper (follow-up to #208)#209
TeoSlayer merged 1 commit into
mainfrom
fix/auto-handshake-dedup-recursion

Conversation

@TeoSlayer
Copy link
Copy Markdown
Owner

Summary

Follow-up to #208. That PR added per-peer in-flight dedup at the daemon's IPC entry point (`HandshakeSendRequest`) and the IPC-side burst symptom is gone — 0 "ephemeral ports exhausted" errors in the local repro post-merge.

But the same repro surfaced a second pathway: the proactive auto-handshake fired from inside `DialConnection` (`daemon.go:3125`) calls `d.handshakes.SendRequest` directly, bypassing the dedup wrapper. That call re-enters `DialConnection` on port 444 for the same peer, which re-checks the same gate, fires another `go SendRequest`, and the chain self-fuels until the dial budget cuts it.

In the local repro one user-triggered handshake to a `trustedagents` peer (crossref-funders, node 41451) produced ~12,279 "direct handshake failed" log entries within ~800 ms before the dial timeout cleared it.

Fix

One-line: route the auto-handshake through `d.HandshakeSendRequest` instead of `d.handshakes.SendRequest`. The dedup map now catches the recursive call — second entry sees the in-flight slot, short-circuits with `ErrHandshakeInFlight`, the goroutine returns, fanout doesn't happen.

The error return is discarded — the goroutine is fire-and-forget, same as before. `ErrHandshakeInFlight` is the dedup success case.

Why this is safe

Test plan

  • `go test -race -count=1 -run 'TestHandshakeInFlight|TestHandshakeSendRequestNoServiceErr' ./pkg/daemon/` — existing dedup regression tests still pass
  • `go build ./...` clean
  • Post-merge: rerun the 7-distinct-peer + 10-same-peer handshake stress, verify no "direct handshake failed" storm

…keSendRequest

Follow-up to PR #208. That PR added per-peer in-flight dedup at
the daemon entry point (HandshakeSendRequest). Field-verified
2026-05-31: with the fix in place the *IPC-side* burst symptom is
gone (0 "ephemeral ports exhausted"), but a second pathway was
exposed — the proactive auto-handshake at line 3125 calls the
plugin's SendRequest *directly*, bypassing the dedup wrapper.

The recursive trip:
  DialConnection
    → checks trustedagents.IsTrusted → fires `go SendRequest`
      → sendMessage → DialAndSend → DialConnection
        → same check fires → another `go SendRequest`
          → ... (self-driving fanout)

In the local repro, one user-triggered handshake to a
trusted-agents peer (crossref-funders, node 41451) produced
~12,279 "direct handshake failed" log entries within ~800 ms
before the dial budget timed out the inner chain. The dedup
wrapper from PR #208 catches the *first* recursive call: it
short-circuits with ErrHandshakeInFlight, the goroutine returns
quietly, and the fanout doesn't happen.

One-line change: replace the direct plugin call with the
daemon-level wrapper. Error return is discarded (the goroutine
is fire-and-forget, same as before); ErrHandshakeInFlight is
the dedup success case.

Wire / IPC / persistence contracts unchanged.
@TeoSlayer TeoSlayer requested a review from Alexgodoroja as a code owner May 31, 2026 20:19
@TeoSlayer TeoSlayer merged commit 273bb0f into main May 31, 2026
6 of 8 checks passed
@hank-pilot
Copy link
Copy Markdown
Collaborator

🤖 Hank — CI status

Classification: real
Run: https://github.com/TeoSlayer/pilotprotocol/actions/runs/26723403068
At commit: 2aa0eeb

The build/test failure is a genuine code defect:

--- FAIL: TestConcurrentDialEncryptDecrypt (98.91s)
    zz_concurrent_dial_encrypt_decrypt_stress_test.go:146: dial group made zero successful dials
FAIL	github.com/TeoSlayer/pilotprotocol/tests	98.920s

@matthew-pilot — fix or comment.

Auto-classified at 2026-06-01T10:26:00Z. Re-runs on next push or check completion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants