Skip to content

#75 - Fix Socket UDP --mode both; error spam impacts startup#85

Closed
RamyaGuru wants to merge 1 commit into
mainfrom
fix/issue-75-socket-udp-deadlock
Closed

#75 - Fix Socket UDP --mode both; error spam impacts startup#85
RamyaGuru wants to merge 1 commit into
mainfrom
fix/issue-75-socket-udp-deadlock

Conversation

@RamyaGuru
Copy link
Copy Markdown
Collaborator

@RamyaGuru RamyaGuru commented May 15, 2026

Fixes #75.

NOTE TO REVIEWER: the original Git issue implies that this caused a throughput cliff in benchmarks for socket UDP. In reality, this was a red herring and it was an issue with the benchmark test. So, this is more of a cosmetic change to fix some error-spamming that also leads some sluggishness on startup (see below).

The server thread, when invoked in --mode both, tries to send before any inbound packet has arrived. With no learned peer, each send_udp_burst returns false after logging UDP server has no learned peer yet; cannot transmit at
ERROR. The spam stops once the first inbound packet teaches the server its peer (typically within ~30 attempts), but each ERROR write through spdlog→stderr is a synchronous syscall that holds state_mutex_ and contends with the
client's first packets.

Two changes:

  1. examples/socket_bench.cpp — gate the server worker's TX path on having received at least one packet (only when receive is enabled), so the client transmits eagerly and the server learns the peer before it tries to send.
  2. src/managers/socket/daqiri_socket_mgr.{h,cpp} — throttle the "no learned peer" log to fire once per missing-peer episode and demote it from ERROR to DEBUG. A new udp_peer_missing_logged latch on EndpointState is cleared
    when the RX loop learns/relearns a peer.

In pathological cases where the server never receives an inbound packet (one-way flow, mis-bound RX queue), the original code would spam ERROR forever; the latch makes those scenarios survivable too.

Test plan

UDP loopback (127.0.0.1), --seconds 30 --mode both, message_size=1024, iterations bumped to 100M so the wall-clock cap dominates. bench-results/ runs on a DGX Spark in container build.

  • Revert + reproduce — checked out main's versions of the three modified files, rebuilt, ran the bench. Confirmed 33 no learned peer ERROR lines on the broken side.

  • Re-apply + verify symptom gone — restored the fix, rebuilt, re-ran. ERROR count: 33 → 0.

  • N=3 throughput, total tx_pkts / 30s, both endpoints:

    Trial 1 Trial 2 Trial 3 Min Max Mean
    Broken 17.67M 16.57M 17.44M 16.57M 17.67M 17.23M
    Fixed 18.89M 18.47M 18.50M 18.47M 18.89M 18.62M

    Mean delta: +1.39M pkts / 30s (+8.1%). Ranges do not overlap — slowest fixed trial still beats fastest broken trial.

  • Code audit of udp_peer_valid in main's daqiri_socket_mgr.cpp: set true once in the RX loop after the first recvmmsg, never cleared — so the broken-side spam is bounded by time-to-first-inbound-packet, not run length.
    Matches the observed ~33 ERROR count being identical in iter=1000 and iter=100M runs.

In bidirectional UDP runs (`--mode both`), the server thread spun
trying to send before any inbound packet had arrived, so the server
had no learned peer and every send attempt logged
`UDP server has no learned peer yet; cannot transmit` at ERROR. The
log spam back-pressured the client and capped throughput at ~1000
packets / 30s.

- examples/socket_bench.cpp: gate the server worker's TX path on
  having received at least one packet (only when receive is enabled),
  so the client transmits eagerly and the server learns the peer
  before it tries to send.
- src/managers/socket/daqiri_socket_mgr.{h,cpp}: throttle the
  "no learned peer" log to fire once per missing-peer episode and
  downgrade it from ERROR to DEBUG. A new `udp_peer_missing_logged`
  latch on EndpointState is cleared when the RX loop learns/relearns
  a peer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Ramya Gurunathan <rgurunathan@nvidia.com>
@RamyaGuru RamyaGuru marked this pull request as ready for review May 15, 2026 16:36
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented May 15, 2026

Greptile Summary

Fixes the --mode both UDP startup issue where the server thread repeatedly tried to send before learning any peer address, causing ERROR log spam that held state_mutex_ and throttled the client's first packets.

  • examples/socket_bench.cpp: adds a waiting_for_peer guard that blocks the server worker's TX path until stats.received_packets > 0 (only when receive is enabled), preventing the allocation/free cycle and eliminating the contention window entirely.
  • src/managers/socket/daqiri_socket_mgr.{h,cpp}: introduces a udp_peer_missing_logged latch on EndpointState (read/written under state_mutex_) that demotes the "no learned peer" diagnostic from ERROR to DEBUG and suppresses it after the first occurrence per episode; the latch resets when the RX loop learns or re-learns a peer.

Confidence Score: 4/5

Safe to merge after addressing the doc update — the logic changes are correct and well-bounded.

Both the waiting_for_peer guard in the bench and the udp_peer_missing_logged latch in the manager are correctly scoped and synchronized. send_tx_burst always frees the burst on both success and failure paths, so BurstParams ownership is unaffected. The only open item is that src/managers/socket/ changes require corresponding doc updates per the project’s doc-sync policy, which are missing from this PR.

The doc-sync gap in src/managers/socket/daqiri_socket_mgr.cpp and .h — the corresponding backend docs have not been updated.

Important Files Changed

Filename Overview
examples/socket_bench.cpp Adds waiting_for_peer guard that blocks the server worker's TX path until at least one RX packet has arrived; logic is correct and thread-safe (only the owning thread reads stats).
src/managers/socket/daqiri_socket_mgr.cpp Converts the no-peer log from always-ERROR to a once-per-episode DEBUG latch; both latch set and reset occur under state_mutex_, so the access pattern is correctly synchronized.
src/managers/socket/daqiri_socket_mgr.h Adds udp_peer_missing_logged boolean (default false) to EndpointState; placement and initialization are correct and consistent with existing fields.

Sequence Diagram

sequenceDiagram
    participant C as ClientWorker
    participant SM as SocketMgr (UDP)
    participant SW as ServerWorker

    Note over SW: waiting_for_peer=true
    C->>SM: send_tx_burst(msg)
    SM->>SM: sendmmsg to server
    SM-->>C: SUCCESS

    SM->>SW: udp_rx_loop receives packet
    SM->>SM: "udp_peer_valid = true"

    Note over SW: received_packets > 0
    SW->>SM: send_tx_burst(msg)
    SM->>SM: send_udp_burst to learned peer
    SM-->>SW: SUCCESS
Loading

Reviews (1): Last reviewed commit: "#75 - Fix Socket UDP --mode both deadloc..." | Re-trigger Greptile

Comment on lines +743 to +745
DAQIRI_LOG_DEBUG("UDP server has no learned peer yet; cannot transmit");
ep.udp_peer_missing_logged = true;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Doc-sync required for src/managers/socket/ changes

Per the doc-sync rule, any change under src/managers/*/ must be accompanied by updates to docs/getting-started.md, docs/configuration.md, docs/tutorials/configuration-walkthrough.md, README.md (Backends section), and AGENTS.md. The demoted log level (ERROR → DEBUG) and the new udp_peer_missing_logged latch are observable by operators who monitor spdlog output — a misconfigured one-way server will now emit a single DEBUG line rather than repeated ERROR lines. Please update the relevant docs in this PR.

Rule Used: DAQIRI has no automated doc-sync gate beyond mkdoc... (source)

@RamyaGuru
Copy link
Copy Markdown
Collaborator Author

Closing this — issue #75 turned out to be a red herring. There's no actual deadlock in Socket UDP --mode both; the benchmark just stops after 1000 packets (the default iterations cap), which made the brief peer-learning warmup look like a hang.

What the branch actually fixed was cosmetic: the server side eagerly calls send_udp_burst before the first RX packet arrives, so udp_peer_valid is false for a handful of iterations and each failed send logs an ERROR. Once one client
packet lands, the peer is learned and sending proceeds normally.

The two-line log cleanup (downgrade ERROR → DEBUG and rate-limit to once) is worth keeping but doesn't justify its own PR — I'll fold it into #72 if it's still relevant there. Closing this PR and issue #75.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Socket UDP --mode both deadlocks on peer learning; only ~1000 packets / 30 s trickle through

1 participant