#75 - Fix Socket UDP --mode both; error spam impacts startup#85
#75 - Fix Socket UDP --mode both; error spam impacts startup#85RamyaGuru wants to merge 1 commit into
Conversation
In bidirectional UDP runs (`--mode both`), the server thread spun
trying to send before any inbound packet had arrived, so the server
had no learned peer and every send attempt logged
`UDP server has no learned peer yet; cannot transmit` at ERROR. The
log spam back-pressured the client and capped throughput at ~1000
packets / 30s.
- examples/socket_bench.cpp: gate the server worker's TX path on
having received at least one packet (only when receive is enabled),
so the client transmits eagerly and the server learns the peer
before it tries to send.
- src/managers/socket/daqiri_socket_mgr.{h,cpp}: throttle the
"no learned peer" log to fire once per missing-peer episode and
downgrade it from ERROR to DEBUG. A new `udp_peer_missing_logged`
latch on EndpointState is cleared when the RX loop learns/relearns
a peer.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Ramya Gurunathan <rgurunathan@nvidia.com>
|
| Filename | Overview |
|---|---|
| examples/socket_bench.cpp | Adds waiting_for_peer guard that blocks the server worker's TX path until at least one RX packet has arrived; logic is correct and thread-safe (only the owning thread reads stats). |
| src/managers/socket/daqiri_socket_mgr.cpp | Converts the no-peer log from always-ERROR to a once-per-episode DEBUG latch; both latch set and reset occur under state_mutex_, so the access pattern is correctly synchronized. |
| src/managers/socket/daqiri_socket_mgr.h | Adds udp_peer_missing_logged boolean (default false) to EndpointState; placement and initialization are correct and consistent with existing fields. |
Sequence Diagram
sequenceDiagram
participant C as ClientWorker
participant SM as SocketMgr (UDP)
participant SW as ServerWorker
Note over SW: waiting_for_peer=true
C->>SM: send_tx_burst(msg)
SM->>SM: sendmmsg to server
SM-->>C: SUCCESS
SM->>SW: udp_rx_loop receives packet
SM->>SM: "udp_peer_valid = true"
Note over SW: received_packets > 0
SW->>SM: send_tx_burst(msg)
SM->>SM: send_udp_burst to learned peer
SM-->>SW: SUCCESS
Reviews (1): Last reviewed commit: "#75 - Fix Socket UDP --mode both deadloc..." | Re-trigger Greptile
| DAQIRI_LOG_DEBUG("UDP server has no learned peer yet; cannot transmit"); | ||
| ep.udp_peer_missing_logged = true; | ||
| } |
There was a problem hiding this comment.
Doc-sync required for
src/managers/socket/ changes
Per the doc-sync rule, any change under src/managers/*/ must be accompanied by updates to docs/getting-started.md, docs/configuration.md, docs/tutorials/configuration-walkthrough.md, README.md (Backends section), and AGENTS.md. The demoted log level (ERROR → DEBUG) and the new udp_peer_missing_logged latch are observable by operators who monitor spdlog output — a misconfigured one-way server will now emit a single DEBUG line rather than repeated ERROR lines. Please update the relevant docs in this PR.
Rule Used: DAQIRI has no automated doc-sync gate beyond mkdoc... (source)
|
Closing this — issue #75 turned out to be a red herring. There's no actual deadlock in Socket UDP --mode both; the benchmark just stops after 1000 packets (the default iterations cap), which made the brief peer-learning warmup look like a hang. What the branch actually fixed was cosmetic: the server side eagerly calls send_udp_burst before the first RX packet arrives, so udp_peer_valid is false for a handful of iterations and each failed send logs an ERROR. Once one client The two-line log cleanup (downgrade ERROR → DEBUG and rate-limit to once) is worth keeping but doesn't justify its own PR — I'll fold it into #72 if it's still relevant there. Closing this PR and issue #75. |
Fixes #75.
NOTE TO REVIEWER: the original Git issue implies that this caused a throughput cliff in benchmarks for socket UDP. In reality, this was a red herring and it was an issue with the benchmark test. So, this is more of a cosmetic change to fix some error-spamming that also leads some sluggishness on startup (see below).
The server thread, when invoked in
--mode both, tries to send before any inbound packet has arrived. With no learned peer, eachsend_udp_burstreturns false after loggingUDP server has no learned peer yet; cannot transmitatERROR. The spam stops once the first inbound packet teaches the server its peer (typically within ~30 attempts), but each ERROR write through spdlog→stderr is a synchronous syscall that holds
state_mutex_and contends with theclient's first packets.
Two changes:
examples/socket_bench.cpp— gate the server worker's TX path on having received at least one packet (only whenreceiveis enabled), so the client transmits eagerly and the server learns the peer before it tries to send.src/managers/socket/daqiri_socket_mgr.{h,cpp}— throttle the "no learned peer" log to fire once per missing-peer episode and demote it from ERROR to DEBUG. A newudp_peer_missing_loggedlatch onEndpointStateis clearedwhen the RX loop learns/relearns a peer.
In pathological cases where the server never receives an inbound packet (one-way flow, mis-bound RX queue), the original code would spam ERROR forever; the latch makes those scenarios survivable too.
Test plan
UDP loopback (
127.0.0.1),--seconds 30 --mode both, message_size=1024, iterations bumped to 100M so the wall-clock cap dominates.bench-results/runs on a DGX Spark in container build.Revert + reproduce — checked out
main's versions of the three modified files, rebuilt, ran the bench. Confirmed 33no learned peerERROR lines on the broken side.Re-apply + verify symptom gone — restored the fix, rebuilt, re-ran. ERROR count: 33 → 0.
N=3 throughput, total tx_pkts / 30s, both endpoints:
Mean delta: +1.39M pkts / 30s (+8.1%). Ranges do not overlap — slowest fixed trial still beats fastest broken trial.
Code audit of
udp_peer_validinmain'sdaqiri_socket_mgr.cpp: set true once in the RX loop after the firstrecvmmsg, never cleared — so the broken-side spam is bounded by time-to-first-inbound-packet, not run length.Matches the observed ~33 ERROR count being identical in iter=1000 and iter=100M runs.