Skip to content

Warn on partial UDP send failure, raise SendError when all fail#23

Merged
oskgu360 merged 7 commits into
mainfrom
skip-unreachable-udp-addrs
May 6, 2026
Merged

Warn on partial UDP send failure, raise SendError when all fail#23
oskgu360 merged 7 commits into
mainfrom
skip-unreachable-udp-addrs

Conversation

@oskgu360
Copy link
Copy Markdown
Contributor

@oskgu360 oskgu360 commented May 5, 2026

Summary

  • A host that resolves to both A and AAAA on a network with no route for one family used to log Sparoid error sending #<...> per per-address failure even when the other family succeeded — at shard-in-app scale (15k+ servers) this floods log aggregators with what monitoring classifies as ERROR-level lines.
  • New behavior:
    • At least one address succeeded: emit one Log.warn { "skip <host> (<ip>): <reason>" } per failed addr, classified as WARN by any properly configured backend. Apps embedding the shard can filter sparoid.client specifically. Returns normally.
    • Every address failed: raise Sparoid::Client::SendError with all per-address details. The CLI's existing rescue ex (src/client-cli.cr:62) prints Sparoid error: <msg> and exits 1.
  • Sparoid::Client::Log = ::Log.for(self) declares a dedicated log source.
  • CLI explicitly routes Log to STDERR via Log::IOBackend.new(STDERR) because in :connect mode STDOUT is the unix-domain socket used for SCM_RIGHTS FD passing — free-form text there would queue ahead of the sendmsg and confuse SSH's recvmsg.
  • Crystal fdpass already handles per-addr connect failures correctly (each attempt runs in its own fiber, first to connect wins, rest silent), so only udp_send needed the change.

Context

  • Reported in #engineering (Slack thread) — sparoid running on v4-only networks was spitting EHOSTUNREACH errors into the log even though the v4 send worked. The shard is used at scale to send to thousands of brokers.
  • Mirrors sparoid.rb#20 in spirit (Ruby client made the analogous fix in fdpass).

Test plan

  • crystal spec spec/sparoid_spec.cr — covers format_send_errors (used to build the raised message). The unrelated client can send another IP failure exists on main for the same network-setup reason; on Linux CI it still passes.
  • crystal tool format --check — clean
  • bin/ameba — clean

oskgu360 added 3 commits May 5, 2026 12:59
A host with both A and AAAA records on a network with no route for one
family (e.g. v4-only network with AAAA record) will fail the send for
the unreachable family while the other family succeeds. The previous
"Sparoid error sending ..." wording made the partial failure look
fatal; reword to "Sparoid warn: skip <ip>: <reason>" to make clear it
is recoverable, mirroring sparoid.rb#20.
When a host resolves to both A and AAAA addresses but the network only
routes one family, the unreachable family's send raises EHOSTUNREACH.
Previously every per-addr failure was logged, which is noisy when the
other family's send succeeded and the call as a whole worked. Collect
errors and only log them if every address failed; include the original
hostname alongside the IP for easier triage.
@oskgu360 oskgu360 changed the title Log UDP send failures as warnings, not errors Suppress per-address UDP send errors when one family succeeds May 5, 2026
oskgu360 added 2 commits May 5, 2026 13:14
The previous integration tests relied on "send to 0.0.0.0 fails with
EHOSTUNREACH" to exercise the all-failed branch, which is true on
macOS but not on the Linux CI runners where the kernel accepts the
send. Extract the error-reporting decision into a pure helper that can
be unit-tested with synthetic inputs, independent of OS networking
behavior.
- Per-address failure with at least one success: STDERR warn line
  ("Sparoid warn: skip <host> (<ip>): <reason>") so v4-only networks
  don't make AAAA-resolved hosts log catastrophically.
- Every address failed: raise Sparoid::Client::SendError with all
  per-address details. The CLI already catches and exits 1, so this
  surfaces a clear actionable failure instead of returning quietly and
  failing later in fdpass.
@oskgu360 oskgu360 changed the title Suppress per-address UDP send errors when one family succeeds Warn on partial UDP send failure, raise SendError when all fail May 5, 2026
Apps embedding sparoid as a shard at scale (15k+ servers) need partial
UDP send failures classified as WARN-level by their monitoring, not as
ERROR-level (which most aggregators infer from STDERR by default).

- Sparoid::Client::Log = ::Log.for(self) declares a log source so apps
  can filter sparoid output specifically.
- udp_send emits per-address skips via Log.warn { ... }.
- client-cli configures Log.setup_from_env with an STDERR backend (in
  :connect mode STDOUT is the unix-domain FD-passing channel and isn't
  safe for free-form output).
@oskgu360 oskgu360 marked this pull request as ready for review May 5, 2026 15:21
@oskgu360 oskgu360 requested a review from a team as a code owner May 5, 2026 15:21
@walro walro requested a review from Copilot May 5, 2026 15:41
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR changes the client-side UDP send path so mixed-family delivery failures are downgraded from unconditional STDERR errors to structured warnings, while escalating the fully-failed case into a Sparoid::Client::SendError. In the broader codebase, this is aimed at making sparoid's client behavior less noisy in partial-success environments while preserving a hard failure when no address can be reached.

Changes:

  • Add per-address error collection in Sparoid::Client, warning on partial send failures and raising SendError when all sends fail.
  • Configure the CLI logger to write to STDERR so connect mode does not contaminate the FD-passing STDOUT channel.
  • Add a new spec for the error-formatting helper used to build the raised exception message.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
src/client.cr Adds aggregated UDP send failure handling, warning logs, and SendError formatting.
src/client-cli.cr Initializes logging to STDERR for CLI execution, especially connect mode.
spec/sparoid_spec.cr Adds unit coverage for the new send-error message formatting helper.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/client.cr Outdated
Comment thread src/client.cr Outdated
Comment thread src/client.cr Outdated
Comment thread src/client.cr Outdated
- SendError no longer prepends 'Sparoid:' to the message — the CLI's
  rescue clause already prefixes 'Sparoid error:', so the previous
  format produced 'Sparoid error: Sparoid: failed to send...'.
- Replace the format_send_errors formatting helper with a
  process_send_results helper that owns the partial-vs-total decision.
  This raises SendError directly when every send failed and returns
  the per-address partial-failure errors otherwise.
- Specs now cover both branches plus the all-succeeded and
  empty-input cases, exercising the actual control flow rather than
  just message formatting.
@oskgu360
Copy link
Copy Markdown
Contributor Author

oskgu360 commented May 6, 2026

Ran metrics-shovel with this patch, output now as warn level at least.

at=warn source=sparoid.client skip dev-cold-ram.lmq.dev.cloudamqp.com ([2a05:d016:10c:5a00:4bfc:39ea:4c98:2d86]:8484): Error sending datagram to [2a05:d016:10c:5a00:4bfc:39ea:4c98:2d86]:8484: No route to host  

And can be configured to be silent by

Log.builder.bind "sparoid.*", Log::Severity::Error, log_backend

@oskgu360 oskgu360 merged commit 9aa4d2a into main May 6, 2026
23 checks passed
@oskgu360 oskgu360 deleted the skip-unreachable-udp-addrs branch May 6, 2026 07:07
@oskgu360 oskgu360 mentioned this pull request May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants