-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
We're running a heavily loaded NodeJS app in Fargate using the dd-trace node lib, running datadog agent in a sidecar, sending metrics and traces to it via UDS. "Heavily loaded" means 10s of K websocket connections with event loop delays sporadically spiking over 1,000ms with bursts of traffic, but otherwise good in the normal case.
We sometimes see a top level uncaught exception EPIPE error that crashes the app. Unfortunately, with Node it's been impossible to see the full stack. But when this happens, we see this log in the datadog agent a few hundred ms before the app:
agent log:
2025-02-19T23:25:30.916-08:00 - 2025-02-20 07:25:30 UTC | TRACE | ERROR | (pkg/trace/api/api.go:618 in handleTraces) | Cannot decode v0.4 traces payload: read unix /var/run/datadog/apm.socket->@: i/o timeout
app log:
2025-02-19T23:25:31.325-08:00 Uncaught Exception: [Error] [uncaughtException] write EPIPE: undefined
If we set DD_TRACE_PARTIAL_FLUSH_MIN_SPANS to 100 (from the default of 1000), the frequency of these errors drops to -almost- 0. But not completely. It still happens once in a while across a cluster handling millions of connections.
I think this is happening for the following reasons:
The agent:
datadog-agent/pkg/trace/api/api.go
Line 585 in 69867d9
| case r.recvsem <- struct{}{}: |
And a race window in NodeJS itself:
nodejs/node#55952
In handleTraces(), the agent is doing a select, waiting for a semaphore or a "time.After", but it's not also selecting on req.Context().Done() which is a context WithTimeout that is initialized with a different timeout by timeoutMiddleware(), initialized in buildMux()
So my guess is the agent is ungracefully closing the connection rather than sending a graceful response as from case <-time.After...
And this ungraceful UDS close is hitting the small race window within NodeJS itself (linked above)