Skip to content

UDS timeout appears to be triggering a NodeJS crash from unhandled EPIPE exception #34461

@bataras

Description

@bataras

We're running a heavily loaded NodeJS app in Fargate using the dd-trace node lib, running datadog agent in a sidecar, sending metrics and traces to it via UDS. "Heavily loaded" means 10s of K websocket connections with event loop delays sporadically spiking over 1,000ms with bursts of traffic, but otherwise good in the normal case.

We sometimes see a top level uncaught exception EPIPE error that crashes the app. Unfortunately, with Node it's been impossible to see the full stack. But when this happens, we see this log in the datadog agent a few hundred ms before the app:

agent log:
2025-02-19T23:25:30.916-08:00 - 2025-02-20 07:25:30 UTC | TRACE | ERROR | (pkg/trace/api/api.go:618 in handleTraces) | Cannot decode v0.4 traces payload: read unix /var/run/datadog/apm.socket->@: i/o timeout

app log:
2025-02-19T23:25:31.325-08:00 Uncaught Exception: [Error] [uncaughtException] write EPIPE: undefined

If we set DD_TRACE_PARTIAL_FLUSH_MIN_SPANS to 100 (from the default of 1000), the frequency of these errors drops to -almost- 0. But not completely. It still happens once in a while across a cluster handling millions of connections.

I think this is happening for the following reasons:

The agent:

case r.recvsem <- struct{}{}:

And a race window in NodeJS itself:
nodejs/node#55952

In handleTraces(), the agent is doing a select, waiting for a semaphore or a "time.After", but it's not also selecting on req.Context().Done() which is a context WithTimeout that is initialized with a different timeout by timeoutMiddleware(), initialized in buildMux()

So my guess is the agent is ungracefully closing the connection rather than sending a graceful response as from case <-time.After...

And this ungraceful UDS close is hitting the small race window within NodeJS itself (linked above)

Metadata

Metadata

Assignees

Labels

oss/2External contributions priority 2team/agent-apmtrace-agent

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions