Possibly unhandled RetryError #3954

daaain · 2022-07-14T11:28:34Z

Which version of dd-trace-py are you using?

1.1.4

Which version of pip are you using?

22.1.2 with Python 3.7.12

Which version of the libraries are you using?

These are all the libraries used in the Flask web service affected.

How can we reproduce your problem?

Not quite sure, it happens a few times every day in our GKE environment. Possibly related to pods being stopped and recreated.

What is the result that you get?

We're getting exceptions raised, triggering Sentry from what seems like unhandled retry errors.

Looking at the stack trace (full trace pasted below) it seems that on ddtrace/internal/writer.py in flush_queue at line 560 RetryErrors should be handled, but somehow this one seems to be slipping through.

RetryError[<Future at 0x7fb7681f7390 state=finished raised ConnectionRefusedError>]

ConnectionRefusedError: [Errno 111] Connection refused
  File "__init__.py", line 407, in __call__
    result = fn(*args, **kwargs)
  File "ddtrace/internal/writer.py", line 446, in _send_payload
    response = self._put(payload, headers)
  File "ddtrace/internal/writer.py", line 398, in _put
    self._conn.request("PUT", self._endpoint, data, headers)
  File "http/client.py", line 1281, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "http/client.py", line 1327, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "http/client.py", line 1276, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "http/client.py", line 1036, in _send_output
    self.send(msg)
  File "http/client.py", line 976, in send
    self.connect()
  File "http/client.py", line 948, in connect
    (self.host,self.port), self.timeout, self.source_address)
  File "socket.py", line 728, in create_connection
    raise err
  File "socket.py", line 716, in create_connection
    sock.connect(sa)

RetryError: RetryError[<Future at 0x7fb7681f7390 state=finished raised ConnectionRefusedError>]
  File "ddtrace/internal/writer.py", line 559, in flush_queue
    self._retry_upload(self._send_payload, encoded, n_traces)
  File "__init__.py", line 404, in __call__
    do = self.iter(retry_state=retry_state)
  File "__init__.py", line 361, in iter
    raise retry_exc from fut.exception()

What is the result that you expected?

Retry errors are handled in the library in a watertight way and not leaked into the application context.

This didn't seem to happen before a version update when we were on 0.60.6.

The text was updated successfully, but these errors were encountered:

Kyle-Verhoog · 2022-07-14T17:16:35Z

hey @daaain, thanks for the report!

So the retry exception will bubble up in our writer thread if the number of retries is exceeded. This exception just gets logged since it is in a separate thread that ddtrace starts. I believe sentry just looks for all uncaught exceptions regardless of thread. The exception isn't very clear about this though, so we should probably make it more clear that this is a result of exceeded retries. We could probably handle it and log instead as well instead of letting it bubble all the way up.

Just to confirm: you're not seeing this affect any of your application code, right?

daaain · 2022-07-15T16:33:09Z

Hey @Kyle-Verhoog, no app impact, just false alarms in Sentry which would be good to stop (got almost 600 since the deployment with the version update 25 days ago).

As a principle, I'd definitely prefer to not have any unhandled exceptions in the agent, unless it's a catastrophic configuration / startup issue which would prevent it from running so definitely needs to surface.

Having said that, I actually realised this is trying to connect to a DD collector in the same cluster:

failed to send traces to Datadog Agent at http://10.154.0.111:8126/v0.4/traces

So this might have just surfaced an issue with some component in the DD Helm chart!

therc · 2022-11-23T01:38:37Z

This is a billing issue. Over the past week, this resulted in hundreds of millions of log entries in our cluster. And it's very difficult to write an exclusion filter to drop all the stack traces without also causing collateral damage (real stack traces from user code).

therc · 2022-11-23T02:37:54Z

Something odd I observed: I wrote the contender for the ugliest log exclusion filter in history, tweaking it until I made all the tracebacks disappear. Then I turned it off and... nothing happened. I was expecting the errors to show up again. Either the problematic workloads disappeared all at once (we were failing to send traces roughly 1M times per hour) or, perhaps, dumping the stack traces makes things even worse, causing the client, in this case a Kubernetes pod, to get into a spiral. I haven't seen the code, but I thought I'd mention this.

Going through logs, I found one instance where we had 63 failures logged within a window of just 30ms (no inline timestamps, so the batching was done by the Docker runtime and/or the local DD agent, perhaps). This looks pretty bad...

And yes, we use a collector, too, but we don't use Helm... we've been managing DD in Terraform since before the charts existed. We're on ddtrace 1.3.0. We'll try a newer version, as well as increasing the connect() timeout to 5s, but it's going to take a while to get all our users to pick up the changes.

therc · 2022-12-06T20:36:45Z

Thank you so much!

github-actions bot added the stale label Aug 15, 2022

github-actions bot removed the stale label Nov 24, 2022

Kyle-Verhoog linked a pull request Dec 1, 2022 that will close this issue

chore(tracer/writer): improve writing exception #4692

Merged

8 tasks

Kyle-Verhoog closed this as completed in #4692 Dec 6, 2022

ArcLightSlavik mentioned this issue Apr 12, 2023

Dropping traces to Datadog Agent after server restart #5533

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possibly unhandled RetryError #3954

Possibly unhandled RetryError #3954

daaain commented Jul 14, 2022 •

edited

Loading

Kyle-Verhoog commented Jul 14, 2022

daaain commented Jul 15, 2022

therc commented Nov 23, 2022

therc commented Nov 23, 2022

therc commented Dec 6, 2022

Possibly unhandled RetryError #3954

Possibly unhandled RetryError #3954

Comments

daaain commented Jul 14, 2022 • edited Loading

Which version of dd-trace-py are you using?

Which version of pip are you using?

Which version of the libraries are you using?

How can we reproduce your problem?

What is the result that you get?

What is the result that you expected?

Kyle-Verhoog commented Jul 14, 2022

daaain commented Jul 15, 2022

therc commented Nov 23, 2022

therc commented Nov 23, 2022

therc commented Dec 6, 2022

daaain commented Jul 14, 2022 •

edited

Loading