Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possibly unhandled RetryError #3954

Closed
daaain opened this issue Jul 14, 2022 · 5 comments · Fixed by #4692
Closed

Possibly unhandled RetryError #3954

daaain opened this issue Jul 14, 2022 · 5 comments · Fixed by #4692

Comments

@daaain
Copy link

daaain commented Jul 14, 2022

Which version of dd-trace-py are you using?

1.1.4

Which version of pip are you using?

22.1.2 with Python 3.7.12

Which version of the libraries are you using?

These are all the libraries used in the Flask web service affected.

alembic | 1.5.4
alembic-autogenerate-enums | 0.0.2
alembic-verify | 0.1.4
alphashape | 1.3.1
async-timeout | 4.0.2
attrs | 20.3.0
basis-libraries | 1.0.0
blinker | 1.4
cachelib | 0.7.0
cachetools | 4.2.1
certifi | 2020.12.5
cffi | 1.14.5
chardet | 4.0.0
click | 7.1.2
click-log | 0.3.2
cryptography | 3.4.4
ddsketch | 2.0.3
ddtrace | 1.1.4
deprecated | 1.2.13
entrypoints | 0.3
flake8 | 3.7.9
flask | 1.1.2
flask-caching | 1.11.1
flask-marshmallow | 0.14.0
flask-migrate | 2.6.0
flask-script | 2.0.6
flask-sqlalchemy | 2.5.1
funcsigs | 1.0.2
future | 0.18.2
geoalchemy2 | 0.8.4
geojson | 2.5.0
google-api-core | 1.31.5
google-api-python-client | 1.8.4
google-auth | 1.35.0
google-auth-httplib2 | 0.0.4
google-auth-oauthlib | 0.5.1
google-cloud-bigquery | 1.20.0
google-cloud-core | 1.7.2
google-cloud-iot | 1.0.0
google-cloud-pubsub | 2.9.0
google-cloud-storage | 2.2.1
google-crc32c | 1.3.0
google-resumable-media | 2.3.3
googleapis-common-protos | 1.52.0
grpc-google-iam-v1 | 0.12.3
grpcio | 1.44.0
grpcio-status | 1.44.0
gunicorn | 20.0.4
httplib2 | 0.19.0
idna | 2.10
importlib-metadata | 3.4.0
itsdangerous | 1.1.0
jinja2 | 2.11.3
latlon23 | 1.0.7
libcst | 0.4.1
mako | 1.1.4
markupsafe | 1.1.1
marshmallow | 3.10.0
marshmallow-enum | 1.5.1
marshmallow-sqlalchemy | 0.28.0
mccabe | 0.6.1
mixpanel | 4.9.0
more-itertools | 8.7.0
mypy-extensions | 0.4.3
networkx | 2.6.3
numpy | 1.21.0
oauthlib | 3.1.0
packaging | 20.9
pip | 22.1.2
proto-plus | 1.19.6
protobuf | 3.15.0
psycopg2-binary | 2.9.3
pyasn1 | 0.4.8
pyasn1-modules | 0.2.8
pycodestyle | 2.5.0
pycparser | 2.20
pydash | 4.9.3
pyflakes | 2.1.1
pyjwt | 2.4.0
pyparsing | 2.4.7
pyproj | 3.0.0.post1
python-dateutil | 2.8.1
python-editor | 1.0.4
python-json-logger | 2.0.1
pytz | 2022.1
pyyaml | 6.0
redis | 4.3.3
requests | 2.25.1
requests-oauthlib | 1.3.0
rsa | 4.7.2
rtree | 0.9.7
scipy | 1.6.0
sentry-sdk | 1.5.12
setuptools | 57.5.0
shapely | 1.8.2
simplejson | 3.17.2
six | 1.15.0
sqlalchemy | 1.3.23
sqlalchemy-citext | 1.7.0
sqlalchemy-diff | 0.1.3
sqlalchemy-utils | 0.38.2
str2bool | 1.1
tenacity | 8.0.1
transitions | 0.7.2
trimesh | 3.9.42
typing-extensions | 3.7.4.3
typing-inspect | 0.7.1
uritemplate | 3.0.1
urllib3 | 1.26.9
werkzeug | 1.0.1
wheel | 0.37.1
wrapt | 1.13.3
zipp | 3.4.0

How can we reproduce your problem?

Not quite sure, it happens a few times every day in our GKE environment. Possibly related to pods being stopped and recreated.

What is the result that you get?

We're getting exceptions raised, triggering Sentry from what seems like unhandled retry errors.

Looking at the stack trace (full trace pasted below) it seems that on ddtrace/internal/writer.py in flush_queue at line 560 RetryErrors should be handled, but somehow this one seems to be slipping through.

RetryError[<Future at 0x7fb7681f7390 state=finished raised ConnectionRefusedError>]

ConnectionRefusedError: [Errno 111] Connection refused
  File "__init__.py", line 407, in __call__
    result = fn(*args, **kwargs)
  File "ddtrace/internal/writer.py", line 446, in _send_payload
    response = self._put(payload, headers)
  File "ddtrace/internal/writer.py", line 398, in _put
    self._conn.request("PUT", self._endpoint, data, headers)
  File "http/client.py", line 1281, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "http/client.py", line 1327, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "http/client.py", line 1276, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "http/client.py", line 1036, in _send_output
    self.send(msg)
  File "http/client.py", line 976, in send
    self.connect()
  File "http/client.py", line 948, in connect
    (self.host,self.port), self.timeout, self.source_address)
  File "socket.py", line 728, in create_connection
    raise err
  File "socket.py", line 716, in create_connection
    sock.connect(sa)

RetryError: RetryError[<Future at 0x7fb7681f7390 state=finished raised ConnectionRefusedError>]
  File "ddtrace/internal/writer.py", line 559, in flush_queue
    self._retry_upload(self._send_payload, encoded, n_traces)
  File "__init__.py", line 404, in __call__
    do = self.iter(retry_state=retry_state)
  File "__init__.py", line 361, in iter
    raise retry_exc from fut.exception()

What is the result that you expected?

Retry errors are handled in the library in a watertight way and not leaked into the application context.

This didn't seem to happen before a version update when we were on 0.60.6.

@Kyle-Verhoog
Copy link
Member

hey @daaain, thanks for the report!

So the retry exception will bubble up in our writer thread if the number of retries is exceeded. This exception just gets logged since it is in a separate thread that ddtrace starts. I believe sentry just looks for all uncaught exceptions regardless of thread. The exception isn't very clear about this though, so we should probably make it more clear that this is a result of exceeded retries. We could probably handle it and log instead as well instead of letting it bubble all the way up.

Just to confirm: you're not seeing this affect any of your application code, right?

@daaain
Copy link
Author

daaain commented Jul 15, 2022

Hey @Kyle-Verhoog, no app impact, just false alarms in Sentry which would be good to stop (got almost 600 since the deployment with the version update 25 days ago).

As a principle, I'd definitely prefer to not have any unhandled exceptions in the agent, unless it's a catastrophic configuration / startup issue which would prevent it from running so definitely needs to surface.

Having said that, I actually realised this is trying to connect to a DD collector in the same cluster:

failed to send traces to Datadog Agent at http://10.154.0.111:8126/v0.4/traces

So this might have just surfaced an issue with some component in the DD Helm chart!

@github-actions github-actions bot added the stale label Aug 15, 2022
@therc
Copy link

therc commented Nov 23, 2022

This is a billing issue. Over the past week, this resulted in hundreds of millions of log entries in our cluster. And it's very difficult to write an exclusion filter to drop all the stack traces without also causing collateral damage (real stack traces from user code).

@therc
Copy link

therc commented Nov 23, 2022

Something odd I observed: I wrote the contender for the ugliest log exclusion filter in history, tweaking it until I made all the tracebacks disappear. Then I turned it off and... nothing happened. I was expecting the errors to show up again. Either the problematic workloads disappeared all at once (we were failing to send traces roughly 1M times per hour) or, perhaps, dumping the stack traces makes things even worse, causing the client, in this case a Kubernetes pod, to get into a spiral. I haven't seen the code, but I thought I'd mention this.

Going through logs, I found one instance where we had 63 failures logged within a window of just 30ms (no inline timestamps, so the batching was done by the Docker runtime and/or the local DD agent, perhaps). This looks pretty bad...

And yes, we use a collector, too, but we don't use Helm... we've been managing DD in Terraform since before the charts existed. We're on ddtrace 1.3.0. We'll try a newer version, as well as increasing the connect() timeout to 5s, but it's going to take a while to get all our users to pick up the changes.

@github-actions github-actions bot removed the stale label Nov 24, 2022
@Kyle-Verhoog Kyle-Verhoog linked a pull request Dec 1, 2022 that will close this issue
8 tasks
@therc
Copy link

therc commented Dec 6, 2022

Thank you so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants