Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

False "Failed to send traces to Datadog Agent" errors #1155

Closed
dreki opened this issue Dec 2, 2019 · 18 comments
Closed

False "Failed to send traces to Datadog Agent" errors #1155

dreki opened this issue Dec 2, 2019 · 18 comments
Assignees

Comments

@dreki
Copy link

dreki commented Dec 2, 2019

Which version of dd-trace-py are you using?

0.31.0

Which version of the libraries are you using?

contentful==1.12.3
ddtrace==0.31.0
ipython==7.1.1
python-box==3.4.0
python-json-logger==0.1.11
tornado==6.0.3
validators==0.12.4
tenacity==6.0.0
contentful==1.12.3

How can we reproduce your problem?

Here's our init code:

import ddtrace

# this must be called before `tornado` packages are imported
if os.environ.get('DD_AGENT_HOST'): # noqa
    ddtrace.patch(tornado=True)  # noqa

import tornado.ioloop
import tornado.locks
import tornado.log
import tornado.options
import tornado.web

And our tornado application settings:

            datadog_trace={
                'default_service': self.name,
                'tags': {'debug': DEBUG},
                'analytics_enabled': True,
                'agent_hostname': os.environ.get('HOSTNAME', 'localhost'),
            },

What is the result that you get?

Every 7-15 minutes or so (unpredictable rate of incidence) we see a "Failed to send traces to Datadog Agent" entry in our Datadog logs, as an error.

What is result that you expected?

If there is an error, more cause information. If it's a false negative (which we suspect), prevention of this false negative.

@dreki
Copy link
Author

dreki commented Dec 2, 2019

Here's a sample log message:

Time | Host
-----------------------------
14:45:47 UTC | gke-hostname-hidden
Failed to send traces to Datadog Agent at http://ourservice-deployment-xx-zz:8129:
  [Errno 104] Connection reset by peer
-----------------------------

@majorgreys
Copy link
Collaborator

majorgreys commented Dec 6, 2019

This log record is created when the request to send traces throws an exceptions, which we log in this case along with its exception message.

@dreki Why do you think this is a false negative?

It could be a networking issue in your k8s environment or perhaps the agent restarting. Logs from either might shed light on what is happening when the tracer gets this exception upon a put request to the agent.

@dreki
Copy link
Author

dreki commented Dec 9, 2019

@majorgreys Discussion online made it sound like it was an acknowledged negative. If this is legit, that's great; would it be possible to log further exception information? We've found no other error messages anywhere to hint at what this might be.

@JordanP
Copy link

JordanP commented Dec 11, 2019

We also got that since a long time (not 0.31 specific), so much that we had to put

logging.getLogger("ddtrace.writer").setLevel(logging.CRITICAL)

somewhere in our code.

It would be great to log more info here to help you debug this.

@majorgreys
Copy link
Collaborator

@dreki and @JordanP You might want to consider rate limiting the library logger using the DD_LOGGING_RATE_LIMIT as described in internal/logger.py. This is not currently documented in our library documentation, but we are planning it in an upcoming update.

I would encourage you to reach out to Datadog support as we will need more specific information from your deployment of the agent and tracer to identify why these networking errors are occuring.

@JerryVerhoef
Copy link

It could be a networking issue in your k8s environment or perhaps the agent restarting. Logs from either might shed light on what is happening when the tracer gets this exception upon a put request to the agent.

We are also exeriencing the same problem. But in our k8s environment we are also using the PHP client, this client is not reporting any errors at all. So either the PHP client is buggy in reporting the error or this is not a networking issue of the k8s environment.

@dreki
Copy link
Author

dreki commented Apr 13, 2020

Just a note to say that this is still affecting us daily.

@majorgreys
Copy link
Collaborator

@dreki @JerryVerhoef @JordanP

The v0.37.0 release included a fix that potentially resolves a scenario where you might see such log events. Can you confirm whether this still is happening?

@majorgreys majorgreys self-assigned this May 28, 2020
@lu911
Copy link

lu911 commented Jun 6, 2020

@majorgreys still happen

dd-trace-py: 0.38.0
message: Failed to send traces to Datadog Agent at <ddtrace.api.API object at 0x7fa6e133f250>: ConnectionResetError(104, 'Connection reset by peer')

@dreki
Copy link
Author

dreki commented Jun 11, 2020

Thank you for checking, @YukSeungChan. Thank you for the attention @majorgreys.

FYI @peter-bertuglia, looks like this is getting more attention.

@peter-bertuglia
Copy link

With v0.37.1 we're still seeing the error with high frequency

@kamyar
Copy link

kamyar commented Jun 11, 2020

We are also seeing this error, can you give a timeline of when this will be fixed?

@Kyle-Verhoog
Copy link
Member

Kyle-Verhoog commented Jun 11, 2020

👋 hi all, we suspect this is an issue with the agent it has a limit to the number of connections it can handle by default. Please see: https://docs.datadoghq.com/tracing/troubleshooting/agent_rate_limits/

Let us know if you're seeing the error in your agents logs and if bumping the limit helps at all. 🙂

@peter-bertuglia
Copy link

After upgrading to 0.42.0 and fixing an egregious mistake in our ddtrace_settings (wrong env var for DD_AGENT_HOST) we're now properly sending traces and we no longer receive the Failed to send traces to Datadog Agent while services are running. However the new startup logs added in 0.42.0 still complain that the agent is not reachable DATADOG TRACER DIAGNOSTIC - Agent not reachable. Exception raised: [Errno 111] Connection refused. This is logged on every startup without exception. Since we're no longer getting logs about failed tracing and we're getting all of our data in Datadog, it seems that we can ignore this but wouldn't hate understanding why it gets logged out.

@paulkarayan
Copy link

@Kyle-Verhoog - that link isnt going to what I think it should? maybe I'm wrong. is it: https://docs.datadoghq.com/tracing/troubleshooting/agent_rate_limits/
?

@Kyle-Verhoog
Copy link
Member

Kyle-Verhoog commented Oct 14, 2020

@peter-bertuglia yes we're aware of this. It's a diagnostic log that we print on startup to help onboarding however this is done for the default hostname/port configuration of the client so if your agent is located somewhere else it'll create noise. We're working to address this in #1715 and #1717.

@paulkarayan hmm yeah looks like my link is dead now. Thanks for posting the updated one!

@dreki
Copy link
Author

dreki commented Oct 15, 2020

@peter-bertuglia has been able to prevent this issue from occurring in our infrastructure. So this issue is 'resolved' from our end, in a manner of speaking. @Kyle-Verhoog, would y'all prefer to have this issue be closed, or leave it open since other folks may have a similar but different issue?

@Kyle-Verhoog
Copy link
Member

@dreki since the initial issue is resolved let's close it for now. We can always re-open or open a new issue (since I think most folks have a similar but different issue). Thanks for following up! 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants