-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
False "Failed to send traces to Datadog Agent" errors #1155
Comments
Here's a sample log message:
|
This log record is created when the request to send traces throws an exceptions, which we log in this case along with its exception message. @dreki Why do you think this is a false negative? It could be a networking issue in your k8s environment or perhaps the agent restarting. Logs from either might shed light on what is happening when the tracer gets this exception upon a put request to the agent. |
@majorgreys Discussion online made it sound like it was an acknowledged negative. If this is legit, that's great; would it be possible to log further exception information? We've found no other error messages anywhere to hint at what this might be. |
We also got that since a long time (not 0.31 specific), so much that we had to put logging.getLogger("ddtrace.writer").setLevel(logging.CRITICAL) somewhere in our code. It would be great to log more info here to help you debug this. |
@dreki and @JordanP You might want to consider rate limiting the library logger using the I would encourage you to reach out to Datadog support as we will need more specific information from your deployment of the agent and tracer to identify why these networking errors are occuring. |
We are also exeriencing the same problem. But in our k8s environment we are also using the PHP client, this client is not reporting any errors at all. So either the PHP client is buggy in reporting the error or this is not a networking issue of the k8s environment. |
Just a note to say that this is still affecting us daily. |
The v0.37.0 release included a fix that potentially resolves a scenario where you might see such log events. Can you confirm whether this still is happening? |
@majorgreys still happen dd-trace-py: 0.38.0 |
Thank you for checking, @YukSeungChan. Thank you for the attention @majorgreys. FYI @peter-bertuglia, looks like this is getting more attention. |
With v0.37.1 we're still seeing the error with high frequency |
We are also seeing this error, can you give a timeline of when this will be fixed? |
👋 hi all, we suspect this is an issue with the agent it has a limit to the number of connections it can handle by default. Please see: https://docs.datadoghq.com/tracing/troubleshooting/agent_rate_limits/ Let us know if you're seeing the error in your agents logs and if bumping the limit helps at all. 🙂 |
After upgrading to |
@Kyle-Verhoog - that link isnt going to what I think it should? maybe I'm wrong. is it: https://docs.datadoghq.com/tracing/troubleshooting/agent_rate_limits/ |
@peter-bertuglia yes we're aware of this. It's a diagnostic log that we print on startup to help onboarding however this is done for the default hostname/port configuration of the client so if your agent is located somewhere else it'll create noise. We're working to address this in #1715 and #1717. @paulkarayan hmm yeah looks like my link is dead now. Thanks for posting the updated one! |
@peter-bertuglia has been able to prevent this issue from occurring in our infrastructure. So this issue is 'resolved' from our end, in a manner of speaking. @Kyle-Verhoog, would y'all prefer to have this issue be closed, or leave it open since other folks may have a similar but different issue? |
@dreki since the initial issue is resolved let's close it for now. We can always re-open or open a new issue (since I think most folks have a similar but different issue). Thanks for following up! 🙂 |
Which version of dd-trace-py are you using?
0.31.0
Which version of the libraries are you using?
How can we reproduce your problem?
Here's our init code:
And our tornado application settings:
What is the result that you get?
Every 7-15 minutes or so (unpredictable rate of incidence) we see a "Failed to send traces to Datadog Agent" entry in our Datadog logs, as an error.
What is result that you expected?
If there is an error, more cause information. If it's a false negative (which we suspect), prevention of this false negative.
The text was updated successfully, but these errors were encountered: