Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

watchdog anomalies after upgrade to 0.35.1 #1030

Closed
gingerlime opened this issue May 7, 2020 · 12 comments · Fixed by #1033
Closed

watchdog anomalies after upgrade to 0.35.1 #1030

gingerlime opened this issue May 7, 2020 · 12 comments · Fixed by #1033
Assignees
Labels
bug Involves a bug community Was opened by a community member integrations Involves tracing integrations
Projects
Milestone

Comments

@gingerlime
Copy link
Contributor

Screenshot 2020-05-07 at 09 29 20

This correlates with a deploy which included the ddtrace upgrade from 0.34.2 to 0.35.1

Was there a change to sampling rates or something?

@delner
Copy link
Contributor

delner commented May 7, 2020

Definitely suspicious, but no, we did not adjust sampling.

0.35.0 did shuffle around how trace components are configured though, and we introduced payload chunking (which uses Net/HTTP) so maybe there's something related. Whats the name of this metric that spiked up?

@gingerlime
Copy link
Contributor Author

Thanks @delner

Whats the name of this metric that spiked up?

Hit/error count on service net/http (and also the latency changed, but I suppose as a result of the increased hit count)

@marcotc
Copy link
Member

marcotc commented May 7, 2020

Thank you @gingerlime!
From what I understand, you are now seeing way more net/http traces compared to before, due to the upgrade to 0.35.1.

I did a comparison of all the changes between 0.34.2 and 0.35.1 and there's nothing that indicates that ddtrace should create more net/http traces. The main parts affected were our Sinatra instrumentation and splitting large trace payloads into chunks.

Let us know if you are using Sinatra.

But in this scenario, trace splitting seems more plausible, but the only way I can imagine it causing an increase in traces is if, for some reason, the code is sending the same traces multiple times. Not sure if this feasible, but are you able to notice if the increase is being caused by duplicate spans, or they are mostly unique?

If we want to continue investigating the chunking logic, I suggest you enable "diagnostic health metrics" which will tells exactly how many times chunking happened:
If you add c.diagnostics.health_metrics.enabled = true to your Datadog.configure{} block, you should start seeing new metrics in your dashboard, one them being datadog.tracer.transport.chunked, which will tell us how many times chunking was necessary in a time-frame.

If we are seeing greater than zero numbers for datadog.tracer.transport.chunked for the affected service, then we have a candidate for the culprit.

@gingerlime
Copy link
Contributor Author

Thanks @marcotc !

Not using Sinatra. It's a rails app.

I've had a bit of a history of false alarms (ok, one so far), so I'm getting a bit nervous :) but I checked other commits on the same deploy and couldn't spot anything that can cause an increase of http requests of this magnitude.

are you able to notice if the increase is being caused by duplicate spans, or they are mostly unique?

I'm not sure I follow you completely. Any pointers to what to look at more specifically?

What I did try however is looking at traces. Given that both the hit count increased, and the latency decreased, I was actually looking for fast traces, less than 1ms, and I can see plenty of these

May 07 06:50:37.214 716 μs POST/v0.4/traces

Do these ring a bell with you guys? If I look before the deploy/spike, I can't see any of these urls in the traces.

If we want to continue investigating the chunking logic, I suggest you enable "diagnostic health metrics" which will tells exactly how many times chunking happened

Is it safe to turn on diagnostics in production? I'm a bit hesitating to cause additional load on the system. Besides the increased hit count, everything seems to run fine, and I'm not sure if we can easily spot a hit difference on our staging environment which is generally much more quiet anyway. I'd be happy to try it out if it's safe though.

@delner
Copy link
Contributor

delner commented May 8, 2020

If you're able to reproduce this in a test environment or a canary I'd suggest starting there when using health metrics. Health metrics are not the same thing as "debug"; it will emit Statsd metrics over UDP to the agent (assuming you have your agent running Statsd), but should not produce any additional log messages. In this sense it should be production safe, but its always a good idea to try this out in the least sensitive setting you can manage.

POST/v0.4/traces is definitely ours; it's the HTTP call the trace library makes to the agent to submit traces. It should only happen every second or so, but could happen in bursts if your trace volume is large enough: this is the new "chunk" behavior we introduced in 0.35.0, which is meant to prevent payloads from growing too large and being rejected by the trace agent.

It's very weird that you're seeing lots of these, and that they're under 1ms (unless the payload is small and the agent is co-located on the same host/container.) We'll have to look deeper into this... have some possible ideas.

@delner
Copy link
Contributor

delner commented May 8, 2020

Okay, one of my tests picked up a problem with the HTTP instrumentation that I think is causing this. While I work on confirming the cause, I would recommend either disabling c.use :http or rolling back to 0.34.2.

Will keep you posted with any updates.

@delner
Copy link
Contributor

delner commented May 8, 2020

Okay I think I may have found the cause: the HTTP circuit breaker wasn't short-circuiting the HTTP instrumentation for the transport requests, and was generating traces for them.

@gingerlime Can you give #1033 a try? You can also try this pre-release gem if that's more suitable for you:

source 'http://gems.datadoghq.com/prerelease' do
  gem 'ddtrace', '0.35.1.fix.http.circuit.breaker.miss.63821'
end

@gingerlime
Copy link
Contributor Author

Thanks again @delner !

I deployed the branch on our staging environment and it definitely looks like the number of hits drop after the deploy

Screenshot 2020-05-08 at 08 58 20

@delner
Copy link
Contributor

delner commented May 8, 2020

Okay great, glad to see this is effective @gingerlime. We're going to try to deploy this as a bugfix today: I'll keep you posted.

@delner
Copy link
Contributor

delner commented May 8, 2020

Alright, we merged the PR to fix this. We'll deploy it shortly as 0.35.2. Thanks for the report @gingerlime , please always feel free to report anything suspicious, and don't worry too much about any false alarms :) I'm glad we were able to find and fix this.

@marcotc marcotc added the bug Involves a bug label May 8, 2020
@marcotc marcotc added this to the 0.35.2 milestone May 8, 2020
@marcotc
Copy link
Member

marcotc commented May 11, 2020

@gingerlime thank you again for this issue report!
We've just released 0.35.2, which includes a fix for this issue.
Please give it a try and let us know if the problem still persists.

@gingerlime
Copy link
Contributor Author

Thank you both for the quick turnaround time and for keeping me posted. I really appreciate it!

@delner delner added community Was opened by a community member integrations Involves tracing integrations labels May 12, 2020
@delner delner self-assigned this May 12, 2020
@delner delner added this to Resolved/Closed in Active work Apr 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Involves a bug community Was opened by a community member integrations Involves tracing integrations
Projects
Active work
  
Resolved/Closed
Development

Successfully merging a pull request may close this issue.

3 participants