watchdog anomalies after upgrade to 0.35.1 #1030

gingerlime · 2020-05-07T07:32:04Z

This correlates with a deploy which included the ddtrace upgrade from 0.34.2 to 0.35.1

Was there a change to sampling rates or something?

delner · 2020-05-07T14:48:28Z

Definitely suspicious, but no, we did not adjust sampling.

0.35.0 did shuffle around how trace components are configured though, and we introduced payload chunking (which uses Net/HTTP) so maybe there's something related. Whats the name of this metric that spiked up?

gingerlime · 2020-05-07T16:12:53Z

Thanks @delner

Whats the name of this metric that spiked up?

Hit/error count on service net/http (and also the latency changed, but I suppose as a result of the increased hit count)

marcotc · 2020-05-07T18:23:34Z

Thank you @gingerlime!
From what I understand, you are now seeing way more net/http traces compared to before, due to the upgrade to 0.35.1.

I did a comparison of all the changes between 0.34.2 and 0.35.1 and there's nothing that indicates that ddtrace should create more net/http traces. The main parts affected were our Sinatra instrumentation and splitting large trace payloads into chunks.

Let us know if you are using Sinatra.

But in this scenario, trace splitting seems more plausible, but the only way I can imagine it causing an increase in traces is if, for some reason, the code is sending the same traces multiple times. Not sure if this feasible, but are you able to notice if the increase is being caused by duplicate spans, or they are mostly unique?

If we want to continue investigating the chunking logic, I suggest you enable "diagnostic health metrics" which will tells exactly how many times chunking happened:
If you add c.diagnostics.health_metrics.enabled = true to your Datadog.configure{} block, you should start seeing new metrics in your dashboard, one them being datadog.tracer.transport.chunked, which will tell us how many times chunking was necessary in a time-frame.

If we are seeing greater than zero numbers for datadog.tracer.transport.chunked for the affected service, then we have a candidate for the culprit.

gingerlime · 2020-05-07T18:57:03Z

Thanks @marcotc !

Not using Sinatra. It's a rails app.

I've had a bit of a history of false alarms (ok, one so far), so I'm getting a bit nervous :) but I checked other commits on the same deploy and couldn't spot anything that can cause an increase of http requests of this magnitude.

are you able to notice if the increase is being caused by duplicate spans, or they are mostly unique?

I'm not sure I follow you completely. Any pointers to what to look at more specifically?

What I did try however is looking at traces. Given that both the hit count increased, and the latency decreased, I was actually looking for fast traces, less than 1ms, and I can see plenty of these

May 07 06:50:37.214 716 μs POST/v0.4/traces

Do these ring a bell with you guys? If I look before the deploy/spike, I can't see any of these urls in the traces.

If we want to continue investigating the chunking logic, I suggest you enable "diagnostic health metrics" which will tells exactly how many times chunking happened

Is it safe to turn on diagnostics in production? I'm a bit hesitating to cause additional load on the system. Besides the increased hit count, everything seems to run fine, and I'm not sure if we can easily spot a hit difference on our staging environment which is generally much more quiet anyway. I'd be happy to try it out if it's safe though.

delner · 2020-05-08T00:19:24Z

If you're able to reproduce this in a test environment or a canary I'd suggest starting there when using health metrics. Health metrics are not the same thing as "debug"; it will emit Statsd metrics over UDP to the agent (assuming you have your agent running Statsd), but should not produce any additional log messages. In this sense it should be production safe, but its always a good idea to try this out in the least sensitive setting you can manage.

POST/v0.4/traces is definitely ours; it's the HTTP call the trace library makes to the agent to submit traces. It should only happen every second or so, but could happen in bursts if your trace volume is large enough: this is the new "chunk" behavior we introduced in 0.35.0, which is meant to prevent payloads from growing too large and being rejected by the trace agent.

It's very weird that you're seeing lots of these, and that they're under 1ms (unless the payload is small and the agent is co-located on the same host/container.) We'll have to look deeper into this... have some possible ideas.

delner · 2020-05-08T00:45:34Z

Okay, one of my tests picked up a problem with the HTTP instrumentation that I think is causing this. While I work on confirming the cause, I would recommend either disabling c.use :http or rolling back to 0.34.2.

Will keep you posted with any updates.

delner · 2020-05-08T02:49:38Z

Okay I think I may have found the cause: the HTTP circuit breaker wasn't short-circuiting the HTTP instrumentation for the transport requests, and was generating traces for them.

@gingerlime Can you give #1033 a try? You can also try this pre-release gem if that's more suitable for you:

source 'http://gems.datadoghq.com/prerelease' do
  gem 'ddtrace', '0.35.1.fix.http.circuit.breaker.miss.63821'
end

gingerlime · 2020-05-08T06:59:29Z

Thanks again @delner !

I deployed the branch on our staging environment and it definitely looks like the number of hits drop after the deploy

delner · 2020-05-08T14:26:42Z

Okay great, glad to see this is effective @gingerlime. We're going to try to deploy this as a bugfix today: I'll keep you posted.

delner · 2020-05-08T17:44:49Z

Alright, we merged the PR to fix this. We'll deploy it shortly as 0.35.2. Thanks for the report @gingerlime , please always feel free to report anything suspicious, and don't worry too much about any false alarms :) I'm glad we were able to find and fix this.

marcotc · 2020-05-11T18:50:44Z

@gingerlime thank you again for this issue report!
We've just released 0.35.2, which includes a fix for this issue.
Please give it a try and let us know if the problem still persists.

gingerlime · 2020-05-11T19:20:27Z

Thank you both for the quick turnaround time and for keeping me posted. I really appreciate it!

delner mentioned this issue May 8, 2020

Fix HTTP circuit breaker #1033

Merged

delner closed this as completed in #1033 May 8, 2020

marcotc added the bug Involves a bug label May 8, 2020

marcotc added this to the 0.35.2 milestone May 8, 2020

delner added community Was opened by a community member integrations Involves tracing integrations labels May 12, 2020

delner self-assigned this May 12, 2020

delner added this to Resolved/Closed in Active work Apr 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

watchdog anomalies after upgrade to 0.35.1 #1030

watchdog anomalies after upgrade to 0.35.1 #1030

gingerlime commented May 7, 2020

delner commented May 7, 2020

gingerlime commented May 7, 2020

marcotc commented May 7, 2020

gingerlime commented May 7, 2020

delner commented May 8, 2020

delner commented May 8, 2020

delner commented May 8, 2020

gingerlime commented May 8, 2020

delner commented May 8, 2020

delner commented May 8, 2020 •

edited

marcotc commented May 11, 2020

gingerlime commented May 11, 2020

watchdog anomalies after upgrade to 0.35.1 #1030

watchdog anomalies after upgrade to 0.35.1 #1030

Comments

gingerlime commented May 7, 2020

delner commented May 7, 2020

gingerlime commented May 7, 2020

marcotc commented May 7, 2020

gingerlime commented May 7, 2020

delner commented May 8, 2020

delner commented May 8, 2020

delner commented May 8, 2020

gingerlime commented May 8, 2020

delner commented May 8, 2020

delner commented May 8, 2020 • edited

marcotc commented May 11, 2020

gingerlime commented May 11, 2020

delner commented May 8, 2020 •

edited