Cannot decode traces payload, read limit reached #818

SpyMachine · 2019-09-17T19:38:00Z

I am receiving the follow error in my trace agent logs.

2019-09-17 19:34:45 UTC | TRACE | ERROR | (pkg/trace/api/api.go:372 in handleTraces) | Cannot decode v0.4 traces payload: read limit reached

and my app is showing:

E, [2019-09-17T15:34:46.160421 #82656] ERROR -- ddtrace: [ddtrace] (/Users/me/.rvm/gems/ruby-2.3.6/gems/ddtrace-0.27.0/lib/ddtrace/transport/http/client.rb:46:in rescue in send_request') Internal error during HTTP transport request. Cause: Broken pipe Location: /Users/me/.rvm/rubies/ruby-2.3.6/lib/ruby/2.3.0/net/protocol.rb:229:in write'

I know my trace logs are rather large but didn't realize this would be a problem. Is there a way to get around this?

The text was updated successfully, but these errors were encountered:

delner · 2019-09-18T03:39:40Z

Yeah, looks like an issue on the trace agent end; we'll want to figure out what factors on the Ruby end might be contributing to this. Can you share a little more context about what kinds of traces your Ruby application is sending?

Web service or job processor? Or Something else?
Typically how large are your traces (i.e. # of spans?) Usually the longer the trace is, the bigger they are.
Is this behavior very consistent or intermittent? If the latter, are there any patterns you observe?

If I had to make a very rough first guess, it might have something to do with payload size when transmitting traces to the agent. Although it's premature to assert this is the cause or prescribe solutions, it might be possible to play with a few levers to see if they have any effect.

On the Ruby side we could play with the buffer size to make flushes smaller, or if your network setup permits it, utilize UDS to transmit traces more frequently. I won't say either of these are likely to be silver bullets though, as it appears the restriction might be largely driven by the agent. Might be a good idea to see if there are any configurations that can be utilized there, although I'm not familiar enough to recommend any.

Any debug information you can provide about what payloads your tracer is submitting to the agent will help us develop a more accurate assessment, and a quicker resolution.

SpyMachine · 2019-09-18T11:21:16Z

Yeah, so it's a very large batch job. Jobs can take many, many hours. We're looking to do tracing to see where our best bang for our buck would be in regards to improving performance. While our traces themselves may be very large, we also don't expect to have nearly as many as a web service would expect to have.

Span count is up in the millions.

It appears to be consistent.

delner · 2019-09-18T14:21:48Z

Gotcha, although the agent can handle such large numbers of spans, it's not handled well when they're all in one trace; on the Ruby side, we typically only flush spans to the agent after the trace completes. There is a buffer limit which will prevent memory usage from overflowing, but it will drop any spans that overflow this limit. It's very possible that its filling the buffer entirely, then the payload sent to the agent is too large for it to handle in a single HTTP request.

We do have a feature called partial flushing that might alleviate this, however, it's seldom been used and is largely still experimental. If you're interested, you can try it by configuring the tracer:

Datadog.configure do |c|
  c.tracer partial_flush_timeout: 3, min_spans_before_partial_flush: 1000, max_spans_before_partial_flush: 10_000
end

If you play with those settings, you might find a balance that helps. In any case, please do let us know if this helps at all.

SpyMachine · 2019-09-18T15:57:11Z

Hmm ok so good news and bad news...The good news is I can see all the data being passed over the wire on a long job and I can see most of the data in the analytics view (though a decent chunk appears to be missing). However, the trace seems totally messed up when I use the partial flush. It's showing the entire trace as one random span over a few ms. Opposed to the same job being ran without the partial flush options which has 4k spans for a short job. I'm assuming somethings getting mangled somewhere =/

delner · 2019-09-18T18:49:38Z

Okay gotcha, bummer on the partial flush. Is there a screenshot you can share for that? Would be helpful if we try to fix this feature.

Will consult the team to see if there's a better way to handling long running jobs, or if there's a way to stream spans up for unfinished traces.

allaud · 2019-11-19T14:46:51Z

Same problem here, I was trying to find my worker traces in Datadog and found nothing.

This is what I found in my Sidekiq logs:

E, [2019-11-19T14:02:58.874305 #1] ERROR -- ddtrace: [ddtrace] (/usr/local/bundle/gems/ddtrace-0.28.0/lib/ddtrace/transport/http/client.rb:46:in `rescue in send_request') Internal error during HTTP transport request. Cause: Broken pipe Location: /usr/local/lib/ruby/2.6.0/socket.rb:456:in `__write_nonblock'

usamaSaddiq · 2020-02-26T05:14:03Z

I think I have the same problem. I am also using sidekiq.

ERROR -- ddtrace: [ddtrace] (/app/vendor/bundle/ruby/2.5.0/gems/ddtrace-0.26.0/lib/ddtrace/transport/http/client.rb:46:in `rescue in send_request') Internal error during HTTP transport request. Cause: Broken pipe Location: /usr/local/lib/ruby/2.5.0/net/protocol.rb:233:in `write'

marcotc · 2020-05-04T19:20:34Z

Thank you for all the reports @SpyMachine, @allaud, @usamaSaddiq!

v0.35.0 has been released, which adds dynamic payload chunking for large flush operations perform by ddtraces writer (#840).

Our preliminary investigation tells us that most, if not all, Broken Pipe errors were indeed caused by a payload that was too large to be flushed. This new release will break down flushes that were too large to flush before into multiple smaller flush requests.
This is specially important for high-throughput applications.

If this error persists in your application after this new release, it is likely because a single trace is being generated that is too large to be flushed, in that case I recommend trying to enable Partial Flushing (c.tracer.partial_flush.enabled = true) in your application.

In any case, let us know of any further errors that still occur from now on.

marcotc · 2020-10-07T19:58:09Z

Just a friendly ping to see if anyone in this GH issue is still seeing issue with payload read limit reached with newer versions of the tracer.

tegon · 2020-10-13T12:18:38Z

@marcotc I'm seeing the issue with version 0.40.0. I'll try to enable partial flush to see if that fixes it.

tegon · 2020-10-13T19:27:04Z

@marcotc Not sure if this helps, but I just noticed that this error started to happen more frequently once we increased the size of our EC2 instances in AWS. Before that change, it only happened a couple of times (we didn't even notice) but after the change is happening pretty constantly.

SpyMachine · 2020-10-14T17:35:28Z

Hi, sorry for late reply, just got around to looking at this again. It appears to be working much better for me now!

gc-dev1 · 2020-10-26T22:54:58Z

@SpyMachine what Tracer Version are you using now?

I'm using 1.19.1 and still have this issue

Cannot decode v0.4 traces payload: read limit reached

marcotc · 2020-10-29T20:17:38Z

@SpyMachine I'm glad you saw improvements!

@tegon @gc-dev1 I think that at this point, we should follow up with each one of you individually, having the full context of your environments.
Because this issues happens right at the edge of the tracer library and agent, and is also highly influence by your infrastructure topology, it's more productive for us to provide full account support on this issue. This will allow us to work directly with the data you report to our backend, and have the full context.
With that in mind, could you please open a new support ticket with us? https://docs.datadoghq.com/help/
Please include a link to this GH issue for context for our support team.

@gc-dev1 I noticed you are seeing a similar issue with .NET. If you already have a support ticket open for that issue, I'd suggest investigating that one first, as the underlying cause might be the same (maybe it's the agent). That might save you an extra support thread with us.

usamaSaddiq · 2021-05-11T04:23:58Z

Hey @marcotc

Sorry for the late reply. I've encountered this issue again when tracer.partial_flush.enabled = true

03:43:56 sidekiq.1  | E, [2021-05-11T03:43:49.839689 #13] ERROR -- ddtrace: [ddtrace] (/app/vendor/bundle/ruby/2.6.0/gems/ddtrace-0.37.0/lib/ddtrace/transport/http/client.rb:35:in `rescue in send_request') Internal error during HTTP transport request. Cause: end of file reached Location: /usr/local/lib/ruby/2.6.0/net/protocol.rb:225:in `rbuf_fill'

I'll open a support ticket with DD as suggested in the above comment.

ivoanjo · 2022-07-13T15:13:30Z

Hey 👋 . I'm going to go ahead and close this issue for now. As Marco said, we're very interested in knowing if you still are seeing issues, please get in contact and do let us know!

benjaminwood · 2022-08-03T23:40:49Z

This may not be the same issue, but we saw ddtrace errors like:

E, [2022-08-02T23:48:52.204538 #1] ERROR -- ddtrace: [ddtrace] (/app/vendor/bundle/ruby/2.4.0/gems/ddtrace-0.54.2/lib/ddtrace/transport/http/client.rb:37:in `rescue in send_request') Internal error during HTTP transport request. Cause: Broken pipe Location: /usr/local/lib/ruby/2.4.0/net/protocol.rb:229:in `write'

And in trace-agent a corresponding:

2022-08-03 22:32:41 UTC | TRACE | ERROR | (run.go:268 in Errorf) | Cannot decode v0.4 traces payload: read tcp 10.0.50.75:8126->10.0.50.108:60054: i/o timeout

If you're using net_http adapter and followed the docs when configuring it, you may have specified a 1s timeout.

This was the source of our problem and after adjusting that to 5s, everything is working just fine.

@ivoanjo is a timeout of 1s in the docs misleading? Seems a bit short to me.

Screenshot of the docs for historical purposes if it changes:

@benjaminwood

As raised in #818, the default `timeout` example is rather far away from the actual default being used, which may lead customers that copy-paste the example to have a bad experience. I also discovered we had an incorrect constant with the default timeout which was not even being used anywhere (you can grep for it to confirm!). Thanks @benjaminwood for the awesome suggestion!

ivoanjo · 2022-08-04T09:40:22Z

Good point! Opened #2205 to improve on this, definitely having an example which nowadays we would not recommend to customers is not a good idea.

benjaminwood · 2022-08-04T23:23:04Z

Awesome, thank you @ivoanjo! I'm sure this will prevent a few headaches for customers down the road.

marcotc linked a pull request Apr 27, 2020 that will close this issue

Trace payload chunking #840

Merged

marcotc removed a link to a pull request Apr 27, 2020

Trace payload chunking #840

Merged

marcotc linked a pull request Apr 29, 2020 that will close this issue

Trace payload chunking #840

Merged

marcotc added this to the 0.35.0 milestone Apr 30, 2020

gc-dev1 mentioned this issue Oct 28, 2020

Large traces reach payload limit DataDog/dd-trace-dotnet#1002

Closed

kitallis mentioned this issue Jan 6, 2021

Upgrade ddtrace simpledotorg/simple-server#1952

Merged

delner added bug Involves a bug community Was opened by a community member core Involves Datadog core libraries labels Apr 21, 2021

delner assigned marcotc Apr 21, 2021

delner added this to Needs triage in Planning via automation Apr 21, 2021

ivoanjo closed this as completed Jul 13, 2022

Planning automation moved this from Needs triage to Resolved (without changes) Jul 13, 2022

ivoanjo mentioned this issue Aug 4, 2022

Minor improvement to timeout documentation #2205

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot decode traces payload, read limit reached #818

Cannot decode traces payload, read limit reached #818

SpyMachine commented Sep 17, 2019 •

edited

delner commented Sep 18, 2019 •

edited

SpyMachine commented Sep 18, 2019

delner commented Sep 18, 2019

SpyMachine commented Sep 18, 2019

delner commented Sep 18, 2019

allaud commented Nov 19, 2019

usamaSaddiq commented Feb 26, 2020 •

edited

marcotc commented May 4, 2020

marcotc commented Oct 7, 2020

tegon commented Oct 13, 2020 •

edited

tegon commented Oct 13, 2020

SpyMachine commented Oct 14, 2020

gc-dev1 commented Oct 26, 2020

marcotc commented Oct 29, 2020

usamaSaddiq commented May 11, 2021 •

edited

ivoanjo commented Jul 13, 2022

benjaminwood commented Aug 3, 2022

ivoanjo commented Aug 4, 2022

benjaminwood commented Aug 4, 2022

Cannot decode traces payload, read limit reached #818

Cannot decode traces payload, read limit reached #818

Comments

SpyMachine commented Sep 17, 2019 • edited

delner commented Sep 18, 2019 • edited

SpyMachine commented Sep 18, 2019

delner commented Sep 18, 2019

SpyMachine commented Sep 18, 2019

delner commented Sep 18, 2019

allaud commented Nov 19, 2019

usamaSaddiq commented Feb 26, 2020 • edited

marcotc commented May 4, 2020

marcotc commented Oct 7, 2020

tegon commented Oct 13, 2020 • edited

tegon commented Oct 13, 2020

SpyMachine commented Oct 14, 2020

gc-dev1 commented Oct 26, 2020

marcotc commented Oct 29, 2020

usamaSaddiq commented May 11, 2021 • edited

ivoanjo commented Jul 13, 2022

benjaminwood commented Aug 3, 2022

ivoanjo commented Aug 4, 2022

benjaminwood commented Aug 4, 2022

SpyMachine commented Sep 17, 2019 •

edited

delner commented Sep 18, 2019 •

edited

usamaSaddiq commented Feb 26, 2020 •

edited

tegon commented Oct 13, 2020 •

edited

usamaSaddiq commented May 11, 2021 •

edited