Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot decode traces payload, read limit reached #818

Closed
SpyMachine opened this issue Sep 17, 2019 · 19 comments · Fixed by #840
Closed

Cannot decode traces payload, read limit reached #818

SpyMachine opened this issue Sep 17, 2019 · 19 comments · Fixed by #840
Assignees
Labels
bug Involves a bug community Was opened by a community member core Involves Datadog core libraries
Projects
Milestone

Comments

@SpyMachine
Copy link

SpyMachine commented Sep 17, 2019

I am receiving the follow error in my trace agent logs.

2019-09-17 19:34:45 UTC | TRACE | ERROR | (pkg/trace/api/api.go:372 in handleTraces) | Cannot decode v0.4 traces payload: read limit reached

and my app is showing:

E, [2019-09-17T15:34:46.160421 #82656] ERROR -- ddtrace: [ddtrace] (/Users/me/.rvm/gems/ruby-2.3.6/gems/ddtrace-0.27.0/lib/ddtrace/transport/http/client.rb:46:in rescue in send_request') Internal error during HTTP transport request. Cause: Broken pipe Location: /Users/me/.rvm/rubies/ruby-2.3.6/lib/ruby/2.3.0/net/protocol.rb:229:in write'

I know my trace logs are rather large but didn't realize this would be a problem. Is there a way to get around this?

@delner
Copy link
Contributor

delner commented Sep 18, 2019

Yeah, looks like an issue on the trace agent end; we'll want to figure out what factors on the Ruby end might be contributing to this. Can you share a little more context about what kinds of traces your Ruby application is sending?

  • Web service or job processor? Or Something else?
  • Typically how large are your traces (i.e. # of spans?) Usually the longer the trace is, the bigger they are.
  • Is this behavior very consistent or intermittent? If the latter, are there any patterns you observe?

If I had to make a very rough first guess, it might have something to do with payload size when transmitting traces to the agent. Although it's premature to assert this is the cause or prescribe solutions, it might be possible to play with a few levers to see if they have any effect.

On the Ruby side we could play with the buffer size to make flushes smaller, or if your network setup permits it, utilize UDS to transmit traces more frequently. I won't say either of these are likely to be silver bullets though, as it appears the restriction might be largely driven by the agent. Might be a good idea to see if there are any configurations that can be utilized there, although I'm not familiar enough to recommend any.

Any debug information you can provide about what payloads your tracer is submitting to the agent will help us develop a more accurate assessment, and a quicker resolution.

@SpyMachine
Copy link
Author

Yeah, so it's a very large batch job. Jobs can take many, many hours. We're looking to do tracing to see where our best bang for our buck would be in regards to improving performance. While our traces themselves may be very large, we also don't expect to have nearly as many as a web service would expect to have.

Span count is up in the millions.

It appears to be consistent.

@delner
Copy link
Contributor

delner commented Sep 18, 2019

Gotcha, although the agent can handle such large numbers of spans, it's not handled well when they're all in one trace; on the Ruby side, we typically only flush spans to the agent after the trace completes. There is a buffer limit which will prevent memory usage from overflowing, but it will drop any spans that overflow this limit. It's very possible that its filling the buffer entirely, then the payload sent to the agent is too large for it to handle in a single HTTP request.

We do have a feature called partial flushing that might alleviate this, however, it's seldom been used and is largely still experimental. If you're interested, you can try it by configuring the tracer:

Datadog.configure do |c|
  c.tracer partial_flush_timeout: 3, min_spans_before_partial_flush: 1000, max_spans_before_partial_flush: 10_000
end

If you play with those settings, you might find a balance that helps. In any case, please do let us know if this helps at all.

@SpyMachine
Copy link
Author

Hmm ok so good news and bad news...The good news is I can see all the data being passed over the wire on a long job and I can see most of the data in the analytics view (though a decent chunk appears to be missing). However, the trace seems totally messed up when I use the partial flush. It's showing the entire trace as one random span over a few ms. Opposed to the same job being ran without the partial flush options which has 4k spans for a short job. I'm assuming somethings getting mangled somewhere =/

@delner
Copy link
Contributor

delner commented Sep 18, 2019

Okay gotcha, bummer on the partial flush. Is there a screenshot you can share for that? Would be helpful if we try to fix this feature.

Will consult the team to see if there's a better way to handling long running jobs, or if there's a way to stream spans up for unfinished traces.

@allaud
Copy link

allaud commented Nov 19, 2019

Same problem here, I was trying to find my worker traces in Datadog and found nothing.

This is what I found in my Sidekiq logs:

E, [2019-11-19T14:02:58.874305 #1] ERROR -- ddtrace: [ddtrace] (/usr/local/bundle/gems/ddtrace-0.28.0/lib/ddtrace/transport/http/client.rb:46:in `rescue in send_request') Internal error during HTTP transport request. Cause: Broken pipe Location: /usr/local/lib/ruby/2.6.0/socket.rb:456:in `__write_nonblock'

@usamaSaddiq
Copy link

usamaSaddiq commented Feb 26, 2020

I think I have the same problem. I am also using sidekiq.

ERROR -- ddtrace: [ddtrace] (/app/vendor/bundle/ruby/2.5.0/gems/ddtrace-0.26.0/lib/ddtrace/transport/http/client.rb:46:in `rescue in send_request') Internal error during HTTP transport request. Cause: Broken pipe Location: /usr/local/lib/ruby/2.5.0/net/protocol.rb:233:in `write'

@marcotc marcotc linked a pull request Apr 27, 2020 that will close this issue
@marcotc marcotc removed a link to a pull request Apr 27, 2020
@marcotc marcotc linked a pull request Apr 29, 2020 that will close this issue
@marcotc marcotc added this to the 0.35.0 milestone Apr 30, 2020
@marcotc
Copy link
Member

marcotc commented May 4, 2020

Thank you for all the reports @SpyMachine, @allaud, @usamaSaddiq!

v0.35.0 has been released, which adds dynamic payload chunking for large flush operations perform by ddtraces writer (#840).

Our preliminary investigation tells us that most, if not all, Broken Pipe errors were indeed caused by a payload that was too large to be flushed. This new release will break down flushes that were too large to flush before into multiple smaller flush requests.
This is specially important for high-throughput applications.

If this error persists in your application after this new release, it is likely because a single trace is being generated that is too large to be flushed, in that case I recommend trying to enable Partial Flushing (c.tracer.partial_flush.enabled = true) in your application.

In any case, let us know of any further errors that still occur from now on.

@marcotc
Copy link
Member

marcotc commented Oct 7, 2020

Just a friendly ping to see if anyone in this GH issue is still seeing issue with payload read limit reached with newer versions of the tracer.

@tegon
Copy link

tegon commented Oct 13, 2020

@marcotc I'm seeing the issue with version 0.40.0. I'll try to enable partial flush to see if that fixes it.

@tegon
Copy link

tegon commented Oct 13, 2020

@marcotc Not sure if this helps, but I just noticed that this error started to happen more frequently once we increased the size of our EC2 instances in AWS. Before that change, it only happened a couple of times (we didn't even notice) but after the change is happening pretty constantly.

@SpyMachine
Copy link
Author

Hi, sorry for late reply, just got around to looking at this again. It appears to be working much better for me now!

@gc-dev1
Copy link

gc-dev1 commented Oct 26, 2020

@SpyMachine what Tracer Version are you using now?

I'm using 1.19.1 and still have this issue

Cannot decode v0.4 traces payload: read limit reached

@marcotc
Copy link
Member

marcotc commented Oct 29, 2020

@SpyMachine I'm glad you saw improvements!

@tegon @gc-dev1 I think that at this point, we should follow up with each one of you individually, having the full context of your environments.
Because this issues happens right at the edge of the tracer library and agent, and is also highly influence by your infrastructure topology, it's more productive for us to provide full account support on this issue. This will allow us to work directly with the data you report to our backend, and have the full context.
With that in mind, could you please open a new support ticket with us? https://docs.datadoghq.com/help/
Please include a link to this GH issue for context for our support team.

@gc-dev1 I noticed you are seeing a similar issue with .NET. If you already have a support ticket open for that issue, I'd suggest investigating that one first, as the underlying cause might be the same (maybe it's the agent). That might save you an extra support thread with us.

@delner delner added bug Involves a bug community Was opened by a community member core Involves Datadog core libraries labels Apr 21, 2021
@delner delner added this to Needs triage in Planning via automation Apr 21, 2021
@usamaSaddiq
Copy link

usamaSaddiq commented May 11, 2021

Hey @marcotc

Sorry for the late reply. I've encountered this issue again when tracer.partial_flush.enabled = true

03:43:56 sidekiq.1  | E, [2021-05-11T03:43:49.839689 #13] ERROR -- ddtrace: [ddtrace] (/app/vendor/bundle/ruby/2.6.0/gems/ddtrace-0.37.0/lib/ddtrace/transport/http/client.rb:35:in `rescue in send_request') Internal error during HTTP transport request. Cause: end of file reached Location: /usr/local/lib/ruby/2.6.0/net/protocol.rb:225:in `rbuf_fill'

I'll open a support ticket with DD as suggested in the above comment.

@ivoanjo
Copy link
Member

ivoanjo commented Jul 13, 2022

Hey 👋 . I'm going to go ahead and close this issue for now. As Marco said, we're very interested in knowing if you still are seeing issues, please get in contact and do let us know!

@ivoanjo ivoanjo closed this as completed Jul 13, 2022
Planning automation moved this from Needs triage to Resolved (without changes) Jul 13, 2022
@benjaminwood
Copy link

This may not be the same issue, but we saw ddtrace errors like:

E, [2022-08-02T23:48:52.204538 #1] ERROR -- ddtrace: [ddtrace] (/app/vendor/bundle/ruby/2.4.0/gems/ddtrace-0.54.2/lib/ddtrace/transport/http/client.rb:37:in `rescue in send_request') Internal error during HTTP transport request. Cause: Broken pipe Location: /usr/local/lib/ruby/2.4.0/net/protocol.rb:229:in `write'

And in trace-agent a corresponding:

2022-08-03 22:32:41 UTC | TRACE | ERROR | (run.go:268 in Errorf) | Cannot decode v0.4 traces payload: read tcp 10.0.50.75:8126->10.0.50.108:60054: i/o timeout

If you're using net_http adapter and followed the docs when configuring it, you may have specified a 1s timeout.

This was the source of our problem and after adjusting that to 5s, everything is working just fine.

@ivoanjo is a timeout of 1s in the docs misleading? Seems a bit short to me.

Screenshot of the docs for historical purposes if it changes:

image

ivoanjo added a commit that referenced this issue Aug 4, 2022
As raised in #818, the default `timeout` example is rather far away
from the actual default being used, which may lead customers that
copy-paste the example to have a bad experience.

I also discovered we had an incorrect constant with the default
timeout which was not even being used anywhere (you can grep for it
 to confirm!).

Thanks @benjaminwood for the awesome suggestion!
@ivoanjo
Copy link
Member

ivoanjo commented Aug 4, 2022

Good point! Opened #2205 to improve on this, definitely having an example which nowadays we would not recommend to customers is not a good idea.

@benjaminwood
Copy link

Awesome, thank you @ivoanjo! I'm sure this will prevent a few headaches for customers down the road.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Involves a bug community Was opened by a community member core Involves Datadog core libraries
Projects
Planning
Resolved (without changes)
Development

Successfully merging a pull request may close this issue.

9 participants