Trace payload chunking #840

marcotc · 2019-10-16T20:49:37Z

During serialization, we now break down large collections of traces into smaller batches.

This is necessary because sending large payloads can cause the receiving server to reject them.
Currently, we send traces to the Datadog agent which has a limit of 10 MiB (as of v6.14.1) per payload.

We then break down traces into chunks that are smaller than that limit.
We also discard single traces that exceed that limit, as these cannot be broken down any further.

lib/ddtrace/encoding.rb

delner · 2019-10-17T15:48:57Z

lib/ddtrace/transport/http/client.rb

-          # Get response from API
-          response = yield(current_api, env)
+          # Get responses from API
+          responses = yield(current_api, env)


Is there a way to call the client multiple times for each request in the batch instead of changing the client to do batching itself? I think it's unwise to make the client handle multiple requests simultaneously as it will greatly increase the complexity of the transport code.

It will also make this batching feature more brittle and tightly coupled to how HTTP works instead of being agnostic to the means of transport, which will make it difficult (if not impossible) to adopt new means of transport in the future.

We need to know what encoder we are using in order to break down the traces being flushed into multiple chunks.

The client is currently responsible for such information, in the form of client.current_api.spec.traces.encoder.
Also, when downgrading, the encoder might change. Downgrading is currently handled by the client.

I tried to prototype a different aproach just now, moving the chunking logic as far up the call chain as I believe it makes sense: feat/subdivide-payloads...tmp-feat/subdivide-payloads

I still don't like this one, too many layers are mixed together.

The main issue is that chunking, in a perfect scenario, would be done before we start calling the client. But the fact that we need the encoder, which is 2 levels down (inside the current Spec instance) and that the encoder can change if we need to downgrade the API, seem to make it quite tricky.

Next, I'm going to try to move current_api into the transport instance, and handle API versioning there, including the downgrading logic.

I'll report back on those results.

Okay, that context is helpful, thanks for that explanation.

The existing design had certain assumptions about encoding before, hence why it was buried down lower in the transport, as it was considered a detail of the current API, which I think still holds true to a great extent.

I think it brings up some legitimate questions about how the design could change to accommodate batching though. Some possible paradigms I can think of might be:

Expose the encoder, wrap the client with some kind of Batcher, then have the batcher encapsulate this logic entirely, and use the client to drive individual requests. Batching could be its own module Batching that can be composed into the existing HTTP::Client.

Assert that encoding requests is detail of the API and that its acceptable for the API to split requests on the client's behalf. Consequently, you'd make the API spec responsible for batching and splitting one large request into smaller ones (which is what I think you were effectively doing.)

There might be more ways of handling this, but the key differences between these basically is option 1 is explicit in Client usage (one request, one response) and option 2 is auto-magic "don't worry about the details, we'll figure it out."

Personally I'm in favor of number 1, because it keeps the responsibilities of the API/Client as small as possible (less complexity), and doesn't get us into weird scenarios where we have to handle a request that was forked into multiple requests in code that isn't concerned with batching (e.g. Client#send_request). Instead, we can keep all this batching code (hopefully) in a neat little module that knows how to deal with multiple requests and extends the capability of the Client in a compartmentalized way. (We could even go a step further to extract the "retry" functionality into a similar module for consistency, something I might want to undertake anyways.)

Let me know your thoughts or if you have some alternative paradigms to suggest!

lib/ddtrace/transport/http/transport.rb

marcotc · 2019-10-22T19:38:55Z

Successfully tested locally with a few real example applications.

marcotc · 2019-12-20T00:49:22Z

Updated with master, ready for review. ~~(Don't mind Rails 5.2.1.4 breaking the Ruby 2.2 build)~~ (CI has been fixed)

Request addressed.

…-rb into feat/subdivide-payloads

marcotc · 2020-04-27T20:13:20Z

There were non-trivial changes to Datadog::Transport::IO that I did not expect to undertake in order to adapt it to use a few of the interfaces that have changed.

But otherwise, there is no part of this PR that was touched only for the sake of refactoring: all components touched required changes for correctness.

lib/ddtrace/chunker.rb

gbbr · 2020-04-28T09:26:08Z

I didn't read all of this, but note that if you split a trace in two it will break stats. I suspect the "payload too big" thing happens anyway with big traces so it would be necessary to do that. An implementation like that will complicate tracer code a lot I suspect, and it happens not only in Ruby but other languages too.

Instead, I think it might be a better idea to explore "span batching" in the agent again, to add an endpoint which receives a set of random spans, and reconstructs traces on the agent side. This brings many benefits:

Solves the payload size problem for all languages
Significantly simplifies span buffering / trace batching in all clients, reducing memory and CPU fingerprint in the host application
Allow supporting open standard like OpenTelemetry, Zipkin, etc.

This will of course move the memory problem into the trace-agent, from not one client, but multiple ones which may be sending to the same endpoint. It should be acceptable, but it's bound to bring new problems and complications and needs to be explored (again).

brettlangdon · 2020-04-28T14:16:46Z

@gbbr this particular change is to break payloads into smaller sizes by separating individual traces. It is not for breaking traces into smaller pieces.

e.g. if one flush interval passes and we have 10mb of traces, we'll send 2 payloads of traces to the agent instead of trying to do it in one.

This is what we do in a few of the languages now.

Span streaming is a great idea, but will require significant development/coordination between the tracers and the agent. This change should unblock us for now while we schedule span streaming investigation/work.

gbbr · 2020-04-28T14:25:43Z

Alright, carry on :) Never mind me then.

lib/ddtrace/transport/traces.rb

delner · 2020-04-28T16:08:01Z

lib/ddtrace/transport/traces.rb

+                return send_traces(traces.lazy)
+              end
+            end
+          end.force


What's force?

Forces a lazy enumerator to eagerly resolve: https://ruby-doc.org/core-2.6.1/Enumerator/Lazy.html#method-i-to_a

I could use #to_a here too (#force is an alias to #to_a), but the #force method only exists for a lazy enumerators, which makes it more explicit that we don't want to simply call #to_a on a simple Array here, as that would not accomplish our goal of streaming requests.

lib/ddtrace/transport/traces.rb

delner · 2020-04-28T18:52:31Z

lib/ddtrace/transport/traces.rb

-          data.length
+        attr_reader :trace_count
+
+        def initialize(data, trace_count)


I don't think this makes sense with the current design, and altering this has a lot of side effects.

Traces::Parcel in the current design is supposed to be a protocol agnostic package of trace data, to be created by something that doesn't have knowledge of how the transport works. By requiring the Parcel to be given encoded data like this (along with its trace count), it implicitly requires a knowledge of the transport and its current API state to properly construct, rendering this an object with strong coupling to internal transport behavior.

That said, I think there's an argument to be made that we should change the design, and that to support encoding in chunks, it might also require some different kind of construct.

Short term, maybe we can remedy this by leaving the existing Traces::Parcel as is, but creating a new Traces::EncodedParcel which results from encoding traces from a parcel during chunking.

Long term, I think perhaps it shouldn't create parcels at all; the transport should only receive generic requests, and return generic responses... any trace specific behavior should be in some kind of Traces::Transport that wraps a generic transport (HTTP/IO/UDS, etc.) Then we shouldn't have a need for any parcels.

The reason I changed this Parcel is because this Parcel is actually a subtype, a Traces::Parcel, it inherits behaviour from the generic parcel Transport::Parcel, which is agnostic.

I did have the additional information I needed (trace_count) under another carrier object, which I believe was Traces::Request, but the parcel seemed like a better carrier for it. To be fair, I'm not 100% sure on the role of Parcel after the changes, so I'm very much open to changing this design.

delner

👍

marcotc requested a review from a team October 16, 2019 20:49

marcotc self-assigned this Oct 16, 2019

marcotc added the core Involves Datadog core libraries label Oct 16, 2019

marcotc force-pushed the feat/subdivide-payloads branch from 39a7ce3 to a4a1da8 Compare October 16, 2019 21:09

Trace payload chunking

2dc844e

marcotc force-pushed the feat/subdivide-payloads branch from a4a1da8 to 2dc844e Compare October 16, 2019 21:10

brettlangdon reviewed Oct 17, 2019

View reviewed changes

lib/ddtrace/encoding.rb Outdated Show resolved Hide resolved

lib/ddtrace/encoding.rb Outdated Show resolved Hide resolved

lib/ddtrace/encoding.rb Outdated Show resolved Hide resolved

lib/ddtrace/encoding.rb Outdated Show resolved Hide resolved

delner previously requested changes Oct 17, 2019

View reviewed changes

marcotc added 2 commits October 17, 2019 12:05

Address review comments

78942f3

Add Transport layer

62b5fd3

marcotc force-pushed the feat/subdivide-payloads branch from 0b3d132 to 62b5fd3 Compare October 21, 2019 22:07

marcotc commented Oct 21, 2019

View reviewed changes

lib/ddtrace/transport/http/transport.rb Outdated Show resolved Hide resolved

marcotc requested a review from delner October 21, 2019 22:31

Better error message when trace is too large

c5369cd

delner added this to In review in Active work Nov 21, 2019

marcotc force-pushed the feat/subdivide-payloads branch 2 times, most recently from 4829e20 to 85e7609 Compare December 19, 2019 23:17

Merge branch 'master' into feat/subdivide-payloads

e252a01

marcotc force-pushed the feat/subdivide-payloads branch from 85e7609 to e252a01 Compare December 19, 2019 23:22

Merge branch 'master' into feat/subdivide-payloads

80480f7

marcotc added 4 commits January 21, 2020 17:45

Merge branch 'master' into feat/subdivide-payloads

b94aef8

Merge branch 'master' into feat/subdivide-payloads

1484b2f

Merge branch 'feat/subdivide-payloads' of github.com:DataDog/dd-trace…

c28c768

…-rb into feat/subdivide-payloads

Move trace_count to Traces::Parcel

db963d8

marcotc force-pushed the feat/subdivide-payloads branch from 9d639d7 to db963d8 Compare April 24, 2020 19:24

marcotc added 2 commits April 24, 2020 18:47

Extract Chunker

a16940b

Sacrifice for the linting gods ☠️

74c3924

delner reviewed Apr 27, 2020

View reviewed changes

lib/ddtrace/chunker.rb Outdated Show resolved Hide resolved

marcotc added 4 commits April 27, 2020 16:17

Adapt benchmark tests

abc49c3

Batch chunk size health metric reporting

6eccad1

Fix typo in encoder test name

31935c9

Fix bad JRuby method surfaced by test refactoring

15b2548

marcotc requested a review from delner April 27, 2020 21:16

marcotc removed the do-not-merge/WIP Not ready for merge label Apr 27, 2020

marcotc added this to the 0.35.0 milestone Apr 27, 2020

marcotc linked an issue Apr 27, 2020 that may be closed by this pull request

Cannot decode traces payload, read limit reached #818

Closed

marcotc removed a link to an issue Apr 27, 2020

Cannot decode traces payload, read limit reached #818

Closed

delner reviewed Apr 28, 2020

View reviewed changes

Remove unnecessary lazy call for recursive send_traces

d141ca2

brettlangdon reviewed Apr 28, 2020

View reviewed changes

lib/ddtrace/transport/traces.rb Outdated Show resolved Hide resolved

delner reviewed Apr 28, 2020

View reviewed changes

marcotc added 3 commits April 28, 2020 15:55

Remove max_size tag from metrics

23c694e

Create separate object for EncodedParcel

d57498e

Restore Traces::Request

d630fc2

delner approved these changes Apr 29, 2020

View reviewed changes

marcotc merged commit 821d023 into master Apr 29, 2020

Active work automation moved this from In review to Merged & awaiting release Apr 29, 2020

marcotc deleted the feat/subdivide-payloads branch April 29, 2020 19:59

marcotc linked an issue Apr 29, 2020 that may be closed by this pull request

Cannot decode traces payload, read limit reached #818

Closed

marcotc mentioned this pull request May 4, 2020

Cannot decode traces payload, read limit reached #818

Closed

marcotc moved this from Merged & awaiting release to Released in Active work May 4, 2020

frewsxcv mentioned this pull request May 6, 2020

After version 0.35.0, traces are getting sent even with Datadog.tracer.enabled = false #1029

Closed

delner mentioned this pull request May 8, 2020

Fix HTTP circuit breaker #1033

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trace payload chunking #840

Trace payload chunking #840

marcotc commented Oct 16, 2019

delner Oct 17, 2019

marcotc Oct 17, 2019

delner Oct 18, 2019

marcotc commented Oct 22, 2019

marcotc commented Dec 20, 2019 •

edited

marcotc commented Apr 27, 2020

gbbr commented Apr 28, 2020 •

edited

brettlangdon commented Apr 28, 2020

gbbr commented Apr 28, 2020

delner Apr 28, 2020

marcotc Apr 28, 2020

delner Apr 28, 2020

marcotc Apr 28, 2020

delner left a comment

Trace payload chunking #840

Trace payload chunking #840

Conversation

marcotc commented Oct 16, 2019

delner Oct 17, 2019

Choose a reason for hiding this comment

marcotc Oct 17, 2019

Choose a reason for hiding this comment

delner Oct 18, 2019

Choose a reason for hiding this comment

marcotc commented Oct 22, 2019

marcotc commented Dec 20, 2019 • edited

marcotc commented Apr 27, 2020

gbbr commented Apr 28, 2020 • edited

brettlangdon commented Apr 28, 2020

gbbr commented Apr 28, 2020

delner Apr 28, 2020

Choose a reason for hiding this comment

marcotc Apr 28, 2020

Choose a reason for hiding this comment

delner Apr 28, 2020

Choose a reason for hiding this comment

marcotc Apr 28, 2020

Choose a reason for hiding this comment

delner left a comment

Choose a reason for hiding this comment

marcotc commented Dec 20, 2019 •

edited

gbbr commented Apr 28, 2020 •

edited