Split TraceConsumer into two different disruptors by tylerbenson · Pull Request #1161 · DataDog/dd-trace-java

tylerbenson · 2020-01-06T21:13:14Z

First disruptor (TraceProcessingDisruptor) does processing, which is currently limited to serialization, but in the future can do other processing such as TraceInterceptor invocation.
Second disruptor (BatchWritingDisruptor) takes serialized traces and batches them into groups and flushes them periodically based on size and time.

dougqh · 2020-01-07T14:42:28Z

-  /** Old signature (pre-Monitor) used in tests */
-  private DDAgentWriter(final DDAgentApi api) {
-    this(api, new Monitor.Noop());
+    batchWritingDisruptor =


I think it would be fine to pass the Spec object done to the disruptors.

Currently not all of the DDAgentWriter constructors use the Spec.

dougqh · 2020-01-07T14:48:01Z

+
+    if (0 < flushFrequencySeconds) {
+      // This provides a steady stream of events to enable flushing with a low throughput.
+      final Runnable heartbeat =


I'm not crazy about adding an extra executor for this.
The requesting flush on time out seen cleaner and lighter weight to me.

Not sure what you mean "on time out"? if we don't have any events in the queue, the handler will never be called to trigger a flush.

If you look at the approach in the experimental branch that I did, it doesn't have a scheduled timer.

Instead the sender thread does something akin to this pseudo-code...
while ( !Thread.current().isInterrupted() ) {
try {
DDApi.Request request = queue.poll(flushFrequencySecs, TimeUnit.SECONDS);
send(request);
} catch ( TimeoutException e ) {
// request flush
flush();
}
}

I prefer this because it is one fewer threads, but also because it is easier to have the sender back-off its schedule.

To clarify, the heartbeat only ensures a minimum level of events to enable timely reporting in case no traces are sent in a given time window. The actual sending frequency can be adjusted in scheduleNextFlush().

dougqh · 2020-01-07T14:51:18Z

-  public volatile boolean shouldFlush = false;
  public volatile T data = null;
+  public volatile int representativeCount = 0;
+  public volatile CountDownLatch flushLatch = null;


I think having a latch per batch is a big improvement in the flush semantics.

Yes, I also like this much better than the previous phaser approach.

A CountDownLatch might be overkill. We don't really need to wait for all the flushers to arrive before unblocking the others, but I don't think it is a big deal.

What would you suggest using instead?

dougqh · 2020-01-07T14:54:25Z

+        if (event.data != null) {
+          try {
+            final byte[] serializedTrace = api.serializeTrace(event.data);
+            monitor.onSerialize(writer, event.data, serializedTrace);


I think there's a bug here. We shouldn't be calling onSerialize before we know if the publishing was successful.

I admit, I had a hard time understanding how to translate the monitor calls. That aspect of this change warrants a thorough review.

The monitor callbacks are following a couple rules...
1 - The call back happens after something is complete.
2 - The success and failure cases are split -- to force thinking carefully about failure.

So in general, I'd expect the callback to be at the end of a try block.

I changed the order. Let me know if I'm missing anything else.

dougqh · 2020-01-07T14:57:02Z

+    private final Monitor monitor;
+    private final DDAgentWriter writer;
+    private final List<byte[]> serializedTraces = new ArrayList<>();
+    private int representativeCount = 0;


It could track not just traces but also spans. We wanted to include in health metrics, but that wasn't terribly easy in the prior design.

Do you mean the number of total spans? What's the benefit there?

dougqh · 2020-01-07T15:00:11Z

+import lombok.extern.slf4j.Slf4j;
+
+@Slf4j
+public class BatchWritingDisruptor extends AbstractDisruptor<byte[]> {


I have some concerns with this. I'd actually like to see us get away from producing many tiny byte[].

I'd prefer to see us build up one big byte[] instead to reduce the amount of allocation.
DDApi.Request from the experimental branch was built with that in mind. I don't quite see how we do that with this design.

We talked and came up with a good solution. Use a byte[] on the event as a buffer that gets reused and grows to satisfy the needed size and copy the array off to a large buffer when batching.

This requires moving off jackson though, so will be done in a separate PR.

dougqh · 2020-01-07T15:19:43Z

+              }
+            }
+          };
+      heartbeatExecutor.scheduleAtFixedRate(heartbeat, 100, 100, TimeUnit.MILLISECONDS);


To have more meaningful back-pressure, we probably also need to able to back-off on the rate that we are sending. How would that work with the heartbeatExecutor?

The heartbeat Executor will only add an event to the queue if the queue is empty and it doesn't influence the frequency of flushing. What it does influence is the greatest amount of delay (beyond the flush frequency) that a flush will occur when the queue is empty. (ie, a flush will be at most 100 ms late from the 1/sec rate.)

dougqh · 2020-01-07T15:26:13Z

I have a few concerns about how future changes will fit into this...
1 - I'd like to get away from producing many small byte[] -- and basing the second Disruptor on byte[] runs counter to that
2 - I'd prefer to have fewer threads, so I'd like to avoid the separate heartbeat if possible
3 - I'd like the sending rate to able to back-off and back-on depending on whether we're successful communicating with the agent. It isn't clear how that would work in this design.

Finally, it would help to have a comment / diagram that describes the overall publishing pipeline. That would the code easier to follow in the future.

First disruptor (TraceProcessingDisruptor) does processing, which is currently limited to serialization, but in the future can do other processing such as TraceInterceptor invocation. Second disruptor (BatchWritingDisruptor) takes serialized traces and batches them into groups and flushes them periodically based on size and time.

… metrics when traces are sampled.

tylerbenson · 2020-01-17T00:45:02Z

+            // attempt to have agent scale the metrics properly
+            ((DDSpan) event.data.get(0).getLocalRootSpan())
+                .context()
+                .setMetric("_sample_rate", 1d / event.representativeCount);


@gbbr does this look like a legit way of getting our _sample_rate scaling done by the agent to be accurate?

The agent doesn't do any scaling, it's the backend, so I wouldn't be able to tell. _sample_rate is expected to hold the rate that a local client sampler (the one that doesn't send stuff to the agent at all) is using IIRC. @furmmon is our expert for answering any questions around sampling, maybe you can confirm.

tylerbenson · 2020-01-17T01:35:47Z

    if (traceProcessingDisruptor.running) {
-      final int representativeCount = traceCount.getAndSet(0) + 1;
+      final int representativeCount;
+      if (trace.isEmpty() || !(trace.get(0).isRootSpan())) {


This might not work if the last span reported isn't the root span. This might be an issue for async traces and for partial flush traces. Any better ideas?

Also rename the builder class on DDTracer to default name generated by Lombok.

dougqh · 2020-01-17T21:09:00Z

+      this.writer = writer;
+    }
+
+    // TODO: reduce byte[] garbage by keeping the byte[] on the event and copy before returning.


So reducing byte[] remains to be done, I think that's fine for now. We can revisit that after ripping out Jackson.

randomanderson

I ran the performance tests and got nearly identical results (~5,300 traces/s) from my local laptop on master and this PR

tylerbenson requested a review from dougqh January 6, 2020 21:13

tylerbenson requested a review from a team as a code owner January 6, 2020 21:13

tylerbenson force-pushed the tyler/disruptor-agent branch from 88baa9e to e3054cb Compare January 6, 2020 22:32