Exploring alternate approach to queueing & flushing by dougqh · Pull Request #1127 · DataDog/dd-trace-java

dougqh · 2019-12-05T04:06:04Z

This is a work-in-progress branch of how we might change the internals of the DDAgentWriter.
I don't intend to merge this pull request as is. Rather I'd like to get opinions on parts of the approach contained within -- and then I'll break this into smaller pull requests.

This pull request contains a few experimental improvements...

Introduces DDApi.Request & Builder similar to the previously introduced DDApi.Response
Changes DDAgentWriter's internal threading and queueing
Introduces retries when DDApi fails to communicate with the agent
Introduces an exponential back-off when communication fails -- including slowing the flush rate

The DDApi.Request.Builder replaces the List<byte[]> used to build up a payload for the trace_agent -- although it still contains a List<byte[]> internally.

The intention here is to...

carry additional meta-info useful for health metrics
eventually replace List<byte[]> with a single ByteBuffer to reduce allocation

DDAgentWriter's internal threading and queues...
This change keeps the Disruptor but the serializer/Disruptor thread but now builds up a payload using a DDApi.Request.Builder (mentioned above).
The ScheduledExecutor is replaced with a sender thread and an ArrayBlockingQueue<DDApi.Request>. The primary motivation for replacing the ScheduledExecutor is that internally it uses an unbounded queue.

The disruptor still enqueues a request (now DDApi.Request) when the the request becomes too large or a flush is requested.

However, flushes no longer happen on a strict timer.
Instead the sender thread -- does a blocking dequeue with a specified timeout.
Whenever the sender thread times out, it requests a flush from the Disruptor and then attempts to dequeue again.
The overall effect is that the sender effectively operates on the same schedule as before -- except...

The sender thread now performs retries to the agent in the event of a failure.
In the event of a failure, the sender will now exponentially back-off before retrying.
But the sender thread also slows down the rate at which it requests a flush from the Disruptor.

One part not included here, but that I feel also needs to change is the semantics/implementation of DDAgentWriter.flush.

flush currently cancels any flush that is already in-progress and then waits for the next flush to finish. While I believe this method is only used in tests, the current implementation is inherently problematic.

Multiple threads calling flush at the same time inhibit each other's progress by extending the time until the next flush is complete. A better semantic would be for each thread to only wait until the last span created in its thread is sent to the agent. This could be done by associating a Future with the DDApi.Request being built and then waiting for the Future to resolve.

dougqh · 2019-12-05T04:06:35Z

@@ -1,34 +1,21 @@
 package datadog.trace.common.writer;

-import static datadog.trace.api.Config.DEFAULT_AGENT_HOST;


Please ignore the imports, I'll clean this up in the final pull request.

dougqh · 2019-12-05T04:07:39Z


-  private static final EventTranslatorOneArg<Event<List<DDSpan>>, List<DDSpan>> TRANSLATOR =
-      new EventTranslatorOneArg<Event<List<DDSpan>>, List<DDSpan>>() {
+  private static final EventTranslatorOneArg<TraceEvent, List<DDSpan>> TRANSLATOR =


This should probably be factor out into an NFC PR.
Our Event class is only used for a single List and only makes sense for List, so the parameterization is unnecessary.

dougqh · 2019-12-05T04:09:10Z


  @Override
  public void incrementTraceCount() {
-    traceCount.incrementAndGet();


I haven't track down how incrementTraceCount is used exactly. Is it just for testing?
My hope is to move the representive count and other metadata into the DDApi.Request.

dougqh · 2019-12-05T04:09:54Z

    disruptor.start();
+    senderThread.start();
    running = true;
-    scheduleFlush();


The senderThread now requests the flushes, so there's no need to schedule once the senderThread is started.

dougqh · 2019-12-05T04:11:21Z

-  private final AtomicReference<ScheduledFuture<?>> flushSchedule = new AtomicReference<>();
+  private final Disruptor<TraceEvent> disruptor;
+
+  private final BlockingQueue<DDApi.Request> requestQueue = new ArrayBlockingQueue<>(16);


ArrayBlockingQueue effectively replaces the internal work queue of the ScheduledExecutorService.

This was done because the ScheduledExecutorService work queue is unbounded.
If we want to do a less invasive change, we can introduce a counting semaphore around the ScheduledExecutorService.

dougqh · 2019-12-05T04:12:21Z

 import datadog.opentracing.DDSpan;
 import datadog.opentracing.DDTraceOTInfo;
 import datadog.trace.common.writer.unixdomainsockets.UnixDomainSocketFactory;
+import lombok.extern.slf4j.Slf4j;


Again, please ignore the imports for now.

dougqh · 2019-12-05T04:13:30Z

    @Override
    public void onPublish(final DDAgentWriter agentWriter, final List<DDSpan> trace) {
-      statsd.incrementCounter("queue.accepted");
+      statsd.count("queue.accepted", 1);


The intention here is to always set the negative indicators to 0, so they stay visible in the UI.
This should definitely be separated into another PR.

dougqh · 2019-12-05T04:17:46Z

-    apiPhaser.register(); // Register on behalf of the scheduled executor thread.
+
+    sender = new Sender(flushFrequencySeconds, 5);
+    sender.register();


sender registering onto the apiPhaser.

Although, as mentioned in the PR description, I think we need to change the implementation of flush and away from a single shared Phaser.

tylerbenson · 2020-01-14T00:47:17Z

Potentially superseded by #1161.

Exploring alternate approach to queueing & flushing

3e98d8f

dougqh added the tag: do not merge Do not merge changes label Dec 5, 2019

dougqh requested a review from a team as a code owner December 5, 2019 04:06

dougqh commented Dec 5, 2019

View reviewed changes

dougqh closed this Mar 10, 2020

tylerbenson added this to the Closed milestone Apr 10, 2020

tylerbenson deleted the dougqh/ddagentwriter-refactor branch June 4, 2020 18:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exploring alternate approach to queueing & flushing#1127

Exploring alternate approach to queueing & flushing#1127
dougqh wants to merge 1 commit into
masterfrom
dougqh/ddagentwriter-refactor

dougqh commented Dec 5, 2019 •

edited

Loading

Uh oh!

dougqh Dec 5, 2019

Uh oh!

dougqh Dec 5, 2019

Uh oh!

dougqh Dec 5, 2019

Uh oh!

dougqh Dec 5, 2019

Uh oh!

dougqh Dec 5, 2019

Uh oh!

dougqh Dec 5, 2019

Uh oh!

dougqh Dec 5, 2019

Uh oh!

dougqh Dec 5, 2019

Uh oh!

tylerbenson commented Jan 14, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -1,34 +1,21 @@
		package datadog.trace.common.writer;

		import static datadog.trace.api.Config.DEFAULT_AGENT_HOST;

Conversation

dougqh commented Dec 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tylerbenson commented Jan 14, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dougqh commented Dec 5, 2019 •

edited

Loading