Skip to content

Conversation

deejgregor
Copy link
Contributor

@deejgregor deejgregor commented Oct 14, 2025

What Does This Do

This change does a few things:

  • Resets the index in collectTraces when the data field is replaced (and marks the index field as volatile). This should prevent the issue from happening.
  • In case the situation still happens, a stand-in CommandElement is returned to avoid returning null. A warning message is also logged.
  • The existing "testing tracer flare dump with multiple traces" test case is expanded to exercise problem.

Motivation

In DumpDrain, the collectTraces method replaces the 'data' field with an empty ArrayList, but at the same time, it does not also reset the 'index' field. If another dump is performed later, this leads the get method reaching the 'return null' statement, and as the comment states, this can (and does) break the queue.

I noticed this while working on a separate enhancement to dump the pending and long running traces over JMX and my unit test would reliably hang if it was run after the existing tracer flare dump unit test.

Additional Notes

Here is an example stack trace when the hang happens:

"dd-trace-monitor" #38 daemon prio=5 os_prio=31 tid=0x0000000110e6e000 nid=0x7617 runnable [0x0000000171032000]
   java.lang.Thread.State: RUNNABLE
    at org.jctools.queues.MpscBlockingConsumerArrayQueue.spinWaitForElement(MpscBlockingConsumerArrayQueue.java:634)
    at org.jctools.queues.MpscBlockingConsumerArrayQueue.parkUntilNext(MpscBlockingConsumerArrayQueue.java:566)
    at org.jctools.queues.MpscBlockingConsumerArrayQueue.take(MpscBlockingConsumerArrayQueue.java:482)
    at datadog.trace.core.PendingTraceBuffer$DelayingPendingTraceBuffer$Worker.run(PendingTraceBuffer.java:317)
    at java.lang.Thread.run(Thread.java:750)

Contributor Checklist

Jira ticket: [PROJ-IDENT]

In DumpDrain, the collectTraces method replaces the 'data' field
with an empty ArrayList, but at the same time, it does not also
reset the 'index' field. If another dump is performed later, this
leads the get method reaching the 'return null' statement, and
as the comment states, this can (and does) break the queue.

This change does a few things:
- Resets the index in collectTraces when the data field is replaced
  (and marks the index field as volatile). This should prevent the
  above situation from happening.
- In case the situation still happens, a stand-in CommandElement
  is returned to avoid returning null. A warning message is also logged.
- The existing "testing tracer flare dump with multiple traces"
  test case is expanded to exercise problem.

Here is an example stack trace when the hang happens:

"dd-trace-monitor" DataDog#38 daemon prio=5 os_prio=31 tid=0x0000000110e6e000 nid=0x7617 runnable [0x0000000171032000]
   java.lang.Thread.State: RUNNABLE
    at org.jctools.queues.MpscBlockingConsumerArrayQueue.spinWaitForElement(MpscBlockingConsumerArrayQueue.java:634)
    at org.jctools.queues.MpscBlockingConsumerArrayQueue.parkUntilNext(MpscBlockingConsumerArrayQueue.java:566)
    at org.jctools.queues.MpscBlockingConsumerArrayQueue.take(MpscBlockingConsumerArrayQueue.java:482)
    at datadog.trace.core.PendingTraceBuffer$DelayingPendingTraceBuffer$Worker.run(PendingTraceBuffer.java:317)
    at java.lang.Thread.run(Thread.java:750)
@deejgregor deejgregor requested a review from a team as a code owner October 14, 2025 17:40
@deejgregor deejgregor requested a review from mhlidd October 14, 2025 17:40
Copy link
Contributor

@bric3 bric3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is good, but don't take my word on it. Another pair of eyes might be needed.

entries2.size() == 1
def pendingTraceText2 = entries2["pending_traces.txt"] as String
def parsedTraces2 = pendingTraceText2.split('\n').collect { new JsonSlurper().parseText(it) }.flatten()
parsedTraces2.size() == 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: would it be better to assert on the trace ids as well ?

Copy link
Contributor Author

@deejgregor deejgregor Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The specific trace IDs don't really matter, so I think it is better to leave them out and not make the test over-specific. We mainly want to make sure:

  1. The dump didn't hang.
  2. We got the expected number of traces (and I'm not even sure that matters that much, but it was there in the original test).

I only generate a second set of traces because the first set is gone from the pending queue by the time the second dump is performed (well, they are still there, actually, but the traces have no pending spans left--they've been written within 500ms).

@PerfectSlayer PerfectSlayer added type: enhancement Enhancements and improvements comp: core Tracer core labels Oct 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp: core Tracer core type: enhancement Enhancements and improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants