-
Notifications
You must be signed in to change notification settings - Fork 314
Avoid pending queue wedge if tracer flare is generated multiple times #9757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Avoid pending queue wedge if tracer flare is generated multiple times #9757
Conversation
In DumpDrain, the collectTraces method replaces the 'data' field with an empty ArrayList, but at the same time, it does not also reset the 'index' field. If another dump is performed later, this leads the get method reaching the 'return null' statement, and as the comment states, this can (and does) break the queue. This change does a few things: - Resets the index in collectTraces when the data field is replaced (and marks the index field as volatile). This should prevent the above situation from happening. - In case the situation still happens, a stand-in CommandElement is returned to avoid returning null. A warning message is also logged. - The existing "testing tracer flare dump with multiple traces" test case is expanded to exercise problem. Here is an example stack trace when the hang happens: "dd-trace-monitor" DataDog#38 daemon prio=5 os_prio=31 tid=0x0000000110e6e000 nid=0x7617 runnable [0x0000000171032000] java.lang.Thread.State: RUNNABLE at org.jctools.queues.MpscBlockingConsumerArrayQueue.spinWaitForElement(MpscBlockingConsumerArrayQueue.java:634) at org.jctools.queues.MpscBlockingConsumerArrayQueue.parkUntilNext(MpscBlockingConsumerArrayQueue.java:566) at org.jctools.queues.MpscBlockingConsumerArrayQueue.take(MpscBlockingConsumerArrayQueue.java:482) at datadog.trace.core.PendingTraceBuffer$DelayingPendingTraceBuffer$Worker.run(PendingTraceBuffer.java:317) at java.lang.Thread.run(Thread.java:750)
dd-trace-core/src/test/groovy/datadog/trace/core/PendingTraceBufferTest.groovy
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is good, but don't take my word on it. Another pair of eyes might be needed.
entries2.size() == 1 | ||
def pendingTraceText2 = entries2["pending_traces.txt"] as String | ||
def parsedTraces2 = pendingTraceText2.split('\n').collect { new JsonSlurper().parseText(it) }.flatten() | ||
parsedTraces2.size() == 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: would it be better to assert on the trace ids as well ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The specific trace IDs don't really matter, so I think it is better to leave them out and not make the test over-specific. We mainly want to make sure:
- The dump didn't hang.
- We got the expected number of traces (and I'm not even sure that matters that much, but it was there in the original test).
I only generate a second set of traces because the first set is gone from the pending queue by the time the second dump is performed (well, they are still there, actually, but the traces have no pending spans left--they've been written within 500ms).
What Does This Do
This change does a few things:
Motivation
In DumpDrain, the collectTraces method replaces the 'data' field with an empty ArrayList, but at the same time, it does not also reset the 'index' field. If another dump is performed later, this leads the get method reaching the 'return null' statement, and as the comment states, this can (and does) break the queue.
I noticed this while working on a separate enhancement to dump the pending and long running traces over JMX and my unit test would reliably hang if it was run after the existing tracer flare dump unit test.
Additional Notes
Here is an example stack trace when the hang happens:
Contributor Checklist
type:
and (comp:
orinst:
) labels in addition to any useful labelsclose
,fix
or any linking keywords when referencing an issue.Use
solves
instead, and assign the PR milestone to the issueJira ticket: [PROJ-IDENT]