New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tracer: Fix race in spanContext.setSamplingPriority #2271
Conversation
This function read c.trace.priority without holding a lock, then called functions that modify it. This causes the race detector to detect the following race. This race bug is causing flaky tests in a program that uses dd-trace-go. The race is triggered when two goroutines both set the ext.ManualKeep on spans that share a root. I think it might plausibly cause a real bug bug, where c.updated gets set to true, even though it the priority was not changed. My attempt to fix it is to move the logic to detect if the priority was modified into the setSamplingPriority function itself. This might not be the right fix since I don't really understand what this does, but it seems to pass the tests both with and without race detection. The race detector report without this change: WARNING: DATA RACE Write at 0x00c0019a0a88 by goroutine 2435: gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.(*trace).setSamplingPriorityLocked() /Users/evan.jones/dd/dd-trace-go/ddtrace/tracer/spancontext.go:327 +0xc4 gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.(*trace).push() /Users/evan.jones/dd/dd-trace-go/ddtrace/tracer/spancontext.go:359 +0x23c gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.newSpanContext() /Users/evan.jones/dd/dd-trace-go/ddtrace/tracer/spancontext.go:143 +0x904 gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.(*tracer).StartSpan() /Users/evan.jones/dd/dd-trace-go/ddtrace/tracer/tracer.go:516 +0xba0 gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.(*tracer).newChildSpan() /Users/evan.jones/dd/dd-trace-go/ddtrace/tracer/tracer_test.go:52 +0x94 gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.TestTraceManualKeepRace.func1.1() /Users/evan.jones/dd/dd-trace-go/ddtrace/tracer/span_test.go:441 +0x84 Previous read at 0x00c0019a0a88 by goroutine 2438: gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.(*spanContext).setSamplingPriority() /Users/evan.jones/dd/dd-trace-go/ddtrace/tracer/spancontext.go:185 +0x1bc gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.(*span).setSamplingPriorityLocked() /Users/evan.jones/dd/dd-trace-go/ddtrace/tracer/span.go:263 +0x8c gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.(*span).setTagBool() /Users/evan.jones/dd/dd-trace-go/ddtrace/tracer/span.go:388 +0xdc gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.(*span).SetTag() /Users/evan.jones/dd/dd-trace-go/ddtrace/tracer/span.go:127 +0x134 gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.TestTraceManualKeepRace.func1.1() /Users/evan.jones/dd/dd-trace-go/ddtrace/tracer/span_test.go:442 +0xa8
t.mu.Lock() | ||
defer t.mu.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewing this it seems to me that the original issue comes from checking c.trace.priority != nil && *c.trace.priority != float64(p)
outside of the highlighted mutex.
In general, the change seems safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct: this fix ensures we only read t.priority
while holding the lock, which required returning the boolean from this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review. I have applied your suggestions!
t.mu.Lock() | ||
defer t.mu.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct: this fix ensures we only read t.priority
while holding the lock, which required returning the boolean from this function.
This PR is stale because it has been open 20 days with no activity. Remove stale label or comment or this will be closed in 10 days. |
This was flagged stale but I believe this probably can be merged? @darccio or others, is there anything else I can do to help here? |
@evanj I think we can merge. Let me update the branch, run the tests and check everything goes fine. |
What does this PR do?
This function read c.trace.priority without holding a lock, then called functions that modify it. This causes the race detector to detect the following race. This race bug is causing flaky tests in a program that uses dd-trace-go. The race is triggered when two goroutines both set the ext.ManualKeep on spans that share a root. I think it might plausibly cause a real bug bug, where c.updated gets set to true, even though it the priority was not changed.
My attempt to fix it is to move the logic to detect if the priority was modified into the setSamplingPriority function itself. This might not be the right fix since I don't really understand what this does, but it seems to pass the tests both with and without race detection.
The race detector report without this change:
Motivation
Fix flaky tests in programs that use dd-trace-go.
Reviewer's Checklist
For Datadog employees:
@DataDog/security-design-and-guidance
.Unsure? Have a question? Request a review!