Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tracer: Fix race in spanContext.setSamplingPriority #2271

Merged
merged 3 commits into from Nov 15, 2023
Merged

Conversation

evanj
Copy link
Contributor

@evanj evanj commented Oct 16, 2023

What does this PR do?

This function read c.trace.priority without holding a lock, then called functions that modify it. This causes the race detector to detect the following race. This race bug is causing flaky tests in a program that uses dd-trace-go. The race is triggered when two goroutines both set the ext.ManualKeep on spans that share a root. I think it might plausibly cause a real bug bug, where c.updated gets set to true, even though it the priority was not changed.

My attempt to fix it is to move the logic to detect if the priority was modified into the setSamplingPriority function itself. This might not be the right fix since I don't really understand what this does, but it seems to pass the tests both with and without race detection.

The race detector report without this change:

WARNING: DATA RACE
Write at 0x00c0019a0a88 by goroutine 2435:
  gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.(*trace).setSamplingPriorityLocked()
      /Users/evan.jones/dd/dd-trace-go/ddtrace/tracer/spancontext.go:327 +0xc4
  gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.(*trace).push()
      /Users/evan.jones/dd/dd-trace-go/ddtrace/tracer/spancontext.go:359 +0x23c
  gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.newSpanContext()
      /Users/evan.jones/dd/dd-trace-go/ddtrace/tracer/spancontext.go:143 +0x904
  gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.(*tracer).StartSpan()
      /Users/evan.jones/dd/dd-trace-go/ddtrace/tracer/tracer.go:516 +0xba0
  gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.(*tracer).newChildSpan()
      /Users/evan.jones/dd/dd-trace-go/ddtrace/tracer/tracer_test.go:52 +0x94
  gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.TestTraceManualKeepRace.func1.1()
      /Users/evan.jones/dd/dd-trace-go/ddtrace/tracer/span_test.go:441 +0x84

Previous read at 0x00c0019a0a88 by goroutine 2438:
  gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.(*spanContext).setSamplingPriority()
      /Users/evan.jones/dd/dd-trace-go/ddtrace/tracer/spancontext.go:185 +0x1bc
  gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.(*span).setSamplingPriorityLocked()
      /Users/evan.jones/dd/dd-trace-go/ddtrace/tracer/span.go:263 +0x8c
  gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.(*span).setTagBool()
      /Users/evan.jones/dd/dd-trace-go/ddtrace/tracer/span.go:388 +0xdc
  gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.(*span).SetTag()
      /Users/evan.jones/dd/dd-trace-go/ddtrace/tracer/span.go:127 +0x134
  gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.TestTraceManualKeepRace.func1.1()
      /Users/evan.jones/dd/dd-trace-go/ddtrace/tracer/span_test.go:442 +0xa8

Motivation

Fix flaky tests in programs that use dd-trace-go.

Reviewer's Checklist

  • Changed code has unit tests for its functionality at or near 100% coverage.
  • There is a benchmark for any new code, or changes to existing code.
  • If this interacts with the agent in a new way, a system test has been added.

For Datadog employees:

  • If this PR touches code that handles credentials of any kind, such as Datadog API keys, I've requested a review from @DataDog/security-design-and-guidance.
  • This PR doesn't touch any of that.

Unsure? Have a question? Request a review!

This function read c.trace.priority without holding a lock, then
called functions that modify it. This causes the race detector to
detect the following race. This race bug is causing flaky tests in a
program that uses dd-trace-go. The race is triggered when two
goroutines both set the ext.ManualKeep on spans that share a root.
I think it might plausibly cause a real bug bug, where c.updated gets
set to true, even though it the priority was not changed.

My attempt to fix it is to move the logic to detect if the priority
was modified into the setSamplingPriority function itself. This might
not be the right fix since I don't really understand what this does,
but it seems to pass the tests both with and without race detection.

The race detector report without this change:

WARNING: DATA RACE
Write at 0x00c0019a0a88 by goroutine 2435:
  gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.(*trace).setSamplingPriorityLocked()
      /Users/evan.jones/dd/dd-trace-go/ddtrace/tracer/spancontext.go:327 +0xc4
  gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.(*trace).push()
      /Users/evan.jones/dd/dd-trace-go/ddtrace/tracer/spancontext.go:359 +0x23c
  gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.newSpanContext()
      /Users/evan.jones/dd/dd-trace-go/ddtrace/tracer/spancontext.go:143 +0x904
  gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.(*tracer).StartSpan()
      /Users/evan.jones/dd/dd-trace-go/ddtrace/tracer/tracer.go:516 +0xba0
  gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.(*tracer).newChildSpan()
      /Users/evan.jones/dd/dd-trace-go/ddtrace/tracer/tracer_test.go:52 +0x94
  gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.TestTraceManualKeepRace.func1.1()
      /Users/evan.jones/dd/dd-trace-go/ddtrace/tracer/span_test.go:441 +0x84

Previous read at 0x00c0019a0a88 by goroutine 2438:
  gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.(*spanContext).setSamplingPriority()
      /Users/evan.jones/dd/dd-trace-go/ddtrace/tracer/spancontext.go:185 +0x1bc
  gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.(*span).setSamplingPriorityLocked()
      /Users/evan.jones/dd/dd-trace-go/ddtrace/tracer/span.go:263 +0x8c
  gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.(*span).setTagBool()
      /Users/evan.jones/dd/dd-trace-go/ddtrace/tracer/span.go:388 +0xdc
  gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.(*span).SetTag()
      /Users/evan.jones/dd/dd-trace-go/ddtrace/tracer/span.go:127 +0x134
  gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer.TestTraceManualKeepRace.func1.1()
      /Users/evan.jones/dd/dd-trace-go/ddtrace/tracer/span_test.go:442 +0xa8
@evanj evanj requested a review from a team October 16, 2023 11:26
@pr-commenter
Copy link

pr-commenter bot commented Oct 16, 2023

Benchmarks

Benchmark execution time: 2023-11-15 09:04:27

Comparing candidate commit 5c9520f in PR branch evan.jones/keep-race with baseline commit d14f56f in branch main.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 39 metrics, 2 unstable metrics.

ddtrace/tracer/span_test.go Outdated Show resolved Hide resolved
Comment on lines 295 to 296
t.mu.Lock()
defer t.mu.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewing this it seems to me that the original issue comes from checking c.trace.priority != nil && *c.trace.priority != float64(p) outside of the highlighted mutex.

In general, the change seems safe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct: this fix ensures we only read t.priority while holding the lock, which required returning the boolean from this function.

Copy link
Contributor Author

@evanj evanj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. I have applied your suggestions!

ddtrace/tracer/span_test.go Outdated Show resolved Hide resolved
Comment on lines 295 to 296
t.mu.Lock()
defer t.mu.Unlock()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct: this fix ensures we only read t.priority while holding the lock, which required returning the boolean from this function.

Copy link

This PR is stale because it has been open 20 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the stale Stuck for more than 1 month label Nov 13, 2023
@evanj
Copy link
Contributor Author

evanj commented Nov 14, 2023

This was flagged stale but I believe this probably can be merged? @darccio or others, is there anything else I can do to help here?

@github-actions github-actions bot removed the stale Stuck for more than 1 month label Nov 15, 2023
@darccio
Copy link
Contributor

darccio commented Nov 15, 2023

@evanj I think we can merge. Let me update the branch, run the tests and check everything goes fine.

@darccio darccio merged commit bd5446a into main Nov 15, 2023
51 checks passed
@darccio darccio deleted the evan.jones/keep-race branch November 15, 2023 09:23
@darccio darccio restored the evan.jones/keep-race branch November 16, 2023 10:26
@darccio darccio deleted the evan.jones/keep-race branch November 16, 2023 10:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants