Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix object_id usage as thread local key #2096

Merged
merged 5 commits into from
Jun 28, 2022

Conversation

marcotc
Copy link
Member

@marcotc marcotc commented Jun 17, 2022

I stumbled upon this flaky error: https://app.circleci.com/pipelines/github/DataDog/dd-trace-rb/6331/workflows/02b56e62-53d7-4d50-ac12-f3e84194bf8f/jobs/234310/parallel-runs/1

Failures:

  1) Datadog::Tracing::FiberLocalContext#initialize create one fiber-local variable
     Failure/Error: expect { fiber_local_context }.to change { Thread.current.keys.size }.by(1)
       expected `Thread.current.keys.size` to have changed by 1, but was changed by 0
     # ./spec/datadog/tracing/context_provider_spec.rb:100:in `block (3 levels) in <top (required)>'
     # ./spec/spec_helper.rb:216:in `block (2 levels) in <top (required)>'
     # ./spec/spec_helper.rb:108:in `block (2 levels) in <top (required)>'
     # /usr/local/bundle/gems/webmock-3.13.0/lib/webmock/rspec.rb:37:in `block (2 levels) in <top (required)>'

It happens because we use object_id as a thread local key and object_id can be reused after garbage collection, causing collision between a recently garbage collected instance of FiberLocalContext and a new one.

This PR addresses this issue in two places where we use object_id as a thread local key by creating a thread-safe counter for each of the object instances.

If there's an easier/cheaper way to get a unique identifier please let me know!

@marcotc marcotc added the bug Involves a bug label Jun 17, 2022
@marcotc marcotc self-assigned this Jun 17, 2022
@marcotc marcotc requested a review from a team June 17, 2022 22:26
@ivoanjo
Copy link
Member

ivoanjo commented Jun 28, 2022

If there's an easier/cheaper way to get a unique identifier please let me know!

I tried spending a bit of time thinking about this. There's quite a lot of abstraction layers here, but here's my current understanding:

  • A given Datadog::Tracing::Tracer gets initialized with a context_provider, which defaults to Datadog::Tracing::DefaultContextProvider. So there's a 1:1 relationship here.
  • Then the DefaultContextProvider keeps a @context which is a FiberLocalContext, again a 1:1 relationship.
  • Each FiberLocalContext wants to have a unique key (@key), which it was (incorrectly) deriving from its object_id.
  • Inside FiberLocalContext, we use this unique key as a key to Thread.current, and use it to store a Datadog::Tracing::Context instance that is unique for every Thread/Fiber.

Flattening a bit down the layers, it appears to me that the intention here is to effectively have a unique key for each instance of Datadog::Tracing::Tracer.

But since this is a unique key per tracer instance, and we don't expect many tracer instances to exist in a given application -- because if we do, we need to start thinking about cleaning up leftover contexts because otherwise they will lead to a memory leak -- I don't think we need a cheaper (performance-wise) way of doing this, since the common situation is that we'll do this one or at most a handful of times in practice.

In terms of easier way, I don't think there is: https://bugs.ruby-lang.org/issues/12607 is actually what we'd want (atomic integer) but hasn't been accepted yet.

@ivoanjo
Copy link
Member

ivoanjo commented Jun 28, 2022

Also, this is not in the part of the code you changed, but is this a potential bug?

# Return the local context.
def context(key = nil)
current_context = key.nil? ? @context.local : @context.local(key)
# Rebuild/reset context after a fork
#
# We don't want forked processes to copy and retransmit spans
# that were generated from the parent process. Reset it such
# that it acts like a distributed trace.
current_context.after_fork! do
current_context = self.context = current_context.fork_clone
end

In particular, when key is another thread, it seems weird to then say "whatever trace that other thread was running, I'm also running it now".

Should we only run the after_fork! block when getting the context of the current thread? (e.g. key == nil || key == Thread.current?)

Copy link
Member

@ivoanjo ivoanjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me!

lib/datadog/tracing/context_provider.rb Outdated Show resolved Hide resolved
Co-authored-by: Ivo Anjo <ivo.anjo@datadoghq.com>
@marcotc
Copy link
Member Author

marcotc commented Jun 28, 2022

Should we only run the after_fork! block when getting the context of the current thread? (e.g. key == nil || key == Thread.current?)

This seems reasonable. I'll add a comment for this.

@codecov-commenter
Copy link

Codecov Report

Merging #2096 (7475b93) into master (39441fb) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #2096   +/-   ##
=======================================
  Coverage   97.50%   97.50%           
=======================================
  Files        1033     1033           
  Lines       53260    53276   +16     
=======================================
+ Hits        51933    51949   +16     
  Misses       1327     1327           
Impacted Files Coverage Δ
...b/datadog/opentracer/thread_local_scope_manager.rb 100.00% <100.00%> (ø)
lib/datadog/tracing/context_provider.rb 96.29% <100.00%> (+1.05%) ⬆️

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

@marcotc marcotc merged commit dc8cd58 into master Jun 28, 2022
@marcotc marcotc deleted the fix-object-id-reuse-for-thread-key branch June 28, 2022 21:50
@github-actions github-actions bot added this to the 1.2.0 milestone Jun 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Involves a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants