Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport synchronization of @components creation and access to master #1418

Merged
merged 10 commits into from
Mar 19, 2021

Conversation

ivoanjo
Copy link
Member

@ivoanjo ivoanjo commented Mar 19, 2021

This PR backports the work done in #1329, #1387 and #1411 in the context of the feature/profiling branch to the master branch.

I've explicitly kept the merge commits in the history because I think the PR discussions on all three were very relevant. If someone is looking at this in the future, #1373 (never merged) also contains a lot of relevant discussion.

Hopefully the final diff looks reasonably understandable and we can close this chapter for a while.

While experimenting with Ruby profiler, I noticed that sometimes more
than one profiler was being started.

Upon closer investigation, I identified this as a race between the
profiler's sampling thread and the thread that was initializing the
Datadog APM: If initialization of components was not "fast enough",
the Datadog::Profiling::Collectors::Stack would run and try to access
`Datadog.tracer` (stack.rb:140), which would trigger a new
initialization of the components, including starting a second profiler.

This can be triggered reliably by injecting an artificial slow down
into component initialization -- change configuration.rb from

```ruby
    def components
      @components ||= build_components(configuration)
    end
```

to

```ruby
    def components
      @components ||= build_components(configuration).tap { sleep 0.1 }
    end
```

and then by running a trivial application:

```ruby
$ DD_PROFILING_ENABLED=true bundle exec bin/ddtracerb exec ruby -e 'sleep 1; puts Thread.list.count'
203
$ DD_PROFILING_ENABLED=true bundle exec bin/ddtracerb exec ruby -e 'sleep 1; puts Thread.list.count'
195
```

Every initialization starts a new profiler, which starts two new
threads, and due to the race, many more threads get started than
expected -- and the value is not deterministic.

To fix this, I've added a mutex to protect all access to the
`@components`. Thus, adding the same `.tap { sleep 0.1 }` now
yields the three expected threads (one user code, two for the
profiler):

```ruby
$ DD_PROFILING_ENABLED=true bundle exec bin/ddtracerb exec ruby -e 'sleep 1; puts Thread.list.count'
3
```
As discussed during PR review, it's undesirable for a bug in ddtrace to
cause an exception (in this case, a ThreadError) during initialization.

This commit introduces #safely_synchronize which both grabs the
COMPONENTS_LOCK as well as safely catches any ThreadError caused by it,
logging it so that users can report issues to us.
[PROF-2711] Fix multiple profilers starting in the same process
In #1329 we added locking to component initialization and made sure
that the logger method would:

a) not trigger component initialization
b) grab the lock to access the components after components were
   initialized

Unfortunately, b) has a corner case -- what happens when components are
being replaced (e.g. during reconfiguration) if a component being
replaced tries to log?

The answer is -- a deadlock:

```
E, [2021-03-01T09:43:26.970633 #18619] ERROR -- ddtrace: [ddtrace]
(lib/ddtrace/configuration.rb:113:in `rescue in block in
safely_synchronize') Detected deadlock during ddtrace initialization.
Please report this at https://github.com/DataDog/dd-trace-rb/blob/master/CONTRIBUTING.md#found-a-bug
  Source:
  lib/ddtrace/configuration.rb:109:in `synchronize' # <-- boom
  lib/ddtrace/configuration.rb:109:in `safely_synchronize'
  lib/ddtrace/configuration.rb:101:in `components'
  lib/ddtrace/configuration.rb:61:in `logger' # <-- new component tries to log
  lib/ddtrace/workers/async.rb:124:in `start_worker'
  lib/ddtrace/workers/async.rb:115:in `block in start_async'
  lib/ddtrace/workers/async.rb:105:in `synchronize'
  lib/ddtrace/workers/async.rb:105:in `start_async'
  lib/ddtrace/workers/async.rb:21:in `perform'
  lib/ddtrace/workers/polling.rb:19:in `perform'
  lib/ddtrace/profiling/collectors/stack.rb:47:in `start'
  lib/ddtrace/profiling/profiler.rb:14:in `each'
  lib/ddtrace/profiling/profiler.rb:14:in `start'
  lib/ddtrace/configuration/components.rb:187:in `startup!'
  lib/ddtrace/configuration.rb:138:in `replace_components!' # <-- components are being replaced
  lib/ddtrace/configuration.rb:40:in `block in configure'
  lib/ddtrace/configuration.rb:111:in `block in safely_synchronize'
  lib/ddtrace/configuration.rb:109:in `synchronize'
  lib/ddtrace/configuration.rb:109:in `safely_synchronize'
  lib/ddtrace/configuration.rb:36:in `configure'
  lib/ddtrace/contrib/extensions.rb:26:in `configure' # <-- configuration requested
```

To avoid this issue, we detect when the logger is being requested
by a thread that owns the COMPONENTS_LOCK (and thus is in the process
of modifying the components) and in that case we skip locking and
directly grab the existing components and logger.
…-during-component-replacement

Fix deadlock when trying to log during component replacement
In #1329, after running into issues where background threads caused the
`@components` to be initialized multiple times, we decided to add
locking to prevent this from happening.

By design, we forced every access to `@components` to go through a
lock (even reads) with the intent that:

1. during configuration (e.g. `@components` was previously `nil`) other
   threads would block until the `@components` were available

2. during reconfiguration (e.g. `@components` was not `nil`) other
   threads would still block until the new `@components` were available

The first issue we ran into was #1387, where the thread doing
reconfiguration could try to access the `@components` (e.g. to log)
and ran into a deadlock.

And then in
<#1373 (comment)>
we discovered that due to the way the `@components` can be accessed in
background threads, we could cause deadlocks if we blocked those
background threads during reconfiguration (case 2 above).

Hence, we arrive at the third attempt to get this correct. We change
the semantics so that threads that are competing to write the
`@components` still block and those operations are sequential, but
we allow concurrent reads of existing `@components` to proceed.

This means that **during reconfiguration** (case 2 above) background
threads can still access all the components as needed (these
components may be in the process of shutting down, but that was always
a possibility).

This avoids the deadlock observed in
<#1373 (comment)>
because the background thread would be able to get the
`Datadog.health_metrics` and finish its work.

Note that there's still two possible issues lying in hiding here,
regarding the case 1 above.
Because there is nothing to read when `@components` was previously
`nil`, we're still making threads block in the write lock.

This means that:
1. If the thread that is initializing the components tries to access
   a component, we're still going to to fail. Basically the same issue
   we ran into in #1387 can happen with any other component, e.g.
   trying to use `health_metrics` during initialization.

2. If for some reason during initialization there's more than one
   thread trying to access the `@components`, and there are multiple
   locks involved, we can still fail. (For instance consider that
   a background thread got initialized, grabbed a lock on itself
   and then tries to grab the `health_metrics`, and then the main
   thread is still doing initialization and tries to grab the lock
   that the background thread holds.)

To solve the issues above, I think we'd need to go in a separate
direction: either have a lock-per-component OR go in the dependency
injection direction.
…reation-and-access

Correctly synchronize @components creation and access, third try
@ivoanjo ivoanjo requested a review from a team March 19, 2021 12:00
@ivoanjo
Copy link
Member Author

ivoanjo commented Mar 19, 2021

Copy link
Member

@marcotc marcotc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥇

@ivoanjo
Copy link
Member Author

ivoanjo commented Mar 19, 2021

https://youtu.be/Xu3FTEmN-eg?t=119

@ivoanjo ivoanjo merged commit c0762e7 into master Mar 19, 2021
@ivoanjo ivoanjo deleted the ivoanjo/backport-components-initialization-fix-v2 branch March 19, 2021 17:20
@github-actions github-actions bot added this to the 0.47.0 milestone Mar 19, 2021
@marcotc marcotc added the core Involves Datadog core libraries label Mar 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Involves Datadog core libraries
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants