Backport synchronization of @components creation and access to master #1418

ivoanjo · 2021-03-19T12:00:28Z

This PR backports the work done in #1329, #1387 and #1411 in the context of the feature/profiling branch to the master branch.

I've explicitly kept the merge commits in the history because I think the PR discussions on all three were very relevant. If someone is looking at this in the future, #1373 (never merged) also contains a lot of relevant discussion.

Hopefully the final diff looks reasonably understandable and we can close this chapter for a while.

While experimenting with Ruby profiler, I noticed that sometimes more than one profiler was being started. Upon closer investigation, I identified this as a race between the profiler's sampling thread and the thread that was initializing the Datadog APM: If initialization of components was not "fast enough", the Datadog::Profiling::Collectors::Stack would run and try to access `Datadog.tracer` (stack.rb:140), which would trigger a new initialization of the components, including starting a second profiler. This can be triggered reliably by injecting an artificial slow down into component initialization -- change configuration.rb from ```ruby def components @components ||= build_components(configuration) end ``` to ```ruby def components @components ||= build_components(configuration).tap { sleep 0.1 } end ``` and then by running a trivial application: ```ruby $ DD_PROFILING_ENABLED=true bundle exec bin/ddtracerb exec ruby -e 'sleep 1; puts Thread.list.count' 203 $ DD_PROFILING_ENABLED=true bundle exec bin/ddtracerb exec ruby -e 'sleep 1; puts Thread.list.count' 195 ``` Every initialization starts a new profiler, which starts two new threads, and due to the race, many more threads get started than expected -- and the value is not deterministic. To fix this, I've added a mutex to protect all access to the `@components`. Thus, adding the same `.tap { sleep 0.1 }` now yields the three expected threads (one user code, two for the profiler): ```ruby $ DD_PROFILING_ENABLED=true bundle exec bin/ddtracerb exec ruby -e 'sleep 1; puts Thread.list.count' 3 ```

As discussed during PR review, it's undesirable for a bug in ddtrace to cause an exception (in this case, a ThreadError) during initialization. This commit introduces #safely_synchronize which both grabs the COMPONENTS_LOCK as well as safely catches any ThreadError caused by it, logging it so that users can report issues to us.

[PROF-2711] Fix multiple profilers starting in the same process

In #1329 we added locking to component initialization and made sure that the logger method would: a) not trigger component initialization b) grab the lock to access the components after components were initialized Unfortunately, b) has a corner case -- what happens when components are being replaced (e.g. during reconfiguration) if a component being replaced tries to log? The answer is -- a deadlock: ``` E, [2021-03-01T09:43:26.970633 #18619] ERROR -- ddtrace: [ddtrace] (lib/ddtrace/configuration.rb:113:in `rescue in block in safely_synchronize') Detected deadlock during ddtrace initialization. Please report this at https://github.com/DataDog/dd-trace-rb/blob/master/CONTRIBUTING.md#found-a-bug Source: lib/ddtrace/configuration.rb:109:in `synchronize' # <-- boom lib/ddtrace/configuration.rb:109:in `safely_synchronize' lib/ddtrace/configuration.rb:101:in `components' lib/ddtrace/configuration.rb:61:in `logger' # <-- new component tries to log lib/ddtrace/workers/async.rb:124:in `start_worker' lib/ddtrace/workers/async.rb:115:in `block in start_async' lib/ddtrace/workers/async.rb:105:in `synchronize' lib/ddtrace/workers/async.rb:105:in `start_async' lib/ddtrace/workers/async.rb:21:in `perform' lib/ddtrace/workers/polling.rb:19:in `perform' lib/ddtrace/profiling/collectors/stack.rb:47:in `start' lib/ddtrace/profiling/profiler.rb:14:in `each' lib/ddtrace/profiling/profiler.rb:14:in `start' lib/ddtrace/configuration/components.rb:187:in `startup!' lib/ddtrace/configuration.rb:138:in `replace_components!' # <-- components are being replaced lib/ddtrace/configuration.rb:40:in `block in configure' lib/ddtrace/configuration.rb:111:in `block in safely_synchronize' lib/ddtrace/configuration.rb:109:in `synchronize' lib/ddtrace/configuration.rb:109:in `safely_synchronize' lib/ddtrace/configuration.rb:36:in `configure' lib/ddtrace/contrib/extensions.rb:26:in `configure' # <-- configuration requested ``` To avoid this issue, we detect when the logger is being requested by a thread that owns the COMPONENTS_LOCK (and thus is in the process of modifying the components) and in that case we skip locking and directly grab the existing components and logger.

…-during-component-replacement Fix deadlock when trying to log during component replacement

In #1329, after running into issues where background threads caused the `@components` to be initialized multiple times, we decided to add locking to prevent this from happening. By design, we forced every access to `@components` to go through a lock (even reads) with the intent that: 1. during configuration (e.g. `@components` was previously `nil`) other threads would block until the `@components` were available 2. during reconfiguration (e.g. `@components` was not `nil`) other threads would still block until the new `@components` were available The first issue we ran into was #1387, where the thread doing reconfiguration could try to access the `@components` (e.g. to log) and ran into a deadlock. And then in <#1373 (comment)> we discovered that due to the way the `@components` can be accessed in background threads, we could cause deadlocks if we blocked those background threads during reconfiguration (case 2 above). Hence, we arrive at the third attempt to get this correct. We change the semantics so that threads that are competing to write the `@components` still block and those operations are sequential, but we allow concurrent reads of existing `@components` to proceed. This means that **during reconfiguration** (case 2 above) background threads can still access all the components as needed (these components may be in the process of shutting down, but that was always a possibility). This avoids the deadlock observed in <#1373 (comment)> because the background thread would be able to get the `Datadog.health_metrics` and finish its work. Note that there's still two possible issues lying in hiding here, regarding the case 1 above. Because there is nothing to read when `@components` was previously `nil`, we're still making threads block in the write lock. This means that: 1. If the thread that is initializing the components tries to access a component, we're still going to to fail. Basically the same issue we ran into in #1387 can happen with any other component, e.g. trying to use `health_metrics` during initialization. 2. If for some reason during initialization there's more than one thread trying to access the `@components`, and there are multiple locks involved, we can still fail. (For instance consider that a background thread got initialized, grabbed a lock on itself and then tries to grab the `health_metrics`, and then the main thread is still doing initialization and tries to grab the lock that the background thread holds.) To solve the issues above, I think we'd need to go in a separate direction: either have a lock-per-component OR go in the dependency injection direction.

…reation-and-access Correctly synchronize @components creation and access, third try

ivoanjo · 2021-03-19T12:01:44Z

marcotc

🥇

ivoanjo · 2021-03-19T17:20:23Z

https://youtu.be/Xu3FTEmN-eg?t=119

ivoanjo added 10 commits March 19, 2021 11:54

Log deadlocks at error level, instead of warn

c2df8b9

Merge pull request #1329 from DataDog/fix/fix-profiler-starting-twice

9635e24

[PROF-2711] Fix multiple profilers starting in the same process

Merge pull request #1387 from DataDog/fix/deadlock-when-trying-to-log…

61a0777

…-during-component-replacement Fix deadlock when trying to log during component replacement

Merge pull request #1411 from DataDog/ivoanjo/synchronize-component-c…

cf651ce

…reation-and-access Correctly synchronize @components creation and access, third try

Apply rubocop fix

9cc7ef6

Tweak and fix comments explaining lock usage

37b1fb2

ivoanjo requested a review from a team March 19, 2021 12:00

marcotc approved these changes Mar 19, 2021

View reviewed changes

ivoanjo merged commit c0762e7 into master Mar 19, 2021

ivoanjo deleted the ivoanjo/backport-components-initialization-fix-v2 branch March 19, 2021 17:20

github-actions bot added this to the 0.47.0 milestone Mar 19, 2021

marcotc added the core Involves Datadog core libraries label Mar 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backport synchronization of @components creation and access to master #1418

Backport synchronization of @components creation and access to master #1418

ivoanjo commented Mar 19, 2021

ivoanjo commented Mar 19, 2021

marcotc left a comment

ivoanjo commented Mar 19, 2021

Backport synchronization of @components creation and access to master #1418

Backport synchronization of @components creation and access to master #1418

Conversation

ivoanjo commented Mar 19, 2021

ivoanjo commented Mar 19, 2021

marcotc left a comment

Choose a reason for hiding this comment

ivoanjo commented Mar 19, 2021