-
Notifications
You must be signed in to change notification settings - Fork 369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backport synchronization of @components creation and access to master #1418
Merged
ivoanjo
merged 10 commits into
master
from
ivoanjo/backport-components-initialization-fix-v2
Mar 19, 2021
Merged
Backport synchronization of @components creation and access to master #1418
ivoanjo
merged 10 commits into
master
from
ivoanjo/backport-components-initialization-fix-v2
Mar 19, 2021
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
While experimenting with Ruby profiler, I noticed that sometimes more than one profiler was being started. Upon closer investigation, I identified this as a race between the profiler's sampling thread and the thread that was initializing the Datadog APM: If initialization of components was not "fast enough", the Datadog::Profiling::Collectors::Stack would run and try to access `Datadog.tracer` (stack.rb:140), which would trigger a new initialization of the components, including starting a second profiler. This can be triggered reliably by injecting an artificial slow down into component initialization -- change configuration.rb from ```ruby def components @components ||= build_components(configuration) end ``` to ```ruby def components @components ||= build_components(configuration).tap { sleep 0.1 } end ``` and then by running a trivial application: ```ruby $ DD_PROFILING_ENABLED=true bundle exec bin/ddtracerb exec ruby -e 'sleep 1; puts Thread.list.count' 203 $ DD_PROFILING_ENABLED=true bundle exec bin/ddtracerb exec ruby -e 'sleep 1; puts Thread.list.count' 195 ``` Every initialization starts a new profiler, which starts two new threads, and due to the race, many more threads get started than expected -- and the value is not deterministic. To fix this, I've added a mutex to protect all access to the `@components`. Thus, adding the same `.tap { sleep 0.1 }` now yields the three expected threads (one user code, two for the profiler): ```ruby $ DD_PROFILING_ENABLED=true bundle exec bin/ddtracerb exec ruby -e 'sleep 1; puts Thread.list.count' 3 ```
As discussed during PR review, it's undesirable for a bug in ddtrace to cause an exception (in this case, a ThreadError) during initialization. This commit introduces #safely_synchronize which both grabs the COMPONENTS_LOCK as well as safely catches any ThreadError caused by it, logging it so that users can report issues to us.
[PROF-2711] Fix multiple profilers starting in the same process
In #1329 we added locking to component initialization and made sure that the logger method would: a) not trigger component initialization b) grab the lock to access the components after components were initialized Unfortunately, b) has a corner case -- what happens when components are being replaced (e.g. during reconfiguration) if a component being replaced tries to log? The answer is -- a deadlock: ``` E, [2021-03-01T09:43:26.970633 #18619] ERROR -- ddtrace: [ddtrace] (lib/ddtrace/configuration.rb:113:in `rescue in block in safely_synchronize') Detected deadlock during ddtrace initialization. Please report this at https://github.com/DataDog/dd-trace-rb/blob/master/CONTRIBUTING.md#found-a-bug Source: lib/ddtrace/configuration.rb:109:in `synchronize' # <-- boom lib/ddtrace/configuration.rb:109:in `safely_synchronize' lib/ddtrace/configuration.rb:101:in `components' lib/ddtrace/configuration.rb:61:in `logger' # <-- new component tries to log lib/ddtrace/workers/async.rb:124:in `start_worker' lib/ddtrace/workers/async.rb:115:in `block in start_async' lib/ddtrace/workers/async.rb:105:in `synchronize' lib/ddtrace/workers/async.rb:105:in `start_async' lib/ddtrace/workers/async.rb:21:in `perform' lib/ddtrace/workers/polling.rb:19:in `perform' lib/ddtrace/profiling/collectors/stack.rb:47:in `start' lib/ddtrace/profiling/profiler.rb:14:in `each' lib/ddtrace/profiling/profiler.rb:14:in `start' lib/ddtrace/configuration/components.rb:187:in `startup!' lib/ddtrace/configuration.rb:138:in `replace_components!' # <-- components are being replaced lib/ddtrace/configuration.rb:40:in `block in configure' lib/ddtrace/configuration.rb:111:in `block in safely_synchronize' lib/ddtrace/configuration.rb:109:in `synchronize' lib/ddtrace/configuration.rb:109:in `safely_synchronize' lib/ddtrace/configuration.rb:36:in `configure' lib/ddtrace/contrib/extensions.rb:26:in `configure' # <-- configuration requested ``` To avoid this issue, we detect when the logger is being requested by a thread that owns the COMPONENTS_LOCK (and thus is in the process of modifying the components) and in that case we skip locking and directly grab the existing components and logger.
…-during-component-replacement Fix deadlock when trying to log during component replacement
In #1329, after running into issues where background threads caused the `@components` to be initialized multiple times, we decided to add locking to prevent this from happening. By design, we forced every access to `@components` to go through a lock (even reads) with the intent that: 1. during configuration (e.g. `@components` was previously `nil`) other threads would block until the `@components` were available 2. during reconfiguration (e.g. `@components` was not `nil`) other threads would still block until the new `@components` were available The first issue we ran into was #1387, where the thread doing reconfiguration could try to access the `@components` (e.g. to log) and ran into a deadlock. And then in <#1373 (comment)> we discovered that due to the way the `@components` can be accessed in background threads, we could cause deadlocks if we blocked those background threads during reconfiguration (case 2 above). Hence, we arrive at the third attempt to get this correct. We change the semantics so that threads that are competing to write the `@components` still block and those operations are sequential, but we allow concurrent reads of existing `@components` to proceed. This means that **during reconfiguration** (case 2 above) background threads can still access all the components as needed (these components may be in the process of shutting down, but that was always a possibility). This avoids the deadlock observed in <#1373 (comment)> because the background thread would be able to get the `Datadog.health_metrics` and finish its work. Note that there's still two possible issues lying in hiding here, regarding the case 1 above. Because there is nothing to read when `@components` was previously `nil`, we're still making threads block in the write lock. This means that: 1. If the thread that is initializing the components tries to access a component, we're still going to to fail. Basically the same issue we ran into in #1387 can happen with any other component, e.g. trying to use `health_metrics` during initialization. 2. If for some reason during initialization there's more than one thread trying to access the `@components`, and there are multiple locks involved, we can still fail. (For instance consider that a background thread got initialized, grabbed a lock on itself and then tries to grab the `health_metrics`, and then the main thread is still doing initialization and tries to grab the lock that the background thread holds.) To solve the issues above, I think we'd need to go in a separate direction: either have a lock-per-component OR go in the dependency injection direction.
…reation-and-access Correctly synchronize @components creation and access, third try
marcotc
approved these changes
Mar 19, 2021
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🥇
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR backports the work done in #1329, #1387 and #1411 in the context of the
feature/profiling
branch to themaster
branch.I've explicitly kept the merge commits in the history because I think the PR discussions on all three were very relevant. If someone is looking at this in the future, #1373 (never merged) also contains a lot of relevant discussion.
Hopefully the final diff looks reasonably understandable and we can close this chapter for a while.