Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROF-9136] Fix rare remote configuration worker thread leak #3519

Merged
merged 4 commits into from
Mar 13, 2024

Conversation

ivoanjo
Copy link
Member

@ivoanjo ivoanjo commented Mar 12, 2024

What does this PR do?

This PR fixes the remote configuration worker thread leak that is discussed in more detail in the internal
"dd-trace-rb Remote::Worker thread leak investigation" doc.

In a nutshell, there's a potential race between threads that call Datadog::Core::Remote::Tie.boot and threads that call Datadog.configure that can, in really unlucky situations, mean that the `Remote::Worker thread gets leaked during library reconfiguration.

This rare issue is O(number of times Datadog.configure gets called) while requests are being processed in a web-app.

I was only able to reproduce it with the weird reproducer below, and by adding sleep to some parts of our code, so while we have observed it in production, we expect it to be rare.

Being more specific, this issue is related to a race between the remote configuration thread being shut down, and other parts of the code base trying to restart it.

In practice, the remote configuration would previously always get started by calls to Datadog::Core::Remote.active_remote.barrier(:once). During dd-trace-rb reconfiguration, it’s possible for the remote configuration thread to be shut down, and before the new configuration kicks in, for any other thread to call
Datadog::Core::Remote.active_remote.barrier(:once), restarting the remote configuration thread that belonged to the old configuration.

This PR fixes this issue introducing a @stop_requested variable that flags a "terminal state" for the remote configuration worker.

Thus, the default behavior of Remote::Worker.start is changed from always starting the thread to not starting the thread if the same worker was stopped before.

Motivation:

Fix issue.

Additional Notes:

We are also working to remove cases where Datadog.configure gets called during request handling, e.g. #3515 .

How to test the change?

I've added unit test coverage for the worker not restarting itself again.

To validate that this fixes the worker thread leak, the following reproducer can be used:

diff --git a/lib/datadog/core/configuration.rb b/lib/datadog/core/configuration.rb
index 7a20444784..fe0ca8ba0f 100644
--- a/lib/datadog/core/configuration.rb
+++ b/lib/datadog/core/configuration.rb
@@ -254,6 +254,12 @@ module Datadog
         components = Components.new(settings)

         old.shutdown!(components)
+        puts "After shutdown, sleeping!"
+
+        sleep 1
+
+        puts "Woke up after shutdown!"
+
         components.startup!(settings)
         components
       end
diff --git a/remote-worker-thread-leak.rb b/remote-worker-thread-leak.rb
new file mode 100644
index 0000000000..3552c654b5
--- /dev/null
+++ b/remote-worker-thread-leak.rb
@@ -0,0 +1,22 @@
+require 'ddtrace'
+
+configure_thread = Thread.new do
+  15.times {
+    Datadog.configure { Thread.pass }
+    Thread.pass
+  }
+end
+
+trigger_rc_thread = Thread.new do
+  loop {
+    sleep 0.5
+    Datadog::Core::Remote.active_remote.barrier(:once)
+    Thread.pass
+  }
+end
+
+configure_thread.join
+trigger_rc_thread.kill
+trigger_rc_thread.join
+
+puts Thread.list

Before the fix, the list of threads printed would show a lot of leaked remote worker configuration threads. With the fix, the threads no longer leak, and if you enable debug logging you will see that the restart is being prevented and the debug log I added will show up

D, [2024-03-12T16:27:02.951950 #742185] DEBUG -- ddtrace: [ddtrace]
(dd-trace-rb/lib/datadog/core/remote/worker.rb:28:in `start')
remote worker: refusing to restart after previous stop

This debug message is harmless and shows the fix is working fine.

For Datadog employees:

  • If this PR touches code that signs or publishes builds or packages, or handles
    credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
  • This PR doesn't touch any of that.

**What does this PR do?**

This PR fixes the remote configuration worker thread leak that is
discussed in more detail in the internal
"dd-trace-rb Remote::Worker thread leak investigation" doc.

In a nutshell, there's a potential race between threads that call
`Datadog::Core::Remote::Tie.boot` and threads that call
`Datadog.configure` that can, in really unlucky situations,
mean that the `Remote::Worker thread gets leaked during library
reconfiguration.

This rare issue is O(number of times Datadog.configure gets called)
while requests are being processed in a web-app.

I was only able to reproduce it with the weird reproducer below, and
by adding `sleep` to some parts of our code, so while we have observed
it in production, we expect it to be rare.

Being more specific, this issue is related to a race between the remote
configuration thread being shut down, and other parts of the code base
trying to restart it.

In practice, the remote configuration would previously always get
started by calls to `Datadog::Core::Remote.active_remote.barrier(:once)`.
During dd-trace-rb reconfiguration, it’s possible for the remote
configuration thread to be shut down, and before the new configuration
kicks in, for any other thread to call
`Datadog::Core::Remote.active_remote.barrier(:once)`, restarting the
remote configuration thread that belonged to the old configuration.

This PR fixes this issue introducing a `@stop_requested` variable that
flags a "terminal state" for the remote configuration worker.

Thus, the default behavior of `Remote::Worker.start` is changed
from always starting the thread to not starting the thread if the
same worker was stopped before.

**Motivation:**

Fix issue.

**Additional Notes:**

We are also working to remove cases where `Datadog.configure` gets
called during request handling, e.g. #3515 .

**How to test the change?**

I've added unit test coverage for the worker not restarting itself
again.

To validate that this fixes the worker thread leak, the following
reproducer can be used:

```diff
diff --git a/lib/datadog/core/configuration.rb b/lib/datadog/core/configuration.rb
index 7a20444784..fe0ca8ba0f 100644
--- a/lib/datadog/core/configuration.rb
+++ b/lib/datadog/core/configuration.rb
@@ -254,6 +254,12 @@ module Datadog
         components = Components.new(settings)

         old.shutdown!(components)
+        puts "After shutdown, sleeping!"
+
+        sleep 1
+
+        puts "Woke up after shutdown!"
+
         components.startup!(settings)
         components
       end
diff --git a/remote-worker-thread-leak.rb b/remote-worker-thread-leak.rb
new file mode 100644
index 0000000000..3552c654b5
--- /dev/null
+++ b/remote-worker-thread-leak.rb
@@ -0,0 +1,22 @@
+require 'ddtrace'
+
+configure_thread = Thread.new do
+  15.times {
+    Datadog.configure { Thread.pass }
+    Thread.pass
+  }
+end
+
+trigger_rc_thread = Thread.new do
+  loop {
+    sleep 0.5
+    Datadog::Core::Remote.active_remote.barrier(:once)
+    Thread.pass
+  }
+end
+
+configure_thread.join
+trigger_rc_thread.kill
+trigger_rc_thread.join
+
+puts Thread.list
```

Before the fix, the list of threads printed would show a lot of
leaked remote worker configuration threads. With the fix, the
threads no longer leak, and if you enable debug logging you will
see that the restart is being prevented and the debug log I
added will show up

```
D, [2024-03-12T16:27:02.951950 #742185] DEBUG -- ddtrace: [ddtrace]
(dd-trace-rb/lib/datadog/core/remote/worker.rb:28:in `start')
remote worker: refusing to restart after previous stop
```

This debug message is harmless and shows the fix is working fine.
@ivoanjo ivoanjo requested a review from a team as a code owner March 12, 2024 16:33
@github-actions github-actions bot added the core Involves Datadog core libraries label Mar 12, 2024
@ivoanjo ivoanjo requested review from lloeki, p-datadog and a team March 12, 2024 16:33
Copy link
Contributor

@lloeki lloeki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling that one!

I seem to remember we talked about protecting concurrent access with a lock inside Datadog::Some::Thing.active_remote. @ivoanjo do you feel this PR is instead enough to solve the issue?


@interval = interval
raise ArgumentError, 'can not initialize a worker without a block' unless block

@block = block
end

def start
def start(allow_restart: false)
Copy link
Contributor

@lloeki lloeki Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a situation where we want to allow restart?

My memory is a bit fuzzy but it seemed to me that once a remote worker has started in practice its only subsequent states are "stopping" and "stopped", but maybe I missed something?

(IOW is state 5. disallowed and maybe should be guarded against?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in 760e084

@started = false
@stop_requested = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not fond of what the naming implies here ^^', I feel it introduces incidental complexity.

Here's my train of thought.

This sequence diagram shows how I view the state machine†:

{ ¬started, ¬starting, ¬stopping }     # 1.
           |
           |
        #start
           |
           V
{ ¬started, starting, ¬stopping }      # 2.
           |
           |
        success
           |
           V
{ started, ¬starting, ¬stopping }      # 3.
           |
           |
         #stop
           |
           V
{ started, ¬starting, stopping }       # 4.
           |
           |
         success
           |
           V
{ ¬started, ¬starting, ¬stopping }     # == 1.

In that context, stop_requested captures a transition (IOW #stop has been called) instead of representing a state. In effect this leads to a representational issue: @stop_requested will be forever true even when a restartable worker has been restarted and no #stop has been issued to that second #start state.

AIUI what stop_requested intends to introduce is:

  • a distinction between the first and last state of the diagram, turning the last one into a 5. then making 5. equivalent to 1. by a conditional on @allow_restart
  • a distinction between 3. and 4. which became identical due to the removal of @stopping.

As a consequence I feel introducing a stopped state would better represent all possible states:

{ ¬started, ¬starting, ¬stopping, ¬stopped }     # 1.
           |
           |
        #start
           |
           V
{ ¬started, starting, ¬stopping, ¬stopped }      # 2.
           |
           |
        success
           |
           V
{ started, ¬starting, ¬stopping, ¬stopped }      # 3.
           |
           |
         #stop
           |
           V
{ started, ¬starting, stopping, ¬stopped }       # 4.
           |
           |
         success
           |
           V
{ ¬started, ¬starting, ¬stopping, stopped }      # 5.
           |
           |
        #start
           |
           V
{ ¬started, starting, ¬stopping, stopped }       # 6.
           |
           |
        success
           |
           V
{ started, ¬starting, ¬stopping, ¬stopped }      # == 3.

It then follows that:

  • @allow_restart is only effective when stopped (and not effective when ¬stopped)
  • the current implementation of stop_requested is equivalent to stopping OR stopped
    • ... except when the worker is started; in that cas I feel a stop count (or maybe better, a start count) could be useful
  • the ambiguity about stop_requested on a restartable worker is resolved and fully captured by stopping

I also feel it makes it easier to implement helper methods from these fundamental state primitives (e.g synthesising #stop_requested if needed)

success is actually a diamond with an error branch, elided so as not to overload the sequence diagram, and also because it happens via exceptional code flow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed @stop_requested to @stopped in 22f3330

I initially added this thinking it may make it easier for testing
and/or experimentation, but after feedback from review, let's go without
it for now -- we can always re-add later.
@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.23%. Comparing base (77f8b0f) to head (e20cbaf).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3519      +/-   ##
==========================================
- Coverage   98.24%   98.23%   -0.01%     
==========================================
  Files        1275     1275              
  Lines       75273    75281       +8     
  Branches     3554     3556       +2     
==========================================
+ Hits        73949    73956       +7     
- Misses       1324     1325       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ivoanjo ivoanjo merged commit f76fd72 into master Mar 13, 2024
219 checks passed
@ivoanjo ivoanjo deleted the ivoanjo/prof-9136-remote-worker-thread-leak branch March 13, 2024 10:41
@github-actions github-actions bot added this to the 2.0 milestone Mar 13, 2024
@ivoanjo ivoanjo modified the milestones: 2.0, 1.21.0 Mar 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Involves Datadog core libraries
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants