[PROF-9136] Fix rare remote configuration worker thread leak #3519

ivoanjo · 2024-03-12T16:33:15Z

What does this PR do?

This PR fixes the remote configuration worker thread leak that is discussed in more detail in the internal
"dd-trace-rb Remote::Worker thread leak investigation" doc.

In a nutshell, there's a potential race between threads that call Datadog::Core::Remote::Tie.boot and threads that call Datadog.configure that can, in really unlucky situations, mean that the `Remote::Worker thread gets leaked during library reconfiguration.

This rare issue is O(number of times Datadog.configure gets called) while requests are being processed in a web-app.

I was only able to reproduce it with the weird reproducer below, and by adding sleep to some parts of our code, so while we have observed it in production, we expect it to be rare.

Being more specific, this issue is related to a race between the remote configuration thread being shut down, and other parts of the code base trying to restart it.

In practice, the remote configuration would previously always get started by calls to Datadog::Core::Remote.active_remote.barrier(:once). During dd-trace-rb reconfiguration, it’s possible for the remote configuration thread to be shut down, and before the new configuration kicks in, for any other thread to call
Datadog::Core::Remote.active_remote.barrier(:once), restarting the remote configuration thread that belonged to the old configuration.

This PR fixes this issue introducing a @stop_requested variable that flags a "terminal state" for the remote configuration worker.

Thus, the default behavior of Remote::Worker.start is changed from always starting the thread to not starting the thread if the same worker was stopped before.

Motivation:

Fix issue.

Additional Notes:

We are also working to remove cases where Datadog.configure gets called during request handling, e.g. #3515 .

How to test the change?

I've added unit test coverage for the worker not restarting itself again.

To validate that this fixes the worker thread leak, the following reproducer can be used:

diff --git a/lib/datadog/core/configuration.rb b/lib/datadog/core/configuration.rb
index 7a20444784..fe0ca8ba0f 100644
--- a/lib/datadog/core/configuration.rb
+++ b/lib/datadog/core/configuration.rb
@@ -254,6 +254,12 @@ module Datadog
         components = Components.new(settings)

         old.shutdown!(components)
+        puts "After shutdown, sleeping!"
+
+        sleep 1
+
+        puts "Woke up after shutdown!"
+
         components.startup!(settings)
         components
       end
diff --git a/remote-worker-thread-leak.rb b/remote-worker-thread-leak.rb
new file mode 100644
index 0000000000..3552c654b5
--- /dev/null
+++ b/remote-worker-thread-leak.rb
@@ -0,0 +1,22 @@
+require 'ddtrace'
+
+configure_thread = Thread.new do
+  15.times {
+    Datadog.configure { Thread.pass }
+    Thread.pass
+  }
+end
+
+trigger_rc_thread = Thread.new do
+  loop {
+    sleep 0.5
+    Datadog::Core::Remote.active_remote.barrier(:once)
+    Thread.pass
+  }
+end
+
+configure_thread.join
+trigger_rc_thread.kill
+trigger_rc_thread.join
+
+puts Thread.list

Before the fix, the list of threads printed would show a lot of leaked remote worker configuration threads. With the fix, the threads no longer leak, and if you enable debug logging you will see that the restart is being prevented and the debug log I added will show up

D, [2024-03-12T16:27:02.951950 #742185] DEBUG -- ddtrace: [ddtrace]
(dd-trace-rb/lib/datadog/core/remote/worker.rb:28:in `start')
remote worker: refusing to restart after previous stop

This debug message is harmless and shows the fix is working fine.

For Datadog employees:

If this PR touches code that signs or publishes builds or packages, or handles
credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
This PR doesn't touch any of that.

**What does this PR do?** This PR fixes the remote configuration worker thread leak that is discussed in more detail in the internal "dd-trace-rb Remote::Worker thread leak investigation" doc. In a nutshell, there's a potential race between threads that call `Datadog::Core::Remote::Tie.boot` and threads that call `Datadog.configure` that can, in really unlucky situations, mean that the `Remote::Worker thread gets leaked during library reconfiguration. This rare issue is O(number of times Datadog.configure gets called) while requests are being processed in a web-app. I was only able to reproduce it with the weird reproducer below, and by adding `sleep` to some parts of our code, so while we have observed it in production, we expect it to be rare. Being more specific, this issue is related to a race between the remote configuration thread being shut down, and other parts of the code base trying to restart it. In practice, the remote configuration would previously always get started by calls to `Datadog::Core::Remote.active_remote.barrier(:once)`. During dd-trace-rb reconfiguration, it’s possible for the remote configuration thread to be shut down, and before the new configuration kicks in, for any other thread to call `Datadog::Core::Remote.active_remote.barrier(:once)`, restarting the remote configuration thread that belonged to the old configuration. This PR fixes this issue introducing a `@stop_requested` variable that flags a "terminal state" for the remote configuration worker. Thus, the default behavior of `Remote::Worker.start` is changed from always starting the thread to not starting the thread if the same worker was stopped before. **Motivation:** Fix issue. **Additional Notes:** We are also working to remove cases where `Datadog.configure` gets called during request handling, e.g. #3515 . **How to test the change?** I've added unit test coverage for the worker not restarting itself again. To validate that this fixes the worker thread leak, the following reproducer can be used: ```diff diff --git a/lib/datadog/core/configuration.rb b/lib/datadog/core/configuration.rb index 7a20444784..fe0ca8ba0f 100644 --- a/lib/datadog/core/configuration.rb +++ b/lib/datadog/core/configuration.rb @@ -254,6 +254,12 @@ module Datadog components = Components.new(settings) old.shutdown!(components) + puts "After shutdown, sleeping!" + + sleep 1 + + puts "Woke up after shutdown!" + components.startup!(settings) components end diff --git a/remote-worker-thread-leak.rb b/remote-worker-thread-leak.rb new file mode 100644 index 0000000000..3552c654b5 --- /dev/null +++ b/remote-worker-thread-leak.rb @@ -0,0 +1,22 @@ +require 'ddtrace' + +configure_thread = Thread.new do + 15.times { + Datadog.configure { Thread.pass } + Thread.pass + } +end + +trigger_rc_thread = Thread.new do + loop { + sleep 0.5 + Datadog::Core::Remote.active_remote.barrier(:once) + Thread.pass + } +end + +configure_thread.join +trigger_rc_thread.kill +trigger_rc_thread.join + +puts Thread.list ``` Before the fix, the list of threads printed would show a lot of leaked remote worker configuration threads. With the fix, the threads no longer leak, and if you enable debug logging you will see that the restart is being prevented and the debug log I added will show up ``` D, [2024-03-12T16:27:02.951950 #742185] DEBUG -- ddtrace: [ddtrace] (dd-trace-rb/lib/datadog/core/remote/worker.rb:28:in `start') remote worker: refusing to restart after previous stop ``` This debug message is harmless and shows the fix is working fine.

lloeki

Thanks for tackling that one!

I seem to remember we talked about protecting concurrent access with a lock inside Datadog::Some::Thing.active_remote. @ivoanjo do you feel this PR is instead enough to solve the issue?

lloeki · 2024-03-13T06:44:06Z

lib/datadog/core/remote/worker.rb


          @interval = interval
          raise ArgumentError, 'can not initialize a worker without a block' unless block

          @block = block
        end

-        def start
+        def start(allow_restart: false)


Is there a situation where we want to allow restart?

My memory is a bit fuzzy but it seemed to me that once a remote worker has started in practice its only subsequent states are "stopping" and "stopped", but maybe I missed something?

(IOW is state 5. disallowed and maybe should be guarded against?)

Removed in 760e084

lloeki · 2024-03-13T07:51:24Z

lib/datadog/core/remote/worker.rb

          @started = false
+          @stop_requested = false


Not fond of what the naming implies here ^^', I feel it introduces incidental complexity.

Here's my train of thought.

This sequence diagram shows how I view the state machine†:

{ ¬started, ¬starting, ¬stopping } # 1. | | #start | V { ¬started, starting, ¬stopping } # 2. | | success | V { started, ¬starting, ¬stopping } # 3. | | #stop | V { started, ¬starting, stopping } # 4. | | success | V { ¬started, ¬starting, ¬stopping } # == 1.

In that context, stop_requested captures a transition (IOW #stop has been called) instead of representing a state. In effect this leads to a representational issue: @stop_requested will be forever true even when a restartable worker has been restarted and no #stop has been issued to that second #start state.

AIUI what stop_requested intends to introduce is:

a distinction between the first and last state of the diagram, turning the last one into a 5. then making 5. equivalent to 1. by a conditional on @allow_restart

a distinction between 3. and 4. which became identical due to the removal of @stopping.

As a consequence I feel introducing a stopped state would better represent all possible states:

{ ¬started, ¬starting, ¬stopping, ¬stopped } # 1. | | #start | V { ¬started, starting, ¬stopping, ¬stopped } # 2. | | success | V { started, ¬starting, ¬stopping, ¬stopped } # 3. | | #stop | V { started, ¬starting, stopping, ¬stopped } # 4. | | success | V { ¬started, ¬starting, ¬stopping, stopped } # 5. | | #start | V { ¬started, starting, ¬stopping, stopped } # 6. | | success | V { started, ¬starting, ¬stopping, ¬stopped } # == 3.

It then follows that:

@allow_restart is only effective when stopped (and not effective when ¬stopped)

the current implementation of stop_requested is equivalent to stopping OR stopped

... except when the worker is started; in that cas I feel a stop count (or maybe better, a start count) could be useful

the ambiguity about stop_requested on a restartable worker is resolved and fully captured by stopping

I also feel it makes it easier to implement helper methods from these fundamental state primitives (e.g synthesising #stop_requested if needed)

† success is actually a diamond with an error branch, elided so as not to overload the sequence diagram, and also because it happens via exceptional code flow.

Changed @stop_requested to @stopped in 22f3330

I initially added this thinking it may make it easier for testing and/or experimentation, but after feedback from review, let's go without it for now -- we can always re-add later.

codecov-commenter · 2024-03-13T09:41:10Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.23%. Comparing base (77f8b0f) to head (e20cbaf).

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #3519      +/-   ##
==========================================
- Coverage   98.24%   98.23%   -0.01%     
==========================================
  Files        1275     1275              
  Lines       75273    75281       +8     
  Branches     3554     3556       +2     
==========================================
+ Hits        73949    73956       +7     
- Misses       1324     1325       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ivoanjo requested a review from a team as a code owner March 12, 2024 16:33

github-actions bot added the core Involves Datadog core libraries label Mar 12, 2024

ivoanjo requested review from lloeki, p-datadog and a team March 12, 2024 16:33

AlexJF approved these changes Mar 12, 2024

View reviewed changes

lloeki reviewed Mar 13, 2024

View reviewed changes

ivoanjo added 3 commits March 13, 2024 09:16

Remove allow_restart argument

760e084

I initially added this thinking it may make it easier for testing and/or experimentation, but after feedback from review, let's go without it for now -- we can always re-add later.

Rename @stop_requested to @stopped

22f3330

Update type signatures for worker after removal of argument

e20cbaf

lloeki approved these changes Mar 13, 2024

View reviewed changes

ivoanjo merged commit f76fd72 into master Mar 13, 2024
219 checks passed

ivoanjo deleted the ivoanjo/prof-9136-remote-worker-thread-leak branch March 13, 2024 10:41

github-actions bot added this to the 2.0 milestone Mar 13, 2024

ivoanjo modified the milestones: 2.0, 1.21.0 Mar 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROF-9136] Fix rare remote configuration worker thread leak #3519

[PROF-9136] Fix rare remote configuration worker thread leak #3519

ivoanjo commented Mar 12, 2024

lloeki left a comment

lloeki Mar 13, 2024 •

edited

ivoanjo Mar 13, 2024

lloeki Mar 13, 2024

ivoanjo Mar 13, 2024

codecov-commenter commented Mar 13, 2024

[PROF-9136] Fix rare remote configuration worker thread leak #3519

[PROF-9136] Fix rare remote configuration worker thread leak #3519

Conversation

ivoanjo commented Mar 12, 2024

lloeki left a comment

Choose a reason for hiding this comment

lloeki Mar 13, 2024 • edited

Choose a reason for hiding this comment

ivoanjo Mar 13, 2024

Choose a reason for hiding this comment

lloeki Mar 13, 2024

Choose a reason for hiding this comment

ivoanjo Mar 13, 2024

Choose a reason for hiding this comment

codecov-commenter commented Mar 13, 2024

Codecov Report

lloeki Mar 13, 2024 •

edited