[PROF-9136] Fix rare remote configuration worker thread leak #3519

**What does this PR do?** This PR fixes the remote configuration worker thread leak that is discussed in more detail in the internal "dd-trace-rb Remote::Worker thread leak investigation" doc. In a nutshell, there's a potential race between threads that call `Datadog::Core::Remote::Tie.boot` and threads that call `Datadog.configure` that can, in really unlucky situations, mean that the `Remote::Worker thread gets leaked during library reconfiguration. This rare issue is O(number of times Datadog.configure gets called) while requests are being processed in a web-app. I was only able to reproduce it with the weird reproducer below, and by adding `sleep` to some parts of our code, so while we have observed it in production, we expect it to be rare. Being more specific, this issue is related to a race between the remote configuration thread being shut down, and other parts of the code base trying to restart it. In practice, the remote configuration would previously always get started by calls to `Datadog::Core::Remote.active_remote.barrier(:once)`. During dd-trace-rb reconfiguration, it’s possible for the remote configuration thread to be shut down, and before the new configuration kicks in, for any other thread to call `Datadog::Core::Remote.active_remote.barrier(:once)`, restarting the remote configuration thread that belonged to the old configuration. This PR fixes this issue introducing a `@stop_requested` variable that flags a "terminal state" for the remote configuration worker. Thus, the default behavior of `Remote::Worker.start` is changed from always starting the thread to not starting the thread if the same worker was stopped before. **Motivation:** Fix issue. **Additional Notes:** We are also working to remove cases where `Datadog.configure` gets called during request handling, e.g. #3515 . **How to test the change?** I've added unit test coverage for the worker not restarting itself again. To validate that this fixes the worker thread leak, the following reproducer can be used: ```diff diff --git a/lib/datadog/core/configuration.rb b/lib/datadog/core/configuration.rb index 7a20444784..fe0ca8ba0f 100644 --- a/lib/datadog/core/configuration.rb +++ b/lib/datadog/core/configuration.rb @@ -254,6 +254,12 @@ module Datadog components = Components.new(settings) old.shutdown!(components) + puts "After shutdown, sleeping!" + + sleep 1 + + puts "Woke up after shutdown!" + components.startup!(settings) components end diff --git a/remote-worker-thread-leak.rb b/remote-worker-thread-leak.rb new file mode 100644 index 0000000000..3552c654b5 --- /dev/null +++ b/remote-worker-thread-leak.rb @@ -0,0 +1,22 @@ +require 'ddtrace' + +configure_thread = Thread.new do + 15.times { + Datadog.configure { Thread.pass } + Thread.pass + } +end + +trigger_rc_thread = Thread.new do + loop { + sleep 0.5 + Datadog::Core::Remote.active_remote.barrier(:once) + Thread.pass + } +end + +configure_thread.join +trigger_rc_thread.kill +trigger_rc_thread.join + +puts Thread.list ``` Before the fix, the list of threads printed would show a lot of leaked remote worker configuration threads. With the fix, the threads no longer leak, and if you enable debug logging you will see that the restart is being prevented and the debug log I added will show up ``` D, [2024-03-12T16:27:02.951950 #742185] DEBUG -- ddtrace: [ddtrace] (dd-trace-rb/lib/datadog/core/remote/worker.rb:28:in `start') remote worker: refusing to restart after previous stop ``` This debug message is harmless and shows the fix is working fine.

I initially added this thinking it may make it easier for testing and/or experimentation, but after feedback from review, let's go without it for now -- we can always re-add later.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROF-9136] Fix rare remote configuration worker thread leak #3519

[PROF-9136] Fix rare remote configuration worker thread leak #3519

Commits on Mar 12, 2024

Commits on Mar 13, 2024