**What does this PR do?**
This PR fixes the remote configuration worker thread leak that is
discussed in more detail in the internal
"dd-trace-rb Remote::Worker thread leak investigation" doc.
In a nutshell, there's a potential race between threads that call
`Datadog::Core::Remote::Tie.boot` and threads that call
`Datadog.configure` that can, in really unlucky situations,
mean that the `Remote::Worker thread gets leaked during library
reconfiguration.
This rare issue is O(number of times Datadog.configure gets called)
while requests are being processed in a web-app.
I was only able to reproduce it with the weird reproducer below, and
by adding `sleep` to some parts of our code, so while we have observed
it in production, we expect it to be rare.
Being more specific, this issue is related to a race between the remote
configuration thread being shut down, and other parts of the code base
trying to restart it.
In practice, the remote configuration would previously always get
started by calls to `Datadog::Core::Remote.active_remote.barrier(:once)`.
During dd-trace-rb reconfiguration, it’s possible for the remote
configuration thread to be shut down, and before the new configuration
kicks in, for any other thread to call
`Datadog::Core::Remote.active_remote.barrier(:once)`, restarting the
remote configuration thread that belonged to the old configuration.
This PR fixes this issue introducing a `@stop_requested` variable that
flags a "terminal state" for the remote configuration worker.
Thus, the default behavior of `Remote::Worker.start` is changed
from always starting the thread to not starting the thread if the
same worker was stopped before.
**Motivation:**
Fix issue.
**Additional Notes:**
We are also working to remove cases where `Datadog.configure` gets
called during request handling, e.g. #3515 .
**How to test the change?**
I've added unit test coverage for the worker not restarting itself
again.
To validate that this fixes the worker thread leak, the following
reproducer can be used:
```diff
diff --git a/lib/datadog/core/configuration.rb b/lib/datadog/core/configuration.rb
index 7a20444784..fe0ca8ba0f 100644
--- a/lib/datadog/core/configuration.rb
+++ b/lib/datadog/core/configuration.rb
@@ -254,6 +254,12 @@ module Datadog
components = Components.new(settings)
old.shutdown!(components)
+ puts "After shutdown, sleeping!"
+
+ sleep 1
+
+ puts "Woke up after shutdown!"
+
components.startup!(settings)
components
end
diff --git a/remote-worker-thread-leak.rb b/remote-worker-thread-leak.rb
new file mode 100644
index 0000000000..3552c654b5
--- /dev/null
+++ b/remote-worker-thread-leak.rb
@@ -0,0 +1,22 @@
+require 'ddtrace'
+
+configure_thread = Thread.new do
+ 15.times {
+ Datadog.configure { Thread.pass }
+ Thread.pass
+ }
+end
+
+trigger_rc_thread = Thread.new do
+ loop {
+ sleep 0.5
+ Datadog::Core::Remote.active_remote.barrier(:once)
+ Thread.pass
+ }
+end
+
+configure_thread.join
+trigger_rc_thread.kill
+trigger_rc_thread.join
+
+puts Thread.list
```
Before the fix, the list of threads printed would show a lot of
leaked remote worker configuration threads. With the fix, the
threads no longer leak, and if you enable debug logging you will
see that the restart is being prevented and the debug log I
added will show up
```
D, [2024-03-12T16:27:02.951950 #742185] DEBUG -- ddtrace: [ddtrace]
(dd-trace-rb/lib/datadog/core/remote/worker.rb:28:in `start')
remote worker: refusing to restart after previous stop
```
This debug message is harmless and shows the fix is working fine.