Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROF-9136] Fix rare remote configuration worker thread leak #3519

Merged
merged 4 commits into from
Mar 13, 2024

Commits on Mar 12, 2024

  1. [PROF-9136] Fix rare remote configuration worker thread leak

    **What does this PR do?**
    
    This PR fixes the remote configuration worker thread leak that is
    discussed in more detail in the internal
    "dd-trace-rb Remote::Worker thread leak investigation" doc.
    
    In a nutshell, there's a potential race between threads that call
    `Datadog::Core::Remote::Tie.boot` and threads that call
    `Datadog.configure` that can, in really unlucky situations,
    mean that the `Remote::Worker thread gets leaked during library
    reconfiguration.
    
    This rare issue is O(number of times Datadog.configure gets called)
    while requests are being processed in a web-app.
    
    I was only able to reproduce it with the weird reproducer below, and
    by adding `sleep` to some parts of our code, so while we have observed
    it in production, we expect it to be rare.
    
    Being more specific, this issue is related to a race between the remote
    configuration thread being shut down, and other parts of the code base
    trying to restart it.
    
    In practice, the remote configuration would previously always get
    started by calls to `Datadog::Core::Remote.active_remote.barrier(:once)`.
    During dd-trace-rb reconfiguration, it’s possible for the remote
    configuration thread to be shut down, and before the new configuration
    kicks in, for any other thread to call
    `Datadog::Core::Remote.active_remote.barrier(:once)`, restarting the
    remote configuration thread that belonged to the old configuration.
    
    This PR fixes this issue introducing a `@stop_requested` variable that
    flags a "terminal state" for the remote configuration worker.
    
    Thus, the default behavior of `Remote::Worker.start` is changed
    from always starting the thread to not starting the thread if the
    same worker was stopped before.
    
    **Motivation:**
    
    Fix issue.
    
    **Additional Notes:**
    
    We are also working to remove cases where `Datadog.configure` gets
    called during request handling, e.g. #3515 .
    
    **How to test the change?**
    
    I've added unit test coverage for the worker not restarting itself
    again.
    
    To validate that this fixes the worker thread leak, the following
    reproducer can be used:
    
    ```diff
    diff --git a/lib/datadog/core/configuration.rb b/lib/datadog/core/configuration.rb
    index 7a20444784..fe0ca8ba0f 100644
    --- a/lib/datadog/core/configuration.rb
    +++ b/lib/datadog/core/configuration.rb
    @@ -254,6 +254,12 @@ module Datadog
             components = Components.new(settings)
    
             old.shutdown!(components)
    +        puts "After shutdown, sleeping!"
    +
    +        sleep 1
    +
    +        puts "Woke up after shutdown!"
    +
             components.startup!(settings)
             components
           end
    diff --git a/remote-worker-thread-leak.rb b/remote-worker-thread-leak.rb
    new file mode 100644
    index 0000000000..3552c654b5
    --- /dev/null
    +++ b/remote-worker-thread-leak.rb
    @@ -0,0 +1,22 @@
    +require 'ddtrace'
    +
    +configure_thread = Thread.new do
    +  15.times {
    +    Datadog.configure { Thread.pass }
    +    Thread.pass
    +  }
    +end
    +
    +trigger_rc_thread = Thread.new do
    +  loop {
    +    sleep 0.5
    +    Datadog::Core::Remote.active_remote.barrier(:once)
    +    Thread.pass
    +  }
    +end
    +
    +configure_thread.join
    +trigger_rc_thread.kill
    +trigger_rc_thread.join
    +
    +puts Thread.list
    ```
    
    Before the fix, the list of threads printed would show a lot of
    leaked remote worker configuration threads. With the fix, the
    threads no longer leak, and if you enable debug logging you will
    see that the restart is being prevented and the debug log I
    added will show up
    
    ```
    D, [2024-03-12T16:27:02.951950 #742185] DEBUG -- ddtrace: [ddtrace]
    (dd-trace-rb/lib/datadog/core/remote/worker.rb:28:in `start')
    remote worker: refusing to restart after previous stop
    ```
    
    This debug message is harmless and shows the fix is working fine.
    ivoanjo committed Mar 12, 2024
    Configuration menu
    Copy the full SHA
    7df5858 View commit details
    Browse the repository at this point in the history

Commits on Mar 13, 2024

  1. Remove allow_restart argument

    I initially added this thinking it may make it easier for testing
    and/or experimentation, but after feedback from review, let's go without
    it for now -- we can always re-add later.
    ivoanjo committed Mar 13, 2024
    Configuration menu
    Copy the full SHA
    760e084 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    22f3330 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    e20cbaf View commit details
    Browse the repository at this point in the history