Ensure tests don't leak threads #1371

marcotc · 2021-02-17T23:05:46Z

This PR cleans up all our test suite of cases where threads as created during test execution, but not cleaned up. Before this change, rake spec:main finishes its execution with over 40 threads executing.

This causes issues when performing integration testing for global tracer resource usage.

Also, it's good hygiene to clean up the process correctly between each test execution, as to avoid noise. To this day, we still have sporadic test failures caused by background threads from neighbor test execution interacting with Datadog.health_metrics.* while we are trying to assert on it.

A global check is introduced in this PR that runs after every RSpec context. The check currently warns loudly in CI and fails locally when triggered. The reason it's not hard-failing in CI this PR is because I wasn't able to completely clean up all sporadic failures (ones that take 3+ CI runs to trigger), and after working on this for 2 weeks straight, I think it's enough work for now in this are :)

marcotc · 2021-02-17T23:06:59Z

lib/ddtrace/workers/loop.rb

-          return false unless run_loop?
+          # Ensure we always inform that a stop was requested.
+          # This guarantees correct shutdown even if the loop
+          # has asynchronously not stated yet.


This was a race condition when perform_loop (just below) is about to be started, but hasn't yet. In this case, run_loop? will return false, and we'll never mark this worker as stopped.

marcotc · 2021-02-17T23:07:24Z

lib/ddtrace/workers/trace_writer.rb

@@ -199,7 +199,7 @@ def write(trace)
        # reconfigure the worker accordingly.
        # NOTE: It's important we do this before queuing or it will drop the current trace,
        #       because #after_fork resets the buffer.
-        perform
+        perform if async?


This was creating a background thread for non-async writers.

Creating a background thread for non-async writers is still not correct, but the solution was a little bit more involved, given perform can change the result of async? calls.

marcotc · 2021-02-17T23:08:06Z

spec/ddtrace/contrib/httpclient/instrumentation_spec.rb

@@ -3,12 +3,13 @@
 require 'ddtrace/contrib/analytics_examples'
 require 'ddtrace'
 require 'ddtrace/contrib/httpclient/instrumentation'
-require 'http'
+require 'httpclient'


Wrong library being required here.

marcotc · 2021-02-17T23:10:02Z

spec/ddtrace/workers/loop_spec.rb

@@ -64,7 +67,7 @@

      context 'when the worker is not running' do
        before { worker.stop_loop }
-        it { is_expected.to be false }
+        it { is_expected.to be true }


#stop_loop now always returns true, as we can't know for sure if the stop signal has already been consumed by the main loop or not.

Minor: Since we changed the spec on this method to always return true, perhaps we should just move this test to the parent context -- since it always applies, regardless of the state of the worker?

marcotc · 2021-02-17T23:10:52Z

spec/ddtrace/workers/loop_spec.rb

@@ -201,12 +204,12 @@
          expect { loop_back_off! }
            .to change { worker.loop_wait_time }
            .from(described_class::BASE_INTERVAL)
-            .to(described_class::BACK_OFF_RATIO)
+            .to(described_class::BASE_INTERVAL * described_class::BACK_OFF_RATIO)


These tests were incorrect, but they just so happen to work because BASE_INTERVAL was 1, thus the calculation of backoff was trivially 1 * BACK_OFF_RATIO.

marcotc · 2021-02-17T23:11:58Z

spec/ddtrace/workers/polling_spec.rb

@@ -13,7 +13,7 @@

    describe '#perform' do
      subject(:perform) { worker.perform }
-      after { worker.stop(true, 0) }
+      after { worker.stop(true, 5) }


0 means "don't wait any time" (aka non-blocking) here. But we actually want to wait for the thread to stop before this tests is considered finished.

The execution should be close to immediate here, so I used the our house-favourite 5 second timeout here, just to be safe.

marcotc · 2021-02-17T23:13:54Z

spec/spec_helper.rb

+          "#{t} (#{t.class.name})",
+          ' == Caller ==',
+          caller,
+          ' == Backtrace ==',
+          t.backtrace,
+          "\n"


If we are lucky that the thread was initialized after DatadogThreadDebugger was injected, we get a very nice call stack with the offending thread creator line.

marcotc · 2021-02-17T23:14:42Z

spec/support/thread_helpers.rb

+module ThreadHelpers
+  module_function
+
+  # Isolate created crated in a block in their own `ThreadGroup`.


I recommend a high level read about ThreadGroups, if not familiar: https://ruby-doc.org/core-2.5.0/ThreadGroup.html

ivoanjo

Thanks a lot for looking at this! I suspect I've ran into at least a couple of issues that are being fixed here.

I had to context switch for a bit and I'm not sure I'll be able to come back to this today, so here sir take my partial review, and I'll come back soon :)

lib/ddtrace/workers/loop.rb

ivoanjo · 2021-02-18T09:53:48Z

lib/ddtrace/workers/loop.rb

@@ -75,7 +77,10 @@ def mutex
      private


⬆️ Github still doesn't allow me to comment on unchanged lines, but looking at that mutex initialization in #mutex /me sees a race in mutex creation. Especially now that we have stop_loop and perform_loop trying to grab the mutex at almost the same time (e.g. in tests) I'm guessing we may see that issue from time to time.

I'll admit I'm not a big fan of the current approach to modularity in the workers -- I've had experience with quite a few of these kinds of approaches to tackling a more complex system from parts and my experience has been that breaking into modules tends to just hide the complexity by breaking it into multiple files, but the assembled system is still very complex.

My suggestion would be to perhaps introduce an initialize that would take care to do the needed non-lazy initialization.

Also it occurred to me that because it's common for users of this module to combine it with ::Async, then this mutex is shared amongst all of them. 🤕

ivoanjo · 2021-02-18T09:56:36Z

lib/ddtrace/workers/trace_writer.rb

-        perform
+        perform if async?

        # Queue the trace if running asynchronously, otherwise short-circuit and write it directly.
        async? ? enqueue(trace) : write_traces([trace])


Minor: Consider simplifying this to do if async? only once

Turns out the perform/async? dance was not trivial.

perform starts whatever process is required to sucessfuly process traces: that could be a thread, or in cases of an after-fork call, we might revert away from threads to a synchronous writer.
When this reverting happens, the Writer becomes synchronous, thus changing the result of async?.

Turns out there was more confusion under the hood, as there were two flags for async: @async in the write and @run_async in the Async mixin.

They were kept independent, but I believe this is not the correct behaviour here. I ended up collapsing them into a single source of truth, and now the behaviour is as expected: only async executions create threads, synchronous ones never do.

spec/ddtrace/buffer_spec.rb

spec/ddtrace/context_spec.rb

spec/ddtrace/contrib/action_cable/instrumentation_spec.rb

spec/ddtrace/contrib/concurrent_ruby/integration_test_spec.rb

Co-authored-by: Ivo Anjo <ivo.anjo@datadoghq.com>

spec/ddtrace/contrib/elasticsearch/patcher_spec.rb

spec/ddtrace/contrib/elasticsearch/transport_spec.rb

marcotc

Thank you for the review, @ivoanjo! I've address all the points you brought up, let me know if there's anything you still think needs to be modified.

spec/ddtrace/buffer_spec.rb

spec/ddtrace/context_spec.rb

spec/ddtrace/contrib/concurrent_ruby/integration_test_spec.rb

spec/ddtrace/contrib/elasticsearch/patcher_spec.rb

spec/ddtrace/contrib/elasticsearch/transport_spec.rb

codecov-io · 2021-02-18T22:02:33Z

Codecov Report

Merging #1371 (2a0d2d0) into master (6ef1ce5) will increase coverage by 0.03%.
The diff coverage is 97.95%.

@@            Coverage Diff             @@
##           master    #1371      +/-   ##
==========================================
+ Coverage   98.11%   98.14%   +0.03%     
==========================================
  Files         755      757       +2     
  Lines       35945    36011      +66     
==========================================
+ Hits        35268    35344      +76     
+ Misses        677      667      -10

Impacted Files	Coverage Δ
lib/ddtrace/workers/trace_writer.rb	`95.34% <ø> (-0.21%)`	⬇️
...dtrace/propagation/propagation_integration_spec.rb	`100.00% <ø> (ø)`
spec/ddtrace/workers/runtime_metrics_spec.rb	`92.81% <75.00%> (ø)`
lib/ddtrace/workers/async.rb	`84.70% <90.00%> (+0.18%)`	⬆️
spec/ddtrace/workers/async_spec.rb	`94.42% <94.73%> (-0.05%)`	⬇️
lib/ddtrace/workers/loop.rb	`98.11% <100.00%> (+0.03%)`	⬆️
spec/ddtrace/buffer_spec.rb	`98.23% <100.00%> (+3.18%)`	⬆️
spec/ddtrace/context_spec.rb	`99.64% <100.00%> (-0.01%)`	⬇️
...trace/contrib/action_cable/instrumentation_spec.rb	`97.14% <100.00%> (+0.36%)`	⬆️
...e/contrib/concurrent_ruby/integration_test_spec.rb	`100.00% <100.00%> (ø)`
... and 31 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6ef1ce5...5d1faa7. Read the comment docs.

ivoanjo

Finally went through the full diff :)

I'm not approving yet as I'm concerned about the changes in async (e.g. I'm concerned that it's complex enough that I'm not confident I can tell their impact) and want a bit more information to make the decision.

Other than that, it's in pretty good shape, I left a few nitpicks and ideas here and there.

ivoanjo · 2021-02-19T09:48:45Z

lib/ddtrace/workers/async.rb

@@ -146,7 +148,7 @@ def stop_fork

              # Reset and turn off
              @pid = Process.pid
-              @run_async = false
+              @async = false
            end
          end
        end


I'll be honest -- I don't feel comfortable enough with what this module is supposed to do and how it gets used to review these changes now.

Here's what's going through my mind:

What is the purpose of @async? In particular, how is supposed to be different from #running?

When and why are users of this module supposed to change @async?

Given that #stop_fork and #restart_after_fork are both called with mutex grabbed, what does mutex_after_fork add?

I wish this wasn't a module (lol). The behaviors are already quite complex and having them as a class would clarify the interfaces and how interaction happens.

@delner -- Can you help out here?

ivoanjo · 2021-02-19T09:49:38Z

lib/ddtrace/workers/async.rb

        def after_fork
          # Do nothing by default
        end

        private

        attr_reader \
-          :pid
+          :mutex, :pid


There's a def worker below what overrides this one. I would suggest adding ´@worker = niltoinitialize` and removing the other one.

ivoanjo · 2021-02-19T09:58:00Z

lib/ddtrace/workers/loop.rb

@@ -75,7 +77,10 @@ def mutex
      private


Also it occurred to me that because it's common for users of this module to combine it with ::Async, then this mutex is shared amongst all of them. 🤕

ivoanjo · 2021-02-19T10:02:19Z

lib/ddtrace/workers/trace_writer.rb

+        #
+        # `perform` can change the result of `async?` during a fork.


Nice call out. I think this is a bit of a code smell that this setup isn't great.

ivoanjo · 2021-02-19T10:06:43Z

spec/ddtrace/buffer_spec.rb

-    subject(:push) { threads.each(&:join) }
+    subject(:push) { wait_for_threads }
+    let(:wait_for_threads) { threads.each { |t| t.join(5000) } }


I'm curious about the join(5000) change since this can mean that we leave threads around after the fact. Was it just a copy from #concat below? I'm curious why we need the timeout on that one as well.

I normally add a timeout of 5 seconds to anything in our test suite that should be immediate, but technically has to wait. I do this so that tests don't hang forever.

Perhaps we could use raise "..." unless t.join(5000) to raise on timeout? If for some reason the threads don't finish, rather than having the test suite silently chug along, possibly leaking threads that make later tests flaky, we could raise an exception here, which would immediately be spotted as an issue.

(If you agree, is it a big change to do this everywhere in the thread suite? 😉 )

⬆️ I'm not sure if you saw this comment. It may be a bit too much to do in this PR (or you may think it's not worth it), but just wanted to call out so that we make an explicit decision :)

I did miss this one, thanks for calling this out.

ivoanjo · 2021-02-19T11:02:16Z

spec/ddtrace/workers/loop_spec.rb

@@ -64,7 +67,7 @@

      context 'when the worker is not running' do
        before { worker.stop_loop }
-        it { is_expected.to be false }
+        it { is_expected.to be true }


Minor: Since we changed the spec on this method to always return true, perhaps we should just move this test to the parent context -- since it always applies, regardless of the state of the worker?

spec/ddtrace/workers_integration_spec.rb

ivoanjo · 2021-02-19T11:23:49Z

spec/spec_helper.rb

@@ -81,7 +81,80 @@
    # (e.g. via a command-line flag).
    config.default_formatter = 'doc'
  end
+
+  # Check for leaky test resources
+  config.after(:all) do


We already spoke about this work before, and I really like it -- it's mega valuable, but I'm slightly concerned about the user experience for new contributors to the gem.

Getting a big stack trace just by running rspec, or even an error without a stack trace (for things our patch can't catch) can be a bit confusing, especially because this is custom to our setup.

So here's what I suggest we can do about it:

Make this behavior non-breaking by default. The current version already includes a CI flag to decide if to break or not, but I suggest making the default not breaking -- that way we inform, but not scare.

Add a note about what this is to the DevelopmentGuide.md AND mention in the error message that we provide some tips there.

Indent the output a bit, so it's clear what belongs where :)

Added all suggested changes. Output looks like this now:

... after many, many lines, a screenshot of the last section:

ivoanjo · 2021-02-19T11:27:24Z

spec/spec_helper.rb

+      # We cannot fail tests gracefully in an `after(:all)` block.
+      # The test results have already been decided by RSpec.
+      # We resort to a more "blunt approach.


Have you considered running the check for background threads and checking it after each individual test?

This is what I wanted, but I could not find a way to run code after all after RSpec callbacks are run. Creating an after callback inserts it at the "end" of the current callback list, but it gets placed before some of the host tests after callbacks depending of declaration order.

I debugged the test runs, stepping into RSpec code, and I don't see any places where RSpec would be able to call our code after all after blocks are done. We can "inject" (read monkey patch) these changes into RSpec, but there's not supported way for it 😢

spec/spec_helper.rb

I found this useful while reviewing DataDog/dd-trace-rb#1371 -- we were searching for specs that leaked threads, and saw a "mysterious" thread with no stack that we couldn't exactly point out what created it or why it existed. By setting a name on the callback thread, it becomes immediately obvious when looking at `Thread.list` what this thread is and where it comes from. (Thread naming has been around [since Ruby 2.3](https://github.com/ruby/ruby/blob/v2_3_0/NEWS) which means we can do it safely for every supported version)

Co-authored-by: Ivo Anjo <ivo.anjo@datadoghq.com>

ivoanjo

I've left a final round of suggestions, but I think this is pretty much good to go.

Thanks a lot for this. Let's bring those flaky failures down to zero!

docs/DevelopmentGuide.md

ivoanjo · 2021-02-22T12:06:27Z

spec/ddtrace/contrib/sucker_punch/patcher_spec.rb

  after do
+    count = Thread.list.size
+
+    SuckerPunch::RUNNING.make_false
+    SuckerPunch::Queue.all.each(&:shutdown)
    SuckerPunch::Queue.clear
+
+    try_wait_until { Thread.list.size < count }
  end


I suspect we can use wait_for_termination to make this a bit cleaner

after do SuckerPunch::RUNNING.make_false queues = SuckerPunch::Queue.all queues.each(&:shutdown) SuckerPunch::Queue.clear queues.each { |it| raise 'timeout...' unless it.wait_for_termination(5) } end

I gave this suggestion a try, but it wait_for_termination only waits for the request for termination to be received, not for all threads to be terminated.

The thread that is "leaking" here is this one.

And unfortunately, the interface to interact with it is to #kill it which does not wait for a complete shutdown.

Overall the ThreadPoolExecutor doesn't have a strong interface for waiting for complete termination. This is something we can follow up with the concurrent-ruby team on.

That's a shame. Thanks for taking a stab at it. Maybe it would make sense to leave here a comment mentioning that we didn't find a nicer way because of the issue you mentioned?

spec/spec_helper.rb

delner · 2021-02-23T16:19:50Z

@marcotc Can we round up on this one? Specifically I'd like to cover how this intersects with workers/async.rb and associated tests.

marcotc · 2021-02-23T20:12:54Z

Ready for another review 🙇.

ivoanjo

LGTM 👍

ivoanjo · 2021-02-24T09:01:17Z

spec/ddtrace/buffer_spec.rb

-    subject(:push) { threads.each(&:join) }
+    subject(:push) { wait_for_threads }
+    let(:wait_for_threads) { threads.each { |t| t.join(5000) } }


⬆️ I'm not sure if you saw this comment. It may be a bit too much to do in this PR (or you may think it's not worth it), but just wanted to call out so that we make an explicit decision :)

ivoanjo · 2021-02-24T09:09:15Z

spec/ddtrace/contrib/sucker_punch/patcher_spec.rb

  after do
+    count = Thread.list.size
+
+    SuckerPunch::RUNNING.make_false
+    SuckerPunch::Queue.all.each(&:shutdown)
    SuckerPunch::Queue.clear
+
+    try_wait_until { Thread.list.size < count }
  end


That's a shame. Thanks for taking a stab at it. Maybe it would make sense to leave here a comment mentioning that we didn't find a nicer way because of the issue you mentioned?

ivoanjo

Merge

ivoanjo · 2021-02-25T08:58:53Z

Since Marco is off, he asked me to take over and merge if this was ready, and it very much amazingly is! Off it goes :)

Ensure tests don't leak threads

81c8fed

marcotc added the dev/testing Involves testing processes (e.g. RSpec) label Feb 17, 2021

marcotc self-assigned this Feb 17, 2021

marcotc requested a review from a team February 17, 2021 23:05

marcotc commented Feb 17, 2021

View reviewed changes

ivoanjo reviewed Feb 18, 2021

View reviewed changes

Update lib/ddtrace/workers/loop.rb

6b24c6a

Co-authored-by: Ivo Anjo <ivo.anjo@datadoghq.com>

ivoanjo reviewed Feb 18, 2021

View reviewed changes

spec/ddtrace/contrib/elasticsearch/patcher_spec.rb Outdated Show resolved Hide resolved

spec/ddtrace/contrib/elasticsearch/transport_spec.rb Outdated Show resolved Hide resolved

marcotc commented Feb 18, 2021

View reviewed changes

marcotc added 7 commits February 18, 2021 15:29

Eagerly initialize mutex

feccd43

Consolidate async? path

26c9d90

Clean up buffer_spec

bfea42c

Revert permanent concurrent-ruby shutdown

e1c41e4

Use native Ruby HTTP client for ElasticSearch tests

0238caa

Address all ElasticSearch tests

a2b96de

Sacrifice for the linting gods ☠️

8bf6c1a

marcotc requested a review from ivoanjo February 18, 2021 20:30

Collapse async and run_async into a single flag

694e37f

ivoanjo reviewed Feb 19, 2021

View reviewed changes

ivoanjo mentioned this pull request Feb 19, 2021

Name the async_cb_thread for easier debugging ffi/ffi#883

Merged

Update spec/ddtrace/contrib/concurrent_ruby/integration_test_spec.rb

fe620bf

Co-authored-by: Ivo Anjo <ivo.anjo@datadoghq.com>

marcotc added 3 commits February 19, 2021 16:40

Disable leak detection for spec/ddtrace/workers

e4d235d

Ensure native threads are still reported without a backtrace

72a08a1

Label them as such

2cbc78a

marcotc requested a review from ivoanjo February 19, 2021 21:49

tata406 previously approved these changes Feb 20, 2021

View reviewed changes

ivoanjo previously approved these changes Feb 22, 2021

View reviewed changes

marcotc added 4 commits February 23, 2021 14:29

Expand on why it is important to not leak resources

566cd4e

Warn instead of skipping async workers verification

52eff33

Better instructions on how to address leaky thread issue

234b842

Update warning comment

3891fd5

marcotc dismissed stale reviews from ivoanjo and tata406 via 3891fd5 February 23, 2021 20:11

marcotc requested review from ivoanjo and delner February 23, 2021 20:12

XXX

8e93a7f

ivoanjo previously approved these changes Feb 24, 2021

View reviewed changes

marcotc added 3 commits February 24, 2021 14:52

Merge branch 'master' into fix-all-thread-mess-wip

2a93554

Sacrifice for the linting gods ☠️

70f0ecf

Sacrifice for the linting gods ☠️

921e912

marcotc dismissed ivoanjo’s stale review via 921e912 February 24, 2021 20:07

marcotc added 2 commits February 24, 2021 15:14

Add thread timeout error and explanation on SuckerPunch cleanup block

a6b42cf

Fix bad merge from master

57ad9cd

marcotc requested a review from ivoanjo February 24, 2021 20:17

ivoanjo approved these changes Feb 25, 2021

View reviewed changes

ivoanjo merged commit a9d9603 into master Feb 25, 2021

ivoanjo deleted the fix-all-thread-mess-wip branch February 25, 2021 08:59

github-actions bot added this to the 0.46.0 milestone Feb 25, 2021

		#
		# `perform` can change the result of `async?` during a fork.

Ensure tests don't leak threads #1371

Ensure tests don't leak threads #1371

Conversation

marcotc commented Feb 17, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marcotc Feb 17, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivoanjo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marcotc left a comment

Choose a reason for hiding this comment

codecov-io commented Feb 18, 2021 • edited

Codecov Report

ivoanjo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivoanjo Feb 22, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivoanjo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

delner commented Feb 23, 2021

marcotc commented Feb 23, 2021

ivoanjo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivoanjo left a comment

Choose a reason for hiding this comment

ivoanjo commented Feb 25, 2021

marcotc Feb 17, 2021 •

edited

codecov-io commented Feb 18, 2021 •

edited

ivoanjo Feb 22, 2021 •

edited