Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure tests don't leak threads #1371

Merged
merged 37 commits into from
Feb 25, 2021
Merged

Ensure tests don't leak threads #1371

merged 37 commits into from
Feb 25, 2021

Conversation

marcotc
Copy link
Member

@marcotc marcotc commented Feb 17, 2021

This PR cleans up all our test suite of cases where threads as created during test execution, but not cleaned up. Before this change, rake spec:main finishes its execution with over 40 threads executing.

This causes issues when performing integration testing for global tracer resource usage.

Also, it's good hygiene to clean up the process correctly between each test execution, as to avoid noise. To this day, we still have sporadic test failures caused by background threads from neighbor test execution interacting with Datadog.health_metrics.* while we are trying to assert on it.

A global check is introduced in this PR that runs after every RSpec context. The check currently warns loudly in CI and fails locally when triggered. The reason it's not hard-failing in CI this PR is because I wasn't able to completely clean up all sporadic failures (ones that take 3+ CI runs to trigger), and after working on this for 2 weeks straight, I think it's enough work for now in this are :)

@marcotc marcotc added the dev/testing Involves testing processes (e.g. RSpec) label Feb 17, 2021
@marcotc marcotc self-assigned this Feb 17, 2021
@marcotc marcotc requested a review from a team February 17, 2021 23:05
return false unless run_loop?
# Ensure we always inform that a stop was requested.
# This guarantees correct shutdown even if the loop
# has asynchronously not stated yet.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a race condition when perform_loop (just below) is about to be started, but hasn't yet. In this case, run_loop? will return false, and we'll never mark this worker as stopped.

@@ -199,7 +199,7 @@ def write(trace)
# reconfigure the worker accordingly.
# NOTE: It's important we do this before queuing or it will drop the current trace,
# because #after_fork resets the buffer.
perform
perform if async?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was creating a background thread for non-async writers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating a background thread for non-async writers is still not correct, but the solution was a little bit more involved, given perform can change the result of async? calls.

@@ -3,12 +3,13 @@
require 'ddtrace/contrib/analytics_examples'
require 'ddtrace'
require 'ddtrace/contrib/httpclient/instrumentation'
require 'http'
require 'httpclient'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong library being required here.

@@ -64,7 +67,7 @@

context 'when the worker is not running' do
before { worker.stop_loop }
it { is_expected.to be false }
it { is_expected.to be true }
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#stop_loop now always returns true, as we can't know for sure if the stop signal has already been consumed by the main loop or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Since we changed the spec on this method to always return true, perhaps we should just move this test to the parent context -- since it always applies, regardless of the state of the worker?

@@ -201,12 +204,12 @@
expect { loop_back_off! }
.to change { worker.loop_wait_time }
.from(described_class::BASE_INTERVAL)
.to(described_class::BACK_OFF_RATIO)
.to(described_class::BASE_INTERVAL * described_class::BACK_OFF_RATIO)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests were incorrect, but they just so happen to work because BASE_INTERVAL was 1, thus the calculation of backoff was trivially 1 * BACK_OFF_RATIO.

@@ -13,7 +13,7 @@

describe '#perform' do
subject(:perform) { worker.perform }
after { worker.stop(true, 0) }
after { worker.stop(true, 5) }
Copy link
Member Author

@marcotc marcotc Feb 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 means "don't wait any time" (aka non-blocking) here. But we actually want to wait for the thread to stop before this tests is considered finished.

The execution should be close to immediate here, so I used the our house-favourite 5 second timeout here, just to be safe.

Comment on lines 112 to 117
"#{t} (#{t.class.name})",
' == Caller ==',
caller,
' == Backtrace ==',
t.backtrace,
"\n"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are lucky that the thread was initialized after DatadogThreadDebugger was injected, we get a very nice call stack with the offending thread creator line.

module ThreadHelpers
module_function

# Isolate created crated in a block in their own `ThreadGroup`.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend a high level read about ThreadGroups, if not familiar: https://ruby-doc.org/core-2.5.0/ThreadGroup.html

Copy link
Member

@ivoanjo ivoanjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for looking at this! I suspect I've ran into at least a couple of issues that are being fixed here.

I had to context switch for a bit and I'm not sure I'll be able to come back to this today, so here sir take my partial review, and I'll come back soon :)

lib/ddtrace/workers/loop.rb Outdated Show resolved Hide resolved
@@ -75,7 +77,10 @@ def mutex
private
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⬆️ Github still doesn't allow me to comment on unchanged lines, but looking at that mutex initialization in #mutex /me sees a race in mutex creation. Especially now that we have stop_loop and perform_loop trying to grab the mutex at almost the same time (e.g. in tests) I'm guessing we may see that issue from time to time.

I'll admit I'm not a big fan of the current approach to modularity in the workers -- I've had experience with quite a few of these kinds of approaches to tackling a more complex system from parts and my experience has been that breaking into modules tends to just hide the complexity by breaking it into multiple files, but the assembled system is still very complex.

My suggestion would be to perhaps introduce an initialize that would take care to do the needed non-lazy initialization.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it occurred to me that because it's common for users of this module to combine it with ::Async, then this mutex is shared amongst all of them. 🤕

Comment on lines 202 to 205
perform
perform if async?

# Queue the trace if running asynchronously, otherwise short-circuit and write it directly.
async? ? enqueue(trace) : write_traces([trace])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Consider simplifying this to do if async? only once

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out the perform/async? dance was not trivial.

perform starts whatever process is required to sucessfuly process traces: that could be a thread, or in cases of an after-fork call, we might revert away from threads to a synchronous writer.
When this reverting happens, the Writer becomes synchronous, thus changing the result of async?.

Turns out there was more confusion under the hood, as there were two flags for async: @async in the write and @run_async in the Async mixin.

They were kept independent, but I believe this is not the correct behaviour here. I ended up collapsing them into a single source of truth, and now the behaviour is as expected: only async executions create threads, synchronous ones never do.

spec/ddtrace/buffer_spec.rb Outdated Show resolved Hide resolved
spec/ddtrace/context_spec.rb Outdated Show resolved Hide resolved
Co-authored-by: Ivo Anjo <ivo.anjo@datadoghq.com>
Copy link
Member Author

@marcotc marcotc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review, @ivoanjo! I've address all the points you brought up, let me know if there's anything you still think needs to be modified.

spec/ddtrace/buffer_spec.rb Outdated Show resolved Hide resolved
spec/ddtrace/context_spec.rb Outdated Show resolved Hide resolved
spec/ddtrace/contrib/elasticsearch/patcher_spec.rb Outdated Show resolved Hide resolved
spec/ddtrace/contrib/elasticsearch/transport_spec.rb Outdated Show resolved Hide resolved
@marcotc marcotc requested a review from ivoanjo February 18, 2021 20:30
@codecov-io
Copy link

codecov-io commented Feb 18, 2021

Codecov Report

Merging #1371 (2a0d2d0) into master (6ef1ce5) will increase coverage by 0.03%.
The diff coverage is 97.95%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1371      +/-   ##
==========================================
+ Coverage   98.11%   98.14%   +0.03%     
==========================================
  Files         755      757       +2     
  Lines       35945    36011      +66     
==========================================
+ Hits        35268    35344      +76     
+ Misses        677      667      -10     
Impacted Files Coverage Δ
lib/ddtrace/workers/trace_writer.rb 95.34% <ø> (-0.21%) ⬇️
...dtrace/propagation/propagation_integration_spec.rb 100.00% <ø> (ø)
spec/ddtrace/workers/runtime_metrics_spec.rb 92.81% <75.00%> (ø)
lib/ddtrace/workers/async.rb 84.70% <90.00%> (+0.18%) ⬆️
spec/ddtrace/workers/async_spec.rb 94.42% <94.73%> (-0.05%) ⬇️
lib/ddtrace/workers/loop.rb 98.11% <100.00%> (+0.03%) ⬆️
spec/ddtrace/buffer_spec.rb 98.23% <100.00%> (+3.18%) ⬆️
spec/ddtrace/context_spec.rb 99.64% <100.00%> (-0.01%) ⬇️
...trace/contrib/action_cable/instrumentation_spec.rb 97.14% <100.00%> (+0.36%) ⬆️
...e/contrib/concurrent_ruby/integration_test_spec.rb 100.00% <100.00%> (ø)
... and 31 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6ef1ce5...5d1faa7. Read the comment docs.

Copy link
Member

@ivoanjo ivoanjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally went through the full diff :)

I'm not approving yet as I'm concerned about the changes in async (e.g. I'm concerned that it's complex enough that I'm not confident I can tell their impact) and want a bit more information to make the decision.

Other than that, it's in pretty good shape, I left a few nitpicks and ideas here and there.

@@ -146,7 +148,7 @@ def stop_fork

# Reset and turn off
@pid = Process.pid
@run_async = false
@async = false
end
end
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be honest -- I don't feel comfortable enough with what this module is supposed to do and how it gets used to review these changes now.

Here's what's going through my mind:

  • What is the purpose of @async? In particular, how is supposed to be different from #running?
  • When and why are users of this module supposed to change @async?
  • Given that #stop_fork and #restart_after_fork are both called with mutex grabbed, what does mutex_after_fork add?
  • I wish this wasn't a module (lol). The behaviors are already quite complex and having them as a class would clarify the interfaces and how interaction happens.

@delner -- Can you help out here?

def after_fork
# Do nothing by default
end

private

attr_reader \
:pid
:mutex, :pid
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a def worker below what overrides this one. I would suggest adding ´@worker = niltoinitialize` and removing the other one.

@@ -75,7 +77,10 @@ def mutex
private
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it occurred to me that because it's common for users of this module to combine it with ::Async, then this mutex is shared amongst all of them. 🤕

Comment on lines 192 to 193
#
# `perform` can change the result of `async?` during a fork.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice call out. I think this is a bit of a code smell that this setup isn't great.

Comment on lines 312 to 313
subject(:push) { threads.each(&:join) }
subject(:push) { wait_for_threads }
let(:wait_for_threads) { threads.each { |t| t.join(5000) } }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about the join(5000) change since this can mean that we leave threads around after the fact. Was it just a copy from #concat below? I'm curious why we need the timeout on that one as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I normally add a timeout of 5 seconds to anything in our test suite that should be immediate, but technically has to wait. I do this so that tests don't hang forever.

Copy link
Member

@ivoanjo ivoanjo Feb 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could use raise "..." unless t.join(5000) to raise on timeout? If for some reason the threads don't finish, rather than having the test suite silently chug along, possibly leaking threads that make later tests flaky, we could raise an exception here, which would immediately be spotted as an issue.

(If you agree, is it a big change to do this everywhere in the thread suite? 😉 )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⬆️ I'm not sure if you saw this comment. It may be a bit too much to do in this PR (or you may think it's not worth it), but just wanted to call out so that we make an explicit decision :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did miss this one, thanks for calling this out.

@@ -64,7 +67,7 @@

context 'when the worker is not running' do
before { worker.stop_loop }
it { is_expected.to be false }
it { is_expected.to be true }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Since we changed the spec on this method to always return true, perhaps we should just move this test to the parent context -- since it always applies, regardless of the state of the worker?

spec/ddtrace/workers_integration_spec.rb Show resolved Hide resolved
@@ -81,7 +81,80 @@
# (e.g. via a command-line flag).
config.default_formatter = 'doc'
end

# Check for leaky test resources
config.after(:all) do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already spoke about this work before, and I really like it -- it's mega valuable, but I'm slightly concerned about the user experience for new contributors to the gem.

Getting a big stack trace just by running rspec, or even an error without a stack trace (for things our patch can't catch) can be a bit confusing, especially because this is custom to our setup.

So here's what I suggest we can do about it:

  1. Make this behavior non-breaking by default. The current version already includes a CI flag to decide if to break or not, but I suggest making the default not breaking -- that way we inform, but not scare.

  2. Add a note about what this is to the DevelopmentGuide.md AND mention in the error message that we provide some tips there.

  3. Indent the output a bit, so it's clear what belongs where :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added all suggested changes. Output looks like this now:
Screen Shot 2021-02-19 at 12 28 31 PM
... after many, many lines, a screenshot of the last section:
Screen Shot 2021-02-19 at 12 28 38 PM

Comment on lines 121 to 123
# We cannot fail tests gracefully in an `after(:all)` block.
# The test results have already been decided by RSpec.
# We resort to a more "blunt approach.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered running the check for background threads and checking it after each individual test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I wanted, but I could not find a way to run code after all after RSpec callbacks are run. Creating an after callback inserts it at the "end" of the current callback list, but it gets placed before some of the host tests after callbacks depending of declaration order.

I debugged the test runs, stepping into RSpec code, and I don't see any places where RSpec would be able to call our code after all after blocks are done. We can "inject" (read monkey patch) these changes into RSpec, but there's not supported way for it 😢

spec/spec_helper.rb Show resolved Hide resolved
ivoanjo added a commit to DataDog/ffi that referenced this pull request Feb 19, 2021
I found this useful while reviewing DataDog/dd-trace-rb#1371 -- we were
searching for specs that leaked threads, and saw a "mysterious" thread
with no stack that we couldn't exactly point out what created it or
why it existed.

By setting a name on the callback thread, it becomes immediately
obvious when looking at `Thread.list` what this thread is and where
it comes from.

(Thread naming has been around
[since Ruby 2.3](https://github.com/ruby/ruby/blob/v2_3_0/NEWS) which
means we can do it safely for every supported version)
Co-authored-by: Ivo Anjo <ivo.anjo@datadoghq.com>
@marcotc marcotc requested a review from ivoanjo February 19, 2021 21:49
tata406
tata406 previously approved these changes Feb 20, 2021
ivoanjo
ivoanjo previously approved these changes Feb 22, 2021
Copy link
Member

@ivoanjo ivoanjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a final round of suggestions, but I think this is pretty much good to go.

Thanks a lot for this. Let's bring those flaky failures down to zero!

docs/DevelopmentGuide.md Outdated Show resolved Hide resolved
Comment on lines 15 to 23
after do
count = Thread.list.size

SuckerPunch::RUNNING.make_false
SuckerPunch::Queue.all.each(&:shutdown)
SuckerPunch::Queue.clear

try_wait_until { Thread.list.size < count }
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect we can use wait_for_termination to make this a bit cleaner

  after do
    SuckerPunch::RUNNING.make_false
    queues = SuckerPunch::Queue.all
    queues.each(&:shutdown)
    SuckerPunch::Queue.clear
    queues.each { |it| raise 'timeout...' unless it.wait_for_termination(5) }
  end

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this suggestion a try, but it wait_for_termination only waits for the request for termination to be received, not for all threads to be terminated.

The thread that is "leaking" here is this one.

And unfortunately, the interface to interact with it is to #kill it which does not wait for a complete shutdown.

Overall the ThreadPoolExecutor doesn't have a strong interface for waiting for complete termination. This is something we can follow up with the concurrent-ruby team on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a shame. Thanks for taking a stab at it. Maybe it would make sense to leave here a comment mentioning that we didn't find a nicer way because of the issue you mentioned?

spec/spec_helper.rb Outdated Show resolved Hide resolved
spec/spec_helper.rb Outdated Show resolved Hide resolved
spec/spec_helper.rb Outdated Show resolved Hide resolved
@delner
Copy link
Contributor

delner commented Feb 23, 2021

@marcotc Can we round up on this one? Specifically I'd like to cover how this intersects with workers/async.rb and associated tests.

@marcotc marcotc dismissed stale reviews from ivoanjo and tata406 via 3891fd5 February 23, 2021 20:11
@marcotc
Copy link
Member Author

marcotc commented Feb 23, 2021

Ready for another review 🙇.

ivoanjo
ivoanjo previously approved these changes Feb 24, 2021
Copy link
Member

@ivoanjo ivoanjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

Comment on lines 312 to 313
subject(:push) { threads.each(&:join) }
subject(:push) { wait_for_threads }
let(:wait_for_threads) { threads.each { |t| t.join(5000) } }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⬆️ I'm not sure if you saw this comment. It may be a bit too much to do in this PR (or you may think it's not worth it), but just wanted to call out so that we make an explicit decision :)

Comment on lines 15 to 23
after do
count = Thread.list.size

SuckerPunch::RUNNING.make_false
SuckerPunch::Queue.all.each(&:shutdown)
SuckerPunch::Queue.clear

try_wait_until { Thread.list.size < count }
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a shame. Thanks for taking a stab at it. Maybe it would make sense to leave here a comment mentioning that we didn't find a nicer way because of the issue you mentioned?

@marcotc marcotc requested a review from ivoanjo February 24, 2021 20:17
Copy link
Member

@ivoanjo ivoanjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge

@ivoanjo
Copy link
Member

ivoanjo commented Feb 25, 2021

Since Marco is off, he asked me to take over and merge if this was ready, and it very much amazingly is! Off it goes :)

@ivoanjo ivoanjo merged commit a9d9603 into master Feb 25, 2021
@ivoanjo ivoanjo deleted the fix-all-thread-mess-wip branch February 25, 2021 08:59
@github-actions github-actions bot added this to the 0.46.0 milestone Feb 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dev/testing Involves testing processes (e.g. RSpec)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants