-
Notifications
You must be signed in to change notification settings - Fork 369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure tests don't leak threads #1371
Conversation
lib/ddtrace/workers/loop.rb
Outdated
return false unless run_loop? | ||
# Ensure we always inform that a stop was requested. | ||
# This guarantees correct shutdown even if the loop | ||
# has asynchronously not stated yet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a race condition when perform_loop
(just below) is about to be started, but hasn't yet. In this case, run_loop?
will return false, and we'll never mark this worker as stopped.
lib/ddtrace/workers/trace_writer.rb
Outdated
@@ -199,7 +199,7 @@ def write(trace) | |||
# reconfigure the worker accordingly. | |||
# NOTE: It's important we do this before queuing or it will drop the current trace, | |||
# because #after_fork resets the buffer. | |||
perform | |||
perform if async? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was creating a background thread for non-async writers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating a background thread for non-async writers is still not correct, but the solution was a little bit more involved, given perform
can change the result of async?
calls.
@@ -3,12 +3,13 @@ | |||
require 'ddtrace/contrib/analytics_examples' | |||
require 'ddtrace' | |||
require 'ddtrace/contrib/httpclient/instrumentation' | |||
require 'http' | |||
require 'httpclient' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong library being required here.
spec/ddtrace/workers/loop_spec.rb
Outdated
@@ -64,7 +67,7 @@ | |||
|
|||
context 'when the worker is not running' do | |||
before { worker.stop_loop } | |||
it { is_expected.to be false } | |||
it { is_expected.to be true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#stop_loop
now always returns true
, as we can't know for sure if the stop signal has already been consumed by the main loop or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: Since we changed the spec on this method to always return true, perhaps we should just move this test to the parent context -- since it always applies, regardless of the state of the worker?
@@ -201,12 +204,12 @@ | |||
expect { loop_back_off! } | |||
.to change { worker.loop_wait_time } | |||
.from(described_class::BASE_INTERVAL) | |||
.to(described_class::BACK_OFF_RATIO) | |||
.to(described_class::BASE_INTERVAL * described_class::BACK_OFF_RATIO) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests were incorrect, but they just so happen to work because BASE_INTERVAL
was 1, thus the calculation of backoff was trivially 1 * BACK_OFF_RATIO
.
@@ -13,7 +13,7 @@ | |||
|
|||
describe '#perform' do | |||
subject(:perform) { worker.perform } | |||
after { worker.stop(true, 0) } | |||
after { worker.stop(true, 5) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0 means "don't wait any time" (aka non-blocking) here. But we actually want to wait for the thread to stop before this tests is considered finished.
The execution should be close to immediate here, so I used the our house-favourite 5 second timeout here, just to be safe.
spec/spec_helper.rb
Outdated
"#{t} (#{t.class.name})", | ||
' == Caller ==', | ||
caller, | ||
' == Backtrace ==', | ||
t.backtrace, | ||
"\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are lucky that the thread was initialized after DatadogThreadDebugger
was injected, we get a very nice call stack with the offending thread creator line.
module ThreadHelpers | ||
module_function | ||
|
||
# Isolate created crated in a block in their own `ThreadGroup`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend a high level read about ThreadGroups, if not familiar: https://ruby-doc.org/core-2.5.0/ThreadGroup.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for looking at this! I suspect I've ran into at least a couple of issues that are being fixed here.
I had to context switch for a bit and I'm not sure I'll be able to come back to this today, so here sir take my partial review, and I'll come back soon :)
lib/ddtrace/workers/loop.rb
Outdated
@@ -75,7 +77,10 @@ def mutex | |||
private |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
⬆️ Github still doesn't allow me to comment on unchanged lines, but looking at that mutex initialization in #mutex
/me sees a race in mutex creation. Especially now that we have stop_loop
and perform_loop
trying to grab the mutex at almost the same time (e.g. in tests) I'm guessing we may see that issue from time to time.
I'll admit I'm not a big fan of the current approach to modularity in the workers -- I've had experience with quite a few of these kinds of approaches to tackling a more complex system from parts and my experience has been that breaking into modules tends to just hide the complexity by breaking it into multiple files, but the assembled system is still very complex.
My suggestion would be to perhaps introduce an initialize
that would take care to do the needed non-lazy initialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also it occurred to me that because it's common for users of this module to combine it with ::Async
, then this mutex is shared amongst all of them. 🤕
lib/ddtrace/workers/trace_writer.rb
Outdated
perform | ||
perform if async? | ||
|
||
# Queue the trace if running asynchronously, otherwise short-circuit and write it directly. | ||
async? ? enqueue(trace) : write_traces([trace]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: Consider simplifying this to do if async?
only once
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out the perform
/async?
dance was not trivial.
perform
starts whatever process is required to sucessfuly process traces: that could be a thread, or in cases of an after-fork call, we might revert away from threads to a synchronous writer.
When this reverting happens, the Writer becomes synchronous, thus changing the result of async?
.
Turns out there was more confusion under the hood, as there were two flags for async: @async
in the write and @run_async
in the Async mixin.
They were kept independent, but I believe this is not the correct behaviour here. I ended up collapsing them into a single source of truth, and now the behaviour is as expected: only async executions create threads, synchronous ones never do.
Co-authored-by: Ivo Anjo <ivo.anjo@datadoghq.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review, @ivoanjo! I've address all the points you brought up, let me know if there's anything you still think needs to be modified.
Codecov Report
@@ Coverage Diff @@
## master #1371 +/- ##
==========================================
+ Coverage 98.11% 98.14% +0.03%
==========================================
Files 755 757 +2
Lines 35945 36011 +66
==========================================
+ Hits 35268 35344 +76
+ Misses 677 667 -10
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally went through the full diff :)
I'm not approving yet as I'm concerned about the changes in async (e.g. I'm concerned that it's complex enough that I'm not confident I can tell their impact) and want a bit more information to make the decision.
Other than that, it's in pretty good shape, I left a few nitpicks and ideas here and there.
lib/ddtrace/workers/async.rb
Outdated
@@ -146,7 +148,7 @@ def stop_fork | |||
|
|||
# Reset and turn off | |||
@pid = Process.pid | |||
@run_async = false | |||
@async = false | |||
end | |||
end | |||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll be honest -- I don't feel comfortable enough with what this module is supposed to do and how it gets used to review these changes now.
Here's what's going through my mind:
- What is the purpose of
@async
? In particular, how is supposed to be different from#running?
- When and why are users of this module supposed to change
@async
? - Given that
#stop_fork
and#restart_after_fork
are both called withmutex
grabbed, what doesmutex_after_fork
add? - I wish this wasn't a module (lol). The behaviors are already quite complex and having them as a class would clarify the interfaces and how interaction happens.
@delner -- Can you help out here?
lib/ddtrace/workers/async.rb
Outdated
def after_fork | ||
# Do nothing by default | ||
end | ||
|
||
private | ||
|
||
attr_reader \ | ||
:pid | ||
:mutex, :pid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a def worker
below what overrides this one. I would suggest adding ´@worker = nilto
initialize` and removing the other one.
lib/ddtrace/workers/loop.rb
Outdated
@@ -75,7 +77,10 @@ def mutex | |||
private |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also it occurred to me that because it's common for users of this module to combine it with ::Async
, then this mutex is shared amongst all of them. 🤕
lib/ddtrace/workers/trace_writer.rb
Outdated
# | ||
# `perform` can change the result of `async?` during a fork. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice call out. I think this is a bit of a code smell that this setup isn't great.
spec/ddtrace/buffer_spec.rb
Outdated
subject(:push) { threads.each(&:join) } | ||
subject(:push) { wait_for_threads } | ||
let(:wait_for_threads) { threads.each { |t| t.join(5000) } } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious about the join(5000)
change since this can mean that we leave threads around after the fact. Was it just a copy from #concat
below? I'm curious why we need the timeout on that one as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I normally add a timeout of 5 seconds to anything in our test suite that should be immediate, but technically has to wait. I do this so that tests don't hang forever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could use raise "..." unless t.join(5000)
to raise on timeout? If for some reason the threads don't finish, rather than having the test suite silently chug along, possibly leaking threads that make later tests flaky, we could raise an exception here, which would immediately be spotted as an issue.
(If you agree, is it a big change to do this everywhere in the thread suite? 😉 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
⬆️ I'm not sure if you saw this comment. It may be a bit too much to do in this PR (or you may think it's not worth it), but just wanted to call out so that we make an explicit decision :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did miss this one, thanks for calling this out.
spec/ddtrace/workers/loop_spec.rb
Outdated
@@ -64,7 +67,7 @@ | |||
|
|||
context 'when the worker is not running' do | |||
before { worker.stop_loop } | |||
it { is_expected.to be false } | |||
it { is_expected.to be true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: Since we changed the spec on this method to always return true, perhaps we should just move this test to the parent context -- since it always applies, regardless of the state of the worker?
spec/spec_helper.rb
Outdated
@@ -81,7 +81,80 @@ | |||
# (e.g. via a command-line flag). | |||
config.default_formatter = 'doc' | |||
end | |||
|
|||
# Check for leaky test resources | |||
config.after(:all) do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already spoke about this work before, and I really like it -- it's mega valuable, but I'm slightly concerned about the user experience for new contributors to the gem.
Getting a big stack trace just by running rspec, or even an error without a stack trace (for things our patch can't catch) can be a bit confusing, especially because this is custom to our setup.
So here's what I suggest we can do about it:
-
Make this behavior non-breaking by default. The current version already includes a
CI
flag to decide if to break or not, but I suggest making the default not breaking -- that way we inform, but not scare. -
Add a note about what this is to the
DevelopmentGuide.md
AND mention in the error message that we provide some tips there. -
Indent the output a bit, so it's clear what belongs where :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spec/spec_helper.rb
Outdated
# We cannot fail tests gracefully in an `after(:all)` block. | ||
# The test results have already been decided by RSpec. | ||
# We resort to a more "blunt approach. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered running the check for background threads and checking it after each individual test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I wanted, but I could not find a way to run code after all after
RSpec callbacks are run. Creating an after
callback inserts it at the "end" of the current callback list, but it gets placed before some of the host tests after
callbacks depending of declaration order.
I debugged the test runs, stepping into RSpec code, and I don't see any places where RSpec would be able to call our code after all after
blocks are done. We can "inject" (read monkey patch) these changes into RSpec, but there's not supported way for it 😢
I found this useful while reviewing DataDog/dd-trace-rb#1371 -- we were searching for specs that leaked threads, and saw a "mysterious" thread with no stack that we couldn't exactly point out what created it or why it existed. By setting a name on the callback thread, it becomes immediately obvious when looking at `Thread.list` what this thread is and where it comes from. (Thread naming has been around [since Ruby 2.3](https://github.com/ruby/ruby/blob/v2_3_0/NEWS) which means we can do it safely for every supported version)
Co-authored-by: Ivo Anjo <ivo.anjo@datadoghq.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left a final round of suggestions, but I think this is pretty much good to go.
Thanks a lot for this. Let's bring those flaky failures down to zero!
after do | ||
count = Thread.list.size | ||
|
||
SuckerPunch::RUNNING.make_false | ||
SuckerPunch::Queue.all.each(&:shutdown) | ||
SuckerPunch::Queue.clear | ||
|
||
try_wait_until { Thread.list.size < count } | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect we can use wait_for_termination
to make this a bit cleaner
after do
SuckerPunch::RUNNING.make_false
queues = SuckerPunch::Queue.all
queues.each(&:shutdown)
SuckerPunch::Queue.clear
queues.each { |it| raise 'timeout...' unless it.wait_for_termination(5) }
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave this suggestion a try, but it wait_for_termination
only waits for the request for termination to be received, not for all threads to be terminated.
The thread that is "leaking" here is this one.
And unfortunately, the interface to interact with it is to #kill
it which does not wait for a complete shutdown.
Overall the ThreadPoolExecutor doesn't have a strong interface for waiting for complete termination. This is something we can follow up with the concurrent-ruby team on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a shame. Thanks for taking a stab at it. Maybe it would make sense to leave here a comment mentioning that we didn't find a nicer way because of the issue you mentioned?
@marcotc Can we round up on this one? Specifically I'd like to cover how this intersects with |
Ready for another review 🙇. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
spec/ddtrace/buffer_spec.rb
Outdated
subject(:push) { threads.each(&:join) } | ||
subject(:push) { wait_for_threads } | ||
let(:wait_for_threads) { threads.each { |t| t.join(5000) } } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
⬆️ I'm not sure if you saw this comment. It may be a bit too much to do in this PR (or you may think it's not worth it), but just wanted to call out so that we make an explicit decision :)
after do | ||
count = Thread.list.size | ||
|
||
SuckerPunch::RUNNING.make_false | ||
SuckerPunch::Queue.all.each(&:shutdown) | ||
SuckerPunch::Queue.clear | ||
|
||
try_wait_until { Thread.list.size < count } | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a shame. Thanks for taking a stab at it. Maybe it would make sense to leave here a comment mentioning that we didn't find a nicer way because of the issue you mentioned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since Marco is off, he asked me to take over and merge if this was ready, and it very much amazingly is! Off it goes :) |
This PR cleans up all our test suite of cases where threads as created during test execution, but not cleaned up. Before this change,
rake spec:main
finishes its execution with over 40 threads executing.This causes issues when performing integration testing for global tracer resource usage.
Also, it's good hygiene to clean up the process correctly between each test execution, as to avoid noise. To this day, we still have sporadic test failures caused by background threads from neighbor test execution interacting with
Datadog.health_metrics.*
while we are trying to assert on it.A global check is introduced in this PR that runs after every RSpec context. The check currently warns loudly in CI and fails locally when triggered. The reason it's not hard-failing in CI this PR is because I wasn't able to completely clean up all sporadic failures (ones that take 3+ CI runs to trigger), and after working on this for 2 weeks straight, I think it's enough work for now in this are :)