-
Notifications
You must be signed in to change notification settings - Fork 369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correctly close all StatsD clients #1429
Conversation
|
The build failure is now due to DataDog/dogstatsd-ruby#175 |
spec/spec_helper.rb
Outdated
@@ -91,6 +91,10 @@ | |||
# with the test execution. | |||
config.around do |example| | |||
example.run.tap do | |||
# Background thread leak checking is currently too verbose for CI, | |||
# slowing down builds to the point of causing CircleCI timeouts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JRuby builds are timing out: https://app.circleci.com/pipelines/github/DataDog/dd-trace-rb/3161/workflows/2c837108-5272-42dc-841a-694d76e30b60/jobs/126109
And build console output is over 400MiB 😢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
400mb of feedback is clearly not helpful. Rather than skipping it entirely in CI, could we count how many errors we've emitted, and after some number, we just stop reporting them, and just include a line at the end of the report with how many errors?
Famously, valgrind does the same:
More than 10000000 total errors detected. I'm not reporting any more.
Final error counts will be inaccurate. Go fix your program!
@tracer.shutdown! | ||
@tracer = nil | ||
end | ||
|
||
Datadog.send(:reset!) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gemfile
Outdated
gem 'dogstatsd-ruby', '>= 3.3.0' | ||
|
||
# Unpin when https://github.com/DataDog/dogstatsd-ruby/pull/175 is released | ||
gem 'dogstatsd-ruby', git: 'https://github.com/marcotc/dogstatsd-ruby.git', ref: 'dd7ef5dfe4ff9336ec27dc3736e0bbd2acede767' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have push access to DataDog/dogstatsd-ruby
, so my fork it is for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is reasonable, given that it's only used for testing.
I decided to make this PR pass our build, instead of breaking down all requirements to have a green CI over 3 CI failing PRs. The first commit contains the main changes. The next two are separate issues that are required for CI to pass. I'm trusting that git history will be enough to help us separate these changes if we need to look back into this changeset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this 🙇 . I'm a bit concerned with the potential race during close, and hence as much as it pains me, I'll hold off on pressing the merge button >_>
Gemfile
Outdated
gem 'dogstatsd-ruby', '>= 3.3.0' | ||
|
||
# Unpin when https://github.com/DataDog/dogstatsd-ruby/pull/175 is released | ||
gem 'dogstatsd-ruby', git: 'https://github.com/marcotc/dogstatsd-ruby.git', ref: 'dd7ef5dfe4ff9336ec27dc3736e0bbd2acede767' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is reasonable, given that it's only used for testing.
runtime_metrics.enabled = false | ||
runtime_metrics.stop(true) | ||
runtime_metrics.stop(true, close_metrics: false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: Is this enabled = false
needed? Wouldn't stop work even if enabled
was still true
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem like it's strictly required. This enabled
is part of the worker pooling/threading/loop pattern, so it's tricky to be sure, but as far as I can reason, it's not actually needed to properly shut down runtime_metrics
here.
def stop(*args, close_metrics: true) | ||
@metrics.close if close_metrics | ||
super(*args) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this safe? We're calling .close
on something that is shared with a background thread, so if Runtime::Metrics
or Datadog::Statsd
are not thread-safe, this can be problematic.
I was thinking that perhaps we can have an easy fix by moving the close after the call to super -- then we know the background activity has stopped, so it should be a safer time to close the resource?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong opinion here.
It's rather hard to reason about this worker class, as there are a few layers of pooling/threading/loop that interweave with each other... I always get lost in the process. It seems like something we'd want to "flatten" in the future.
I'll give it a shot and see if works correctly with close after super.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes please! I think the module approach right now makes the combined objects have a lot of behavior and weird cases. I'd really like to move to using composition for these workers, e.g. two objects which communicate with clear APIs, rather than the current approach of having a lot of behavior injected into the class via a combination of include and prepend.
spec/ddtrace_integration_spec.rb
Outdated
it 'does not error on reporting health metrics' do | ||
expect(Datadog.health_metrics.queue_accepted(1)).to be_a(Integer) | ||
expect(Datadog.health_metrics.queue_accepted(1)).to be(true) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: Perhaps to be_truthy
would be a bit more flexible here?
spec/spec_helper.rb
Outdated
@@ -91,6 +91,10 @@ | |||
# with the test execution. | |||
config.around do |example| | |||
example.run.tap do | |||
# Background thread leak checking is currently too verbose for CI, | |||
# slowing down builds to the point of causing CircleCI timeouts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
400mb of feedback is clearly not helpful. Rather than skipping it entirely in CI, could we count how many errors we've emitted, and after some number, we just stop reporting them, and just include a line at the end of the report with how many errors?
Famously, valgrind does the same:
More than 10000000 total errors detected. I'm not reporting any more.
Final error counts will be inaccurate. Go fix your program!
# DEV: we have to use an explicit `block`, argument | ||
# instead of the implicit `yield` call, as calling | ||
# `yield` here crashes the Ruby VM in Ruby < 2.2. | ||
def initialize(*args, &block) | ||
caller_ = caller | ||
wrapped = lambda do |*thread_args| | ||
Thread.current.instance_variable_set(:@caller, caller_) | ||
yield(*thread_args) | ||
block.call(*thread_args) # rubocop:disable Performance/RedundantBlockCall | ||
end | ||
wrapped.ruby2_keywords if wrapped.respond_to?(:ruby2_keywords, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how this method is like the history of Ruby bugs and features. "Let's just wrap it" turns into "oh but this crashes on old rubies, and we need to use correct delegation on modern rubies...". Ahh computers...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. 👍
unless $background_thread_leak_warned ||= false | ||
warn RSpec::Core::Formatters::ConsoleCodes.wrap( | ||
"Too many leaky thread reports! Suppressing further reports.\n" \ | ||
'Consider addressing the previously reported leaks before proceeding.', | ||
:red | ||
) | ||
|
||
$background_thread_leak_warned = true | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be worth adding a note at the end of the test suite execution too, otherwise if you ignore the output of running a lot of tests (which to be honest, is what I do with such a noisy output as we have right now), you may miss that we still have a bunch of those to fix :)
The new version of
dogstatsd-ruby
, v4.9.0, creates a background thread, which needs to be properly closed by calling the previously existing method#close
.In a few parts of our tracer and test suite we fail to call
#close
on all instances of the StatsD client. This PR addresses that.The configuration of Runtime Metric and Health Metrics components reuse the existing
statsd
client from previous configurations, thus directly manipulating the lifecycle ofstatsd
. This doesn't seem like a natural relationship, as thestatsd
client is nested inside the Runtime Metrics Worker object, and trying to manipulate its lifecycle from outside creates a few complications. These are documented inline. This also gives us another reason to "flatten" our internal tracer components, in this case thestatsd
client, as that will restore the natural ownership of its lifecycle to the global Components manager.