Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correctly close all StatsD clients #1429

Merged
merged 9 commits into from
Apr 9, 2021
Merged

Correctly close all StatsD clients #1429

merged 9 commits into from
Apr 9, 2021

Conversation

marcotc
Copy link
Member

@marcotc marcotc commented Mar 24, 2021

The new version of dogstatsd-ruby, v4.9.0, creates a background thread, which needs to be properly closed by calling the previously existing method #close.

In a few parts of our tracer and test suite we fail to call #close on all instances of the StatsD client. This PR addresses that.

The configuration of Runtime Metric and Health Metrics components reuse the existing statsd client from previous configurations, thus directly manipulating the lifecycle of statsd. This doesn't seem like a natural relationship, as the statsd client is nested inside the Runtime Metrics Worker object, and trying to manipulate its lifecycle from outside creates a few complications. These are documented inline. This also gives us another reason to "flatten" our internal tracer components, in this case the statsd client, as that will restore the natural ownership of its lifecycle to the global Components manager.

@marcotc marcotc self-assigned this Mar 25, 2021
@marcotc marcotc added dev/refactor Involves refactoring existing components core Involves Datadog core libraries labels Mar 25, 2021
@marcotc
Copy link
Member Author

marcotc commented Mar 26, 2021

dogstatsd-ruby 4.9.0 was recently removed, but 5.0.0 will be released shortly with the change changes (just different version number, due to breaking API change).
Even though it's a major release, the breaking changes do not affect ddtrace's statsd calls.

@marcotc marcotc marked this pull request as ready for review April 7, 2021 23:44
@marcotc marcotc requested a review from a team April 7, 2021 23:44
@marcotc
Copy link
Member Author

marcotc commented Apr 7, 2021

The build failure is now due to DataDog/dogstatsd-ruby#175

@@ -91,6 +91,10 @@
# with the test execution.
config.around do |example|
example.run.tap do
# Background thread leak checking is currently too verbose for CI,
# slowing down builds to the point of causing CircleCI timeouts.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

400mb of feedback is clearly not helpful. Rather than skipping it entirely in CI, could we count how many errors we've emitted, and after some number, we just stop reporting them, and just include a line at the end of the report with how many errors?

Famously, valgrind does the same:

More than 10000000 total errors detected.  I'm not reporting any more.
Final error counts will be inaccurate.  Go fix your program!

@tracer.shutdown!
@tracer = nil
end

Datadog.send(:reset!)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gemfile Outdated
gem 'dogstatsd-ruby', '>= 3.3.0'

# Unpin when https://github.com/DataDog/dogstatsd-ruby/pull/175 is released
gem 'dogstatsd-ruby', git: 'https://github.com/marcotc/dogstatsd-ruby.git', ref: 'dd7ef5dfe4ff9336ec27dc3736e0bbd2acede767'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have push access to DataDog/dogstatsd-ruby, so my fork it is for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is reasonable, given that it's only used for testing.

@marcotc
Copy link
Member Author

marcotc commented Apr 8, 2021

I decided to make this PR pass our build, instead of breaking down all requirements to have a green CI over 3 CI failing PRs.

The first commit contains the main changes. The next two are separate issues that are required for CI to pass. I'm trusting that git history will be enough to help us separate these changes if we need to look back into this changeset.

Copy link
Member

@ivoanjo ivoanjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this 🙇 . I'm a bit concerned with the potential race during close, and hence as much as it pains me, I'll hold off on pressing the merge button >_>

Gemfile Outdated
gem 'dogstatsd-ruby', '>= 3.3.0'

# Unpin when https://github.com/DataDog/dogstatsd-ruby/pull/175 is released
gem 'dogstatsd-ruby', git: 'https://github.com/marcotc/dogstatsd-ruby.git', ref: 'dd7ef5dfe4ff9336ec27dc3736e0bbd2acede767'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is reasonable, given that it's only used for testing.

Comment on lines 127 to 128
runtime_metrics.enabled = false
runtime_metrics.stop(true)
runtime_metrics.stop(true, close_metrics: false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Is this enabled = false needed? Wouldn't stop work even if enabled was still true?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem like it's strictly required. This enabled is part of the worker pooling/threading/loop pattern, so it's tricky to be sure, but as far as I can reason, it's not actually needed to properly shut down runtime_metrics here.

Comment on lines 52 to 55
def stop(*args, close_metrics: true)
@metrics.close if close_metrics
super(*args)
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this safe? We're calling .close on something that is shared with a background thread, so if Runtime::Metrics or Datadog::Statsd are not thread-safe, this can be problematic.

I was thinking that perhaps we can have an easy fix by moving the close after the call to super -- then we know the background activity has stopped, so it should be a safer time to close the resource?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion here.
It's rather hard to reason about this worker class, as there are a few layers of pooling/threading/loop that interweave with each other... I always get lost in the process. It seems like something we'd want to "flatten" in the future.

I'll give it a shot and see if works correctly with close after super.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please! I think the module approach right now makes the combined objects have a lot of behavior and weird cases. I'd really like to move to using composition for these workers, e.g. two objects which communicate with clear APIs, rather than the current approach of having a lot of behavior injected into the class via a combination of include and prepend.

Comment on lines 115 to 117
it 'does not error on reporting health metrics' do
expect(Datadog.health_metrics.queue_accepted(1)).to be_a(Integer)
expect(Datadog.health_metrics.queue_accepted(1)).to be(true)
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Perhaps to be_truthy would be a bit more flexible here?

@@ -91,6 +91,10 @@
# with the test execution.
config.around do |example|
example.run.tap do
# Background thread leak checking is currently too verbose for CI,
# slowing down builds to the point of causing CircleCI timeouts.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

400mb of feedback is clearly not helpful. Rather than skipping it entirely in CI, could we count how many errors we've emitted, and after some number, we just stop reporting them, and just include a line at the end of the report with how many errors?

Famously, valgrind does the same:

More than 10000000 total errors detected.  I'm not reporting any more.
Final error counts will be inaccurate.  Go fix your program!

Comment on lines +180 to 189
# DEV: we have to use an explicit `block`, argument
# instead of the implicit `yield` call, as calling
# `yield` here crashes the Ruby VM in Ruby < 2.2.
def initialize(*args, &block)
caller_ = caller
wrapped = lambda do |*thread_args|
Thread.current.instance_variable_set(:@caller, caller_)
yield(*thread_args)
block.call(*thread_args) # rubocop:disable Performance/RedundantBlockCall
end
wrapped.ruby2_keywords if wrapped.respond_to?(:ruby2_keywords, true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how this method is like the history of Ruby bugs and features. "Let's just wrap it" turns into "oh but this crashes on old rubies, and we need to use correct delegation on modern rubies...". Ahh computers...

@marcotc marcotc requested review from ivoanjo and delner April 9, 2021 18:56
Copy link
Contributor

@delner delner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. 👍

@marcotc marcotc merged commit b1a159f into master Apr 9, 2021
@marcotc marcotc deleted the fix-statsd branch April 9, 2021 20:46
@github-actions github-actions bot added this to the 0.48.0 milestone Apr 9, 2021
Comment on lines +101 to +109
unless $background_thread_leak_warned ||= false
warn RSpec::Core::Formatters::ConsoleCodes.wrap(
"Too many leaky thread reports! Suppressing further reports.\n" \
'Consider addressing the previously reported leaks before proceeding.',
:red
)

$background_thread_leak_warned = true
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be worth adding a note at the end of the test suite execution too, otherwise if you ignore the output of running a lot of tests (which to be honest, is what I do with such a noisy output as we have right now), you may miss that we still have a bunch of those to fix :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Involves Datadog core libraries dev/refactor Involves refactoring existing components
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants