Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tracer health metrics #838

Merged
merged 21 commits into from
Nov 20, 2019
Merged

Add tracer health metrics #838

merged 21 commits into from
Nov 20, 2019

Conversation

delner
Copy link
Contributor

@delner delner commented Oct 15, 2019

To monitor the health and performance of the tracer, this pull request adds health metrics to the tracer and other core components, which are sent as Statsd metrics under the datadog.tracer.* prefix. These metrics can then be graphed to evaluate the health, stability and impact of the trace library within a Ruby application.

By default, these metrics are enabled, if dogstatsd-ruby >= 3.3.0 is available. They can be disabled by setting the DD_DEBUG_HEALTH_METRICS_ENABLED ENV var to 0.

List of metrics added with this pull request:

datadog.tracer.api.errors
datadog.tracer.api.requests
datadog.tracer.api.responses
datadog.tracer.queue.accepted
datadog.tracer.queue.accepted_lengths
datadog.tracer.queue.accepted_size
datadog.tracer.queue.dropped
datadog.tracer.queue.length
datadog.tracer.queue.max_length
datadog.tracer.queue.size
datadog.tracer.queue.spans
datadog.tracer.traces.filtered # Added but not implemented
datadog.tracer.writer.cpu_time # Added but not implemented

@delner delner added core Involves Datadog core libraries do-not-merge/WIP Not ready for merge feature Involves a product feature labels Oct 15, 2019
@delner delner self-assigned this Oct 15, 2019
@delner
Copy link
Contributor Author

delner commented Oct 15, 2019

This pull request is still a work in progress; need to add the actual instrumentation into the tracer code and tests to verify its behavior. Should also run this in a sandbox application to verify it works end-to-end.

Copy link
Member

@marcotc marcotc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 It looks very good!

Copy link
Member

@brettlangdon brettlangdon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like a good start

lib/ddtrace/configuration/settings.rb Outdated Show resolved Hide resolved
lib/ddtrace/debug/health.rb Outdated Show resolved Hide resolved
@delner delner force-pushed the feature/debug_metrics branch 3 times, most recently from 4c7c636 to f56a39c Compare October 22, 2019 22:11
@delner delner marked this pull request as ready for review October 22, 2019 22:13
@delner delner requested a review from a team October 22, 2019 22:13
@delner
Copy link
Contributor Author

delner commented Oct 22, 2019

Implemented some queue metrics, too.

Right now it's pretty aggressive, and will send stats with each trace push. This might be too much ultimately, but I want to measure the upper bounds on what we can send, because if we can send for each push, we can get some nice distribution metrics about things like "whats the distribution of spans per trace."

If we can't, then we can apply sample rates to Statsd, or rework these metrics to only compute/submit metric aggregations on flush, which would significantly reduce the number of stat calls, but also the granularity at which the data can be explored; a trade off we might have to take for sake of scale.

Consider this a first iteration, to experiment with and explore the upper bounds of what could be done with stats, but with the caveat that it might have to be reworked slightly to scale it back.

@delner
Copy link
Contributor Author

delner commented Oct 23, 2019

Also opened this additional branch, which implements a more conservative "aggregate at flush time" approach; its on the other end of the scale in relation to this original implementation. It definitely has flaws of its own (e.g. queue.accepted_* and queue.* metrics are identical/redundant) but it also is much less verbose.

Feel free to compare the differences here and open it as a PR if necessary: https://github.com/DataDog/dd-trace-rb/compare/feature/debug_metrics...refactor/aggregated_queue_metrics?expand=1

Comment on lines +14 to +15
METRIC_QUEUE_ACCEPTED_LENGTHS = 'datadog.tracer.queue.accepted_lengths'.freeze
METRIC_QUEUE_ACCEPTED_SIZE = 'datadog.tracer.queue.accepted_size'.freeze
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these two sub-metrics of datadog.tracer.queue.accepted, would it make sense to have the metric name separator like so: datadog.tracer.queue.accepted.lengths datadog.tracer.queue.accepted.size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think that could make sense; something we'd want to reconcile with our standards though.

@marcotc marcotc force-pushed the feature/debug_metrics branch 2 times, most recently from f56a39c to 3d7c7ff Compare October 24, 2019 18:21
@marcotc
Copy link
Member

marcotc commented Oct 24, 2019

Pushed a rebase with latest master.


# Rough calculation of bytesize; not very accurate.
object.instance_variables.inject(::ObjectSpace.memsize_of(object)) do |sum, var|
sum + ::ObjectSpace.memsize_of(object.instance_variable_get(var))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seem like a good compromise between a very shallow estimate (only ::ObjectSpace.memsize_of(object)) and a full recursive memory measurement. 👍

@delner delner merged commit 84bdb44 into master Nov 20, 2019
@delner delner deleted the feature/debug_metrics branch November 20, 2019 21:48
@delner delner added this to the 0.29.0 milestone Nov 20, 2019
@delner delner added this to Released in Active work Nov 21, 2019
@marcotc marcotc removed the do-not-merge/WIP Not ready for merge label Dec 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Involves Datadog core libraries feature Involves a product feature
Projects
Active work
  
Released
Development

Successfully merging this pull request may close these issues.

None yet

3 participants