Add tracer health metrics #838

delner · 2019-10-15T21:30:17Z

To monitor the health and performance of the tracer, this pull request adds health metrics to the tracer and other core components, which are sent as Statsd metrics under the datadog.tracer.* prefix. These metrics can then be graphed to evaluate the health, stability and impact of the trace library within a Ruby application.

By default, these metrics are enabled, if dogstatsd-ruby >= 3.3.0 is available. They can be disabled by setting the DD_DEBUG_HEALTH_METRICS_ENABLED ENV var to 0.

List of metrics added with this pull request:

datadog.tracer.api.errors
datadog.tracer.api.requests
datadog.tracer.api.responses
datadog.tracer.queue.accepted
datadog.tracer.queue.accepted_lengths
datadog.tracer.queue.accepted_size
datadog.tracer.queue.dropped
datadog.tracer.queue.length
datadog.tracer.queue.max_length
datadog.tracer.queue.size
datadog.tracer.queue.spans
datadog.tracer.traces.filtered # Added but not implemented
datadog.tracer.writer.cpu_time # Added but not implemented

delner · 2019-10-15T21:36:43Z

This pull request is still a work in progress; need to add the actual instrumentation into the tracer code and tests to verify its behavior. Should also run this in a sandbox application to verify it works end-to-end.

marcotc

🎉 It looks very good!

brettlangdon

looks like a good start

lib/ddtrace/configuration/settings.rb

lib/ddtrace/debug/health.rb

delner · 2019-10-22T22:22:33Z

Implemented some queue metrics, too.

Right now it's pretty aggressive, and will send stats with each trace push. This might be too much ultimately, but I want to measure the upper bounds on what we can send, because if we can send for each push, we can get some nice distribution metrics about things like "whats the distribution of spans per trace."

If we can't, then we can apply sample rates to Statsd, or rework these metrics to only compute/submit metric aggregations on flush, which would significantly reduce the number of stat calls, but also the granularity at which the data can be explored; a trade off we might have to take for sake of scale.

Consider this a first iteration, to experiment with and explore the upper bounds of what could be done with stats, but with the caveat that it might have to be reworked slightly to scale it back.

delner · 2019-10-23T04:13:07Z

Also opened this additional branch, which implements a more conservative "aggregate at flush time" approach; its on the other end of the scale in relation to this original implementation. It definitely has flaws of its own (e.g. queue.accepted_* and queue.* metrics are identical/redundant) but it also is much less verbose.

Feel free to compare the differences here and open it as a PR if necessary: https://github.com/DataDog/dd-trace-rb/compare/feature/debug_metrics...refactor/aggregated_queue_metrics?expand=1

marcotc · 2019-10-23T18:13:37Z

lib/ddtrace/ext/debug.rb

+          METRIC_QUEUE_ACCEPTED_LENGTHS = 'datadog.tracer.queue.accepted_lengths'.freeze
+          METRIC_QUEUE_ACCEPTED_SIZE = 'datadog.tracer.queue.accepted_size'.freeze


For these two sub-metrics of datadog.tracer.queue.accepted, would it make sense to have the metric name separator like so: datadog.tracer.queue.accepted.lengths datadog.tracer.queue.accepted.size?

Yeah I think that could make sense; something we'd want to reconcile with our standards though.

lib/ddtrace/buffer.rb

lib/ddtrace/configuration/settings.rb

marcotc · 2019-10-24T18:22:54Z

Pushed a rebase with latest master.

marcotc · 2019-10-28T16:15:46Z

lib/ddtrace/runtime/object_space.rb

+
+        # Rough calculation of bytesize; not very accurate.
+        object.instance_variables.inject(::ObjectSpace.memsize_of(object)) do |sum, var|
+          sum + ::ObjectSpace.memsize_of(object.instance_variable_get(var))


This seem like a good compromise between a very shallow estimate (only ::ObjectSpace.memsize_of(object)) and a full recursive memory measurement. 👍

delner added core Involves Datadog core libraries do-not-merge/WIP Not ready for merge feature Involves a product feature labels Oct 15, 2019

delner requested review from marcotc and brettlangdon October 15, 2019 21:30

delner self-assigned this Oct 15, 2019

delner force-pushed the feature/debug_metrics branch from 0612cb6 to 937c45d Compare October 15, 2019 21:35

marcotc reviewed Oct 15, 2019

View reviewed changes

brettlangdon reviewed Oct 17, 2019

View reviewed changes

lib/ddtrace/configuration/settings.rb Outdated Show resolved Hide resolved

lib/ddtrace/debug/health.rb Outdated Show resolved Hide resolved

delner force-pushed the feature/debug_metrics branch 3 times, most recently from 4c7c636 to f56a39c Compare October 22, 2019 22:11

delner marked this pull request as ready for review October 22, 2019 22:13

delner requested a review from a team October 22, 2019 22:13

marcotc reviewed Oct 23, 2019

View reviewed changes

lib/ddtrace/buffer.rb Show resolved Hide resolved

marcotc reviewed Oct 23, 2019

View reviewed changes

lib/ddtrace/configuration/settings.rb Show resolved Hide resolved

marcotc force-pushed the feature/debug_metrics branch 2 times, most recently from f56a39c to 3d7c7ff Compare October 24, 2019 18:21

marcotc reviewed Oct 28, 2019

View reviewed changes

delner force-pushed the feature/debug_metrics branch from 1e5b2ce to ba75081 Compare October 28, 2019 17:21

delner added 5 commits October 31, 2019 18:17

Added: Helpers to Datadog::Metrics

d7edbe6

Added: #send_metrics to Metrics.

354daa6

Added: #count to Datadog::Metrics

01a4153

Added: Status code to HTTP::Response

af86446

Added: Debug::Health::Metrics class

58afa31

delner added 16 commits October 31, 2019 18:17

Added: Debug health metrics configuration setting

1107e4e

Added: Health metrics to Transport statistics.

99695a3

Added: HTTP statistics and additional metric tags

142fffe

Refactored: HTTP Client to use HTTP statistics

576a1ad

Changed: Health metrics from distribution to count

df7355d

Refactored: TraceBuffer minitest to RSpec

306267f

Added: Health metrics to TraceBuffer

d4404fe

Added: Lazy evaluation for Datadog::Metrics.

1e26906

Added: Runtime::ObjectSpace#estimate_bytesize

fe56442

Added: HealthMetricHelpers to RSpec

e6d91e3

Changed: Buffer queue size metric to lazy evaluation.

86c8aec

Refactored: Buffer spec to use traces.

68c45fb

Added: Datadog::Metrics logging adapter.

ae76661

Changed: Trace size estimate for health metric.

0ec721e

Changed: Buffer to send accept event in addition to drop.

f293bae

Changed: Buffer to measure queue max length on push/pop.

84bdb44

delner force-pushed the feature/debug_metrics branch from 95bef86 to 84bdb44 Compare October 31, 2019 22:18

delner mentioned this pull request Nov 19, 2019

Aggregated tracer health metrics #859

Merged

delner merged commit 84bdb44 into master Nov 20, 2019

delner deleted the feature/debug_metrics branch November 20, 2019 21:48

delner added this to the 0.29.0 milestone Nov 20, 2019

delner added this to Released in Active work Nov 21, 2019

marcotc removed the do-not-merge/WIP Not ready for merge label Dec 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tracer health metrics #838

Add tracer health metrics #838

delner commented Oct 15, 2019 •

edited

delner commented Oct 15, 2019

marcotc left a comment

brettlangdon left a comment

delner commented Oct 22, 2019

delner commented Oct 23, 2019

marcotc Oct 23, 2019

delner Oct 25, 2019

marcotc commented Oct 24, 2019

marcotc Oct 28, 2019

		METRIC_QUEUE_ACCEPTED_LENGTHS = 'datadog.tracer.queue.accepted_lengths'.freeze
		METRIC_QUEUE_ACCEPTED_SIZE = 'datadog.tracer.queue.accepted_size'.freeze

Add tracer health metrics #838

Add tracer health metrics #838

Conversation

delner commented Oct 15, 2019 • edited

delner commented Oct 15, 2019

marcotc left a comment

Choose a reason for hiding this comment

brettlangdon left a comment

Choose a reason for hiding this comment

delner commented Oct 22, 2019

delner commented Oct 23, 2019

marcotc Oct 23, 2019

Choose a reason for hiding this comment

delner Oct 25, 2019

Choose a reason for hiding this comment

marcotc commented Oct 24, 2019

marcotc Oct 28, 2019

Choose a reason for hiding this comment

delner commented Oct 15, 2019 •

edited