Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single Span Sampling #2128

Merged
merged 51 commits into from
Sep 6, 2022
Merged

Single Span Sampling #2128

merged 51 commits into from
Sep 6, 2022

Conversation

marcotc
Copy link
Member

@marcotc marcotc commented Jul 5, 2022

This PR adds support for single span sampling to the tracer.

Single Span Sampling allows you to:

You can configure sampling rule that allow you keep spans despite their
respective traces being dropped by trace-level sampling.

It is configured through the documented environment variables: DD_SPAN_SAMPLING_RULES,ENV_SPAN_SAMPLING_RULES_FILE

All changes in this feature branch have been individually reviewed.

@marcotc marcotc added the feature Involves a product feature label Jul 5, 2022
@marcotc marcotc self-assigned this Jul 5, 2022
@marcotc marcotc marked this pull request as ready for review July 29, 2022 21:09
@marcotc marcotc requested a review from a team July 29, 2022 21:09
Copy link
Contributor

@delner delner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looking good. Some questions.

Only hard ask here is to do some performance testing. I'm particularly interested in:

100+ spans per trace, all traces dropped, rule configured to keep one span

vs

100+ spans per trace, all traces kept, no rules evaluated

This should give us the relative cost increase in a worst case.

@@ -75,6 +77,7 @@ def build_tracer(settings, agent_settings)
enabled: settings.tracing.enabled,
trace_flush: trace_flush,
sampler: sampler,
span_sampler: build_span_sampler(settings),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't love that span_sampler is its own component within the tracer, but it seems better to have smaller, simpler responsibilities than to have one sampler that's complex.

# These rules allow a span to be kept when its encompassing trace is dropped.
#
# The syntax for single span sampling rules can be found here:
# TODO: <Single Span Sampling documentation URL here>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call out: let's update this when we can.

def get_trace(trace_op)
trace_op.flush!
trace_op.flush! do |spans|
spans.select! { |span| single_sampled?(span) } unless trace_op.sampled?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this will add some cost: having to iterate over each span in each dropped trace. Off the top of my head, not sure how we avoid this, but it is worth noting in regards of possible performance impact.

Fine for now; let's just measure the performance of this before merging, if possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmarks show it's not measurably slower, unless single span sampling actually matches, which causes 3 set_tag operations, which are a bit slow.

Benchmarks have been added and attached results to this PR as a comment.

@@ -20,6 +20,9 @@ class RateSampler < Sampler
# * +sample_rate+: the sample rate as a {Float} between 0.0 and 1.0. 0.0
# means that no trace will be sampled; 1.0 means that all traces will be
# sampled.
#
# DEV-2.0: Allow for `sample_rate` zero (drop all) to be allowed. This eases
# DEV-2.0: usage for many consumers of the {RateSampler} class.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Can you clarify?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expanded the comment in the PR, but here's the gist:
All internal users of RateSampler (RuleSampler and now Single Span Sampling) want sample_rate == 0 to mean "drop all", but they can't do that because of the validation that happens in the RateSampler initializer.

The way they get around it is to not set the value in the initializer, but call:

sampler = RateSampler.new
sampler.sample_rate = sample_rate # There's no validation here

This bypasses the validation. Ideally, the RateSampler would respect any rate between 0.0 and 1.0.

@@ -54,6 +56,13 @@ def match?(span)
end
end

def ==(other)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a span really the same as another span if their service and name are the same? For sampling purposes?

Can you explain the logic behind this a little more?

Copy link
Member Author

@marcotc marcotc Aug 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This equality method referrers to the Matcher class, not to spans.

If two matchers have the exact same instance variables, which is the only state they can have, they are the same.
Currently, the name and service matchers are all the instance variables a Matcher can have, so if two matchers have the same name and service they are effectively the same.

@marcotc
Copy link
Member Author

marcotc commented Aug 8, 2022

Here are the benchmark results.

Memory usage does have any significant change. The following benchmarks are measuring execution time.

The numbers 1:, 10:, and 100: are the trace size under test, in number of spans.

TraceOperation is kept by trace-level sampling:

1. And no single span sampling is configured (baseline):

This code path does not consult the single span sampler.

                   1:    17532.0 i/s
                  10:     3640.9 i/s - 4.82x  (± 0.00) slower
                 100:      443.4 i/s - 39.54x  (± 0.00) slower

TraceOperation is reject by trace-level sampling:

2. And no single span sampling is configured:

This code path does not consult the single span sampler.

                   1:    20786.5 i/s
                  10:     4217.3 i/s - 4.93x  (± 0.00) slower
                 100:      474.6 i/s - 43.80x  (± 0.00) slower

3. Simple span sampling is configured and all spans are rejected:

The difference between this benchmark and the previous one is the cost to consult single span rules.

                   1:    19565.4 i/s
                  10:     3926.7 i/s - 4.98x  (± 0.00) slower
                 100:      436.3 i/s - 44.85x  (± 0.00) slower

4. Simple span sampling is configured and all spans are kept:

One side effect of being Single Span Sampled is that 3 tags are added to each span successfully being single sampled, thus more overhead is expected.

                   1:    15104.6 i/s
                  10:     3080.9 i/s - 4.90x  (± 0.00) slower
                 100:      365.4 i/s - 41.34x  (± 0.00) slower

Conclusions

The only code path with meaningful performance impact is the 4, and that can be attributed to extra tags being added to each single sampled span, as well as the time it takes to try to match each trace span to the configured rules.

The rules by themselves are not very expensive: the difference between 3 and 2 is effectively the cost to consult single span rules for all spans in a trace.
In fact, the performance of the baseline (1) and 3 are very closely matched: this means that "keeping all spans" is just as expensive as "dropping the trace plus consulting single span sampling rules".

Maybe not surprising, but dropped traces (2) are cheaper than sampled traces (1), like because the PrioritySampler and RuleSampler don't have to be consulted. Neither 1 nor 2 have Single Span Sampling configured, thus this is a tracer-level sampling overhead that existed beforehand.

Copy link
Contributor

@delner delner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance looks acceptable for a worst case. Tests are passing (minus one annoying MacOS test) so I think this should be good to merge as soon as its rebased.

Nice work!

@marcotc marcotc merged commit eaa1469 into master Sep 6, 2022
@marcotc marcotc deleted the feature-single-span-sampling branch September 6, 2022 18:46
@github-actions github-actions bot added this to the 1.5.0 milestone Sep 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Involves a product feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants