Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rule sampler #854

Merged
merged 32 commits into from
Nov 29, 2019
Merged

Rule sampler #854

merged 32 commits into from
Nov 29, 2019

Conversation

marcotc
Copy link
Member

@marcotc marcotc commented Nov 7, 2019

This PR adds support for rule-based sampling of traces.

In its most complete form, here's an example configuration:

Datadog.configure do |c|
  c.tracer sampler: Datadog::PrioritySampler.new(
    post_sampler: Datadog::Sampling::RuleSampler.new(
      [
        Datadog::Sampling::SimpleRule.new(name: 'operation.name', sample_rate: 0.9),
        Datadog::Sampling::SimpleRule.new(service: 'service-1', sample_rate: 0.5),
        Datadog::Sampling::SimpleRule.new(service: /service-.*/, name: 'db.select', sample_rate: 0.7),
      ],
      default_sampler: Datadog::RateSampler.new(1.0),
      rate_limiter: Datadog::Sampling::TokenBucket.new(1000),
    )
  )
end

Or for 100% sampling with 100 traces/sec limit:

Datadog.configure do |c|
  c.tracer sampler: Datadog::PrioritySampler.new(
    post_sampler: Datadog::Sampling::RuleSampler.new
  )
end

@marcotc marcotc added core Involves Datadog core libraries feature Involves a product feature labels Nov 7, 2019
@marcotc marcotc self-assigned this Nov 7, 2019
@delner delner added this to In progress in Active work Nov 21, 2019
class SimpleMatcher < Matcher
# Returns `true` for case equality (===) with any object
MATCH_ALL = Class.new do
# DEV: A class that implements `#===` is ~20% faster than
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have a link? this is neat.

Copy link
Member Author

@marcotc marcotc Nov 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested it locally while developing this class, I can upload a gist a benchmark snippet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting idea; I could benchmark it and post results.

lib/ddtrace/sampling/rule.rb Outdated Show resolved Hide resolved

attr_reader :rules, :rate_limiter, :priority_sampler

def initialize(rules, rate_limiter, priority_sampler = Datadog::RateByServiceSampler.new)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we would need to make sure this priority sampler gets it's rates updated from responses from the agent.

Copy link
Member Author

@marcotc marcotc Nov 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's taken take by delegating :update to that sampler, like so further down this class:

def_delegators :@priority_sampler, :update

This is similar to how Datadog::PrioritySampler accomplishes the same goal.

lib/ddtrace/sampling/rule_sampler.rb Outdated Show resolved Hide resolved
draft.rb Outdated
Datadog::PrioritySampler.new(
post_sampler: Datadog::RateByServiceSampler.new(
1.0,
env: proc { Datadog.tracer.tags[:env] } # TODO: how do I provide `tracer.tags`? Seems like a circular reference here.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initializing a fallback PrioritySampler is not trivial.
Maybe we could use an interface in our tracer that allows us to append or prepend a sampler to a sampling chain, so that we could prepend our RuleSampler before the existing PrioritySampler, and avoid having all this complex initialization information outside of the tracer.


# TODO: This class name is so bad, yet so good.
# [Class documentation]
class UnlimitedLimiter < RateLimiter
Copy link
Contributor

@delner delner Nov 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, great name. If you want a different name my suggestions are just as quirky: NonLimiter, NoLimiter, NeverLimiter. But honestly, UnlimitedLimiter works just fine.

I might suggest though that this behavior for a limiter shouldn't be a Limiter but a composable module for a Limiter (e.g. RateLimiiter.new.extend(UnlimitedRate)) because an unlimited limiter isn't a distinct species of limiter, just one that behaves a particular way.

In the same line of thinking, your default initializer for RuleSampler could be rate_limiter = default_rate_limiter, then you could define:

def default_rate_limiter
  Datadog::Sampling::RateLimiter.new.tap do |limiter|
    limiter.extend(Datadog::Sampling::RateLimiter::UnlimitedRate)
  end
end

Maybe there's a more elegant way of expressing the same thing, but just a general thought.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do see the point you are making, specially if we had something like a DebugRateLimiter that prints all operations to stdout: we would be able to add this functionality to an existing rate limiter as a mixin.

But for the case of UnlimitedLimiter, I think that being subclass of RateLimiter makes more sense, as it pretty much overrides all behaviour of RateLimiter and replaces with its own. I see it less as a composable feature-set and more as a whole type of limiter in itself. When overlaying it with the token bucket implementation, for example, it removes all token bucket functionality and replaces with its own implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, its probably fine using subclasses here, too. Just wanted to consider our options, but I'm cool with this.

private

def sample_span(span)
sampled, sample_rate = @rules.find do |rule|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop is a little awkward because you have to find and retrieve two different values from the rule. It may be possible to eliminate Rule#sample entirely, and simplify things. If we were to change Rule to delegate to Matcher#match?, Sampler#sample?, Sampler#sample_rate, then you could rewrite this function as:

def sample_span(span)
  rule = @rules.find do |rule|
    rule.matches?(span)
  end

  yield(span) unless rule

  sampled = rule.sample?(span)
  sample_rate = rule.sample_rate(span)
  [sampled && rate_limiter.allow?(1), sample_rate]
end

As a general rule of thumb, it might be better for components like Rule to have very simple methods that do as little as possible and have more complex, compositional objects such as this RuleSampler orchestrate and drive those smaller components.

There might be even better ways of doing this than what I've suggested, but just some food for thought.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one trade-off with this approach which is that custom rules might want to provide a coupled result between the sampling decision (sample?) and sample_rate. In this case we are making it harder to implement such rules:

class CustomRule < Rule
  def sample?(span)
    case span.service
    when 'service-1'
      span.name != 'ignore.op'
    when 'service-2'
      span.start_time.hour > 7
    end
  end

  def sample_rate(span)
    case span.service
    when 'service-1'
      0.8
    when 'service-2'
      0.9
    end
  end
end

Instead of returning a "response" payload with both together:

Rule.new do |span|
  case span.service
  when 'service-1'
    [span.name != 'ignore.op', 0.8]
  when 'service-2'
    [span.start_time.hour > 7, 0.9]
  end
end

After reviewing this, it seems like an ergonomic decision that we wouldn't prioritize in detriment of the maintainability of RuleSampler. Thus I made changes that mostly follow your suggestion.

lib/ddtrace/sampler.rb Outdated Show resolved Hide resolved
lib/ddtrace/sampler.rb Outdated Show resolved Hide resolved
lib/ddtrace/sampling/rule_sampler.rb Outdated Show resolved Hide resolved
lib/ddtrace/sampling/rule_sampler.rb Show resolved Hide resolved
sampled = sample_span(span) { |s| @fallback_sampler.sample!(s) }

sampled.tap do
span.sampled = sampled
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for here, we always want span.sampled = true, we want to do:

span.sampled = true
span.context.sampling_priority = sampled ? Datadog::Ext::Priority::AUTO_KEEP : Datadog::Ext::Priority::AUTO_REJECT

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our samplers can be used as stand-alone samplers, as well as composed together.
When using RuleSampler as a stand-alone sampler, it will make hard decisions.

When using it alongside the PrioritySampler, the sampling_priority and final sampling decision are taken care of by the priority sampler, which will enforce the conditions you mentioned.

PrioritySampler is currently the sampler in our code base that is aware of sampling_priority. All other samplers only make a boolean decision and are not concerned with details around what flags to set in the span.

end
end

def_delegators :@fallback_sampler, :update
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can no-op :update, we don't need to consider agent returned sample rates at all here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is effectively a no-op, as most sampler implementations we delegate this call to are no-ops.

I think we should keep this delegate as we currently support RateByServiceSampler in our app. We could remove update it if we drop support for it in the future.

lib/ddtrace/sampling/token_bucket.rb Outdated Show resolved Hide resolved
def priority_sample(span)
@priority_sampler.sample?(span)
def priority_sample!(span)
@priority_sampler.sample!(span)
Copy link
Member Author

@marcotc marcotc Nov 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We change this to use sample! instead of sample? to allow for @priority_sampler to add tags to the span.
In our case, we want to set '_dd.rule_psr' and '_dd.limit_psr' when sampling happens.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, makes sense.

@marcotc marcotc changed the title [WIP] New sampler Rule sampler Nov 26, 2019
@marcotc marcotc marked this pull request as ready for review November 27, 2019 00:08
@marcotc marcotc requested a review from a team November 27, 2019 00:08
@marcotc marcotc requested a review from delner November 27, 2019 17:06
span.set_metric(SAMPLE_RATE_METRIC_KEY, pre_sample_rate_metric) # Restore true sampling metric
else
span.clear_metric(SAMPLE_RATE_METRIC_KEY)
end
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is growing in complexity in order to support the RuleSampler as a post_sampler.
It seems nice to keep all priority sampling concerns here, and not have any other sampler worry about it, but I can see an argument for moving away from having the current PrioritySampler try to do it all.

I'm open to suggestions for different approaches to simplify this approach.

The current implementation in this PR works, but trying to maintain compatibility with all existing samplers, alongside the new rule sampler, has created an arms-race for the PrioritySampler class. I believe this implementation will be much simpler if we move away from having an agent one day.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the interest of time, we might want to keep what we have here if its working, and refactor in the future. I would just make sure you have sufficient tests to verify all these edge cases and conditions introduced by this complexity such that when we do refactor later, we won't introduce any bugs to the sampling logic. If we can do that, then I think its okay to let this be.

delner
delner previously approved these changes Nov 28, 2019
Copy link
Contributor

@delner delner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @marcotc! The logic for all these layered sampling rules is a bit complicated, but I think it was expressed fairly cleanly and minimally in your implementation given what the rules are.

Only other suggestion is making sure we have enough tests that verify all the different scenarios/rules we can reasonably anticipate such that if we do need to modify this code again in the near future, it makes it easier to refactor and makes it less likely introduce any bugs/incorrect behavior. Some basic integration/feature tests might be good to such an end.

This is probably the most important point, but I leave it to your judgment if we've met that bar here. Once you feel like you have, feel free to merge this one. Nice job overall!

span.set_metric(SAMPLE_RATE_METRIC_KEY, pre_sample_rate_metric) # Restore true sampling metric
else
span.clear_metric(SAMPLE_RATE_METRIC_KEY)
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the interest of time, we might want to keep what we have here if its working, and refactor in the future. I would just make sure you have sufficient tests to verify all these edge cases and conditions introduced by this complexity such that when we do refactor later, we won't introduce any bugs to the sampling logic. If we can do that, then I think its okay to let this be.

def priority_sample(span)
@priority_sampler.sample?(span)
def priority_sample!(span)
@priority_sampler.sample!(span)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, makes sense.

lib/ddtrace/sampling/matcher.rb Outdated Show resolved Hide resolved

attr_reader :matcher, :sampler

# @param [Matcher] matcher A matcher to verify span conformity against
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm loving these comments you added to these functions! 💯


# TODO: This class name is so bad, yet so good.
# [Class documentation]
class UnlimitedLimiter < RateLimiter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, its probably fine using subclasses here, too. Just wanted to consider our options, but I'm cool with this.

@delner delner moved this from In progress to In review in Active work Nov 28, 2019
@marcotc
Copy link
Member Author

marcotc commented Nov 28, 2019

@delner I added integration tests recently to this PR, they are under spec/ddtrace/integration_spec.rb.

@marcotc
Copy link
Member Author

marcotc commented Nov 28, 2019

@delner added test coverage for the sampler.rb changes. Everything introduced in this PR should be covered now.

I added a few extra tests to cover Sampler#sampling_rate that were missing for existing functionality.

Copy link
Contributor

@palazzem palazzem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job! Thank you also for describing the API. I think this is a good start. If we need to improve the API to make it simpler (in the future), we have all the building blocks to hide the complexity.

Thank you very much @delner @marcotc @brettlangdon

@marcotc marcotc merged commit 94b47a6 into master Nov 29, 2019
Active work automation moved this from In review to Merged & awaiting release Nov 29, 2019
@marcotc marcotc deleted the feat/new-sampler branch November 29, 2019 15:47
elsif rate_limit
Datadog::Sampling::TokenBucket.new(rate_limit)
else
Datadog::Sampling::TokenBucket.new(100)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default here should be no rate limiter.

span.sampled = true
if pre_sample_rate_metric
# Restore true sampling metric, as only the @pre_sampler can reject traces
span.set_metric(SAMPLE_RATE_METRIC_KEY, pre_sample_rate_metric)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be unset by any sampler right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also shouldn't be set by any other sampler either

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only needed for samplers which cause traces to not be send to the agent span.sampled = false

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default "post_sampler", RateByServiceSampler:

@priority_sampler = opts[:post_sampler] || RateByServiceSampler.new

ultimately delegate the sampling decision to an instance of RateSampler. That sampler sets this metric whenever it successfully samples:
(span.sampled = sample?(span)).tap do |sampled|
span.set_metric(SAMPLE_RATE_METRIC_KEY, @sample_rate) if sampled
end

In case case we restore it to the true sampling metric value, store under pre_sample_rate_metric.
No sampler today is unsetting this metric, but code is put in place for completeness.

Backstory

Ultimately, the current contract interface we have for samplers (sample? for no side-effects, and sample! for true sampling decision with side-effects) does not correctly allow the samplers to be used as both:

  1. Top-level sampler: which ultimately makes the sampling decision, possibly setting metrics and tags.
  2. Chained sampler: provide a decision, but without side-effects. One example is the "post_sampler" in PrioritySampler above.

We have cases where we want "some" side-effects, like in the RuleSampler when used in a sampler chain: it needs to set rule-specific metrics (e.g. rule sampling ratio metric), but not actually set the sampling decision. It's a "middle-ground" between the sample! and sample? of today.
I chose to go with an implementation with reduce changes to the existing samplers because:

  • I did not have a clear vision on how to have a clean contract to represent all these requirements I mentioned above in a concise package.
  • I did not want to perform a major refactoring on the existing samplers, introduction risk to users of the existing sampler.

@marcotc marcotc added this to the 0.30.0 milestone Dec 4, 2019
@delner delner moved this from Merged & awaiting release to Released in Active work Mar 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Involves Datadog core libraries feature Involves a product feature
Projects
Active work
  
Released
Development

Successfully merging this pull request may close these issues.

None yet

4 participants