Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dnsdist: Add prometheus metrics for top Dynamic Blocks entries #9756

Merged
merged 7 commits into from
Dec 3, 2020

Conversation

rgacogne
Copy link
Member

@rgacogne rgacogne commented Nov 20, 2020

Short description

This PR is a work-in-progress, but the idea is to export the top 20 dynamic blocks for each reason, for SMT and NMG dynamic blocks. Since scanning every single entry might take a long time, these metrics are cached for 60s, which is currently hardcoded but could easily be made configurable.

The current metrics looks like this:

# HELP dnsdist_dynblocks_nmg_top_offenders Top offenders blocked by Dynamic Blocks (netmasks)
# TYPE dnsdist_dynblocks_nmg_top_offenders gauge
dnsdist_dynblocks_nmg_top_offenders{reason="Exceeded query rate",netmask="127.0.0.1/32"} 6
# HELP dnsdist_dynblocks_smt_top_offenders Top offenders blocked by Dynamic Blocks (suffixes)
# TYPE dnsdist_dynblocks_smt_top_offenders gauge
dnsdist_dynblocks_smt_top_offenders{reason="Pseudo-Random Subdomain Attack",suffix="prsd.powerdns.com."} 0

@SrX @wojas I'm pretty sure I'm holding the prometheus metrics wrong, feedback welcome :-)

We could also expose them to the API. It currently can request all dynamic block entries which is very slow, so exposing the top N would likely make sense there as well.

Checklist

I have:

  • read the CONTRIBUTING.md document
  • compiled this code
  • tested this code
  • included documentation (including possible behaviour changes)
  • documented the code
  • added or modified regression test(s)
  • added or modified unit test(s)
  • checked that this code was merged to master

@wojas
Copy link
Member

wojas commented Nov 23, 2020

These are very useful metrics!

Let me see if I understand why you picked gauge as the metric type instead of counter. A counter would in general be a better match, because it allows you to decide on the Prometheus side how often you want to poll this information, and derive the gauge as required. In this case the gauge solves two issues with that approach:

  1. For performance and simplicity reasons you want to derive these metrics at fixed intervals instead of on-demand.
  2. Using a gauge prevents us from having to keep track of historical counters, we just need to consider the current top.

One concern with metrics that contain a remote IP address is the cardinality of the dataset, but you limited this by a) only returning the top N entries and 2) only once per minute (or other time period).

The fact that you mention that you cache the values for 60s instead of collecting and resetting them every 60s confuses me. Caching would only work if this is a counter and not a gauge and you never reset them. But in this case new threats will take a while to reach the top offenders list. I think that it is better to reset every interval and let the metrics collection system do any necessary aggregation.

As for the naming, it's better to make the name clearer by adding the unit: dnsdist_dynblocks_nmg_top_offenders -> dnsdist_dynblocks_nmg_top_offender_hits_per_second. I intentionally picked 'per second' here, so that the interpretation of the metric does not depend on any configured interval. Divide the number of hits by the seconds in your interval, 60 if collected every minute. When displaying the metric in a dashboard you can always scale this to per minute values if desired.

I could be useful to make the collection interval and the max number of entries configurable and perhaps to allow configurable threshold under which entries are not reported at all, to prevent filling the metrics database with rare events from random IPs.

@rgacogne
Copy link
Member Author

Thanks for the feedback!

The fact that you mention that you cache the values for 60s instead of collecting and resetting them every 60s confuses me. Caching would only work if this is a counter and not a gauge and you never reset them. But in this case new threats will take a while to reach the top offenders list. I think that it is better to reset every interval and let the metrics collection system do any necessary aggregation.

You are right, my current approach does not work well. I have two issues that arise from the way our dynamic blocks work. The first one is that we might easily have hundreds of thousands of entries there, but that's OK if we only scan them at a reasonable interval. The second is that the entries are short-lived, and if an entry is not renewed while still active it gets deleted, and if we insert that same entry again later the counter starts from zero again. I can't reset the counter when collecting the metrics since I need the counters to stay consistent in the CLI or the API.
I can derive the difference between two polling by subtracting the previous value from the current one, but I see two issues with that:

  • the counter might have started from zero again. I can detect that if the value is lower than the previous one, but if it is already greater than it was I don't have any way to detect it ;
  • if the previous value was not in the top N during the last poll I don't have it anymore, so I only have the current, absolute value, which is wrong.
    I'm not sure how to progress from there. I guess I could keep a full snapshot of the previous entries cached, to be able to get the previous values for the current top N. It will use quite some memory but I don't see right away how to avoid that. And perhaps add some tag to each entry to be able to detect that an entry has been deleted then added again.

@wojas
Copy link
Member

wojas commented Nov 23, 2020

That makes it a bit tricky indeed...

If the operation is expensive, it does not really matter if it is run once per minute or once every 10s. The main concern is that it either a) locks resources causing latency to spike or b) steals available CPU resources from processing packets. Is locking an issue here? If the issue is CPU, is it possible to throttle the number of entries we process?

You mention that the entries are short-lived. Is there is minimum period they are guaranteed to be kept around since the last hit? If for example they are guaranteed to be kept around for one at least one minute since the last hit, we could collect these statistics every minute and store the last collected counter value for every entry. This way we always know the number of hits in the last minute, and we do not need to worry that data disappears before collection.

If this period is shorter than one minute, we could collect these statistics at shorter intervals, for example every 10s. To prevent missing spikes, we could keep 6 periods of data and sum these to produce a rolling 1 minute window of average per-second hit rates. This allows gathering these metrics once per minute without missing periodic spikes.

Note that if you want to aggregate top N results, you need to oversample when collecting the data, i.e. keep the top 5N for each time bucket before aggregation and then cap at N.

@rgacogne
Copy link
Member Author

If the operation is expensive, it does not really matter if it is run once per minute or once every 10s. The main concern is that it either a) locks resources causing latency to spike or b) steals available CPU resources from processing packets. Is locking an issue here? If the issue is CPU, is it possible to throttle the number of entries we process?

The good news is that locking is not much of an issue, we use a sort of RCU mechanism that allows the list to be updated while we are still accessing a previous version. So we mostly care about not stealing too much CPU resources indeed. As for throttling the number of entries we process, I don't think we can do that and still provide meaningful numbers. The structures we use are optimized for keyed lookups, so we need to read them all to get the one with the highest number of hits.

You mention that the entries are short-lived. Is there is minimum period they are guaranteed to be kept around since the last hit?

I'm afraid there isn't, the duration of a dynamic block is configured by the administrator (a typical value is 60s) and starts when the block is inserted (or renewed), not when a hit occurs. I'm quite reluctant to change that, for example by inserting a last hit timestamp, mostly because we want the fast-path to be as efficient as possible in case of a hit.

I'm wondering if this is a real issue, and whether we should accept that these metrics will be some kind of best-effort anyway. So perhaps do what you suggest, keep 6 periods of over-sampled data and sum, detecting when a counter is lower than the previous value but not worrying too much about losing some hits here and there.

@wojas
Copy link
Member

wojas commented Nov 23, 2020

So we mostly care about not stealing too much CPU resources indeed.

If this has a significant negative impact on performance during collection, this is almost just as big a problem when it occurs once per minute as when it occurs once per 10s.

Any idea how much impact one run would have on a typical instance?

I'm wondering if this is a real issue, and whether we should accept that these metrics will be some kind of best-effort anyway. So perhaps do what you suggest, keep 6 periods of over-sampled data and sum, detecting when a counter is lower than the previous value but not worrying too much about losing some hits here and there.

This sounds good enough. At least we will not completely miss an entry this way, just potentially miscount it a bit.

Do we have any control over when the cleanups occur? If so, perhaps these can be run immediately after the metric collection to minimize the number of hits dropped and not included on cleanup?

@rgacogne
Copy link
Member Author

If this has a significant negative impact on performance during collection, this is almost just as big a problem when it occurs once per minute as when it occurs once per 10s.

Any idea how much impact one run would have on a typical instance?

Agreed. We have a few benchmarks in our unit tests, showing that scanning 1M NMG dynamic blocks takes roughly 60 ms on my desktop. 1M blocks is likely a very high number, I expect we will not have more than that in production.
Scanning 1M SMN entries takes ~750 ms, which is a bit more problematic, but I doubt we will ever reach even 100k entries there, even 10k seems doubtful. So I guess retrieving them every 10s could work.

Do we have any control over when the cleanups occur? If so, perhaps these can be run immediately after the metric collection to minimize the number of hits dropped and not included on cleanup?

Until now the cleanup was done in the maintenance hook, so we don't have any control over it, but since we will need a new thread to collect the metrics I guess I could move the cleanup to that thread as well.

@wojas
Copy link
Member

wojas commented Nov 25, 2020

That sounds fine. We can never run a server at a consistent 100% CPU capacity anyway, because there is always something that occasionally wants attention.

@rgacogne rgacogne marked this pull request as ready for review November 26, 2020 10:57
@rgacogne
Copy link
Member Author

I pushed a new version that should address most of the issues. It will do more tests and also make the cleaning interval configurable.

@rgacogne
Copy link
Member Author

Rebased to fix a conflict, plus fixed oversampling. Also fixed a refresh issue for SMT entries not caused by this PR.

We now properly skip the too old entries when scanning responses,
so we need to set a long enough period so that the responses are
still valids when we compute the ratio.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants