Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dnsdist: expose all metrics in API (including servers, pools and frontends) #6002

Closed
wojas opened this issue Nov 27, 2017 · 6 comments
Closed

Comments

@wojas
Copy link
Member

wojas commented Nov 27, 2017

  • Program: dnsdist
  • Issue type: Feature request

Short description

Currently dnsdist does not export all metrics that are available in Carbon to the API.

The ones I am aware of are:

  • dnsdist.$instance.main.servers.*.$metric
  • dnsdist.$instance.main.pools.*.$metric
  • dnsdist.$instance.main.frontends.*.$metric

Usecase

Gathering all statistics without using Carbon.

Description

It would be hard to expose those in a clean way in the current /api/v1/servers/localhost/statistics API.

I propose adding a new /api/v1/servers/localhost/metrics API that would include these missing statistics, and structure them in a way that would make conversion to Prometheus metrics easy (see also #4947 ). The result data for this API would be in JSON, so that users that do not use Prometheus can still easily access them.

Example JSON:

[
  {"name": "queries", "value": 123},
  {"name": "cache-hits", "value": 12},
  {"name": "server-queries", "labels": {"server": "foo"}, "value": 123},
  {"name": "server-drops", "labels": {"server": "foo"}, "value": 2}, 
  {"name": "pool-cache-hits", "labels": {"pool": "bar"}, "value": 2},
  {"name": "latency-avg", "labels": {"window": "100"}, "value": 2.1},
  {"name": "latency-avg", "labels": {"window": "1000"}, "value": 1.7},
  {"name": "latency", "labels": {"bucket": "0-1"}, "value": 114},
  {"name": "latency", "labels": {"bucket": "1-10"}, "value": 123},
  {"name": "latency", "labels": {"bucket": "slow"}, "value": 13},
  {"name": "rule", "labels": {"action": "nxdomain"}, "value": 7},
]

Note that these names do not fully comply to Prometheus metric naming recommendations. Since the current statistic names are well documented, I think it's better to stick as close as possible to current names instead of renaming all of them. The few exceptions that I propose are those where a label would be clearer and more flexible.

If we decide to actually add a Prometheus endpoint, we would need to map '-' to '_' and prefix all of them. Example conversion:

acl-drops                      pdns_dnsdist_acl_drops
cache-hits                     pdns_dnsdist_cache_hits
cache-misses                   pdns_dnsdist_cache_misses
cpu-sys-msec                   pdns_dnsdist_cpu_sys_msec
cpu-user-msec                  pdns_dnsdist_cpu_user_msec
downstream-send-errors         pdns_dnsdist_downstream_send_errors
downstream-timeouts            pdns_dnsdist_downstream_timeouts
dyn-block-nmg-size             pdns_dnsdist_dyn_block_nmg_size
dyn-blocked                    pdns_dnsdist_dyn_blocked
empty-queries                  pdns_dnsdist_empty_queries
fd-usage                       pdns_dnsdist_fd_usage
latency-avg100                 pdns_dnsdist_latency_avg{window="100"}
latency-avg1000                pdns_dnsdist_latency_avg{window="1000"}
latency-avg10000               pdns_dnsdist_latency_avg{window="10000"}
latency-avg1000000             pdns_dnsdist_latency_avg{window="1000000"}
latency0-1                     pdns_dnsdist_latency{bucket="0-1"}
latency1-10                    pdns_dnsdist_latency{bucket="1-10"}
latency10-50                   pdns_dnsdist_latency{bucket="10-50"}
latency100-1000                pdns_dnsdist_latency{bucket="100-1000"}
latency50-100                  pdns_dnsdist_latency{bucket="50-100"}
latency-slow                   pdns_dnsdist_latency{bucket="slow"}
no-policy                      pdns_dnsdist_no_policy
noncompliant-queries           pdns_dnsdist_noncompliant_queries
noncompliant-responses         pdns_dnsdist_noncompliant_responses
queries                        pdns_dnsdist_queries
rdqueries                      pdns_dnsdist_rdqueries
real-memory-usage              pdns_dnsdist_real_memory_usage
responses                      pdns_dnsdist_responses
rule-drop                      pdns_dnsdist_rule{action="drop"}
rule-nxdomain                  pdns_dnsdist_rule{action="nxdomain"}
rule-refused                   pdns_dnsdist_rule{action="refused"}
self-answered                  pdns_dnsdist_self_answered
servfail-responses             pdns_dnsdist_servfail_responses
trunc-failures                 pdns_dnsdist_trunc_failures
uptime                         pdns_dnsdist_uptime
@wojas
Copy link
Member Author

wojas commented Nov 27, 2017

One item that would be useful to convert according to https://prometheus.io/docs/concepts/metric_types/#histogram if possible is latency:

pdns_dnsdist_latency_bucket{le="1"}
pdns_dnsdist_latency_bucket{le="10"}
pdns_dnsdist_latency_bucket{le="50"}
pdns_dnsdist_latency_bucket{le="100"}
pdns_dnsdist_latency_bucket{le="1000"}
pdns_dnsdist_latency_bucket{le="+Inf"}
pdns_dnsdist_latency_sum
pdns_dnsdist_latency_count

This would enable the use of https://prometheus.io/docs/prometheus/latest/querying/functions/#histogram_quantile

Note that the bucket values are cumulative in contrast to the current buckets. This makes it easy to add additional buckets without breaking dashboards.

@johnhtodd
Copy link

I noticed when you did your mapping that you did not create labels for metrics that only have two possible choices, even if they are mutually exclusive measurements of the same metric type. (Example: cache-hits and cache-misses remain separate metrics, instead of becoming pdns_dnsdist_cache{result="miss"} and pdns_dnsdist_cache{result="hit"} ) Is this a best practice with just two possible results? I know now from experience that any use of labels on the same metric type makes life much, much easier so I'd always lean towards using labels where possible if a metric type has to be aggregated across types.

@wojas
Copy link
Member Author

wojas commented Dec 8, 2017

@johnhtodd I initially tried to map recursor metric names into 'proper' labeled names, including the hit/miss metrics.

I found that things that appear to belong together at first glance (based on names and type of information) are often not drawn together when you build graphs that are actually insightful (you can check the Metronome graphs). I ended up with various graphs where some metrics access information using a label and others do not. Some of the others maybe should have a label, or maybe not. Some metrics would conflict in name with others if you move part into a label, and you would have to rename one of them.

It is pretty hard to map a set of metrics to proper labeled entries, if you do not do so from the start and think it through carefully. The current unlabeled metric names at least are known and used by many users, documented (there is room for improvement there).

Keeping relabeling to the minimum makes it easier to build dashboards for arbitrary metric storage systems based on the documentation, because you do not need to guess if a specific metric was renamed or not.

@johnhtodd
Copy link

@wojas Thanks for the run-through. We just did a one-for-one mapping to import into Prometheus, and it's been a real challenge to create reasonable queries because we should have used labels where we just mapped to metric names.

@filippog
Copy link

re: full Prometheus histograms support, I've opened #6088 to expose the _sum part of the histogram since AFAICT it isn't presented as a metric now.

@rgacogne
Copy link
Member

rgacogne commented Apr 5, 2019

Unless I'm mistaken, we know expose all metrics via the API, and via the prometheus endpoint as well.

@rgacogne rgacogne closed this as completed Apr 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants