metrics for downstream state changes and total downtime #12113

appliedprivacy · 2022-10-21T09:14:59Z

Program: dnsdist
Issue type: Feature request

Short description

new prometheus metric showing a counter how often the status of a resolver changed.

Usecase

For some reason we have a flapping resolver. The logs show:

Oct 21 10:48:09 bender-dpriv1 dnsdist[24782]: Marking downstream 109.70.100.136:53 as 'down'
Oct 21 10:48:10 bender-dpriv1 dnsdist[24782]: Marking downstream 109.70.100.136:53 as 'up'
Oct 21 10:48:17 bender-dpriv1 dnsdist[24782]: Marking downstream 109.70.100.136:53 as 'down'
Oct 21 10:48:19 bender-dpriv1 dnsdist[24782]: Marking downstream 109.70.100.136:53 as 'up'
Oct 21 10:48:34 bender-dpriv1 dnsdist[24782]: Marking downstream 109.70.100.136:53 as 'down'
Oct 21 10:48:36 bender-dpriv1 dnsdist[24782]: Marking downstream 109.70.100.136:53 as 'up'
Oct 21 10:48:57 bender-dpriv1 dnsdist[24782]: Marking downstream 109.70.100.136:53 as 'down'
Oct 21 10:48:58 bender-dpriv1 dnsdist[24782]: Marking downstream 109.70.100.136:53 as 'up'

Since the outage is usually lasts just 1-2 seconds it remains largely invisible when monitoring dnsdist_server_status,
therefore we would propose to add two new counters to dnsdist's prometheus metrics to make these issues visible to monitoring.

Description

Given these events:

Oct 21 10:48:09 bender-dpriv1 dnsdist[24782]: Marking downstream 109.70.100.136:53 as 'down'
Oct 21 10:48:10 bender-dpriv1 dnsdist[24782]: Marking downstream 109.70.100.136:53 as 'up'
Oct 21 10:48:17 bender-dpriv1 dnsdist[24782]: Marking downstream 109.70.100.136:53 as 'down'
Oct 21 10:48:18 bender-dpriv1 dnsdist[24782]: Marking downstream 109.70.100.136:53 as 'up'

the new metrics would contain:

dnsdist_server_status_changes_total{server="109_70_100_136:53"} 3
dnsdist_server_status_down_seconds_total{server="109_70_100_136:53"}  2

The text was updated successfully, but these errors were encountered:

rgacogne · 2022-10-21T09:21:19Z

That sounds like a very good idea, thanks! I have put this in the 1.9 milestone as we are (hopefully) near the first alpha release of 1.8 and I'm afraid I will not have to actually implement that change before the first beta (after which we are in "bug fixes only" until the final release), but I will gladly merge a pull request before the beta if someone else feels up to it :)

rgacogne · 2023-08-14T12:44:49Z

#13009 added a counter for the number of health-check failures, which should mostly cover the first need. I'll ponder the "total downtime" one.

rgacogne added feature request dnsdist labels Oct 21, 2022

rgacogne added this to the dnsdist-1.9 milestone Oct 21, 2022

rgacogne modified the milestones: dnsdist-1.9, dnsdist-1.10 Jan 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metrics for downstream state changes and total downtime #12113

metrics for downstream state changes and total downtime #12113

appliedprivacy commented Oct 21, 2022

rgacogne commented Oct 21, 2022

rgacogne commented Aug 14, 2023

metrics for downstream state changes and total downtime #12113

metrics for downstream state changes and total downtime #12113

Comments

appliedprivacy commented Oct 21, 2022

Short description

Usecase

Description

rgacogne commented Oct 21, 2022

rgacogne commented Aug 14, 2023