Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics for downstream state changes and total downtime #12113

Open
appliedprivacy opened this issue Oct 21, 2022 · 2 comments
Open

metrics for downstream state changes and total downtime #12113

appliedprivacy opened this issue Oct 21, 2022 · 2 comments

Comments

@appliedprivacy
Copy link
Contributor

  • Program: dnsdist
  • Issue type: Feature request

Short description

new prometheus metric showing a counter how often the status of a resolver changed.

Usecase

For some reason we have a flapping resolver. The logs show:

Oct 21 10:48:09 bender-dpriv1 dnsdist[24782]: Marking downstream 109.70.100.136:53 as 'down'
Oct 21 10:48:10 bender-dpriv1 dnsdist[24782]: Marking downstream 109.70.100.136:53 as 'up'
Oct 21 10:48:17 bender-dpriv1 dnsdist[24782]: Marking downstream 109.70.100.136:53 as 'down'
Oct 21 10:48:19 bender-dpriv1 dnsdist[24782]: Marking downstream 109.70.100.136:53 as 'up'
Oct 21 10:48:34 bender-dpriv1 dnsdist[24782]: Marking downstream 109.70.100.136:53 as 'down'
Oct 21 10:48:36 bender-dpriv1 dnsdist[24782]: Marking downstream 109.70.100.136:53 as 'up'
Oct 21 10:48:57 bender-dpriv1 dnsdist[24782]: Marking downstream 109.70.100.136:53 as 'down'
Oct 21 10:48:58 bender-dpriv1 dnsdist[24782]: Marking downstream 109.70.100.136:53 as 'up'

Since the outage is usually lasts just 1-2 seconds it remains largely invisible when monitoring dnsdist_server_status,
therefore we would propose to add two new counters to dnsdist's prometheus metrics to make these issues visible to monitoring.

Description

Given these events:

Oct 21 10:48:09 bender-dpriv1 dnsdist[24782]: Marking downstream 109.70.100.136:53 as 'down'
Oct 21 10:48:10 bender-dpriv1 dnsdist[24782]: Marking downstream 109.70.100.136:53 as 'up'
Oct 21 10:48:17 bender-dpriv1 dnsdist[24782]: Marking downstream 109.70.100.136:53 as 'down'
Oct 21 10:48:18 bender-dpriv1 dnsdist[24782]: Marking downstream 109.70.100.136:53 as 'up'

the new metrics would contain:

dnsdist_server_status_changes_total{server="109_70_100_136:53"} 3
dnsdist_server_status_down_seconds_total{server="109_70_100_136:53"}  2
@rgacogne
Copy link
Member

That sounds like a very good idea, thanks! I have put this in the 1.9 milestone as we are (hopefully) near the first alpha release of 1.8 and I'm afraid I will not have to actually implement that change before the first beta (after which we are in "bug fixes only" until the final release), but I will gladly merge a pull request before the beta if someone else feels up to it :)

@rgacogne
Copy link
Member

#13009 added a counter for the number of health-check failures, which should mostly cover the first need. I'll ponder the "total downtime" one.

@rgacogne rgacogne modified the milestones: dnsdist-1.9, dnsdist-1.10 Jan 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants