Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dnsdist: Do not report latency metrics of down upstream servers #10508

Merged
merged 2 commits into from Jun 29, 2021

Conversation

hhoffstaette
Copy link
Contributor

@hhoffstaette hhoffstaette commented Jun 19, 2021

Short description

This patch suppresses output of metrics of dead upstream servers, as reported and discussed
in bug #10500. Tested manually by curling /metrics, setting a server to down and verifying
again that the output no longer contains this server's metrics except for dnsdist_server_status.

Checklist

I have:

  • read the CONTRIBUTING.md document
  • compiled this code
  • tested this code
  • included documentation (including possible behaviour changes)
  • documented the code
  • added or modified regression test(s)
  • added or modified unit test(s)

@hhoffstaette
Copy link
Contributor Author

hhoffstaette commented Jun 19, 2021

The seond part of the behaviour change discussed in #10500 - zeroing stats when a server goes down - will come
in a separate PR.
Nevermind, easy enough.

Copy link
Member

@rgacogne rgacogne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks! I'm wondering if it makes sense to hide all values, or if hiding only the latency would not be better, because the other values might still be useful to investigate what is going on.

@hhoffstaette
Copy link
Contributor Author

hhoffstaette commented Jun 21, 2021

Looks good, thanks! I'm wondering if it makes sense to hide all values, or if hiding only the latency would not be better, because the other values might still be useful to investigate what is going on.

Hmm..not sure either, but if I start clearing stats on setDown() I should know how much to reset. :)
Personally I don't care, but if there's something you want to see preverved across state changes just let me know. Maybe others have some opinion on this too.
Also it would be good to know whether server status change zeroes out the same stats as e.g. an explicit reset, as requested in #3459. I can't say whether they should be the same or whether automatic setDown() should only clear a subset.

@rgacogne
Copy link
Member

I would prefer to only hide (and automatically clear) the latency for now, since I personally would find it a weird to reset the other metrics.
I would be OK with adding a DownstreamState::resetMetrics() binding (or something like that) to clear all the metrics for a backend, but I'd really prefer to keep it separate. I have seen servers going up and down from time to time in production and I would not want to have the metrics reset automatically in that case.

@hhoffstaette
Copy link
Contributor Author

hhoffstaette commented Jun 21, 2021

We now only omit latency of a down server, as discussed.

@hhoffstaette hhoffstaette changed the title dnsdist: Do not report metrics about down upstream servers dnsdist: Do not report latency metrics of down upstream servers Jun 21, 2021
@hhoffstaette
Copy link
Contributor Author

Second commit resets the latency on change of availability; tested manually, works fine (not exactly rocket surgery).
I hope hooking into setDown() is enough; I didn't really understand the exact interaction with setAuto().

@hhoffstaette
Copy link
Contributor Author

hhoffstaette commented Jun 21, 2021

A pretty picture!

state-change

Violet is a bit slow, so disable it at 19:40 and bring it back up at 19:50. Latency was reset and is slowly ramping up again. There are not too many requests going on so it takes a while, but the general behaviour should be clear.

@hhoffstaette
Copy link
Contributor Author

hhoffstaette commented Jun 22, 2021

Looks like I'm missing something wrt, the state management. The latency is reset when I set a server down manually, but not when it fails the liveliness check. I just had a server fail by itself and when it came back the latency resumed with the previous measurement. Apparently there is a second path where a server is marked "down".
Edit: found it, "upStatus" is a second state variable that is updated outside of setDown().

@rgacogne
Copy link
Member

found it, "upStatus" is a second state variable that is updated outside of setDown().

We currently have two different variables, availability which is either:

  • Up if setUp() has been called, forcing the server in the up state, disabling health-checking ;
  • Down if setDown() has been called, forcing the server in the down state, disabling health-checking as well ;
  • Auto, enabling health-checking.

and upStatus which is updated by the health-checks and is true is the server is up, false otherwise. The health-checking code directly updates that value, but it might be cleaner to define a method to do that, allowing you to plug yourself into that new method.

@hhoffstaette
Copy link
Contributor Author

hhoffstaette commented Jun 26, 2021

Updated second commit to also introduce a new setUpStatus() for all upStatus updates and also reset the latency there.

  • Pulled the cable on my switch
  • Waited for liveliness check to kick in: latency gets reset
  • Plugged back in: latency ramps up again

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants