Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose server reachability #75

Closed
hhoffstaette opened this issue Apr 19, 2024 · 5 comments · Fixed by #76
Closed

Expose server reachability #75

hhoffstaette opened this issue Apr 19, 2024 · 5 comments · Fixed by #76

Comments

@hhoffstaette
Copy link
Contributor

hhoffstaette commented Apr 19, 2024

Based on a recent mailing list thread I'd like to propose the exposure of an additional metric.

The original chrony client shows the reachability of upstream servers and this can be used to detect changes in network topology without having to rely on time/clock drift for failure detection.

Having this field exposed would make it easy to create e.g. an alertmanager rule.

@SuperQ
Copy link
Owner

SuperQ commented Apr 19, 2024

From the chrony docs

This shows the source’s reachability register printed as an octal number. The register has 8 bits and is updated on every received or missed packet from the source. A value of 377 indicates that a valid reply was received for all from the last eight transmissions.

The facebook/time library returns this value as a uint16.

While there's some interesting things we could do with the bits, the most useful thing that comes to mind would be to compute the ratio of 1 to 0 bits in the value as a ratio. So if all probes fail the metrics is 0.0. If all probes are passing, 1.0.

Another option would be to only expose the current bit, since Prometheus is polling typically faster than NTP packets are sent, we could represent the "last reach success" as a simple bool.

The question is, how is that register updated, shift left? shift right?

The next useful option would be to expose the bits directly as a state set. While this would provide the full bit detail, it's a bit high cardinality.

As for the easy option, exposing the raw byte directly as a value, this seems less useful for monitoring, as you would have to interpret the bits in PromQL for the alert to be useful. I would say this is better mapped in the exporter's code.

@hhoffstaette
Copy link
Contributor Author

hhoffstaette commented Apr 19, 2024

The reachability is available in the SourceData.
However I now wonder if this is more useful/better than just alerting on chrony_sources_state_info which already exposes the source state in a human-readable way.

Edit: a simple chrony_sources_state_info{source_state != "sync"} (for say >=10m) alert would have been enough to prevent the problem reported on the mailing list.

@hhoffstaette
Copy link
Contributor Author

Flattening the value to a binary last_reach_success is probably too fragile against network hiccups - packets can and do get lost even on inhouse networks (resync, reboots and whatnot).

Hmm.. maybe this wasn't such a useful idea at all 😅

@SuperQ
Copy link
Owner

SuperQ commented Apr 19, 2024

Chrony's default minpoll is 6, which is 2^6 seconds (64 seconds) with a maxpoll of 10 (2^10 = 1024 seconds). So even if you have a scrape interval of 60s, you'll always catch the last reach bit.

NTP is a very low packet count protocol.

SuperQ added a commit that referenced this issue Apr 19, 2024
Compute two reachability metrics from the "Reachability" bitmask.
* Count the number of 1s in the bitmask as the polling success ratio.
* Expose the right most bit as the "last reach success"

Fixes: #75

Signed-off-by: SuperQ <superq@gmail.com>
@SuperQ
Copy link
Owner

SuperQ commented Apr 19, 2024

Did some local testing.

  • The iburst means there are a number of packets on restart, so we can't catch individual early bits.
  • The ratio of course starts out at 0 on restart, so you can't alert on the success ratio until it's been up until it fills up the bits.

SuperQ added a commit that referenced this issue Apr 19, 2024
Compute two reachability metrics from the "Reachability" bitmask.
* Count the number of 1s in the bitmask as the polling success ratio.
* Expose the right most bit as the "last reach success"

Fixes: #75

Signed-off-by: SuperQ <superq@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants