Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dnsdist: latency-count is broken #11239

Closed
rgacogne opened this issue Jan 26, 2022 · 10 comments · Fixed by #11323
Closed

dnsdist: latency-count is broken #11239

rgacogne opened this issue Jan 26, 2022 · 10 comments · Fixed by #11323
Assignees
Milestone

Comments

@rgacogne
Copy link
Member

  • Program: dnsdist
  • Issue type: Bug report

Short description

On master, I get the following stats:

latency-avg100                              97490075.9 tcp-listen-overflows                               0
latency-avg1000                             21377171.3 tcp-query-pipe-full                                0
latency-avg10000                            2323640.8  trunc-failures                                     0
latency-avg1000000                            23451.7  udp-in-csum-errors                                 0
latency-count                                   297    udp-in-errors                                      0
latency-slow                                     19    udp-noport-errors                                  0
latency-sum                                23453901    udp-recvbuf-errors                                 0
latency0-1                                      137    udp-sndbuf-errors                                  0
latency1-10                                       0    udp6-in-csum-errors                                0
latency10-50                                      0    udp6-in-errors                                     0
latency100-1000                                   0    udp6-noport-errors                                 0
latency50-100                                     0    udp6-recvbuf-errors                                0

Latency count is clearly not a sum of the other metrics. The value seems to come from:

g_stats.responses + g_stats.selfAnswered + g_stats.cacheHits

which is wrong since the latency is only computed for UDP queries, while these values are also updated for TCP/DoT/DoH queries, and it also takes self-answered responses which, as far as I can tell, are not taken into account for the latency stats.

@appliedprivacy
Copy link
Contributor

We upgraded from dnsdist 1.6.1 to 1.7.0 (on FreeBSD) and get hit by this issue.
You can clearly see when we upgraded in this graph:

dnsdist-latency-bucket-1 7 0

We realized it is likely an issue in the metric dnsdist_latency_bucket and the numbers are not reflecting reality
because - as you also mentioned - the number do not add up when compared to dnsdist_frontend_queries.

@appliedprivacy
Copy link
Contributor

https://mailman.powerdns.com/pipermail/dnsdist/2023-April/001331.html

dnsdist 1.8.0 on FreeBSD 13.1 dnsdist_latency_bucket output:

# HELP dnsdist_latency Histogram of responses by latency (in milliseconds)
# TYPE dnsdist_latency histogram
dnsdist_latency_bucket{le="1"} 7136
dnsdist_latency_bucket{le="10"} 7136
dnsdist_latency_bucket{le="50"} 7136
dnsdist_latency_bucket{le="100"} 7136
dnsdist_latency_bucket{le="1000"} 7136
dnsdist_latency_bucket{le="+Inf"} 7136
dnsdist_latency_sum 0
dnsdist_latency_count 7136

a bit later:

# HELP dnsdist_latency Histogram of responses by latency (in milliseconds)
# TYPE dnsdist_latency histogram
dnsdist_latency_bucket{le="1"} 7157
dnsdist_latency_bucket{le="10"} 7157
dnsdist_latency_bucket{le="50"} 7157
dnsdist_latency_bucket{le="100"} 7157
dnsdist_latency_bucket{le="1000"} 7157
dnsdist_latency_bucket{le="+Inf"} 7157
dnsdist_latency_sum 0
dnsdist_latency_count 7157

@azadi
Copy link
Contributor

azadi commented Sep 5, 2023

Adding another data-point where we hit the same issue with the dnsdist_latency_bucket metric, both for 1.7.2 and 1.8.0, after upgrading from 1.6.1.

dnsdist_latency_bucket{le="1"} 298782
dnsdist_latency_bucket{le="10"} 298782
dnsdist_latency_bucket{le="50"} 298782
dnsdist_latency_bucket{le="100"} 298782
dnsdist_latency_bucket{le="1000"} 298782
dnsdist_latency_bucket{le="+Inf"} 298782
dnsdist_latency_sum 0
dnsdist_latency_count 298782

Linux (Debian) in this case and observed across multiple hosts (but with the same configuration as expected).

@rgacogne
Copy link
Member Author

rgacogne commented Sep 8, 2023

Re-opening this issue since apparently it's still a problem in 1.8.0 (and likely in 1.8.1 as well).

@omoerbeek
Copy link
Member

Looking at the values above, they suggest that all values recorded by doLatencyStats() are 0.0.

@rgacogne
Copy link
Member Author

rgacogne commented Sep 8, 2023

So to get metric values like these, the only option I see would be that all queries had a latency of 0. Checking if that's possible and how.

@omoerbeek
Copy link
Member

Correction: this can also happen if the values of udiff are all < 1000. The udiff / 1000 in the expression dnsdist::metrics::g_stats.latencySum += udiff / 1000; in doLatencyStats() will then be between 0 and 1, and truncated to 0.
I have one dnsdist talking to an auth on the same machine, and there's I see similar numbers, all latencies are below 1ms.

@rgacogne
Copy link
Member Author

rgacogne commented Sep 8, 2023

Looking at latency0_1 should make that clear, right?

@omoerbeek
Copy link
Member

Looking at latency0_1 should make that clear, right?

Indeed. having both the Prometheus and the dumpStats() output can help.

@rgacogne
Copy link
Member Author

I'm going to close this again since based on what Otto and I observed it does look correct in 1.8.0, please let me know otherwise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants