Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsd 4.8 stats with implausible spikes #313

Closed
and0x000 opened this issue Jan 25, 2024 · 8 comments
Closed

nsd 4.8 stats with implausible spikes #313

and0x000 opened this issue Jan 25, 2024 · 8 comments
Assignees

Comments

@and0x000
Copy link

and0x000 commented Jan 25, 2024

With nsd 4.8 we see unexplainable spikes from the num.rcode and num.type metrics.

The spikes seem to appear at a fixed frequency. The data is implausible in the context of other networking metrics for the system (such as the overall number of packets). Also, the spikes do not show up for nsd 4.7 on a comparable system.

image

(Graph from graphana, data comes from a custom crawler that queries nsd-control stats every 30 seconds or so)

I'd assume that this has something to do with the changes introduced in #305 but I lack the understanding of C to pinpoint anything.

NSD version 4.8.0
Written by NLnet Labs.

Configure line: --build=x86_64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-option-checking --disable-silent-rules --libdir=${prefix}/lib/x86_64-linux-gnu --runstatedir=/run --disable-maintainer-mode --disable-dependency-tracking --with-configdir=/etc/nsd --with-nsd_conf_file=/etc/nsd/nsd.conf --with-pidfile=/run/nsd/nsd.pid --with-dbfile=/var/lib/nsd/nsd.db --with-zonesdir=/etc/nsd --with-xfrdfile=/var/lib/nsd/xfrd.state --disable-largefile --disable-recvmmsg --enable-root-server --enable-mmap --enable-ratelimit --enable-zone-stats --enable-systemd --enable-checking --enable-dnstap
Event loop: libevent 2.1.12-stable (uses epoll)
Linked with OpenSSL 1.1.1w  11 Sep 2023

Copyright (C) 2001-2020 NLnet Labs.  This is free software.
There is NO warranty; not even for MERCHANTABILITY or FITNESS
FOR A PARTICULAR PURPOSE.
@k0ekk0ek k0ekk0ek self-assigned this Jan 26, 2024
@k0ekk0ek
Copy link
Contributor

Hi @and0x000. It's certainly suspicious that these started appearing with 4.8.0, but I'll have to investigate this first before I can say anything useful. I'll have a look.

@k0ekk0ek
Copy link
Contributor

@and0x000, I trust you're using the exact same scripts with NSD 4.7? Would it be possible to post the correcsponding graph for the number of queries? It would be useful to know if this problem is limited to the type and rcode counters.

@and0x000
Copy link
Author

Hi @k0ekk0ek , thank you for looking into it.

Yes, the crawler script is identical for both NSD 4.7 and 4.8. So is the rest of the system. Only the NSD packet was updated.

It doesn't seem to be limited to the type and rcode counters, they were only the ones for which I had immediate graphs at hand.
From what I can see, the following metrics are affected:

  • num.queries
  • server<#>.queries
  • num.truncated
  • num.dropped
  • num.type
  • num.rcode
  • num.class
  • num.[udp|tcp|tls]

@k0ekk0ek
Copy link
Contributor

My hypothesis is that the copying of statistics is not an atomic operation. So when memcpy'ing the statistics for reporting an update might occur that corrupts the values in memory. I could reimplement the logic, I'd need to do that anyway, but it might be worth it to implement a stress-test. That doesn't guarantee we'll hit the issue though... I need to give this some more thought

@k0ekk0ek
Copy link
Contributor

@and0x000, would it be possible to collect the raw numbers for a couple of days in text files? I've been through the code, the logic isn't all that complicated, so if the numbers really don't add up (as in, they wrap around or similar) I don't see how it's anything other than my non-atomic hypothesis and we'll have to add somekind of synchronization mechanism. It's just that we'll have to insert extra logic for each query that's processed, which is not desirable(?)

@and0x000
Copy link
Author

@and0x000, would it be possible to collect the raw numbers for a couple of days in text files?

Yea, I'll write the raw stats output to a file. With the frequency of the spikes occurring, I believe there should be enough data to compare against within a few hours. Is there a way that I can provide you with the raw data? 'Cause I don't necessarily want it to end up public.

@k0ekk0ek
Copy link
Contributor

Thanks! Yes, you may send it via email to jeroen -at- nlnetlabs -dot- nl. Alternatively, if it's to much data for email, we could use some file sharing service.

@wcawijngaards
Copy link
Member

The commit fixes it, thanks for the report! The detailed logs revealed that the statistic values themselves are fine, and do not show the spikes. But the time.elapsed printout is wrong. It turns out that the change in 4.8 causes the stats_noreset lookup to reset the timer for the time.elapsed printout. This was not the behaviour before that. When the elapsed time is very small, because another stats lookup just happened, then the numqueries divided by time becomes the very large spike value. The commit fixes it, the other stats lookup does not reset the time, and thus the elapsed time should be printed out correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants