nsd 4.8 stats with implausible spikes #313

and0x000 · 2024-01-25T15:20:28Z

With nsd 4.8 we see unexplainable spikes from the num.rcode and num.type metrics.

The spikes seem to appear at a fixed frequency. The data is implausible in the context of other networking metrics for the system (such as the overall number of packets). Also, the spikes do not show up for nsd 4.7 on a comparable system.

(Graph from graphana, data comes from a custom crawler that queries nsd-control stats every 30 seconds or so)

I'd assume that this has something to do with the changes introduced in #305 but I lack the understanding of C to pinpoint anything.

NSD version 4.8.0
Written by NLnet Labs.

Configure line: --build=x86_64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-option-checking --disable-silent-rules --libdir=${prefix}/lib/x86_64-linux-gnu --runstatedir=/run --disable-maintainer-mode --disable-dependency-tracking --with-configdir=/etc/nsd --with-nsd_conf_file=/etc/nsd/nsd.conf --with-pidfile=/run/nsd/nsd.pid --with-dbfile=/var/lib/nsd/nsd.db --with-zonesdir=/etc/nsd --with-xfrdfile=/var/lib/nsd/xfrd.state --disable-largefile --disable-recvmmsg --enable-root-server --enable-mmap --enable-ratelimit --enable-zone-stats --enable-systemd --enable-checking --enable-dnstap
Event loop: libevent 2.1.12-stable (uses epoll)
Linked with OpenSSL 1.1.1w  11 Sep 2023

Copyright (C) 2001-2020 NLnet Labs.  This is free software.
There is NO warranty; not even for MERCHANTABILITY or FITNESS
FOR A PARTICULAR PURPOSE.

The text was updated successfully, but these errors were encountered:

k0ekk0ek · 2024-01-26T08:20:49Z

Hi @and0x000. It's certainly suspicious that these started appearing with 4.8.0, but I'll have to investigate this first before I can say anything useful. I'll have a look.

k0ekk0ek · 2024-01-26T10:46:16Z

@and0x000, I trust you're using the exact same scripts with NSD 4.7? Would it be possible to post the correcsponding graph for the number of queries? It would be useful to know if this problem is limited to the type and rcode counters.

and0x000 · 2024-01-26T11:17:39Z

Hi @k0ekk0ek , thank you for looking into it.

Yes, the crawler script is identical for both NSD 4.7 and 4.8. So is the rest of the system. Only the NSD packet was updated.

It doesn't seem to be limited to the type and rcode counters, they were only the ones for which I had immediate graphs at hand.
From what I can see, the following metrics are affected:

num.queries
server<#>.queries
num.truncated
num.dropped
num.type
num.rcode
num.class
num.[udp|tcp|tls]

k0ekk0ek · 2024-01-29T11:30:00Z

My hypothesis is that the copying of statistics is not an atomic operation. So when memcpy'ing the statistics for reporting an update might occur that corrupts the values in memory. I could reimplement the logic, I'd need to do that anyway, but it might be worth it to implement a stress-test. That doesn't guarantee we'll hit the issue though... I need to give this some more thought

k0ekk0ek · 2024-01-30T14:25:48Z

@and0x000, would it be possible to collect the raw numbers for a couple of days in text files? I've been through the code, the logic isn't all that complicated, so if the numbers really don't add up (as in, they wrap around or similar) I don't see how it's anything other than my non-atomic hypothesis and we'll have to add somekind of synchronization mechanism. It's just that we'll have to insert extra logic for each query that's processed, which is not desirable(?)

and0x000 · 2024-01-31T17:27:33Z

@and0x000, would it be possible to collect the raw numbers for a couple of days in text files?

Yea, I'll write the raw stats output to a file. With the frequency of the spikes occurring, I believe there should be enough data to compare against within a few hours. Is there a way that I can provide you with the raw data? 'Cause I don't necessarily want it to end up public.

k0ekk0ek · 2024-01-31T18:23:17Z

Thanks! Yes, you may send it via email to jeroen -at- nlnetlabs -dot- nl. Alternatively, if it's to much data for email, we could use some file sharing service.

wcawijngaards · 2024-02-06T10:49:05Z

The commit fixes it, thanks for the report! The detailed logs revealed that the statistic values themselves are fine, and do not show the spikes. But the time.elapsed printout is wrong. It turns out that the change in 4.8 causes the stats_noreset lookup to reset the timer for the time.elapsed printout. This was not the behaviour before that. When the elapsed time is very small, because another stats lookup just happened, then the numqueries divided by time becomes the very large spike value. The commit fixes it, the other stats lookup does not reset the time, and thus the elapsed time should be printed out correctly.

k0ekk0ek self-assigned this Jan 26, 2024

wcawijngaards closed this as completed in ccc1bbc Feb 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nsd 4.8 stats with implausible spikes #313

nsd 4.8 stats with implausible spikes #313

and0x000 commented Jan 25, 2024 •

edited

k0ekk0ek commented Jan 26, 2024

k0ekk0ek commented Jan 26, 2024

and0x000 commented Jan 26, 2024

k0ekk0ek commented Jan 29, 2024

k0ekk0ek commented Jan 30, 2024

and0x000 commented Jan 31, 2024

k0ekk0ek commented Jan 31, 2024

wcawijngaards commented Feb 6, 2024

nsd 4.8 stats with implausible spikes #313

nsd 4.8 stats with implausible spikes #313

Comments

and0x000 commented Jan 25, 2024 • edited

k0ekk0ek commented Jan 26, 2024

k0ekk0ek commented Jan 26, 2024

and0x000 commented Jan 26, 2024

k0ekk0ek commented Jan 29, 2024

k0ekk0ek commented Jan 30, 2024

and0x000 commented Jan 31, 2024

k0ekk0ek commented Jan 31, 2024

wcawijngaards commented Feb 6, 2024

and0x000 commented Jan 25, 2024 •

edited