-
-
Notifications
You must be signed in to change notification settings - Fork 377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dnstap logging significantly affects unbound performance (regression in 1.11) #305
Comments
If you're on Linux, it would be interesting if you could use the |
I cannot provide system-wide perf records, but here are perf CPU flame graphs for unbound process for all combinations (1.11.0, 1.9.6, with and without dnstap service consuming the logs). It's pretty obvious that 1.11 spends much more CPU sending dnstap logs. (had to put SVGs in ZIP, as Github doesn't allow uploading SVGs for some reason) |
Hi, Alexander: I'm the original author of It looks like it's doing a lot of Lines 167 to 226 in 753487f
My guess is the problem is here: Lines 201 to 203 in 753487f
If I'm understanding the code correctly, it looks like individual dnstap log payloads are being queued, and when the queue size changes from 0 → 1, a wakeup message is sent, which causes a I ran into this problem when developing If my understanding is correct, my suggestion to the Unbound developers would be to implement a similar "low watermark" threshold for sending a wakeup to drain the queue, combined with a timeout flush. That should significantly reduce the number of syscalls caused by dnstap logging when the server is under load. |
Your analysis is excellent, thank you and for the cpu flame graphs (that wake up routine sure looks expensive). I want to wait for the event handling fix from the previous issue (#304), before tackling this. |
Thank you Robert, excellent analysis! |
The commit should fix the issue. It is hard to reproduce the performance numbers, but it implements the threshold wakeup solution. That means the threads wake up when there are 32 messages, or it is almost full (90% of 1Mb), or after 1 second has passed. |
And a fix for that commit so that it only wakes up once, when it reaches the threshold, it does not keep doing it. |
With unbound 1.11 being first release that switched away from
libfstrm
(#164, #264), we observe number of regressions with dnstap logging over unix socket (first one reported as #304)This is quite a big one.
In our production environment we've noticed a direct correlation between dnstap logs being consumed and unbound performance with 1.11.
Strangely enough, the quicker the dnstap stream was consumed by dnstap service, the more CPU was unbound using.
We've noticed it when on few machines with especially powerful hardware it went absolutely crazy, when we sped up dnstap service ~1.5 times by simply skipping half of samples, the CPU used by unbound went up 10x, from 200% to 2000%.
While investigating the issue with synthetic load on different hardware, I can observe that simply having dnstap socket consumed by dnstap process significantly increases the CPU used by unbound.
For example:
Same server, same load, with unbound 1.9.6, we have unbound CPU usage stable at around ~80%, with or without dnstap service running and consuming logs (as one would expect).
I understand that dnstap uses bidirectional protocol, and when there is no consumer running the unbound doesn't send any dnstap samples.
But before 1.11 sending samples had no significant impact on Unbound itself, and we've been using dnstap logging with unbound for a few years now.
The text was updated successfully, but these errors were encountered: