Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dnsdist: 10x increased CPU usage with 1.7.1 even when idle #11576

Closed
hhoffstaette opened this issue Apr 25, 2022 · 18 comments
Closed

dnsdist: 10x increased CPU usage with 1.7.1 even when idle #11576

hhoffstaette opened this issue Apr 25, 2022 · 18 comments

Comments

@hhoffstaette
Copy link
Contributor

hhoffstaette commented Apr 25, 2022

  • Program: dnsdist
  • Issue type: Bug report

Short description

After updating from 1.7.0 to 1.7.1 I noticed ~10x increase in user CPU, even when no queries were being made (or only occasional background traffic), from ~0.2-0.3ms -> ~5ms average.

Environment

  • Operating system: Gentoo, kernel 5.15.35
  • Software version: dnsdist 1.7.1
  • Software source: self-compiled (I maintain my own package)

Steps to reproduce

  1. observe 1.7.0 working fine, with very low CPU usage
  2. update package to 1.7.1 with precisely the same CFLAGS etc.
  3. observe higher user CPU metrics e.g. via Grafana

Expected behaviour

Same user CPU usage as 1.7.0

image

Perf analysis:
image

Looks like "something" changed. Any ideas welcome.

@hhoffstaette
Copy link
Contributor Author

hhoffstaette commented Apr 25, 2022

Btw all my upstreams are UDP - no DoT, DoH etc.

@rgacogne
Copy link
Member

That's terrible and I would really like to know what's going on quickly, but unfortunately I see no changes at all in that area between 1.7.0 and 1.7.1 :-/ I also cannot reproduce this on Arch.
Would you mind sharing your configuration? And, any chance you could temporarily downgrade to 1.7.0 and confirm that the CPU usage drops again?

@hhoffstaette
Copy link
Contributor Author

hhoffstaette commented Apr 25, 2022

Downgraded to 1.7.0 (aka renamed the package script back to 1.7.0), rebuilt (same compiler (gc 11.3), same flags (O2 plus a few selected optimizations from O3) etc.) -- CPU went back to <0.5ms user time right away, even while serving queries. Running fine again.

My config is as pedestrian as it can be for a home user:

  • packet cache 4096 entries
  • cacheCleaningDelay: 300s
  • a bunch of Lua extensions to add/block hosts, increase TTL etc. via Spoof/RCode/LuaResponseAction
  • 8 upstreams to the outer planets (Cloudflare, Google, NextDNS)

Has been running 24/7 without hiccup since forever.

New perf profile looks as expected. Highest load contribution due to YAHttp serving metrics every minute, otherwise the kernel is just casually slinging a few packets here and there.
image

@hhoffstaette
Copy link
Contributor Author

hhoffstaette commented Apr 25, 2022

I guess the biggest problem is that everything is inlined into processQuery() and I don't really see whether there is a contributing function. I will try a nonoptimized debug build and without my Lua extensions, but that will take a bit.

hhoffstaette added a commit to hhoffstaette/portage that referenced this issue Apr 25, 2022
@hhoffstaette
Copy link
Contributor Author

hhoffstaette commented Apr 25, 2022

Rebuilt with -O1 and without stripping:
image

I use checkInterval=600 in my newServer definitions, but neither removing nor changing the value makes a difference. Looking forward to hear theories. 🧐

@hhoffstaette
Copy link
Contributor Author

I reverted 671f2b1 and everything is back to normal, to the point where almost the only activity is the kernel playing with itself:
image

So it's not the compiler, build flags, dependencies, my config, my lua extensions, my custom CPU scheduler or anything else. :)

@Habbie
Copy link
Member

Habbie commented Apr 25, 2022

Excellent debugging!

@rgacogne
Copy link
Member

Thanks for tracking that issue, much appreciated!

checkInterval=600

OK, that actually makes sense now. Before 1.7.1 we only scanned the vector holding the UDP states for a given backend, looking for timeouts, when we were running the health-check for that backend. That was not really expected since we should be looking for timeouts roughly once per second, so I made that scanning independent of the check interval.
Of course that means that in your case we now do that 600 times more than we used to in 1.7.0..
I'm not sure what the right fix would be in that case. In 1.8.0 we have a new mode which is better suited to that kind of workload, via setRandomizedIdsOverUDP 1, where we only scan the in-flight queries.
In 1.7.1, I guess the more sensible fix would be to lower the number of UDP states via setMaxUDPOutstanding 2, since at the level of QPS you are dealing with we likely need less than 100 UDP states.

@hhoffstaette
Copy link
Contributor Author

I'm not sure I'm following. As I said before I also tried to run without the explicit checkInterval in my newServer definition, and it still used much more CPU. I don't want to send root server requests to upstream servers once every second (that's just rude and unnecessary - admittedly not in the case where your authoritative backend is local), and neither do I want to use the new random thing (I read the 1.8 docs). The old checkInterval worked exactly as one would expect and makes sense. I understand that running the check for timeouts has nothing to do with running the health check for a backend and should be decoupled (I always wondered why lost outgoing packets would be shown lingering in the console for so long before being marked as "drops") , but certainly there has to be a way to detect timeouts/lost packets in a more CPU friendly way?

@hhoffstaette
Copy link
Contributor Author

OK..I think i understand the problem better now. I was curious where the 1s interval that you mentioned was defined and found it at the beginning of healthChecksThread() - turns out this is just the thread sleep interval, and in 1.7.0 it would go back to sleep right away until the health check interval is reached. I'll experiment with setMaxUDPOutstanding, but this is still dodgy since it doesn't really solve the problem of conflating timeouts with liveliness checks.

@rgacogne
Copy link
Member

I'll experiment with setMaxUDPOutstanding, but this is still dodgy since it doesn't really solve the problem of conflating timeouts with liveliness checks.

Well, unless I'm mistaken, that's exactly what 1.7.1 is doing differently compared to 1.7.0: the timeouts checks are no longer tied to the health-check interval.
I believe checking for timeouts once per second makes sense, otherwise we would detect these events a long time after they actually occurred, which is a bit pointless. The issue in your case is that checking for timeouts by scanning a large number of buckets while you usually have very in-flight queries is a waste of resources. The new setRandomizedIdsOverUDP option in 1.8 will prevent that by dynamically allocating the states for that kind of setups, meaning we only need to scan the ones that are actually in use. In the meantime reducing the number of buckets will make the regular scanning much less costly, and should not have any drawback since you are dealing with very few QPS.
I'm also wondering if we could try a small optimization in the general case, and only scan the buckets when the count of outstanding queries for a given backend is non-zero.

@hhoffstaette
Copy link
Contributor Author

So lowering setMaxUDPOutstanding to 128 and increasing the sleep interval (instead of reverting 671f2b1) works for me; CPU is back to previous level and everything still works fine. I can live with this for now.

@hhoffstaette
Copy link
Contributor Author

I'm also wondering if we could try a small optimization in the general case, and only scan the buckets when the count of outstanding queries for a given backend is non-zero.

Had the same thought! This would be great since it's a cheap shortcut for the idle case without downside for a busy system (which I understand is the regular case).

@rgacogne
Copy link
Member

Had the same thought! This would be great since it's a cheap shortcut for the idle case without downside for a busy system

#11579 should do exactly that, if by any chance you have the opportunity to test it :)

(which I understand is the regular case)

To be fair we are seeing more and more people using dnsdist in "low-qps" environments and we are making improvements in that direction in the upcoming 1.8, so your feedback is very welcome! :)

@rgacogne
Copy link
Member

#11580 might help a bit as well in your case, although it might not be as significant.

@hhoffstaette
Copy link
Contributor Author

With vanilla 1.7.1 and both the short-circuit and allocation patches applied, processQuery() now only shows up during periods of activity; otherwise dnsdist doesn't even show up in the first ten entries. 😄
Looking forward to 1.7.2. 🚀

@hhoffstaette
Copy link
Contributor Author

hhoffstaette commented Apr 26, 2022

Btw I noticed another and much more drastic improvement from using a smaller setMaxUDPOutstanding:
image

Edit: whoops - more likely the ring buffer, reduced from 8192 to 1024. Still a huge reduction..

@rgacogne
Copy link
Member

We just released 1.7.2, which hopefully solves this :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants