Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get rid of active idle loops #11422

Closed
Glandos opened this issue Mar 16, 2022 · 8 comments
Closed

Get rid of active idle loops #11422

Glandos opened this issue Mar 16, 2022 · 8 comments

Comments

@Glandos
Copy link

Glandos commented Mar 16, 2022

  • Program: dnsdist
  • Issue type: Feature request

Short description

In some threads, there are active idle loops, for example in health checks :

static void healthChecksThread()

In TCP client, there is an active loop of 500 ms.

Usecase

I have deployed dnsdist successfully on low end computers, but a lot of CPU is used (wasted?) in just doing nothing every second, in the case of health checks.
I didn't take the time to inspect if something is done within TCP client loop.

Description

#7142 was implemented in a way not to change the sleep interval. This should have been done with a sleep adapted to the minimal needed sleep interval.
In the same way, the run method of the multiplexer has fixed timeout of 500ms, which should be configurable.

Globally, the request is about thinking of doing things when they are required, instead of regularly, just in case. The active idle loop is somewhat simpler, but does not scale nor adapts to low end computers.

@rgacogne
Copy link
Member

This should have been done with a sleep adapted to the minimal needed sleep interval.

The healthcheck thread has to wake up every second to update the query load and drop rate of downstream servers, and will go back to sleep right away if there is nothing else to do but that has to be done every second.

The TCP client loop handles timeouts, cleans up the downstream TCP connections cache, and check whether a dump of the current state has been requested.

the request is about thinking of doing things when they are required, instead of regularly, just in case

That sounds good. I'm afraid it's not clear to me how you suggest doing that, though. I would love to see some actual measurements, perhaps with perf, because it might very well be that we are missing some possible optimizations but without facts that's hard to track down.

@rgacogne rgacogne added this to the dnsdist-helpneeded milestone Mar 16, 2022
@Glandos
Copy link
Author

Glandos commented Mar 16, 2022

Thanks for the quick reply!

The healthcheck thread has to wake up every second to update the query load and drop rate of downstream servers, and will go back to sleep right away if there is nothing else to do but that has to be done every second.
I can't handle all the consequences of a code modifications here, but I'm just guessing that those metrics can be computed less frequently, with an adaptation to delta variable? Of course, this setting can be set to 1 second, for backward compatibility.

The TCP client loop handles timeouts, cleans up the downstream TCP connections cache, and check whether a dump of the current state has been requested.
The same thoughts can apply here. Of course, it's important to react quickly for many people, but my own servers are limited in CPU, and I tolerate that their response time is a bit longer in exchange of a lower CPU usage.

the request is about thinking of doing things when they are required, instead of regularly, just in case

That sounds good. I'm afraid it's not clear to me how you suggest doing that, though. I would love to see some actual measurements, perhaps with perf, because it might very well be that we are missing some possible optimizations but without facts that's hard to track down.

I only have some basic figures, reported by htop:

  • dnsdist started 1 week ago
  • dnsdist/healtC thread consumed 1h02 CPU time
  • carbon thread consumed 13m21 CPU time
  • each dnsdist/tcpClien and dnsdist/dohClien thread consumed 56s CPU time
  • main thread consumed 27s.
  • other threads are not significant.

So in fact, this is mainly the health check that consumed resources, but I have to check with perf (or something similar) if it's linked to the loop or to some rarely occurring operations, like a real health check.
EDIT: all my downstreams are set with checkInterval=60.

@rgacogne
Copy link
Member

dnsdist/healtC thread consumed 1h02 CPU time

That seems to be a lot indeed, depending on how many downstreams you have and whether they are Do53, DoT or DoH ones. On a fairly idle setup that has been running for 8 days I have 6 minutes 42 seconds of CPU time taken by the healtC thread for a single Do53 downstream, with the default 1s checkInterval.

@rgacogne
Copy link
Member

A small improvement is available in #11437 but I suspect a much better one can be reached by setting setRandomizedIdsOverUDP(true) 1 if you are running master.

@Glandos
Copy link
Author

Glandos commented Mar 27, 2022

No, I don't have time to switch to master, but I'll be patient and don't forget this.

@rgacogne
Copy link
Member

We just released 1.7.2 which should improve things significantly for your use-case! We will welcome any feedback!

@Glandos
Copy link
Author

Glandos commented Jun 25, 2022

Indeed, the healthCheck thread has a much much lower impact now. Thanks for the fix!
I'll let you close this issue. In my opinion, active loops should be avoided, but if it means to re-architecture the whole software, it can be a non-goal.

@rgacogne
Copy link
Member

rgacogne commented Feb 2, 2024

I'm closing this for now since most, if not all, of the active loops have been removed. Please let us know if you see something weird!

@rgacogne rgacogne closed this as completed Feb 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants