-
Notifications
You must be signed in to change notification settings - Fork 571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Drop ApiListener#m_RelayQueue #9855
base: master
Are you sure you want to change the base?
Conversation
Is there more than the local What is the CPU utilization in your tests of both the complete system and just the icinga2 process with and without your curl loops running? I mean these requests shouldn't put any additional items into RelayQueue so they should only make it tip over if this takes away resources that could otherwise be used for processing the queue. That makes me wonder: can you reproduce the same behavior by limiting the number of CPU cores available to icinga2 to a value that's too low (i.e. just less than what it would use if it wasn't limited and therefore fully using the limit)? Also, when you observe checks per second, does this change? I think your PR might fix it by putting some back pressure there instead of overflowing a queue. |
+1
This is the idea behind it. A DDoSed app should go the best of all available (bad) ways, in this case it's the "back pressure". |
I assume you mean v2.14.
I don't quite understand the question, especially given two different situations (under load / not under load) above. Please specify an exact number of cores. In theory: less parallelism = less processing/t = more pile-up/t w/ same config
v2.14 or this PR? Forget the question, I don't get a complete answer beyond what's shown below anyway. W/o curl
W/ curl
|
Do I interpret this correctly that 34.8% of 56 cores were idle, i.e. plenty more CPU resources available?
As the process used 10.71 cores without curl running, I'd start with a limit of 10 cores and see if this makes the process use 1000% CPU constantly, if not, go for fewer cores. So the idea is to deliberately provide too little CPU resources and see if this results in the same issue. Pretty much in an effort to isolate it and figure out if that curl loops actually contribute anything more than CPU load.
Both, that would have been a test for that theory. So kind of the picture perfect result would have been something like:
|
Ah, finally!
|
+1
Ok, now with this PR: W/o curl
W/ curl
|
Those numbers don't help if you don't say how/when you've obtained them. Like how long was the process running when you got the "W/o curl" numbers and for how long were your curl loops running when you obtained the "W/ curl" numbers? Otherwise I don't even know which of the Anyways, the config you shared has 750 000 hosts and 7 500 000 services, i.e. a total of 8 250 000 checkables.
Which doesn't look any close to any of the numbers you've shared so I have no idea how to interpret these numbers. But again, I don't know how they were obtained, but given |
I think we have figured out what's the problem: https://community.icinga.com/t/memory-leak-despite-2-14-upgrade-how-to-debug/12380 Forget the memory leak part, but IMAO the queue is still useless. |
Instead run the previously enqueued function directly. The latter doesn't really do anything more than this:
But this queue isn't just useless. It's also "malicious" (even if not counting useless malloc(3) and mutex locks):
Icinga just works as long as I don't start this:
for i in {1..30}; do (while curl -ksSo /dev/null -d '{"filter":"service.name==string('$(($RANDOM % 10))')&&host.name==string('$RANDOM')","pretty":1}' -X GET -u root:a97ccaf00cbaf91a 'https://127.0.0.1:5665/v1/objects/services'; do true; done) & done
Then the queue dropped here just grows and grows – with the memory usage of course.
This PR fixes the memory leak.
Maybe it slows Icinga a bit down in critical situations where I'm literally DDoSing it, but it's better than an OOM crash.
Behaviour in Icinga 2.10
Behaviour in Icinga 2.2
Actual sync writing reasoning the relay queue: