Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.Sign up
[dev.icinga.com #10963] high load and memory consumption on icinga2 agent v2.4.1 #3835
This issue has been migrated from Redmine: https://dev.icinga.com/issues/10963
Created by elabedzki on 2016-01-13 15:36:54 +00:00
we noticed a high load and memory consumption problem with some icinga2 agents ( in version 2.4.1 ), isn't really clear what is going on behind the scenes.
One of our customer has a hugh setup, described as follows...
Has anyone similar problems on his setup?
The CPU load is about, along with the extremely high memory utilization.
Icinga2 eats up to 70% of 2GB RAM and generates a load of 12 on a one core system.
At the same time we noticed in the log file on the masters that all agents are trying to reconnect, as you can see:
[2016-01-13 11:42:59 +0100] warning/ApiListener: Removing API client for endpoint 'mon-icingamaster-02.lxprod.obsfucated.customer.domain'. 106 API clients left.
Anyone some ideas what's going on here?
2016-01-15 09:11:52 +00:00 by jflach cb70d97
2016-01-19 14:24:17 +00:00 by (unknown) d50c8e1
2016-01-19 15:24:07 +00:00 by (unknown) b1aa6cc
2016-01-19 15:24:12 +00:00 by (unknown) e4b7111
2016-01-19 15:43:46 +00:00 by (unknown) db0c6ef
2016-01-19 16:25:28 +00:00 by (unknown) 55f0c58
2016-01-20 13:07:07 +00:00 by (unknown) e48ed33
2016-01-21 09:37:47 +00:00 by (unknown) 72c3b6d
2016-01-21 12:02:53 +00:00 by (unknown) 6d88d90
2016-01-21 15:37:52 +00:00 by (unknown) 6ca054e
2016-02-12 13:15:24 +00:00 by mfriedrich 04a4049
2016-02-16 12:08:21 +00:00 by (unknown) 9e9298f
2016-02-23 08:57:40 +00:00 by jflach e80b335
2016-02-23 08:57:49 +00:00 by (unknown) abfacd9
2016-02-23 09:46:13 +00:00 by (unknown) badeea7
2016-02-23 09:46:17 +00:00 by (unknown) b227dc7
2016-02-23 09:46:17 +00:00 by (unknown) 087ad3f
2016-02-23 09:46:17 +00:00 by (unknown) 3cfa871
2016-02-23 09:46:18 +00:00 by (unknown) 80fdccc
2016-02-23 09:46:18 +00:00 by (unknown) 7985e93
2016-02-23 09:46:18 +00:00 by (unknown) c415dd3
2016-02-23 09:46:18 +00:00 by (unknown) fc90265
2016-02-23 09:46:19 +00:00 by mfriedrich f6378c9
2016-02-23 09:46:19 +00:00 by (unknown) c998665
Updated by mfriedrich on 2016-01-14 20:00:09 +00:00
Keep in mind that 1) the leak exists in 2.4.1 stable (that git commit is from master) 2) Base64 is only used for rest api auth, which isn't enabled on clients
I guess there are more possible leaks, Valgrind will hopefully unveil them.
Updated by tgelf on 2016-01-14 23:13:17 +00:00
Didn't know that we cultivate more of them ;) What about lib/base/tlsutility.cpp, RandomString seems to be missing delete  bytes if RAND_bytes succeeds. Not sure whether this happens often enough to result in a serious leak...
Updated by tobiasvdk on 2016-01-15 15:56:46 +00:00
I think there is still a memory leak. Here is a diff between two
Updated by gbeutner on 2016-01-18 07:11:41 +00:00
While these leaks are definitely bugs RandomString isn't used in any code paths that are reachable via the 'daemon' CLI command. Also, the changes for the base64 functions weren't introduced until after 2.4.1 was released.
Updated by itbess on 2016-01-18 20:18:14 +00:00
We are running in the same problem on 2.3.11.
in pmap over time it generates more and more of these
Updated by tgelf on 2016-01-20 12:45:28 +00:00
Could you please give the latest snapshot (version: v2.4.1-116-g55f0c58, commit: 55f0c58) a try? We are not sure why, but that one mitigated the problems for us - no more memory leak, much less CPU load.
Updated by tobiasvdk on 2016-01-21 15:24:53 +00:00
Still leaking with r2.4.1-123-g72c3b6d:
Updated by tobiasvdk on 2016-01-22 08:53:20 +00:00
Updated by tobiasvdk on 2016-02-04 12:21:56 +00:00
@Shroud: should I run some gdb commands?
Updated by tobiasvdk on 2016-02-04 13:04:43 +00:00
Maybe it's because the database currently cannot handle the load:
I will deactivate the ido feature and test again.
Updated by tobiasvdk on 2016-02-04 20:29:16 +00:00
But the queue has a length of 500000 which was already reached. Were are the other results being held? I need to have a look into the code.
Updated by gbeutner on 2016-02-10 07:10:33 +00:00
tobiasvdk: Once the WorkQueue's size limit is reached the Enqueue() method blocks - which generally means other parts of Icinga become unresponsive. I'm not really happy with this behavior but there really are only a few options:
Updated by tobiasvdk on 2016-02-10 15:23:37 +00:00
Also good would be to allow multiple connections #10953 ;)
Updated by vytenis on 2016-02-17 14:50:29 +00:00
We also noticed very bad behaviour with IDO queue and had to bump it to work in our setup, as 500k was not nearly enough (SSDs+mysql tuning alone is not sufficient for 100k+ object setups) - see #10731 - while blocking on Enqueue() does not lead to hard freeze as it used to be back in 2.4.0, it will still happen eventually if the DB cannot keep up beyond the initial query load. Naturally, the queries do take up a LOT more RAM than Icinga itself requires. :) TBH, the IDO could be a lot more efficient - there's like ~10 queries per monitored object that have to be executed - the recent changes in git master really reduced the runtime load, though, especially if you do not care about history.