-
Notifications
You must be signed in to change notification settings - Fork 570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Load explodes after every reload/restart of Icinga 2 #5465
Comments
Do you have a specific performance analysis including graphs from work queues and enabled features, checks, etc ? It is hard to tell what exactly could cause this without more insights. https://www.icinga.com/docs/icinga2/latest/doc/15-troubleshooting/#analyze-your-environment |
Here is a snapshot from some of our Graphs: https://snapshot.raintank.io/dashboard/snapshot/XX5gtmp2yf4nnIXJ1oIpWt1FA263Um2S?orgId=2 |
Thanks for the graphs. The last one uses the Logs look fine, nothing spectacular. This change in cpu load could come from the recent work queue additions for all features, i.e. graphite. (for future reference - I modified the URL with render/ and added the screenshot here) |
Does this happen with 2.8 again? There were certain improvements for the cluster in this release. Cheers, |
In my setup yes. Single Master:
Our instance contains 512 Hosts an 5368 Services. 5 of them are Icinga2 Clients, the remaining hosts are checked by NRPE.
|
In our setup it happens, too. a year ago it was so bad that the master Cluster could not catch anymore. So we had to expand our cluster with some satellites. |
How many of these services actually invoke check_nwc_health? How's the average execution time and latency for these checks? |
@dnsmichi: this is a real issue, root cause is our scheduling code in Icinga 2. Have a look at how Icinga 1.x tried to avoid such problems. It was far from being perfect, the 1.x scheduling code is a mess - but it's basics where well thought. This issue is a combination of checks being rescheduled at restart (that's a bug, shouldn't happen) combined with a rescheduling/splaying logic where every checkable (and not a central scheduling logic) decides on it's own when to run the next check. There is a related issue in our Icinga Support issue tracker. You'll find it combined with debug log, hints how to filter the log and a link pointing to the place in our code responsible for the "checks are not being executed for a long time" effect as explained by @pgress. Eventually talk to @gunnarbeutner, he should be aware of this issue - we discussed it quite some time ago. The "reschedule everything when starting" issue should be easy to fix. Most of the heavy spikes shown by @uffsalot would then no longer appear. As you can see he has an environment with not too many monitored objects and not the greatest and latest but absolutely sufficient hardware. Especially given that most of his checks are NRPE-based, he should never experience what his graphs are showing. Cheers, NB: Sooner or later we should consider implementing a scheduler logic taking the amount of active checks, their interval and their average execution time into account. It should try to fairly distribute new checks while respecting and preserving current schedule for existing ones. |
There are some changes to this in current git master and the snapshot packages which will run into 2.9. This is scheduled for June. |
@dnsmichi oh to reduce load? Which changes? :) |
To influence check scheduling upon reload. Snapshot packages should be available for testing already. |
I have the same problem in a customers setup. I'll try to get some tests / feedback from them as well. |
Im guessing this 1a9c159 is the fix. |
1a9c159 is not about this issue. While it might redistribute the load of early checks.
Here we have the problem that we do not know much about the checks and can't make many guesses based on that information. A high execution time does not mean high load, a check with a long check interval might not have that because of it being a heavy check and the amount of active checks tells us nothing concrete either. The only thing that comes to my mind is randomizing execution checks with similar check intervals better. |
I'd suggest to test the snapshot packages and report back whether the problem persists or not. |
Anycast ping, is someone experiencing this issue still with a recent Icinga 2 version? |
Might also be related to a setup where the master actively connects to all clients, in #6517. |
We have experienced less load since using 2.9. We made sure only one check is run at one time. (this is with nagios-nrpe-server (check_nrpe)) we doin't use icinga2 client. |
Any chance you'll try the snapshot packages? |
Unfortunately not right at the moment, as I don't have an identical dev cluster right now. But if I can help with any further infos (total checks, plugins, etc), I would love to do it :) |
I believe that the load is caused by the reconnect timer, or many incoming connections with many separate threads being spawned. A full analysis is available in #6517. |
Ok, then @Thomas-Gelf was right about the scheduler. I was just guessing from the recent changes, and it is good to know that the possible areas are boiled down with a recent version, thanks. |
Anything we can do to help to resolve this issue for the next release? |
I don't have this issue anymore with the latest icinga2 version |
Hi All, Still have this issue. I have two satellites with 8CPU and 16GB, in zone I have ~250 hosts and ~4000 services
|
I found in api/log folder a lot of big log files, after cleaning - the satellites became cold |
We were able to fix this issue with the latest version and changing some default config values. |
Which ones? |
By default we had disabled the replay logs only on the command endpoints but now we have also disabled the replay logs on the monitoring server for all command endpoints
|
Colleagues, don't we recommend to do exactly that? |
Yes, we do. It's disabled in all our example configs using command endpoint agents in our distributed monitoring docs and we even have a dedicated section: https://icinga.com/docs/icinga-2/latest/doc/06-distributed-monitoring/#disable-log-duration-for-command-endpoints |
@pgress Please try this. |
Anyone else of you who also got this fixed via #5465 (comment) ? |
@Al2Klimov did you add log log_duration = 0 to agents or to satellites? |
You should set that on all Endpoint objects that represent a connection from or to an agent, see also https://icinga.com/docs/icinga-2/latest/doc/06-distributed-monitoring/#disable-log-duration-for-command-endpoints |
I've applied this solution, seems, did not help for me, but it was maybe year ago. I had: 2 HA masters, 2x4 HA satellites. I added this parameter for satellites + one big zone, began to work well, but after few redeployment of master config (ansible remove /etc/icinga2 and create new with form repository), problem was back for me |
Hey i dont work for the company anymore so i cant reproduce this issue anymore. So i will unassign this isssue. |
Let's me try |
Hi All
|
We got a 2 master-cluster which gets very high load, when the core is reloaded.
The two Nodes contain 8vCPUs and 16GB RAM. So the Power shouldn't be a problem at all.
We got now about 700 Hosts with about 7000 Services in it.
We debugged this Problem already a little bit and found out, that there is no Check done until 5 to 6 minutes are gone. After this Duration alle checks start together which results in the high load. When we use a single-node instead of an Cluster, then we don't have the Problem. The Checks start immediately after the reload.
icinga2 --version
): r2.7.0-1icinga2 feature list
): api checker command graphite ido-mysql mainlog notificationcube | 1.0.0
graphite | 0.0.0.5
monitoring | 2.4.1
icinga2 daemon -C
):No Problemszones.conf
file (oricinga2 object list --type Endpoint
andicinga2 object list --type Zone
) from all affected nodes.The text was updated successfully, but these errors were encountered: