Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
[dev.icinga.com #11196] High load when pinning command endpoint on HA cluster #3954
This issue has been migrated from Redmine: https://dev.icinga.com/issues/11196
Created by peckel on 2016-02-22 09:53:43 +00:00
I wrote a note about this a couple of weeks ago (Dec 7th) in the mailing list, but at that time I could not exactly pin down the problem. After some additional testing, I found what is causing the issue.
The basic setup is that I need to build a new monitoring environment for a customer who has several security zones in which monitoring must be kept local. So the general idea is to set an HA cluster with Icinga2, Graphite, Grafana and the IDO DB in the central zone, and one satellite cluster within each of the security zones that does the data collection, which works perfectly in general.
The problem is that the checks on the cluster nodes themselves can't be pinned to the individual nodes. While this is perfectly OK for HTTP checks, pingers and the standard SSH check, but the trouble starts with disk, memory, CPU and the other local stuff: When the checks are running on a cluster, it can't be determined what node alarms or perfdata belong to. This is not really an option. I've also reviewed feature requests 10977, 10679 and 10040 on this issue. As the monitoring infrastructure is very critical to the customer, we need to monitor the local resources on the cluster nodes as well.
My naive approach was to pin the command endpoint to the local node for the local resource checks on the cluster nodes. But as I do that, I immediately get substantially increased load of the icinga2 process on the cluster where this is configured in terms of memory and CPU. From the Graphite logs I can see that the satellite cluster with the pinned command endpoints starts sending the same perfdata over and over again, for the local checks and the checks on all connected Icinga 2 clients.
The only change in the configuration is the setting for the pinned command endpoint on the satellite cluster:
In the service definitions I have the following configuration:
The graphs for Graphite updateOperations vs. metricsReceived show nicely that the number of metrics coming in is increasing dramatically when the above setup is activated, but there are no real updates as the datapoints are essentially all the same.
As soon as command endpoint pinning is disabled, the load immediately drops to normal again (compare the log entries to the Graphite stats, they correlate nicely). This can now be reproduced at will, so if you need and data to help fixing this I can now provide them.
Updated by mfriedrich on 2016-03-18 11:22:27 +00:00
Pinning a check on a node inside a HA zone on a specific endpoint is currently not supported nor implemented. You may run into undefined behaviour with one node being responsible for the executed check, and another one being the command_endpoint, executing the check, and then the check result gets synced back to all involved nodes.
There is a feature request to allow such behaviour, but for now I'd suggest to re-think your zone design. If you want to run specific checks on defined nodes, assign them their own zone, and make that the third level below the satellite zone.
Updated by peckel on 2016-03-18 11:32:46 +00:00
thanks for your input. However, there is a certain catch-22:
I need to have a redundant pair of satellites to execute remote checks in the specific zone. No real problem here.
To ensure the availability of the monitoring setup, I also need to monitor those satellites themselves for the usual system resources, e.g. disk, memory, cpu - the standard stuff. All works fine (at least that what it looks like) when both satellites are up, but when one of them goes down the other one does not only take over the remote checks (which is desired) but also the checks for local resources (which most definitely isn't, as it then starts monitoring its own disks and recording perfdata for them instead of the failed node).
Putting the satellite node in its own zone isn't an option as it is already a member of the satellite cluster zone, and it can't be in two zones at the same time.
I have no idea how to catch that issue with a modified zone design. Or am I missing something here?
Thanks and best regards,
Updated by gbeutner on 2016-07-25 07:45:04 +00:00
Can you please test whether this problem still occurs with the current snapshot packages? As far as I can see this should have been fixed as part of #12179.
Updated by peckel on 2016-07-26 17:49:30 +00:00
thanks for the update.
I've just upgraded my test einvironment to today's snapshot, and it seems the problem has disappeared. Pinning the endpoint for certain services (e.g. local disks, processes etc.) to a particular cluster node instead of having them fail over is working now as well.
Great news, thanks!