Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dev.icinga.com #11196] High load when pinning command endpoint on HA cluster #3954

Closed
icinga-migration opened this issue Feb 22, 2016 · 12 comments

Comments

Projects
None yet
1 participant
@icinga-migration
Copy link
Member

commented Feb 22, 2016

This issue has been migrated from Redmine: https://dev.icinga.com/issues/11196

Created by peckel on 2016-02-22 09:53:43 +00:00

Assignee: (none)
Status: Closed (closed on 2016-08-04 16:07:08 +00:00)
Target Version: 2.5.0
Last Update: 2016-08-04 16:07:17 +00:00 (in Redmine)

Icinga Version: 2.4.1
Backport?: Not yet backported
Include in Changelog: 1

I wrote a note about this a couple of weeks ago (Dec 7th) in the mailing list, but at that time I could not exactly pin down the problem. After some additional testing, I found what is causing the issue.

The basic setup is that I need to build a new monitoring environment for a customer who has several security zones in which monitoring must be kept local. So the general idea is to set an HA cluster with Icinga2, Graphite, Grafana and the IDO DB in the central zone, and one satellite cluster within each of the security zones that does the data collection, which works perfectly in general.

The problem is that the checks on the cluster nodes themselves can't be pinned to the individual nodes. While this is perfectly OK for HTTP checks, pingers and the standard SSH check, but the trouble starts with disk, memory, CPU and the other local stuff: When the checks are running on a cluster, it can't be determined what node alarms or perfdata belong to. This is not really an option. I've also reviewed feature requests 10977, 10679 and 10040 on this issue. As the monitoring infrastructure is very critical to the customer, we need to monitor the local resources on the cluster nodes as well.

My naive approach was to pin the command endpoint to the local node for the local resource checks on the cluster nodes. But as I do that, I immediately get substantially increased load of the icinga2 process on the cluster where this is configured in terms of memory and CPU. From the Graphite logs I can see that the satellite cluster with the pinned command endpoints starts sending the same perfdata over and over again, for the local checks and the checks on all connected Icinga 2 clients.

The only change in the configuration is the setting for the pinned command endpoint on the satellite cluster:

object Host "icinga2-satellite1.demo.hindenburgring.com" {
    import "generic-host"
    address = "icinga2-satellite1.demo.hindenburgring.com"
    vars.os = "Linux"

    vars.command_endpoint = "icinga2-satellite1.demo.hindenburgring.com"
}

object Host "icinga2-satellite2.demo.hindenburgring.com" {
    import "generic-host"
    address = "icinga2-satellite2.demo.hindenburgring.com"
    vars.os = "Linux"

   vars.command_endpoint = "icinga2-satellite2.demo.hindenburgring.com"
}

[...]

In the service definitions I have the following configuration:

apply Service "load" {
    import "generic-service"

    check_command = "load"

    command_endpoint = host.vars.command_endpoint
    assign where host.vars.command_endpoint
}

apply Service "procs" {
    import "generic-service"

    check_command = "procs"

    command_endpoint = host.vars.command_endpoint
    assign where host.vars.command_endpoint
}

[...]

The graphs for Graphite updateOperations vs. metricsReceived show nicely that the number of metrics coming in is increasing dramatically when the above setup is activated, but there are no real updates as the datapoints are essentially all the same.

As soon as command endpoint pinning is disabled, the load immediately drops to normal again (compare the log entries to the Graphite stats, they correlate nicely). This can now be reproduced at will, so if you need and data to help fixing this I can now provide them.

Attachments


Relations:

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Feb 23, 2016

Updated by peckel on 2016-02-23 18:39:32 +00:00

Update:

The issue persists in 2.4.2.

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Feb 25, 2016

Updated by peckel on 2016-02-25 09:22:35 +00:00

Update: The issue is still showing in 2.4.3.

Additionally, it looks very similar to the one in #11041.

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Mar 4, 2016

Updated by mfriedrich on 2016-03-04 15:54:22 +00:00

  • Parent Id set to 11313
@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Mar 18, 2016

Updated by mfriedrich on 2016-03-18 11:19:45 +00:00

  • Relates set to 11041
@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Mar 18, 2016

Updated by mfriedrich on 2016-03-18 11:22:27 +00:00

Pinning a check on a node inside a HA zone on a specific endpoint is currently not supported nor implemented. You may run into undefined behaviour with one node being responsible for the executed check, and another one being the command_endpoint, executing the check, and then the check result gets synced back to all involved nodes.

There is a feature request to allow such behaviour, but for now I'd suggest to re-think your zone design. If you want to run specific checks on defined nodes, assign them their own zone, and make that the third level below the satellite zone.

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Mar 18, 2016

Updated by peckel on 2016-03-18 11:32:46 +00:00

Hi Michael,

thanks for your input. However, there is a certain catch-22:

I need to have a redundant pair of satellites to execute remote checks in the specific zone. No real problem here.

To ensure the availability of the monitoring setup, I also need to monitor those satellites themselves for the usual system resources, e.g. disk, memory, cpu - the standard stuff. All works fine (at least that what it looks like) when both satellites are up, but when one of them goes down the other one does not only take over the remote checks (which is desired) but also the checks for local resources (which most definitely isn't, as it then starts monitoring its own disks and recording perfdata for them instead of the failed node).

Putting the satellite node in its own zone isn't an option as it is already a member of the satellite cluster zone, and it can't be in two zones at the same time.

I have no idea how to catch that issue with a modified zone design. Or am I missing something here?

Thanks and best regards,

Peter.

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Jul 25, 2016

Updated by gbeutner on 2016-07-25 07:45:04 +00:00

  • Status changed from New to Assigned
  • Assigned to set to peckel

Can you please test whether this problem still occurs with the current snapshot packages? As far as I can see this should have been fixed as part of #12179.

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Jul 25, 2016

Updated by gbeutner on 2016-07-25 07:45:12 +00:00

  • Status changed from Assigned to Feedback
@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Jul 25, 2016

Updated by gbeutner on 2016-07-25 07:45:29 +00:00

  • Duplicates set to 12179
@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Jul 26, 2016

Updated by peckel on 2016-07-26 17:49:30 +00:00

Hi Gunnar,

thanks for the update.

I've just upgraded my test einvironment to today's snapshot, and it seems the problem has disappeared. Pinning the endpoint for certain services (e.g. local disks, processes etc.) to a particular cluster node instead of having them fail over is working now as well.

Great news, thanks!

Best regards,

Peter.

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Aug 4, 2016

Updated by mfriedrich on 2016-08-04 16:07:09 +00:00

  • Status changed from Feedback to Closed
  • Assigned to deleted peckel
  • Target Version set to 2.5.0
  • Done % changed from 0 to 100

Thanks for the kind feedback.

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Aug 4, 2016

Updated by mfriedrich on 2016-08-04 16:07:18 +00:00

  • Parent Id deleted 11313
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.