Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dev.icinga.com #8137] Maximum concurrent service checks #2468

Closed
icinga-migration opened this issue Dec 20, 2014 · 14 comments

Comments

Projects
None yet
1 participant
@icinga-migration
Copy link
Member

commented Dec 20, 2014

This issue has been migrated from Redmine: https://dev.icinga.com/issues/8137

Created by JayNewman on 2014-12-20 18:11:34 +00:00

Assignee: gbeutner
Status: Resolved (closed on 2016-05-10 09:30:03 +00:00)
Target Version: 2.4.8
Last Update: 2016-05-11 07:40:46 +00:00 (in Redmine)

Backport?: Not yet backported
Include in Changelog: 1

From older examples, it appears that at one point in time it was possible to specify "max_concurrent_checks", but apparently this is no longer the case.
My environment had a problem where the check_interval was set too small on some service checks, and the satellite nodes were each running over 9,000 processes trying to keep up. This not only caused instability but also flooded the DNS server.

Can you please implement some form of limitation on concurrent checks, to act as a safety net? An error in configuration should ideally cause a backlog rather than overwhelming the infrastructure.

If you need more details, I discussed and demonstrated this with Cornelius Wachinger.

Attachments

Changesets

2016-05-10 09:26:55 +00:00 by gbeutner f6f3bd1

Implement support for limiting the number of concurrent checks

fixes #8137

2016-05-12 09:08:21 +00:00 by gbeutner f08d378

Implement support for limiting the number of concurrent checks

fixes #8137

2016-05-12 11:47:32 +00:00 by gbeutner 97a5091

Fix incorrect re-scheduling behavior for command_endpoint checks

refs #8137

2016-05-12 12:06:47 +00:00 by gbeutner 01e58b4

Fix incorrect re-scheduling behavior for command_endpoint checks

refs #8137

Relations:

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Jan 7, 2015

Updated by mfriedrich on 2015-01-07 14:30:58 +00:00

  • Category set to Checker

"max_concurrent_checks" is an Icinga 1.x configuration setting which has not been ported to Icinga 2. I am not entirely sure if that really solves the issue since the skipped checks will fill yet another check queue happening in the future. Monitoring the instances and their health state sounds more reasonable to me, especially when there are resource upgrades needed on-demand.

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Jan 8, 2015

Updated by JayNewman on 2015-01-08 23:59:50 +00:00

The issue I am trying to convey is, that it is better to end up with a large backlog of service checks and have time to correct the configuration, rather than have an insane number of concurrent checks running which will crash the server (I measured a load level over 500 at times) and also floods the DNS servers which are being queried with each service check. The latter problem means that not only do we have a problem with the Icinga monitoring, but we have also seriously impacted a production environment in which core services rely on DNS.

We do not want the risk of a monitoring tool being the cause of a production outage; it is supposed to help us avoid outages.

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Feb 25, 2016

Updated by mfriedrich on 2016-02-25 00:29:48 +00:00

  • Target Version set to Backlog
@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Mar 4, 2016

Updated by mfriedrich on 2016-03-04 15:50:06 +00:00

  • Parent Id set to 11310
@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Mar 30, 2016

Updated by ricardo on 2016-03-30 14:46:17 +00:00

  • File added icinga.png

Hi,

I have a similar issue where after a crash the Icinga instance runs all checks at once.

would be really handy to limit concurrent checks to 500 or 1000 by default.

See attachment.

icinga.png

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Mar 31, 2016

Updated by mfriedrich on 2016-03-31 10:35:03 +00:00

  • Priority changed from Normal to High
@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Apr 1, 2016

Updated by kowalskimn on 2016-04-01 07:56:36 +00:00

Same problem here. I cut down my instance from 6gb ram to 4gb and icinga started crashing because some checks are more ram hungry than others and there were too many of them running at the same time. I had to scale it back up to 6gb to stop the crashes from happening.

This caused the oom to crash icinga2 itself, since it was still using more ram than each of the check processes.

If i will add more hosts to the setup, this will just not scale, unless i spread checking into a few separate machines - which is pretty much the same solution as adding more ram to one.

Setting max concurrent checks sounds like a good idea, more sophisticated solutions like making icinga2 able to adjust amount of checks based on ram/cpu utilization sound rather overly complex.

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Apr 18, 2016

Updated by mfriedrich on 2016-04-18 08:36:37 +00:00

  • Target Version changed from Backlog to 2.5.0
@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented May 10, 2016

Updated by gbeutner on 2016-05-10 09:26:11 +00:00

  • Status changed from New to Assigned
  • Assigned to set to gbeutner
  • Target Version changed from 2.5.0 to 2.4.8
@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented May 10, 2016

Updated by gbeutner on 2016-05-10 09:30:04 +00:00

  • Status changed from Assigned to Resolved
  • Done % changed from 0 to 100

Applied in changeset f6f3bd1.

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented May 10, 2016

Updated by mfriedrich on 2016-05-10 10:25:26 +00:00

  • Backport? changed from __ to Not yet backported
  • Include in Changelog changed from __ to 1

Tests

mbmif /usr/local/icinga2/etc/icinga2/tests (master) # cat ../features-available/checker.conf
/**
 * The checker component takes care of executing service checks.
 */

library "checker"

object CheckerComponent "checker" {
  concurrent_checks = 100
}

mbmif /usr/local/icinga2/etc/icinga2/tests (master) # cat 8137.conf
object CheckCommand "sleep" {
  import "plugin-check-command"

  command = "/bin/sleep 30"
}

for (i in range(300)) {
  object Service "s" + i {
    host_name = NodeName
    check_command = "sleep"
    check_interval = 10s
  }
}
@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented May 11, 2016

Updated by mfriedrich on 2016-05-11 07:40:47 +00:00

  • Parent Id deleted 11310
@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented May 11, 2016

Updated by mfriedrich on 2016-05-11 12:40:49 +00:00

  • Relates set to 11714
@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented May 18, 2016

Updated by gbeutner on 2016-05-18 12:02:40 +00:00

  • Relates set to 11806
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.