New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dev.icinga.com #8141] Optimized Freshness Checking #1537

Closed
icinga-migration opened this Issue Dec 24, 2014 · 3 comments

Comments

Projects
None yet
2 participants
@icinga-migration
Member

icinga-migration commented Dec 24, 2014

This issue has been migrated from Redmine: https://dev.icinga.com/issues/8141

Created by jrhunt on 2014-12-24 21:54:06 +00:00

Assignee: (none)
Status: New
Target Version: Backlog
Last Update: 2015-05-18 12:18:15 +00:00 (in Redmine)


I have attached a patch that optimizes host and service freshness checking.

My team and I run Icinga 1.x in a very large environment (~60k service checks on our largest node, 95% of which are passive submission, every 5 minutes). Using a custom event broker to keep up with the load of inbound check results, we have noticed very few scaling problems. However, when we have large swaths of the infrastructure fail, like when a hypervisor dies, we see a large influx of freshness checks, which causes the Icinga process to thrash as it tries (in vain) to schedule tens of thousands of check_dummy active checks to report that the services are stale.

To remedy this, we patched Icinga to recognize two new host and service attributes:

  • freshness_status - Numeric status code to use when the check is determined to be stale, with the usual meanings (0 = OK, etc.)
  • freshness_message - A description explaining the nature of the freshness failure.

For example, for passive services that are usually fed by our monitoring agent software, we have defined these two attributes as such:

    check_freshness 1
    freshness_threshold 900
    freshness_status 1
    freshness_message "No result from monitoring agent in over 15 minutes"

We then modified the Icinga check_*_service_result_freshness() functions to bypass the normal schedule-an-active-check-run behavior if these attributes are present, and instead synthesize a check result and inject it into the check results list.

Upon exercising this code in our testbed environment, we noticed that recovery from large scale outages (~20k stale checks at a time) would cause Icinga to thrash, first marking all the stale services as stale, and then processing all of the inbound results from our event broker, and then marking everything as stale again, etc. We found it prudent to reap all of the check results on every 1000th stale check, to keep this particular undesired behavior from occuring.

Note that for configurations that don't specify these attributes, the current schedule-an-active-check-run behavior persists, to preserve backwards compatibility.

Also note that earlier versions of this patch (as discussed in #icinga-devel and between dnsmichi and myself, iamjameshunt, on twitter) made problematic changes to the add_host() and add_service() functions. This version of the patch does not suffer from this problem, at the expense of not allowing event brokers to pass the freshness_status / freshness_message attributes via those functions.

Attachments


Relations:

@icinga-migration

This comment has been minimized.

Member

icinga-migration commented Jan 24, 2015

Updated by mfriedrich on 2015-01-24 12:59:37 +00:00

Sorry for the delayed answer, January does not seem to be a good month.

While I like the initial idea, I'm holding off to add new options in terms of configuration and state to Icinga 1.x. As you already mentioned, the event broker modules won't receive these attributes and their values, but they probably should as for writing to idoutils db representing that in Icinga Web and so on. Similar issue with Classic UI. Livestatus is out of the scope, that's not maintained by Icinga in 1.x.

We do have engineering to do on Icinga 2 with passive checks and freshness (#7071), combined with api ideas of feeding passive checks into the core. I will take your ideas and patch design into account once we are there.

Curious what others think though :)

@icinga-migration

This comment has been minimized.

Member

icinga-migration commented Apr 3, 2015

Updated by mfriedrich on 2015-04-03 16:30:09 +00:00

  • Relates set to 7071
@icinga-migration

This comment has been minimized.

Member

icinga-migration commented May 18, 2015

Updated by berk on 2015-05-18 12:18:15 +00:00

  • Target Version set to Backlog

@icinga-migration icinga-migration added this to the Backlog milestone Jan 17, 2017

@dnsmichi dnsmichi removed this from the Backlog milestone Dec 19, 2017

@dnsmichi dnsmichi added the wontfix label Dec 19, 2017

@dnsmichi dnsmichi closed this Dec 19, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment