[dev.icinga.com #8141] Optimized Freshness Checking #1537
This issue has been migrated from Redmine: https://dev.icinga.com/issues/8141
Created by jrhunt on 2014-12-24 21:54:06 +00:00
I have attached a patch that optimizes host and service freshness checking.
My team and I run Icinga 1.x in a very large environment (~60k service checks on our largest node, 95% of which are passive submission, every 5 minutes). Using a custom event broker to keep up with the load of inbound check results, we have noticed very few scaling problems. However, when we have large swaths of the infrastructure fail, like when a hypervisor dies, we see a large influx of freshness checks, which causes the Icinga process to thrash as it tries (in vain) to schedule tens of thousands of check_dummy active checks to report that the services are stale.
To remedy this, we patched Icinga to recognize two new host and service attributes:
For example, for passive services that are usually fed by our monitoring agent software, we have defined these two attributes as such:
We then modified the Icinga check_*_service_result_freshness() functions to bypass the normal schedule-an-active-check-run behavior if these attributes are present, and instead synthesize a check result and inject it into the check results list.
Upon exercising this code in our testbed environment, we noticed that recovery from large scale outages (~20k stale checks at a time) would cause Icinga to thrash, first marking all the stale services as stale, and then processing all of the inbound results from our event broker, and then marking everything as stale again, etc. We found it prudent to reap all of the check results on every 1000th stale check, to keep this particular undesired behavior from occuring.
Note that for configurations that don't specify these attributes, the current schedule-an-active-check-run behavior persists, to preserve backwards compatibility.
Also note that earlier versions of this patch (as discussed in #icinga-devel and between
Updated by mfriedrich on 2015-01-24 12:59:37 +00:00
Sorry for the delayed answer, January does not seem to be a good month.
While I like the initial idea, I'm holding off to add new options in terms of configuration and state to Icinga 1.x. As you already mentioned, the event broker modules won't receive these attributes and their values, but they probably should as for writing to idoutils db representing that in Icinga Web and so on. Similar issue with Classic UI. Livestatus is out of the scope, that's not maintained by Icinga in 1.x.
We do have engineering to do on Icinga 2 with passive checks and freshness (#7071), combined with api ideas of feeding passive checks into the core. I will take your ideas and patch design into account once we are there.
Curious what others think though :)