Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.Sign up
[dev.icinga.com #7287] Re-checks scheduling w/ retry_interval #2009
This issue has been migrated from Redmine: https://dev.icinga.com/issues/7287
Created by wimmer on 2014-09-20 15:16:39 +00:00
I found strange behavior of re-checks scheduling.
When some service changes its status, Icinga1 works as expected:
But Icinga2 uses strange Attempt numbering [there is no 1/3 (SOFT state)]
2015-03-11 15:33:36 +00:00 by mfriedrich 05c237c
2015-08-21 08:24:49 +00:00 by mfriedrich 6f252bb
2016-01-20 15:29:01 +00:00 by mfriedrich a51e647
2016-01-22 17:40:14 +00:00 by mfriedrich 2a11b27
2016-02-23 08:24:57 +00:00 by mfriedrich 9ca7245
2016-03-05 17:15:03 +00:00 by mfriedrich b8e3d61
2016-03-05 17:16:49 +00:00 by mfriedrich ef532f2
2016-03-11 14:55:03 +00:00 by mfriedrich 8344f74
2016-03-11 14:55:14 +00:00 by mfriedrich f99feab
Updated by wimmer on 2014-11-23 13:33:59 +00:00
After 2 months I tried scheduling again. I use Debian packages version 2.2.
Please can somebody confirm this bug?
I would like to use Icinga2 on my production systems, but is impossible with this strange behavior.
Updated by schirrmie on 2015-03-05 07:25:22 +00:00
Your patch does not change the behaviour of the retry_interval.
Updated by mfriedrich on 2015-05-18 13:11:00 +00:00
I'll need to dig into this a little deeper. I'm not sure if this might happen for 2.4.0 - feel free to submit any patches helping resolve the issue.
Updated by zearan on 2015-05-26 12:03:02 +00:00
The problem seems to lie in file libs/icinga/checkable-check.cpp. The function Checkable::ExecuteCheck() (line 521) executes the calculation of the next check before running the check itself:
The time of the next check does not get re-evaluated by Checkable::ProcessCheckResult() (line 237) when the result is available. I have tried to fix it by a one-liner but I cannot say if this causes any unwanted side effects. At least it did not crash anything when I tried it.
Updated by wimmer on 2015-06-19 17:34:18 +00:00
I applied patch to 2.3.5 and it works great.
Here is my setup:
Updated by mfriedrich on 2015-08-18 15:01:14 +00:00
Generally speaking I'd like to split this issue, as it consists of two issues:
Updated by mfriedrich on 2015-08-18 16:48:33 +00:00
Apparently the fix does not work this way, host.next_check still stays at the old scheduled timestamp.
Updated by akrus on 2015-12-22 07:21:51 +00:00
I'm also experiencing the same issue here.
This results in checks being executes every 24 hours (retry interval doesn't work, while Icinga Web 2 shows it correctly).
Updated by Peter_Story on 2016-01-14 16:02:47 +00:00
I'm also encountering this bug. I have a service with:
After the first soft failure, the subsequent check is scheduled to run after 10 minutes, instead of 30 seconds.
Updated by mfriedrich on 2016-01-20 15:22:12 +00:00
It fairly should but we're all normal people, not machines.
I found another issue while testing the original patch which took me quite a while to figure out why it was happening.
Tests using the external command pipe
Now process check results using Icinga Web 2, from DOWN to UP and vice versa.
Calling UpdateNextCheck() works as it should, it contains the correct next GetRetryInterval() values and also sends that to the database.
There's one subsequent SetNextCheck() call afterwards which then overrides the value inside the core (and so the API and external interfaces).
We do ensure to reschedule a check when processing a check result from the external command pipe or api actions. This is due to the reason we require freshness checks using the check_interval (see #7071 for better proposals) for incoming passive check results.
So not a good method for tests anyways. Temporarily removing the code blocks from externalcommandlistener.cpp proves the patch working.
Manual recheck will cause ProcessCheckResult() to immediately update the next check using the retry interval.
Updated by mfriedrich on 2016-01-22 17:37:02 +00:00
There's still one issue with passive check results - calling UpdateNextCheck() will first set a probably smaller next check time, and after invoking ProcessCheckResult() a separate call to SetNextCheck() is required to ensure the check_interval is used for passively received checks and their freshness.
This merely influences all checks where we do not care about them whether being active or passive.
It is also worse for active checks - the next check time is updated to check_interval if sent in as passive check result via API / UI.
Moving that task inside ProcessCheckResult() ensures that subsequent calls to SetNextCheck() happen with the correct check interval for passive check results. The active checks are then rescheduled based on their current state, especially if changed from sending a check result enforcing a state change, e.g. HARD OK, or SOFT NOT-OK with retry_interval being required.
Furthermore updating the next check time before sending a DB IDO update for OnNewCheckResult will also help reduce duplicated next_check queries (see linked #11019).