Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
[dev.icinga.com #2536] scheduled_downtime_depth falsely incremented if in flexible downtime with duration < end-starttime window #944
This issue has been migrated from Redmine: https://dev.icinga.com/issues/2536
Created by Wolfgang on 2012-04-22 13:47:24 +00:00
Scheduling a flexible downtime which is not ended within the defined duration the counter scheduled_downtime_depth is falsely incremented using the interval which is defined as duration.
The following shell script was run every minute via crontab to get the current values:
The result is:
The counter is incremented every three minutes (duration of the flexible downtime) although no other downtime is planned for the host. The counter is decremented at the end of the downtime period and keeps its value.
2012-04-22 18:08:41 +00:00 by mfriedrich 4bbd7e8
2012-04-22 20:02:28 +00:00 by mfriedrich f03dbcd
2012-04-22 20:52:41 +00:00 by mfriedrich 8d315d0
2012-04-23 11:37:19 +00:00 by mfriedrich 2bfc1d4
2012-04-28 08:49:12 +00:00 by mfriedrich 95e0400
2012-04-28 08:52:20 +00:00 by mfriedrich 51997db
2012-04-28 08:53:29 +00:00 by mfriedrich dc1569b
2012-04-28 08:56:48 +00:00 by mfriedrich 8422f24
Updated by Wolfgang on 2012-04-22 14:51:31 +00:00
Attached a debug file using debug_level=520 (events and downtime), debug_verbosity=2.
Updated by mfriedrich on 2012-04-22 17:24:36 +00:00
hmmm, my guess is that the first location where it detects an ending downtime, may be wrong.
if hitting that section, the downtime would be decremented as needed.
if the if condition does not match, you will fall into the else tree, where the downtime_depth gets incremented.
there go the startup fixes by ricardo, which may be cleared and reported correctly after restart then.
depth is incremented, in_effect is set.
given the times
Sun, 22 Apr 2012 16:21:36 GMT+1
the first match is ok, but the rest fails. since there's debug log missing, i'll add one, while checking why the if condition fails. maybe this requires a change on the "if" diverging a bit, or the "else" condition being changed as well.
Updated by mfriedrich on 2012-04-22 17:35:16 +00:00
now let's go to the webinterface again, disable active checks for the host, and submit a passive one.
if i am right, the next event for the scheduled downtime expire is scheduled now+3minutes, where the next event (not check!) will happen and then tell us a bit more.
Updated by mfriedrich on 2012-04-22 17:52:07 +00:00
grml mistake, 3h instead of 3 min.
1335116473 = Sun, 22 Apr 2012 19:41:13 GMT+2
this is wrong, means a rescheduled downtime then.
current_time = 1335116653
downtime is in_effect, and current time is not greater than the end_time, so the "else" tree matches, redoing the downtime.
the actual end_time of the downtime is sort of precalculated for flexible downtimes - it takes the current time plus duration.
so what we now know - a flexible downtime with a (duration < end-start) match cannot be triggered with (in_effect && current>end).
need to enhance the debug log for entry_time.
Updated by mfriedrich on 2012-04-22 17:59:57 +00:00
another brain summary.
basically this means now, that there are two test conditions for flexible downtimes.
if you hit 1), it will schedule the event for expiration just like a normal fixed downtime, happening a bit afterwards - which won't trigger the else tree.
if you hit 2) because of a minimal duration window, the scheduled expiration event is current_time+duration, not hitting the set endtime yet, leading to the "else" tree, and then incrementing the counter until the actual endtime is hit somewhere in the future.
Updated by mfriedrich on 2012-04-22 18:25:24 +00:00
and this leads to another problem. entry_time is NOT what you expect from it. it's actually the time the command for scheduling a downtime is sent. it is NOT the time when a downtime is actually triggered.
so the core won't know when a flexible downtime is triggered, it just assumes that within the start and end time windows, with some given duration, it will go wild, and the least downtime cancel will be end_time.
so with that concept, one could not fix that bug.
Updated by mfriedrich on 2012-04-22 19:32:26 +00:00
while working on adding the trigger_time in #2537 i've been reading the docs on flexible downtime. it clearly says "lasts duration" and not "lasts forever".
Updated by mfriedrich on 2012-04-22 19:41:20 +00:00
if we happen to trigger the flexible downtime, we check if the currenttime is greater equal than trigger_time (time when the flex downtime started) plus added the duration it lasts. so we can be sure about the 1x duration it should last, and can safely expire the downtime.
so in order to fix this issue here, we must implement the change in #2537
Updated by mfriedrich on 2012-04-22 19:43:10 +00:00
basic test with a 5minute fixed downtime.
Updated by mfriedrich on 2012-04-22 19:44:15 +00:00
test with a 3min flexible downtime in a 20min window.
Updated by mfriedrich on 2012-04-22 19:53:18 +00:00
test with a 5min flexible downtime in a 5min window (which stands for the default, the gui says, but for 2h)
Updated by mfriedrich on 2012-04-22 20:09:32 +00:00
Updated by melle on 2012-04-26 13:08:00 +00:00
Seems to work for me, though I'm not sure if the results are what the "downtime experts" expect them to be.
Results in detail:
Updated by Frankstar on 2012-04-27 09:44:48 +00:00
also the gui displayed everything correct.
next step, fix downtime.
Updated by Frankstar on 2012-04-27 10:06:30 +00:00
Downtime test, fix downtime
seems to work fine.
Updated by Frankstar on 2012-04-27 11:11:19 +00:00
flexible downtime test, 1h, 7min dur., with up/down/up/down simulation.
send up: 12:18
worked fine. no more testing by me.