Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.Sign up
[dev.icinga.com #1782] Icinga forgets to schedule non-24x7 checks #712
This issue has been migrated from Redmine: https://dev.icinga.com/issues/1782
Created by mcp on 2011-08-04 10:34:46 +00:00
I'm using Icinga v1.4.2, but had the same problems also with v1.4.1 and v1.4.0,
Icinga forgets to schedule checks which are not 24x7.
I have a bunch of checks doing checks from 7am to 10pm and another bunch
all those checks have "Next Scheduled Check: N/A".
When I restart Icinga after 7am but before 8am all checks which should start
I'm using the same config as I used for Nagios 3 the past years and didn't
2011-10-22 18:09:37 +00:00 by mfriedrich 348a795
Updated by mcp on 2011-08-04 10:59:16 +00:00
config example attached.
icingastats output attached.
I don't have graphs for icingastats, some others attached.
Updated by mcp on 2011-08-05 12:21:41 +00:00
well, the problem isn't a performance problem at all.
I wasn't aware of the problem with icinga 1.3.x but I think
and please notice, its just for service !24x7. the other ones
Updated by mfriedrich on 2011-08-05 13:45:22 +00:00
so to speak a timeperiods problem. performance would have been a guess in regards of "forgetting things over time", meaning no space for calculations.
which are the system defaults - UTC, something else? did this happen now in DST?
and just a wild guess from the example
turn on debugging and check if the event for rescheduling your service. it might also be interesting, which state the current service is in ...
just for the record - i've just looked into base/events.c where the logic happens.
so given the todo i would expect that this is exactly the case for you.
git blame tells that
so let's have a look into ...
Updated by mfriedrich on 2011-08-05 15:27:04 +00:00
ok, to trigger the reschedule (if this really happens), the run_event must be set to false. this is the case if
it would be generally interesting from your debuglog if you can catchup with these messages
(set the debug level to match both "checks" and "events")
Updated by mfriedrich on 2011-08-05 15:42:08 +00:00
after having analyzed this, more event output could be the lead to the topic (in init_timing_loop which is called before starting the event_loop for monitoring things)
either on of those 2
Updated by mcp on 2011-08-06 08:26:13 +00:00
Preferred Time is Invalid In Timeperiod '07-22 Uhr - x7': 1312603714 --> Sat Aug 6 06:08:34 2011
but I think that is because of it should start at 7 am.
at startup I see some with "Service check should not be scheduled" but that's OK because
at 06.08.2011 06:02:02:
[1312603322.977948] [008.2] [pid=31568] Service 'Updates' on host 'loghost'
and if I understand this correctly, it is totally wrong, because that
and for the Update Service on host loghost:
Last Check Time: 2011-08-05 18:12:06
for the other missing 450 for today yet it's the same.
Updated by mfriedrich on 2011-08-06 10:56:11 +00:00
hmm that sounds like a DST problem again.
so the core calculates a check time based on the assumption that DST is not active, which then results in wrong check times which are not executed because their check_period is different.
that actually requires debugging ...
Updated by mfriedrich on 2011-08-06 12:06:18 +00:00
the next check time gets calculated from various factors,
first, the CIB (current interleave block) plus IBI (interleave block index) multiplied by TIB (total interleave blocks)
creating the mult factor, which is being used to calculate the next_check
which then results in your output. the below code block checks, if the preferred check time is a valid time within the timeperiods.
given your example, it fails and spits out the error msg in debug log.
note: those two entries do not match! but the calculation is wrong either way.
actually it doesn't because the next_check time stays the same afterwards.
so conclusion for this analysis - the calculated next_check time does not fit within the timeperiod throws an error, but the retry of getting a valid check time does something wrong, and probably ignores the timeperiod setting (and/or has DST problems). otherwise the Actual Check Time would be corrected.
Updated by mfriedrich on 2011-08-06 12:29:40 +00:00
and now for the worse part, reading on nagios bug tracker and lists...
seems like we got a winner on an old and long lasting bug :(
Updated by mfriedrich on 2011-08-06 12:45:07 +00:00
one last question if we can build a proposed patch on that ... which exact version of icinga is installed, 1.4.2 release tarball or something different? testing will require a recompiled source install.
Updated by mfriedrich on 2011-08-08 06:50:37 +00:00
ok. as i do think that this hurts your production, and i currently can't reproduce the overall problem, i'd propose a workaround for that problem
regarding the problem, i'd love to see the full debuglog (with level=-1 and verbosity=2) from -1 to +1 in the time window). if it contains not-accurate-for-public info, targz it and send it to my mailbox michael.friedrich (at) univie.ac.at
Updated by mfriedrich on 2011-08-08 09:14:25 +00:00
i've created a test config, and i can't reproduce it. checks are rescheduled like i want them to be due to the timperiods being set.
Updated by mfriedrich on 2011-08-08 13:38:32 +00:00
this thread is very interesting, also the proposed patch.
currently, if the next_check time is not a valid time and not in the timeperiod, the check becomes reschedule for the next week (does this happen in here?).
by changing that to happen at the next check cycle, the core could retry that once more, when hitting the changed timeperiod / changed time window.
Updated by mfriedrich on 2011-09-16 12:35:29 +00:00
possible this valuable patch from andreas could resolve the rescheduling issue with check periods on (changed) timeperiods.
Updated by mfriedrich on 2011-10-22 15:29:32 +00:00
host check viability when running an asynchronous host check, is_valid_time gets set by that, determining if the check should be run and if time_is_valid is true or false.
time_is_valid is used in here
/* run a scheduled host check asynchronously */
/* attempt to run the check */
/* an error occurred, so reschedule the check */
/* the host could not be rescheduled properly - set the next check time for next week */
time_is_valid is set to FALSE if the check could not be executed. so this is not the actual check but a timeperiod check before.
preferred_time is then calculated too, being matched against as next valid time
the question is, why is next_valid_time == preferred_time and not having a pre-checked valid time from the check attempt to set a reschedule time 1 week in the future.
for the services, this is a bit different.
so over here, time_is_valid is FALSE too if the check could not happen, but the next_valid_time calculated from previous get_next_valid_time does not fit in check_time_against_period so it still gets rescheduled.
i would suspect a bug in that function, because is_valid_time is always FALSE (so TRUE on the condition, and unless the timeperiod does not match the next_valid_time, the condition is always matched and therefore checks are rescheduled next week,
but as a matter of fact, this only occurs if using non 24x7 timeperiods.
deeper analysis on both needed
Updated by mfriedrich on 2011-10-22 15:31:35 +00:00
Updated by mfriedrich on 2011-10-22 15:48:59 +00:00
which could be true too - the timeperiod checks are not valid and therefore causing within the check viability an error, and afterwards on a reschedule retry they are immediately rescheduled away once more because of probably the same bug. this happens within milliseconds, not to say nanoseconds, to determine
Updated by mfriedrich on 2011-10-22 17:48:44 +00:00
get_next_valid_time contains a nested check with check_time_against_period and preferred_time against tperiod.
0 causing get_earliest_time to use
and then taking the timeperiod exclusions into account. i expect that no timeperiod exclusion is used currently?
if so this would be called recursively, incrementing the level and also hitting the else tree. why the heck is valid_time +1 over there?
anyhow, the actual logic happens within _get_next_valid_time_per_timeperiod
this is where a lot of is_dst=-1 are happening, but that should be set by the system and then being checked and compared by the application, like done with shift?
Updated by mfriedrich on 2011-10-22 17:58:04 +00:00
just checked ...
isdst should be determined with
which is actually done within
even calculating the shift then.
but as a matter of fact, not used in each place where new times are calculated. could be a problem too.
Updated by mcp on 2011-10-23 09:50:14 +00:00
root@lokalhorst:/root/ # ./dst
Updated by mfriedrich on 2012-09-12 17:32:35 +00:00
we could test a patch, which was originally done on the nagios tree.