New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checks with large check_intervals scheduled outside of short time periods #647
Comments
The issue seems to be from 62af867, where the previous maintainer was trying to reduce the load caused by scheduling checks from outside their timeperiods. He did this by 'rescheduling' the tests by some random amount of time after the start of the timeperiod. However, this code doesn't ensure that the new time is still in the timeperiod, instead only using the check_interval or retry_interval of the host/service. |
* Create reschedule_within_timeperiod(), which handles the ranged_urand() rescheduling more correctly * fix error in reschedule_within_timeperiod, replace ranged_urand() calls where applicable * Fix initial scheduling of service checks
Hi @Madlohe, I see in change log:
We're still with 4.4.3 so we didn't try the previous fix yet. Do we should expect a partial solution? Thanks, |
Originally (for 4.4.4) I took all instances of check scheduling and ensured that they checked the scheduled time against the timeperiod to make sure it would run. The issue was that this is a somewhat expensive operation, so on startup we'd see CPU load issues due to scheduling hundreds or thousands of checks like this simultaneously. For 4.4.5 the timeperiod logic is only used when rescheduling. Rescheduling occurs after each check, but also in the time several minutes before each check. So, in your specific case, I think the changes should still work. If you upgrade and still have issues with this, do let me know. |
…ods (NagiosEnterprises#649) * Create reschedule_within_timeperiod(), which handles the ranged_urand() rescheduling more correctly * fix error in reschedule_within_timeperiod, replace ranged_urand() calls where applicable * Fix initial scheduling of service checks
…ods (NagiosEnterprises#649) * Create reschedule_within_timeperiod(), which handles the ranged_urand() rescheduling more correctly * fix error in reschedule_within_timeperiod, replace ranged_urand() calls where applicable * Fix initial scheduling of service checks
…ods (NagiosEnterprises#649) * Create reschedule_within_timeperiod(), which handles the ranged_urand() rescheduling more correctly * fix error in reschedule_within_timeperiod, replace ranged_urand() calls where applicable * Fix initial scheduling of service checks
…ods (NagiosEnterprises#649) * Create reschedule_within_timeperiod(), which handles the ranged_urand() rescheduling more correctly * fix error in reschedule_within_timeperiod, replace ranged_urand() calls where applicable * Fix initial scheduling of service checks
See here for context
In short, when a service has a check_interval much shorter than the total length of the check_period, the check may be scheduled after the end of the time period. When this happens, the check isn't run and the service remains PENDING.
The text was updated successfully, but these errors were encountered: