[dev.icinga.com #2136] freshness checks are generating stale alerts, even if result was received in time #802

Closed
icinga-migration opened this Issue Dec 1, 2011 · 9 comments

Projects

None yet

1 participant

@icinga-migration
Member

This issue has been migrated from Redmine: https://dev.icinga.com/issues/2136

Created by mfriedrich on 2011-12-01 19:12:21 +00:00

Assignee: mfriedrich
Status: Resolved (closed on 2011-12-02 15:55:45 +00:00)
Target Version: 1.6.1
Last Update: 2011-12-02 15:55:45 +00:00 (in Redmine)


the recent patch #2027 on the freshness checking on startup is heavily affecting the normal scheduled run under weird circumstances.

[1322755814] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;localhost;fuckup;0;connected|state=1 time=8.636009
[1322755815] PASSIVE SERVICE CHECK: localhost;fuckup;0;connected
[1322756114] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;localhost;fuckup;0;connected|state=1 time=8.635003
[1322756115] PASSIVE SERVICE CHECK: localhost;fuckup;0;connected
[1322756414] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;localhost;fuckup;0;connected|state=1 time=8.635272
[1322756415] PASSIVE SERVICE CHECK: localhost;fuckup;0;connected
[1322756475] Warning: The results of service 'fuckup' on host 'localhost' are stale by 0d 0h 1m 0s (threshold=0d 0h 20m 0s).  I'm forcing an immediate check of the service.
[1322756478] SERVICE ALERT: localhost;fuckup;CRITICAL;HARD;1;CRITICAL: whohooo failure
[1322756478] SERVICE NOTIFICATION: demo;localhost;fuckup;CRITICAL;service-notify-by-email;CRITICAL: whohooo failure

as remarked in #2027 description, it could most likely fail.

a tested reverted fix in OMD 0.52 nightly is working, so this actually needs to be pushed an released into 1.6.1 then.

Attachments

Changesets

2011-12-01 20:02:26 +00:00 by mfriedrich 2208b71

remove wrong expiration calculation on freshness checks #2136

this is a revert, and needs a proper fix, but will do for now.

refs #2136

Relations:

Member

Updated by mfriedrich on 2011-12-01 19:16:32 +00:00

  • Target Version set to 1.6.1
Member

Updated by mfriedrich on 2011-12-01 20:06:04 +00:00

seeking out into the snippet, it might be that the 60 seconds delay on the alert has something to do with the hardcoded 60 seconds in the addition (missing brackets, different and operator?).

-       if (temp_service->check_type == SERVICE_CHECK_PASSIVE) {
-               if (event_start < program_start + 60 && event_start - last_program_stop < (freshness_threshold * 0.618)) {
-                       expiration_time = event_start + freshness_threshold;
-               }
-       }

requires further investigation though. the patch for removal will be in r1.6

Member

Updated by mfriedrich on 2011-12-01 22:16:52 +00:00

quickfix/revert here

https://git.icinga.org/?p=icinga-core.git;a=commit;h=2208b714cc3a3e9257d474fd444000c6cc3c82f1

Member

Updated by mfriedrich on 2011-12-02 08:58:09 +00:00

  • File added 0001-remove-wrong-expiration-calculation-on-freshness-che.patch
Member

Updated by mfriedrich on 2011-12-02 13:01:59 +00:00

  • Subject changed from freshness checks are generating stale alerts after 60 seconds, even if result was received to freshness checks are generating stale alerts, even if result was received
  • Priority changed from High to Urgent
Member

Updated by mfriedrich on 2011-12-02 13:44:14 +00:00

  • Subject changed from freshness checks are generating stale alerts, even if result was received to freshness checks are generating stale alerts, even if result was received in time

further debug logs (cleaned private stuff)

[1322813318.089879] [016.2] [pid=17065] Checking freshness of service 'foo' on host 'bar'...
[1322813318.089883] [016.2] [pid=17065] Freshness thresholds: service=300, use=300
[1322813318.089887] [016.2] [pid=17065] HBC: 1, PS: 1322812951, ES: 1322812958, LC: 1322813213, CT: 1322813318, ET: 1322813258
[1322813318.089966] [016.1] [pid=17065] Check results for service 'foo' on host 'bar' are stale by 0d 0h 1m 0s (threshold=0d 0h 5m 0s).  Forcing an immediate check of the service...
[1322813318.090017] [016.0] [pid=17065] Scheduling a forced, active check of service 'foo' on host 'bar' @ Fri Dec  2 09:08:38 2011
[1322813318.090059] [016.2] [pid=17065] Scheduling new service check event.

has_been_checked: 1
program_start: 1322812951
event_start: 1322812958
last_check: 1322813213
current_time: 1322813318
expiration_time: 1322813258
Freshness thresholds: service=300, use=300

from the new calculation algorithm

-               if (event_start < program_start + 60 && event_start - last_program_stop < (freshness_threshold * 0.618)) {
-                       expiration_time = event_start + freshness_threshold;
-               }

if (1322812958 < 1322812951 + 60 && 1322812958 - x < 300 * 0.618)
expiration_time = 1322812958 + 300

if (expiration_time < current_time)

if (1322813*258* < 1322813*318*)
this is true

analze the above

if (1322812958 < 1322813011 && 1322812958-x < 185,4)
=>
1322812958-x < 185,4
=>
1322812958-185,4 < x
=>
x > 1322812772,6
=>
x > Fri, 02 Dec 2011 07:59:32 +0000

the overall conclusion is that the freshness threshold with the retention.dat creation date calculation between the event_time and a 61,8 percent freshness threshold will cause an error if

event_start is within

  • program_start+60
  • last_program_stop happened below 300s*61,8%, so not so long ago

actually the problem is not the last_program_stop, but the condition on the program_start and the 60 seconds. this value is just an assumption and therefore not correct for further usage.

Member

Updated by ABauer on 2011-12-02 14:14:08 +00:00

Hi michi,

thanks for figuring this out, in detail!

I ran into the same problem today, after updating to 1.6.

I applied your patch, now everything is fine and running smoothly.

Member

Updated by mfriedrich on 2011-12-02 14:49:30 +00:00

thanks for the feedback!

1.6.1 has been put onto sourceforge, and packagers got a seperate mail for that.

Member

Updated by mfriedrich on 2011-12-02 15:55:45 +00:00

  • Status changed from Assigned to Resolved
  • Done % changed from 0 to 100
@icinga-migration icinga-migration added this to the 1.6.1 milestone Jan 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment