Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

[dev.icinga.com #2993] duplicate events when scheduling forced host|service check #1079

Closed
icinga-migration opened this Issue Aug 18, 2012 · 23 comments

Comments

Projects
None yet
1 participant
Member

icinga-migration commented Aug 18, 2012

This issue has been migrated from Redmine: https://dev.icinga.com/issues/2993

Created by imriz on 2012-08-18 13:21:14 +00:00

Assignee: mfriedrich
Status: Resolved (closed on 2012-08-27 12:49:54 +00:00)
Target Version: 1.7.2
Last Update: 2012-08-27 12:50:34 +00:00 (in Redmine)

Icinga Version: 1.7.1
OS Version: Centos 6

Hi,

When you submit a forced service check, temp_event is always NULL, and therefore the previous event is not removed, resulting with multiple events for the same service, running at different schedules.

Attachments

Changesets

2012-08-18 14:37:52 +00:00 by mfriedrich 379b712

* core: fix duplicated events on check scheduling logic for new events (Andreas Ericsson) #2676 #2993

previously, the logic on scheduling a new event was changed using the
new_event attribute. the decision for actually scheduling a new event
now happens generalized after having decided to actually do so.
furthermore next_check_event is correctly assigned to that new event for
the host|service check (which is a bug in previous versions, causing
duplicate events under different circumstances).

refs #2676
refs #2993

Conflicts:

	Changelog

2012-08-19 17:09:21 +00:00 by mfriedrich ec9c5e3

core: avoid duplicate events when scheduling forced host|service check (Imri Zvik) #2993

previously, we had introduced a hash-like implementation of
host|service->next_check_event directly pointing to the next
scheduled event. this algorithm is being used within
schedule_host|service_check, determing wether to use the
already assigned event, or scheduling a new event. Since we
did not populate the event_data (host or service object) when
adding a new event to the scheduler, this became insame, always
rescheduling a new event, but not discarding the old one.

This has been partly fixed in #2676 with refactoring that detection
and saving the next_check_event accordingly. But on already scheduled
events which were forced (overridden), another bug was unveiled.

Now that we add the reverse pointer from the host|service event_data
back to the newly created event when forcing a check, we will make sure
that those events are checked correctly, and executed/discarded in the
first place, and not always creating a new event, seperated from the rest.

basically, using the previous implementation (with and without the fix
from #2676) we've created an event bomb under various circumstances,
especially when future events would have been overridden (forced checks).
as events usually result in checks, which can result into perfdata, this
could possibly the root cause for #2924 as well, as other users reported
on the portal as well.

http://www.monitoring-portal.org/wbb/index.php?page=Thread&threadID=26352

With the kind patch provided by Imri Zvik, this now works like expected.
Adapted the debug output a bit myself, so with debug_level=272 it will be
easy to see what happens on events scheduling - and not the non-telling
mess before.

kudos to Imri Zvik for the patch.

refs #2993
refs #2676
refs #2182
refs #2924

2012-08-19 17:29:57 +00:00 by mfriedrich f32fbf8

core: avoid duplicate events when scheduling forced host|service check (Imri Zvik) #2993

previously, we had introduced a hash-like implementation of
host|service->next_check_event directly pointing to the next
scheduled event. this algorithm is being used within
schedule_host|service_check, determing wether to use the
already assigned event, or scheduling a new event. Since we
did not populate the event_data (host or service object) when
adding a new event to the scheduler, this became insame, always
rescheduling a new event, but not discarding the old one.

This has been partly fixed in #2676 with refactoring that detection
and saving the next_check_event accordingly. But on already scheduled
events which were forced (overridden), another bug was unveiled.

Now that we add the reverse pointer from the host|service event_data
back to the newly created event when forcing a check, we will make sure
that those events are checked correctly, and executed/discarded in the
first place, and not always creating a new event, seperated from the rest.

basically, using the previous implementation (with and without the fix
from #2676) we've created an event bomb under various circumstances,
especially when future events would have been overridden (forced checks).
as events usually result in checks, which can result into perfdata, this
could possibly the root cause for #2924 as well, as other users reported
on the portal as well.

http://www.monitoring-portal.org/wbb/index.php?page=Thread&threadID=26352

With the kind patch provided by Imri Zvik, this now works like expected.
Adapted the debug output a bit myself, so with debug_level=272 it will be
easy to see what happens on events scheduling - and not the non-telling
mess before.

kudos to Imri Zvik for the patch.

refs #2993
refs #2676
refs #2182
refs #2924

Conflicts:
	Changelog

Relations:

Member

icinga-migration commented Aug 18, 2012

Updated by imriz on 2012-08-18 13:24:04 +00:00

I am referring to the schedule_service_check function in check.c, of course.

Also, options & CHECK_OPTION_FORCE_EXECUTION never evaluates as true - should be options == CHECK_OPTION_FORCE_EXECUTION ?

Member

icinga-migration commented Aug 18, 2012

Updated by imriz on 2012-08-18 14:21:27 +00:00

Because temp_event is NULL, the code never reach the point where it assign new_event to svc->next_check_event.

svc->next_check_event = new_event; should be moved to the place where the new_event is assigned with values (later down the function).

Member

icinga-migration commented Aug 18, 2012

Updated by imriz on 2012-08-18 14:30:29 +00:00

Just saw https://dev.icinga.org/issues/2676

This fix should probably be released before 1.8, as with busy configurations, with many people and commands from the UI, this issue could be a real problem.

Member

icinga-migration commented Aug 18, 2012

Updated by mfriedrich on 2012-08-18 14:50:10 +00:00

well i've seen some reports on the portal as well, but still i haven't been able to reproduce the bug itsself, only the fix, sourcing from another problem, remains logical.

it might be related to #2964 as well, so i'd be happy if you could test the fix under your conditions and report back.

i've prepared the possible fix as on-top 1.7.1 in git r1.7
https://git.icinga.org/?p=icinga-core.git;a=commit;h=379b71295b4846195590350ccb309b3ec79212da

Member

icinga-migration commented Aug 18, 2012

Updated by imriz on 2012-08-18 14:59:16 +00:00

Hi,
Thanks for the quick response!
The patch looks good - I will try it tomorrow and report back.

What about the second issue? shouldn't the bitwise AND be something like

(options & CHECK_OPTION_FORCE_EXECUTION) == 1
or
options == CHECK_OPTION_FORCE_EXECUTION

The current if statement never evalutes as true (easy to reprocude - just enable debug logs, and see that it always logged as non-forced).

Member

icinga-migration commented Aug 18, 2012

Updated by mfriedrich on 2012-08-18 15:36:32 +00:00

that's a bitwise comparison against those options, yes.

#define CHECK_OPTION_NONE               0       /* no check options */
#define CHECK_OPTION_FORCE_EXECUTION    1       /* force execution of a check (ignores disabled services/hosts, invalid timeperiods) */
#define CHECK_OPTION_FRESHNESS_CHECK    2       /* this is a freshness check */
#define CHECK_OPTION_ORPHAN_CHECK       4       /* this is an orphan check */

why would

options & CHECK_OPTION_FORCE_EXECUTION

never result in true, if the bit was set accordingly? rather than checking the debug log, i would check if the gui (cmd.cgi logging, external commands) did send the correct options.

btw

if((options & CHECK_OPTION_FORCE_EXECUTION) == 1)

is the long version of

if(options & CHECK_OPTION_FORCE_EXECUTION)

and

options == CHECK_OPTION_FORCE_EXECUTION

is wrong, as it compares only direct values. what if options contain other than the forced one, i.e. orphaned?

Member

icinga-migration commented Aug 18, 2012

Updated by imriz on 2012-08-18 21:50:55 +00:00

  • File added checks.patch

Hi,

I've just tested the patch, and it doesn't solve the bug (maybe svc->next_check_event = new_event; is too early?).
The attached patch does partially solves it - partially, because if you send the forced event after you (re)started icinga, and before the service check ran at least once, the next_check_event is not populated, and therefore, you get duplicate event. If you wait for the service check to be ran at least once before you send the force command, the logic works.

As for the bitwise AND, you're of course right - I mixed some debug output I've added :)

Member

icinga-migration commented Aug 18, 2012

Updated by imriz on 2012-08-18 22:22:33 +00:00

  • File added events.patch

When icinga initially starts, nor the init_timing_loop() or schedule_new_event logic deals with populating the next_check_event field.
The attached patch populates the next_check_event when Icinga starts - this seems to solve the issue completely.
Please notice that I've dealt with service checks (still needs to do host checks), as a POC - I am sure you can find a better way to implement this solution.

Member

icinga-migration commented Aug 19, 2012

Updated by imriz on 2012-08-19 07:20:15 +00:00

After re-testing, the patch at https://git.icinga.org/?p=icinga-core.git;a=commit;h=379b71295b4846195590350ccb309b3ec79212da works well, but we still need to deal with schedule_new_event() in events.c

Member

icinga-migration commented Aug 19, 2012

Updated by mfriedrich on 2012-08-19 10:12:37 +00:00

  • Category set to Scheduler
  • Status changed from New to Assigned
  • Assigned to set to mfriedrich
  • Target Version set to 1.8
Member

icinga-migration commented Aug 19, 2012

Updated by mfriedrich on 2012-08-19 14:58:28 +00:00

ok, to summarize that a bit.

  • your checks.c is valid, as it moves the saving of the new_event into the service object at the last time possible. though, the refactorization with the patch shown above already does it quite the same way, so we are safe to leave it that way

  • the events.c patch is good thinking, but does not change the temp_service object, as nothing is returned / referenced by the schedule_new_event() function. that stays a local change only and does not affect the host/service object list stored in memory.

that problem is somewhat complicated.

Member

icinga-migration commented Aug 19, 2012

Updated by imriz on 2012-08-19 15:08:02 +00:00

dnsmichi wrote:

  • the events.c patch is good thinking, but does not change the temp_service object, as nothing is returned / referenced by the schedule_new_event() function. that stays a local change only and does not affect the host/service object list stored in memory.

After applying the change to events.c I can successfully force a service check before its first run.
Shouldn't it work, as we pass the pointer to schedule_new_event()?

Member

icinga-migration commented Aug 19, 2012

Updated by imriz on 2012-08-19 15:22:10 +00:00

[1345389657.812786] [2048.1] [pid=20073] Processing: 'INITIAL SERVICE STATE: isp-radius-1;CPU Load;$SERVICESTATE$;$SERVICESTATETYPE$;$SERVICEATTEMPT$;OK - load average: 0.00, 0.00, 0.00
[1345389657.813122] [2048.1] [pid=20073] Done. Final output: 'INITIAL SERVICE STATE: isp-radius-1;CPU Load;OK;HARD;1;OK - load average: 0.00, 0.00, 0.00
[1345389665.158311] [128.2] [pid=20073] Raw command entry: [1345389664] SCHEDULE_FORCED_SVC_CHECK;isp-radius-1;CPU Load;1345389663
[1345389665.158743] [016.0] [pid=20073] Scheduling a forced, active check of service 'CPU Load' on host 'isp-radius-1' @ Sun Aug 19 18:21:03 2012
[1345389665.158800] [016.2] [pid=20073] Found another service check event for this service 'CPU Load' on host 'isp-radius-1' @ Sun Aug 19 18:24:22 2012

The last line is my edited version of the log_debug_info in schedule_service_check() just after the if (temp_event != NULL) line.

Member

icinga-migration commented Aug 19, 2012

Updated by mfriedrich on 2012-08-19 16:31:33 +00:00

yep, i fell into the pointer to event_data, which is only casts to a service, setting an attribute. lemme see how this can be cleaned up.

though, you're right on getting the value back into the event list, the problem to understand is the reverse logic - put the created new_event ptr onto (service *)event_data->next_check_event.

i need to do a little more testing on that, will post later on. thanks for the help!

Member

icinga-migration commented Aug 19, 2012

Updated by mfriedrich on 2012-08-19 16:32:58 +00:00

  • Target Version changed from 1.8 to 1.7.2
Member

icinga-migration commented Aug 19, 2012

Updated by mfriedrich on 2012-08-19 17:00:45 +00:00

patch as is works like a charm, also for hosts. only the debug output is pretty annoying. will include an enhanced version then.

Member

icinga-migration commented Aug 19, 2012

Updated by mfriedrich on 2012-08-19 17:02:42 +00:00

[1345395697.005115] [016.0] [pid=21007] Scheduling a forced, active check of service 'DISEÑOS' on host '1300localhost' @ Sun Aug 19 19:01:30 2012
[1345395697.005130] [016.2] [pid=21007] Found another service check event for service 'DISEÑOS' on host '1300localhost' @ Sun Aug 19 19:02:28 2012
[1345395697.005134] [016.2] [pid=21007] New service check event is forced, so it will be used instead of the existing event.
[1345395697.005139] [016.2] [pid=21007] Scheduling new service check event for 'DISEÑOS' on host '1300localhost' @ Sun Aug 19 19:01:30 2012

[1345395697.005145] [016.2] [pid=21007] Removing service check event for service 'DISEÑOS' on host '1300localhost' @ Sun Aug 19 19:02:28 2012
[1345395697.005174] [024.1] [pid=21007] Run a few checks before executing a service check for 'DISEÑOS'.
[1345395697.005181] [016.0] [pid=21007] Attempting to run scheduled check of service 'DISEÑOS' on host '1300localhost': check options=1, latency=7.005000
[1345395697.005192] [016.0] [pid=21007] Checking service 'DISEÑOS' on host '1300localhost'...
[1345395697.005211] [2320.2] [pid=21007] Raw Command Input: /usr/lib/nagios/plugins/check_dummy $ARG1$
[1345395697.005235] [2320.2] [pid=21007] Expanded Command Output: /usr/lib/nagios/plugins/check_dummy $ARG1$
Member

icinga-migration commented Aug 19, 2012

Updated by mfriedrich on 2012-08-19 17:07:51 +00:00

  • Subject changed from Duplicate events when forcing service check to duplicate events when scheduling forced host|service check
Member

icinga-migration commented Aug 19, 2012

Updated by mfriedrich on 2012-08-19 17:27:29 +00:00

it's in my dev branch for 1.8 but i will cherry-pick it into r1.7 then. 1 week testing for 1.7.2 should be sufficient imho.

https://git.icinga.org/?p=icinga-core.git;a=commit;h=ec9c5e3

Member

icinga-migration commented Aug 19, 2012

Updated by mfriedrich on 2012-08-19 17:54:48 +00:00

  • Status changed from Assigned to 7
Member

icinga-migration commented Aug 25, 2012

Updated by mfriedrich on 2012-08-25 13:36:10 +00:00

  • Done % changed from 0 to 100
Member

icinga-migration commented Aug 27, 2012

Updated by mfriedrich on 2012-08-27 12:49:54 +00:00

  • Status changed from 7 to Resolved

thanks for your awesome help, release 1.7.2 will be out soon.

@icinga-migration icinga-migration added this to the 1.7.2 milestone Jan 17, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment