Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

[dev.icinga.com #3099] only reap results when checkresult_list is not overloaded (max_check_result_list_items) #1104

Closed
icinga-migration opened this Issue Sep 10, 2012 · 15 comments

Comments

Projects
None yet
1 participant
Member

icinga-migration commented Sep 10, 2012

This issue has been migrated from Redmine: https://dev.icinga.com/issues/3099

Created by tgelf on 2012-09-10 08:11:22 +00:00

Assignee: mfriedrich
Status: Resolved (closed on 2012-09-23 09:21:25 +00:00)
Target Version: 1.8
Last Update: 2012-09-23 09:21:25 +00:00 (in Redmine)


We often stumbled over an annoying problem in distributed setups when IDO is involved. If the reaper runs into it's own timeout, the checkresults linked-list will start growing. The core will get slower and slower and process less and lesser events until timeout happens.

If you reach the point where events come in faster than they are processed, game is over: the core will be unable to ever catch up. The only chance you have is killing the daemon, but doing so you loose all in-memory results. And there will be MANY of them.

Here is a quick fix for this issue (fixes process_check_result_queue in utils.c):

/* processes files in the check result queue directory */
int process_check_result_queue(char *dirname) {
        char file[MAX_FILENAME_LENGTH];
        DIR *dirp = NULL;
        struct dirent *dirfile = NULL;
        register int x = 0;
        struct stat stat_buf;
        struct stat ok_stat_buf;
        char *temp_buffer = NULL;
+       int result = OK;
+
+       int list_length = 0;
+       check_result *current = check_result_list;
+
+       while (current != NULL) {
+           list_length++;
+           current = current->next;
+       }
+        log_debug_info(DEBUGL_CHECKS, 1, "check_result_list has %d
items\n", list_length);
+       if (list_length > 1000) {
+               log_debug_info(DEBUGL_CHECKS, 1, "Skipping check result queue
as there is still work for the reaper\n");
+               return 0;
+       }

        /* make sure we have what we need */
        if (dirname == NULL) {

Well... of course this could be improved, but at least it works. Instead of locking up at 100% CPU usage Icinga will manage it to reap a grown queue and continue it's work. Please note that adding a counter to avoid counting single elements will not work as there are NEBs out there directly modifying the checkresults list.

Regards,
Thomas Gelf

Changesets

2012-09-10 15:12:59 +00:00 by mfriedrich 3f3840b

core: only reap results when checkresult_list is not overloaded (max_check_result_list_items) (thx Thomas Gelf) #3099 - MF

when the check result reaper event gets called in order to reap new
check result files from disk, there may be still unprocessed results
being left on the checkresult list. this is especially the case when
there's performance issues on the core itsself, not leaving enough
time to actually process the checkresult lists, but rather reaping
and reaping too much results onto the list. the larger the list grows
the worse O(n) will be for e.g. determining the size of the list.

on various larger setups this could lead into a long in-memory list
slowing down everything. once you kill the core, you will lose all
results from memory, also those already being reaped. rather than
reaping *everything* on the scheduled reaping interval, we should
only reap a specific number, until the checkresult list is "full".
max_check_result_list_items will take care of that - once set to
a value greater than 0, the reaper will only put checkresults onto
the list until the list items will not exceed the max item number.
this will allow us to only process smaller chunks of checkresult
files, waiting for them being processed, and then reaping the rest.
one might set the max fileage a bit higher, but that should not be
an issue with the default values.

in order to stay safe, this feature is disabled by default (same as
setting max_check_result_list_items=0 in icinga.cfg). the default
and smaller setups won't need it anyways.

since the checkresult list is not threadsafe in any way, and neb
modules such as mod_gearman or dnx fiddle with the checkresult list
in memory in order to stash their own checkresult queue into the core
we cannot just add our own counter as e.g. a global variable, as those
addons do not know how to modify that one. so the patch reads the
checkresult list length before deciding to bail early or not - make
sure to find the best value by yourself (reports say 1024 is good
enough). setting this value too large might double up the performance
issues you had already before - therefore this config item is tagged
'experimental'.

the better solution - have a clean api for stashing checkresults
into the core, rather than letting neb modules fiddling with inner
core structures.

thanks to Thomas Gelf for the initial patch.

refs #3099

2012-09-10 15:46:48 +00:00 by mfriedrich 46cfbb3

core: fix broken tests for new delete_scheduled_event() and max_check_result_list_items

refs #3048
refs #3099

2012-09-10 15:56:51 +00:00 by mfriedrich a3f2994

core: set max_check_result_list_items var to a default value

sorry, missed it on the previous commit while testing stuff.

refs #3099

Relations:

Member

icinga-migration commented Sep 10, 2012

Updated by mfriedrich on 2012-09-10 08:25:35 +00:00

we discussed that last week in nuremberg - what if one of the neb modules (like mod_gearman, dnx) fiddles with the check_result_list in memory? then your counter might be invalid, as the check_result_list is not threadsafe.

the possibility to hit the while loop is rather small, but it still might be possible.

Member

icinga-migration commented Sep 10, 2012

Updated by tgelf on 2012-09-10 09:13:59 +00:00

Hi Michi,

that's what I've written in my comment, isn't it? However, the patch as shown shall work fine with such NEBs, as it counts all elements at each single run. Doing so doesn't hurt as there are usually zero elements in the list at this point.

Cheers,
Thomas

Member

icinga-migration commented Sep 10, 2012

Updated by mfriedrich on 2012-09-10 09:41:37 +00:00

it was meant the time frame before the while loop and afterwards - question is if the neb modules could fiddle with the list in between that windows.

if they are not (only doing that on special core events), it should be fine. though, i do not really remember how they do it, so before applying this, it will be mandatory to check that.

besides, how about a config option for the max items in queue value? hardcoding 1000 might be too much/little?

Member

icinga-migration commented Sep 10, 2012

Updated by mfriedrich on 2012-09-10 09:41:53 +00:00

  • Category set to Check Results
  • Status changed from New to Assigned
  • Assigned to set to mfriedrich
  • Target Version set to 1.8
Member

icinga-migration commented Sep 10, 2012

Updated by tgelf on 2012-09-10 10:09:26 +00:00

dnsmichi wrote:

it was meant the time frame before the while loop and afterwards - question is if the neb modules could fiddle with the list in between that windows.

Afaik they don't, they (should) act only on callbacks. I only checked mod_gearman, looks fine - but I didn't try it out. If a NEB modifies the list while process_check_result_queue is adding elements to the list far worse things could happen, it would probably have trouble with add_check_result_to_list :p

if they are not (only doing that on special core events), it should be fine. though, i do not really remember how they do it, so before applying this, it will be mandatory to check that.

besides, how about a config option for the max items in queue value? hardcoding 1000 might be too much/little?

I agree on this, something like max_checkresults_list_length would be great. Usually there shall be 0 elements, I have chosen 1000 to speed things up when being stressed - otherwise it first has to reap everything before reading new data. process_check_result_queue is still able to write tens of thousands of results to memory (everything in che checkresults folder), but it will stop doing so unless all of them have been reaped afterwards.

Member

icinga-migration commented Sep 10, 2012

Updated by mfriedrich on 2012-09-10 13:55:11 +00:00

something like that.

# MAX CHECK RESULT LIST ITEMS !!EXPERIMENTAL!!
# This experimental option allows you to set the max number of items
# the checkresult reaper will put onto the checkresult list for further
# processing by the core. If there are too many, the reaping will be
# terminated early, allowing the core to process the results sooner.
# On larger setups, that list might grow too much, and decrease
# performance on processing. You might experiment with that value, the
# default is set to 1024 items at max.

#max_check_result_list_items=1024
Member

icinga-migration commented Sep 10, 2012

Updated by tgelf on 2012-09-10 13:57:16 +00:00

Looks great. No setting or default setting (0) means former behaviour, no action.

Member

icinga-migration commented Sep 10, 2012

Updated by mfriedrich on 2012-09-10 14:02:11 +00:00

good idea. i was not sure how to handle that with existing setups.

setting the default to 0, and checking that.

+       if (list_length > max_check_result_list_items && max_check_result_list_items != 0) {
+               log_debug_info(DEBUGL_CHECKS, 1, "Skipping check result queue as there is still work items (%d) for the reaper\n", max_check_result_list_items);
+               return OK;
+       }
Member

icinga-migration commented Sep 10, 2012

Updated by tgelf on 2012-09-10 14:14:58 +00:00

I'd opt for putting everything in the if block, doing so completely avoids counting list items for those not using this feature:

+   int list_length = 0;
+   check_result *current = check_result_list;
+
+   if (max_check_result_list_items != 0) {
+       while (current != NULL) {
+           list_length++;
+           current = current->next;
+       }
+       if (list_length > max_check_result_list_items) {
+               log_debug_info(DEBUGL_CHECKS, 1, "Skipping check result queue
as there are %d results not handled by the reaper\n", list_length);
+               return 0;
+       }
+   }
Member

icinga-migration commented Sep 10, 2012

Updated by mfriedrich on 2012-09-10 14:50:44 +00:00

yep good. i do not really have a "check result generator" right on, do you have such?

Member

icinga-migration commented Sep 10, 2012

Updated by mfriedrich on 2012-09-10 14:53:47 +00:00

  • Tracker changed from Bug to Feature
  • Subject changed from Fix for core being busy with himself once overloaded to only reap results when checkresult_list is not overloaded (max_check_result_list_items)
Member

icinga-migration commented Sep 10, 2012

Updated by tgelf on 2012-09-10 14:57:12 +00:00

dnsmichi wrote:

yep good. i do not really have a "check result generator" right on, do you have such?

Not really. "All" I have are live environments hammering check result data from distrubuted slaves to the master.

Member

icinga-migration commented Sep 10, 2012

Updated by mfriedrich on 2012-09-10 15:03:01 +00:00

ok, so something you could test for me then :) i will push a revision to git soon.

Member

icinga-migration commented Sep 10, 2012

Updated by mfriedrich on 2012-09-10 15:39:24 +00:00

mod_gearman merges the lists here
https://github.com/sni/mod\_gearman/blob/master/neb\_module/mod\_gearman.c#L295

which is not put within a lock, but only calls that when the core signals that the checkresult reaper would be called.

https://github.com/sni/mod\_gearman/blob/master/neb\_module/mod\_gearman.c#L255

/* handle timed events */
static int handle_timed_events( int event_type, void *data ) {
    nebstruct_timed_event_data * ted = (nebstruct_timed_event_data *)data;

    /* sanity checks */
    if (event_type != NEBCALLBACK_TIMED_EVENT_DATA || ted == 0)
        return NEB_ERROR;

    /* we only care about REAPER events */
    if (ted->event_type != EVENT_CHECK_REAPER)
        return NEB_OK;

    gm_log( GM_LOG_TRACE, "handle_timed_events(%i, data)\n", event_type, ted->event_type );

    move_results_to_core();

    return NEB_OK;
}

/* handles a timed event */
int handle_timed_event(timed_event *event) {
        host *temp_host = NULL;
        service *temp_service = NULL;
        void (*userfunc)(void *);
        struct timeval tv;
        double latency = 0.0;
        /* make sure gcc3 won't hit here */
#ifndef GCCTOOOLD
        struct timeval start;
        gettimeofday(&start, NULL);
#endif 

        log_debug_info(DEBUGL_FUNCTIONS, 0, "handle_timed_event() start\n");

#ifdef USE_EVENT_BROKER
        /* send event data to broker */
        broker_timed_event(NEBTYPE_TIMEDEVENT_EXECUTE, NEBFLAG_NONE, NEBATTR_NONE, event, NULL);
#endif 

so this is safe, as it's a serialized event, waiting for the callback itsself, not fiddling within memory.

for dnx i cannot tell, but i guess it's quite the same mechanism, as some would break the core entirely otherwise.

Member

icinga-migration commented Sep 23, 2012

Updated by mfriedrich on 2012-09-23 09:21:25 +00:00

  • Status changed from Assigned to Resolved
  • Done % changed from 0 to 100

it's tagged experimental, so everyone needs to take care themselves, and they might provide feedback then.

@icinga-migration icinga-migration added this to the 1.8 milestone Jan 17, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment