Skip to content
This repository has been archived by the owner. It is now read-only.

[dev.icinga.com #1259] implement circular buffer for idoutils #563

Closed
icinga-migration opened this Issue Mar 2, 2011 · 10 comments

Comments

Projects
None yet
1 participant
@icinga-migration
Copy link
Member

icinga-migration commented Mar 2, 2011

This issue has been migrated from Redmine: https://dev.icinga.com/issues/1259

Created by mfriedrich on 2011-03-02 08:52:04 +00:00

Assignee: (none)
Status: Rejected (closed on 2011-10-24 17:19:35 +00:00)
Target Version: (none)
Last Update: 2014-12-08 14:34:51 +00:00 (in Redmine)


the current attempt on locking and writing and reading the dbuf range, and doing memmove is a quick&dirty hack, but not a desired solution for a final release (see #866).

it comes to mind that existing solutions from the external commands should be used, as the algorithm is basically the same.

socket -> reader -> circular_buffer <- process_data -> rdbms

see the following examples. it might make sense to use the direct implementation from icinga itsself, but i think we cannot prevent some code copies into ido2db.c or better, a new utils.c over there too.

include/icinga.h

/* CIRCULAR_BUFFER structure - used by worker threads */
typedef struct circular_buffer_struct{
        void            **buffer;
        int             tail;
        int             head;
        int             items;
        int             high;           /* highest number of items that has ever been stored in buffer */
        unsigned long   overflow;
        pthread_mutex_t buffer_lock;
        }circular_buffer;


/* DBUF structure - dynamic string storage */
typedef struct dbuf_struct{
        char *buf;
        unsigned long used_size;
        unsigned long allocated_size;
        unsigned long chunk_size;
        }dbuf;


/* slots in circular buffers */
#define DEFAULT_EXTERNAL_COMMAND_BUFFER_SLOTS     4096

base/commands.c

the one for processing the data in the buffer in the while loop.

int check_for_external_commands(void){

maybe allow the same setting as core - 1> check as often as possible.

base/events.c

                        /* check for external commands if we're supposed to check as often as possible */
                        if(command_check_interval==-1)
                                check_for_external_commands();

the socketreader

base/utils.c

basically adapt the main initialisation to create and clear the circular buffer.

int init_command_file_worker_thread(void){
int shutdown_command_file_worker_thread(void){
void cleanup_command_file_worker_thread(void *arg){

this is not a worker thread in ido2db, but the main thread processing the data from the socket (in the core case a named pipe)

void * command_file_worker_thread(void *arg){
  • loop if no data
  • get buffer items from circular buffer
  • if no buffer items, loop a bit (so that the cpu does not struggle from load with no sleep in between polling the socket - this is the main problem in #866 too)
  • loop through data
    • try to submit the gotten data and check if circular buffer full
      • if not successful, sleep and retry (loop)
    • if buffer is now full, bail out of reading loop of socket
  • if buffer full, sleep a bit and then step on the looping again

submitting the data is basically an insert into the circular buffer.

int submit_external_command(char *cmd, int *buffer_items){

===

so handling the data from the socket the same way as the command pipe (only s/cmdfile/socket/) will do the trick, and not increase cpu load as it's now in #866

Attachments

Changesets

2011-09-25 00:12:03 +00:00 by mfriedrich a34a899

* idoutils: idomod: change output_buffer to asynchrounous circular buffer between core and data sink, removing core blocking #1259

** main thread writes circular buffer, protected by a mutex
** client thread is started after event loop is starting, reading from circular buffer
** the config dump starts when the event loop starts, minimizing the overall startup delay
** on event loop start, idomod will now tell how many items to flush - please adjust this value in idomod.cfg accordinly
** default minimal output_buffer_items is now 50000 (see idomod.cfg)
** core blocking is removed, even if the database backend is slow / ido2db not responding

for more please read https://dev.icinga.org/issues/1259

refs #1259
refs #866

2011-09-27 15:56:09 +00:00 by mfriedrich 412e220

idoutils: make idomod debug logging threadsafe with a mutex

refs #1259

2011-09-27 23:30:19 +00:00 by mfriedrich b419560

idoutils: add threaded circular buffer to ido2db, same as idomod #1259

main problem - the overall dump is now not synchronous but asynchronous.
this requires finetuning on both sides, but the overall core blocking
seems to be removed all over.

furthermore, making all functions thread safe will be needed.
and ido2db code needs to be cleaned from the old attempts.

and, there souldn't be retries on the buffer if full anymore,
the 2 buffers should just get into blocking mode.

refs #1259

Relations:

@icinga-migration

This comment has been minimized.

Copy link
Member Author

icinga-migration commented Apr 27, 2011

Updated by mfriedrich on 2011-04-27 16:57:58 +00:00

  • Target Version changed from 1.4 to 1.5

second attempt was not working either, so postponing that stuff for 1.5

@icinga-migration

This comment has been minimized.

Copy link
Member Author

icinga-migration commented May 27, 2011

Updated by mfriedrich on 2011-05-27 16:33:33 +00:00

  • Status changed from Assigned to Feedback
  • Assigned to deleted mfriedrich
  • Target Version deleted 1.5
@icinga-migration

This comment has been minimized.

Copy link
Member Author

icinga-migration commented Sep 20, 2011

Updated by mfriedrich on 2011-09-20 06:45:18 +00:00

  • File added ndoutils-async-v1.1.patch

idea for a message queue. the actual problem with that - if the kernel can't handle more, the msg is lost and the kernel complains meaning there's no actuall buffering on disk. furthermore this solution attemps to fork another child for doing the asynchronous processing. this would be dangerous for the already housekeeping thread to be forked then too - there should be only one thread running by the main process for data processing.

another idea would be 0mq as new dependency, removing the deepdown msg queue layer and keeping an eye on the buffering.

Patch adds asynchronous database queries in ndoutils tool. Tested against version 1.4b9 with 5k service checks per minute.
It is recomended to remove all message queues owned by user nagios before ndoutils startup.

This can be achieved by script:
for i in `ipcs -q | grep nagios |awk '{print $2}'`; do ipcrm -q $i; done



It is strongly recomended to tune message queue parameters:

kernel.msgmnb

kernel.msgmax

kernel.msgmni

Errors of asynchronous process are logged via syslog. Eg.

ndo2db-3x: Error: queue send error. - means that process cannot put message into queue.
@icinga-migration

This comment has been minimized.

Copy link
Member Author

icinga-migration commented Sep 25, 2011

Updated by mfriedrich on 2011-09-25 00:09:17 +00:00

  • Subject changed from implement circular buffer for ido2db to implement circular buffer for idoutils

another idea would be a circular buffer within idomod - as the synchronous buffer is already there. the overall construct on the datasink and reconnecting, flushing etc is to be seen deprecated when placing main producer and thread consumer on that buffer, secured by a mutex lock.

1/ main problem - the core daemonizes by default, and started threads through idomod_init will die silently. solution - register a callback on PROCESS_DATA and catch up on NEBTYPE_PROCESS_EVENTLOOPSTART where we start the queue thread for consuming data
2/ this invokes another problem - the initial buffer must be large enough to allow a full dump
3/ the write to sink retries should be set not to fully block the core on waiting, but bailing out and warning. currently a retry of 5 is set, but mostly it should be infinite either way.
4/ all push and pop functions on the buffer are not threadsafe, proper mutex locking required
5/ the flush on reconnect functionality needs to be completely removed, as the buffer is used differently now (meaning - always because it's multithreaded)

config
testconfig, 200 hosts, 4k services, lots of groups. objects.cache line count: 149767

screen1

  1. service icinga stop; rm -f /var/nagios/idomod.tmp ; rm -f /data/icinga/idomod.debug ; service icinga start

screen2

  1. tail -f /data/icinga/idomod.debug | grep 'buffer items'
[1316903557.156929] [001.2] [pid=32055] idomod_write_to_sink_queue() buffer items: 8725/65535
[1316903557.156935] [001.2] [pid=32055] idomod_sink_buffer_push() start
[1316903557.156941] [001.2] [pid=32055] idomod_sink_buffer_push() end
[1316903557.156947] [001.2] [pid=32055] idomod_write_to_sink_queue() success
[1316903557.159681] [001.2] [pid=32055] idomod_broker_data() end
[1316903557.159982] [001.2] [pid=32055] idomod_broker_data() handle event
[1316903557.159993] [001.2] [pid=32055] idomod_init_event_loop_start() start
[1316903557.160020] [001.2] [pid=32055] idomod_init() Successfully created queue thread with thread id 1117038912
[1316903557.160041] [001.2] [pid=32055] idomod_write_to_sink_queue() start
[1316903557.160048] [001.2] [pid=32055] idomod_write_to_sink_queue() buf: 
200:
1=105
2=0
3=0
4=1316903557.159708
105=Icinga
107=1.5.0
104=08-24-2011
102=32055
999
[1316903557.160054] [001.2] [pid=32055] idomod_write_to_sink_queue() buffer items: 8726/65535
[1316903557.160060] [001.2] [pid=32055] idomod_sink_buffer_push() start
[1316903557.160066] [001.2] [pid=32055] idomod_sink_buffer_push() end
[1316903557.160071] [001.2] [pid=32055] idomod_write_to_sink_queue() success
[1316903557.160270] [001.2] [pid=32055] idomod_read_from_sink_queue() started with thread id 1117038912
[1316903557.160284] [001.2] [pid=32055] idomod_read_from_sink_queue() buffer items: 8727/65535
[1316903557.160293] [001.2] [pid=32055] idomod_sink_buffer_pop() start
[1316903557.160299] [001.2] [pid=32055] idomod_sink_buffer_pop() end
[1316903557.160304] [001.2] [pid=32055] idomod_read_from_sink_queue() buffer: 

[1316903557.160312] [001.2] [pid=32055] idomod_write_to_sink() start
[1316903557.160317] [001.2] [pid=32055] idomod_write_to_sink(
)
[1316903557.160325] [001.2] [pid=32055] idomod_open_sink() start
[1316903557.160377] [001.2] [pid=32055] idomod_open_sink() end
[1316903557.160395] [001.2] [pid=32055] idomod_hello_sink() start
[1316903557.160408] [001.2] [pid=32055] idomod_write_to_sink_queue() start
[1316903557.160414] [001.2] [pid=32055] idomod_write_to_sink_queue() buf: 

HELLO
PROTOCOL: 2
AGENT: IDOMOD
AGENTVERSION: 1.5.0
STARTTIME: 1316903557
DISPOSITION: REALTIME
CONNECTION: UNIXSOCKET
CONNECTTYPE: INITIAL
INSTANCENAME: nagios-domentw
STARTDATADUMP


[1316903557.160420] [001.2] [pid=32055] idomod_write_to_sink_queue() buffer items: 8726/65535
[1316903557.160425] [001.2] [pid=32055] idomod_sink_buffer_push() start
[1316903557.160431] [001.2] [pid=32055] idomod_sink_buffer_push() end
[1316903557.160436] [001.2] [pid=32055] idomod_write_to_sink_queue() success
[1316903557.162706] [001.2] [pid=32055] idomod_write_runtime_variables() start
[1316903557.162735] [001.2] [pid=32055] idomod_write_to_sink_queue() start
[1316903557.162745] [001.2] [pid=32055] idomod_write_to_sink_queue() buf: 
303:
4=1316903557.162730

[1316903557.162755] [001.2] [pid=32055] idomod_write_to_sink_queue() buffer items: 8727/65535
[1316903557.162764] [001.2] [pid=32055] idomod_sink_buffer_push() start
[1316903557.162774] [001.2] [pid=32055] idomod_sink_buffer_push() end
[1316903557.162783] [001.2] [pid=32055] idomod_write_to_sink_queue() success
[1316903557.162797] [001.2] [pid=32055] idomod_hello_sink() end
[1316903557.162932] [001.2] [pid=32055] idomod_broker_data() handle event
[1316903557.162951] [001.2] [pid=32055] idomod_write_to_sink_queue() start
[1316903557.162961] [001.2] [pid=32055] idomod_write_to_sink_queue() buf: 
202:
1=300
2=0
3=0
4=1316903557.162904
73=1316903557
74=262144
72=idomod: Successfully connected to data sink.  8726 queued items to flush.
999

[1316903557.167712] [001.2] [pid=32055] idomod_read_from_sink_queue() buffer: 
202:
1=300
2=0
3=0
4=1316903541.987050
73=1316903541
74=262144
72=Event broker module '/usr/bin/idomod.o' initialized successfully.
999


[1316903557.167719] [001.2] [pid=32055] idomod_write_to_sink() start
[1316903557.167724] [001.2] [pid=32055] idomod_write_to_sink(
202:
1=300
2=0
3=0
4=1316903541.987050
73=1316903541
74=262144
72=Event broker module '/usr/bin/idomod.o' initialized successfully.
999

)

the basic difference is now that idomod does not connect immediately to the socket, but on eventloop start (and then flushing the buffer onto the socket, while asynchronously adding more data to it.

from the syslog - TIMEPERIODTRANSITION is an output by mklivestatus indicating that the event loop has started.

Sep 25 00:32:21 xxx icinga: Icinga 1.5.0 starting... (PID=32053)
Sep 25 00:32:21 xxx icinga: Local time is Sun Sep 25 00:32:21 CEST 2011
Sep 25 00:32:21 xxx icinga: LOG VERSION: 2.0 
Sep 25 00:32:21 xxx icinga: livestatus: Livestatus 1.1.9i8 by Mathias Kettner. Socket: '/var/nagios/rw/live'
Sep 25 00:32:21 xxx icinga: Event broker module '/usr/lib/mk-livestatus/livestatus.o' initialized successfully.
Sep 25 00:32:21 xxx icinga: idomod: IDOMOD 1.5.0 (08-24-2011) Copyright (c) 2005-2008 Ethan Galstad (nagios@nagios.org), Copyright (c) 2009-2011 Icinga Development Team (http://www.icinga.org))
Sep 25 00:32:21 xxx icinga: Event broker module '/usr/bin/idomod.o' initialized successfully.
Sep 25 00:32:22 xxx icinga: Finished daemonizing... (New PID=32055) 
Sep 25 00:32:30 xxx icinga: TIMEPERIOD TRANSITION: 24x7;-1;1
Sep 25 00:32:30 xxx icinga: TIMEPERIOD TRANSITION: notification_period;-1;1
Sep 25 00:32:30 xxx icinga: TIMEPERIOD TRANSITION: test_timeperiod;-1;1
Sep 25 00:32:30 xxx icinga: TIMEPERIOD TRANSITION: workhours;-1;0
Sep 25 00:32:37 xxx ido2db: Client connected, data available. 
Sep 25 00:32:37 xxx ido2db: Handling client connection... 
Sep 25 00:32:37 xxx icinga: idomod: Successfully connected to data sink.  8726 queued items to flush.
Sep 25 00:32:37 xxx ido2db: Successfully connected to mysql database

so 16 seconds startup and a ready-to-start-checking core on a vmware dualcore with 2gb ram ;-)

(this includes data_processing_options = 1 btw also timed events which kill everything)

dropping timedevents, it's even less

Sep 25 00:53:21 xxx icinga: Icinga 1.5.0 starting... (PID=5936)

Sep 25 00:53:22 xxx icinga: Finished daemonizing... (New PID=5938)

Sep 25 00:53:32 xxx icinga: TIMEPERIOD TRANSITION: 24x7;-1;1
Sep 25 00:53:32 xxx icinga: TIMEPERIOD TRANSITION: notification_period;-1;1
Sep 25 00:53:32 xxx icinga: TIMEPERIOD TRANSITION: test_timeperiod;-1;1
Sep 25 00:53:32 xxx icinga: TIMEPERIOD TRANSITION: workhours;-1;0

Sep 25 00:53:32 xxx icinga: idomod: Successfully connected to data sink.  4509 queued items to flush.

so 11 seconds all over. it could be less, if the daemonizing would happen differently.

the problem with that is of course - a full buffer requires a lot of work of ido2db to handle that data properly. so the final data in the database will take it's time.

still, one more problem. wouldn't be idoutils, if not.

starting the queue thread after the config dump will cause problems. so the overall config dump (which is nothing else than looping through memory loaded lists and/or reading from file if demanded) won't be started initially when the config is loaded, but finally when the queue thread is started, as "post processing callback". the seperation which config will be processed, happens within the called funtions.

        /* POST PROCESSING... */

        switch (event_type) {

        case NEBCALLBACK_PROCESS_DATA:

                procdata = (nebstruct_process_data *)data;

                /* dump everything on event loop start if applicable */
                if (procdata->type == NEBTYPE_PROCESS_EVENTLOOPSTART) {
                        idomod_write_config_files();
                        idomod_write_config(IDOMOD_CONFIG_DUMP_ORIGINAL);
                        idomod_write_config(IDOMOD_CONFIG_DUMP_RETAINED);
                        idomod_write_runtime_variables();
                }

                break;

benefit once more - moving the config dump after the event loop actually removes more blocking on startup, the delay is now ~2sec.

Sep 25 01:52:00 xxx icinga: Icinga 1.5.0 starting... (PID=14837)

Sep 25 01:52:00 xxx icinga: idomod: IDOMOD 1.5.0 (08-24-2011) Copyright (c) 2005-2008 Ethan Galstad (nagios@nagios.org), Copyright (c) 2009-2011 Icinga Development Team (http://www.icinga.org))
Sep 25 01:52:00 xxx icinga: Event broker module '/usr/bin/idomod.o' initialized successfully.

Sep 25 01:52:02 xxx icinga: TIMEPERIOD TRANSITION: 24x7;-1;1
Sep 25 01:52:02 xxx icinga: TIMEPERIOD TRANSITION: notification_period;-1;1
Sep 25 01:52:02 xxx icinga: TIMEPERIOD TRANSITION: test_timeperiod;-1;1
Sep 25 01:52:02 xxx icinga: TIMEPERIOD TRANSITION: workhours;-1;0

Sep 25 01:52:02 xxx ido2db: Client connected, data available. 
Sep 25 01:52:02 xxx ido2db: Handling client connection... 
Sep 25 01:52:02 xxx icinga: idomod: Successfully connected to data sink.  100 queued items to flush.
Sep 25 01:52:02 xxx ido2db: Successfully connected to mysql database

the data in the database is now verified ok :)

known bug if not all retained states are dumped to the database, a check needs to pass the actual state into the database (this is being seen on the first startup only, then the status tables remain filled). this is a problem with dumping retained states to the database.

* idoutils: idomod: change output_buffer to asynchrounous circular buffer between core and data sink, removing core blocking #
1259
** main thread writes circular buffer, protected by a mutex
** client thread is started after event loop is starting, reading from circular buffer
** the config dump starts when the event loop starts, minimizing the overall startup delay
** on event loop start, idomod will now tell how many items to flush - please adjust this value in idomod.cfg accordinly
** default minimal output_buffer_items is now 50000 (see idomod.cfg)
** core blocking is removed, even if the database backend is slow / ido2db not responding

will be pushed to mfriedrich/ido + dev/ido

> test, test, test!

@icinga-migration

This comment has been minimized.

Copy link
Member Author

icinga-migration commented Sep 25, 2011

Updated by mfriedrich on 2011-09-25 00:09:31 +00:00

  • Status changed from Feedback to Assigned
  • Assigned to set to mfriedrich
  • Target Version set to 1.6
  • Done % changed from 0 to 80
@icinga-migration

This comment has been minimized.

Copy link
Member Author

icinga-migration commented Sep 27, 2011

Updated by mfriedrich on 2011-09-27 21:49:33 +00:00

by reports from testers, this needs to be enhanced somehow. now the core does not block anymore, but the socket will because ido2db / rdbms is not fast enough. so there would be an end-to-end buffer needed, one for buffering on the core part if the connection dies, and one on the ido2db side, buffering data from the socket and the database. if having 2 asynchronous buffers, the lock (and wait) can be established again, and the core blocking would be minor, having 2 locations to increase performance. but beware, threading and circular buffers will make debugging harder.
furthermore, various functionality needs to be cached in the ido2db daemon instead of querying the database all over.

@icinga-migration

This comment has been minimized.

Copy link
Member Author

icinga-migration commented Sep 27, 2011

Updated by mfriedrich on 2011-09-27 21:49:58 +00:00

  • Done % changed from 80 to 50
@icinga-migration

This comment has been minimized.

Copy link
Member Author

icinga-migration commented Sep 27, 2011

Updated by mfriedrich on 2011-09-27 23:28:43 +00:00

# rm -f /var/icinga/idomod.tmp ; rm -f /data/icinga/idomod.debug /data/icinga/ido2db.debug ; service ido2db restart ; service icinga restart

the circular buffer requires to add new config options to ido2db.cfg (defaults will be 50k items and /tmp/ido2db.tmp).

looking good. but buffering also shows that the config dump itsself takes ages. probably don't drop config on restart/reload, remove checks on is_active and fake the web status data on the old status then.

@icinga-migration

This comment has been minimized.

Copy link
Member Author

icinga-migration commented Oct 24, 2011

Updated by mfriedrich on 2011-10-24 17:19:35 +00:00

  • Status changed from Assigned to Rejected
  • Assigned to deleted mfriedrich
  • Target Version deleted 1.6

see #1934 for a deeper anaylsis on the data dumping and a proof of concept.

it just does NOT work with this design. multiple workers or circular buffers won't make it happen when there are so many things to be done on an insert or update. and when you are using singlethreaded db libraries.

=> proof of concept reflects = rejected.

@icinga-migration

This comment has been minimized.

Copy link
Member Author

icinga-migration commented Dec 8, 2014

Updated by mfriedrich on 2014-12-08 14:34:51 +00:00

  • Project changed from 18 to Core, Classic UI, IDOUtils
  • Category changed from 25 to IDOUtils
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.