Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.Sign up
[dev.icinga.com #4427] Persistent ido2db process after an ido2db service restart #1312
This issue has been migrated from Redmine: https://dev.icinga.com/issues/4427
Created by tontonitch on 2013-07-18 07:52:09 +00:00
Since I've upgraded from icinga 1.8 to 1.9 (currently 1.9.3), it appears that one ido2db process is not correctly stopped after an ido2db restart. Problem started to occur with icinga version 1.9.0.
Consequently, sometimes there are 3 ido2db processes running:
Even if I stop the ido2db service, one process remains and I need to kill it (kill -9 52898)
This situation doesn't appear at each ido2db restart. I try to reproduce the problem with debug, but no success yet.
2014-01-03 19:46:41 +00:00 by (unknown) 9516b8c
2014-01-03 19:59:04 +00:00 by (unknown) 238aa46
2014-01-03 20:01:38 +00:00 by (unknown) b1ed17b
2014-01-09 22:28:36 +00:00 by (unknown) 5164ca6
2014-01-23 15:15:33 +00:00 by (unknown) 144a0b7
Updated by bigon on 2013-12-17 11:23:56 +00:00
I'm experiencing this issue quite often on my infrastructure when the database is busy. This could actually lead to a problem where the new ido2db process is getting stuck and then blocking everything in the core.
I looked at the code and the problem is IMHO in the ido2db_parent_sighandler() function which is racy. If the child is busy writing to the database, it might miss the kill signal, this means in return that the parent will never recieved the SIGCHLD and thus never wait for the child to die. In this condition, the rest of the function is called and in ido2db_cleanup_socket() both the socket and the pidfile are removed. Most of the initscript are relying on the pid file to see if the process has properly exited and otherwise try to kill -9 the processes, this is not working as the pidfile is already gone.
IMHO, wait()/waitpid() should be called just after calling kill() function and wait until all the children have died.
Edit: The same code seems to be present in nagios codebase
Updated by mfriedrich on 2014-01-03 20:03:18 +00:00
In regards of waitpid() you're truly right, the parent processes should make sure to wait for all child processes to exit properly before terminating itself (and return early if there are no children). I cannot reproduce it easily, but I've pushed your proposed fix to the current development tree.