Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
[dev.icinga.com #10410] OpenBSD: hang during ConfigItem::ActivateItems() in daemon startup #3517
This issue has been migrated from Redmine: https://dev.icinga.com/issues/10410
Created by sthen on 2015-10-20 14:07:15 +00:00
icinga2 startup has been broken on OpenBSD for a while. It gets to ConfigItem::ActivateItems() but hangs after the DynamicObject::RestoreObjects() call. I've eventually tracked down the commit that introduced it; reverting the fix for https://dev.icinga.org/issues/7769 allows startup to take place correctly. I don't know how to track it down further but please let me know if there's any particular information that would be useful or if you have anything you'd like me to try.
2015-10-20 20:55:16 +00:00 by (unknown) 9002f80
2015-10-20 21:02:11 +00:00 by (unknown) 3c6f0e3
2015-10-21 05:02:49 +00:00 by (unknown) cb230a0
2015-10-21 07:18:52 +00:00 by (unknown) e93dd3c
2015-11-09 19:39:26 +00:00 by (unknown) 9ea51aa
2015-11-10 10:41:21 +00:00 by (unknown) 0a6505c
Updated by gbeutner on 2015-10-20 19:33:56 +00:00
Can you attach gdb to the icinga2 process and show me a stacktrace (thread apply all bt full)?
FWIW I've got the support/2.3 branch working on OpenBSD and it doesn't seem to be hanging (it's executing checks just fine):
Updated by gbeutner on 2015-10-20 20:57:00 +00:00
Alright, I can definitely see how 7d93788 would cause this. The problem is we're holding a lock on a mutex while calling fork(). Now that I think about it I'm fairly certain I've seen a similar issue on AIX.
This doesn't seem to quite fix everything yet though, the child process dies after a few seconds (SIGBUS). Still looking into that one. :)
Updated by gbeutner on 2015-10-20 21:03:20 +00:00
Ok, the second version of the patch should be working now (3c6f0e3). Can you try whether this fixes your problem? :)
Updated by sthen on 2015-10-20 21:32:49 +00:00
This fixes the hang, but something's not quite right I think. Nothing is written to main-log until I shutdown i2, and then the following is logged (everything appears in the file at the same time):
Compare to the case with 7d93788 backed out completely, I get this written to log immediately at startup
and then at shutdown
Updated by gbeutner on 2015-10-21 09:03:34 +00:00
Btw, on an unrelated note: You can use -DFLEX_EXECUTABLE=/usr/local/bin/gflex, i.e. you don't need to create symlinks. :P
Updated by sthen on 2015-10-21 20:01:43 +00:00
Possibly related to this, so I've kept it in this ticket for now but I can move it to a separate ticket if preferred. I'm preparing a port of icingaweb2 so I've enabled the command pipe to test this (and work out how to handle our default chroot jail for PHP). When restarting the daemon after enabling this, it results in a similar problem, i.e. 'icinga2 daemon -d' fails to finish daemonizing.
It has created the pipe at this point, and leaves me with two processes running:
pid 1931 is here:
(gdb) quitI haven't been able to get a backtrace from the second process; gdb hangs when I try to attach to it (looking in top the process wait chan is 'fdlock'). I'll have a poke and see if I can come up with some more information but wonder if you have any ideas?
Updated by mfriedrich on 2015-11-09 14:57:40 +00:00
The ExternalCommandListener feature spawns a new thread which tries to open a FIFO file. Ordinarily this open() call should only block that thread until another process opens the FIFO file for writing. However, apparently other unrelated open() calls in other threads (e.g. main-log as file logger feature) are also blocked.
As far as I understand this would appear to be a bug in the open() syscall on OpenBSD.
Updated by sthen on 2015-11-10 11:23:37 +00:00
I asked around about this. Turns out the syscall bug in OpenBSD is known. There was an attempt at fixing it in the past (kern/kern_descrip.c 1.105, kern/vfs_syscalls.c 1.203, sys/filedesc.h 1.26), but there were problems with the attempted fix (commit log for the revert was "causes all new processes to get stuck after a while").
An alternative workaround, changing O_RDONLY to O_RDWR, was suggested - so though it's dirty, you could pull out the commit and do something like
Updated by gbeutner on 2015-11-10 11:26:51 +00:00
I think I may have already found a portable fix for the "100% cpu" problem on Linux:
I'll need to test this a bit more on OS X/Linux - but I think this is the cleanest solution for this problem.