-
-
Notifications
You must be signed in to change notification settings - Fork 371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
libunbound: subprocess reap causes parent process reap to hang #775
Comments
Libunbound uses file descriptors to communicate with the created thread. Also when a process is used, a file descriptor pipe is used to communicate with it. When the fork is called, I am not sure which threads get duplicated, but the file descriptors come along, and thus the context in one process connects to the original thread, or both threads if that thread got duplicated too. This is what causes the issues, I believe. Perhaps a code change to help with that would be to set the close-on-exec flag for the threaded pipe file descriptors in libunbound. So that after the fork, the file descriptors are closed. This probably causes the worker thread to exit because the command channel has closed, and may then print errors from functions that call it, or something along these lines. But then does not have the command channel duplicated across the forked processes. |
That wouldn’t address the specific breakage I’ve seen, which doesn’t involve an exec. Is there an intended behaviour when a ub_ctx gets used/closed from a different process than the one that created it? I’m wondering if this is undefined behaviour that has just “happened” not to hang for us until now. |
No, there is no intended behaviour involving different processes. Since the command channel refers to another process, I would want it closed, instead of sending messages or waiting for messages on it. The close on exec flag then does not do what is needed here. Since the other process has a copy, it is nearly as if there is a forked process context. But not really, it has flags internally. For such a context, I guess it could even be used by one or the other process, but not both, after fork. |
The fix adds a check for the pid. If the pid has changed, there was a fork operation, and then the cleanup process does not mess up the running thread in the other process. With that fix the example code works, and it stops neatly. |
Thank you, @wcawijngaards! |
Thanks @wcawijngaards & @FGasper, I confirm that this patch is solving the issue we are viewing on AlmaLinux9 while fork process still hold the socket open (but do not use it) |
* nlnet/master: - Fix for the ignore of tcp events for closed comm points, preserve the use after free protection features. - Ignore expired error responses. - Fix NLnetLabs#779: [doc] Missing documention in ub_resolve_event() for callback parameter was_ratelimited. - Complementary fix for distutils.sysconfig deprecation in Python 3.10 to commit 62c5039. - iana portlist update. - Fix NLnetLabs#775: libunbound: subprocess reap causes parent process reap to hang. - Fix to make sure to not read again after a tcp comm point is closed. - Fix to ignore tcp events for closed comm points. fix use after free when WSACreateEvent() fails
Describe the bug
The following causes a hung process on AlmaLinux 9 and Ubuntu 22:
To reproduce
Steps to reproduce the behavior:
unbound.h
and linked to libunbound.Expected behavior
The process should finish. (It does on AlmaLinux 8.)
System:
unbound -V
output: (Not relevant; we only use libunbound, not the daemon.)Additional information
The 2nd
ub_ctx_delete
is doing a read that hangs. It sends\4\0\0\0\0\0\0\0
and expects a response; that response comes in the child process (the first one to do the reap), but the parent process doesn’t get a response.I don’t see anything in the documentation that suggests this shouldn’t work at least insofar as not causing a hang. Maybe some sort of timeout in
ub_ctx_delete
would be worthwhile?Contexts where I see this bug:
Contexts where I don’t see this bug:
Given the above, the issue would seem to be in user space.
The text was updated successfully, but these errors were encountered: