Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libunbound: subprocess reap causes parent process reap to hang #775

Closed
FGasper opened this issue Nov 4, 2022 · 6 comments
Closed

libunbound: subprocess reap causes parent process reap to hang #775

FGasper opened this issue Nov 4, 2022 · 6 comments

Comments

@FGasper
Copy link
Contributor

FGasper commented Nov 4, 2022

Describe the bug
The following causes a hung process on AlmaLinux 9 and Ubuntu 22:

#include <stdio.h>
#include <stdlib.h>
#include <sys/select.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>

#include "unbound.h"

void my_cb(void* my_arg, int err, struct ub_result* result) {
	printf("callback called\n");
}

int main() {
	struct ub_ctx* ub = ub_ctx_create();
	ub_ctx_async(ub, 1);

	ub_resolve_async(ub, "google.com", 1, 1, NULL, my_cb, NULL);
	ub_wait(ub);

	pid_t pid = fork();
	if (!pid) {
		printf("child reaping\n");
		ub_ctx_delete(ub);
		printf("child reaped\n");
		exit(0);
	}

	int status;
	wait(&status);

	printf("parent reaping\n");
	ub_ctx_delete(ub);    // hang is here
	printf("parent reaped\n");

	return 0;
}

To reproduce
Steps to reproduce the behavior:

  1. On AlmaLinux 9: Compile the above, modifying as appropriate to find unbound.h and linked to libunbound.
  2. Run the resulting binary.

Expected behavior
The process should finish. (It does on AlmaLinux 8.)

System:

  • Unbound version: 1.17.0
  • OS: AlmaLinux 9
  • unbound -V output: (Not relevant; we only use libunbound, not the daemon.)

Additional information
The 2nd ub_ctx_delete is doing a read that hangs. It sends \4\0\0\0\0\0\0\0 and expects a response; that response comes in the child process (the first one to do the reap), but the parent process doesn’t get a response.

I don’t see anything in the documentation that suggests this shouldn’t work at least insofar as not causing a hang. Maybe some sort of timeout in ub_ctx_delete would be worthwhile?

Contexts where I see this bug:

  • AlmaLinux 9
  • Ubuntu 22.04 (containerized from AL9)

Contexts where I don’t see this bug:

  • AlmaLinux 8 (virtualized)
  • AlmaLinux 8 (containerized from AL9)
  • Ubuntu 20.04 (virtualized)

Given the above, the issue would seem to be in user space.

@FGasper FGasper changed the title AlmaLinux 9, libunbound: subprocess reap causes parent process reap to hang libunbound: subprocess reap causes parent process reap to hang Nov 5, 2022
@wcawijngaards
Copy link
Member

Libunbound uses file descriptors to communicate with the created thread. Also when a process is used, a file descriptor pipe is used to communicate with it. When the fork is called, I am not sure which threads get duplicated, but the file descriptors come along, and thus the context in one process connects to the original thread, or both threads if that thread got duplicated too. This is what causes the issues, I believe.

Perhaps a code change to help with that would be to set the close-on-exec flag for the threaded pipe file descriptors in libunbound. So that after the fork, the file descriptors are closed. This probably causes the worker thread to exit because the command channel has closed, and may then print errors from functions that call it, or something along these lines. But then does not have the command channel duplicated across the forked processes.

@FGasper
Copy link
Contributor Author

FGasper commented Nov 7, 2022

set the close-on-exec flag for the threaded pipe file descriptors in libunbound. So that after the fork, the file descriptors are closed.

That wouldn’t address the specific breakage I’ve seen, which doesn’t involve an exec.

Is there an intended behaviour when a ub_ctx gets used/closed from a different process than the one that created it? I’m wondering if this is undefined behaviour that has just “happened” not to hang for us until now.

@wcawijngaards
Copy link
Member

No, there is no intended behaviour involving different processes. Since the command channel refers to another process, I would want it closed, instead of sending messages or waiting for messages on it.

The close on exec flag then does not do what is needed here.

Since the other process has a copy, it is nearly as if there is a forked process context. But not really, it has flags internally. For such a context, I guess it could even be used by one or the other process, but not both, after fork.

@wcawijngaards
Copy link
Member

The fix adds a check for the pid. If the pid has changed, there was a fork operation, and then the cleanup process does not mess up the running thread in the other process.

With that fix the example code works, and it stops neatly.

@FGasper
Copy link
Contributor Author

FGasper commented Nov 8, 2022

Thank you, @wcawijngaards!

@atoomic
Copy link

atoomic commented Nov 8, 2022

Thanks @wcawijngaards & @FGasper, I confirm that this patch is solving the issue we are viewing on AlmaLinux9 while fork process still hold the socket open (but do not use it)

jedisct1 added a commit to jedisct1/unbound that referenced this issue Nov 29, 2022
* nlnet/master:
  - Fix for the ignore of tcp events for closed comm points, preserve   the use after free protection features.
  - Ignore expired error responses.
  - Fix NLnetLabs#779: [doc] Missing documention in ub_resolve_event() for   callback parameter was_ratelimited.
  - Complementary fix for distutils.sysconfig deprecation in Python 3.10   to commit 62c5039.
  - iana portlist update.
  - Fix NLnetLabs#775: libunbound: subprocess reap causes parent process reap   to hang.
  - Fix to make sure to not read again after a tcp comm point is closed.
  - Fix to ignore tcp events for closed comm points.
  fix use after free when WSACreateEvent() fails
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants