Use `closefrom` and remove our fds max limit. #4872

ZmnSCPxj · 2021-10-19T04:44:55Z

This also inadvertently fixes a latent bug: before this patch, in the
subd function in lightningd/subd.c, we would close execfail[1]
before doing an exec.
We use an EOF on execfail[1] as a signal that exec succeeded (the
fd is marked CLOEXEC), and otherwise use it to pump errno to the
parent.
The intent is that this fd should be kept open until exec, at which
point CLOEXEC triggers and close that fd and sends the EOF, or if
exec fails we can send the errno to the parent process vua that
pipe-end.

However, in the previous version, we end up closing that fd before
reaching exec, either in the loop which dup2s passed-in fds (by
overwriting execfail[1] with a dup2) or in the "close everything"
loop, which does not guard against execfail[1], only
dev_disconnect_fd.

Ping @whitslack.

Also ping @NicolasDorier and @cdecker --- on Linux < 5.9, this new code uses /proc/$$/fd to see the open file descriptors (and avoid looping all the way to 1048576), but I do not know if /proc is limited or inaccessible within Docker, and the original #2977 was triggered by Docker on root. Cursory search suggests it should be accessible but I would prefer expert opinion.

We can argue that we can just "kick the can" and bump up the max limit to 4096, but at some point @whitslack is going to have 10,000 channels all by himself, and besides closefrom is "the future" and should be implementable on more systems at some point, maybe. More to the point: if we have N channels, the limit has to be at least N, and we need to launch N subdaemons (one for each channel), and iterating from 0 to N takes N steps, which we do N times at each subdaemon, meaning kicking the can and just raising the limit is O( n^2 ).

Created as draft for now; in particular the code is only minimally tested on Linux 5.11 (and I tried disabling HAVE_NR_CLOSE_RANGE to test the /proc scanning, which should work on a lot of systems that are not running lightningd in a chroot jail). It should be tested on more systems and container types, and while I can try to figure out how to get the tests running on FreeBSD in a VM, I do not know if I can legally run a MacOS VM if I refuse to purchase anything from Apple.

ccan/ccan/closefrom/_info

rustyrussell · 2021-10-20T22:58:32Z

I think increasing to 4k in the worst case is also the Right Thing, which means this won't make things worse,

I say move it out of draft (with the simple fixes preferably) and we use the -rcs for further testing...

ZmnSCPxj · 2021-10-20T23:53:06Z

Removed from 0.10.2 milestone for now: the CI FreeBSD smoke test shows this line: https://github.com/ElementsProject/lightning/pull/4872/checks?check_run_id=3935023800#step:3:4379

   checking for closefrom() offered by system... no

FreeBSD should support closefrom, so this should be "yes", meaning something is broken in the configurator entry for this. Lemme debug the FreeBSD bit.

ZmnSCPxj · 2021-10-21T00:55:13Z

Right right, closefrom returns void and I was doing int res = closefrom(STDERR_FILENO + 1); in the configurator. Hope the FreeBSD smoke test shows closefrom is detected now!

Also did the closefrom_limit as suggested, and use bool everywhere it made sense. closefrom_may_be_slow is still exposed so we can print something to logs.

ZmnSCPxj · 2021-10-21T01:42:41Z

Configurator test on FreeBSD now works! https://github.com/ElementsProject/lightning/pull/4872/checks?check_run_id=3958492505#step:3:4385

Will bring out of draft.

Signed-off-by: ZmnSCPxj jxPCSnmZ <ZmnSCPxj@protonmail.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

This also inadvertently fixes a latent bug: before this patch, in the `subd` function in `lightningd/subd.c`, we would close `execfail[1]` *before* doing an `exec`. We use an EOF on `execfail[1]` as a signal that `exec` succeeded (the fd is marked CLOEXEC), and otherwise use it to pump `errno` to the parent. The intent is that this fd should be kept open until `exec`, at which point CLOEXEC triggers and close that fd and sends the EOF, *or* if `exec` fails we can send the `errno` to the parent process vua that pipe-end. However, in the previous version, we end up closing that fd *before* reaching `exec`, either in the loop which `dup2`s passed-in fds (by overwriting `execfail[1]` with a `dup2`) or in the "close everything" loop, which does not guard against `execfail[1]`, only `dev_disconnect_fd`.

Fixes: ElementsProject#4868 ChangeLog-Fixed: We now no longer self-limit the number of file descriptors (which limits the number of channels) in sufficiently modern systems, or where we can access `/proc` or `/dev/fd`. We still self-limit on old systems where we cannot find the list of open files on `/proc` or `/dev/fd`, so if you need > ~4000 channels, upgrade or mount `/proc`.

rustyrussell · 2021-10-21T04:40:51Z

Included in upstream ccan/ trivial rebase.

ZmnSCPxj · 2021-10-21T05:22:55Z

This PR currently still fails an edge case (which the existing version would also fail at):

Suppose we want to pass two fds to the subdaemon.
By complete chance, the second fd being passed happens to be fd 3.
In both the old code and the current PR, fd 3 will be overwritten (by the first fd), and then moved to fd 4 (when our loop reaches the second fd).
- The subademon will then see fd 3 as closed, and fd 4 being the file descriptor that was passed first instead of second.

I think the PR can be pushed as-is, since it does no worse than the existing code (and maybe we have some protection already against the above...? Do we pass more than one fd to a subdaemon?)

~~The correct solution would be~~ No it is still incorrect, darn it!:

Loop over the passed in fds.
- For each one, open /dev/null and associate the passed-in fds with the newly opened /dev/null fd.
Loop over the passed in fds again (note! this must be a separate loop!):
- For each one, move_fd the original fd to the /dev/null we allocated for it and ovewrite it with the new fd
Loop over the destination /dev/null fds.
- For each one, move_fd to fdnum++.

The above is very technical and complex and stuff, so I think it should be a different PR; this PR, as noted, does no worse than the current code anyway (and we might already avoid, deliberately or accidentally, the mentioned edge case), and is strictly an improvement. WDYT @rustyrussell ?

--

Edit: This is the actual correct solution

Loop over the passed in fds:
- Check if fdnum == execfail[1] or dev_disconnect_fd, if so move_fd_any them out of the way (i.e. this is the current code).
- va_copy the current ap, then loop over the copy:
  - Check if fdnum ==``*va_read(ap_copy), if move_fd_any it out of the way -- fortunately the passed-in fds are int * so we can actually use the passed-in storage.

ZmnSCPxj · 2021-10-21T05:47:33Z

CI failure seems unrelated, it is pointing at hsmtool, which should not be affected by this (does hsmtool use pipecmd?). Is there a way to restart just one test? The only button I can find seems to restart all of them.

I have code now to implement the fix to the really edgy edge case in the previous comment, but am wary of adding it currently, because I think we will not hit that edge case in practice anyway (because if so lightningd would already be currently failing a lot). Maybe. Hmmmmmm.

vincenzopalazzo · 2021-10-21T12:19:26Z

Is there a way to restart just one test? The only button I can find seems to restart all of them.

yep, I think there is any method to restart only one check

rustyrussell · 2021-10-22T02:34:22Z

Is there a way to restart just one test? The only button I can find seems to restart all of them.

yep, I think there is any method to restart only one check

No, you have to restart the entire thing :( known GH actions limitation it seems...

cdecker · 2021-10-22T11:17:33Z

ACK 9d5fbc1

ZmnSCPxj · 2021-10-24T03:02:25Z

If anyone is interested and has a MacOS, here is some code I would like to have compiled and executed:

#include<errno.h>
#include<libproc.h>
#include<stdbool.h>
#include<stdio.h>
#include<stdlib.h>
#include<sys/types.h>
#include<unistd.h>

static
void closefrom(int fromfd)
{
	int saved_errno = errno;

	int maxfd;

/* #ifdef HAVE_PROC_PIDINFO */
	int orig_sz, sz, i, num;
	pid_t pid = getpid();
	struct proc_fdinfo *infos;

	orig_sz = proc_pidinfo(pid, PROC_PIDLISTFDS, 0, NULL, 0);
{ /* debug */
if (orig_sz >= 0)
	printf("proc_pidinfo(NULL, 0) ok\n");
} /* debug */
	if (orig_sz == 0)
		return;
	else if ((orig_sz > 0) && ((infos = malloc(orig_sz)) != NULL)) {
		sz = proc_pidinfo(pid, PROC_PIDLISTFDS, 0,
				  infos, orig_sz);
		if (sz >= 0 && sz <= orig_sz) {
{ /* debug */
	printf("proc_pidinfo(infos, orig_sz) ok\n");
} /* debug */
			num = sz / sizeof(struct proc_fdinfo);
			for (i = 0; i < sz; ++i) {
				if (infos[i].proc_fd >= fromfd)
					close(infos[i].proc_fd);
			}
			free(infos);
			goto quit;
		} else
			free(infos);
	}
/* #endif // HAVE_PROC_PIDINFO */

{ /* debug */
if (orig_sz >= 0)
	printf("fell back oh no!\n");
} /* debug */

	maxfd = sysconf(_SC_OPEN_MAX);

	for (; fromfd < maxfd; ++fromfd)
		close(fromfd);

quit:
	errno = saved_errno;
}

static
bool can_proc_pidinfo(void)
{
/* #ifdef HAVE_PROC_PIDINFO */
	int sz = proc_pidinfo(getpid(), PROC_PIDLISTFDS, 0, NULL, 0);
	if (sz >= 0)
		return true;
/* #endif // HAVE_PROC_PIDINFO */
	return false;
}

static
bool closefrom_may_be_slow(void)
{
	if (can_proc_pidinfo())
		return false;
	else
		return true;
}

int main(void)
{
	int fds[0];
	ssize_t wres;

	char buf = '\0';

	printf("closefrom_may_be_slow: %s\n",
	       closefrom_may_be_slow() ? "true" : "false");

	if (pipe(fds) < 0) {
		perror("pipe");
		return 1;
	}

	closefrom(STDERR_FILENO + 1);

	/* Writing to the write end should fail.  */
	do {
		wres = write(fds[1], &buf, 1);
	} while ((wres < 0) && (errno == EINTR));
	if ((wres < 0) && (errno == EBADF)) {
		printf("successfully closed by closefrom!\n");
		return 0;
	} else if (wres < 0) {
		perror("write");
		return 1;
	} else {
		printf("unexpected success on supposedly-closed socket!\n");
		return 1;
	}
}

On MacOS we use /dev/fd, which should work, according to my research, but note that it may not be accessible of lightningd is run in a chroot jail. On the other hand, the above uses malloc, which has problems. First, it can fail, meaning we end up with the fallback. Second, in a few rare contexts (not in lightningd, but this is ccan code and intended to be much more general) you do not want to call malloc ever --- e.g. you are implementing your own programming language with a GC and want your GC to use sbrk directly with the assumption that everything the GC controls is a single contiguous memory area: a random malloc risks a random uncontrolled sbrk which causes a malloc-managed memory block in the middle of your GC heap (and remember, free rarely, if ever, puts memory back with a reversing sbrk, most malloc implementations will just retain the memory for later reuse in a later malloc and always sbrk more memory, never less).

Unfortunately, allocating off the stack is difficult. Variable-length arrays are in C99, but ccan is intended to be compatible with pre-C99 compilers. alloca may require the occasional alloca(0) on some systems --- and in those systems, it is emulated by a linked-list of malloced blocks anyway, so bleah. Static-sized stack arrays are not general --- what if the number of kept fds below fromfd exceeds the static size? So malloc is the most portable solution (i.e. in most contexts where closefrom makes sense, malloc use should be safe as long as you ensure free, and hardly anyone makes programming language implementations with a GC that assumes contiguous memory area anyway, most GC algos will want to split up memory into large chunks and you can just malloc those at the C level).

whitslack · 2021-10-24T03:06:28Z

@ZmnSCPxj: If you don't want to use malloc, why don't you just mmap an anonymous private chunk of memory and munmap it when you're finished with it?

ZmnSCPxj · 2021-10-24T08:14:10Z

Certainly a possibility if we consider the GC-with-contiguous-address-space thing, but that is marginal; mmap can still fail and force us to the slow close loop.

NicolasDorier · 2021-11-08T16:12:03Z

/proc seems to work fine inside containers

ZmnSCPxj · 2021-11-08T17:10:49Z

@NicolasDorier thanks!

ZmnSCPxj force-pushed the use-closefrom branch 2 times, most recently from 99b0ce6 to 5682ea0 Compare October 19, 2021 05:03

rustyrussell approved these changes Oct 20, 2021

View reviewed changes

ccan/ccan/closefrom/_info Outdated Show resolved Hide resolved

rustyrussell added this to the v0.10.2 milestone Oct 20, 2021

ZmnSCPxj removed this from the v0.10.2 milestone Oct 20, 2021

ZmnSCPxj force-pushed the use-closefrom branch from 5682ea0 to f58a403 Compare October 21, 2021 00:52

ZmnSCPxj marked this pull request as ready for review October 21, 2021 01:43

ZmnSCPxj added this to the v0.10.2 milestone Oct 21, 2021

ZmnSCPxj added 3 commits October 21, 2021 15:09

ccan: update to include closefrom

c133d7f

Signed-off-by: ZmnSCPxj jxPCSnmZ <ZmnSCPxj@protonmail.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

rustyrussell force-pushed the use-closefrom branch from f58a403 to 9d5fbc1 Compare October 21, 2021 04:40

rustyrussell mentioned this pull request Oct 22, 2021

Cleanup fd shuffle #4877

Merged

cdecker merged commit e733fdf into ElementsProject:master Oct 22, 2021

ZmnSCPxj deleted the use-closefrom branch October 22, 2021 16:43

cdecker mentioned this pull request Mar 21, 2022

Very slow block download #5117

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `closefrom` and remove our fds max limit. #4872

Use `closefrom` and remove our fds max limit. #4872

ZmnSCPxj commented Oct 19, 2021 •

edited

Loading

rustyrussell commented Oct 20, 2021

ZmnSCPxj commented Oct 20, 2021 •

edited

Loading

ZmnSCPxj commented Oct 21, 2021

ZmnSCPxj commented Oct 21, 2021

rustyrussell commented Oct 21, 2021

ZmnSCPxj commented Oct 21, 2021 •

edited

Loading

ZmnSCPxj commented Oct 21, 2021

vincenzopalazzo commented Oct 21, 2021

rustyrussell commented Oct 22, 2021

cdecker commented Oct 22, 2021

ZmnSCPxj commented Oct 24, 2021

whitslack commented Oct 24, 2021

ZmnSCPxj commented Oct 24, 2021

NicolasDorier commented Nov 8, 2021

ZmnSCPxj commented Nov 8, 2021

Use closefrom and remove our fds max limit. #4872

Use closefrom and remove our fds max limit. #4872

Conversation

ZmnSCPxj commented Oct 19, 2021 • edited Loading

rustyrussell commented Oct 20, 2021

ZmnSCPxj commented Oct 20, 2021 • edited Loading

ZmnSCPxj commented Oct 21, 2021

ZmnSCPxj commented Oct 21, 2021

rustyrussell commented Oct 21, 2021

ZmnSCPxj commented Oct 21, 2021 • edited Loading

ZmnSCPxj commented Oct 21, 2021

vincenzopalazzo commented Oct 21, 2021

rustyrussell commented Oct 22, 2021

cdecker commented Oct 22, 2021

ZmnSCPxj commented Oct 24, 2021

whitslack commented Oct 24, 2021

ZmnSCPxj commented Oct 24, 2021

NicolasDorier commented Nov 8, 2021

ZmnSCPxj commented Nov 8, 2021

Use `closefrom` and remove our fds max limit. #4872

Use `closefrom` and remove our fds max limit. #4872

ZmnSCPxj commented Oct 19, 2021 •

edited

Loading

ZmnSCPxj commented Oct 20, 2021 •

edited

Loading

ZmnSCPxj commented Oct 21, 2021 •

edited

Loading