Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ASYNC copy_files leaves open file descriptors in forked app #1388

Closed
ibethune opened this issue Sep 1, 2015 · 8 comments

Comments

Projects
None yet
4 participants
@ibethune
Copy link

commented Sep 1, 2015

We have discovered an issue on PrimeGrid, where we use a custom wrapper executable which itself forks and execs a binary. This binary is copied into the slot directly as it is specified with the <copy_file/> tag in the app_version XML.

A problem occurs when the binary is > 10 MB, then the async file copy mechanism is used. From my reading of the code in app_start.cpp ACTIVE_TASK::start(), it seems possible to fork() the child process while async copy is still ongoing, although I couldn't follow the exact code path. Thus the child process inherits the open-for-writing file descriptor for the async copy, and then when it tries to exec() the copied binary it hits an error because the file is still open for writing (even if the parent process has closed it's copy of the file descriptor).

You can see discussion of the problem in this thread http://www.primegrid.com/forum_thread.php?id=6412&nowrap=true#87572 , but I was also able to confirm locally, that the forked child process (the primegrid_llr_wrapper) does indeed hold open file descriptors for writing on the asynchronously copied files in the slot directory:

primegrid_llr_w 5743           boinc   11r      REG                8,1 37186295     270220 /var/lib/boinc-client/projects/www.primegrid.com/sllr64.3.8.16
primegrid_llr_w 5743           boinc   12w      REG                8,1 37186295     393433 /var/lib/boinc-client/slots/1/primegrid_llr
primegrid_llr_w 5743           boinc   13r      REG                8,1 37186295     270220 /var/lib/boinc-client/projects/www.primegrid.com/sllr64.3.8.16
primegrid_llr_w 5743           boinc   14w      REG                8,1 37186295     393467 /var/lib/boinc-client/slots/2/primegrid_llr
primegrid_llr_w 5743           boinc   15r      REG                8,1 37186295     270220 /var/lib/boinc-client/projects/www.primegrid.com/sllr64.3.8.16
primegrid_llr_w 5743           boinc   16w      REG                8,1 37186295     393477 /var/lib/boinc-client/slots/3/primegrid_llr
primegrid_llr_w 5743           boinc   17r      REG                8,1 37186295     270220 /var/lib/boinc-client/projects/www.primegrid.com/sllr64.3.8.16
primegrid_llr_w 5743           boinc   18w      REG                8,1 37186295     393487 /var/lib/boinc-client/slots/4/primegrid_llr
primegrid_llr_w 5743           boinc   19r      REG                8,1 37186295     270220 /var/lib/boinc-client/projects/www.primegrid.com/sllr64.3.8.16
primegrid_llr_w 5743           boinc   20w      REG                8,1 37186295     393496 /var/lib/boinc-client/slots/5/primegrid_llr
primegrid_llr_w 5743           boinc   21r      REG                8,1 37186295     270220 /var/lib/boinc-client/projects/www.primegrid.com/sllr64.3.8.16
primegrid_llr_w 5743           boinc   22w      REG                8,1 37186295     393504 /var/lib/boinc-client/slots/6/primegrid_llr
primegrid_llr_w 5743           boinc   23r      REG                8,1 37186295     270220 /var/lib/boinc-client/projects/www.primegrid.com/sllr64.3.8.16
primegrid_llr_w 5743           boinc   24w      REG                8,1 37186295     393512 /var/lib/boinc-client/slots/7/primegrid_llr

At the moment we have a workaround where we just close all open fds in the child process immediately after it starts (apart from 0-2: stdin, stdout, stderr). It would be good to fix this in the client though. A couple of options could be:

  1. Make the logic foolproof that the fork() is not called until async copies have completed, and the files fclose()'d

  2. Get the fd associated with the async copied file and use fcntl(fd, F_SETFD, FD_CLOEXEC) to make sure that the fd is not inherited by the child process.

Happy to help with any debugging / test case that may be needed.

@JasonGroothuis

This comment has been minimized.

Copy link

commented Sep 2, 2015

Some distantly related notes they may or may not help tracking down the issues:
Constructive Discussions with Rom Walton, recently after some heated debate, revealed that some similar issues we had with output files (namely stderr.txt and possibly sometimes result files) were also being accessed before complete writes+close.
What was indicated to me, is that the client code avoids use of OS specific synchronisation primitives (for some design reasons), and a fixed wait was applied to reduce the probably of truncation/premature-read. While I personally wouldn't have approached it that way, it did apparently reduce/eliminate our (other-end) issues.
If that sounds similar enough, i.e. a design conflict with increased multithreading of system, libraies and drivers, then I can certainly recommend some more future-proof approaches (There are some that don't depend on using timed waits on non-realtime OSes)

--.----------------------------------------------------------------------------------------------------
Jason Richard Groothuis
bSc(compSci)


Date: Tue, 1 Sep 2015 13:56:40 -0700
From: notifications@github.com
To: boinc@noreply.github.com
Subject: [boinc] ASYNC copy_files leaves open file descriptors in forked app (#1388)

We have discovered an issue on PrimeGrid, where we use a custom wrapper executable which itself forks and execs a binary. This binary is copied into the slot directly as it is specified with the tag in the app_version XML.

A problem occurs when the binary is > 10 MB, then the async file copy mechanism is used. From my reading of the code in app_start.cpp ACTIVE_TASK::start(), it seems possible to fork() the child process while async copy is still ongoing, although I couldn't follow the exact code path. Thus the child process inherits the open-for-writing file descriptor for the async copy, and then when it tries to exec() the copied binary it hits an error because the file is still open for writing (even if the parent process has closed it's copy of the file descriptor).

You can see discussion of the problem in this thread http://www.primegrid.com/forum_thread.php?id=6412&nowrap=true#87572 , but I was also able to confirm locally, that the forked child process (the primegrid_llr_wrapper) does indeed hold open file descriptors for writing on the asynchronously copied files in the slot directory:

primegrid_llr_w 5743 boinc 11r REG 8,1 37186295 270220 /var/lib/boinc-client/projects/www.primegrid.com/sllr64.3.8.16
primegrid_llr_w 5743 boinc 12w REG 8,1 37186295 393433 /var/lib/boinc-client/slots/1/primegrid_llr
primegrid_llr_w 5743 boinc 13r REG 8,1 37186295 270220 /var/lib/boinc-client/projects/www.primegrid.com/sllr64.3.8.16
primegrid_llr_w 5743 boinc 14w REG 8,1 37186295 393467 /var/lib/boinc-client/slots/2/primegrid_llr
primegrid_llr_w 5743 boinc 15r REG 8,1 37186295 270220 /var/lib/boinc-client/projects/www.primegrid.com/sllr64.3.8.16
primegrid_llr_w 5743 boinc 16w REG 8,1 37186295 393477 /var/lib/boinc-client/slots/3/primegrid_llr
primegrid_llr_w 5743 boinc 17r REG 8,1 37186295 270220 /var/lib/boinc-client/projects/www.primegrid.com/sllr64.3.8.16
primegrid_llr_w 5743 boinc 18w REG 8,1 37186295 393487 /var/lib/boinc-client/slots/4/primegrid_llr
primegrid_llr_w 5743 boinc 19r REG 8,1 37186295 270220 /var/lib/boinc-client/projects/www.primegrid.com/sllr64.3.8.16
primegrid_llr_w 5743 boinc 20w REG 8,1 37186295 393496 /var/lib/boinc-client/slots/5/primegrid_llr
primegrid_llr_w 5743 boinc 21r REG 8,1 37186295 270220 /var/lib/boinc-client/projects/www.primegrid.com/sllr64.3.8.16
primegrid_llr_w 5743 boinc 22w REG 8,1 37186295 393504 /var/lib/boinc-client/slots/6/primegrid_llr
primegrid_llr_w 5743 boinc 23r REG 8,1 37186295 270220 /var/lib/boinc-client/projects/www.primegrid.com/sllr64.3.8.16
primegrid_llr_w 5743 boinc 24w REG 8,1 37186295 393512 /var/lib/boinc-client/slots/7/primegrid_llr

At the moment we have a workaround where we just close all open fds in the child process immediately after it starts (apart from 0-2: stdin, stdout, stderr). It would be good to fix this in the client though. A couple of options could be:

  1. Make the logic foolproof that the fork() is not called until async copies have completed, and the files fclose()'d

  2. Get the fd associated with the async copied file and use fcntl(fd, F_SETFD, FD_CLOEXEC) to make sure that the fd is not inherited by the child process.

Happy to help with any debugging / test case that may be needed.


Reply to this email directly or view it on GitHub.

@davidpanderson

This comment has been minimized.

Copy link
Contributor

commented Sep 2, 2015

I looked at the code and couldn't immediately see how this situation (dangling open files) can happen. When an async copy finishes (ASYNC_COPY::copy_chunk()) it closes the input and output files, renames the output file from temp name to final name, and calls ACTIVE_TASK::start() to start the task.

So I must be missing something. I'll investigate further when I have time. If anyone can supply additional info (like, are the files for async copy ever closed?) that would be great.

davidpanderson added a commit that referenced this issue Sep 2, 2015

client: fix problem where app processes inherit open files from async…
… copies

Github issue #1388 describes a situation where a wrapper can't exec
an executable file because there's an open file descriptor to it.
This happens only if the file is > 10 MB so that async copy is used.

The reason for this (which Rom identified) is as follows.
Suppose several such tasks start around the same time.
The first async copy finishes and the wrapper is started by fork/exec.
The process inherits open fds for all the other copies.
When they finish and are started, they won't be able to exec.

The fix is to use boinc_fopen() instead of fopen() in ASYNC_COPY.
The former sets its fd to close-on-exec.

In general the client should always use boinc_fopen() instead of fopen();
in addition to close-on-exec, it also does retry.
I changed a couple of other fopen()s also.
@davidpanderson

This comment has been minimized.

Copy link
Contributor

commented Sep 2, 2015

I checked in what I think is a fix.

Rom pointed out that when several such tasks start around the same time, each task process will have open descriptors for the async copies that haven't finished yet. When those tasks eventually finish and start, they'll have the dangling descriptor problem.

The solution is to use close-on-exec for async copy descriptors. It turns out that boinc_fopen(), BOINC's wrapper for fopen(), already does this, and we just needed to use it instead of fopen() in the ASYNC_COPY code.

@JasonGroothuis

This comment has been minimized.

Copy link

commented Sep 2, 2015

As well as the possibility of the close being omitted/incomplete, it could also possibly be a rare double close, which can be tough to recreate. Can look if there are clues to either condition after work.

Jason Richard Groothuis
bSc(compSci)


Date: Wed, 2 Sep 2015 14:10:54 -0700
From: notifications@github.com
To: boinc@noreply.github.com
CC: jason_groothuis@hotmail.com
Subject: Re: [boinc] ASYNC copy_files leaves open file descriptors in forked app (#1388)

I looked at the code and couldn't immediately see how this situation (dangling open files) can happen. When an async copy finishes (ASYNC_COPY::copy_chunk()) it closes the input and output files, renames the output file from temp name to final name, and calls ACTIVE_TASK::start() to start the task.

So I must be missing something. I'll investigate further when I have time. If anyone can supply additional info (like, are the files for async copy ever closed?) that would be great.


Reply to this email directly or view it on GitHub.

@JasonGroothuis

This comment has been minimized.

Copy link

commented Sep 2, 2015

Ah good, so the forum post was correct on that point.


Jason Richard Groothuis
bSc(compSci)


Date: Wed, 2 Sep 2015 15:58:25 -0700
From: notifications@github.com
To: boinc@noreply.github.com
CC: jason_groothuis@hotmail.com
Subject: Re: [boinc] ASYNC copy_files leaves open file descriptors in forked app (#1388)

I checked in what I think is a fix.

Rom pointed out that when several such tasks start around the same time, each task process will have open descriptors for the async copies that haven't finished yet. When those tasks eventually finish and start, they'll have the dangling descriptor problem.

The solution is to use close-on-exec for async copy descriptors. It turns out that boinc_fopen(), BOINC's wrapper for fopen(), already does this, and we just needed to use it instead of fopen() in the ASYNC_COPY code.


Reply to this email directly or view it on GitHub.

davidpanderson added a commit that referenced this issue Sep 2, 2015

client: fix problem where app processes inherit open files from async…
… copies

Github issue #1388 describes a situation where a wrapper can't exec
an executable file because there's an open file descriptor to it.
This happens only if the file is > 10 MB so that async copy is used.

The reason for this (which Rom identified) is as follows.
Suppose several such tasks start around the same time.
The first async copy finishes and the wrapper is started by fork/exec.
The process inherits open fds for all the other copies.
When they finish and are started, they won't be able to exec.

The fix is to use boinc_fopen() instead of fopen() in ASYNC_COPY.
The former sets its fd to close-on-exec.

In general the client should always use boinc_fopen() instead of fopen();
in addition to close-on-exec, it also does retry.
I changed a couple of other fopen()s also.
@romw

This comment has been minimized.

Copy link
Member

commented Sep 2, 2015

I backported this fix to the 7.6 branch.

@romw romw closed this Sep 2, 2015

@ibethune

This comment has been minimized.

Copy link
Author

commented Sep 3, 2015

Thanks guys, I retested with master and confirm the fix works as expected.

@davidpanderson

This comment has been minimized.

Copy link
Contributor

commented Sep 3, 2015

Glad to hear it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.