Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

md5_file: Too many open files #1175

Open
romw opened this issue Feb 4, 2015 · 8 comments
Open

md5_file: Too many open files #1175

romw opened this issue Feb 4, 2015 · 8 comments

Comments

@romw
Copy link
Member

romw commented Feb 4, 2015

Reported by smoe on 8 Jan 42640604 02:59 UTC
I found the boinc-client to have stopped for no apparent reason. It was working only with a local self-built SETI client. I had seen this once a long time before, though, back then with the WCG.

From stderrdae.txt:

No protocol specified[protocol specified[BR]No protocol specified[protocol specified[BR]...

dir_open: Could not open directory 'slots/0'.[Could not open directory 'slots/18'.[BR]dir_open: Could not open directory 'slots/17'.[Could not open directory 'slots/12'.[BR]dir_open: Could not open directory 'slots/7'.[Could not open directory 'slots/4'.[BR]dir_open: Could not open directory 'slots/22'.[Could not open directory 'slots/19'.[BR]dir_open: Could not open directory 'slots/9'.[Could not open directory 'slots/16'.[BR]dir_open: Could not open directory 'slots/14'.[Could not open directory 'slots/20'.[BR]dir_open: Could not open directory 'slots/8'.[Could not open directory 'slots/3'.[BR]dir_open: Could not open directory 'slots/23'.[Could not open directory 'slots/11'.[BR]...

dir_open: Could not open directory 'slots/7'.[Could not open directory 'slots/7'.[BR]dir_open: Could not open directory 'slots/7'.[can't open projects/einstein.phys.uwm.edu/einstein_S6LV1_1.10_i686-pc-linux-gnu!__SSE2[BR]md5_file: Too many open files[Could not open directory 'projects/setiathome.berkeley.edu'.[BR]dir_open: Could not open directory 'slots/24'.[can't open projects/setiathome.berkeley.edu/14ja12ac.18155.67.4.10.61_1_0[BR]md5_file: Too many open files[Could not open directory 'slots/14'.[BR]dir_open: Could not open directory 'slots/14'.[can't open projects/einstein.phys.uwm.edu/hsgamma_FGRP1_0.23_i686-pc-linux-gnu[BR]md5_file: Too many open files[can't open projects/boinc.bakerlab.org_rosetta/minirosetta_3.26_x86_64-pc-linux-gnu[BR]md5_file: Too many open files[Could not open directory 'projects/docking.cis.udel.edu'.[BR]dir_open: Could not open directory 'projects/spin.fh-bielefeld.de'.[Could not open directory 'projects/boinc.fzk.de_poem'.[BR]dir_open: Could not open directory 'projects/qah.uni-muenster.de'.[Could not open directory 'projects/www.rechenkraft.net_yoyo'.[BR]dir_open: Could not open directory 'projects/www.worldcommunitygrid.org'.[Could not open directory 'slots/21'.[BR]dir_open: Could not open directory 'slots/21'.[Could not open directory 'slots/21'.[BR]md5_file: can't open projects/www.worldcommunitygrid.org/wcg_faah_autodock_6.40_i686-pc-linux-gnu[Too many open files[BR]dir_open: Could not open directory 'projects/lhcathomeclassic.cern.ch_sixtrack'.[Could not open directory 'slots/0'.[BR]dir_open: Could not open directory 'slots/1'.[Could not open directory 'slots/2'.[BR]....

dir_open: Could not open directory 'slots/21'.[Could not open directory 'slots/22'.[BR]dir_open: Could not open directory 'slots/23'.[Could not open directory 'slots/4'.[BR]md5_file: can't open projects/setiathome.berkeley.edu/30dc09aj.1678.25025.13.10.226_2_0[Too many open files[BR]

From stdoutdae.txt:

21-Aug-2012 10:08:37 [Temporarily failed download of 23jn11ad.13583.17249.14.10.205: transient HTTP error[BR]21-Aug-2012 10:08:37 [Backing off 4 min 36 sec on download of 23jn11ad.13583.17249.14.10.205[BR]21-Aug-2012 10:08:37 [Temporarily failed download of 31oc10ac.1632.15183.4.10.58: transient HTTP error[BR]21-Aug-2012 10:08:37 [Backing off 5 min 27 sec on download of 31oc10ac.1632.15183.4.10.58[BR]21-Aug-2012 10:09:01 [Project communication failed: attempting access to reference site[BR]21-Aug-2012 10:09:02 [Internet access OK - project servers may be temporarily down.[BR]21-Aug-2012 10:13:40 [Started download of 05my12ad.31349.14382.3.10.249[BR]21-Aug-2012 10:13:40 [Started download of 05my12ad.31349.14382.3.10.255[BR]21-Aug-2012 10:13:53 [Finished download of 05my12ad.31349.14382.3.10.249[BR]21-Aug-2012 10:13:53 [Started download of 23jn11ad.13583.17249.14.10.241[BR]21-Aug-2012 10:13:54 [Finished download of 05my12ad.31349.14382.3.10.255[BR]21-Aug-2012 10:13:54 [Started download of 23jn11ad.13583.17249.14.10.205[BR]21-Aug-2012 10:14:02 [Finished download of 23jn11ad.13583.17249.14.10.241[BR]21-Aug-2012 10:14:02 [Started download of 05my12ad.31349.14382.3.10.224[BR]21-Aug-2012 10:14:12 [Finished download of 23jn11ad.13583.17249.14.10.205[BR]21-Aug-2012 10:14:12 [Finished download of 05my12ad.31349.14382.3.10.224[BR]21-Aug-2012 10:14:12 [Started download of 31oc10ac.1632.15183.4.10.58[BR]21-Aug-2012 10:14:12 [Started download of 30dc09aj.1678.25025.13.10.226[BR]21-Aug-2012 10:14:29 [Finished download of 30dc09aj.1678.25025.13.10.226[BR]21-Aug-2012 10:14:29 [Started download of 31oc10ac.1632.15183.4.10.64[BR]21-Aug-2012 10:14:30 [Finished download of 31oc10ac.1632.15183.4.10.58[BR]21-Aug-2012 10:14:34 [Finished download of 31oc10ac.1632.15183.4.10.64[BR]21-Aug-2012 10:17:46 [Started download of 30jn10ab.1159.23777.7.10.6.vlar[BR]21-Aug-2012 10:17:55 [Starting task 23jn11ad.13583.17249.14.10.241_1 using setiathome_enhanced version 612 in slot 0[BR]21-Aug-2012 10:17:55 [Starting task 05my12ad.31349.14382.3.10.247_0 using setiathome_enhanced version 612 in slot 1[BR]21-Aug-2012 10:17:55 [Starting task 23jn11ad.13583.17249.14.10.229_1 using setiathome_enhanced version 612 in slot 2[BR]21-Aug-2012 10:17:55 [Starting task 05my12ad.31349.14382.3.10.224_1 using setiathome_enhanced version 612 in slot 3[BR]21-Aug-2012 10:17:55 [Starting task 30dc09aj.1678.25025.13.10.226_2 using setiathome_enhanced version 612 in slot 4[BR]21-Aug-2012 10:17:55 [Starting task 23jn11ad.13583.17249.14.10.228_0 using setiathome_enhanced version 612 in slot 5[BR]21-Aug-2012 10:17:55 [Starting task 23jn11ad.13583.17249.14.10.248_0 using setiathome_enhanced version 612 in slot 6[BR]21-Aug-2012 10:17:55 [Starting task 30dc09aj.1678.25025.13.10.220_2 using setiathome_enhanced version 612 in slot 7[BR]21-Aug-2012 10:17:55 [Starting task 05my12ad.31349.14382.3.10.255_0 using setiathome_enhanced version 612 in slot 8[BR]21-Aug-2012 10:17:55 [Starting task 05my12ad.31349.14382.3.10.249_0 using setiathome_enhanced version 612 in slot 9[BR]21-Aug-2012 10:17:55 [Starting task 23jn11ad.13583.17249.14.10.205_1 using setiathome_enhanced version 612 in slot 10[BR]21-Aug-2012 10:17:55 [Starting task 05my12ad.31349.14382.3.10.246_0 using setiathome_enhanced version 612 in slot 11[BR]21-Aug-2012 10:17:55 [Starting task 27my10ac.18052.55637.5.10.1_2 using setiathome_enhanced version 612 in slot 12[BR]21-Aug-2012 10:17:55 [Starting task 31oc10ac.1632.15183.4.10.53_0 using setiathome_enhanced version 612 in slot 13[BR]21-Aug-2012 10:17:55 [Starting task 31oc10ac.1632.15183.4.10.41_1 using setiathome_enhanced version 612 in slot 14[BR]21-Aug-2012 10:17:55 [Starting task 31oc10ac.1632.15183.4.10.58_0 using setiathome_enhanced version 612 in slot 15[BR]21-Aug-2012 10:17:55 [Starting task 31oc10ac.1632.15183.4.10.49_0 using setiathome_enhanced version 612 in slot 16[BR]21-Aug-2012 10:17:55 [Starting task 31oc10ac.1632.15183.4.10.52_0 using setiathome_enhanced version 612 in slot 17[BR]21-Aug-2012 10:17:55 [Starting task 31oc10ac.1632.15183.4.10.64_0 using setiathome_enhanced version 612 in slot 18[BR]21-Aug-2012 10:17:55 [Starting task 31oc10ac.1632.15183.4.10.36_1 using setiathome_enhanced version 612 in slot 19[BR]21-Aug-2012 10:17:55 [Starting task 31oc10ac.1632.15183.4.10.28_1 using setiathome_enhanced version 612 in slot 20[BR]21-Aug-2012 10:17:55 [Starting task 31oc10ac.1632.15183.4.10.47_0 using setiathome_enhanced version 612 in slot 21[BR]21-Aug-2012 10:17:59 [Finished download of 30jn10ab.1159.23777.7.10.6.vlar[BR]21-Aug-2012 10:17:59 [Starting task 30jn10ab.1159.23777.7.10.6.vlar_3 using setiathome_enhanced version 612 in slot 22[BR]21-Aug-2012 10:18:26 [Started download of 19se10ac.457.271346.15.10.37.vlar[BR]21-Aug-2012 10:18:35 [Finished download of 19se10ac.457.271346.15.10.37.vlar[BR]21-Aug-2012 10:18:35 [Starting task 19se10ac.457.271346.15.10.37.vlar_3 using setiathome_enhanced version 612 in slot 23[BR]21-Aug-2012 10:48:33 [Can't get task disk usage: opendir() failed[BR]21-Aug-2012 10:48:33 [Can't get task disk usage: opendir() failed[BR]21-Aug-2012 10:48:33 [Can't get task disk usage: opendir() failed[BR]21-Aug-2012 10:48:33 [Can't get task disk usage: opendir() failed[BR]21-Aug-2012 10:48:33 [Can't get task disk usage: opendir() failed[BR]....

1-Aug-2012 11:38:37 [Can't get task disk usage: opendir() failed[BR]21-Aug-2012 11:38:37 [Can't get task disk usage: opendir() failed[BR]21-Aug-2012 11:38:37 [Can't get task disk usage: opendir() failed[BR]21-Aug-2012 11:38:37 [Can't get task disk usage: opendir() failed[BR]21-Aug-2012 11:45:55 [read_stderr_file(): malloc() failed[BR]21-Aug-2012 11:45:55 [Computation for task 30dc09aj.1678.25025.13.10.226_2 finished[BR]21-Aug-2012 11:45:55 [Can't open client_state_next.xml: fopen() failed[BR]21-Aug-2012 11:45:55 [Couldn't write state file: fopen() failed; giving up[BR]

Migrated-From: http://boinc.berkeley.edu/trac/ticket/1203

@romw
Copy link
Member Author

romw commented Feb 5, 2015

Commented by Nicolas on 12 May 42647863 23:38 UTC
It looks like something is leaking file descriptors. It's hard to know what's the real cause of this bug without more information. Have you seen this happen more than once?

@romw
Copy link
Member Author

romw commented Feb 5, 2015

Commented by davea on 7 Oct 42647966 16:43 UTC
From the " read_stderr_file(): malloc() failed" it looks like you system is out of swap space, or some other memory-related problem. What is the memory usage of the client?

@romw
Copy link
Member Author

romw commented Feb 5, 2015

Commented by Nicolas on 25 Jun 42647991 02:28 UTC
Actually, read_stderr_file() returns ERR_MALLOC if read_file_malloc() fails for ''any'' reason, including if it was unable to open the file. So I dont think theres any memory problem in this case.

@romw
Copy link
Member Author

romw commented Feb 5, 2015

Commented by smoe on 11 Oct 42714773 09:42 UTC
The mystery is that the issue is not reported by lsof

sudo lsof|cut -f1 -d\ |uniq -c | sort -n

where the only bad tool is indeed iceweasel for all the images etc. While googling about it, I got across

http://stackoverflow.com/questions/10218266/debugging-file-descriptor-leak-in-kernel

which pointed to a never released shared memory. Is this what is happening? Some wild polling/pushing on shared memory in a threaded environment that somewhat has gone wild?

Please kindly review respective communication code for any such evidence.

[[BR]]Steffen

@romw
Copy link
Member Author

romw commented Feb 5, 2015

Commented by smoe on 11 Oct 42714780 15:24 UTC
[fine. Just, when sprintf-ing to path, please consider making it an snprintf(path,sizeof(path),...)

Cheers,

Steffen

Replying to comment:4 Nicolas:

Actually, read_stderr_file() returns ERR_MALLOC if read_file_malloc() fails for '' any'' reason, including if it was unable to open the file. So I dont think theres any memory problem in this cas

@romw
Copy link
Member Author

romw commented Feb 5, 2015

Commented by davea on 23 Feb 42726792 15:27 UTC
The problem with snprintf (and strncpy) is that if the buffer is exceeded, it's not null-terminated.

@ChristianBeer ChristianBeer modified the milestones: Client/Manager 8.0, Undetermined Apr 10, 2017
@ChristianBeer
Copy link
Member

related to #1114 but this time BOINC is causing the problem on it's own.
suggested patch from original ticket (needs to be adjusted):

--- client/app_control.cpp	(revision 26057)
+++ client/app_control.cpp	(working copy)
@@ -818,6 +818,7 @@
 int ACTIVE_TASK::read_stderr_file() {
     char* buf1, *buf2;
     char path[MAXPATHLEN];
+    int retval;
 
     // truncate stderr output to the last 63KB;
     // it's unlikely that more than that will be useful
@@ -825,9 +826,9 @@
     int max_len = 63*1024;
     sprintf(path, "%s/%s", slot_dir, STDERR_FILE);
     if (!boinc_file_exists(path)) return 0;
-    if (read_file_malloc(path, buf1, max_len, !config.stderr_head)) {
-        return ERR_MALLOC;
-    }
+
+    retval = read_file_malloc(path, buf1, max_len, !config.stderr_head)
+    if (retval) return retval;
 
     // if it's a vbox app, check for string in stderr saying
     // the job failed because CPU VM extensions disabled

@davidpanderson
Copy link
Contributor

I made that change, but it doesn't involve leaking file descriptors.

@Ageless93 Ageless93 added this to Backlog in Client and Manager via automation Nov 11, 2017
@Ageless93 Ageless93 removed this from Backlog in Client and Manager Nov 11, 2017
@Ageless93 Ageless93 added this to Backlog in Client and Manager via automation Nov 11, 2017
@AenBleidd AenBleidd removed this from Backlog in Client and Manager Aug 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Backlog
Development

No branches or pull requests

3 participants