-
Notifications
You must be signed in to change notification settings - Fork 449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finish file present too long #3017
Comments
I see such "finish file present too long" also from time 2 time in yoyo@home results. |
I increased the timeout from 10 seconds to 5 minutes. (PR #3019) |
That will probably do. I think many people on SETI have reported over the years that this error can happen if they need to reboot the machine just as a 'finish' file is written: this should be long enough for the reboot to complete before the timeout has finished (though we may need to consider whether it's long enough for a full security patch application). I don't know whether I'll have time to search the message boards tonight after maintenance is over, but I'll try to look tomorrow. |
Richard, I do not understand how Reboot influences this issue. During system shutdown the client should exit gracefully and clean up. Do you mean crash reboot? Or is the timestamp of the finish file taken into account? |
It may not be the underlying cause, but my memory from SETI (which was offline overnight) leads to threads like https://setiathome.berkeley.edu/forum_thread.php?id=83398 We may need to investigate more deeply. |
This bug happens very often for very short WUs, which needs less than 1 minute to complete, especially for few-seconds tiny WUs. I saw it many times when I was crunching "short" tasks from SRBase. I think that number of CPU cores also affects plays some role, if I remember correctly this bug happened more often on machines with 32+ cores. |
How much room there is for negotiating this? Changing the timeout from 10 seconds to 5 minutes will help but I don't think it makes the problem go away entirely. Scenario is people putting the computer to sleep at just the wrong moment. Or shutting down the computer just after science app has written the finish file and the next time the computer is started the science app doesn't realize it was already done and instead continues from previous 99% done checkpoint. I wouldn't have a problem if the finish file timeout check was removed altogether. |
the mechanism is needed, else CPUs can be idle indefinitely if an app hangs while exiting |
Longer timeout should help with short running apps. When many of them are finishing and new ones are starting, this can take longer than only one of them is finishing and starting, especially on machines with many CPU cores. This change will help projects which have such WUs. Edit: BOINC Client shutdown scenario is more tricky. BOINC Client should save time of shutdown, and after next start use it to check if some app finished processing during shutdown. |
Very well then.
|
When running BOINC under minium priority and the system is under heavy load and a science app finishes, BOINC aborts the task and does not submit the result. This is unfortunate, because the task is already successfully done, yet the result is discarded.
The app does not need to do any complicated processing at exit. I have
exit(boinc_finish(0));
at the end my app and it still triggers. This can also happen during swap thrashing.Relevant line:
boinc/client/app_control.cpp
Line 140 in f14d96d
I propose to check the process CPU time instead of wall clock time. Alternatively, just kill the app, but do not abort the task. After all, boinc_finish wass called with success and all output files are present.
The text was updated successfully, but these errors were encountered: