Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finish file present too long #3017

Closed
tomasbrod opened this issue Feb 12, 2019 · 11 comments
Closed

Finish file present too long #3017

tomasbrod opened this issue Feb 12, 2019 · 11 comments

Comments

@tomasbrod
Copy link

When running BOINC under minium priority and the system is under heavy load and a science app finishes, BOINC aborts the task and does not submit the result. This is unfortunate, because the task is already successfully done, yet the result is discarded.

The app does not need to do any complicated processing at exit. I have exit(boinc_finish(0)); at the end my app and it still triggers. This can also happen during swap thrashing.

Relevant line:

atp->abort_task(EXIT_ABORTED_BY_CLIENT, "finish file present too long");

I propose to check the process CPU time instead of wall clock time. Alternatively, just kill the app, but do not abort the task. After all, boinc_finish wass called with success and all output files are present.

@tomasbrod
Copy link
Author

@UweBeckert
Copy link
Contributor

I see such "finish file present too long" also from time 2 time in yoyo@home results.

@davidpanderson
Copy link
Contributor

davidpanderson commented Feb 12, 2019

I increased the timeout from 10 seconds to 5 minutes. (PR #3019)

@RichardHaselgrove
Copy link
Contributor

That will probably do. I think many people on SETI have reported over the years that this error can happen if they need to reboot the machine just as a 'finish' file is written: this should be long enough for the reboot to complete before the timeout has finished (though we may need to consider whether it's long enough for a full security patch application). I don't know whether I'll have time to search the message boards tonight after maintenance is over, but I'll try to look tomorrow.

@tomasbrod
Copy link
Author

Richard, I do not understand how Reboot influences this issue. During system shutdown the client should exit gracefully and clean up. Do you mean crash reboot? Or is the timestamp of the finish file taken into account?

@RichardHaselgrove
Copy link
Contributor

It may not be the underlying cause, but my memory from SETI (which was offline overnight) leads to threads like

https://setiathome.berkeley.edu/forum_thread.php?id=83398

We may need to investigate more deeply.

@sirzooro
Copy link
Contributor

sirzooro commented Feb 17, 2019

This bug happens very often for very short WUs, which needs less than 1 minute to complete, especially for few-seconds tiny WUs. I saw it many times when I was crunching "short" tasks from SRBase. I think that number of CPU cores also affects plays some role, if I remember correctly this bug happened more often on machines with 32+ cores.

@JuhaSointusalo
Copy link
Contributor

@davidpanderson

How much room there is for negotiating this? Changing the timeout from 10 seconds to 5 minutes will help but I don't think it makes the problem go away entirely. Scenario is people putting the computer to sleep at just the wrong moment. Or shutting down the computer just after science app has written the finish file and the next time the computer is started the science app doesn't realize it was already done and instead continues from previous 99% done checkpoint.

I wouldn't have a problem if the finish file timeout check was removed altogether.

@davidpanderson
Copy link
Contributor

the mechanism is needed, else CPUs can be idle indefinitely if an app hangs while exiting
(which has happened).
problems happened with a 10-sec timeout.
so we try 5 minutes and see if any problems happen; they almost certainly won't

@sirzooro
Copy link
Contributor

sirzooro commented Mar 23, 2019

Longer timeout should help with short running apps. When many of them are finishing and new ones are starting, this can take longer than only one of them is finishing and starting, especially on machines with many CPU cores. This change will help projects which have such WUs.

Edit: BOINC Client shutdown scenario is more tricky. BOINC Client should save time of shutdown, and after next start use it to check if some app finished processing during shutdown.

@JuhaSointusalo
Copy link
Contributor

JuhaSointusalo commented Mar 25, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants