New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redundancy to protect checkpoint files (CoreWrapper/Gromacs/FAHCore_a7) and results file(Core_22) #1195
Comments
FAHWrapper does not do that. It is a very simply program that was added to fix problems with shutting down old cores which ignored CTRL-C but would shutdown if their parent process quit. |
OK. So the question still applies to Gromacs. And doesn't native Gromacs support the previous two checkpoints? Then, too, where does the code reside that compresses and checksums the upload package? Is that all part of Gromacs or it in the wrapper? |
It's the core wrapper code that does this. Not to be confused with FAHWrapper. |
What information do you need? Shall I post the entire discussion that was linked in the original post? |
Can you provide a procedure for reproducing the problem? |
Process a WU until it reaches 100%. Interrupt processing while the following messages are being issued.
I think this applies to any FAHCore, not just 0xa4. As I said earlier, it's in the code that prepares the package for upload by compressing and checksuming which apparently is in the coreWrapper. Since the upload package incomplete and the last checkpoint from GROMACS or from OpenMM gone, if you now restarting processing of the WU, it will restart from 0%, not from 100%. |
please merge with #1166 |
Yes, the problem was reported using FAHCore_22. I guess that suggests that they use the same wrapper code. ****************************** Date: 2020-05-19 ******************************* Note the absence of the two messages |
I've reproduced the problem that I reported above in Windows, though I'm not exactly sure how. The condition I found it in: FAHControl reported that it was updating but nothing happened. FAHControl was running and FAHCore_22 was running. HideConsole was NOT used so the live output from FAHControl appeared in a window and the active WU was in the last few steps before 100%. I elected to let it run. Telnet 127.0.0.1 36330 did not respond. (well, it did clear the screen rather that hanging.) I tried to interrupt FAHClient with a CTRL-c. (see log) and after a long wait, with a second console interrupt. It progressed to 0x22:Folding@home Core Shutdown: FINISHED_UNIT but FAHClient did not report receiving a return code (as above). All of the FAH*wrapper programs are not running. All of the files are intact but even compressed,they exceed github's limit. I'm manually deleting the other WU. 20:20:23:WU03:FS02:0x22:Completed 970000 out of 1000000 steps (97%) 20:35:36:WU03:FS02:0x22:Saving result file ..\logfile_01.txt |
I've reproduced the problem that I reported above in Windows, though I'm not exactly sure how. The condition I found it in: FAHControl reported that it was updating but nothing happened. FAHControl was running and FAHCore_22 was running. HideConsole was NOT used so the live output from FAHControl appeared in a window and the active WU was in the last few steps before 100%. I elected to let it run. Telnet 127.0.0.1 36330 did not respond. (well, it did clear the screen rather that hanging.) I tried to interrupt FAHClient with a CTRL-c. (see log) and after a long wait, with a second console interrupt. It progressed to 0x22:Folding@home Core Shutdown: FINISHED_UNIT but FAHClient did not report receiving a return code (as above). All of the FAH*wrapper programs are not running. All of the files are intact but even compressed,they exceed github's limit. I'm manually deleting the other WU. 20:20:23:WU03:FS02:0x22:Completed 970000 out of 1000000 steps (97%) 20:35:36:WU03:FS02:0x22:Saving result file ..\logfile_01.txt Upon restart, a new WU downloaded for this GPU as well as an existing WU starting from 0% so whateve wa left in the files, the client though I needed a new assignment AND I already had one, but not one with a recognizeable results file that could be enqueue on the upload queue. |
Usually a new checkpoint is created while an older one still exists.
After validation of the new one, the older one gets deleted.
So no work should get lost even if the OS crashes at an inopportune time.
https://foldingforum.org/viewtopic.php?f=96&t=29959#p295154
Does FAHWrapper create the new checkpoint before deleting the old one?
The text was updated successfully, but these errors were encountered: