Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redundancy to protect checkpoint files (CoreWrapper/Gromacs/FAHCore_a7) and results file(Core_22) #1195

Open
bb30994 opened this issue May 3, 2017 · 10 comments
Labels
1.Type - Enhancement Reported issue is an enhamcement. 3.Component - FAHCoreWrapper Reported issue relates to FAHCoreWrapper.

Comments

@bb30994
Copy link

bb30994 commented May 3, 2017

Usually a new checkpoint is created while an older one still exists.
After validation of the new one, the older one gets deleted.
So no work should get lost even if the OS crashes at an inopportune time.
https://foldingforum.org/viewtopic.php?f=96&t=29959#p295154

Does FAHWrapper create the new checkpoint before deleting the old one?

@jcoffland
Copy link
Member

FAHWrapper does not do that. It is a very simply program that was added to fix problems with shutting down old cores which ignored CTRL-C but would shutdown if their parent process quit.

@bb30994
Copy link
Author

bb30994 commented May 4, 2017

OK. So the question still applies to Gromacs. And doesn't native Gromacs support the previous two checkpoints?

Then, too, where does the code reside that compresses and checksums the upload package? Is that all part of Gromacs or it in the wrapper?

@jcoffland
Copy link
Member

It's the core wrapper code that does this. Not to be confused with FAHWrapper.

@bb30994 bb30994 changed the title Redundancy to protect checkpoint files (FAHWrapper/Gromacs/FAHCore_a7) Redundancy to protect checkpoint files (CoreWrapper/Gromacs/FAHCore_a7) Jun 22, 2017
@jcoffland jcoffland added the 0.Status - More Information Reported issue needs more information before a decision is made. label Oct 30, 2017
@bb30994
Copy link
Author

bb30994 commented Nov 22, 2017

What information do you need? Shall I post the entire discussion that was linked in the original post?

@jcoffland
Copy link
Member

Can you provide a procedure for reproducing the problem?

@bb30994
Copy link
Author

bb30994 commented Jan 24, 2018

Process a WU until it reaches 100%. Interrupt processing while the following messages are being issued.

01:35:36:WU00:FS00:0xa4:Finished Work Unit:
01:35:36:WU00:FS00:0xa4:- Reading up to 811536 from "00/wudata_01.trr": Read 811536
01:35:36:WU00:FS00:0xa4:trr file hash check passed.
01:35:36:WU00:FS00:0xa4:- Reading up to 746060 from "00/wudata_01.xtc": Read 746060
01:35:36:WU00:FS00:0xa4:xtc file hash check passed.
01:35:36:WU00:FS00:0xa4:edr file hash check passed.
01:35:36:WU00:FS00:0xa4:logfile size: 28730
01:35:36:WU00:FS00:0xa4:Leaving Run
01:35:37:WU00:FS00:0xa4:- Writing 1588814 bytes of core data to disk...
01:35:38:WU00:FS00:0xa4:Done: 1588302 -> 1538819 (compressed to 96.8 percent)
01:35:38:WU00:FS00:0xa4: ... Done.

I think this applies to any FAHCore, not just 0xa4. As I said earlier, it's in the code that prepares the package for upload by compressing and checksuming which apparently is in the coreWrapper.

Since the upload package incomplete and the last checkpoint from GROMACS or from OpenMM gone, if you now restarting processing of the WU, it will restart from 0%, not from 100%.

@bb30994
Copy link
Author

bb30994 commented Jan 24, 2018

please merge with #1166

@jcoffland jcoffland removed the 0.Status - More Information Reported issue needs more information before a decision is made. label Jan 24, 2018
@jcoffland jcoffland added 1.Type - Enhancement Reported issue is an enhamcement. 3.Component - FAHCoreWrapper Reported issue relates to FAHCoreWrapper. labels Apr 7, 2018
@bb30994
Copy link
Author

bb30994 commented Jun 2, 2020

I think this applies to any FAHCore, not just 0xa4.

Yes, the problem was reported using FAHCore_22. I guess that suggests that they use the same wrapper code.
A manual exit happened at precisely the wrong time. On restart, there was no results file and the WU restarted from 0%.

****************************** Date: 2020-05-19 *******************************
19:33:36:WU00:FS00:0x22:Saving result file ..\logfile_01.txt
19:33:36:WU00:FS00:0x22:Saving result file checkpointState.xml
19:33:36:WU00:FS00:0x22:Saving result file checkpt.crc
19:33:36:WU00:FS00:0x22:Saving result file positions.xtc
19:33:36:WU00:FS00:0x22:Saving result file science.log
19:33:37:WU00:FS00:0x22:Folding@home Core Shutdown: FINISHED_UNIT
19:33:37:Clean exit

Note the absence of the two messages
...FahCore returned: FINISHED_UNIT (100 = 0x64)
...Sending unit results: id:0x state:SEND ....

@bb30994 bb30994 changed the title Redundancy to protect checkpoint files (CoreWrapper/Gromacs/FAHCore_a7) Redundancy to protect checkpoint files (CoreWrapper/Gromacs/FAHCore_a7) and results file(Core_22) Jun 2, 2020
@bb30994
Copy link
Author

bb30994 commented Jun 2, 2020

I've reproduced the problem that I reported above in Windows, though I'm not exactly sure how.

The condition I found it in: FAHControl reported that it was updating but nothing happened. FAHControl was running and FAHCore_22 was running. HideConsole was NOT used so the live output from FAHControl appeared in a window and the active WU was in the last few steps before 100%. I elected to let it run.

Telnet 127.0.0.1 36330 did not respond. (well, it did clear the screen rather that hanging.)
bruce.zip
From that, I'm guessing that all of the ports had experienced the broken pipe failure. and FAHClient was proceeding without external direction.

I tried to interrupt FAHClient with a CTRL-c. (see log) and after a long wait, with a second console interrupt. It progressed to 0x22:Folding@home Core Shutdown: FINISHED_UNIT but FAHClient did not report receiving a return code (as above).

All of the FAH*wrapper programs are not running. All of the files are intact but even compressed,they exceed github's limit. I'm manually deleting the other WU.

20:20:23:WU03:FS02:0x22:Completed 970000 out of 1000000 steps (97%)
20:25:27:WU03:FS02:0x22:Completed 980000 out of 1000000 steps (98%)
20:30:28:WU03:FS02:0x22:Completed 990000 out of 1000000 steps (99%)
20:30:35:WARNING:Console control signal 0 on PID 2512
20:30:35:Exiting, please wait. . .
20:34:20:WARNING:Console control signal 0 on PID 2512
20:34:20:WARNING:Next signal will force exit
20:35:28:WU03:FS02:0x22:Completed 1000000 out of 1000000 steps (100%)

20:35:36:WU03:FS02:0x22:Saving result file ..\logfile_01.txt
20:35:36:WU03:FS02:0x22:Saving result file checkpointState.xml
20:35:36:WU03:FS02:0x22:Saving result file checkpt.crc
20:35:36:WU03:FS02:0x22:Saving result file positions.xtc
20:35:36:WU03:FS02:0x22:Saving result file science.log
20:35:36:WU03:FS02:0x22:Folding@home Core Shutdown: FINISHED_UNIT

@bb30994
Copy link
Author

bb30994 commented Jun 2, 2020

I've reproduced the problem that I reported above in Windows, though I'm not exactly sure how.

The condition I found it in: FAHControl reported that it was updating but nothing happened. FAHControl was running and FAHCore_22 was running. HideConsole was NOT used so the live output from FAHControl appeared in a window and the active WU was in the last few steps before 100%. I elected to let it run.

Telnet 127.0.0.1 36330 did not respond. (well, it did clear the screen rather that hanging.)
bruce.zip
From that, I'm guessing that all of the ports had experienced the broken pipe failure. and FAHClient was proceeding without external direction.

I tried to interrupt FAHClient with a CTRL-c. (see log) and after a long wait, with a second console interrupt. It progressed to 0x22:Folding@home Core Shutdown: FINISHED_UNIT but FAHClient did not report receiving a return code (as above).

All of the FAH*wrapper programs are not running. All of the files are intact but even compressed,they exceed github's limit. I'm manually deleting the other WU.

20:20:23:WU03:FS02:0x22:Completed 970000 out of 1000000 steps (97%)
20:25:27:WU03:FS02:0x22:Completed 980000 out of 1000000 steps (98%)
20:30:28:WU03:FS02:0x22:Completed 990000 out of 1000000 steps (99%)
20:30:35:WARNING:Console control signal 0 on PID 2512
20:30:35:Exiting, please wait. . .
20:34:20:WARNING:Console control signal 0 on PID 2512
20:34:20:WARNING:Next signal will force exit
20:35:28:WU03:FS02:0x22:Completed 1000000 out of 1000000 steps (100%)

20:35:36:WU03:FS02:0x22:Saving result file ..\logfile_01.txt
20:35:36:WU03:FS02:0x22:Saving result file checkpointState.xml
20:35:36:WU03:FS02:0x22:Saving result file checkpt.crc
20:35:36:WU03:FS02:0x22:Saving result file positions.xtc
20:35:36:WU03:FS02:0x22:Saving result file science.log
20:35:36:WU03:FS02:0x22:Folding@home Core Shutdown: FINISHED_UNIT

Upon restart, a new WU downloaded for this GPU as well as an existing WU starting from 0% so whateve wa left in the files, the client though I needed a new assignment AND I already had one, but not one with a recognizeable results file that could be enqueue on the upload queue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.Type - Enhancement Reported issue is an enhamcement. 3.Component - FAHCoreWrapper Reported issue relates to FAHCoreWrapper.
Projects
None yet
Development

No branches or pull requests

2 participants