Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standby/Hibernate gives error on GPU since FahCore_22 0.0.10 #1529

Open
informatorius opened this issue Jun 24, 2020 · 8 comments
Open

Standby/Hibernate gives error on GPU since FahCore_22 0.0.10 #1529

informatorius opened this issue Jun 24, 2020 · 8 comments
Assignees
Labels
1.Type - Defect Reported issue is a defect. 3.Component - GROMACS Core Reported issue relates to FahCore_a7. 3.Component - OpenMM Core Reported issue relates to FahCore_21/FahCore_22. 4.OS - All Reported issue occurs on all supported OS platforms.

Comments

@informatorius
Copy link

After Standby/Hibernate and resume on Windows the GPU FahCore_22 0.0.10 shows an error and resumes from last checkpoint. But after several Standby/Hibernate the work unit gets dumped as BAD because of max error retries reached. Standby/Hibernate should not count as error.

@bb30994
Copy link

bb30994 commented Jun 26, 2020

I see the same error in FAHCore_a7. Is it in the wrapper or in the actual FAHCore?

03:45:49:WU01:FS00:0xa7:Completed 242500 out of 250000 steps (97%)
06:01:50:WARNING:WU01:FS00:Detected clock skew (2 hours 06 mins), I/O delay, laptop hibernation or other slowdown noted, adjusting time estimates
06:01:59:WARNING:WU01:FS00:FahCore returned an unknown error code which probably indicates that it crashed
06:01:59:WARNING:WU01:FS00:FahCore returned: WU_STALLED (127 = 0x7f)

@informatorius
Copy link
Author

informatorius commented Jun 27, 2020

That error you see was shown previously in FahCore_21 too. But WU just resumes from last checkpoint and continues folding. It was not counted as error for WU "max errors retry".

Since FahCore_22 0.0.10 it says additionally "max error retries reached" after e.g. 3 standby/resume and dumps the work unit unexpectedly. That is the problem.

@PantherX PantherX added 1.Type - Defect Reported issue is a defect. 3.Component - GROMACS Core Reported issue relates to FahCore_a7. 3.Component - OpenMM Core Reported issue relates to FahCore_21/FahCore_22. 4.OS - All Reported issue occurs on all supported OS platforms. labels Aug 28, 2020
@PantherX
Copy link
Contributor

This can be slightly complicated to resolve since there's no "warning" by the OS AFAIK before it goes into standby/hibernate. Thus upon resume, the client either:

  • Expects all work to be present which may not be the case
  • Resume from the last valid checkpoint

Let's see what happens.

BTW, the message in FahCore_22 about max error retries reached was generally meant to prevent looping errors for NAN, not for this use case which is very different. Since FahCore_22 is in active development, it might be considered but let's wait and see.

@bb30994
Copy link

bb30994 commented Aug 30, 2020

The most common cause for this is that the data residing in the GPU's memory has changed. I have successfully done a hibernate (which preserves everything in main RAM including the state of the FAHCore) and was able to resume with an active WU folding EXCEPT if power was removed from the GPU. Obviously if part of the WU is running in the GPU's memory and it is NOT preserved by the hibernate and the WU cannot resume. I DO NOT RECOMMEND TESTING THIS PROCEDURE. Pausing the WU before hibernating should work as long as you wait long enough but when the OS initiates the hibernation (e,g,- as a power-saving feature), this is not possible.

The proper procedure would be to write up an enhancement request for Microsoft (Linux? MacOS?) to enhance their hibernate procedure to add check-pointing GPU status to their existing hibernation check-pointing of main RAM. Good luck with that.

@bb30994
Copy link

bb30994 commented Aug 30, 2020

Theory: From the Windows perspective, the GPU is used to compute/rasterize data for display in a window. After a hibernation, that data can be redisplayed by refreshing the window which recomputes it based on the state of the in-RAM's program. Assuming Windows initiates a Refresh request to each program that has an open window that was being displayed, can FAH intercept that request and recompute every pending GPU kernel?

We don’t have a way to detect suspend/resume at the core level, so ...

In other words, the refresh request from Windows could potentially be understood as a resume request.

Code to "drain" the GPU of active work would be a challenge since that means hanging the supporting CPU task until all kernels have quiesced.

@jchodera
Copy link
Member

jchodera commented Sep 13, 2020

We're looking into this, but the fix will require a bit more extensive changes, so likely won't be out until core22 0.0.13

@sashmxm
Copy link

sashmxm commented Mar 25, 2021

FAHCore22
/hive/miners/fah/7.6.21/cores/cores.foldingathome.org/lin/64bit/22-0.0.13/Core_22.fah/FahCore_22
09:25:44:WU01:FS00:0x22:Completed 210000 out of 750000 steps (28%)
09:25:46:WU01:FS00:0x22:Checkpoint completed at step 210000
09:26:46:WARNING:WU01:FS00:Detected clock skew (1 mins 01 secs), I/O delay, laptop hibernation or other slowdown noted, adjusting time estimates
09:26:47:ERROR:Send error: 32: Broken pipe
09:27:41:WU01:FS00:0x22:Completed 217500 out of 750000 steps (29%)

09:34:21:WU01:FS00:0x22:Checkpoint completed at step 255000
09:35:21:WARNING:WU01:FS00:Detected clock skew (1 mins 02 secs), I/O delay, laptop hibernation or other slowdown noted, adjusting time estimates
09:35:22:ERROR:Send error: 32: Broken pipe
09:36:18:WU01:FS00:0x22:Completed 262500 out of 750000 steps (35%)

still live this problem...any ideas?

@jchodera
Copy link
Member

@sashmxm : We're nearly finished with some core build updates that will make it easier to address this issue.

Can you describe what happens after hibernation? Does the WU fail to resume and get returned as an ERROR, or does something get stuck, requiring you manually clear out the WU or restart the client?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.Type - Defect Reported issue is a defect. 3.Component - GROMACS Core Reported issue relates to FahCore_a7. 3.Component - OpenMM Core Reported issue relates to FahCore_21/FahCore_22. 4.OS - All Reported issue occurs on all supported OS platforms.
Projects
None yet
Development

No branches or pull requests

5 participants