Standby/Hibernate gives error on GPU since FahCore_22 0.0.10 #1529

informatorius · 2020-06-24T06:23:57Z

After Standby/Hibernate and resume on Windows the GPU FahCore_22 0.0.10 shows an error and resumes from last checkpoint. But after several Standby/Hibernate the work unit gets dumped as BAD because of max error retries reached. Standby/Hibernate should not count as error.

bb30994 · 2020-06-26T05:34:51Z

I see the same error in FAHCore_a7. Is it in the wrapper or in the actual FAHCore?

03:45:49:WU01:FS00:0xa7:Completed 242500 out of 250000 steps (97%)
06:01:50:WARNING:WU01:FS00:Detected clock skew (2 hours 06 mins), I/O delay, laptop hibernation or other slowdown noted, adjusting time estimates
06:01:59:WARNING:WU01:FS00:FahCore returned an unknown error code which probably indicates that it crashed
06:01:59:WARNING:WU01:FS00:FahCore returned: WU_STALLED (127 = 0x7f)

informatorius · 2020-06-27T07:53:05Z

That error you see was shown previously in FahCore_21 too. But WU just resumes from last checkpoint and continues folding. It was not counted as error for WU "max errors retry".

Since FahCore_22 0.0.10 it says additionally "max error retries reached" after e.g. 3 standby/resume and dumps the work unit unexpectedly. That is the problem.

PantherX · 2020-08-28T21:48:06Z

This can be slightly complicated to resolve since there's no "warning" by the OS AFAIK before it goes into standby/hibernate. Thus upon resume, the client either:

Expects all work to be present which may not be the case
Resume from the last valid checkpoint

Let's see what happens.

BTW, the message in FahCore_22 about max error retries reached was generally meant to prevent looping errors for NAN, not for this use case which is very different. Since FahCore_22 is in active development, it might be considered but let's wait and see.

bb30994 · 2020-08-30T16:08:56Z

The most common cause for this is that the data residing in the GPU's memory has changed. I have successfully done a hibernate (which preserves everything in main RAM including the state of the FAHCore) and was able to resume with an active WU folding EXCEPT if power was removed from the GPU. Obviously if part of the WU is running in the GPU's memory and it is NOT preserved by the hibernate and the WU cannot resume. I DO NOT RECOMMEND TESTING THIS PROCEDURE. Pausing the WU before hibernating should work as long as you wait long enough but when the OS initiates the hibernation (e,g,- as a power-saving feature), this is not possible.

The proper procedure would be to write up an enhancement request for Microsoft (Linux? MacOS?) to enhance their hibernate procedure to add check-pointing GPU status to their existing hibernation check-pointing of main RAM. Good luck with that.

bb30994 · 2020-08-30T16:24:48Z

Theory: From the Windows perspective, the GPU is used to compute/rasterize data for display in a window. After a hibernation, that data can be redisplayed by refreshing the window which recomputes it based on the state of the in-RAM's program. Assuming Windows initiates a Refresh request to each program that has an open window that was being displayed, can FAH intercept that request and recompute every pending GPU kernel?

We don’t have a way to detect suspend/resume at the core level, so ...

In other words, the refresh request from Windows could potentially be understood as a resume request.

Code to "drain" the GPU of active work would be a challenge since that means hanging the supporting CPU task until all kernels have quiesced.

jchodera · 2020-09-13T16:37:51Z

We're looking into this, but the fix will require a bit more extensive changes, so likely won't be out until core22 0.0.13

sashmxm · 2021-03-25T09:41:48Z

FAHCore22
/hive/miners/fah/7.6.21/cores/cores.foldingathome.org/lin/64bit/22-0.0.13/Core_22.fah/FahCore_22
09:25:44:WU01:FS00:0x22:Completed 210000 out of 750000 steps (28%)
09:25:46:WU01:FS00:0x22:Checkpoint completed at step 210000
09:26:46:WARNING:WU01:FS00:Detected clock skew (1 mins 01 secs), I/O delay, laptop hibernation or other slowdown noted, adjusting time estimates
09:26:47:ERROR:Send error: 32: Broken pipe
09:27:41:WU01:FS00:0x22:Completed 217500 out of 750000 steps (29%)

09:34:21:WU01:FS00:0x22:Checkpoint completed at step 255000
09:35:21:WARNING:WU01:FS00:Detected clock skew (1 mins 02 secs), I/O delay, laptop hibernation or other slowdown noted, adjusting time estimates
09:35:22:ERROR:Send error: 32: Broken pipe
09:36:18:WU01:FS00:0x22:Completed 262500 out of 750000 steps (35%)

still live this problem...any ideas?

jchodera · 2021-03-25T19:39:24Z

@sashmxm : We're nearly finished with some core build updates that will make it easier to address this issue.

Can you describe what happens after hibernation? Does the WU fail to resume and get returned as an ERROR, or does something get stuck, requiring you manually clear out the WU or restart the client?

PantherX added 1.Type - Defect Reported issue is a defect. 3.Component - GROMACS Core Reported issue relates to FahCore_a7. 3.Component - OpenMM Core Reported issue relates to FahCore_21/FahCore_22. 4.OS - All Reported issue occurs on all supported OS platforms. labels Aug 28, 2020

bb30994 assigned jchodera Nov 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standby/Hibernate gives error on GPU since FahCore_22 0.0.10 #1529

Standby/Hibernate gives error on GPU since FahCore_22 0.0.10 #1529

informatorius commented Jun 24, 2020

bb30994 commented Jun 26, 2020 •

edited

informatorius commented Jun 27, 2020 •

edited

PantherX commented Aug 28, 2020

bb30994 commented Aug 30, 2020 •

edited

bb30994 commented Aug 30, 2020 •

edited

jchodera commented Sep 13, 2020 •

edited

sashmxm commented Mar 25, 2021

jchodera commented Mar 25, 2021

Standby/Hibernate gives error on GPU since FahCore_22 0.0.10 #1529

Standby/Hibernate gives error on GPU since FahCore_22 0.0.10 #1529

Comments

informatorius commented Jun 24, 2020

bb30994 commented Jun 26, 2020 • edited

informatorius commented Jun 27, 2020 • edited

PantherX commented Aug 28, 2020

bb30994 commented Aug 30, 2020 • edited

bb30994 commented Aug 30, 2020 • edited

jchodera commented Sep 13, 2020 • edited

sashmxm commented Mar 25, 2021

jchodera commented Mar 25, 2021

bb30994 commented Jun 26, 2020 •

edited

informatorius commented Jun 27, 2020 •

edited

bb30994 commented Aug 30, 2020 •

edited

bb30994 commented Aug 30, 2020 •

edited

jchodera commented Sep 13, 2020 •

edited