New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Standby/Hibernate gives error on GPU since FahCore_22 0.0.10 #1529
Comments
I see the same error in FAHCore_a7. Is it in the wrapper or in the actual FAHCore? 03:45:49:WU01:FS00:0xa7:Completed 242500 out of 250000 steps (97%) |
That error you see was shown previously in FahCore_21 too. But WU just resumes from last checkpoint and continues folding. It was not counted as error for WU "max errors retry". Since FahCore_22 0.0.10 it says additionally "max error retries reached" after e.g. 3 standby/resume and dumps the work unit unexpectedly. That is the problem. |
This can be slightly complicated to resolve since there's no "warning" by the OS AFAIK before it goes into standby/hibernate. Thus upon resume, the client either:
Let's see what happens. BTW, the message in FahCore_22 about max error retries reached was generally meant to prevent looping errors for NAN, not for this use case which is very different. Since FahCore_22 is in active development, it might be considered but let's wait and see. |
The most common cause for this is that the data residing in the GPU's memory has changed. I have successfully done a hibernate (which preserves everything in main RAM including the state of the FAHCore) and was able to resume with an active WU folding EXCEPT if power was removed from the GPU. Obviously if part of the WU is running in the GPU's memory and it is NOT preserved by the hibernate and the WU cannot resume. I DO NOT RECOMMEND TESTING THIS PROCEDURE. Pausing the WU before hibernating should work as long as you wait long enough but when the OS initiates the hibernation (e,g,- as a power-saving feature), this is not possible. The proper procedure would be to write up an enhancement request for Microsoft (Linux? MacOS?) to enhance their hibernate procedure to add check-pointing GPU status to their existing hibernation check-pointing of main RAM. Good luck with that. |
Theory: From the Windows perspective, the GPU is used to compute/rasterize data for display in a window. After a hibernation, that data can be redisplayed by refreshing the window which recomputes it based on the state of the in-RAM's program. Assuming Windows initiates a Refresh request to each program that has an open window that was being displayed, can FAH intercept that request and recompute every pending GPU kernel?
In other words, the refresh request from Windows could potentially be understood as a resume request. Code to "drain" the GPU of active work would be a challenge since that means hanging the supporting CPU task until all kernels have quiesced. |
We're looking into this, but the fix will require a bit more extensive changes, so likely won't be out until core22 0.0.13 |
FAHCore22 09:34:21:WU01:FS00:0x22:Checkpoint completed at step 255000 still live this problem...any ideas? |
@sashmxm : We're nearly finished with some core build updates that will make it easier to address this issue. Can you describe what happens after hibernation? Does the WU fail to resume and get returned as an ERROR, or does something get stuck, requiring you manually clear out the WU or restart the client? |
After Standby/Hibernate and resume on Windows the GPU FahCore_22 0.0.10 shows an error and resumes from last checkpoint. But after several Standby/Hibernate the work unit gets dumped as BAD because of max error retries reached. Standby/Hibernate should not count as error.
The text was updated successfully, but these errors were encountered: