New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPU Folding: BAD_FRAME_CHECKSUM on 6 cores #1458
Comments
@sgnsajgon Can you please post your log file in the Official Forum: https://foldingforum.org/index.php for the issue to be diagnosed? "Guru Meditation" can come from a number of GPU-related configuration or failure issues. |
@shorttack I will do, now I'm waiting for account activation email. This issue was labeled as "OpenMM Core", but I think this is not GPU-related problem. GPU slot works correctly and is restored correctly after crash. It's CPU slot problem. |
@sgnsajgon I labeled OpenMM Core because that's where "Guru Meditation" comes from. I suggested diagnosing with the Forum community because I have seen Guru Meditation in a failing graphics card, which would make it not a problem for us here at GitHub. |
@shorttack So why "Guru Meditation" log entry is annotated as "0xa7" Gromacs core?:
|
I have send post to Official Forum: |
|
@shorttack I'm not sure that it is the case. My "Guru Meditation" error is 100% reproducible, and it occurs always before CPU slot starts work. It occurs after new Google Clouds VM is started, Fahclient is launched, and CPU slot is trying to resume previous WU from checkpoint, because processing of this WU had been interrupted by prior VM preemption (hard kill). Nonetheless, I will try to deploy another VMs with 6 and 8 vCPUS and no GPU and I will check whether the problem also occur or not. |
@sgnsajgon Look in the Forum for problems with Core a7 and Windows shutdown. That's a known problem where Windows shuts down before folding is done cleaning up, leading to corruption on the next startup. |
I have reproduced this issue using VM with 8 vCPUs and no GPU. All vCPUs are working as single folding slot. |
@shorttack Is source code responsible for checkpoint resume, persistence layer, Guru meditation and checksum available on Github or somewhere? I cannot find it, I would like to check it. I quess it should be source code of Fahclient or Fahcore. |
It's FAHCore_a7, itself. All FAHClient is doing is telling the FAHCore to start processing the files in directory NN. It contains the code that is attempting to open a corrupt checkpoint. The Client doesn't process checkpoint files, just the FAHCore. The real issue is why the data are left somewhere in RAM rather than being finalized (synced) on disk by the FAHCore that's shutting down. The startup can't correct an incomplete disk image. |
See also https://foldingforum.org/viewtopic.php?f=108&t=35147&p=333767#p333767 on containers and preemptible VMs. |
There's a reasonable chance that this problem is related to FAHCore-wrapper rather trhan to a specific FAHCore. I'm not sure what steps involved in a PAUSE is handled internally by a FAHcore and what are handled by the wrapper. |
FAHClient 7.6.9
I'm folding on Google Cloud Platform, using preemptible virtual machines with persistent storage.
Such VM can be killed any time, without graceful shutdown, thus in that case FAHClient is forcily killed, then after some minutes, the VM is spawned again, and the folding is resumed from the checkpoint.
In the past I have used 2 VMs: First with 1 vCPU for CPU folding, second with 1 vCPU and GPU for GPU folding. These VMs worked correctly - preemption, checkpoint restore and work resume worked without errors.
Now I'm using only one VM with 8 vCPUs and GPU, folding on 2 slots:
Slots are folding as intended, but in case of preemption and resumption, GPU slot is resumed correctly, but CPU slot always fail with the same scenario:
The text was updated successfully, but these errors were encountered: