CPU Folding: BAD_FRAME_CHECKSUM on 6 cores #1458

sgnsajgon · 2020-05-04T00:36:06Z

FAHClient 7.6.9

I'm folding on Google Cloud Platform, using preemptible virtual machines with persistent storage.
Such VM can be killed any time, without graceful shutdown, thus in that case FAHClient is forcily killed, then after some minutes, the VM is spawned again, and the folding is resumed from the checkpoint.

In the past I have used 2 VMs: First with 1 vCPU for CPU folding, second with 1 vCPU and GPU for GPU folding. These VMs worked correctly - preemption, checkpoint restore and work resume worked without errors.

Now I'm using only one VM with 8 vCPUs and GPU, folding on 2 slots:

CPU slot utilizing 6 vCPUs.
GPU slot.

Slots are folding as intended, but in case of preemption and resumption, GPU slot is resumed correctly, but CPU slot always fail with the same scenario:

10:16:28: CPU: Intel(R) Xeon(R) CPU @ 2.20GHz
10:16:28: CPU ID: GenuineIntel Family 6 Model 79 Stepping 0
10:16:28: CPUs: 8
10:16:28: Memory: 29.45GiB
10:16:28: Free Memory: 27.47GiB
10:16:28: Threads: POSIX_THREADS
10:16:28: OS Version: 4.19
10:16:28: Has Battery: false
10:16:28: On Battery: false
10:16:28: UTC Offset: 0
10:16:28: PID: 10
10:16:28: CWD: /var/lib/fahclient
10:16:28: OS: Linux 4.19.112+ x86_64
10:16:28: OS Arch: AMD64
10:16:28: GPUs: 1
10:16:28: GPU 0: Bus:0 Slot:4 Func:0 NVIDIA:7 TU104GL [Tesla T4]
10:16:28: CUDA Device 0: Platform:0 Device:0 Bus:0 Slot:4 Compute:7.5 Driver:10.1
10:16:28:OpenCL Device 0: Platform:0 Device:0 Bus:0 Slot:4 Compute:1.2 Driver:418.67
(...)
10:16:29:WU02:FS00:0xa7:ERROR:Guru Meditation #885995a80cc46232.818eb4166c330456 (5455872.5459308) '02/01/state.cpt'
10:16:29:WU02:FS00:0xa7:WARNING:Unexpected exit() call
10:16:29:WU02:FS00:0xa7:WARNING:Unexpected exit from science code
10:16:29:WU02:FS00:0xa7:Saving result file ../logfile_01.txt
10:16:29:WU02:FS00:0xa7:Saving result file frame51.trr
10:16:29:WU02:FS00:0xa7:ERROR:Guru Meditation #0.d6a0f109730da89c (0.5457120) '02/01/frame51.trr'
10:16:34:WARNING:WU02:FS00:FahCore returned: BAD_FRAME_CHECKSUM (112 = 0x70)
10:16:34:WARNING:WU02:FS00:Fatal error, dumping
10:16:34:WU02:FS00:Sending unit results: id:02 state:SEND error:DUMPED project:14570 run:0 clone:1034 gen:51 core:0xa7 unit:0x00000040287234c95e7ee8a1d36c7740
10:16:34:WU02:FS00:Connecting to 40.114.52.201:8080
10:16:34:WU00:FS00:Connecting to 65.254.110.245:80
10:16:35:WU02:FS00:Server responded WORK_ACK (400)
10:16:35:WU02:FS00:Cleaning up

shorttack · 2020-05-05T21:54:30Z

@sgnsajgon Can you please post your log file in the Official Forum: https://foldingforum.org/index.php for the issue to be diagnosed? "Guru Meditation" can come from a number of GPU-related configuration or failure issues.

sgnsajgon · 2020-05-07T01:34:06Z

@shorttack I will do, now I'm waiting for account activation email.

This issue was labeled as "OpenMM Core", but I think this is not GPU-related problem. GPU slot works correctly and is restored correctly after crash. It's CPU slot problem.

shorttack · 2020-05-07T14:42:30Z

@sgnsajgon I labeled OpenMM Core because that's where "Guru Meditation" comes from. I suggested diagnosing with the Forum community because I have seen Guru Meditation in a failing graphics card, which would make it not a problem for us here at GitHub.

sgnsajgon · 2020-05-07T19:41:15Z

@shorttack So why "Guru Meditation" log entry is annotated as "0xa7" Gromacs core?:

10:16:29:WU02:FS00:0xa7:ERROR:Guru Meditation

sgnsajgon · 2020-05-07T19:41:57Z

I have send post to Official Forum:

https://foldingforum.org/viewtopic.php?f=61&t=35123

shorttack · 2020-05-07T19:50:14Z

So why "Guru Meditation" log entry is annotated as "0xa7" Gromacs core?
Sorry, so many posts today in this part-time job. I have ALSO seen Guru Mediation in a system where the CPU was overheating. The AVX-512 instructions are very heat-intensive. You can try dialing back the AVX multiplier in your bios (Google it). Gotta go.

sgnsajgon · 2020-05-07T20:01:36Z

@shorttack I'm not sure that it is the case. My "Guru Meditation" error is 100% reproducible, and it occurs always before CPU slot starts work. It occurs after new Google Clouds VM is started, Fahclient is launched, and CPU slot is trying to resume previous WU from checkpoint, because processing of this WU had been interrupted by prior VM preemption (hard kill).

Nonetheless, I will try to deploy another VMs with 6 and 8 vCPUS and no GPU and I will check whether the problem also occur or not.

shorttack · 2020-05-08T00:26:46Z

@sgnsajgon Look in the Forum for problems with Core a7 and Windows shutdown. That's a known problem where Windows shuts down before folding is done cleaning up, leading to corruption on the next startup.

sgnsajgon · 2020-05-08T23:43:17Z

I have reproduced this issue using VM with 8 vCPUs and no GPU. All vCPUs are working as single folding slot.

sgnsajgon · 2020-05-09T00:41:05Z

@shorttack Is source code responsible for checkpoint resume, persistence layer, Guru meditation and checksum available on Github or somewhere? I cannot find it, I would like to check it. I quess it should be source code of Fahclient or Fahcore.

bb30994 · 2020-05-12T01:58:02Z

It's FAHCore_a7, itself. All FAHClient is doing is telling the FAHCore to start processing the files in directory NN. It contains the code that is attempting to open a corrupt checkpoint. The Client doesn't process checkpoint files, just the FAHCore.

The real issue is why the data are left somewhere in RAM rather than being finalized (synced) on disk by the FAHCore that's shutting down. The startup can't correct an incomplete disk image.

shorttack · 2020-05-12T14:40:08Z

See also https://foldingforum.org/viewtopic.php?f=108&t=35147&p=333767#p333767 on containers and preemptible VMs.

bb30994 · 2020-06-03T20:49:36Z

There's a reasonable chance that this problem is related to FAHCore-wrapper rather trhan to a specific FAHCore. I'm not sure what steps involved in a PAUSE is handled internally by a FAHcore and what are handled by the wrapper.

bb30994 mentioned this issue May 4, 2020

FAHCore_a7 and _22 don't always sync files on Windows Shutdown. #1314

Open

shorttack added Linux 3.Component - OpenMM Core Reported issue relates to FahCore_21/FahCore_22. labels May 5, 2020

PantherX added 4.OS - Debian Reported issue occurs on Debian based OS (Debian, Mint, Ubuntu). 4.OS - Fedora Reported issue occurs on Fedora based OS (Fedora, Red Hat, CentOS). and removed Linux labels May 22, 2020

PantherX added 3.Component - FAHCoreWrapper Reported issue relates to FAHCoreWrapper. and removed 3.Component - OpenMM Core Reported issue relates to FahCore_21/FahCore_22. labels Jun 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU Folding: BAD_FRAME_CHECKSUM on 6 cores #1458

CPU Folding: BAD_FRAME_CHECKSUM on 6 cores #1458

sgnsajgon commented May 4, 2020 •

edited

shorttack commented May 5, 2020

sgnsajgon commented May 7, 2020 •

edited

shorttack commented May 7, 2020

sgnsajgon commented May 7, 2020

sgnsajgon commented May 7, 2020

shorttack commented May 7, 2020

sgnsajgon commented May 7, 2020

shorttack commented May 8, 2020

sgnsajgon commented May 8, 2020 •

edited

sgnsajgon commented May 9, 2020 •

edited

bb30994 commented May 12, 2020

shorttack commented May 12, 2020

bb30994 commented Jun 3, 2020

CPU Folding: BAD_FRAME_CHECKSUM on 6 cores #1458

CPU Folding: BAD_FRAME_CHECKSUM on 6 cores #1458

Comments

sgnsajgon commented May 4, 2020 • edited

shorttack commented May 5, 2020

sgnsajgon commented May 7, 2020 • edited

shorttack commented May 7, 2020

sgnsajgon commented May 7, 2020

sgnsajgon commented May 7, 2020

shorttack commented May 7, 2020

sgnsajgon commented May 7, 2020

shorttack commented May 8, 2020

sgnsajgon commented May 8, 2020 • edited

sgnsajgon commented May 9, 2020 • edited

bb30994 commented May 12, 2020

shorttack commented May 12, 2020

bb30994 commented Jun 3, 2020

sgnsajgon commented May 4, 2020 •

edited

sgnsajgon commented May 7, 2020 •

edited

sgnsajgon commented May 8, 2020 •

edited

sgnsajgon commented May 9, 2020 •

edited