Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU Folding: BAD_FRAME_CHECKSUM on 6 cores #1458

Open
sgnsajgon opened this issue May 4, 2020 · 13 comments
Open

CPU Folding: BAD_FRAME_CHECKSUM on 6 cores #1458

sgnsajgon opened this issue May 4, 2020 · 13 comments
Labels
3.Component - FAHCoreWrapper Reported issue relates to FAHCoreWrapper. 4.OS - Debian Reported issue occurs on Debian based OS (Debian, Mint, Ubuntu). 4.OS - Fedora Reported issue occurs on Fedora based OS (Fedora, Red Hat, CentOS).

Comments

@sgnsajgon
Copy link

sgnsajgon commented May 4, 2020

FAHClient 7.6.9

I'm folding on Google Cloud Platform, using preemptible virtual machines with persistent storage.
Such VM can be killed any time, without graceful shutdown, thus in that case FAHClient is forcily killed, then after some minutes, the VM is spawned again, and the folding is resumed from the checkpoint.

In the past I have used 2 VMs: First with 1 vCPU for CPU folding, second with 1 vCPU and GPU for GPU folding. These VMs worked correctly - preemption, checkpoint restore and work resume worked without errors.

Now I'm using only one VM with 8 vCPUs and GPU, folding on 2 slots:

  1. CPU slot utilizing 6 vCPUs.
  2. GPU slot.

Slots are folding as intended, but in case of preemption and resumption, GPU slot is resumed correctly, but CPU slot always fail with the same scenario:

10:16:28: CPU: Intel(R) Xeon(R) CPU @ 2.20GHz
10:16:28: CPU ID: GenuineIntel Family 6 Model 79 Stepping 0
10:16:28: CPUs: 8
10:16:28: Memory: 29.45GiB
10:16:28: Free Memory: 27.47GiB
10:16:28: Threads: POSIX_THREADS
10:16:28: OS Version: 4.19
10:16:28: Has Battery: false
10:16:28: On Battery: false
10:16:28: UTC Offset: 0
10:16:28: PID: 10
10:16:28: CWD: /var/lib/fahclient
10:16:28: OS: Linux 4.19.112+ x86_64
10:16:28: OS Arch: AMD64
10:16:28: GPUs: 1
10:16:28: GPU 0: Bus:0 Slot:4 Func:0 NVIDIA:7 TU104GL [Tesla T4]
10:16:28: CUDA Device 0: Platform:0 Device:0 Bus:0 Slot:4 Compute:7.5 Driver:10.1
10:16:28:OpenCL Device 0: Platform:0 Device:0 Bus:0 Slot:4 Compute:1.2 Driver:418.67
(...)
10:16:29:WU02:FS00:0xa7:ERROR:Guru Meditation #885995a80cc46232.818eb4166c330456 (5455872.5459308) '02/01/state.cpt'
10:16:29:WU02:FS00:0xa7:WARNING:Unexpected exit() call
10:16:29:WU02:FS00:0xa7:WARNING:Unexpected exit from science code
10:16:29:WU02:FS00:0xa7:Saving result file ../logfile_01.txt
10:16:29:WU02:FS00:0xa7:Saving result file frame51.trr
10:16:29:WU02:FS00:0xa7:ERROR:Guru Meditation #0.d6a0f109730da89c (0.5457120) '02/01/frame51.trr'
10:16:34:WARNING:WU02:FS00:FahCore returned: BAD_FRAME_CHECKSUM (112 = 0x70)
10:16:34:WARNING:WU02:FS00:Fatal error, dumping
10:16:34:WU02:FS00:Sending unit results: id:02 state:SEND error:DUMPED project:14570 run:0 clone:1034 gen:51 core:0xa7 unit:0x00000040287234c95e7ee8a1d36c7740
10:16:34:WU02:FS00:Connecting to 40.114.52.201:8080
10:16:34:WU00:FS00:Connecting to 65.254.110.245:80
10:16:35:WU02:FS00:Server responded WORK_ACK (400)
10:16:35:WU02:FS00:Cleaning up

@shorttack
Copy link

@sgnsajgon Can you please post your log file in the Official Forum: https://foldingforum.org/index.php for the issue to be diagnosed? "Guru Meditation" can come from a number of GPU-related configuration or failure issues.

@shorttack shorttack added Linux 3.Component - OpenMM Core Reported issue relates to FahCore_21/FahCore_22. labels May 5, 2020
@sgnsajgon
Copy link
Author

sgnsajgon commented May 7, 2020

@shorttack I will do, now I'm waiting for account activation email.

This issue was labeled as "OpenMM Core", but I think this is not GPU-related problem. GPU slot works correctly and is restored correctly after crash. It's CPU slot problem.

@shorttack
Copy link

@sgnsajgon I labeled OpenMM Core because that's where "Guru Meditation" comes from. I suggested diagnosing with the Forum community because I have seen Guru Meditation in a failing graphics card, which would make it not a problem for us here at GitHub.

@sgnsajgon
Copy link
Author

@shorttack So why "Guru Meditation" log entry is annotated as "0xa7" Gromacs core?:

10:16:29:WU02:FS00:0xa7:ERROR:Guru Meditation

@sgnsajgon
Copy link
Author

I have send post to Official Forum:

https://foldingforum.org/viewtopic.php?f=61&t=35123

@shorttack
Copy link

So why "Guru Meditation" log entry is annotated as "0xa7" Gromacs core?
Sorry, so many posts today in this part-time job. I have ALSO seen Guru Mediation in a system where the CPU was overheating. The AVX-512 instructions are very heat-intensive. You can try dialing back the AVX multiplier in your bios (Google it). Gotta go.

@sgnsajgon
Copy link
Author

@shorttack I'm not sure that it is the case. My "Guru Meditation" error is 100% reproducible, and it occurs always before CPU slot starts work. It occurs after new Google Clouds VM is started, Fahclient is launched, and CPU slot is trying to resume previous WU from checkpoint, because processing of this WU had been interrupted by prior VM preemption (hard kill).

Nonetheless, I will try to deploy another VMs with 6 and 8 vCPUS and no GPU and I will check whether the problem also occur or not.

@shorttack
Copy link

@sgnsajgon Look in the Forum for problems with Core a7 and Windows shutdown. That's a known problem where Windows shuts down before folding is done cleaning up, leading to corruption on the next startup.

@sgnsajgon
Copy link
Author

sgnsajgon commented May 8, 2020

I have reproduced this issue using VM with 8 vCPUs and no GPU. All vCPUs are working as single folding slot.

@sgnsajgon
Copy link
Author

sgnsajgon commented May 9, 2020

@shorttack Is source code responsible for checkpoint resume, persistence layer, Guru meditation and checksum available on Github or somewhere? I cannot find it, I would like to check it. I quess it should be source code of Fahclient or Fahcore.

@bb30994
Copy link

bb30994 commented May 12, 2020

It's FAHCore_a7, itself. All FAHClient is doing is telling the FAHCore to start processing the files in directory NN. It contains the code that is attempting to open a corrupt checkpoint. The Client doesn't process checkpoint files, just the FAHCore.

The real issue is why the data are left somewhere in RAM rather than being finalized (synced) on disk by the FAHCore that's shutting down. The startup can't correct an incomplete disk image.

@shorttack
Copy link

See also https://foldingforum.org/viewtopic.php?f=108&t=35147&p=333767#p333767 on containers and preemptible VMs.

@PantherX PantherX added 4.OS - Debian Reported issue occurs on Debian based OS (Debian, Mint, Ubuntu). 4.OS - Fedora Reported issue occurs on Fedora based OS (Fedora, Red Hat, CentOS). and removed Linux labels May 22, 2020
@bb30994
Copy link

bb30994 commented Jun 3, 2020

There's a reasonable chance that this problem is related to FAHCore-wrapper rather trhan to a specific FAHCore. I'm not sure what steps involved in a PAUSE is handled internally by a FAHcore and what are handled by the wrapper.

@PantherX PantherX added 3.Component - FAHCoreWrapper Reported issue relates to FAHCoreWrapper. and removed 3.Component - OpenMM Core Reported issue relates to FahCore_21/FahCore_22. labels Jun 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.Component - FAHCoreWrapper Reported issue relates to FAHCoreWrapper. 4.OS - Debian Reported issue occurs on Debian based OS (Debian, Mint, Ubuntu). 4.OS - Fedora Reported issue occurs on Fedora based OS (Fedora, Red Hat, CentOS).
Projects
None yet
Development

No branches or pull requests

4 participants