Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMD GPU on Linux can't start WU, FahCore keeps returning INTERRUPTED (102 = 0x66) #1570

Open
torokati44 opened this issue Oct 3, 2020 · 7 comments
Labels
1.Type - Defect Reported issue is a defect. 3.Component - OpenMM Core Reported issue relates to FahCore_21/FahCore_22. 4.OS - Fedora Reported issue occurs on Fedora based OS (Fedora, Red Hat, CentOS).

Comments

@torokati44
Copy link

My Environment: FaH Version 7.6.13 on Fedora 32 (Linux 5.8.12-200.fc32.x86_64).

Expected Behavior: When I press "Fold", my GPU starts working on a WU.
Current Behavior: My GPU can't start any WUs, it's stuck on "Ready".

Things I have already tried:

  • Rebooting my machine.
  • Updating my GPU drivers.
  • Removing the GPU slot and adding it again.
  • Deleting the cores folder.

None of them made a difference.

Additional info:

  • Folding on this GPU worked very well until at least 2020-09-20.
  • My GPU is still correctly recognized by FAH as Ellesmere XT [Radeon RX 470/480/570/580/590], as before.
  • OpenCL works - Blender can render perfectly with the Cycles engine on the GPU.
  • Folding on my CPU is unaffected, works fine.

See a section of my Log that keeps repeating every few minutes:

21:42:35:WU00:FS01:Starting
21:42:35:WU00:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /home/attila/cores/cores.foldingathome.org/lin/64bit/22-0.0.13/Core_22.fah/FahCore_22 -dir 00 -suffix 01 -version 706 -lifeline 11530 -checkpoint 10 -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
21:42:35:WU00:FS01:Started FahCore on PID 14330
21:42:35:Started thread 28 on PID 11530
21:42:35:WU00:FS01:Core PID:14334
21:42:35:WU00:FS01:FahCore 0x22 started
21:42:36:WU00:FS01:0x22:*********************** Log Started 2020-10-03T21:42:35Z ***********************
21:42:36:WU00:FS01:0x22:*************************** Core22 Folding@home Core ***************************
21:42:36:WU00:FS01:0x22:       Core: Core22
21:42:36:WU00:FS01:0x22:       Type: 0x22
21:42:36:WU00:FS01:0x22:    Version: 0.0.13
21:42:36:WU00:FS01:0x22:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
21:42:36:WU00:FS01:0x22:  Copyright: 2020 foldingathome.org
21:42:36:WU00:FS01:0x22:   Homepage: https://foldingathome.org/
21:42:36:WU00:FS01:0x22:       Date: Sep 19 2020
21:42:36:WU00:FS01:0x22:       Time: 01:10:35
21:42:36:WU00:FS01:0x22:   Revision: 571cf95de6de2c592c7c3ed48fcfb2e33e9ea7d3
21:42:36:WU00:FS01:0x22:     Branch: core22-0.0.13
21:42:36:WU00:FS01:0x22:   Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
21:42:36:WU00:FS01:0x22:    Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections -O3
21:42:36:WU00:FS01:0x22:             -funroll-loops -DOPENMM_GIT_HASH="\"189320d0\""
21:42:36:WU00:FS01:0x22:   Platform: linux2 4.19.76-linuxkit
21:42:36:WU00:FS01:0x22:       Bits: 64
21:42:36:WU00:FS01:0x22:       Mode: Release
21:42:36:WU00:FS01:0x22:Maintainers: John Chodera <john.chodera@choderalab.org> and Peter Eastman
21:42:36:WU00:FS01:0x22:             <peastman@stanford.edu>
21:42:36:WU00:FS01:0x22:       Args: -dir 00 -suffix 01 -version 706 -lifeline 14330 -checkpoint 10
21:42:36:WU00:FS01:0x22:             -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
21:42:36:WU00:FS01:0x22:************************************ libFAH ************************************
21:42:36:WU00:FS01:0x22:       Date: Sep 15 2020
21:42:36:WU00:FS01:0x22:       Time: 05:14:43
21:42:36:WU00:FS01:0x22:   Revision: 44301ed97b996b63fe736bb8073f22209cb2b603
21:42:36:WU00:FS01:0x22:     Branch: HEAD
21:42:36:WU00:FS01:0x22:   Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
21:42:36:WU00:FS01:0x22:    Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections -O3
21:42:36:WU00:FS01:0x22:             -funroll-loops
21:42:36:WU00:FS01:0x22:   Platform: linux2 4.19.76-linuxkit
21:42:36:WU00:FS01:0x22:       Bits: 64
21:42:36:WU00:FS01:0x22:       Mode: Release
21:42:36:WU00:FS01:0x22:************************************ CBang *************************************
21:42:36:WU00:FS01:0x22:       Date: Sep 15 2020
21:42:36:WU00:FS01:0x22:       Time: 05:11:04
21:42:36:WU00:FS01:0x22:   Revision: 33fcfc2b3ed2195a423606a264718e31e6b3903f
21:42:36:WU00:FS01:0x22:     Branch: HEAD
21:42:36:WU00:FS01:0x22:   Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
21:42:36:WU00:FS01:0x22:    Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections -O3
21:42:36:WU00:FS01:0x22:             -funroll-loops -fPIC
21:42:36:WU00:FS01:0x22:   Platform: linux2 4.19.76-linuxkit
21:42:36:WU00:FS01:0x22:       Bits: 64
21:42:36:WU00:FS01:0x22:       Mode: Release
21:42:36:WU00:FS01:0x22:************************************ System ************************************
21:42:36:WU00:FS01:0x22:        CPU: AMD Ryzen 7 2700X Eight-Core Processor
21:42:36:WU00:FS01:0x22:     CPU ID: AuthenticAMD Family 23 Model 8 Stepping 2
21:42:36:WU00:FS01:0x22:       CPUs: 16
21:42:36:WU00:FS01:0x22:     Memory: 31.33GiB
21:42:36:WU00:FS01:0x22:Free Memory: 18.28GiB
21:42:36:WU00:FS01:0x22:    Threads: POSIX_THREADS
21:42:36:WU00:FS01:0x22: OS Version: 5.8
21:42:36:WU00:FS01:0x22:Has Battery: false
21:42:36:WU00:FS01:0x22: On Battery: false
21:42:36:WU00:FS01:0x22: UTC Offset: 2
21:42:36:WU00:FS01:0x22:        PID: 14334
21:42:36:WU00:FS01:0x22:        CWD: /home/attila/work
21:42:36:WU00:FS01:0x22:************************************ OpenMM ************************************
21:42:36:WU00:FS01:0x22:   Revision: 189320d0
21:42:36:WU00:FS01:0x22:********************************************************************************
21:42:36:WU00:FS01:0x22:Project: 14905 (Run 363, Clone 13, Gen 20)
21:42:36:WU00:FS01:0x22:Unit: 0x0000001d81d59d695f4ec9da7ba39dbb
21:42:36:WU00:FS01:0x22:Reading tar file core.xml
21:42:36:WU00:FS01:0x22:Reading tar file integrator.xml
21:42:36:WU00:FS01:0x22:Reading tar file state.xml
21:42:36:WU00:FS01:0x22:Reading tar file system.xml
21:42:36:WU00:FS01:0x22:Digital signatures verified
21:42:36:WU00:FS01:0x22:Folding@home GPU Core22 Folding@home Core
21:42:36:WU00:FS01:0x22:Version 0.0.13
21:42:37:WU00:FS01:FahCore returned: INTERRUPTED (102 = 0x66)
@torokati44
Copy link
Author

Oh, I randomly discovered just now that the /var/lib/systemd/coredump/ folder was 4GiB in size, and growing. It was filled up with numerous files named core.FahCore_22.1000.2c72e4e63205416c9fe99fd43918be85.1439679.1601727111000000.lz4 (and similar), with a new one generated every 2-3 minutes.
When I expanded and opened one of these files in gdb for inspection (together with the FahCore_22 binary), the following exit code and stack trace was revealed:

Core was generated by `/home/attila/cores/cores.foldingathome.org/lin/64bit/22-0.0.13/Core_22.fah/FahC'.
Program terminated with signal SIGSEGV, Segmentation fault.
(gdb) bt
#0  0x00007fd700676e69 in ?? ()
#1  0x00000000007c8383 in std::error_code::default_error_condition() const ()
#2  0x00007fd7d4d2d3e7 in pocl_mkdir_p () from /lib64/libpocl.so.2.5.0
#3  0x00007fd7d4d1090c in pocl_cache_init_topdir () from /lib64/libpocl.so.2.5.0
#4  0x00007fd7d4d115ab in pocl_init_devices () from /lib64/libpocl.so.2.5.0
#5  0x00007fd7d4cefa4b in POclGetDeviceIDs () from /lib64/libpocl.so.2.5.0
#6  0x00007fd7d5fbb620 in clGetDeviceIDs () from /lib64/libOpenCL.so.1
#7  0x00007fd7d5fb1485 in _initClIcd_real () from /lib64/libOpenCL.so.1
#8  0x00007fd7d5fb1da4 in clGetPlatformIDs () from /lib64/libOpenCL.so.1
#9  0x0000000000444fc1 in ?? ()
#10 0x000000000045ae6e in ?? ()
#11 0x000000000047cd34 in ?? ()
#12 0x000000000047d2f5 in ?? ()
#13 0x000000000043e3ce in ?? ()
#14 0x00007fd7d57c3042 in __libc_start_main () from /lib64/libc.so.6

This points to a crash in POCL of course. Disabling it by commenting out the single line (with the library name) in /etc/OpenCL/vendors/pocl.icd allowed folding to commence on the GPU without problems.

The strange thing is that neither clinfo nor Blender has any issues with the POCL platform being enabled, with clinfo even being able to query all sorts of properties of it correctly, as well as of the single device (the CPU) it offers.

Although my immediate issue was "fixed" by this workaround, I'd rather not close this until the problem is really fixed, without having to disable POCL (not that I need it that much, but still); or at least some hint can be put into the log about what might be the problem. Me finding this workaround was indeed pure luck!

@torokati44
Copy link
Author

torokati44 commented Oct 4, 2020

The relevant error message in the POCL source says:

Could not create top directory (...) for cache. Note:
if you have proper rights to create that directory, and still
get the error, then most likely pocl and the program you're
trying to run are linked to different versions of libstdc++ library.
This is not a bug in pocl and there's nothing we
can do to fix it - you need both pocl and your program to be
compiled for your system. This is known to happen with
Luxmark benchmark binaries dowloaded from website; Luxmark
installed from your linux distribution's packages should work

After removing the default POCL cache directory, $HOME/.cache/pocl, it is recreated just fine when I try to start folding.
So, it's not a permissions issue, but a libstdc++ version mismatch issue... :/
The current libstdc++ version on my system seems to be 6.0.28. EDIT: (Although the package version is 10.2.1-1.fc32, but this is most likely just to keep it consistent with the corresponding gcc version.)

Could this be an issue with the distributed FAH RPM packages?
EDIT: Or, rather, the Core_22 binary that is downloaded separately?

@katrichnikitos
Copy link

Have the same issue, GPU is Nvidia. Workaround (commenting /etc/Opencl/vendors/pocl.icd) helped, but still need the fix.

@torokati44
Copy link
Author

I have some more clues.

From my observation:

Folding on this GPU worked very well until at least 2020-09-20.

From the FaH Log:

21:42:36:WU00:FS01:0x22: Core: Core22
21:42:36:WU00:FS01:0x22: Version: 0.0.13
...
21:42:36:WU00:FS01:0x22: Date: Sep 19 2020

From my package manager (dnf):

Name : pocl
Version : 1.5
Install time : Tue Sep 22 09:48:03 2020

Additionally:

Name : libstdc++
Version : 10.2.1
Install time : Sun Aug 2 19:42:39 2020

Something... somewhere... sometime... became incompatible with something else! 😄

It's unfortunate that both the pocl package and the newer Core22 build happened right around that time I last saw it working... (And then stopped folding for a week or so...)

@jchodera
Copy link
Member

Thanks for the report! We'll look into this, but it looks like it's a segfault within pocl on querying the OpenCL device IDs, right?

@torokati44
Copy link
Author

Technically yes, but as the comments show, the root cause is not in pocl itself.

@torokati44
Copy link
Author

The (current) TL;DR part (I think) is this comment from the pocl source, also quoted above:

... most likely pocl and the program you're
trying to run are linked to different versions of libstdc++ library.
This is not a bug in pocl and there's nothing we
can do to fix it - you need both pocl and your program to be
compiled for your system. This is known to happen with
Luxmark benchmark binaries dowloaded from website; Luxmark
installed from your linux distribution's packages should work

It looks like this is the case for Core22 as well, just like Luxmark...

@PantherX PantherX added 1.Type - Defect Reported issue is a defect. 3.Component - OpenMM Core Reported issue relates to FahCore_21/FahCore_22. 4.OS - Fedora Reported issue occurs on Fedora based OS (Fedora, Red Hat, CentOS). 4.OS - Debian Reported issue occurs on Debian based OS (Debian, Mint, Ubuntu). and removed 4.OS - Fedora Reported issue occurs on Fedora based OS (Fedora, Red Hat, CentOS). 4.OS - Debian Reported issue occurs on Debian based OS (Debian, Mint, Ubuntu). labels Nov 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.Type - Defect Reported issue is a defect. 3.Component - OpenMM Core Reported issue relates to FahCore_21/FahCore_22. 4.OS - Fedora Reported issue occurs on Fedora based OS (Fedora, Red Hat, CentOS).
Projects
None yet
Development

No branches or pull requests

4 participants