-
Notifications
You must be signed in to change notification settings - Fork 371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenCL stopped working "no devices" when upgrading to ROCm 3.0 from ROCm 2.10 #977
Comments
Moving back to ROCm 2.10 (from 3.0) produces working OpenCL, I see these packages are installed
|
Moving forward to ROCm 3.0 again with the same set of packages installed OpenCL finds no devices
|
Same on Raven Ridge 2700u APU upgraded from 2.10 to 3.0. $ clinfo $ rocminfoHSA System AttributesRuntime Version: 1.1 ==========
|
After switching out hsakmt package from Fedora to the one provided by ROCm I now got these packages installed. And clinfo $ clinfo Platform Name AMD Accelerated Parallel Processing NULL platform behavior |
A simple upgrade from ROCm 2.10 to 3.0 of the packages at http://repo.radeon.com/rocm/yum/rpm when they became availible. clinfo shows 0 devices availbile directly after upgrade and after reboot. Removing packages and reverting back to 2.10 solves the problem. $ sudo dnf update $ clinfo Platform Name AMD Accelerated Parallel Processing NULL platform behavior |
The steps are: starting with a working OpenCL-only ROCm 2.10 installation, do a
which updates the packages as already indicated. At this point OpenCL stops detecting any devices, and this persists after a reboot. rocm-smi continues to detect correctly all the GPUs at all times. |
You need to install comgr too. Its no longer a part of rocm-opencl package for unknown reason - libamd_comgr.so |
I did try and installed comgr too, it didn't fix it. Maybe AMD could use the information from this thread to fix the 3.0 upgrade and update the instructions:
|
3.0 is not working. I made an error in an previous comment and ended up with both 2.2 repo and 3.0 in my rocm.repo after switching back and forth a few times. I have deleted my previous comment that stated that this was fixed. |
When reinstalling 3.0 packages I noticed this: Running scriptlet: hsa-rocr-dev-1.1.9.0_rocm_rel_3.0_6_7128d0dc-1.x86_6 8/9 |
clinfo shows: Number of P2P devices (AMD) 0 When hsa-ext-rocr-dev-1.1.9.0-rocm-rel-3.0-6-7128d0dc-Linux.rpm is installed for image support clinfo crashes and so does Darktable. Last version that worked was hsa-ext-rocr-dev-1.1.9-122-ge5c4efb1-Linux.rpm from ROCm 2.9 $ clinfo |
For those wondering how to revert to a previous version on Debian-based distros: sudo apt autoremove rocm-dkms rock-dkms
sudo vim /etc/apt/sources.list.d/rocm.list Replace sudo apt update
sudo apt install rocm-dkms # or any other set of packages you need |
Thanks all.
We have logged an internal issue for proper fix. |
@rkothako could you please clarify what are the working steps for an upgrade from 2.10 to 3.0 OpenCL-only without dkms? i.e. I'm not using rocm-dkms, and most likely rocm-dkms would fail to compile anyway on the kernel I'm using (5.5). And you could also please explain what is the problem that is fixed by the working upgrade steps (to help our understanding), thanks. |
Same here after upgrading from 2.10 to 3.0-6 with Vega Frontier card.
How do you guys do regression testing??? |
New to ROCm stack but have used AMD OpenCL before. I just tried to ROCm on OpenSuse from Yum repository and I assume I am having the same problem? I first upgraded to Kernel 5.4 so Kernel support should be there /opt/rocm/bin/rocminfo gives reasonable results (splits ThreadRipper into 4 agents??) then shows as below for GPU. When I run /opt/rocm/opencl/bin/x86_64/clinfo I get as below: Number of platforms: 1 Platform Name: AMD Accelerated Parallel Processing So as I guess this thread indicates OpenCL just does not work with ROCm 3.0?? If this is not the case how do you fix this? How could this happen was OpenCL really not tested before this was released? Yikes! When will a fix be available? Partial Output from rocminfo: Agent 5 Name: gfx900 |
I will note that I looked at the rpm's that were installed from repository, as below, and noticed that opencl* has a 2.0.0 as part of version where other important rocm stuff has 3.0.0 is it right or should there be a later version of the opencl* packages?? linux-k3mw:/home/goesmgr # rpm -qa | grep rocm |
@rkothako : is there a way to upgrade from ROCm 2.10 to ROCm 3.0, OpenCL only, without dkms? Please let me know how I can do this upgrade. |
Hi @preda and all, |
Does not work for me with Ubuntu 18.04.3 LTS and kernel 5.0.0-37-generic, Vega Frontier card. Can you please post a step by step guide and test the next release with some common distributions? Thanks! |
Steps to follow:
|
@rkothako thank you, but I am talking about an install without rock-dkms, as is required when using a recent kernel that is not supported by rock-dkms. Did you try your instructions on a system with Linux kernel 5.4 or 5.5? |
@rkothako Thanks, this works for /opt/rocm clinfo. Problem is now (or still since I though that was the fix) that leela zero (https://github.com/leela-zero/leela-zero Go engine with OpenCL) fails to compile all 290 kernel it tries during tuning (worked with 2.10): |
rkohato & all Looked and I have the comgr and rocm-smi packages installed and it is not working. I am using OpenSuse and 5.4 Kernel as a result the rock-dkms is not used/needed. I have a "clean" install of just the 3.0 version from the latest zypper repository: zypper ar http://repo.radeon.com/rocm/zyp/zypper/ rocm-repo In my post earlier it shows the rocm related packages that were installed. To the developers here: Why don't you try to just do a clean OpenSuse Linux install and then add your repository and the packages. Then run clinfo and straighten out that issue and then run some real testing on common OpenCL Kernels and applications. Then post updates with a readme of tested configurations. I do not see anyone but the most experienced users wanting to try/use the ROCm stuff at this point. Little point in spending all this effort on ROCm if no one can actually use it! |
I have a similar problem. Kernel 5.4.6 from mainline. ROCm from git: Version 3.0 It looks to me that the following error messages are what is wrong: Unfortunately, the failure is happening within the binary only amd lib, so further analysis is difficult. |
Doing an strace gives some more info: More of the strace: |
I think I have worked out what the problem is when running clinfo not finding vega gpu card: This is trying to mbind to node0 (success) and then tries to mbind to node1 (fails). So, the RAM is attached to node0(Agent1), but no RAM is attached to node1(Agent2). I have a AMD Threadripper 1950 which has a similar pattern of RAM install. All attached to node0, no RAM attached to node1. |
@jcdutton |
@ableeker |
Not sure if this will help anyone here, but I compiled both ROCm 3.0 and 2.10 entirely from source on my slackware64-current system and ran into the same problems as described here (rocminfo finds everything, clinfo doesn't). |
[@jcdutton:] I sure can! Here it is. |
@ableeker |
@jcdutton |
@ableeker |
@jcdutton |
Thanks to @preda and pointing out "rocm-opencl". I was able to get openCL working on my AMD RX580 now on Ubuntu 19.10. I had to switch back to the ROCM PPA for 2.9, then installed that package, and now I have a working openCL environment (and in-turn DaVinci Resolve) so far anyways. Thanks! I have not had success with ROCM 3.0 or other methods at this time. |
I wanted to clean up ROCm and OpenCL, so I removed everything (hopefully 2.10 as well as 3.0), and re-installed only the 3.0 versions of rocm-dev, and rocm-opencl-dev. I was confident that 3.0 would work after this, because this time I had installed libcurses5. Alas!, clinfo reported 0 devices, and programs that use OpenCL refuse to work. Am I missing something? Do I need to install another package for 3.0? By the way, installing just these 2 packages is working great for me with 2.10. With 2.10 I install rocm-dev, and rocm-opencl-dev, and don't install rocm-dkms, because I'm running Ubuntu 19.10 with kernel 5.3. I install ROCm because I needed a version of OpenCL that's fairly new, but I only need OpenCL. So far, what I've seen is that if you only need OpenCL, programs that use OpenCL will happily work even if you don't install rocm-dev, and only install rocm-opencl-dev (which installs rocm-opencl), or even only rocm-opencl. |
@ableeker |
@jcdutton The good news is, I've found it! Prepare the same way as for 2.10: add repo, user in video group, and add udev rule. Then install ROCm OpenGL: install rocm-dkms, or rocm-dev, then install rocm-opengl-dev. Now for me anyway libncurses5 was missing, so I installed that. At this stage clinfo reports 0 devices, because it can't find libamd_comgr.so. So install comgr. And it seems that rocm-smi-lib64 is needed as well. Installing these last two packages is what @rkothako told us to do. Now clinfo should be working. What's more, OpenCL should be working as well! Now the bad news is, some applications that actually are using OpenCL, and that were working with 2.10, will no longer work with 3.0. Luxmark is unable to compile kernels, and gpuowl crashes with a memory access fault. |
I don't know if this applies to everyone, but I also had to install ncurses5 to get OpenCL working. |
On Linux 5.6.0-rc1 (that was working fine with ROCm 2.10 OpenCL-only) I upgraded to ROCm 3.1, and aside from setting LD_LIBRARY_PATH, the only additional thing I had to do to get clinfo working was to install libncurses5 (thanks @ableeker ). These are my rocm packages that I have installed
Trying to run gpuowl, OpenCL compiles the kernels without issue, but the execution does not proceed correctly. At this point I don't know where the blame lies (i.e. is gpuowl doing something invalid that happened to work before, or is something wrong with the new OpenCL codegen). Unfortunately ATM I'm not planning to debug this deeper and I'll be moving back to 2.10. But this is already progress, OpenCL is technically working although in gpuowl's case it is not usable (but as I said, I don't know on which side the problem is, yet). It turns out gpuowl could be a useful tool for regression testing, because it does produce big complex kernels (exercising the compiler extensivelly) and at the same time it is self-validating at runtime, thus any codegen problems tend to show up promptly. And can also be used for performance regression testing as well. But that's a different topic. |
I'm not sure if this should be in the same issue, because I've just tried the new version 3.1. Anyway, I've noticed that I didn't need to install anything apart from rocm-opencl-dev, clinfo and OpenCL worked right after that. What's more, gpuowl was running correctly straight away. I didn't test it thoroughly, but it didn't abort with an memory access anymore, and it did started calculating. LuxMark is working again as well. Version 3.1 is looking rather good to me. |
@preda |
AMD has restructured the lot, installing just rocm-dev (or presumably rocm-dkms) will install everything needed for a working OpenCL environment. I didn't need to install comgr, rocm-opencl-dev, or rocm-smi-lib64. Looks like even libncurses5 isn't needed any longer. |
I'm getting |
Looks like this only happens when pre-compiling kernels. Will open a separate issue. |
ROCm 3.1 appears to be very different from ROCM 3.0. |
The install instructions mention that you should add yourself to the video group, but I also had to add myself to the "render" group because that owned /dev/dri/renderD128 |
I can't believe this, but jcdutton's answer fixed the issue for me, Fresh install of Ubuntu 20.04 on a new system, vega 56, and fresh install of rocm 3.3.0 from the repo. rocminfo worked but clinfo segfaulted. Installing The clue is in the strace, which not long before the crash claims it can't find libtinfo.so.5. But then it complains about comgr before it crashes, leading many to believe that's the problem. I strongly recommend amd add a dependency on libncurses5 for Ubuntu, since newer versions don't install it by default. Oh, and you probably shouldn't segv if its not there either. Hope this helps others. |
This issue spell out libncurses5 specifically: #1067 |
Since many find this issue, tracing the clinfo strace seems to be the most generic solution to find out what's wrong:
Other errors described in this issue seems to be self-explanatory (missing kernel arguments). |
Thanks all for the help/suggestions over this thread. |
I updated from ROCm 2.10 to ROCm 3.0, and OpenCL stopped working by reporting 0 devices.
There are no errors in dmesg.
Kernel: Linux 5.4.5 and 5.5.0-rc2 (same behavior on both), GPU RadeonVII.
rocm-smi reports correctly all the GPUs, so it seems the hardware is detected and initialized correctly:
But both /usr/bin/clinfo and /opt/rocm/opencl/bin/x86_64/clinfo report no devices:
The ROCm packages that I have installed:
The text was updated successfully, but these errors were encountered: