-
Notifications
You must be signed in to change notification settings - Fork 428
ROCm from Radeon Software for Linux 21.40.1 still tries to provide support for R9 390X (gfx7) and wrecks the kernel #1624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The request is: ROCm should not break the system, whatever something is supported or not. |
Thanks @illwieckz for reaching out. |
Hi @ROCmSupport, thank you for your attention and kind answer. I'm not against the idea of keeping ROCm stack for gfx7 if it is still working for some users, it would be very unfortunate to remove this code if it works for someone, but then an option is needed to prevent ROCm to attempt to handle gfx7 if both gfx7 and something else if hosted and gfx7 does not work. On my end the last time ROCm worked with gfx7 was in year 2018 (proof) and it only worked for some short months. So I guess there are two possible solutions:
Some notes though:
Hawaii is gfx7.
Radeon Software for Linux 20.04.1 stopped supporting gfx7 and others with amdgpu-pro 21.30, but I don't know if that was intentional or if a mistake was made: https://gitlab.freedesktop.org/drm/amd/-/issues/1806 AMD now provides no one OpenCL solution for Hawaii. None of ROCm, PAL, Orca are providing working OpenCL today (and Mesa solution is still incomplete then image workflow are unusable):
|
It's possible to blacklist the card for ROCm by doing (given the unsupported card you want to blacklist is the first one): export GPU_DEVICE_ORDINAL='0' Or: export GPU_DEVICE_ORDINAL='1,3' To blacklist the second and the fouth one, Or to blacklist everything: export GPU_DEVICE_ORDINAL=',' This would make possible to use ROCm to a supported GPU while ignoring the unsupported one. One big problem is that it would also blacklist the card in Orca (and probably also with PAL) so with this trick you cannot use ROCm with one card and Orca with another. |
An environment variable named |
The bug you point out is pretty bad. I reproduced it (I still had a Hawaii card lying around) and I'm going to send out a fix in a minute. There is an environment variable ROCR_VISIBLE_DEVICES that you can probably use as a workaround for now. See here for details: https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/fc99cf8516ef4bfc6311471b717838604a673b73/src/core/inc/amd_filter_device.h#L58 Hawaii support in KFD is obviously mostly untested these days. It also depends on custom firmware that was never pushed upstream due to quality regressions in the graphics driver. Therefore I will also guard KFD support for Hawaii behind the module parameter amdgpu.exp_hw_support=1. That way users will not run ROCm on Hawaii by accident. |
Oh that's awesome! Last time I have seen ROCr OpenCL working on R9 390X was in 2018 (proof).
Interesting!
The support for R9 390X was never good anyway, see drm/amd#1816, there are stability problems reported since 2015 with crashes occurring with default kernel configuration, it even means one cannot run a Linux live CD/USB to install Linux on a computer with the display plugged into an R9 390X… The computer would crash before completing the installation (in fact, before starting the installation).
That will be very convenient ! Thank you for this attention. Note: I own PCIe 2, PCIe 3 and PCIe 4 hosts, so I can test fixes for the R9 390X on those various configurations. I assume main difference would be between PCIe 2 and PCIe 3 because of PCI Atomics (I reproduced the current bug on both PCIe 2 and PCIe 4, haven't tested on PCIe 3). |
I don't think it makes a difference. Hawaii doesn't support PCIe atomics in any case. My test system is PCIe 3. My KFD patches are here for review: https://lore.kernel.org/amd-gfx/20211208082531.918062-2-Felix.Kuehling@amd.com/T/ |
OK then, I remember comments about Hawaii and PCIe atomics and others where PCIe 2 or 3 seems to have made a difference, so this was confusing. The current host is PCIe4 on my end (ThreadRipper Pro 3955WX).
Great! Do you have a link to the firmware that may be required for this? Display support is so bad with R9 390X with stock firmwares I would be interested in knowing what can be worse (I would also not be surprised if it appears the problems reported with updated firmwares are the ones we already reproduce with usual firmwares). |
start_nocpsch would never set dqm->sched_running on Hawaii due to an early return statement. This would trigger asserts in other functions and end up in inconsistent states. Bug: ROCm/ROCm#1624 Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Kent Russell <kent.russell@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
start_nocpsch would never set dqm->sched_running on Hawaii due to an early return statement. This would trigger asserts in other functions and end up in inconsistent states. Bug: ROCm/ROCm#1624 Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Kent Russell <kent.russell@amd.com>
start_nocpsch would never set dqm->sched_running on Hawaii due to an early return statement. This would trigger asserts in other functions and end up in inconsistent states. Bug: ROCm/ROCm#1624 Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Kent Russell <kent.russell@amd.com>
Hi @fxkamd this message just to say the bug is still reproducible. Or maybe one bug was fixed but there are more bugs remaining? Here are the software versions I use:
Here is what happens when I call If I press
|
Looks like you're using the amdgpu driver built into your 5.15 kernel. Are you sure the kernel includes the fix? |
Hi @fxkamd I tried again, here are the results of my tests. I'm running Ubuntu
So, in some way, when installing fully complete rocm installation there is no wreckage, but that very likely happens because some kernel module is blacklisting the device or something like that, ROCm itself doesn't do any check: when using stock kernel modules, ROCm actually tries to use the gfx7 card. Actually I'm not against the gfx7 support being disabled by default (possibly with an option to unlock it for those who takes the risk) but that better be done on ROCm side. Also, since ROCm still tries to use the gfx7 card if not using the provided dkms module (and then still ship some support for it), I would like to know if that disablement on kernel module side can be unlocked so I can test if ROCm gfx7 support works with the dkms kernel modules. I totally forgot to respond in October 2022 but I experienced the exact same thing at the time: with dkms modules the gfx7 device was simply not listed at all. I'm confirming the behavior is the same today with the more recent versions. I would like to know how to unlock the gfx7 device when running the dkms modules so I can do extended tests, if that's possible. As a side remark, I noticed that latest old OpenCL Orca driver ( Edit: It looks like |
Some info:
|
So I enabled Both Fortunately, this doesn't wreck the kernel like before, so at least, yes, the original bug. There is a new one though. Also the fact OpenCL applications will hang is a problem.
|
Since the current I reported the remaining issue in another dedicated thread: |
I tried this week to run ROCm with Hawaii on a Threadripper PRO based computer, installing ROCm from “Radeon™ Software for Linux® version 21.40.1 for Ubuntu 20.04.3”:
https://www.amd.com/en/support/kb/release-notes/rn-amdgpu-unified-linux-21-40-1
I was not expecting to get OpenCL to be provided, I was just curious about it, but I was expecting to not get my kernel wrecked and require me to reboot.
It looks like the Hawaii (GFX7) broken support is still provided with current ROCm, and it puts the kernel in a so badly state the user is asked to reboot.
This means, for example, that an user cannot host both a Radeon R9 390X running Orca and a newer card like an Radeon PRO W6600 on the same system: the ROCm driver will break the kernel because of trying to provide support to R9 390X.
Abstract from
dmesg
:Whole error log from
dmesg
:The text was updated successfully, but these errors were encountered: