-
Notifications
You must be signed in to change notification settings - Fork 351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ROCM fails with supported hardware on mainboard Asus Prime B450-Plus Ryzen CPU/APU combo #1013
Comments
After quite a lot of trial and nonsensical output/errors, I cannot stress enough the need to completely cold boot the system when anyone is troubleshooting and making any kind of BIOS setting changes with the ASUS Prime B450-Plus mainboards (possibly other mainboards which support APU's as well). It is not enough to simply reboot, or power-off, or do a quick toggle of your physical switch on the PSU unit. Some state seems to remain for a small window of time after power disconnect unless you take similar steps as one would when draining the flea power on a blade server. (i.e. make the change in BIOS, shutdown, disconnect power Inputs, hold the power-on button for x seconds, reconnect power, boot). If this isn't done (and even if it was done and you still see this behavior try power cycling one more time) you end up with nonsensical output/errors, and if one doesn't realize this during troubleshooting they can waste a lot of time. So, it appears I can have 2.9 running properly now with just a single card (dgpu:enabled , igpu:disabled + full cold boot). I would still like to be able to run both cards side-by-side (but not in parallel) but this will have to do for now. I'll post the logs/coredumps for the latest version I can get installed without problems with a single card. I'll keep you posted. Edit: 3.0 works just fine when the problematic APU is disabled and state is thoroughly flushed. I've attached a tarball with rocminfo, clinfo, and opencl exercise test 02 along with the appropriate strace and crash/core dumps. Hopefully this will provide enough information to nail down the problematic parts. If not, and additional information is needed please reach out as I can replicate the issue on demand (and I want to see this fixed). 3.0_preboot (fresh install, no reboot) |
As a follow-up, I won't be able to test this any further. I've had to send the mainboard back to ASUS after it bricked. The issue was present on BIOS rev 0604 and 2008 which was supposed to include a number of stability fixes. There seemed to be a minor issue after the firmware upgrade (USB Mouse wasn't detected after bootup into BIOS without hotplugging) and about a week after the upgrade following a restart for changing the BIOS settings for GPUs it ended up bricking the motherboard completely. No POST and no beepcodes when you remove all RAM. Power button would turn the fans on but no power to USB devices, chirp at boot, or POST, and ASUS firmware crash recovery doesn't start so its a dead duck. Resetting the CMOS did fix the bootup chirp, beepcodes, and USB power issue; but it still would never make it to POST without signifying any error (No video output), and it would never make it past POST, or into BIOS when pressing the appropriate buttons without a display. The motherboard Is now being sent back for RMA/repair. In case someone from AMD would like to follow-up with Asus, I've included the specific changes which seem to have triggered the brick, the only other possibillity is some form of memory bug in the new BIOS which eventually caused a resource exhaustion or corruption issue. The hardware is about a year old and has been running without issue until now so its unlikely that its a hardware issue. Aside from the state issues mentioned in my previous posts, the changes made to the BIOS only involved setting the primary from PCIe to Internal GPU (Ryzen 5 APU 2400G) and the shared memory size to 3G (from a 16Gb Viper stick) with multimonitor support turned off, the dGPU inslot was a 14cu RX560 at the time. On the previous firmware (0604) this left the system without post until the dGPU was removed after which you could boot up and fix settings. In the latest update (2008) this appears to cause a brick. I was testing this as I realized I had forgotten that test case when testing the above settings. Output was tested from both GPU/APU HDMI ports (there was none), and all other previous troubleshooting was performed first with the bare minimum component system, and then with the dGPU installed (to rule out all options). I doubt anything rocm related was at issue, but I'll know more once ASUS has had a chance to examine the motherboard. I've been told to expect it to take at least 14 days plus shipping. I'll post an update when I know more. |
Received a brand new Asus Prime B450 Plus board (same bios revision), and back to ye olde segment fault with clinfo. Seems it now no longer allows IGFX to be selected as the primary when IGFX Multimonitor is disabled in the BIOS (i.e. the setting that bricked the old one). |
ping... It looks like 3.1 gets a little bit further but OpenCL is still broken, and with multimonitor support enabled I was able to import pytorch and get a segmentation fault with the stack trace referencing compute engine cpp. |
Tested 3.3 with the same options. ROCM is still nonfunctional, problems with the unpinned compute engine.cpp at line 126, segmentation fault, or system hang depending on the various BIOS settings and tests run. There has been no response or assignment in the past 2 months, at this point I'm moving on. I ended up buying a Nvidia Jetson developer kit in the interim and the Nvidia customer service is quite possibly some of the worst I've ever interacted with; Instructing a customer seeking an RMA to post in a community forum as a first step for a defective unit rather than starting a formal RMA is fairly dubious and high on my list of overall worst-case experiences (this happened with the Jetson). For anyone running into the same problem. From what I've been able to gather, there appears to be two things happening here, one is a firmware bug with my, and possibly several other motherboard models with regard to Ryzen CPU/APU combos, and possibly the ACPI Component Resource Affinity Tables not being properly structured/exposed. The tables are speculation at this point because its outside my personal expertise, and there is no one to get help from since there has been no response over the last several months. The other issue is ROCM as far as I've been able to tell doesn't have any way of selecting specific HSA agents and instead walks the agent list which then fails the process, or segfaults. Symptoms you might see include ROCM segfaulting when it attempts to enumerate all HSA agents. Anyone with a Ryzen 5 2400G CPU/APU and that motherboard combo may run into unpinned compute engine errors, and will be unable to run rocm tests without error while the IGFX BIOS option is enabled. OpenCL won't work, TF2, and Pytorch won't work. The APU is detected as an agent, and there doesn't appear to be any way to have amdkfd ignore the APU as an agent without disabling it in the BIOS at boot. If you wanted to solely use the APU to render X11 while at the same time as using the dGPU for deep learning, as far as I've been able to tell over the last two months of dredging this problem; it can't be done. I've tested rolling back my kernel and using the old dkms modules, using previous versions and I would have to rollback to rocm v1.9, and kernel 4.18 to have it work again which is not possible given my current environment requirements. Additionally disabling the BIOS options for the multimonitor support allow testing to get further in the process but rocm hangs when running standard pytorch tests, and OpenCL still fails, and disabling multi-monitor option setting while setting IGFX as the primary with RAM allocation set to 1-3GB after a cold flea-power drain boot ...yes ... still bricks the device permanently with firmware version 2008 (yes it happened a second time). It looks like I'll have to RMA the mainboard replacement I received back to ASUS. Last time it took about a month and they sent me a new board; I may just go with another manufacturer (not ASUS) instead of dealing with this. At this point I can't consider AMD cards a viable alternative to Nvidia for deep learning research, and there just isn't enough of a skilled community to handle the troubleshooting aspects. The project looked promising but given the current level of QA, and the challenges that need to be address; rather than waste time spinning wheels in the dark, I think It would be better to shelve the entire project until the necessary resources can be brought in to provide a base level of support towards correcting the QA issues. The testing and verification of ROCM functionality to ensure its working properly shouldn't be contingent on specific motherboard firmware or BIOS options when the hardware is listed as supported, or untested. |
I've attached the CRAT table dump in the case someone comes along that would find it useful with following up on this issue. I saved a full ACPI dump to file; if its needed feel free to reach out. I don't know enough about low level structures to be able to do anything with this. I'll make it available for the next six months. |
Maybe it does not help, but did you ever try to restrict the visible devices? There are some methods available. E.g. environment variable ROCR_VISIBLE_DEVICES=0 for if you want only the first device "visible". |
@seesturm Thanks. I wasn't aware of those options and I'd been searching fairly regularly for the past several months. It would have been something additional I could have tried. Unfortunately the BIOS settings bricked the replacement board, again, so it doesn't POST and I won't have anything to test on until the second RMA replacement comes back if I end up going that route. At this point, I'm thinking I probably will just wash my hands of this and scrap my original plans of having an AMD based deep learning rig. The software just isn't at a QA level where a fairly experienced professional system admin can get the platform verified up and running anymore and there's too much floating outdated and unusable information out there with regard to troubleshooting. |
I have the new motherboard back from RMA (no longer bricked). @kentrussell At this point, I'd be willing to donate a problematic but functional unit (ASUS motherboard/Ryzen 5 2400G APU combo) if it means the issue gets some attention, I was testing the setup with an RX560 [14 compute unit]. In either case, ASUS is not addressing the problem with their firmware at all and I've managed to brick several motherboards now just toggling BIOS options (from the UEFI manager). Main problems occur when APU/CPU/dGPU are detected as enabled, or plugged in.
AtomicOpsCap: 32bit+ 64bit+ 128bitCAS- Setting ROCR_VISIBLE_DEVICES had some unexpected behavior in a docker image prior to the v2020 AGESA firmware update; haven't been able to test afterwards as testing on the host fails with a hang. Prior (v2008 firmware) Value of 0 had shown both CPU/APU as HSA agents, Value of 1 showed Ryzen 5 CPU/dGPU as HSA agents, both still failed sample tests. Firmware update causes rocminfo to hang with processes that eventually go defunct. Looks like the first node being created by kfd is an APU topology node [0x0:0x0] and then dGPU node is added to the topology, then the APU is added last. Testing was done with 20.04 LTS using the 5.4 upstream kernel. rocminfo throws a bad address error right before hanging. Let me know if you need anything from me but at this point I'm looking to purchase hardware that actually works (ASUS... never again). Edit: The hardware may either be sold at a loss or donated depending on the effort needed, given the current status of CV19 shutdowns. I'd rather have it go towards something productive.
...
[ 0.000000] No NUMA configuration found ... |
@fxkamd Is this similar to something else you were discussing in the ROCK Bug Report with a similar issue regarding the BIOS not configuring the CRAT correctly? I have a vague recollection of you mentioning something along those lines, but can't seem to find the notifications in my inbox and Outlook didn't find anything either. |
Hi @dundir |
Thank you for the update. It may be about a week before I can test this as the system is currently running a series of batches on Nvidia hardware. I'll provide an update once I have had a chance to replace the AMD hardware and run the tests on the latest version. I'll reach out once I have an update. |
Hi @dundir, please check latest ROCm Documentation and ROCm 5.7.1 to see if your query has been resolved. If resolved, please close the ticket. Thanks. |
Hi Adam,
Unfortunately, I no longer have the mainboard that had this issue.
The Asus B450 Prime Plus with built-in APU (on CPU), failed earlier this
year from a bad PSU unit.
Should I leave this issue open in the meantime, in the case that someone
that has this board can test and verify if the issue persists?
As these issues were largely related to poor implementation by ASUS
regarding PCIe Atomics, I don't see how software would resolve this, but
this isn't my area of expertise.
Best Regards,
dundir
…On Wed, Dec 13, 2023 at 7:31 PM Adam Tran ***@***.***> wrote:
Hi @dundir <https://github.com/dundir>, please check latest ROCm
Documentation and ROCm 5.7.1 to see if your query has been resolved. If
resolved, please close the ticket. Thanks.
—
Reply to this email directly, view it on GitHub
<#1013 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AASY2BLEPG2Y6L3VD3BYVY3YJJXJDAVCNFSM4KVXXMZ2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOBVGUYDOMJRGQZQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hi @dundir, Thank you for your response. Let's close the ticket as it has been opened for more than 3 yrs. If another community member runs into the same failure again on ASUS Prime B450-Plus, they can file a new ticket for investigation. Thanks for your support, |
Good Evening, I had rocm 1.8 set up and running awhile back but only with a single card and I just ended up reinstalling and bringing everything up to the latest versions.
Initially I seemed to be getting hit with the segfault issue on a 5.3.0-28 18.04 LTS ubuntu kernel after install (fresh OS) when running both rocminfo/clinfo on version 3.0. Quite a number of other people seem to be having this issue as well and posted a proposed workaround (downgrading).
Downgrading both the package and its dependencies to 2.10 seemed to fix that particular issue but I'm still having some problems. I'm hoping someone can help me work through or around this.
I currently have a Ryzen 5 2400G CPU/APU running on an Asus Prime 450-Plus mainboard along with a RX560 14cu dgpu. My setup is intended to save the 2400G APU for the hardware acceleration for the xserver running the desktop and nothing else, and then offload tensorflow/pytorch to the RX560 for my studies. I've done this by having randr set the sink to offload to just the igpu.
Just to be clear, I'm trying to get rocm working with just the RX560. I understand the memory model is different between the two cards and I'm not trying to get both running in parallel, only one (the RX560), with the other (Raven) handling desktop/media acceleration.
As an additional detail, I've confirmed the EFI Utility is running the latest version firmware available from ASUS, and its been set to have both enabled cards enabled (amd/amd). The BIOS setting has two values other than disabled, under the setting iGPU Multi-Monitor Support: Enabled and HybridMode. I've tried both of them with no outward change in behavior following a cold boot inbetween (which seems to be necessary when toggling to or from disabled).
After reading many of the issues and trying/tinkering with the udev rules, I think I'm running into at least two problems but I'm not sure how to fix either at this point.
The first challenge is rocminfo is failing with an error (listed below). rocm-smi detects both cards some stats don't make sense and there are warnings (see below) for other stats. The second issue is clinfo is failing as well with the clGetDeviceIDs(-1) error.
I've confirmed the kernel has all three options necessary for kfd built by checking /boot/config-$(uname -r) and the user is located in the video group (ubuntu doesn't use a render group).
Usually my installation workflow starts with testing smi,rocminfo, and then clinfo prior to passing /dev/kfd for use in a docker container as these seem to be fairly solid milestones in ensuring everything is working properly on the host first.
The weird part is rocminfo will work properly when sudoed, but only after a completely cold boot when the igpu setting in bios has been disabled and the only card detected is then the dgpu, otherwise it fails with the listed error at the below line or line 900. clinfo doesn't work in any case (same error).
Booting with both igpu+dgpu enabled (even if only using the latter) seems to introduce problems. Digging into an strace for clinfo it looked like two likely might have been a missing so (libamdocl-orca64.so/included with amdgpu-pro) or a permission issue on kfd. Changing the udev.rules to include MODE="0666" seemed to correct the silent permission denied entry in the strace log but did nothing to correct any of the main issues, and setting other permissions for kfd to read/write would likely have security implications and doesn't seem to provide any benefit.
I'd appreciate any help you can provide in getting rocm working. I've been at this for a few days and tomorrow I plan on downgrading to 2.9 to see if there are any changes.
[Edit]: Downgrading to 2.9 resulted in no changes in either utility's output.
Please let me know if any additional information is needed.
strace clinfo
https://pastebin.com/gz3gUBGA
strace rocminfo
https://pastebin.com/cJA7dgD5
cpuinfo
https://pastebin.com/LFHGqZ57
#> rocminfo (output):
ROCk module is loaded
user is member of video group
hsa api call failure at: /data/jenkins-workspace/compute-rocm-rel-2.10/rocminfo/rocminfo.cc:1102
Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.
#>clinfo (output):
Number of platforms: 1
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 2.1 AMD-APP (3019.0)
Platform Name: AMD Accelerated Parallel Processing
Platform Vendor: Advanced Micro Devices, Inc.
Platform Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices
Platform Name: AMD Accelerated Parallel Processing
ERROR: clGetDeviceIDs(-1)
#> lspci | grep "VGA":
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Baffin [Radeon RX 460/560D / Pro 450/455/460/555/555X/560/560X] (rev e5)
09:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series] (rev c6)
#> rocm-smi
ROCm System Management Interface
WARNING: GPU[0] : Unable to read /sys/class/hwmon/hwmon0/temp1_input
WARNING: GPU[0] : Unable to read /sys/class/hwmon/hwmon0/temp1_input
WARNING: GPU[0] : Unable to read /sys/class/hwmon/hwmon0/power1_average
WARNING: GPU[0] : Unable to read /sys/class/hwmon/hwmon0/pwm1
WARNING: GPU[1] : Unable to read /sys/class/drm/card1/device/gpu_busy_percent
GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
0 N/A N/A N/A N/A None% off 46.0W 0% 0%
1 33.0c N/A 400Mhz 1067Mhz None% auto N/A 39% N/A
==================End of ROCm SMI Log ==================
#> inxi -G
Graphics: Card-1: Advanced Micro Devices [AMD/ATI] Baffin [Radeon RX 460/560D / Pro 450/455/460/555/555X/560/560X]
Card-2: Advanced Micro Devices [AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series]
Display Server: X.Org 1.20.5 drivers: amdgpu,amdgpu
Resolution: 1920x1080@60.00hz
OpenGL: renderer: AMD RAVEN (DRM 3.33.0, 5.3.0-28-generic, LLVM 9.0.0)
version: 4.5 Mesa 19.2.8
The text was updated successfully, but these errors were encountered: