New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rocm 2.10: clinfo generates segfault in /opt/rocm/hsa/lib/libhsa-ext-image64.so.1:amd::GpuAgent::GetInfo() #962
Comments
After some difficulty i was able to get enuf of the drivers app and libraries recomplied with debug info #1 0x00007f931f67c57d in hsa_amd_image_get_info_max_dim (component=..., here is the stack: |
Can confirm, installing hsa-ext-rocr will segfault anything trying to get opencl info [ e.g darktable, hashcat ] |
Hi @barolo! |
Any news on this? |
No, not particularly. 3.00 came out so i upgraded 1 of our UDOO Bolt V8's to that and i still get a segfault but i dont know yet if it's the same segfault, am debugging that now |
@pqyptixa , on raven ridge it's still broken in the exact same place in 3.0; in the function: hsa_amd_image_get_info_max_dim_impl() in libhsa-ext-image64.so.1.1.9 this can be reproduced by running /opt/rocm/opencl/bin/x86a_64/clinfo via gdb and setting a breakpoint in hsa_amd_image_get_info_max_dim_impl() and the stepping a few times inside the function. |
Almost certainly related to ROCm/ROCR-Runtime#68. I can't assign this issue though since I don't have permissions in this repo. |
Oh hey! Just the person I was hoping to hear from! Thankyou for your efforts Mr. Skeely! I have an update: Good news: Analysis: This failure is on Ryzen CPU / RavenRidge GPU. What i realized this morning is that i dont know which of these query args is triggering it - do any of them look like they would be immediately broken on RavenRidge:
|
@johnutz-PNSR can you say which files did you move? I renamed every /opt/rocm/hsa/lib/libhsa-ext-image64.so* file (and also /opt/rocm/lib/libhsa-ext-image64.so) and I still get a crash... |
(engages qa guy mode) |
argh! somehow hit CLOSE while typing my long repro vsry! i then executed /opt/rocm/opencl/bin/x86_64/clinfo So, while stepping thru libhsa-runtime64.so,1.1.9 i noticed that libhsa-runtime64.so,1.1.9 tries to load libhsa-ext-finalizer64.so,1.1.9 even tho it doesnt come with rocm. when libhsa-runtime64.so,1.1.9 didnt find libhsa-ext-finalizer64.so,1.1.9 it totally didnt care! so i moved libhsa-ext-image64.so.* out of the /opt/rocm/hsa/lib directory and thus libhsa-runtime64.so,1.1.9 happifly shared all the facts that it had access to back to /opt/rocm/opencl/lib/x86_64/libOpenCL.so.1 which in turn returned them to /opt/rocm/opencl/bin/x86_64/clinfo and /opt/rocm/opencl/bin/x86_64/clinfo happily prints them out to the screen. Note that i am being super picky about describing the paths to things for a reason. if you are not doing exactly what i am doing then my set up is not the same as yours and i can make no statements about your possible outcome. (disengage qa guy mode - returning to dev guy mode) @skeelyamd, would this be a quick fix? null ptr's usually are, unless the null ptr comes from the software having baked in assumptions about what hardware it expects to be visiting... Tnx! |
There's nothing particularly RavenRidge specific in the faulting code path. It's essentially indexing a lookup table common to the processor family. That said the values are returned to a variable sized structure allocated by the caller. In this case that would be OpenCL (libamdocl64.so). If that structure isn't large enough or a bogus pointer was given then the function can segfault. From your trace I can see that it isn't null at least (value=0x7fff3f26f960). Between 2.9 and 3.0 no deliberate functional changes were introduced into libhsa-ext-image64.so. However, a critical support library and some build files were modified. It's possible that some RavenRidge specific errors were introduced there. Removing the library disables image support and so skips OpenCL's query for image properties. As you noted libhsa-ext-finalizer64 and libhsa-ext-image64 are optional libraries (with finalizer having been deprecated & removed long ago) and ROCr is quite happy not to find them. So the quickest workaround is to remove libhsa-ext-image64. Of course that leaves you without image support but at least things won't be crashing. Most of the ROCr team is on holiday right now (including myself) so it's hard to say how quick a fix will be. We do have someone (who isn't on holiday) looking into this at the moment but there's nothing interesting to say yet. In the meantime it would be helpful to know if using the 2.9 OpenCL (libamdocl64) or the 2.9 image lib (libhsa-ext-image64) resolves the issue for you. If so rolling back that one component should be a better workaround and knowing which (or if) will help ensure that what we reproduce is the same issue you are observing. |
@skeelyamd thankyou for answering on your holiday. I appreciate it. I will attempt what you suggested as soon as i because they are interesting and simple avenues to approach. However, in performance testing the various opencv4.2 backends i managed to crash the box hard and a coworker informs me that it awaits my return to the office for a round of fsck. That will be tuesday. |
@skeelyamd So you can tell the the one human working over the holiday that they can focus on debugging libhsa-ext-image64.so.1.1.9 and ignore libOpenCL.so.1 Tnx for your suggestion, things are better now |
All, Reading through the comments on this issue I surmise the following:
I tried to see if anything jumps out in a diff between ROC 2.9 vs 3.0 releases. Initial review does not hint at something obvious. Is it possible to request console log from a run of rocminfo or clinfo with Images library removed. I am trying to get the value of "Chip ID" field. Meanwhile I will try to get access to a RavenRidge system. Per my look up the Chip ID should be 0x15DD (5597). |
Hi @rerrabolu! here is rocminfo on our UDOO Bolt V8:HSA System AttributesRuntime Version: 1.1 ==========
|
Hi @rerrabolu ! |
I am trying to reproduce the error on a system that I can access. Currently limited by it. |
Will update the once I have more info |
Couldnt ask for anything more! Thankyou very much! |
Hi, I was able to reproduce the error on a device. We know the fix. Given current release process, I am afraid I can't give a date when a fix will become available for general public. In the interim, removing images library from loader list will help unless image specific functionality is needed. |
@rerrabolu This is very good news! Thankyou for your efforts! Tnx! johnu |
@rerrabolu , i ask about the bug's mode of failure because the image functionality seems pretty essential to our current development efforts on ryzen. Would using the 2.9 lib-ext-image64.so provide us the image functionality we need? |
One of the modules ROCr Images initializes is addrlib. The failure is in that code. It is rather Asic specific. I will be surprised if you are able to use 2.9 as it too would suffer from the same problem. You can find open sourced addrlib here: https://github.com/mesa3d/mesa/tree/master/src/amd/addrlib I don't know if there is a way to share a pre-release version of the library. I will let Sean, my colleague to look into this. No promises. |
@rerrabolu thankyou for explaining that. what's noteworthy is that the crash doesn't happen if i use the current released code and replace the current lib-ext-image64.so with the 2.9 version. When you say you will be surprised if i am able to use the 2.9 version do you mean that you would expect it to crash just like the 3.0 version does or that it wont perform the correct image behaviors despite not crashing? Please let me know if you have a chance. Tnx! johnu |
Thanks for confirming that using 2.9 is able to get around the issue. This will allow the fix to flow through the normal process. I was expecting 2.9 to crash as well. I have not looked into 2.9 code base to determine if it will result in some incorrect image behaviors. I will try to look into it and update my observations here, I am going on break starting tomorrow. |
@rerrabolu is there a sample test app that i can use to demonstrate the image stuff is working correctly? |
I can confirm this crash (or something with the same backtrace) still exists on 3.0.0.6. |
I'm having this problem too with v3.3. Is there a way to tell which upcoming release will include a fix? Deleting/replacing libhsa-ext-image64.so* does not get clinfo to work for me. |
@newmanmr, if removing libhsa-ext-image64.so* does not resolve this issue for you then you are seeing a different root cause. OpenCL runs a significant amount of code before and after initializing ROCr and this can fail at multiple points. I'd suggest checking the backtrace to see if this you are seeing the same issue or not. We are expecting the fix for this issue to be in the next release (3.5). |
Hi @johnutz-PNSR, Please check latest ROCm Documentation and ROCm 5.7.1 to see if your issue has been resolved. If resolved, please close the ticket. Thanks. |
Original ticket is more than a year old and the person that opened the ticket has not responded to the latest request. If this is still an issue, please file a new ticket and we will investigate. Thanks! |
OS: Linux ryzendev2 5.0.0-37-generic #40~18.04.1-Ubuntu
HW: UDOO Bolt V8 - IOMMU is enabled in the BIOS
(lspci and lsmod are listed below the stack trace)
I am attempting to run clinfo.
I get a segfault in amd::GpuAgent::GetInfo() in /opt/rocm/hsa/lib/libhsa-ext-image64.so.1.
But libhsa-ext-image64.so.1 is stripped and so the stack from the core file is less useful
than hoped:
gdb /usr/bin/clinfo -c ~/core
Reading symbols from /usr/bin/clinfo...(no debugging symbols found)...done.
[New LWP 1531]
[New LWP 1534]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `clinfo'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007f7cf8f3b850 in ?? () from /opt/rocm/hsa/lib/libhsa-ext-image64.so.1
[Current thread is 1 (Thread 0x7f7cfc44e740 (LWP 1531))]
(gdb) bt
#0 0x00007f7cf8f3b850 in ?? () from /opt/rocm/hsa/lib/libhsa-ext-image64.so.1
#1 0x00007f7cfae924d9 in amd::GpuAgent::GetInfo(hsa_agent_info_t, void*) const () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#2 0x00007f7cfaea6e68 in HSA::hsa_agent_get_info(hsa_agent_s, hsa_agent_info_t, void*) () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#3 0x00007f7cfb270733 in ?? ()
from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#4 0x00007f7cfb270e32 in ?? ()
from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#5 0x00007f7cfb27245a in ?? ()
from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#6 0x00007f7cfb23f28f in ?? ()
from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#7 0x00007f7cfb23a297 in ?? ()
from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#8 0x00007f7cfb20dad5 in ?? ()
from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#9 0x00007f7cfb3853c9 in ?? ()
from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#10 0x00007f7cfb20dc0c in clIcdGetPlatformIDsKHR ()
from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#11 0x00007f7cfc4563c5 in ?? ()
from /opt/rocm/opencl/lib/x86_64/libOpenCL.so.1
#12 0x00007f7cfc45818f in ?? ()
from /opt/rocm/opencl/lib/x86_64/libOpenCL.so.1
#13 0x00007f7cfb4a1827 in __pthread_once_slow (once_control=0x7f7cfc45c0d8, init_routine=0x7f7cfc457fb0) at pthread_once.c:116
#14 0x00007f7cfc4568f1 in clGetPlatformIDs ()
from /opt/rocm/opencl/lib/x86_64/libOpenCL.so.1
#15 0x000055fbf38ff722 in ?? ()
#16 0x00007f7cfbc78b97 in __libc_start_main (main=0x55fbf38ff5d0, argc=1, argv=0x7fff359e5e98, init=, fini=,
rtld_fini=, stack_end=0x7fff359e5e88)
at ../csu/libc-start.c:310
#17 0x000055fbf38ffb3a in ?? ()
(gdb)
lspci:
00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 15d0
00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Device 15d1
00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge
00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 15d3
00:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 15d3
00:01.6 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 15d3
00:01.7 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 15d3
00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge
00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 15db
00:08.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 15dc
00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 61)00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 51) 00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 15e8
00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 15e9
00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 15ea
00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 15eb
00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 15ec
00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 15ed
00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 15ee
00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 15ef
04:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 0c)
05:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series] (rev 83)
05:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Device 15de 05:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Device 15df 05:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Device 15e0
05:00.4 USB controller: Advanced Micro Devices, Inc. [AMD] Device 15e1
05:00.5 Multimedia controller: Advanced Micro Devices, Inc. [AMD] Device 15e2 05:00.6 Audio device: Advanced Micro Devices, Inc. [AMD] Device 15e3
05:00.7 Non-VGA unclassified device: Advanced Micro Devices, Inc. [AMD] Device 15e6
06:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 61)
lsmod:
Module Size Used by
binfmt_misc 24576 1
nls_iso8859_1 16384 1
input_leds 16384 0
hid_generic 16384 0
usbhid 53248 0
edac_mce_amd 28672 0
kvm_amd 90112 0
ccp 86016 1 kvm_amd
kvm 647168 1 kvm_amd
snd_hda_codec_realtek 114688 1
irqbypass 16384 1 kvm
amdgpu 3915776 11
snd_hda_codec_generic 77824 1 snd_hda_codec_realtek
ledtrig_audio 16384 2 snd_hda_codec_generic,snd_hda_codec_realtek snd_hda_codec_hdmi 53248 1
snd_hda_intel 49152 5
snd_hda_codec 135168 4 snd_hda_codec_generic,snd_hda_codec_hdmi,snd_hda_intel,snd_hda_codec_realtek
snd_hda_core 86016 5 snd_hda_codec_generic,snd_hda_codec_hdmi,snd_hda_intel,snd_hda_codec,snd_hda_codec_realtek
snd_hwdep 20480 1 snd_hda_codec
snd_pcm 102400 4 snd_hda_codec_hdmi,snd_hda_intel,snd_hda_codec,snd_hda_core
snd_seq_midi 20480 0
amdkcl 28672 1 amdgpu
snd_seq_midi_event 16384 1 snd_seq_midi
amd_iommu_v2 20480 1 amdgpu
amdttm 102400 1 amdgpu
snd_rawmidi 36864 1 snd_seq_midi
crct10dif_pclmul 16384 1
amd_sched 32768 1 amdgpu
crc32_pclmul 16384 0
drm_kms_helper 180224 1 amdgpu
cdc_acm 36864 0
ghash_clmulni_intel 16384 0
drm 483328 16 drm_kms_helper,amd_sched,amdttm,amdgpu,amdkcl snd_seq 69632 2 snd_seq_midi,snd_seq_midi_event
aesni_intel 372736 0
i2c_algo_bit 16384 1 amdgpu
fb_sys_fops 16384 1 drm_kms_helper
snd_seq_device 16384 3 snd_seq,snd_seq_midi,snd_rawmidi
aes_x86_64 20480 1 aesni_intel
syscopyarea 16384 1 drm_kms_helper
crypto_simd 16384 1 aesni_intel
cryptd 24576 3 crypto_simd,ghash_clmulni_intel,aesni_intel glue_helper 16384 1 aesni_intel
snd_timer 36864 2 snd_seq,snd_pcm
sysfillrect 16384 1 drm_kms_helper
snd 86016 21 snd_hda_codec_generic,snd_seq,snd_seq_device,snd_hda_codec_hdmi,snd_hwdep,snd_hda_intel,snd_hda_codec,snd_hda_codec_realtek,snd_timer,snd_pcm,snd_rawmidi
sysimgblt 16384 1 drm_kms_helper
snd_pci_acp3x 16384 0
k10temp 16384 0
soundcore 16384 1 snd
mac_hid 16384 0
sch_fq_codel 20480 2
parport_pc 36864 0
ppdev 24576 0
lp 20480 0
parport 53248 3 parport_pc,lp,ppdev
ip_tables 32768 0
x_tables 40960 1 ip_tables
autofs4 45056 2
ahci 40960 2
libahci 32768 1 ahci
r8169 86016 0
i2c_amd_mp2_pci 20480 0
i2c_piix4 28672 0
realtek 20480 0
video 49152 0
sdhci_acpi 24576 0
sdhci 57344 1 sdhci_acpi
i2c_hid 28672 0
hid 126976 3 i2c_hid,usbhid,hid_generic
Any help with this would be greatly appreciated!
Thankyou very much in advance!
John Utz
Pensar Development
The text was updated successfully, but these errors were encountered: