Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rocm 2.10: clinfo generates segfault in /opt/rocm/hsa/lib/libhsa-ext-image64.so.1:amd::GpuAgent::GetInfo() #962

Closed
johnutz-PNSR opened this issue Dec 12, 2019 · 32 comments

Comments

@johnutz-PNSR
Copy link

OS: Linux ryzendev2 5.0.0-37-generic #40~18.04.1-Ubuntu
HW: UDOO Bolt V8 - IOMMU is enabled in the BIOS

(lspci and lsmod are listed below the stack trace)

I am attempting to run clinfo.
I get a segfault in amd::GpuAgent::GetInfo() in /opt/rocm/hsa/lib/libhsa-ext-image64.so.1.

But libhsa-ext-image64.so.1 is stripped and so the stack from the core file is less useful
than hoped:
gdb /usr/bin/clinfo -c ~/core
Reading symbols from /usr/bin/clinfo...(no debugging symbols found)...done.
[New LWP 1531]
[New LWP 1534]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `clinfo'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007f7cf8f3b850 in ?? () from /opt/rocm/hsa/lib/libhsa-ext-image64.so.1
[Current thread is 1 (Thread 0x7f7cfc44e740 (LWP 1531))]
(gdb) bt
#0 0x00007f7cf8f3b850 in ?? () from /opt/rocm/hsa/lib/libhsa-ext-image64.so.1
#1 0x00007f7cfae924d9 in amd::GpuAgent::GetInfo(hsa_agent_info_t, void*) const () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#2 0x00007f7cfaea6e68 in HSA::hsa_agent_get_info(hsa_agent_s, hsa_agent_info_t, void*) () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#3 0x00007f7cfb270733 in ?? ()
from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#4 0x00007f7cfb270e32 in ?? ()
from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#5 0x00007f7cfb27245a in ?? ()
from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#6 0x00007f7cfb23f28f in ?? ()
from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#7 0x00007f7cfb23a297 in ?? ()
from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#8 0x00007f7cfb20dad5 in ?? ()
from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#9 0x00007f7cfb3853c9 in ?? ()
from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#10 0x00007f7cfb20dc0c in clIcdGetPlatformIDsKHR ()
from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#11 0x00007f7cfc4563c5 in ?? ()
from /opt/rocm/opencl/lib/x86_64/libOpenCL.so.1
#12 0x00007f7cfc45818f in ?? ()
from /opt/rocm/opencl/lib/x86_64/libOpenCL.so.1
#13 0x00007f7cfb4a1827 in __pthread_once_slow (once_control=0x7f7cfc45c0d8, init_routine=0x7f7cfc457fb0) at pthread_once.c:116
#14 0x00007f7cfc4568f1 in clGetPlatformIDs ()
from /opt/rocm/opencl/lib/x86_64/libOpenCL.so.1
#15 0x000055fbf38ff722 in ?? ()
#16 0x00007f7cfbc78b97 in __libc_start_main (main=0x55fbf38ff5d0, argc=1, argv=0x7fff359e5e98, init=, fini=,
rtld_fini=, stack_end=0x7fff359e5e88)
at ../csu/libc-start.c:310
#17 0x000055fbf38ffb3a in ?? ()
(gdb)

lspci:
00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 15d0
00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Device 15d1
00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge
00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 15d3
00:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 15d3
00:01.6 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 15d3
00:01.7 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 15d3
00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge
00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 15db
00:08.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 15dc
00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 61)00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 51) 00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 15e8
00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 15e9
00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 15ea
00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 15eb
00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 15ec
00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 15ed
00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 15ee
00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Device 15ef
04:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 0c)
05:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series] (rev 83)
05:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Device 15de 05:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Device 15df 05:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Device 15e0
05:00.4 USB controller: Advanced Micro Devices, Inc. [AMD] Device 15e1
05:00.5 Multimedia controller: Advanced Micro Devices, Inc. [AMD] Device 15e2 05:00.6 Audio device: Advanced Micro Devices, Inc. [AMD] Device 15e3
05:00.7 Non-VGA unclassified device: Advanced Micro Devices, Inc. [AMD] Device 15e6
06:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 61)

lsmod:
Module Size Used by
binfmt_misc 24576 1
nls_iso8859_1 16384 1
input_leds 16384 0
hid_generic 16384 0
usbhid 53248 0
edac_mce_amd 28672 0
kvm_amd 90112 0
ccp 86016 1 kvm_amd
kvm 647168 1 kvm_amd
snd_hda_codec_realtek 114688 1
irqbypass 16384 1 kvm
amdgpu 3915776 11
snd_hda_codec_generic 77824 1 snd_hda_codec_realtek
ledtrig_audio 16384 2 snd_hda_codec_generic,snd_hda_codec_realtek snd_hda_codec_hdmi 53248 1
snd_hda_intel 49152 5
snd_hda_codec 135168 4 snd_hda_codec_generic,snd_hda_codec_hdmi,snd_hda_intel,snd_hda_codec_realtek
snd_hda_core 86016 5 snd_hda_codec_generic,snd_hda_codec_hdmi,snd_hda_intel,snd_hda_codec,snd_hda_codec_realtek
snd_hwdep 20480 1 snd_hda_codec
snd_pcm 102400 4 snd_hda_codec_hdmi,snd_hda_intel,snd_hda_codec,snd_hda_core
snd_seq_midi 20480 0
amdkcl 28672 1 amdgpu
snd_seq_midi_event 16384 1 snd_seq_midi
amd_iommu_v2 20480 1 amdgpu
amdttm 102400 1 amdgpu
snd_rawmidi 36864 1 snd_seq_midi
crct10dif_pclmul 16384 1
amd_sched 32768 1 amdgpu
crc32_pclmul 16384 0
drm_kms_helper 180224 1 amdgpu
cdc_acm 36864 0
ghash_clmulni_intel 16384 0
drm 483328 16 drm_kms_helper,amd_sched,amdttm,amdgpu,amdkcl snd_seq 69632 2 snd_seq_midi,snd_seq_midi_event
aesni_intel 372736 0
i2c_algo_bit 16384 1 amdgpu
fb_sys_fops 16384 1 drm_kms_helper
snd_seq_device 16384 3 snd_seq,snd_seq_midi,snd_rawmidi
aes_x86_64 20480 1 aesni_intel
syscopyarea 16384 1 drm_kms_helper
crypto_simd 16384 1 aesni_intel
cryptd 24576 3 crypto_simd,ghash_clmulni_intel,aesni_intel glue_helper 16384 1 aesni_intel
snd_timer 36864 2 snd_seq,snd_pcm
sysfillrect 16384 1 drm_kms_helper
snd 86016 21 snd_hda_codec_generic,snd_seq,snd_seq_device,snd_hda_codec_hdmi,snd_hwdep,snd_hda_intel,snd_hda_codec,snd_hda_codec_realtek,snd_timer,snd_pcm,snd_rawmidi
sysimgblt 16384 1 drm_kms_helper
snd_pci_acp3x 16384 0
k10temp 16384 0
soundcore 16384 1 snd
mac_hid 16384 0
sch_fq_codel 20480 2
parport_pc 36864 0
ppdev 24576 0
lp 20480 0
parport 53248 3 parport_pc,lp,ppdev
ip_tables 32768 0
x_tables 40960 1 ip_tables
autofs4 45056 2
ahci 40960 2
libahci 32768 1 ahci
r8169 86016 0
i2c_amd_mp2_pci 20480 0
i2c_piix4 28672 0
realtek 20480 0
video 49152 0
sdhci_acpi 24576 0
sdhci 57344 1 sdhci_acpi
i2c_hid 28672 0
hid 126976 3 i2c_hid,usbhid,hid_generic

Any help with this would be greatly appreciated!
Thankyou very much in advance!

John Utz
Pensar Development

@johnutz-PNSR
Copy link
Author

After some difficulty i was able to get enuf of the drivers app and libraries recomplied with debug info
Specifically:

#1 0x00007f931f67c57d in hsa_amd_image_get_info_max_dim (component=...,
attribute=12291, value=0x7fff3f26f960)
at /home/ryzendev/GIT/ROCR-Runtime/src/core/runtime/hsa_ext_interface.cpp:668

here is the stack:
Core was generated by `clinfo'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007f931d6d2850 in ?? () from /opt/rocm/hsa/lib/libhsa-ext-image64.so.1
[Current thread is 1 (Thread 0x7f9320c83740 (LWP 8734))]
(gdb) bt
#0 0x00007f931d6d2850 in ?? () from /opt/rocm/hsa/lib/libhsa-ext-image64.so.1
#1 0x00007f931f67c57d in hsa_amd_image_get_info_max_dim (component=...,
attribute=12291, value=0x7fff3f26f960)
at /home/ryzendev/GIT/ROCR-Runtime/src/core/runtime/hsa_ext_interface.cpp:668
#2 0x00007f931f63c99d in amd::GpuAgent::GetInfo (this=0x55c2a1247c20, attribute=12291,
value=0x7fff3f26f960)
at /home/ryzendev/GIT/ROCR-Runtime/src/core/runtime/amd_gpu_agent.cpp:798
#3 0x00007f931f666e26 in HSA::hsa_agent_get_info (agent_handle=..., attribute=12291,
value=0x7fff3f26f960)
at /home/ryzendev/GIT/ROCR-Runtime/src/core/runtime/hsa.cpp:556
#4 0x00007f931f6a933e in hsa_agent_get_info (agent=..., attribute=12291,
value=0x7fff3f26f960)
at /home/ryzendev/GIT/ROCR-Runtime/src/core/common/hsa_table_interface.cpp:110
#5 0x00007f931faa8733 in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#6 0x00007f931faa8e32 in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#7 0x00007f931faaa45a in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#8 0x00007f931fa7728f in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#9 0x00007f931fa72297 in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#10 0x00007f931fa45ad5 in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#11 0x00007f931fbbd3c9 in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#12 0x00007f931fa45c0c in clIcdGetPlatformIDsKHR ()
from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#13 0x00007f9320c8b3c5 in ?? () from /opt/rocm/opencl/lib/x86_64/libOpenCL.so.1
#14 0x00007f9320c8d18f in ?? () from /opt/rocm/opencl/lib/x86_64/libOpenCL.so.1
#15 0x00007f931fcd9827 in __pthread_once_slow (once_control=0x7f9320c910d8,
init_routine=0x7f9320c8cfb0) at pthread_once.c:116
#16 0x00007f9320c8b8f1 in clGetPlatformIDs ()
from /opt/rocm/opencl/lib/x86_64/libOpenCL.so.1
#17 0x000055c29f518722 in ?? ()
#18 0x00007f93204b0b97 in __libc_start_main (main=0x55c29f5185d0, argc=1,
argv=0x7fff3f270338, init=, fini=,
rtld_fini=, stack_end=0x7fff3f270328) at ../csu/libc-start.c:310
#19 0x000055c29f518b3a in ?? ()
(gdb)

@barolo
Copy link

barolo commented Dec 20, 2019

Can confirm, installing hsa-ext-rocr will segfault anything trying to get opencl info [ e.g darktable, hashcat ]

@johnutz-PNSR
Copy link
Author

Hi @barolo!
Further debugging has shown that i my case that libhsa-runtime64.so.1.1.9 is trying to use the function pointer hsa_amd_image_get_info_max_dim_fn from libhsa-ext-image64.so.1. but can not load libhsa-ext-image64.so.1 despite the fact that they are all in the same directory:

@pqyptixa
Copy link

Any news on this?

@johnutz-PNSR
Copy link
Author

No, not particularly. 3.00 came out so i upgraded 1 of our UDOO Bolt V8's to that and i still get a segfault but i dont know yet if it's the same segfault, am debugging that now

@johnutz-PNSR
Copy link
Author

@pqyptixa , on raven ridge it's still broken in the exact same place in 3.0; in the function:

hsa_amd_image_get_info_max_dim_impl() in libhsa-ext-image64.so.1.1.9

this can be reproduced by running /opt/rocm/opencl/bin/x86a_64/clinfo via gdb and setting a breakpoint in hsa_amd_image_get_info_max_dim_impl() and the stepping a few times inside the function.

@skeelyamd
Copy link
Collaborator

Almost certainly related to ROCm/ROCR-Runtime#68. I can't assign this issue though since I don't have permissions in this repo.

@johnutz-PNSR
Copy link
Author

Oh hey! Just the person I was hoping to hear from! Thankyou for your efforts Mr. Skeely!

I have an update:
Bad news:
It still repos in 3.00

Good news:
If i move libhsa-ext-image64.so* the crash no longer happens, clinfo, OpenCV::DNN
works, etc.

Analysis:
As near as I can tell the crash is happening in hsa_amd_image_get_info_max_dim_impl in
libhsa-ext-image64.so.1.1.9 i say this because i think i am single stepping into this
function before the segfault is tossed.

This failure is on Ryzen CPU / RavenRidge GPU.
I assume the problem doesnt repro on desktop class GPUs.

What i realized this morning is that i dont know which of these query args is triggering it - do any of them look like they would be immediately broken on RavenRidge:

case HSA_EXT_AGENT_INFO_IMAGE_1D_MAX_ELEMENTS:
case HSA_EXT_AGENT_INFO_IMAGE_1DA_MAX_ELEMENTS:
case HSA_EXT_AGENT_INFO_IMAGE_1DB_MAX_ELEMENTS:
case HSA_EXT_AGENT_INFO_IMAGE_2D_MAX_ELEMENTS:
case HSA_EXT_AGENT_INFO_IMAGE_2DA_MAX_ELEMENTS:
case HSA_EXT_AGENT_INFO_IMAGE_2DDEPTH_MAX_ELEMENTS:
case HSA_EXT_AGENT_INFO_IMAGE_2DADEPTH_MAX_ELEMENTS:
case HSA_EXT_AGENT_INFO_IMAGE_3D_MAX_ELEMENTS:
case HSA_EXT_AGENT_INFO_IMAGE_ARRAY_MAX_LAYERS:
  return hsa_amd_image_get_info_max_dim(public_handle(), attribute, value);

@pqyptixa
Copy link

@johnutz-PNSR can you say which files did you move? I renamed every /opt/rocm/hsa/lib/libhsa-ext-image64.so* file (and also /opt/rocm/lib/libhsa-ext-image64.so) and I still get a crash...

@johnutz-PNSR
Copy link
Author

(engages qa guy mode)
I am using an UDOO Bolt V8 ryzen v1000 with a fresh ubuntu 18.0.4.3 install plus updates.
I do not have any of the AMDGPU Radeon Software for Linux installed.
I do not have ubuntu 18.04.3 clinfo or /usr/lib/x86_64-linux-gnu/libOpenCL.so.1.0.0 installed
I installed rocm-3.00.
i then executed /opt/rocm/opencl/bin/x86_64/clinfo
this lead to my now familiar crash that occurs when clinfo asks /opt/rocm/opencl/lib/x86_64/libOpenCL.so.1
to ask /opt/rocm/hsa/lib/libhsa-runtime64.so,1 for info that is provided by hsa_amd_image_get_info_max_dim() that comes from

@johnutz-PNSR
Copy link
Author

argh! somehow hit CLOSE while typing my long repro vsry!

i then executed /opt/rocm/opencl/bin/x86_64/clinfo
this lead to my now familiar crash that occurs when /opt/rocm/opencl/bin/x86_64/clinfo asks /opt/rocm/opencl/lib/x86_64/libOpenCL.so.1
to ask /opt/rocm/hsa/lib/libhsa-runtime64.so,1.1.9 for info that is provided by hsa_amd_image_get_info_max_dim() that comes from
/opt/rocm/hsa/lib/libhsa-ext-image64.so.

So, while stepping thru libhsa-runtime64.so,1.1.9 i noticed that libhsa-runtime64.so,1.1.9 tries to load libhsa-ext-finalizer64.so,1.1.9 even tho it doesnt come with rocm. when libhsa-runtime64.so,1.1.9 didnt find libhsa-ext-finalizer64.so,1.1.9 it totally didnt care! so i moved libhsa-ext-image64.so.* out of the /opt/rocm/hsa/lib directory and thus libhsa-runtime64.so,1.1.9 happifly shared all the facts that it had access to back to /opt/rocm/opencl/lib/x86_64/libOpenCL.so.1 which in turn returned them to /opt/rocm/opencl/bin/x86_64/clinfo and /opt/rocm/opencl/bin/x86_64/clinfo happily prints them out to the screen.

Note that i am being super picky about describing the paths to things for a reason. if you are not doing exactly what i am doing then my set up is not the same as yours and i can make no statements about your possible outcome.

(disengage qa guy mode - returning to dev guy mode)

@skeelyamd, would this be a quick fix? null ptr's usually are, unless the null ptr comes from the software having baked in assumptions about what hardware it expects to be visiting...

Tnx!
johnu

@skeelyamd
Copy link
Collaborator

There's nothing particularly RavenRidge specific in the faulting code path. It's essentially indexing a lookup table common to the processor family. That said the values are returned to a variable sized structure allocated by the caller. In this case that would be OpenCL (libamdocl64.so). If that structure isn't large enough or a bogus pointer was given then the function can segfault. From your trace I can see that it isn't null at least (value=0x7fff3f26f960).

Between 2.9 and 3.0 no deliberate functional changes were introduced into libhsa-ext-image64.so. However, a critical support library and some build files were modified. It's possible that some RavenRidge specific errors were introduced there.

Removing the library disables image support and so skips OpenCL's query for image properties. As you noted libhsa-ext-finalizer64 and libhsa-ext-image64 are optional libraries (with finalizer having been deprecated & removed long ago) and ROCr is quite happy not to find them. So the quickest workaround is to remove libhsa-ext-image64. Of course that leaves you without image support but at least things won't be crashing.

Most of the ROCr team is on holiday right now (including myself) so it's hard to say how quick a fix will be. We do have someone (who isn't on holiday) looking into this at the moment but there's nothing interesting to say yet. In the meantime it would be helpful to know if using the 2.9 OpenCL (libamdocl64) or the 2.9 image lib (libhsa-ext-image64) resolves the issue for you. If so rolling back that one component should be a better workaround and knowing which (or if) will help ensure that what we reproduce is the same issue you are observing.

@johnutz-PNSR
Copy link
Author

@skeelyamd thankyou for answering on your holiday. I appreciate it.

I will attempt what you suggested as soon as i because they are interesting and simple avenues to approach.

However, in performance testing the various opencv4.2 backends i managed to crash the box hard and a coworker informs me that it awaits my return to the office for a round of fsck.

That will be tuesday.

@johnutz-PNSR
Copy link
Author

@skeelyamd
Today is tuesday and i collected the rocm-2.9 opencl and hsa lib debs:
http://repo.radeon.com/rocm/apt/2.9.0/pool/main/h/hsa-ext-rocr-dev/hsa-ext-rocr-dev_1.1.9-122-ge5c4efb_amd64.deb
http://repo.radeon.com/rocm/apt/2.9.0/pool/main/r/rocm-opencl/rocm-opencl_1.2.0-2019100138_amd64.deb
I can confirm that the rocm-3.0 libhsa-ext-image64.so.1.1.9 is the source of the crash.
here is my testing matrix:
libOpenCL from 2.9 + libhsa-ext-image64 from 3.0 = SEGFAULT
libOpenCL from 2.9 + libhsa-ext-image64 from 2.9 = WORKS
libOpenCL from 3.0 + libhsa-ext-image64 from 2.9 = WORKS

So you can tell the the one human working over the holiday that they can focus on debugging libhsa-ext-image64.so.1.1.9 and ignore libOpenCL.so.1

Tnx for your suggestion, things are better now

@rerrabolu
Copy link
Collaborator

All,

Reading through the comments on this issue I surmise the following:

  • Seg fault occurs in "hsa_amd_image_get_info_max_dim" method
  • OpenCL app clinfo can be used to trigger this seg fault
  • The seg fault can be avoided by removing Images library from the load path

I tried to see if anything jumps out in a diff between ROC 2.9 vs 3.0 releases. Initial review does not hint at something obvious.

Is it possible to request console log from a run of rocminfo or clinfo with Images library removed. I am trying to get the value of "Chip ID" field. Meanwhile I will try to get access to a RavenRidge system. Per my look up the Chip ID should be 0x15DD (5597).

@johnutz-PNSR
Copy link
Author

Hi @rerrabolu!

here is rocminfo on our UDOO Bolt V8:

HSA System Attributes

Runtime Version: 1.1
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp co
unt)
Machine Model: LARGE
System Endianness: LITTLE

==========
HSA Agents


Agent 1


Name: AMD Ryzen Embedded V1605B with Radeon Vega Gfx
Marketing Name: AMD Ryzen Embedded V1605B with Radeon Vega Gfx
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32(0x20) KB
Chip ID: 5597(0x15dd)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2000
BDFID: 1280
Internal Node ID: 0
Compute Unit: 8
SIMDs per CU: 4
Shader Engines: 1
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 33554048(0x1fffe80) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Acessible by all: TRUE
ISA Info:
N/A


Agent 2


Name: gfx902
Marketing Name: AMD Ryzen Embedded V1605B with Radeon Vega Gfx
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 4096(0x1000)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 0
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
Chip ID: 5597(0x15dd)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1100
BDFID: 1280
Internal Node ID: 0
Compute Unit: 11
SIMDs per CU: 4
Shader Engines: 1
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Features: KERNEL_DISPATCH
Fast F16 Operation: FALSE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 160(0xa0)
Max Work-item Per CU: 10240(0x2800)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Pool Info:
Pool 1
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Acessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx902+xnack
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***

@johnutz-PNSR
Copy link
Author

Hi @rerrabolu !
Where you able to learn anything from my posting of the rocinfo output? Please let me know when you get a moment.
Tnx!
johnu

@rerrabolu
Copy link
Collaborator

I am trying to reproduce the error on a system that I can access. Currently limited by it.

@rerrabolu
Copy link
Collaborator

Will update the once I have more info

@johnutz-PNSR
Copy link
Author

Couldnt ask for anything more! Thankyou very much!

@rerrabolu
Copy link
Collaborator

Hi,

I was able to reproduce the error on a device. We know the fix. Given current release process, I am afraid I can't give a date when a fix will become available for general public. In the interim, removing images library from loader list will help unless image specific functionality is needed.

@johnutz-PNSR
Copy link
Author

@rerrabolu This is very good news! Thankyou for your efforts!
I recognize that this is a closed source component but nonetheless, can you tell me some details about the error?
Does it only occur with RavenRidge?
Is it along the lines of a falling thru a switch statement that doesnt contain a RavenRidge entry?
Or is it something else entirely?

Tnx!

johnu

@johnutz-PNSR
Copy link
Author

@rerrabolu , i ask about the bug's mode of failure because the image functionality seems pretty essential to our current development efforts on ryzen. Would using the 2.9 lib-ext-image64.so provide us the image functionality we need?

@rerrabolu
Copy link
Collaborator

One of the modules ROCr Images initializes is addrlib. The failure is in that code. It is rather Asic specific. I will be surprised if you are able to use 2.9 as it too would suffer from the same problem. You can find open sourced addrlib here: https://github.com/mesa3d/mesa/tree/master/src/amd/addrlib

I don't know if there is a way to share a pre-release version of the library. I will let Sean, my colleague to look into this. No promises.

@johnutz-PNSR
Copy link
Author

@rerrabolu thankyou for explaining that.

what's noteworthy is that the crash doesn't happen if i use the current released code and replace the current lib-ext-image64.so with the 2.9 version.

When you say you will be surprised if i am able to use the 2.9 version do you mean that you would expect it to crash just like the 3.0 version does or that it wont perform the correct image behaviors despite not crashing?

Please let me know if you have a chance.

Tnx!

johnu

@rerrabolu
Copy link
Collaborator

Thanks for confirming that using 2.9 is able to get around the issue. This will allow the fix to flow through the normal process.

I was expecting 2.9 to crash as well. I have not looked into 2.9 code base to determine if it will result in some incorrect image behaviors. I will try to look into it and update my observations here, I am going on break starting tomorrow.

@johnutz-PNSR
Copy link
Author

@rerrabolu is there a sample test app that i can use to demonstrate the image stuff is working correctly?

@rhn
Copy link

rhn commented Feb 27, 2020

I can confirm this crash (or something with the same backtrace) still exists on 3.0.0.6.

@newmanmr
Copy link

newmanmr commented May 5, 2020

I'm having this problem too with v3.3. Is there a way to tell which upcoming release will include a fix?

Deleting/replacing libhsa-ext-image64.so* does not get clinfo to work for me.

@skeelyamd
Copy link
Collaborator

@newmanmr, if removing libhsa-ext-image64.so* does not resolve this issue for you then you are seeing a different root cause. OpenCL runs a significant amount of code before and after initializing ROCr and this can fail at multiple points. I'd suggest checking the backtrace to see if this you are seeing the same issue or not.

We are expecting the fix for this issue to be in the next release (3.5).

@nartmada
Copy link
Collaborator

Hi @johnutz-PNSR, Please check latest ROCm Documentation and ROCm 5.7.1 to see if your issue has been resolved. If resolved, please close the ticket. Thanks.

@nartmada
Copy link
Collaborator

Original ticket is more than a year old and the person that opened the ticket has not responded to the latest request. If this is still an issue, please file a new ticket and we will investigate. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants