Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault in clinfo #1121

Closed
inducer opened this issue May 31, 2020 · 4 comments
Closed

Segfault in clinfo #1121

inducer opened this issue May 31, 2020 · 4 comments

Comments

@inducer
Copy link

inducer commented May 31, 2020

I've installed the ROCM runtime on my (Debian testing) machine (upstream kernel 5.3.14-1, dual-socket Haswell Xeon) following the instructions. When I run clinfo, I get a segfault. Adding "HSAKMT_DEBUG_LEVEL=7" sheds a bit more light, but not much:

$ HSAKMT_DEBUG_LEVEL=7  /opt/rocm-3.3.0/opencl/bin/x86_64/clinfo
acquiring VM for 13bc using 7
SVM alt (coherent):    0x1e00000 - 0x100167ffff
SVM (non-coherent): 0x1001680000 - 0x3fffffffff
Failed to map remapped mmio page on gpu_mem 0
[hsaKmtAllocMemory] node 0
[hsaKmtMapMemoryToGPU] address 0x1e08000
[hsaKmtAllocMemory] node 0
bind_mem_to_numa mem 0x1e01000 flags 0x40 size 0x1000 node_id 0
[hsaKmtMapMemoryToGPUNodes] address 0x1e01000 number of nodes 1
[hsaKmtAllocMemory] node 2
[hsaKmtAllocMemory] node 0
bind_mem_to_numa mem 0x1e14000 flags 0x40 size 0x2000 node_id 0
[hsaKmtMapMemoryToGPUNodes] address 0x1e14000 number of nodes 1
[hsaKmtAllocMemory] node 0
bind_mem_to_numa mem 0x1e04000 flags 0x1040 size 0x1000 node_id 0
[hsaKmtMapMemoryToGPUNodes] address 0x1e04000 number of nodes 1
[1]    1726631 segmentation fault (core dumped)  HSAKMT_DEBUG_LEVEL=7 /opt/rocm-3.3.0/opencl/bin/x86_64/clinfo‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

The backtrace from gdb is not very informative, other than placing the crash somewhere inside the OpenCL runtime:

Thread 1 "clinfo" received signal SIGSEGV, Segmentation fault.
__GI___libc_free (mem=0x437265776f500403) at malloc.c:3102
3102    malloc.c: No such file or directory.
(gdb) bt
#0  __GI___libc_free (mem=0x437265776f500403) at malloc.c:3102
#1  0x00007ffff7a1ef65 in ?? () from /opt/rocm-3.3.0/opencl/lib/x86_64/libamdocl64.so
#2  0x00007ffff7a23737 in ?? () from /opt/rocm-3.3.0/opencl/lib/x86_64/libamdocl64.so
#3  0x00007ffff79ec4bf in ?? () from /opt/rocm-3.3.0/opencl/lib/x86_64/libamdocl64.so
#4  0x00007ffff79e7096 in ?? () from /opt/rocm-3.3.0/opencl/lib/x86_64/libamdocl64.so
#5  0x00007ffff79b9b15 in ?? () from /opt/rocm-3.3.0/opencl/lib/x86_64/libamdocl64.so
#6  0x00007ffff7b37e39 in ?? () from /opt/rocm-3.3.0/opencl/lib/x86_64/libamdocl64.so
#7  0x00007ffff79b9c4c in clIcdGetPlatformIDsKHR () from /opt/rocm-3.3.0/opencl/lib/x86_64/libamdocl64.so
#8  0x00007ffff7e4dae3 in ?? () from /lib/x86_64-linux-gnu/libOpenCL.so.1
#9  0x00007ffff7e4e4e3 in clGetPlatformIDs () from /lib/x86_64-linux-gnu/libOpenCL.so.1
#10 0x000000000040cdd1 in ?? ()
#11 0x0000000000403b8c in ?? ()
#12 0x00007ffff7c70e0b in __libc_start_main (main=0x403aa0, argc=1, argv=0x7fffffffe688, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffe678) at ../csu/libc-start.c:308
#13 0x000000000040c1fe in ?? ()‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

What can I do to troubleshoot this?

AMD Device from lspci:

81:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Fiji [Radeon R9 FURY / NANO Series] (rev cb)                                    
81:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Fiji HDMI/DP Audio [Radeon R9 Nano / FURY/FURY X]     ‍‍

Device nodes:

$ ls -lah /dev/kfd /dev/dri /dev/shm                                                                                                                                    andreask_work@stout 0:16
crw-rw---- 1 root gpu  242, 0 Dec  2 11:18 /dev/kfd

/dev/dri:

total 0
drwxr-xr-x  3 root root      120 Dec  2 11:18 .
drwxr-xr-x 19 root root     3.5K Dec 14 07:00 ..
drwxr-xr-x  2 root root      100 Dec  2 11:18 by-path
crw-rw----  1 root gpu  226,   0 Dec  2 11:18 card0
crw-rw----  1 root gpu  226,   1 Dec  2 11:18 card1
crw-rw----  1 root gpu  226, 128 Dec  2 11:18 renderD128

/dev/shm:

total 8.0K
drwxrwxrwt  2 root root   80 May 26 14:19 .
drwxr-xr-x 19 root root 3.5K Dec 14 07:00 ..
-rw-rw-r--  1 root gpu     8 May 27 00:13 hsakmt_shared_mem
-rw-rw-r--  1 root gpu    32 Dec  7 22:23 sem.hsakmt_semaphore‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Thanks!

@CircuitCraft42
Copy link

I'm having the same issue on Debian unstable (upstream kernel 5.6.14-1). However the first time I ran it I got

munmap_chunk(): invalid pointer
Aborted

And, after noticing I got inconsistent output, I ran it again a few times, also getting

free(): invalid pointer
Aborted

and

Number of platforms:                 1
  Platform Profile:                  FULL_PROFILE
  Platform Version:                  OpenCL 2.1 AMD-APP (3098.0)
  Platform Name:                     AMD Accelerated Parallel Processing
  Platform Vendor:                   Advanced Micro Devices, Inc.
  Platform Extensions:               cl_khr_icd cl_amd_event_callback cl_amd_offline_devices


  Platform Name:                     AMD Accelerated Parallel Processing
ERROR: clGetDeviceIDs(-1)

as well as just Segmentation fault.

@CircuitCraft42
Copy link

CircuitCraft42 commented Jun 2, 2020

This comment had the solution. After installing the missing dependency, my GPU was listed correctly.

@inducer
Copy link
Author

inducer commented Jun 8, 2020

Thanks! I can confirm that installing libtinfo5 made this problem go away. So this bug comes down to adding the missing dependency (which is still missing in the ROCM 3.5 release).

@ROCmSupport
Copy link

Hi @inducer
Thanks for reaching out.
As we know, lintinfo5 solves the problem and we are taking care of this.
This ticket is similar to #1324
Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants