New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bristol Ridge.- Asrock A320M PRO4 system: vector_copy runs OK from console, but hangs in an ssh terminal #191
Comments
ps. this is on a Bristol Ridge APU, Asrock A320M PRO4 with bios version 3.00, Ubuntu 16.04.3 with no X server running, and ROCm 1.6. |
We never seen this, SSH is the standard way we log into the system for ROCm, by default ROCm is designed to be headless. We do not see this issue on Xeon E5, Core or EPYC based servers nor Desktop machine I7 Extreme, Xeon E3, Core I7 Core i5, Ryzen and Threadripper system we work on mostly these days. Are you sure SSH is correctly configured? |
Also, can you try clinfo instead |
Thanks. The SSH config on both client and server are whatever Ubuntu gives you by default. Except sometimes I also enable X forwarding. With and without X forwarding made no difference. 'clinfo' reports identical text at the console and in a remote shell, except for the 'Platform ID' field. (The 'Platform ID' field value changes every time I run clinfo. It looks like it's printing a pointer value, maybe this is a bug?) Here's the clinfo output: Number of platforms: 1 Platform Name: AMD Accelerated Parallel Processing Do you have any tips on how to debug something like this? Are there tracing facilities in the runtime, thunk or kernel module? |
Sounds more like issue in the base Linux kernel of Linux kernel graphics driver, since the userland is working fine on the console |
Can you run hip or HCC app while running SSH |
Another data point: I can get the hang at the console, just by piping the output to a file. This runs ok at the console: > ./vector_copy Crazy! Something is sensitive to exactly which pty or tty (??) the stdout is plumbed into. |
The kernel is the ROCK kernel, compiled from source checked out from github about two days ago. This hang was also happening with the stock ROCm 1.6 kernel. If you'd rather support that one, I could certainly switch back to it. Compiling from source didn't seem to help or hurt things. I haven't recompiled any other components yet, everything else is ROCm 1.6 off-the-shelf. (At least it should be. AMDGPU PRO was installed previously, and I wiped it. ldd confirms that only rocm libs are getting pulled into vector_copy) |
Strace on the hanging and nonhanging vector_copies shows some differences that happen somewhat earlier than the actual hang. The earliest diff is this: open("/dev/shm/sem.hsakmt_semaphore", O_RDWR|O_NOFOLLOW) = 4 // will run ok Everything else up to that point looked identical (ignoring the huge volume of pointer differences.) |
This is the 'square' HIP app, it produces the same result in either VNC or at the console: john@dash80: Hmm, let's try another sample... john@dash80: Is there a better sample than these? I'm just poking around /opt/rocm and picking samples at random. Compiling 'square' for debug and running in gdb shows this stack trace: #0 0x00007ffff6f84ec6 in ?? () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1 |
Try and HCC sample
Greg
On Sep 2, 2017, at 9:48 AM, jcoiner <notifications@github.com<mailto:notifications@github.com>> wrote:
This is the 'square' HIP app, it produces the same result in either VNC or at the console:
john@dash80:/foo/square$ make
make: Nothing to be done for 'all'.
john@dash80:/foo/square$ ./square.hip.out
info: running on device AMD A8-9600 RADEON R7, 10 COMPUTE CORES 4C+6G
info: architecture on AMD GPU device is: 801
info: allocate host mem ( 7.63 MB)
info: allocate device mem ( 7.63 MB)
info: copy Host2Device
info: launch 'vector_square' kernel
info: copy Device2Host
Segmentation fault
Hmm, let's try another sample...
john@dash80:/foo$ cp -R /opt/rocm/hip/samples/2_Cookbook/0_MatrixTranspose .
john@dash80:/foo$ cd 0_MatrixTranspose/
john@dash80:/foo/0_MatrixTranspose$
john@dash80:/foo/0_MatrixTranspose$ make
/opt/rocm/hip/bin/hipcc -g -c -o MatrixTranspose.o MatrixTranspose.cpp
/opt/rocm/hip/bin/hipcc MatrixTranspose.o -o MatrixTranspose
./MatrixTranspose
Device name AMD A8-9600 RADEON R7, 10 COMPUTE CORES 4C+6G
Makefile:29: recipe for target 'test' failed
make: *** [test] Segmentation fault
Is there a better sample than these? I'm just poking around /opt/rocm and picking samples at random.
Compiling 'square' for debug and running in gdb shows this stack trace:
#0 0x00007ffff6f84ec6 in ?? () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#1<#1> 0x00007ffff6f8510a in ?? () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#2<#2> 0x00007ffff467a68c in waitComplete ()
at /home/jenkins/jenkins-root/workspace/compute-rocm-rel-1.6/external/hcc-tot/lib/hsa/mcwamp_hsa.cpp:3721
#3<#3> 0x00007ffff467b75d in operator() ()
at /home/jenkins/jenkins-root/workspace/compute-rocm-rel-1.6/external/hcc-tot/lib/hsa/mcwamp_hsa.cpp:3819
#4<#4> _M_invoke<> ()
at /usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/functional:1530
#5<#5> operator() ()
at /usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/functional:1520
#6<#6> operator() ()
at /usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/future:1342
#7<#7> 0x00007ffff467b6f2 in _M_invoke ()
at /usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/functional:1856
#8<#8> 0x00007ffff467b667 in operator() ()
at /usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/functional:2267
#9<#9> _M_do_set ()
at /usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/future:527
#10<#10> 0x00007ffff7416a99 in __pthread_once_slow (once_control=0xc56288,
init_routine=0x7ffff65e9ac0 <__once_proxy>) at pthread_once.c:116
#11<#11> 0x00007ffff467bc4b in __gthread_once ()
at /usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/x86_64-linux-gnu/c++/5.4.0/bits/gthr-default.h:699
#12<#12> call_once<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()> *, bool *), std::__future_base::_State_baseV2 *, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()> , bool > ()
at /usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/mutex:738
#13<#13> _M_set_result ()
at /usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/future:386
#14<#14> _M_complete_async ()
at /usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/future:1606
#15<#15> 0x00007ffff46685b6 in wait ()
at /usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/future:319
#16<#16> wait ()
at /usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/future:656
#17<#17> wait ()
at /home/jenkins/jenkins-root/workspace/compute-rocm-rel-1.6/external/hcc-tot/lib/hsa/mcwamp_hsa.cpp:1148
#18<#18> 0x00007ffff4669834 in copy_ext ()
at /home/jenkins/jenkins-root/workspace/compute-rocm-rel-1.6/external/hcc-tot/lib/hsa/mcwamp_hsa.cpp:3345
#19<#19> 0x00007ffff7b53a5b in ihipStream_t::locked_copySync(void, void const, unsigned long, unsigned int, bool) () from /opt/rocm/hip/lib/libhip_hcc.so
#20<#20> 0x00007ffff7b7b0bf in hipMemcpy () from /opt/rocm/hip/lib/libhip_hcc.so
#21<#21> 0x000000000041e41d in main (argc=1, argv=0x7fffffffe208) at square.hipref.cpp:90
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#191 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AD8DuUTkEegrA9TxKJZr2dmF5uK2_oyTks5seWqlgaJpZM4PJEad>.
|
Is there an HCC sample within the ROCm source tree or under /opt? I'm looking... |
Just try this. https://github.com/RadeonOpenCompute/hcc/tree/clang_tot_upgrade/benchmarks/benchEmptyKernel
https://github.com/ROCm-Developer-Tools/HCC-Example-Application/tree/master/SyncVsAsyncArrayCopy
HIP I would try HIPinfo
Greg
On Sep 2, 2017, at 9:54 AM, jcoiner <notifications@github.com<mailto:notifications@github.com>> wrote:
Is there an HCC sample within the ROCm source tree or under /opt? I'm looking...
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#191 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AD8DuawytB91avA1Px9e4xY-hs5_NDddks5seWwegaJpZM4PJEad>.
|
Thanks. I found ./hcc/benchmarks/benchEmptyKernel in the git checkout of ROCm. Copied it, compiled it. Result: john@dash80:~/foo/benchEmptyKernel$ ./bench Iterations per test: 10000
could not open code object './nullkernel-gfx803.hsaco' Hmm, maybe a symlink will fix this? john@dash80: Iterations per test: 10000
dispatch_hsa_kernel+withcompletion+activewait : 11.45826 I don't know how to interpret those numbers, but it didn't print an obvious error and the exit code was 0 so I guess that's ok? This ran in a VNC where the vector_copy normally hangs. |
A data point: in InterruptSignal::WaitRelaxed(), increasing the kMaxElapsed timeout from 200uS to 200mS allows vector_copy to pass in an ssh terminal or in a VNC (both contexts where it hung before.) Passing cases never reach the blocking hsaKmtWaitOnEvent() call in WaitRelaxed(), they find the value they're waiting for in the polling loop. Failing cases time out while polling, call hsaKmtWaitOnEvent() to block, and this never returns. I guess there's nothing special about the ssh terminal, except probably timings shake out a bit differently exposing a race condition. This probably means the GPU is never hanging, and the race is purely host-side. If I rewrite WaitRelaxed() to always block, this always hangs, at least for me:
|
I think something goes wrong during interrupts setup on this system. From adding pr_debug statements to the amdkfd module, it appears that kfd_signal_event_interrupt() never gets called in the hanging case. I believe we expect this to be called when the GPU signals the event. Digging deeper, I'm getting this sequence of kernel messages. This is an edited subset of my dmesg; I removed a bunch of intervening messages that seemed irrelevant: [ 2.943868] amdgpu 0000:00:01.0: amdgpu: using MSI. The 'JPC BOZO' messages are ones I added. What appears to be happening is that the first two times vblank interrupt is enabled, we get a stream of 60hz interrupts as expected. The 3rd time that the kernel tries to enable vblank, we get a scary "No irq handler for vector" message and then we never get another interrupt from the gfx core again. (And the gfx core isn't hung; the vector_copy still passes if I force it to poll only.) I tried booting with pci=nomsi,noaer to rule out MSI problems, but that panicked on boot before even mounting the root filesystem. So I haven't ruled out MSI problems yet. In other news, this system's console behaves oddly. It takes several seconds to ALT-F[n] to another virtual terminal, and it also takes several seconds to wake up from sleep. Once it wakes up or switches terminals, it's responsive like normal. When the console wakes up from these multi-second comas, there are new messages in dmesg about "vblank wait timed out", and new messages on the console about "flip_done timed out". At first I thought that was unrelated to OpenCL program hangs, but now I'm thinking maybe a single problem with interrupts setup is causing a spectrum of troubles. |
Disabling MSI support in the amdgpu driver (but not for the whole kernel) allows vector_copy to run reliably! I'm finally seeing the kfd_signal_event_interrupt() calls which wake the user process up from a call to hsaKmtWaitOnEvent(). This is how I disabled MSI:
... which is the wrong way to do it, as there's also a kernel module option, if I only knew how to set that... This raises questions: is there a bug in amdgpu interrupts setup on Bristol Ridge, is it a BIOS bug, or something else? I know nothing about interrupts setup, I'm just pattern-matching my way through here so I'll probably stop debugging this now and go ahead with disabling MSI as a workaround. |
Disabling MSI in amdgpu also fixes the trouble with laggy virtual console switching, it's now instantaneous like it should be. |
Do you know if IOMMUv2 is enabled in your SBIOS. |
Yes, IOMMUv2 is enabled in the BIOS and recognized at kernel startup.
Also Xorg started working now that interrupt setup is working better with MSI disabled. Formerly X hung almost immediately on startup, now it's working well under ROCm. |
Interestingly, the MSI mode bit is only used in 3 places in amdgpu. It gates the call to pci_enable_msi() in amdgpu_irq_init(), and the corresponding call to pci_disable_msi() in amdgpu_irq_fini(). Also it shows up in this code in cz_ih_irq_init() [and the corresponding functions for other chips]:
That looks a little suspicious. If we want to set RPTR_REARM when msi=1, do we want to clear it when msi=0? Or should it always be 1? Either way we could remove the conditional. There are no other uses of the msi bit in amdgpu or amdkfd. Whatever goes wrong when msi is enabled, the problem is likely outside the ROCm drivers. Let's close this, it seems out of scope for ROCM. |
EDIT: To summarize, disabling MSI interrupt mode caused a number of things to stop hanging.
Has anyone seen this? I have vector_copy running OK in a local login console. When I ssh to the same machine, as the same user, running the same binary, it hangs after "Dispatching the kernel succeeded":
john@dash80:~/foo/sample$ ./vector_copy
Initializing the hsa runtime succeeded.
Checking finalizer 1.0 extension support succeeded.
Generating function table for finalizer succeeded.
Getting a gpu agent succeeded.
Querying the agent name succeeded.
The agent name is gfx801.
Querying the agent maximum queue size succeeded.
The maximum queue size is 131072.
Creating the queue succeeded.
"Obtaining machine model" succeeded.
"Getting agent profile" succeeded.
Create the program succeeded.
Adding the brig module to the program succeeded.
Query the agents isa succeeded.
Finalizing the program succeeded.
Destroying the program succeeded.
Create the executable succeeded.
Loading the code object succeeded.
Freeze the executable succeeded.
Extract the symbol from the executable succeeded.
Extracting the symbol from the executable succeeded.
Extracting the kernarg segment size from the executable succeeded.
Extracting the group segment size from the executable succeeded.
Extracting the private segment from the executable succeeded.
Creating a HSA signal succeeded.
Finding a fine grained memory region succeeded.
Allocating argument memory for input parameter succeeded.
Allocating argument memory for output parameter succeeded.
Finding a kernarg memory region succeeded.
Allocating kernel argument memory buffer succeeded.
Dispatching the kernel succeeded.
^C
... until I CTRL-C it. These are the only differences in environment variables, they look innocuous to me:
$ diff env.broke env.ok
Before I spend time on this, is it a known issue? This should work, right?
FWIW, after the hang, I can still run vector_copy at the local console and it runs fine. I can even run it at the console while another vector_copy instance in the ssh is hanging and it runs fine on the console. So the GPU isn't getting completely hung; it's probably not a CP hang.
Thanks
John
The text was updated successfully, but these errors were encountered: