[Issue]: 100% GPU usage and high power draw when creating multiple HW queues with MES on RDNA3 #153

Googulator · 2023-11-10T16:42:02Z

Followup to ROCm/ROCm#2625 since further debugging revealed it to be an amdgpu/amdkfd driver issue.

When running llama.cpp's server example on ROCm, using an RDNA3 GPU, GPU usage is shown as 100% and a high power consumption is measured at the wall outlet, even with the server at idle.

Investigating further, it seems that the issue is related to HIP stream usage: GPU usage first shoots up to persistent 100% when llama.cpp tries to create its second HIP stream. If I limit llama.cpp to use only a single stream, then GPU load behaves normally until it begins writing into GPU memory using hipMemcpy or hipMemset, at which point it permanently (EDIT: for the lifetime of the process that performed the hipMemcpy) jumps up to 100%, and stays there until llama.cpp is closed.

In minimal testcases, the following scenarios all yielded 100% GPU usage, despite never actually executing any user code on the GPU:

Creating a HIP stream while another HIP stream is open. Once triggered, closing the HIP streams doesn't help. (If I open a stream and close it, then open another one, with no overlap in time between the 2 streams, the issue isn't seen.)
Writing to GPU memory while a HIP stream is open. Once triggered, neither closing the HIP stream nor deallocating the memory previously written will cause the GPU load to come down, only killing the process helps. (If I close the stream before writing to GPU memory, the issue isn't seen, even if that memory was allocated before or during the stream's lifetime.)
Creating a HIP stream after any GPU memory write has taken place, even if the previously written memory is freed before the stream is created. Once triggered, closing the HIP stream doesn't help.

The issue is reproducible with the latest code in this repository, using ROCm 5.7.1 on a Radeon RX 7900 XT, and also on a Radeon RX 7800 XT.
Setting the module option "sched_policy=2" seems to be a viable workaround, at the cost of slightly higher power consumption when the GPU is fully idle, and the local ttyX consoles becoming laggy. ("sched_policy=1" didn't help.)

Debugging this further, it seems that excessive power usage starts when the offending operation (memory write or stream creation) creates a new HW queue. On RDNA3, this always uses MES, even when mes=0 is specified in the module parameters.

Within the MES code, mes_v11_0_add_hw_queue then calls mes_v11_0_submit_pkt_and_poll_completion, which calls amdgpu_ring_commit. As soon as amdgpu_ring_commit returns, GPU usage spikes to 100%, and remains there, using about 100W of excess power.

Minimal testcases are available in ROCm/ROCm#2625.

kentrussell · 2023-11-14T18:26:56Z

Thanks for your report. We've got someone setting up a system internally to reproduce the issue and try to isolate where this GPU usage bug is coming from.

Googulator · 2023-11-21T16:36:13Z

Are there any updates? Were you successful in reproducing this?

kentrussell · 2023-11-21T17:47:49Z

I'll try to get an update from the dev who was assigned to repro it.

EDIT: The dev assigned to it was a bit backlogged but he's getting on reproducing it this week. Thanks for your patience!

kentrussell · 2023-11-23T21:21:02Z

So our dev couldn't repro it with the following config:
OS: Ubuntu 22.04 6.2.0-37-generic
GPU: Radeon PRO W7900 (this is a Navi31 GPU)
DRIVER: ROCm 5.7.1 6.2.4-1664922.22.04

He saw the 100% spikes but the usage went down again afterwards. Can you get the VBIOS for your card? rocm-smi --showvbios should be sufficient, or "dmesg|grep ATOM"

Googulator · 2023-11-24T12:31:27Z

The 100% GPU usage does go back down if the test program exits completely - but only then; this is the issue. It can be made more visible by extending the delay in my minimal testcases at the end of execution to e.g. a minute, which will cause 1 minute of inappropriate 100% GPU load.

rocm-smi --showvbios from system 1 (RX 7900 XT):

========================= ROCm System Management Interface =========================
====================================== VBIOS =======================================
GPU[0]          : VBIOS version: 113-D70401-00
====================================================================================
=============================== End of ROCm SMI Log ================================

rocm-smi --showvbios from system 2 (RX 7800 XT):

========================= ROCm System Management Interface =========================
====================================== VBIOS =======================================
GPU[0]          : VBIOS version: 113-APM6767CL-100
====================================================================================
=============================== End of ROCm SMI Log ================================

lufixSch · 2023-12-02T12:22:33Z

Any updates on this? I am observing the same issue with llama.cpp on my RX 7900 XTX

I'm adding my rocm-smi --showvbios as well. Maybe it helps.

rocm-smi

GPU0 is the 7900 XTX. GPU1 is a 6750 XT (there is no issue with this one)

========================= ROCm System Management Interface =========================
====================================== VBIOS =======================================
GPU[0]          : VBIOS version: 113-3E4710U-O4X
GPU[1]          : VBIOS version: 113-67KA6SHD1-X01
====================================================================================
=============================== End of ROCm SMI Log ================================

gartnera · 2023-12-03T23:33:25Z

So our dev couldn't repro it with the following config:

He saw the 100% spikes but the usage went down again afterwards.

So it sounds like you can reproduce it? I can also reproduce this issue with a newer kernel on arch linux (Linux 6.6.1-arch1-1) on a 7900 XT (VBIOS version: 113-D70401-00).

Just to be clear the issue is that calling hipStreamCreate shouldn't max out the GPU usage. Even after calling hipStreamDestroy the max usage continues. The only thing that allows the GPU to return to idle is closing the program.

gotzl · 2023-12-05T20:25:15Z

I can reproduce this as well, ASRock 7900 XTX (ref design).
I've used repro1.cpp from the other thread, and I can clearly see in rocm-smi that the GPU usage remains at 100% after the program outputs HIP stream destroyed, waiting 5 more secods. Only when the program finishes, the GPU usage goes down again.

rocm-smi --showvbios

======================= ROCm System Management Interface =======================
==================================== VBIOS =====================================
GPU[0]		: VBIOS version: 113-D7020100-102
================================================================================
============================= End of ROCm SMI Log ==============================

lufixSch · 2023-12-06T09:07:42Z

I now also tried the minimal testcases with both of my GPUs (6750 XT, 7900 XTX) in the system and only with the 7900 XTX.

I was not really able to reproduce the issue with the minimal testcase when both GPUs are in the system. @Googulator I tried HIP_VISIBLE_DEVICES=<gpu id> to select the device. I'm not sure if that works with the minimal example.

When I removed the 6750 XT I was able to reproduce the issue with repro1.cpp

Additionally I observed that sometimes (I could not reproduce it reliably) the GPU usage stays at 100% even after the process exits. Only after a reboot the usage goes back down.

kentrussell · 2023-12-06T14:08:21Z

@Ori-Messinger Any progress on this one? I know you've got a bunch of issues to repro ATM

tada123 · 2023-12-12T12:15:33Z

I also face the same issue on Linux archlinux-pc 6.1.67-1-lts #1 SMP PREEMPT_DYNAMIC Mon, 11 Dec 2023 12:58:39 +0000 x86_64 GNU/Linux with

>>> rocm-smi --showvbios
========================= ROCm System Management Interface =========================
====================================== VBIOS =======================================
GPU[0]          : VBIOS version: 115-C994PI0-102
====================================================================================
=============================== End of ROCm SMI Log ================================

>>> rocminfo
ROCk module is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 3 2200G with Radeon Vega Graphics
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen 3 2200G with Radeon Vega Graphics
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3500                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            4                                  
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    16305612(0xf8cdcc) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    16305612(0xf8cdcc) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    16305612(0xf8cdcc) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx803                             
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon RX 560 Series           
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
  Chip ID:                 26607(0x67ef)                      
  ASIC Revision:           1(0x1)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1176                               
  BDFID:                   256                                
  Internal Node ID:        1                                  
  Compute Unit:            14                                 
  SIMDs per CU:            4                                  
  Shader Engines:          2                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 730                                
  SDMA engine uCode::      58                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    4194304(0x400000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS:                     
      Size:                    4194304(0x400000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx803          
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***

The problem disappears when using linux-lts older kernel.
Also, diff between rocminfo output on different kernels looks like this:

11c11
< DMAbuf Support:          YES
---
> DMAbuf Support:          NO
49c49
<       Size:                    16305612(0xf8cdcc) KB              
---
>       Size:                    16306580(0xf8d194) KB              
56c56
<       Size:                    16305612(0xf8cdcc) KB              
---
>       Size:                    16306580(0xf8d194) KB              
63c63
<       Size:                    16305612(0xf8cdcc) KB              
---
>       Size:                    16306580(0xf8d194) KB

morphles · 2023-12-21T19:34:18Z

llama.cpp sitting idling ~140W on 7900 XTX ... this is unacceptable .
As a side note, power limiting overall is way too convoluted (couldn't even set it up) compared to nvidia, where you just do -pl <watts>.

65a · 2023-12-31T02:37:37Z

This also affects W7900 cards on ROCm 6.0, kernel 6.6.8. A solution is desirable to avoid burning electricity and creating heat for no reason.

99W draw with nothing in VRAM and inference not running. radeontop reports Graphics Pipe, Clip Rectangle and Shader Clock at 100%.

EDIT: Filed https://gitlab.freedesktop.org/drm/amd/-/issues/3080 to ensure this is also on the radar there.

65a · 2024-01-01T06:11:43Z

I've found that at least for llama.cpp, setting GPU_MAX_HW_QUEUES=1 in the environment works around this issue with no clear performance impact, but substantial power/thermal budget improvement. I still think this seems like a major issue that should be fixed without obscure environment variables.

Googulator · 2024-01-21T14:37:10Z

Is this commit intended to be a fix for this issue?

https://gitlab.freedesktop.org/agd5f/linux/-/commit/7e505b272c7adb68c5353944eda4befb95e83935

I haven't been able to find it in this repository, only on the Freedesktop one.

kentrussell · 2024-01-29T17:27:31Z

We're not sure that the patch will fix it, but the patch missed the ROCm 6.0 cutoff, hence it not being here yet.
It'll be in ROCm 6.1 (unless they pick a really weird branching point). It can also be manually applied by editing the file in /usr/src/ and then rebuilding via dkms, if you wanted to give it a shot

Googulator · 2024-01-30T12:05:05Z

What is the actual development repository for the kernel driver then, if not this one? The freedesktop one appears to be a staging repository for patches ready for upstreaming, not actual development.

kentrussell · 2024-01-30T14:23:12Z

We're working on that currently.

Right now, the upstream repo (maintained by Alex Deucher) is for the upstream kernel, which is where most of our patches come from. The DKMS code (which is exposed here) is not upstreamable, so we've got KCL (Kernel Compatibility Layer), IPC and RDMA here but not upstream. What happens is that the patches going into amd-staging-drm-next are picked over to this DKMS-supported branch by the KCL team. They adapt all patches to work on the various kernels that we support. Then that internal branch is what's used for ROCm releases, so we can support more OSes than just Ubuntu (or whatever distro supports the latest upstream kernel)

Currently we (more accurately, I) just update the master branch to reflect the latest ROCm release branch at release time, with no real develop branch to speak of. This is primarily because development on the DKMS branch is still done internally. I'm working on seeing if we can try to at least try to mirror the internal mainline branch here as well on some sort of weekly cadence, instead of just dealing with updates at release time, but the process is taking a while. Lots of hoops to jump through.

Wintoplay · 2024-03-09T06:47:16Z

@65a Thank, It works.

kentrussell mentioned this issue Nov 14, 2023

[Issue]: 100% GPU usage and high power draw during idle after memcpy/memset with HIP streams on RDNA3 ROCm/ROCm#2625

Open

lufixSch mentioned this issue Dec 3, 2023

AMD thread oobabooga/text-generation-webui#3759

Open

danielzgtg mentioned this issue Dec 22, 2023

[Issue]: kernel NULL pointer dereference and device open freeze ROCm/ROCm#2596

Open

kentrussell mentioned this issue Jan 4, 2024

[Issue]: 7900xtx Graphics Pipeline at 100% when idling with a loaded llama.cpp model ROCm/ROCm#2777

Closed

Azeirah mentioned this issue Feb 2, 2024

AMD ROCm problem: GPU is constantly running at 100% ggerganov/llama.cpp#5280

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue]: 100% GPU usage and high power draw when creating multiple HW queues with MES on RDNA3 #153

[Issue]: 100% GPU usage and high power draw when creating multiple HW queues with MES on RDNA3 #153

Googulator commented Nov 10, 2023 •

edited

kentrussell commented Nov 14, 2023

Googulator commented Nov 21, 2023

kentrussell commented Nov 21, 2023 •

edited

kentrussell commented Nov 23, 2023

Googulator commented Nov 24, 2023

lufixSch commented Dec 2, 2023

gartnera commented Dec 3, 2023 •

edited

gotzl commented Dec 5, 2023 •

edited

lufixSch commented Dec 6, 2023 •

edited

kentrussell commented Dec 6, 2023

tada123 commented Dec 12, 2023

morphles commented Dec 21, 2023

65a commented Dec 31, 2023 •

edited

65a commented Jan 1, 2024 •

edited

Googulator commented Jan 21, 2024

kentrussell commented Jan 29, 2024

Googulator commented Jan 30, 2024

kentrussell commented Jan 30, 2024

Wintoplay commented Mar 9, 2024

[Issue]: 100% GPU usage and high power draw when creating multiple HW queues with MES on RDNA3 #153

[Issue]: 100% GPU usage and high power draw when creating multiple HW queues with MES on RDNA3 #153

Comments

Googulator commented Nov 10, 2023 • edited

kentrussell commented Nov 14, 2023

Googulator commented Nov 21, 2023

kentrussell commented Nov 21, 2023 • edited

kentrussell commented Nov 23, 2023

Googulator commented Nov 24, 2023

lufixSch commented Dec 2, 2023

gartnera commented Dec 3, 2023 • edited

gotzl commented Dec 5, 2023 • edited

lufixSch commented Dec 6, 2023 • edited

kentrussell commented Dec 6, 2023

tada123 commented Dec 12, 2023

morphles commented Dec 21, 2023

65a commented Dec 31, 2023 • edited

65a commented Jan 1, 2024 • edited

Googulator commented Jan 21, 2024

kentrussell commented Jan 29, 2024

Googulator commented Jan 30, 2024

kentrussell commented Jan 30, 2024

Wintoplay commented Mar 9, 2024

Googulator commented Nov 10, 2023 •

edited

kentrussell commented Nov 21, 2023 •

edited

gartnera commented Dec 3, 2023 •

edited

gotzl commented Dec 5, 2023 •

edited

lufixSch commented Dec 6, 2023 •

edited

65a commented Dec 31, 2023 •

edited

65a commented Jan 1, 2024 •

edited