Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: 100% GPU usage and high power draw when creating multiple HW queues with MES on RDNA3 #153

Open
Googulator opened this issue Nov 10, 2023 · 19 comments

Comments

@Googulator
Copy link

Googulator commented Nov 10, 2023

Followup to ROCm/ROCm#2625 since further debugging revealed it to be an amdgpu/amdkfd driver issue.

When running llama.cpp's server example on ROCm, using an RDNA3 GPU, GPU usage is shown as 100% and a high power consumption is measured at the wall outlet, even with the server at idle.

Investigating further, it seems that the issue is related to HIP stream usage: GPU usage first shoots up to persistent 100% when llama.cpp tries to create its second HIP stream. If I limit llama.cpp to use only a single stream, then GPU load behaves normally until it begins writing into GPU memory using hipMemcpy or hipMemset, at which point it permanently (EDIT: for the lifetime of the process that performed the hipMemcpy) jumps up to 100%, and stays there until llama.cpp is closed.

In minimal testcases, the following scenarios all yielded 100% GPU usage, despite never actually executing any user code on the GPU:

  • Creating a HIP stream while another HIP stream is open. Once triggered, closing the HIP streams doesn't help. (If I open a stream and close it, then open another one, with no overlap in time between the 2 streams, the issue isn't seen.)
  • Writing to GPU memory while a HIP stream is open. Once triggered, neither closing the HIP stream nor deallocating the memory previously written will cause the GPU load to come down, only killing the process helps. (If I close the stream before writing to GPU memory, the issue isn't seen, even if that memory was allocated before or during the stream's lifetime.)
  • Creating a HIP stream after any GPU memory write has taken place, even if the previously written memory is freed before the stream is created. Once triggered, closing the HIP stream doesn't help.

The issue is reproducible with the latest code in this repository, using ROCm 5.7.1 on a Radeon RX 7900 XT, and also on a Radeon RX 7800 XT.
Setting the module option "sched_policy=2" seems to be a viable workaround, at the cost of slightly higher power consumption when the GPU is fully idle, and the local ttyX consoles becoming laggy. ("sched_policy=1" didn't help.)

Debugging this further, it seems that excessive power usage starts when the offending operation (memory write or stream creation) creates a new HW queue. On RDNA3, this always uses MES, even when mes=0 is specified in the module parameters.

Within the MES code, mes_v11_0_add_hw_queue then calls mes_v11_0_submit_pkt_and_poll_completion, which calls amdgpu_ring_commit. As soon as amdgpu_ring_commit returns, GPU usage spikes to 100%, and remains there, using about 100W of excess power.

Minimal testcases are available in ROCm/ROCm#2625.

@kentrussell
Copy link
Contributor

Thanks for your report. We've got someone setting up a system internally to reproduce the issue and try to isolate where this GPU usage bug is coming from.

@Googulator
Copy link
Author

Are there any updates? Were you successful in reproducing this?

@kentrussell
Copy link
Contributor

kentrussell commented Nov 21, 2023

I'll try to get an update from the dev who was assigned to repro it.

EDIT: The dev assigned to it was a bit backlogged but he's getting on reproducing it this week. Thanks for your patience!

@kentrussell
Copy link
Contributor

So our dev couldn't repro it with the following config:
OS: Ubuntu 22.04 6.2.0-37-generic
GPU: Radeon PRO W7900 (this is a Navi31 GPU)
DRIVER: ROCm 5.7.1 6.2.4-1664922.22.04

He saw the 100% spikes but the usage went down again afterwards. Can you get the VBIOS for your card? rocm-smi --showvbios should be sufficient, or "dmesg|grep ATOM"

@Googulator
Copy link
Author

The 100% GPU usage does go back down if the test program exits completely - but only then; this is the issue. It can be made more visible by extending the delay in my minimal testcases at the end of execution to e.g. a minute, which will cause 1 minute of inappropriate 100% GPU load.

rocm-smi --showvbios from system 1 (RX 7900 XT):

========================= ROCm System Management Interface =========================
====================================== VBIOS =======================================
GPU[0]          : VBIOS version: 113-D70401-00
====================================================================================
=============================== End of ROCm SMI Log ================================

rocm-smi --showvbios from system 2 (RX 7800 XT):

========================= ROCm System Management Interface =========================
====================================== VBIOS =======================================
GPU[0]          : VBIOS version: 113-APM6767CL-100
====================================================================================
=============================== End of ROCm SMI Log ================================

@lufixSch
Copy link

lufixSch commented Dec 2, 2023

Any updates on this? I am observing the same issue with llama.cpp on my RX 7900 XTX

I'm adding my rocm-smi --showvbios as well. Maybe it helps.

rocm-smi GPU0 is the 7900 XTX. GPU1 is a 6750 XT (there is no issue with this one)
========================= ROCm System Management Interface =========================
====================================== VBIOS =======================================
GPU[0]          : VBIOS version: 113-3E4710U-O4X
GPU[1]          : VBIOS version: 113-67KA6SHD1-X01
====================================================================================
=============================== End of ROCm SMI Log ================================

@gartnera
Copy link

gartnera commented Dec 3, 2023

So our dev couldn't repro it with the following config:

He saw the 100% spikes but the usage went down again afterwards.

So it sounds like you can reproduce it? I can also reproduce this issue with a newer kernel on arch linux (Linux 6.6.1-arch1-1) on a 7900 XT (VBIOS version: 113-D70401-00).

Just to be clear the issue is that calling hipStreamCreate shouldn't max out the GPU usage. Even after calling hipStreamDestroy the max usage continues. The only thing that allows the GPU to return to idle is closing the program.

@gotzl
Copy link

gotzl commented Dec 5, 2023

I can reproduce this as well, ASRock 7900 XTX (ref design).
I've used repro1.cpp from the other thread, and I can clearly see in rocm-smi that the GPU usage remains at 100% after the program outputs HIP stream destroyed, waiting 5 more secods. Only when the program finishes, the GPU usage goes down again.

rocm-smi --showvbios

======================= ROCm System Management Interface =======================
==================================== VBIOS =====================================
GPU[0]		: VBIOS version: 113-D7020100-102
================================================================================
============================= End of ROCm SMI Log ==============================

@lufixSch
Copy link

lufixSch commented Dec 6, 2023

I now also tried the minimal testcases with both of my GPUs (6750 XT, 7900 XTX) in the system and only with the 7900 XTX.

I was not really able to reproduce the issue with the minimal testcase when both GPUs are in the system. @Googulator I tried HIP_VISIBLE_DEVICES=<gpu id> to select the device. I'm not sure if that works with the minimal example.

When I removed the 6750 XT I was able to reproduce the issue with repro1.cpp

Additionally I observed that sometimes (I could not reproduce it reliably) the GPU usage stays at 100% even after the process exits. Only after a reboot the usage goes back down.

@kentrussell
Copy link
Contributor

@Ori-Messinger Any progress on this one? I know you've got a bunch of issues to repro ATM

@tada123
Copy link

tada123 commented Dec 12, 2023

I also face the same issue on Linux archlinux-pc 6.1.67-1-lts #1 SMP PREEMPT_DYNAMIC Mon, 11 Dec 2023 12:58:39 +0000 x86_64 GNU/Linux with

>>> rocm-smi --showvbios
========================= ROCm System Management Interface =========================
====================================== VBIOS =======================================
GPU[0]          : VBIOS version: 115-C994PI0-102
====================================================================================
=============================== End of ROCm SMI Log ================================

>>> rocminfo
ROCk module is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 3 2200G with Radeon Vega Graphics
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen 3 2200G with Radeon Vega Graphics
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3500                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            4                                  
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    16305612(0xf8cdcc) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    16305612(0xf8cdcc) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    16305612(0xf8cdcc) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx803                             
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon RX 560 Series           
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
  Chip ID:                 26607(0x67ef)                      
  ASIC Revision:           1(0x1)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1176                               
  BDFID:                   256                                
  Internal Node ID:        1                                  
  Compute Unit:            14                                 
  SIMDs per CU:            4                                  
  Shader Engines:          2                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 730                                
  SDMA engine uCode::      58                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    4194304(0x400000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS:                     
      Size:                    4194304(0x400000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx803          
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***             


The problem disappears when using linux-lts older kernel.
Also, diff between rocminfo output on different kernels looks like this:

11c11
< DMAbuf Support:          YES
---
> DMAbuf Support:          NO
49c49
<       Size:                    16305612(0xf8cdcc) KB              
---
>       Size:                    16306580(0xf8d194) KB              
56c56
<       Size:                    16305612(0xf8cdcc) KB              
---
>       Size:                    16306580(0xf8d194) KB              
63c63
<       Size:                    16305612(0xf8cdcc) KB              
---
>       Size:                    16306580(0xf8d194) KB

@morphles
Copy link

llama.cpp sitting idling ~140W on 7900 XTX ... this is unacceptable .
As a side note, power limiting overall is way too convoluted (couldn't even set it up) compared to nvidia, where you just do -pl <watts>.

@65a
Copy link

65a commented Dec 31, 2023

This also affects W7900 cards on ROCm 6.0, kernel 6.6.8. A solution is desirable to avoid burning electricity and creating heat for no reason.

99W draw with nothing in VRAM and inference not running. radeontop reports Graphics Pipe, Clip Rectangle and Shader Clock at 100%.

EDIT: Filed https://gitlab.freedesktop.org/drm/amd/-/issues/3080 to ensure this is also on the radar there.

@65a
Copy link

65a commented Jan 1, 2024

I've found that at least for llama.cpp, setting GPU_MAX_HW_QUEUES=1 in the environment works around this issue with no clear performance impact, but substantial power/thermal budget improvement. I still think this seems like a major issue that should be fixed without obscure environment variables.

@Googulator
Copy link
Author

Is this commit intended to be a fix for this issue?

https://gitlab.freedesktop.org/agd5f/linux/-/commit/7e505b272c7adb68c5353944eda4befb95e83935

I haven't been able to find it in this repository, only on the Freedesktop one.

@kentrussell
Copy link
Contributor

We're not sure that the patch will fix it, but the patch missed the ROCm 6.0 cutoff, hence it not being here yet.
It'll be in ROCm 6.1 (unless they pick a really weird branching point). It can also be manually applied by editing the file in /usr/src/ and then rebuilding via dkms, if you wanted to give it a shot

@Googulator
Copy link
Author

What is the actual development repository for the kernel driver then, if not this one? The freedesktop one appears to be a staging repository for patches ready for upstreaming, not actual development.

@kentrussell
Copy link
Contributor

We're working on that currently.

Right now, the upstream repo (maintained by Alex Deucher) is for the upstream kernel, which is where most of our patches come from. The DKMS code (which is exposed here) is not upstreamable, so we've got KCL (Kernel Compatibility Layer), IPC and RDMA here but not upstream. What happens is that the patches going into amd-staging-drm-next are picked over to this DKMS-supported branch by the KCL team. They adapt all patches to work on the various kernels that we support. Then that internal branch is what's used for ROCm releases, so we can support more OSes than just Ubuntu (or whatever distro supports the latest upstream kernel)

Currently we (more accurately, I) just update the master branch to reflect the latest ROCm release branch at release time, with no real develop branch to speak of. This is primarily because development on the DKMS branch is still done internally. I'm working on seeing if we can try to at least try to mirror the internal mainline branch here as well on some sort of weekly cadence, instead of just dealing with updates at release time, but the process is taking a while. Lots of hoops to jump through.

@Wintoplay
Copy link

@65a Thank, It works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants