-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Issue]: 100% GPU usage and high power draw when creating multiple HW queues with MES on RDNA3 #153
Comments
Thanks for your report. We've got someone setting up a system internally to reproduce the issue and try to isolate where this GPU usage bug is coming from. |
Are there any updates? Were you successful in reproducing this? |
I'll try to get an update from the dev who was assigned to repro it. EDIT: The dev assigned to it was a bit backlogged but he's getting on reproducing it this week. Thanks for your patience! |
So our dev couldn't repro it with the following config: He saw the 100% spikes but the usage went down again afterwards. Can you get the VBIOS for your card? rocm-smi --showvbios should be sufficient, or "dmesg|grep ATOM" |
The 100% GPU usage does go back down if the test program exits completely - but only then; this is the issue. It can be made more visible by extending the delay in my minimal testcases at the end of execution to e.g. a minute, which will cause 1 minute of inappropriate 100% GPU load. rocm-smi --showvbios from system 1 (RX 7900 XT):
rocm-smi --showvbios from system 2 (RX 7800 XT):
|
Any updates on this? I am observing the same issue with llama.cpp on my RX 7900 XTX I'm adding my rocm-smiGPU0 is the 7900 XTX. GPU1 is a 6750 XT (there is no issue with this one)
|
So it sounds like you can reproduce it? I can also reproduce this issue with a newer kernel on arch linux ( Just to be clear the issue is that calling |
I can reproduce this as well, ASRock 7900 XTX (ref design). rocm-smi --showvbios
|
I now also tried the minimal testcases with both of my GPUs (6750 XT, 7900 XTX) in the system and only with the 7900 XTX. I was not really able to reproduce the issue with the minimal testcase when both GPUs are in the system. @Googulator I tried When I removed the 6750 XT I was able to reproduce the issue with Additionally I observed that sometimes (I could not reproduce it reliably) the GPU usage stays at 100% even after the process exits. Only after a reboot the usage goes back down. |
@Ori-Messinger Any progress on this one? I know you've got a bunch of issues to repro ATM |
I also face the same issue on
The problem disappears when using
|
llama.cpp sitting idling ~140W on 7900 XTX ... this is unacceptable . |
This also affects W7900 cards on ROCm 6.0, kernel 6.6.8. A solution is desirable to avoid burning electricity and creating heat for no reason. 99W draw with nothing in VRAM and inference not running. EDIT: Filed https://gitlab.freedesktop.org/drm/amd/-/issues/3080 to ensure this is also on the radar there. |
I've found that at least for llama.cpp, setting |
Is this commit intended to be a fix for this issue? https://gitlab.freedesktop.org/agd5f/linux/-/commit/7e505b272c7adb68c5353944eda4befb95e83935 I haven't been able to find it in this repository, only on the Freedesktop one. |
We're not sure that the patch will fix it, but the patch missed the ROCm 6.0 cutoff, hence it not being here yet. |
What is the actual development repository for the kernel driver then, if not this one? The freedesktop one appears to be a staging repository for patches ready for upstreaming, not actual development. |
We're working on that currently. Right now, the upstream repo (maintained by Alex Deucher) is for the upstream kernel, which is where most of our patches come from. The DKMS code (which is exposed here) is not upstreamable, so we've got KCL (Kernel Compatibility Layer), IPC and RDMA here but not upstream. What happens is that the patches going into amd-staging-drm-next are picked over to this DKMS-supported branch by the KCL team. They adapt all patches to work on the various kernels that we support. Then that internal branch is what's used for ROCm releases, so we can support more OSes than just Ubuntu (or whatever distro supports the latest upstream kernel) Currently we (more accurately, I) just update the master branch to reflect the latest ROCm release branch at release time, with no real develop branch to speak of. This is primarily because development on the DKMS branch is still done internally. I'm working on seeing if we can try to at least try to mirror the internal mainline branch here as well on some sort of weekly cadence, instead of just dealing with updates at release time, but the process is taking a while. Lots of hoops to jump through. |
@65a Thank, It works. |
Followup to ROCm/ROCm#2625 since further debugging revealed it to be an amdgpu/amdkfd driver issue.
When running llama.cpp's server example on ROCm, using an RDNA3 GPU, GPU usage is shown as 100% and a high power consumption is measured at the wall outlet, even with the server at idle.
Investigating further, it seems that the issue is related to HIP stream usage: GPU usage first shoots up to persistent 100% when llama.cpp tries to create its second HIP stream. If I limit llama.cpp to use only a single stream, then GPU load behaves normally until it begins writing into GPU memory using hipMemcpy or hipMemset, at which point it permanently (EDIT: for the lifetime of the process that performed the hipMemcpy) jumps up to 100%, and stays there until llama.cpp is closed.
In minimal testcases, the following scenarios all yielded 100% GPU usage, despite never actually executing any user code on the GPU:
The issue is reproducible with the latest code in this repository, using ROCm 5.7.1 on a Radeon RX 7900 XT, and also on a Radeon RX 7800 XT.
Setting the module option "sched_policy=2" seems to be a viable workaround, at the cost of slightly higher power consumption when the GPU is fully idle, and the local ttyX consoles becoming laggy. ("sched_policy=1" didn't help.)
Debugging this further, it seems that excessive power usage starts when the offending operation (memory write or stream creation) creates a new HW queue. On RDNA3, this always uses MES, even when mes=0 is specified in the module parameters.
Within the MES code, mes_v11_0_add_hw_queue then calls mes_v11_0_submit_pkt_and_poll_completion, which calls amdgpu_ring_commit. As soon as amdgpu_ring_commit returns, GPU usage spikes to 100%, and remains there, using about 100W of excess power.
Minimal testcases are available in ROCm/ROCm#2625.
The text was updated successfully, but these errors were encountered: