-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Issue]: Power info and energy counter APIs always return the same info #22
Comments
@marifamd @dmitrii-galantsev |
@jaywonchung Sorry for the late response. I believe this was a bug with how we called our gpu_metrics struct and didn't force gpu_metrics to update unconditionally. This should be resolved in ROCm 6.1+ branches. This is the commit that resolved your issue: 78074d7 |
Sorry, hit the wrong comment button :). If you update to the 6.1 release you should see this resolved. |
Hi @marifamd, thanks a lot for your response! Would that require a whole ROCm update to 6.1, or is there some possibility that it will still work if I just update the AMDSMI library to the version embedded inside the ROCm 6.1 branch? It's just that it was slightly confusing which versioning to follow, since according to #21 ROCm, the AMDSMI library, and AMDSMI tools are versioned separately, but GitHub releases/tags seem to only exist for new ROCm versions. Thanks. |
Since we are packaged and aligned with ROCm releases I would recommend updating to 6.1, this ensures you are on a stable build that is synced with the amdgpu driver. The alternative would be building and installing and overwriting the current version within the ROCm version you installed currently, which is not worth the effort vs using the package managed ROCm release. To clarify on the versioning, the AMD-SMI Library and AMD-SMI CLI are versioned separately from ROCm for good practice as they are separate projects that we track internally. For the end user we align to the ROCm release schedule. |
Problem Description
What I'm doing
I'm trying to measure the power and energy consumption of four MI100 accelerators on the node. It has multiple versions of ROCm installed (e.g.,
/opt/rocm-5.7.1
,/opt/rocm-6.0.2
), and I'm running the Python bindings of amdsmi like this:To utilize the GPU, I wrote a simple script that does matrix multiplication repetitively using PyTorch
2.2.2+rocm5.7
.It runs for about 15 seconds on one GPU (and much longer if I disable GPUs).
To look at power and energy, I'm running the following in the Python interpreter:
At the same time, I have
watch
up on another tmux pane, which is spawningamd-smi
continuously. Notice that it's using amdsmi distributed as part of ROCm 6.0.2:Problems
ROCm 6.0.2
Power API
Through the Python interpreter, the return value of the power API never changes:
It's the exact same return value regardless of how many time I call it (in the
while
loop).However, when I look at the
watch -n0.1 amd-smi metric -pE
monitor at the same time, I can see that the power (CURRENT_POWER
) it reports fluctuates up and down slightly (34 W, 33 W, 35 W, 34 W, ...).It feel as though the return value of
amdsmi_get_power_info
is cached somewhere inside the process and it never gets updated, whereaswatch
spawns a newamd-smi
process every time, loading in updated values.Energy API
Through the Python interpreter, the return value of the energy API never changes:
It's the exact same return value regardless of how many time I call it (in the
while
loop).This time, even the numbers shown in
watch -n0.1 amd-smi metric -pE
never change!Side note: I suppose users should multiply
power
andcounter_resolution
to get micro-Joules. I think this API is quite unintuitive.ROCm 5.7.1
Power API
There is no problem! Even when I'm running the exact same code, I can see that the power draw of one of the four GPUs maxing out:
Energy API
Same problem as ROCm 6.0.2 (return value never changing).
Operating System
Rocky Linux 9.1 (Blue Onyx)
CPU
AMD EPYC 7V13 64-Core Processor
GPU
AMD Instinct MI100
ROCm Version
ROCm 6.0.2, ROCm 5.7.1
ROCm Component
amdsmi
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
Click to expand
Additional Information
No response
The text was updated successfully, but these errors were encountered: