Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Newer MI200+ GPUs report GPU-domain-level (GPU socket-level) power data as opposed to individual devices #553

Closed
tpatki opened this issue Jun 15, 2024 · 0 comments · Fixed by #554

Comments

@tpatki
Copy link
Member

tpatki commented Jun 15, 2024

As an example from our MI200 node, here's what rocm-smi(v6.0.2 onwards) reports. Note here that GPU devices 1,3,5,7 do not report power values but report a N/A instead. This is by design.

In Variorum, we leverage the newer rsmi_dev_power_get() API as that allows for data collection on both older and newer GPUs.

This API, however, fails on device IDs 1,3,5,7 with a RSMI_STATUS_NOT_SUPPORTED error, resulting in Variorum printing out failure messages when no real failure has occurred. It would have been helpful if the RSMI API had returned a 0.0 value on the GPU devices that do not report power, so a tool that loops through all GPU devices to gather power or other telemetry could report a 0.0 when the power data is unavailable as opposed to failing/exiting.

To get around this problem, we will now specifically check for the RSMI_STATUS_NOT_SUPPORTED to bypass the variorum error message and allow the loops that walks through all GPU devices to continue.

$ ./rocm-smi


====================================== ROCm System Management Interface ======================================
================================================ Concise Info ================================================
Device  [Model : Revision]    Temp    Power  Partitions      SCLK    MCLK     Fan  Perf  PwrCap  VRAM%  GPU%  
        Name (20 chars)       (Edge)  (Avg)  (Mem, Compute)                                                   
==============================================================================================================
0       [0x0b0c : 0x00]       37.0°C  87.0W  N/A, N/A        800Mhz  1600Mhz  0%   auto  560.0W    0%   0%    
        AMD INSTINCT MI200 (                                                                                  
1       [0x0b0c : 0x00]       39.0°C  N/A    N/A, N/A        800Mhz  1600Mhz  0%   auto  0.0W      0%   0%    
        AMD INSTINCT MI200 (                                                                                  
2       [0x0b0c : 0x00]       39.0°C  85.0W  N/A, N/A        800Mhz  1600Mhz  0%   auto  560.0W    0%   0%    
        AMD INSTINCT MI200 (                                                                                  
3       [0x0b0c : 0x00]       39.0°C  N/A    N/A, N/A        800Mhz  1600Mhz  0%   auto  0.0W      0%   0%    
        AMD INSTINCT MI200 (                                                                                  
4       [0x0b0c : 0x00]       38.0°C  94.0W  N/A, N/A        800Mhz  1600Mhz  0%   auto  560.0W    0%   0%    
        AMD INSTINCT MI200 (                                                                                  
5       [0x0b0c : 0x00]       40.0°C  N/A    N/A, N/A        800Mhz  1600Mhz  0%   auto  0.0W      0%   0%    
        AMD INSTINCT MI200 (                                                                                  
6       [0x0b0c : 0x00]       34.0°C  86.0W  N/A, N/A        800Mhz  1600Mhz  0%   auto  560.0W    0%   0%    
        AMD INSTINCT MI200 (                                                                                  
7       [0x0b0c : 0x00]       36.0°C  N/A    N/A, N/A        800Mhz  1600Mhz  0%   auto  0.0W      0%   0%    
        AMD INSTINCT MI200 (                                                                                  
==============================================================================================================
============================================ End of ROCm SMI Log =============================================
@tpatki tpatki changed the title Newer MI200+ GPUs report GPU-domain-level (GPU socket-level )power data. Newer MI200+ GPUs report GPU-domain-level (GPU socket-level) power data as opposed to individual devices Jun 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant