You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As an example from our MI200 node, here's what rocm-smi(v6.0.2 onwards) reports. Note here that GPU devices 1,3,5,7 do not report power values but report a N/A instead. This is by design.
In Variorum, we leverage the newer rsmi_dev_power_get() API as that allows for data collection on both older and newer GPUs.
This API, however, fails on device IDs 1,3,5,7 with a RSMI_STATUS_NOT_SUPPORTED error, resulting in Variorum printing out failure messages when no real failure has occurred. It would have been helpful if the RSMI API had returned a 0.0 value on the GPU devices that do not report power, so a tool that loops through all GPU devices to gather power or other telemetry could report a 0.0 when the power data is unavailable as opposed to failing/exiting.
To get around this problem, we will now specifically check for the RSMI_STATUS_NOT_SUPPORTED to bypass the variorum error message and allow the loops that walks through all GPU devices to continue.
The text was updated successfully, but these errors were encountered:
tpatki
changed the title
Newer MI200+ GPUs report GPU-domain-level (GPU socket-level )power data.
Newer MI200+ GPUs report GPU-domain-level (GPU socket-level) power data as opposed to individual devices
Jun 15, 2024
As an example from our MI200 node, here's what
rocm-smi
(v6.0.2 onwards) reports. Note here that GPU devices 1,3,5,7 do not report power values but report aN/A
instead. This is by design.In Variorum, we leverage the newer
rsmi_dev_power_get()
API as that allows for data collection on both older and newer GPUs.This API, however, fails on device IDs 1,3,5,7 with a
RSMI_STATUS_NOT_SUPPORTED
error, resulting in Variorum printing out failure messages when no real failure has occurred. It would have been helpful if the RSMI API had returned a0.0
value on the GPU devices that do not report power, so a tool that loops through all GPU devices to gather power or other telemetry could report a0.0
when the power data is unavailable as opposed to failing/exiting.To get around this problem, we will now specifically check for the
RSMI_STATUS_NOT_SUPPORTED
to bypass the variorum error message and allow the loops that walks through all GPU devices to continue.The text was updated successfully, but these errors were encountered: