-
Notifications
You must be signed in to change notification settings - Fork 87
DCGM_FI_PROF_GR_ENGINE_ACTIVE missing on nodes with H100 / H200 GPUs #226
Copy link
Copy link
Open
Description
On nodes with H100 or H200 we are missing the DCGM_FI_PROF_GR_ENGINE_ACTIVE metric.
Environment:
OpenShift 4.16 / NVIDIA GPU Operator 24.9.2
The metric DCGM_FI_PROF_GR_ENGINE_ACTIVE is not reported at all on H200 & H100 nodes:
H200 sh-5.1# curl -s http://127.0.0.1:9400/metrics | grep DCGM_FI_PROF_GR_ENGINE_ACTIVE
H200 sh-5.1#
While the same exporter on nodes with L40S it is reporting them:
L40S sh-5.1# curl -s http://127.0.0.1:9400/metrics | grep DCGM_FI_PROF_GR_ENGINE_ACTIVE
# HELP DCGM_FI_PROF_GR_ENGINE_ACTIVE gpu utilization.
# TYPE DCGM_FI_PROF_GR_ENGINE_ACTIVE gauge
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-585834da-79ca-0599-5140-4ffdb131ae93",pci_bus_id="00000000:4E:00.0",device="nvidia0",modelName="NVIDIA L40S",Hostname="l40s.example.com",container="tensorflow",namespace="redacted",pod="tensorflow-0"} 0.000000
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="1",UUID="GPU-8f8001ff-73ba-2e83-3cdc-4156ae4f3912",pci_bus_id="00000000:62:00.0",device="nvidia1",modelName="NVIDIA L40S",Hostname="l40s.example.com",container="tensorflow",namespace="redacted",pod="tensorflow-0"} 0.000000
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="2",UUID="GPU-5bd39b62-f1d2-21f3-d9d5-24cf8d513c73",pci_bus_id="00000000:C9:00.0",device="nvidia2",modelName="NVIDIA L40S",Hostname="l40s.example.com"} 0.000000
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="3",UUID="GPU-68fddb01-b19e-2ab2-e214-dd134a1deda9",pci_bus_id="00000000:DE:00.0",device="nvidia3",modelName="NVIDIA L40S",Hostname="l40s.example.com"} 0.000000
L40S sh-5.1#
Overall it seems that this metric is left out on H200 nodes:
H200 sh-5.1# curl -s http://127.0.0.1:9400/metrics | grep HELP
# HELP DCGM_FI_DEV_SM_CLOCK sm clock.
# HELP DCGM_FI_DEV_MEM_CLOCK mem clock.
# HELP DCGM_FI_DEV_MAX_SM_CLOCK max sm clock.
# HELP DCGM_FI_DEV_MAX_MEM_CLOCK max mem clock.
# HELP DCGM_FI_DEV_GPU_TEMP gpu temp.
# HELP DCGM_FI_DEV_POWER_USAGE power usage.
# HELP DCGM_FI_DEV_POWER_MGMT_LIMIT_MAX power mgmt limit.
# HELP DCGM_FI_DEV_MEM_COPY_UTIL mem utilization.
# HELP DCGM_FI_DEV_ENC_UTIL enc utilization.
# HELP DCGM_FI_DEV_DEC_UTIL dec utilization.
H200 sh-5.1# curl -s http://127.0.0.1:9400/metrics | grep HELP | wc -l
10
while on L40S nodes we have exactly this metric more:
sh-5.1# curl -s http://127.0.0.1:9400/metrics | grep HELP
# HELP DCGM_FI_DEV_SM_CLOCK sm clock.
# HELP DCGM_FI_DEV_MEM_CLOCK mem clock.
# HELP DCGM_FI_DEV_MAX_SM_CLOCK max sm clock.
# HELP DCGM_FI_DEV_MAX_MEM_CLOCK max mem clock.
# HELP DCGM_FI_DEV_GPU_TEMP gpu temp.
# HELP DCGM_FI_DEV_POWER_USAGE power usage.
# HELP DCGM_FI_DEV_POWER_MGMT_LIMIT_MAX power mgmt limit.
# HELP DCGM_FI_DEV_MEM_COPY_UTIL mem utilization.
# HELP DCGM_FI_DEV_ENC_UTIL enc utilization.
# HELP DCGM_FI_DEV_DEC_UTIL dec utilization.
# HELP DCGM_FI_PROF_GR_ENGINE_ACTIVE gpu utilization.
sh-5.1# curl -s http://127.0.0.1:9400/metrics | grep HELP | wc -l
11
This breaks the GPU console plugin as it does not show the H100 / H200 nodes: rh-ecosystem-edge/console-plugin-nvidia-gpu#66
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels