Skip to content

DCGM_FI_PROF_GR_ENGINE_ACTIVE missing on nodes with H100 / H200 GPUs #226

@duritong

Description

@duritong

On nodes with H100 or H200 we are missing the DCGM_FI_PROF_GR_ENGINE_ACTIVE metric.

Environment:

OpenShift 4.16 / NVIDIA GPU Operator 24.9.2

The metric DCGM_FI_PROF_GR_ENGINE_ACTIVE is not reported at all on H200 & H100 nodes:

H200 sh-5.1# curl -s http://127.0.0.1:9400/metrics | grep DCGM_FI_PROF_GR_ENGINE_ACTIVE
H200 sh-5.1# 

While the same exporter on nodes with L40S it is reporting them:

L40S sh-5.1# curl -s http://127.0.0.1:9400/metrics | grep DCGM_FI_PROF_GR_ENGINE_ACTIVE
# HELP DCGM_FI_PROF_GR_ENGINE_ACTIVE gpu utilization.
# TYPE DCGM_FI_PROF_GR_ENGINE_ACTIVE gauge
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-585834da-79ca-0599-5140-4ffdb131ae93",pci_bus_id="00000000:4E:00.0",device="nvidia0",modelName="NVIDIA L40S",Hostname="l40s.example.com",container="tensorflow",namespace="redacted",pod="tensorflow-0"} 0.000000
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="1",UUID="GPU-8f8001ff-73ba-2e83-3cdc-4156ae4f3912",pci_bus_id="00000000:62:00.0",device="nvidia1",modelName="NVIDIA L40S",Hostname="l40s.example.com",container="tensorflow",namespace="redacted",pod="tensorflow-0"} 0.000000
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="2",UUID="GPU-5bd39b62-f1d2-21f3-d9d5-24cf8d513c73",pci_bus_id="00000000:C9:00.0",device="nvidia2",modelName="NVIDIA L40S",Hostname="l40s.example.com"} 0.000000
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="3",UUID="GPU-68fddb01-b19e-2ab2-e214-dd134a1deda9",pci_bus_id="00000000:DE:00.0",device="nvidia3",modelName="NVIDIA L40S",Hostname="l40s.example.com"} 0.000000
L40S sh-5.1# 

Overall it seems that this metric is left out on H200 nodes:

H200  sh-5.1# curl -s http://127.0.0.1:9400/metrics | grep HELP
# HELP DCGM_FI_DEV_SM_CLOCK sm clock.
# HELP DCGM_FI_DEV_MEM_CLOCK mem clock.
# HELP DCGM_FI_DEV_MAX_SM_CLOCK max sm clock.
# HELP DCGM_FI_DEV_MAX_MEM_CLOCK max mem clock.
# HELP DCGM_FI_DEV_GPU_TEMP gpu temp.
# HELP DCGM_FI_DEV_POWER_USAGE power usage.
# HELP DCGM_FI_DEV_POWER_MGMT_LIMIT_MAX power mgmt limit.
# HELP DCGM_FI_DEV_MEM_COPY_UTIL mem utilization.
# HELP DCGM_FI_DEV_ENC_UTIL enc utilization.
# HELP DCGM_FI_DEV_DEC_UTIL dec utilization.
H200 sh-5.1# curl -s http://127.0.0.1:9400/metrics | grep HELP | wc -l
10

while on L40S nodes we have exactly this metric more:

sh-5.1# curl -s http://127.0.0.1:9400/metrics | grep HELP                         
# HELP DCGM_FI_DEV_SM_CLOCK sm clock.
# HELP DCGM_FI_DEV_MEM_CLOCK mem clock.
# HELP DCGM_FI_DEV_MAX_SM_CLOCK max sm clock.
# HELP DCGM_FI_DEV_MAX_MEM_CLOCK max mem clock.
# HELP DCGM_FI_DEV_GPU_TEMP gpu temp.
# HELP DCGM_FI_DEV_POWER_USAGE power usage.
# HELP DCGM_FI_DEV_POWER_MGMT_LIMIT_MAX power mgmt limit.
# HELP DCGM_FI_DEV_MEM_COPY_UTIL mem utilization.
# HELP DCGM_FI_DEV_ENC_UTIL enc utilization.
# HELP DCGM_FI_DEV_DEC_UTIL dec utilization.
# HELP DCGM_FI_PROF_GR_ENGINE_ACTIVE gpu utilization.
sh-5.1# curl -s http://127.0.0.1:9400/metrics | grep HELP | wc -l
11

This breaks the GPU console plugin as it does not show the H100 / H200 nodes: rh-ecosystem-edge/console-plugin-nvidia-gpu#66

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions