DCGM_FI_PROF_GR_ENGINE_ACTIVE missing on nodes with H100 / H200 GPUs

On nodes with H100 or H200 we are missing the DCGM_FI_PROF_GR_ENGINE_ACTIVE metric.

Environment:

OpenShift 4.16 / NVIDIA GPU Operator 24.9.2

The metric DCGM_FI_PROF_GR_ENGINE_ACTIVE is not reported at all on H200 & H100 nodes:

```
H200 sh-5.1# curl -s http://127.0.0.1:9400/metrics | grep DCGM_FI_PROF_GR_ENGINE_ACTIVE
H200 sh-5.1# 
```
While the same exporter on nodes with L40S it is reporting them:

```
L40S sh-5.1# curl -s http://127.0.0.1:9400/metrics | grep DCGM_FI_PROF_GR_ENGINE_ACTIVE
# HELP DCGM_FI_PROF_GR_ENGINE_ACTIVE gpu utilization.
# TYPE DCGM_FI_PROF_GR_ENGINE_ACTIVE gauge
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-585834da-79ca-0599-5140-4ffdb131ae93",pci_bus_id="00000000:4E:00.0",device="nvidia0",modelName="NVIDIA L40S",Hostname="l40s.example.com",container="tensorflow",namespace="redacted",pod="tensorflow-0"} 0.000000
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="1",UUID="GPU-8f8001ff-73ba-2e83-3cdc-4156ae4f3912",pci_bus_id="00000000:62:00.0",device="nvidia1",modelName="NVIDIA L40S",Hostname="l40s.example.com",container="tensorflow",namespace="redacted",pod="tensorflow-0"} 0.000000
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="2",UUID="GPU-5bd39b62-f1d2-21f3-d9d5-24cf8d513c73",pci_bus_id="00000000:C9:00.0",device="nvidia2",modelName="NVIDIA L40S",Hostname="l40s.example.com"} 0.000000
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="3",UUID="GPU-68fddb01-b19e-2ab2-e214-dd134a1deda9",pci_bus_id="00000000:DE:00.0",device="nvidia3",modelName="NVIDIA L40S",Hostname="l40s.example.com"} 0.000000
L40S sh-5.1# 
```

Overall it seems that this metric is left out on H200 nodes:

```
H200  sh-5.1# curl -s http://127.0.0.1:9400/metrics | grep HELP
# HELP DCGM_FI_DEV_SM_CLOCK sm clock.
# HELP DCGM_FI_DEV_MEM_CLOCK mem clock.
# HELP DCGM_FI_DEV_MAX_SM_CLOCK max sm clock.
# HELP DCGM_FI_DEV_MAX_MEM_CLOCK max mem clock.
# HELP DCGM_FI_DEV_GPU_TEMP gpu temp.
# HELP DCGM_FI_DEV_POWER_USAGE power usage.
# HELP DCGM_FI_DEV_POWER_MGMT_LIMIT_MAX power mgmt limit.
# HELP DCGM_FI_DEV_MEM_COPY_UTIL mem utilization.
# HELP DCGM_FI_DEV_ENC_UTIL enc utilization.
# HELP DCGM_FI_DEV_DEC_UTIL dec utilization.
H200 sh-5.1# curl -s http://127.0.0.1:9400/metrics | grep HELP | wc -l
10
```

while on L40S nodes we have exactly this metric more:

```
sh-5.1# curl -s http://127.0.0.1:9400/metrics | grep HELP                         
# HELP DCGM_FI_DEV_SM_CLOCK sm clock.
# HELP DCGM_FI_DEV_MEM_CLOCK mem clock.
# HELP DCGM_FI_DEV_MAX_SM_CLOCK max sm clock.
# HELP DCGM_FI_DEV_MAX_MEM_CLOCK max mem clock.
# HELP DCGM_FI_DEV_GPU_TEMP gpu temp.
# HELP DCGM_FI_DEV_POWER_USAGE power usage.
# HELP DCGM_FI_DEV_POWER_MGMT_LIMIT_MAX power mgmt limit.
# HELP DCGM_FI_DEV_MEM_COPY_UTIL mem utilization.
# HELP DCGM_FI_DEV_ENC_UTIL enc utilization.
# HELP DCGM_FI_DEV_DEC_UTIL dec utilization.
# HELP DCGM_FI_PROF_GR_ENGINE_ACTIVE gpu utilization.
sh-5.1# curl -s http://127.0.0.1:9400/metrics | grep HELP | wc -l
11
```

This breaks the GPU console plugin as it does not show the H100 / H200 nodes: https://github.com/rh-ecosystem-edge/console-plugin-nvidia-gpu/issues/66

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DCGM_FI_PROF_GR_ENGINE_ACTIVE missing on nodes with H100 / H200 GPUs #226

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DCGM_FI_PROF_GR_ENGINE_ACTIVE missing on nodes with H100 / H200 GPUs #226

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions