Skip to content

Some GPM/DCGM statistics incorrectly use cycle count rather than elapsed time #1122

@arisu3

Description

@arisu3

NVIDIA Open GPU Kernel Modules Version

590.48.01-1

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Debian GNU/Linux 13 (trixie)

Kernel Release

6.12.73

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

RTX 5070 Laptop GPU

Describe the bug

Some GPM and DGCM metrics use cycle count rather than elapsed time to determine statistics for things like SM usage, occupancy, etc. Because of that, large SM clock swings due to DVFS can cause severe inaccuracies. A research paper (https://dl.acm.org/doi/full/10.1145/3784828.3785156) discussed some software-workarounds involving postprocessing, but it would be great if the actual metric calculations could be fixed.

The paper describes:

The documentation for the semantically equivalent DCGM metric PROF_SM_ACTIVE describes the SM utilization as “the ratio of cycles an SM has at least 1 warp assigned”. While the GPU utilization is measured as a percentage of time, the SM utilization is measured as a percentage of cycles. Therefore, the SM utilization depends on the SM clock frequency during the measurement.

This issue applies to any Blackwell (and presumably Ada/Hopper) GPU. I only have a Blackwell, and Blackwell does not support the proprietary kernel module, hence I cannot test with it.

To Reproduce

  • Enable GPM: nvidia-smi gpm -s 1
  • Monitor GPM metrics: nvidia-smi dmon --gpm-metrics 1,2
  • Trigger clock swings. Metric 1 will be accurate, metric 2 will be inaccurate

Bug Incidence

Always

nvidia-bug-report.log.gz

N/A

More Info

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions