Skip to content

Question about DCGM fields

Closed
Closed

Description

I was going through the different dcgm fields and had the following questions:

Question-1: What is the difference between the below fields
DCGM_FI_DEV_GPU_UTIL vs DCGM_FI_PROF_GR_ENGINE_ACTIVE
DCGM_FI_DEV_MEM_COPY_UTIL vs DCGM_FI_PROF_DRAM_ACTIVE

Question-2:
I am looking to track the following metrics for my AI workloads:

  1. GPU utilization over time - Idea here is to understand if the task is utilizing the GPU efficiently or is it just wasting GPU resources
  2. Memory utilization over time - Idea here is to know if there is scope to increase the batchsize of the AI workload
  3. How effectively is my task utilizing the GPU parallelization capabilities? Some stat to understand if there is further scope to parallelize the computation, something like % core utilization and total cores. (Perhaps DCGM_FI_PROF_SM_ACTIVE?)
  4. Can I increase my computation batch size? This I believe should come from some memory utilization stat.

Could you please advice what fields I should look into for the above stats? Note that I require the same set of metrics both on tesla T4 and A100 (MIG) cards. (Asking as issue#58 seems to mentions that the above mentioned *_DEV_* fields do not work for MIGs)

Also dcgm documentation says not all fields can be queried in parallel. Does this apply only for the *_PROF_* fields or even the *_DEV_* fields? More specifically I wanted to know if the DCGM_FI_DEV_GPU_UTIL can be allotted to any group or does it need to be part of its own group?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions