Description
openedon Jan 13, 2023
I was going through the different dcgm fields and had the following questions:
Question-1: What is the difference between the below fields
DCGM_FI_DEV_GPU_UTIL vs DCGM_FI_PROF_GR_ENGINE_ACTIVE
DCGM_FI_DEV_MEM_COPY_UTIL vs DCGM_FI_PROF_DRAM_ACTIVE
Question-2:
I am looking to track the following metrics for my AI workloads:
- GPU utilization over time - Idea here is to understand if the task is utilizing the GPU efficiently or is it just wasting GPU resources
- Memory utilization over time - Idea here is to know if there is scope to increase the batchsize of the AI workload
- How effectively is my task utilizing the GPU parallelization capabilities? Some stat to understand if there is further scope to parallelize the computation, something like % core utilization and total cores. (Perhaps DCGM_FI_PROF_SM_ACTIVE?)
- Can I increase my computation batch size? This I believe should come from some memory utilization stat.
Could you please advice what fields I should look into for the above stats? Note that I require the same set of metrics both on tesla T4 and A100 (MIG) cards. (Asking as issue#58 seems to mentions that the above mentioned *_DEV_* fields do not work for MIGs)
Also dcgm documentation says not all fields can be queried in parallel. Does this apply only for the *_PROF_* fields or even the *_DEV_* fields? More specifically I wanted to know if the DCGM_FI_DEV_GPU_UTIL can be allotted to any group or does it need to be part of its own group?