Confirm DCP GPU family #5

Kaka1127 · 2021-08-24T04:38:25Z

Hi.

I have two questions.

I would like to know about DCP GPU family. Which gpu are including?
How should I one standard dashboard to show the GPU utilization with some GPU familiy sever(T4, RTX A6000, A100 or Geforce RTX3080 and so on) under the K8s environment?

As you know, if not included in the DCP GPU family, the DCGM_FI_PROF_* metrics will be disabled. If it will mixed the GPU family for our cluster, the dashboard will not work well... Or should I use the previous metrics of "DCGM_FI_DEV_GPU_UTIL"?

Best regards.
Kaka

nikkon-dev · 2021-08-24T05:32:03Z

Hi @Kaka1127,

1.) The DCP metrics are supported for Datacenter grade Volta and newer GPUs - previously known as Tesla brand, Titan RTX (Volta), Quadro RTX. That includes GA100, GV100, A6000.
The RTX3* and RTX2* gaming series are not supported.

2.) Starting from DCGM 2.2.9, we improved the UX for heterogeneous environments - systems where several GPUs are installed into the same machine. That will not help in situations when you have different configurations in your cluster, but each node has only non-DCP compatible GPU(s).
We are working on improving integration with k8s ConfigMaps. That would be very welcomed if you can provide feedback on what UX you would prefer in situations like yours.

WBR,
Nik

Kaka1127 · 2021-08-24T05:57:17Z

Hi @nikkon-dev

Thank you for your response and providing good information. But I am confuse a bit in your answer of No.2.
Did you mean that the DCGM 2.2.9 supported the server with installing the several GPUs (A100 and V100 and A6000 and so on)?

In our usage case, it only allow to add the same GPU family per server for K8s cluster but I am interested in such a situation.
In case of using the server are installed the multi GPU family, how should we specified the GPU which user would like to use?

I understand that the NDF added the label for each node and I could select the "node" by using this label. But I do not know that user will be able to choose the specified GPU by selecting label.

Best regards.
Kaka

nikkon-dev · 2021-08-24T06:05:21Z

@Kaka1127,

Let me provide more details on what's been changed in 2.2.9.
Yes, now you should be able to get metrics for different SKUs simultaneously.
In case you have at least one GPU that supports a DCP metric, you should be able to enable that metric for all GPUs - unsupported GPUs will return zeros, but you should not observe failures.
Also, in case you are using DCGMi directly, you should be able to run multiple parallel dcgmi dmon sessions with various combinations of GPUs/Metrics.
Example:
Installed GPUs: GA100, RTX2080Ti - you can enable DCGM_FI_PROF* metrics, and RTX GPU will have zeros.
Installed GPUs: RTX2080Ti, RTX3090 - you cannot enable DCGM_FI_PROF* metrics - an error will be returned as there are no GPUs in the monitoring group that would report any meaningful values.

WBR,
Nik

Kaka1127 · 2021-08-24T08:54:08Z

@nikkon-dev

I got it! It is very helpful for us.

Regarding to selecting the specified GPU the different SKUs in a server, I will create new topics.

Best regards.
Kaka

lszxyz · 2022-01-10T08:49:31Z

docker run -itd --restart=always --gpus all --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:2.3.1-2.6.1-ubuntu20.04
3080 2080

Warning #2: dcgm-exporter doesn't have sufficient privileges to expose profiling metrics. To get profiling metrics with dcgm-exporter, use --cap-add SYS_ADMIN
time="2021-12-27T08:19:53Z" level=info msg="Starting dcgm-exporter"
time="2021-12-27T08:19:55Z" level=info msg="DCGM successfully initialized!"
time="2021-12-27T08:19:57Z" level=info msg="Not collecting DCP metrics: Error getting supported metrics: Profiling is not supported for this group of GPUs or GPU"
time="2021-12-27T08:19:57Z" level=info msg="Starting webserver"
time="2021-12-27T08:19:57Z" level=info msg="Pipeline starting"

like this ，i have run with RTX3* and RTX2* gaming series , whether should i get some metrics

nikkon-dev · 2022-01-10T08:59:39Z

@lszxyz,

There are two issues in your example:

As Warning 2 shows, you need to provide --cap-add SYS_ADMIN argument to the docker run command.
DCP metrics (DCGM_FI_PROF_* family) are not supported on gaming RTX SKUs.

lszxyz · 2022-01-10T09:41:20Z

DCP metrics (DCGM_FI_PROF_* family) What is the abbreviation of dcp

nikkon-dev · 2022-01-10T10:09:51Z

DCP stands for Datacenter Profiling

lszxyz · 2022-01-12T13:05:14Z

all indicators that support exporting in this file main/etc ，whether dcgm_fan_speed_percent support ?

Kaka1127 closed this as completed Aug 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confirm DCP GPU family #5

Confirm DCP GPU family #5

Kaka1127 commented Aug 24, 2021

nikkon-dev commented Aug 24, 2021

Kaka1127 commented Aug 24, 2021

nikkon-dev commented Aug 24, 2021

Kaka1127 commented Aug 24, 2021

lszxyz commented Jan 10, 2022

nikkon-dev commented Jan 10, 2022 •

edited

lszxyz commented Jan 10, 2022

nikkon-dev commented Jan 10, 2022

lszxyz commented Jan 12, 2022

Confirm DCP GPU family #5

Confirm DCP GPU family #5

Comments

Kaka1127 commented Aug 24, 2021

nikkon-dev commented Aug 24, 2021

Kaka1127 commented Aug 24, 2021

nikkon-dev commented Aug 24, 2021

Kaka1127 commented Aug 24, 2021

lszxyz commented Jan 10, 2022

nikkon-dev commented Jan 10, 2022 • edited

lszxyz commented Jan 10, 2022

nikkon-dev commented Jan 10, 2022

lszxyz commented Jan 12, 2022

nikkon-dev commented Jan 10, 2022 •

edited