Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confirm DCP GPU family #5

Closed
Kaka1127 opened this issue Aug 24, 2021 · 9 comments
Closed

Confirm DCP GPU family #5

Kaka1127 opened this issue Aug 24, 2021 · 9 comments

Comments

@Kaka1127
Copy link

Hi.

I have two questions.

  1. I would like to know about DCP GPU family. Which gpu are including?

  2. How should I one standard dashboard to show the GPU utilization with some GPU familiy sever(T4, RTX A6000, A100 or Geforce RTX3080 and so on) under the K8s environment?

As you know, if not included in the DCP GPU family, the DCGM_FI_PROF_* metrics will be disabled. If it will mixed the GPU family for our cluster, the dashboard will not work well... Or should I use the previous metrics of "DCGM_FI_DEV_GPU_UTIL"?

Best regards.
Kaka

@nikkon-dev
Copy link
Collaborator

Hi @Kaka1127,

1.) The DCP metrics are supported for Datacenter grade Volta and newer GPUs - previously known as Tesla brand, Titan RTX (Volta), Quadro RTX. That includes GA100, GV100, A6000.
The RTX3* and RTX2* gaming series are not supported.

2.) Starting from DCGM 2.2.9, we improved the UX for heterogeneous environments - systems where several GPUs are installed into the same machine. That will not help in situations when you have different configurations in your cluster, but each node has only non-DCP compatible GPU(s).
We are working on improving integration with k8s ConfigMaps. That would be very welcomed if you can provide feedback on what UX you would prefer in situations like yours.

WBR,
Nik

@Kaka1127
Copy link
Author

Hi @nikkon-dev

Thank you for your response and providing good information. But I am confuse a bit in your answer of No.2.
Did you mean that the DCGM 2.2.9 supported the server with installing the several GPUs (A100 and V100 and A6000 and so on)?

In our usage case, it only allow to add the same GPU family per server for K8s cluster but I am interested in such a situation.
In case of using the server are installed the multi GPU family, how should we specified the GPU which user would like to use?

I understand that the NDF added the label for each node and I could select the "node" by using this label. But I do not know that user will be able to choose the specified GPU by selecting label.

Best regards.
Kaka

@nikkon-dev
Copy link
Collaborator

@Kaka1127,

Let me provide more details on what's been changed in 2.2.9.
Yes, now you should be able to get metrics for different SKUs simultaneously.
In case you have at least one GPU that supports a DCP metric, you should be able to enable that metric for all GPUs - unsupported GPUs will return zeros, but you should not observe failures.
Also, in case you are using DCGMi directly, you should be able to run multiple parallel dcgmi dmon sessions with various combinations of GPUs/Metrics.
Example:
Installed GPUs: GA100, RTX2080Ti - you can enable DCGM_FI_PROF* metrics, and RTX GPU will have zeros.
Installed GPUs: RTX2080Ti, RTX3090 - you cannot enable DCGM_FI_PROF* metrics - an error will be returned as there are no GPUs in the monitoring group that would report any meaningful values.

WBR,
Nik

@Kaka1127
Copy link
Author

@nikkon-dev

I got it! It is very helpful for us.

Regarding to selecting the specified GPU the different SKUs in a server, I will create new topics.

Best regards.
Kaka

@lszxyz
Copy link

lszxyz commented Jan 10, 2022

docker run -itd --restart=always --gpus all --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:2.3.1-2.6.1-ubuntu20.04
3080 2080

Warning #2: dcgm-exporter doesn't have sufficient privileges to expose profiling metrics. To get profiling metrics with dcgm-exporter, use --cap-add SYS_ADMIN
time="2021-12-27T08:19:53Z" level=info msg="Starting dcgm-exporter"
time="2021-12-27T08:19:55Z" level=info msg="DCGM successfully initialized!"
time="2021-12-27T08:19:57Z" level=info msg="Not collecting DCP metrics: Error getting supported metrics: Profiling is not supported for this group of GPUs or GPU"
time="2021-12-27T08:19:57Z" level=info msg="Starting webserver"
time="2021-12-27T08:19:57Z" level=info msg="Pipeline starting"

like this ,i have run with RTX3* and RTX2* gaming series , whether should i get some metrics

@nikkon-dev
Copy link
Collaborator

nikkon-dev commented Jan 10, 2022

@lszxyz,

There are two issues in your example:

  1. As Warning 2 shows, you need to provide --cap-add SYS_ADMIN argument to the docker run command.
  2. DCP metrics (DCGM_FI_PROF_* family) are not supported on gaming RTX SKUs.

@lszxyz
Copy link

lszxyz commented Jan 10, 2022

DCP metrics (DCGM_FI_PROF_* family) What is the abbreviation of dcp

@nikkon-dev
Copy link
Collaborator

DCP stands for Datacenter Profiling

@lszxyz
Copy link

lszxyz commented Jan 12, 2022

all indicators that support exporting in this file main/etc ,whether dcgm_fan_speed_percent support ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants