New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Confirm DCP GPU family #5
Comments
Hi @Kaka1127, 1.) The DCP metrics are supported for Datacenter grade Volta and newer GPUs - previously known as Tesla brand, Titan RTX (Volta), Quadro RTX. That includes GA100, GV100, A6000. 2.) Starting from DCGM 2.2.9, we improved the UX for heterogeneous environments - systems where several GPUs are installed into the same machine. That will not help in situations when you have different configurations in your cluster, but each node has only non-DCP compatible GPU(s). WBR, |
Hi @nikkon-dev Thank you for your response and providing good information. But I am confuse a bit in your answer of No.2. In our usage case, it only allow to add the same GPU family per server for K8s cluster but I am interested in such a situation. I understand that the NDF added the label for each node and I could select the "node" by using this label. But I do not know that user will be able to choose the specified GPU by selecting label. Best regards. |
Let me provide more details on what's been changed in 2.2.9. WBR, |
I got it! It is very helpful for us. Regarding to selecting the specified GPU the different SKUs in a server, I will create new topics. Best regards. |
docker run -itd --restart=always --gpus all --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:2.3.1-2.6.1-ubuntu20.04 Warning #2: dcgm-exporter doesn't have sufficient privileges to expose profiling metrics. To get profiling metrics with dcgm-exporter, use --cap-add SYS_ADMIN like this ,i have run with RTX3* and RTX2* gaming series , whether should i get some metrics |
There are two issues in your example:
|
DCP metrics (DCGM_FI_PROF_* family) What is the abbreviation of dcp |
DCP stands for Datacenter Profiling |
all indicators that support exporting in this file main/etc ,whether dcgm_fan_speed_percent support ? |
Hi.
I have two questions.
I would like to know about DCP GPU family. Which gpu are including?
How should I one standard dashboard to show the GPU utilization with some GPU familiy sever(T4, RTX A6000, A100 or Geforce RTX3080 and so on) under the K8s environment?
As you know, if not included in the DCP GPU family, the DCGM_FI_PROF_* metrics will be disabled. If it will mixed the GPU family for our cluster, the dashboard will not work well... Or should I use the previous metrics of "DCGM_FI_DEV_GPU_UTIL"?
Best regards.
Kaka
The text was updated successfully, but these errors were encountered: