Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot Retrieve GPU PIDs from DCGM Metrics #347

Closed
doronkg opened this issue Jun 25, 2024 · 4 comments
Closed

Cannot Retrieve GPU PIDs from DCGM Metrics #347

doronkg opened this issue Jun 25, 2024 · 4 comments
Labels
question Further information is requested

Comments

@doronkg
Copy link
Contributor

doronkg commented Jun 25, 2024

Ask your question

Hi, I'm using NVIDIA GPU Operator to expose GPUs on my OpenShift cluster, and trying to create a PromQL aggregation to correlate GPU PIDs (Process ID) to K8s Pods.

From the exported DCGM metrics, I saw no metric with a label representing GPU PID.
In the DCGM release notes, the following is mentioned:

The following features have been dropped or deprecated starting with DCGM 3.0:
The following field identifiers have been removed:
DCGM_FI_DEV_GRAPHICS_PIDS
DCGM_FI_DEV_COMPUTE_PIDS
...

My question - is there a way to retrieve this info in the current version?
Let me know if I should submit this issue to the DCGM GitHub repo instead.

The workaround I've implemented for the time being is running a custom DaemonSet on all GPU nodes, running the following command to correlate GPU PID and Pod UID, and using this as a custom metric:

$ nvidia-smi --query-compute-apps=pid --format=csv,noheader | xargs -I{} sh -c "echo -n '{}'; echo -n ','; grep -oPm1 '[0-9a-f]{8}(_[0-9a-f]{4}){3}_[0-9a-f]{12}' /proc/{}/cgroup | sed 's/_/-/g'" 
114855,c8b8d8a2-5e73-4c1a-b8e3-735e8a4e56d3
115044,1f7d9c8e-4a4b-455b-9b0d-9a2d1f4e6c2f

NOTE: It requires setting hostPid: true in the Pod spec.

Versions:
OpenShift: v4.12.35
Kubernetes: v1.25.12+ba5cc25
NVIDIA GPU Operator: v23.3.2
DCGM Exporter: v3.1.7

@doronkg doronkg added the question Further information is requested label Jun 25, 2024
@dpointk
Copy link

dpointk commented Jun 25, 2024

👀 following

@Lynnery
Copy link

Lynnery commented Jun 25, 2024

Would like to see this implemented 👀

@nvvfedorov
Copy link
Collaborator

@doronkg , Please submit issue to the DCGM repository.

@doronkg
Copy link
Contributor Author

doronkg commented Jun 25, 2024

@doronkg , Please submit issue to the DCGM repository.

Thanks, submitted here: NVIDIA/DCGM#175
I am closing this one.

@doronkg doronkg closed this as completed Jun 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants