Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot Retrieve GPU PIDs from DCGM Metrics #175

Open
doronkg opened this issue Jun 25, 2024 · 0 comments
Open

Cannot Retrieve GPU PIDs from DCGM Metrics #175

doronkg opened this issue Jun 25, 2024 · 0 comments

Comments

@doronkg
Copy link

doronkg commented Jun 25, 2024

Ask your question

Hi, I'm using NVIDIA GPU Operator to expose GPUs on my OpenShift cluster, and trying to create a PromQL aggregation to correlate GPU PIDs (Process ID) to K8s Pods.

From the exported DCGM metrics, I saw no metric with a label representing GPU PID.
In the DCGM release notes, the following is mentioned:

The following features have been dropped or deprecated starting with DCGM 3.0:
The following field identifiers have been removed:
DCGM_FI_DEV_GRAPHICS_PIDS
DCGM_FI_DEV_COMPUTE_PIDS
...

My question - is there a way to retrieve this info in the current version?
I originally submitted this issue to the DCGM Exporter GitHub repo.

The workaround I've implemented for the time being is running a custom DaemonSet on all GPU nodes, running the following command to correlate GPU PID and Pod UID, and using this as a custom metric:

$ nvidia-smi --query-compute-apps=pid --format=csv,noheader | xargs -I{} sh -c "echo -n '{}'; echo -n ','; grep -oPm1 '[0-9a-f]{8}(_[0-9a-f]{4}){3}_[0-9a-f]{12}' /proc/{}/cgroup | sed 's/_/-/g'" 
114855,c8b8d8a2-5e73-4c1a-b8e3-735e8a4e56d3
115044,1f7d9c8e-4a4b-455b-9b0d-9a2d1f4e6c2f

NOTE: It requires setting hostPid: true in the Pod spec.

Versions:
OpenShift: v4.12.35
Kubernetes: v1.25.12+ba5cc25
NVIDIA GPU Operator: v23.3.2
DCGM Exporter: v3.1.7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant