New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pod metrics displays Daemonset name of dcgm-exporter rather than the pod with GPU #27
Comments
Hey @nikkon-dev do you need any additional information to help explain the pod metrics bug? I'll have to work around this one way or another and am trying to gauge my next steps. Thanks! |
Hey @salliewalecka, |
sure thing! I'll try it out. I was just getting some issues previously where the community suggested the downgrade, but I'll see if the newest version works for me. |
Hey! When I upgraded to 2.3.1-2.6.0-ubuntu20.04 but I get a
|
@salliewalecka, |
Still working on ^ as I thought I changed the entrypoint, but still getting crash loop. Interestingly enough I did a |
For dcgm (or dcgm-exporter) as a monitoring tool, the Cuda version does not really matter (we even still support Cuda9). But I'm not sure where the first requirement came from. |
@nikkon-dev I follower the setup from GKE (Google's hosted kubernetes) https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers. The driver is containerized I'm assuming since it comes from the daemonset that GCP provides. The node is a VM running Container-Optimized OS I believe but I am unsure of how it is provided but I can go and find out if that is needed. |
Ok. I think I understood your environment. You mentioned that you were able to run |
Awesome thanks! I also just asked gcp support to get official answers for the future. I will work on ^ and change the entry point. For some reason though I'm still getting CrashLoopBackOff so I need to overcome that. It might take me 'til tomorrow to complete this. Thanks for all your help. |
It wasn't running when I logged in but when I tried to restart it, I got:
The logs were
|
Could you check that |
|
Ok. That's closer to the problem root cause. That library is provided by the Nvidia docker runtime and controlled by the |
Oh wait maybe I found it here:
|
Does the |
|
yup it looks like it ... I must have had a wonky find command earlier |
Ok. It looks like we will need to collect the NVML debug logs to analyze them further.
NVML logs are encrypted binary blobs. |
|
Somehow it looks like the logs never made it/ got created |
I have to call it for today but thanks so much for all your help again. |
Just a weird guess - what is the size of the nvidia-ml.so library? Not the symlink, but the final actual file. |
@nikkon-dev Nice thought! It's size 0.
|
Could you try to delete |
Please try to refresh your dcgm-exporter images. We re-published all recent dcgm-explorer images with the fix. |
I used the image dcgm-exporter:2.2.9-2.5.0-ubuntu20.04 and had to change the readiness probe to a from a 5 to 30 sec delay (maybe could have shortened it). However, in prometheus the tag still says |
@nikkon-dev Sorry I didn't get a change to do the I have
|
@nikkon-dev I got it working! I needed to add this to my env as well since it was the non-default option
Now I see my pod coming as exported_pod="my-pod-zzzzzzz-xxxx". Thanks a ton for all your help here! |
@salliewalecka env:
- name: "DCGM_EXPORTER_KUBERNETES"
value: "true"
- name: "DCGM_EXPORTER_KUBERNETES_GPU_ID_TYPE"
value: "device-name" Is 'device-name' a constant value? |
Yes. "device-name" and "uid" are two possible values here: dcgm-exporter/pkg/dcgmexporter/types.go Line 50 in d40847d
|
Didn't help with gpu-operator v1.11.0 |
Expected Behavior: I'm trying to get gpu metrics working for my workloads and would expect be able to see my pod name show up in the prometheus metrics as per this guide in the section "Per-pod GPU metrics in a Kubernetes cluster"
Existing Behavior: The metrics show up but the "pod" tag is "somename-gpu-dcgm-exporter" which is unhelpful as it does not map back to my pods.
example metric:
DCGM_FI_DEV_GPU_TEMP{UUID="GPU-<UUID>", container="exporter", device="nvidia0", endpoint="metrics", gpu="0", instance="<Instance>", job="somename-gpu-dcgm-exporter", namespace="some-namespace", pod="somename-gpu-dcgm-exporter-vfbhl", service="somename-gpu-dcgm-exporter"}
K8s cluster: GKE clusters with a nodepool running 2 V100 GPUs per node
Setup: I used helm template to generate the yaml to apply to my GKE cluster. I ran into the issue described here, so I needed to add privileged: true, downgrade to nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04, and add nvidia-install-dir-host volume.
Things I've tried:
The daemonset looked as below:
The text was updated successfully, but these errors were encountered: