Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod metrics displays Daemonset name of dcgm-exporter rather than the pod with GPU #27

Closed
salliewalecka opened this issue Nov 17, 2021 · 32 comments

Comments

@salliewalecka
Copy link

salliewalecka commented Nov 17, 2021

Expected Behavior: I'm trying to get gpu metrics working for my workloads and would expect be able to see my pod name show up in the prometheus metrics as per this guide in the section "Per-pod GPU metrics in a Kubernetes cluster"

Existing Behavior: The metrics show up but the "pod" tag is "somename-gpu-dcgm-exporter" which is unhelpful as it does not map back to my pods.

example metric: DCGM_FI_DEV_GPU_TEMP{UUID="GPU-<UUID>", container="exporter", device="nvidia0", endpoint="metrics", gpu="0", instance="<Instance>", job="somename-gpu-dcgm-exporter", namespace="some-namespace", pod="somename-gpu-dcgm-exporter-vfbhl", service="somename-gpu-dcgm-exporter"}

K8s cluster: GKE clusters with a nodepool running 2 V100 GPUs per node
Setup: I used helm template to generate the yaml to apply to my GKE cluster. I ran into the issue described here, so I needed to add privileged: true, downgrade to nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04, and add nvidia-install-dir-host volume.

Things I've tried:

  • Verified DCGM_EXPORTER_KUBERNETES is set to true
  • Went through https://github.com/NVIDIA/dcgm-exporter/blob/main/pkg/dcgmexporter/kubernetes.go#L126 to see if I misunderstood the functionality or could find any easy resolution
  • I see there is a code change since my downgrade, but that seemed enable MIG, but that didn't seem like it applied to me. Even if it did, the issue I encountered that forced the downgrade would still exist.

The daemonset looked as below:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: somename-gpu-dcgm-exporter
  namespace: some-namespace
  labels:
    helm.sh/chart: dcgm-exporter-2.4.0
    app.kubernetes.io/name: dcgm-exporter
    app.kubernetes.io/instance: somename-gpu
    app.kubernetes.io/version: "2.4.0"
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/component: "dcgm-exporter"
spec:
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      app.kubernetes.io/name: dcgm-exporter
      app.kubernetes.io/instance: somename-gpu
      app.kubernetes.io/component: "dcgm-exporter"
  template:
    metadata:
      labels:
        app.kubernetes.io/name: dcgm-exporter
        app.kubernetes.io/instance: somename-gpu
        app.kubernetes.io/component: "dcgm-exporter"
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: cloud.google.com/gke-accelerator
                    operator: Exists
      serviceAccountName: gpu-dcgm-exporter
      volumes:
      - name: "pod-gpu-resources"
        hostPath:
          path: "/var/lib/kubelet/pod-resources"
      - name: nvidia-install-dir-host
        hostPath:
          path: /home/kubernetes/bin/nvidia
      tolerations:
        - effect: NoSchedule
          key: nvidia.com/gpu
          operator: "Exists"
        - effect: NoSchedule
          key: nodeSize
          operator: Equal
          value: my-special-nodepool-taint
      containers:
      - name: exporter
        securityContext:
          capabilities:
            add:
            - SYS_ADMIN
          runAsNonRoot: false
          runAsUser: 0
          privileged: true
        image: "nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04"
        imagePullPolicy: "IfNotPresent"
        args:
        - -f
        - /etc/dcgm-exporter/dcp-metrics-included.csv
        env:
        - name: "DCGM_EXPORTER_KUBERNETES"
          value: "true"
        - name: "DCGM_EXPORTER_LISTEN"
          value: ":9400"
        ports:
        - name: "metrics"
          containerPort: 9400
        volumeMounts:
        - name: "pod-gpu-resources"
          readOnly: true
          mountPath: "/var/lib/kubelet/pod-resources"
        - name: nvidia-install-dir-host
          mountPath: /usr/local/nvidia
        livenessProbe:
          httpGet:
            path: /health
            port: 9400
          initialDelaySeconds: 5
          periodSeconds: 5
        readinessProbe:
          httpGet:
            path: /health
            port: 9400
          initialDelaySeconds: 5
@salliewalecka
Copy link
Author

salliewalecka commented Nov 22, 2021

Hey @nikkon-dev do you need any additional information to help explain the pod metrics bug? I'll have to work around this one way or another and am trying to gauge my next steps. Thanks!

@nikkon-dev
Copy link
Collaborator

Hey @salliewalecka,
While we are trying to find out why the resource manager reports the GPU resource associated with the dcgm-exporter pod instead of your job pod, could you try to update the dcgm-exporter itself? 2.0.13 is quite an old version, and there were changes in the k8s API that we are using. There is 2.3.1-2.6.0-ubuntu20.04 version.

@salliewalecka
Copy link
Author

sure thing! I'll try it out. I was just getting some issues previously where the community suggested the downgrade, but I'll see if the newest version works for me.

@salliewalecka
Copy link
Author

Hey! When I upgraded to 2.3.1-2.6.0-ubuntu20.04 but I get a Pod errors: CrashLoopBackOff with the logs below. This was the error that caused me to downgrade before.

level=fatal msg="Error watching fields: This request is serviced by a module of DCGM that is not currently loaded"
Error: Failed to initialize NVML
level=fatal msg="Error starting nv-hostengine: DCGM initialization error"

@nikkon-dev
Copy link
Collaborator

@salliewalecka,
Is it possible to get debug logs from the nv-hostengine service running inside the dcgm-exporter container?
Usually, those logs are disabled and you need to restart nv-hostengine with the -f debug.log --log-level DEBUG command line arguments. The nv-hostengine service is controlled by nvidia-dcgm.service or dcgm.service systemctl files. That's not straightforward to do so right in the dcgm-exporter container as k8s will restart the pod once the nv-hostengine process terminates.
In any case, you should be able to run nvidia-smi command inside the dcgm-exporter container (you may change the entry point to not trigger the crash loop). If nvidia-smi inside the dcgm-exporter container returns an error, that, most likely, means you have misconfigured Nvidia container infra. You may reference this documentation for the prerequisites https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#prerequisites - those are pretty much the same for dcgm-exporter and gpu-operator.

@salliewalecka
Copy link
Author

Still working on ^ as I thought I changed the entrypoint, but still getting crash loop. Interestingly enough I did a printenv for my first entrypoint and got a message that included NVIDIA_REQUIRE_CUDA=cuda>=11.4 brand=tesla,driver>=418,driver<419 brand=tesla,driver>=440,driver<441 driver>=450 but when I did my entrypoint as nvidia-smi I saw: | NVIDIA-SMI 450.119.04 Driver Version: 450.119.04 CUDA Version: 11.0 |. Is the CUDA required to be > 11.4 in this version. As of now the version is something GKE is managing...

@nikkon-dev
Copy link
Collaborator

For dcgm (or dcgm-exporter) as a monitoring tool, the Cuda version does not really matter (we even still support Cuda9). But I'm not sure where the first requirement came from.
Are you using a bare-metal driver (driver installed on the node), or a containerized driver (https://docs.nvidia.com/datacenter/cloud-native/driver-containers/overview.html)? Is the node a VM, or a bare-metal machine as well? If that's a VM, how the GPU is provided - passthrough or virtualized?

@salliewalecka
Copy link
Author

@nikkon-dev I follower the setup from GKE (Google's hosted kubernetes) https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers. The driver is containerized I'm assuming since it comes from the daemonset that GCP provides. The node is a VM running Container-Optimized OS I believe but I am unsure of how it is provided but I can go and find out if that is needed.

@nikkon-dev
Copy link
Collaborator

Ok. I think I understood your environment. You mentioned that you were able to run nvidia-smi in the dcgm-exporter container, right? If it runs and returns the driver version that's enough to understand that the driver is actually loaded.
In such a case, we need to see nv-hostnegine debug logs.
If you are able to change the entrypoint in the dcgm-exporter container to something like bash -c 'while sleep 10; do :; done', you should be able to login into the container and then restart the nv-hostengine with --log-level debug arguments. (stop the nvidia-dcgm service and change the service definition name, or just manually start nv-hostnegine -f /tmp/nvhostengine.debug.log --log-level debug). After these steps, you need to run dcgm-exporter process and wait till it crashes/terminates.

@salliewalecka
Copy link
Author

Awesome thanks! I also just asked gcp support to get official answers for the future. I will work on ^ and change the entry point. For some reason though I'm still getting CrashLoopBackOff so I need to overcome that. It might take me 'til tomorrow to complete this. Thanks for all your help.

@salliewalecka
Copy link
Author

It wasn't running when I logged in but when I tried to restart it, I got:

root@somename-gpu-dcgm-exporter-aaa5:/# /usr/bin/nv-hostengine -f /tmp/nvhostengine.debug.log --log-level debug Error: Failed to initialize NVML

The logs were

root@somename-gpu-dcgm-exporter-aaa5:/# tail -n500 /tmp/nvhostengine.debug.log
2021-11-23 00:15:20.506 DEBUG [76:76] Initialized base logger [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmApi.cpp:6806] [dcgmStartEmbedded_v2]
2021-11-23 00:15:20.507 INFO  [76:76] version:2.3.1;arch:x86_64;buildtype:Release;buildid:4;builddate:2021-10-14;commit:58dee5b7d36a6a9572d80dd1ba63f5a3000df63a;branch:rel_dcgm_2_3;buildplatform:Linux 4.15.0-159-generic #167-Ubuntu SMP Tue Sep 21 08:55:05 UTC 2021 x86_64;;crc:2daadee6a4a852d93855ef6127e30fbe [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmApi.cpp:6809] [dcgmStartEmbedded_v2]
2021-11-23 00:15:20.507 ERROR [76:76] Cannot initialize the hostengine: Error: Failed to initialize NVML [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:3647] [DcgmHostEngineHandler::Init]
2021-11-23 00:15:20.507 ERROR [76:76] DcgmHostEngineHandler::Init failed [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmApi.cpp:6824] [dcgmStartEmbedded_v2]
2021-11-23 00:15:20.507 DEBUG [76:76] Before dcgmapiFreeClientHandler [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmApi.cpp:7137] [dcgmShutdown]
2021-11-23 00:15:20.507 INFO  [76:76] Another thread freed the client handler for us. [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmApi.cpp:255] [dcgmapiFreeClientHandler]
2021-11-23 00:15:20.507 DEBUG [76:76] After dcgmapiFreeClientHandler [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmApi.cpp:7139] [dcgmShutdown]
2021-11-23 00:15:20.507 DEBUG [76:76] dcgmShutdown completed successfully [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmApi.cpp:7174] [dcgmShutdown]
2021-11-23 00:15:28.951 DEBUG [82:82] Initialized base logger [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmApi.cpp:6806] [dcgmStartEmbedded_v2]
2021-11-23 00:15:28.951 INFO  [82:82] version:2.3.1;arch:x86_64;buildtype:Release;buildid:4;builddate:2021-10-14;commit:58dee5b7d36a6a9572d80dd1ba63f5a3000df63a;branch:rel_dcgm_2_3;buildplatform:Linux 4.15.0-159-generic #167-Ubuntu SMP Tue Sep 21 08:55:05 UTC 2021 x86_64;;crc:2daadee6a4a852d93855ef6127e30fbe [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmApi.cpp:6809] [dcgmStartEmbedded_v2]
2021-11-23 00:15:28.951 ERROR [82:82] Cannot initialize the hostengine: Error: Failed to initialize NVML [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:3647] [DcgmHostEngineHandler::Init]
2021-11-23 00:15:28.951 ERROR [82:82] DcgmHostEngineHandler::Init failed [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmApi.cpp:6824] [dcgmStartEmbedded_v2]
2021-11-23 00:15:28.951 DEBUG [82:82] Before dcgmapiFreeClientHandler [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmApi.cpp:7137] [dcgmShutdown]
2021-11-23 00:15:28.951 INFO  [82:82] Another thread freed the client handler for us. [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmApi.cpp:255] [dcgmapiFreeClientHandler]
2021-11-23 00:15:28.951 DEBUG [82:82] After dcgmapiFreeClientHandler [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmApi.cpp:7139] [dcgmShutdown]
2021-11-23 00:15:28.951 DEBUG [82:82] dcgmShutdown completed successfully [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmApi.cpp:7174] [dcgmShutdown]
`
``

@nikkon-dev
Copy link
Collaborator

Could you check that libnvidia-ml.so.1 is present inside the dcgm-exporter container?

@salliewalecka
Copy link
Author

somename-gpu-dcgm-exporter-aaa5:/# find / libnvidia-ml.so.1
find: 'libnvidia-ml.so.1': No such file or directory

@nikkon-dev
Copy link
Collaborator

nikkon-dev commented Nov 23, 2021

Ok. That's closer to the problem root cause. That library is provided by the Nvidia docker runtime and controlled by the NVIDIA_DRIVER_CAPABILITIES env variable. This should be NVIDIA_DRIVER_CAPABILITIES=compute,utility.
A bit of doc here: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html
I'm not sure if the GKE environment set this right out of the box.

@salliewalecka
Copy link
Author

salliewalecka commented Nov 23, 2021

Oh wait maybe I found it here:

root@ somename-gpu-dcgm-exporter-sgjz5:/# find / -name 'libnvidia-ml.*'


/usr/local/nvidia/lib64/libnvidia-ml.so
/usr/local/nvidia/lib64/libnvidia-ml.so.450.119.04
/usr/local/nvidia/lib64/libnvidia-ml.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.495.29.05
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1```

@nikkon-dev
Copy link
Collaborator

Does the ldconfig -p | grep -i libnvidia-ml.so find it?

@salliewalecka
Copy link
Author

salliewalecka commented Nov 23, 2021

gpu-dcgm-exporter-sgjz5:/# echo $NVIDIA_DRIVER_CAPABILITIES
compute,utility,compat32
gpu-dcgm-exporter-sgjz5:/# ldconfig -p | grep -i libnvidia-ml.so
	libnvidia-ml.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1

@salliewalecka
Copy link
Author

yup it looks like it ... I must have had a wonky find command earlier

@nikkon-dev
Copy link
Collaborator

Ok. It looks like we will need to collect the NVML debug logs to analyze them further.

export __NVML_DBG_LVL=DEBUG
export __NVML_DBG_FILE=/tmp/nvml.debug.log
nv-hostengine -f /tmp/hostengine.debug.log --log-level DEBUG -n

NVML logs are encrypted binary blobs.

@salliewalecka
Copy link
Author

salliewalecka commented Nov 23, 2021

gpu-dcgm-exporter-sgjz5:/# echo $__NVML_DBG_FILE
/tmp/nvml.debug.log
gpu-dcgm-exporter-sgjz5:/# echo $__NVML_DBG_LVL
DEBUG
gpu-dcgm-exporter-sgjz5:/# nv-hostengine -f /tmp/hostengine.debug.log --log-level DEBUG -n
Error: Failed to initialize NVML
User defined signal 1
gpu-dcgm-exporter-sgjz5:/# cat /tmp/nvml.debug.log
cat: /tmp/nvml.debug.log: No such file or directory

@salliewalecka
Copy link
Author

Somehow it looks like the logs never made it/ got created

@salliewalecka
Copy link
Author

I have to call it for today but thanks so much for all your help again.

@nikkon-dev
Copy link
Collaborator

nikkon-dev commented Nov 23, 2021

Just a weird guess - what is the size of the nvidia-ml.so library? Not the symlink, but the final actual file.

@salliewalecka
Copy link
Author

salliewalecka commented Nov 23, 2021

@nikkon-dev Nice thought! It's size 0.

root@barista-gpu-dcgm-exporter-sgjz5:/#  ls -al /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
lrwxrwxrwx 1 root root 25 Oct 28 15:11 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 -> libnvidia-ml.so.495.29.05
root@barista-gpu-dcgm-exporter-sgjz5:/# du -h /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.495.29.05
0	/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.495.29.05

@nikkon-dev
Copy link
Collaborator

@salliewalecka,

Could you try to delete libnvidia-* from the /lib/x86_84-linux-gnu and run ldconfig after that in the dcgm-exporter container? After that ldconfig -p | grep -i libnvidia-ml should show you the /usr/local/nvidia/lib64 location and nv-hostengine should be able to run.

@nikkon-dev
Copy link
Collaborator

@salliewalecka,

Please try to refresh your dcgm-exporter images. We re-published all recent dcgm-explorer images with the fix.

@salliewalecka
Copy link
Author

I used the image dcgm-exporter:2.2.9-2.5.0-ubuntu20.04 and had to change the readiness probe to a from a 5 to 30 sec delay (maybe could have shortened it). However, in prometheus the tag still says pod="somename-gpu-dcgm-exporter-vgtvq"

@salliewalecka
Copy link
Author

@nikkon-dev Sorry I didn't get a change to do the ldconfig -p | grep -i libnvidia-ml. I'm assuming since we got the exporter up and running that that recommendation is obsolete? Is there anything additional I need to be running to get the association with the pod?

I have

 env:
            - name: "DCGM_EXPORTER_KUBERNETES"
              value: "true"

@salliewalecka
Copy link
Author

@nikkon-dev I got it working! I needed to add this to my env as well since it was the non-default option

- name: "DCGM_EXPORTER_KUBERNETES_GPU_ID_TYPE"
              value: "device-name"

Now I see my pod coming as exported_pod="my-pod-zzzzzzz-xxxx". Thanks a ton for all your help here!

@ysoftman
Copy link

@salliewalecka
I have the same problem.
The pod tag in DCGM_FI_DEV_GPU_UTIL metric(Prometheus) has unnecessarily value like 'some-name-gpu-dcgm-exporter'
Do I have to add container env to my daemonset?

env:
  - name: "DCGM_EXPORTER_KUBERNETES"
    value: "true"
  - name: "DCGM_EXPORTER_KUBERNETES_GPU_ID_TYPE"
    value: "device-name"

Is 'device-name' a constant value?

@nikkon-dev
Copy link
Collaborator

Yes. "device-name" and "uid" are two possible values here:

@Muscule
Copy link

Muscule commented Jul 26, 2022

@nikkon-dev I got it working! I needed to add this to my env as well since it was the non-default option

- name: "DCGM_EXPORTER_KUBERNETES_GPU_ID_TYPE"
              value: "device-name"

Now I see my pod coming as exported_pod="my-pod-zzzzzzz-xxxx". Thanks a ton for all your help here!

Didn't help with gpu-operator v1.11.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants