Crash loop backoff with `Error: Failed to initialize NVML` on GKE #59

praveenperera · 2022-04-01T15:30:07Z

Similar to issue: #27

My daemonset.yaml

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-gpu-metrics-dcgm-exporter
  namespace: default
  uid: 3415e29d-346f-4580-b99e-aaca03a672ad
  resourceVersion: '5254468'
  generation: 12
  creationTimestamp: '2022-03-31T15:25:15Z'
  labels:
    app.kubernetes.io/component: dcgm-exporter
    app.kubernetes.io/instance: nvidia-gpu-metrics
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: dcgm-exporter
    app.kubernetes.io/version: 2.6.5
    helm.sh/chart: dcgm-exporter-2.6.5
  annotations:
    deprecated.daemonset.template.generation: '12'
    meta.helm.sh/release-name: nvidia-gpu-metrics
    meta.helm.sh/release-namespace: default
spec:
  selector:
    matchLabels:
      app.kubernetes.io/component: dcgm-exporter
      app.kubernetes.io/instance: nvidia-gpu-metrics
      app.kubernetes.io/name: dcgm-exporter
  template:
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/component: dcgm-exporter
        app.kubernetes.io/instance: nvidia-gpu-metrics
        app.kubernetes.io/name: dcgm-exporter
    spec:
      volumes:
        - name: pod-gpu-resources
          hostPath:
            path: /var/lib/kubelet/pod-resources
            type: ''
        - name: nvidia-install-dir-host
          hostPath:
            path: /home/kubernetes/bin/nvidia
            type: ''
      containers:
        - name: exporter
          image: nvcr.io/nvidia/k8s/dcgm-exporter:2.3.5-2.6.5-ubuntu20.04
          args:
            - '-f'
            - /etc/dcgm-exporter/dcp-metrics-included.csv
          ports:
            - name: metrics
              containerPort: 9400
              protocol: TCP
          env:
            - name: DCGM_EXPORTER_KUBERNETES
              value: 'true'
            - name: DCGM_EXPORTER_LISTEN
              value: ':9400'
          resources: {}
          volumeMounts:
            - name: pod-gpu-resources
              readOnly: true
              mountPath: /var/lib/kubelet/pod-resources
            - name: nvidia-install-dir-host
              mountPath: /usr/local/nvidia
          livenessProbe:
            httpGet:
              path: /health
              port: 9400
              scheme: HTTP
            initialDelaySeconds: 45
            timeoutSeconds: 1
            periodSeconds: 5
            successThreshold: 1
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /health
              port: 9400
              scheme: HTTP
            initialDelaySeconds: 45
            timeoutSeconds: 1
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 3
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: IfNotPresent
          securityContext:
            capabilities:
              add:
                - SYS_ADMIN
            runAsUser: 0
            runAsNonRoot: false
      restartPolicy: Always
      terminationGracePeriodSeconds: 30
      dnsPolicy: ClusterFirst
      serviceAccountName: nvidia-gpu-metrics-dcgm-exporter
      serviceAccount: nvidia-gpu-metrics-dcgm-exporter
      securityContext: {}
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: cloud.google.com/gke-accelerator
                    operator: Exists
      schedulerName: default-scheduler
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 0
  revisionHistoryLimit: 10

What I've tried

Tried running nvidia-smi in container, same error
ldconfig -p | grep -i libnvidia-ml.so the library was found in the /usr/local/nvidia/lib64/
Ran /usr/bin/nv-hostengine -f /tmp/nvhostengine.debug.log --log-level debug

2022-04-01 15:25:29.629 DEBUG [29:29] Initialized base logger [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmApi.cpp:6806] [dcgmStartEmbedded_v2]
2022-04-01 15:25:29.629 INFO  [29:29] version:2.3.5;arch:x86_64;buildtype:Release;buildid:13;builddate:2022-03-09;commit:e7246b91195b78740e0db2d0f1edf15dd88436d6;branch:rel_dcgm_2_3;buildplatform:Linux 4.15.0-159-generic #167-Ubuntu SMP Tue Sep 21 08:55:05 UTC 2021 x86_64;;crc:d764e6617965aa186e46fc5540b128aa [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmApi.cpp:6809] [dcgmStartEmbedded_v2]
2022-04-01 15:25:29.632 ERROR [29:29] Cannot initialize the hostengine: Error: Failed to initialize NVML [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:3677] [DcgmHostEngineHandler::Init]
2022-04-01 15:25:29.632 ERROR [29:29] DcgmHostEngineHandler::Init failed [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmApi.cpp:6824] [dcgmStartEmbedded_v2]
2022-04-01 15:25:29.632 DEBUG [29:29] Before dcgmapiFreeClientHandler [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmApi.cpp:7137] [dcgmShutdown]
2022-04-01 15:25:29.632 INFO  [29:29] Another thread freed the client handler for us. [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmApi.cpp:255] [dcgmapiFreeClientHandler]
2022-04-01 15:25:29.632 DEBUG [29:29] After dcgmapiFreeClientHandler [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmApi.cpp:7139] [dcgmShutdown]
2022-04-01 15:25:29.632 DEBUG [29:29] dcgmShutdown completed successfully [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmApi.cpp:7174] [dcgmShutdown]

The text was updated successfully, but these errors were encountered:

glowkey · 2022-04-01T16:26:44Z

Just to clarify, the libnvidia-ml.so files were found inside the container? I just pulled them and checked and did not find them. Also, have you followed the integration guide for kubernetes found here? https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/dcgm-exporter.html#integrating-gpu-telemetry-into-kubernetes

praveenperera · 2022-04-01T16:32:53Z

@glowkey the Nvidia drivers were already installed on the node with the nvidia-driver-installer that's now automatically included in GKE clusters

The libnvidia-ml.so are available inside the container because it was mounted from the node.

volumeMounts:
  - name: nvidia-install-dir-host
  mountPath: /usr/local/nvidia

volumes:
  - name: nvidia-install-dir-host
    hostPath:
      path: /home/kubernetes/bin/nvidia

nikkon-dev · 2022-04-01T23:16:39Z

@praveenperera,
Inside the container, could you run ldconfig -p | grep libnvidia-ml.so and see if there are any results?
It's not enough to just mount the /usr/local/nvidia, you also need to tell the OS where to look for the Nvidia libraries (update ldcache).

praveenperera · 2022-04-02T01:27:14Z

@nikkon-dev yes sorry ldconfig -p | grep libnvidia-ml.so was run inside the container and it showed the library files where I expected them to be (the folder I mounted them to).

nikkon-dev · 2022-04-02T01:53:03Z

@praveenperera,
Then we need to understand what the LD is loading on the system (nv-hostengine and other DCGM libraries are using RPATH that may interfere with the system environment).
Could you provide the results of the LD_DEBUG=all ./nv-hostengine -n?

praveenperera · 2022-04-04T17:37:36Z

Hey @nikkon-dev this is the output I get: https://gist.github.com/praveenperera/48ca14a4a898ef9a51d9e8b91b5076b1

And the output of ldconfig -p | grep libnvidia-ml.so is

root@nvidia-gpu-metrics-dcgm-exporter-dp86x:/# ldconfig -p | grep libnvidia-ml.so
	libnvidia-ml.so.1 (libc6,x86-64) => /usr/local/nvidia/lib64/libnvidia-ml.so.1
	libnvidia-ml.so (libc6,x86-64) => /usr/local/nvidia/lib64/libnvidia-ml.so

lhriley · 2022-05-05T16:47:49Z

Was there any progress on this in the last month? We're seeing the exact same issue on GKE, and it would be great to get some actual metrics from the GPUs.

nikkon-dev · 2022-05-10T20:24:40Z

I took a look at your configuration, and here is some issue I noticed:
You should not mount the Nvidia libraries inside the container on your own - the Nvidia docker runtime handles that automatically.

lhriley · 2022-05-11T23:55:02Z

I don't believe the nvidia docker runtime is in play in GKE, but I could be wrong.

As far as I'm aware, this is all native containerd functionality via their container OS (COS) AMI and a daemonset they provide to manage the nvidia drivers. So, I believe that we would need to mount the nvidia drivers as indicated in the example provided.

lhriley · 2022-05-12T00:13:05Z

I just got this working after reading some comments in the archived repo: NVIDIA/gpu-monitoring-tools#96 (comment)

My helm chart values:

###
#
# Reference: https://github.com/NVIDIA/dcgm-exporter/blob/main/deployment/values.yaml
#

serviceMonitor:
  enabled: false

resources:
  limits:
    cpu: 100m
    memory: 128Mi
  requests:
    cpu: 100m
    memory: 128Mi

securityContext:
  privileged: true

tolerations:
  - operator: Exists

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
            - key: cloud.google.com/gke-accelerator
              operator: Exists

podAnnotations:
  ad.datadoghq.com/exporter.check_names: |
          ["openmetrics"]
  ad.datadoghq.com/exporter.init_configs: |
          [{}]
  ad.datadoghq.com/exporter.instances: |
    [
      {
        "openmetrics_endpoint": "http://%%host%%:9400/metrics",
        "namespace": "nvidia-dcgm-exporter",
        "metrics": [{"*":"*"}]
      }
    ]

extraHostVolumes:
  - name: vulkan-icd-mount
    hostPath: /home/kubernetes/bin/nvidia/vulkan/icd.d
  - name: nvidia-install-dir-host
    hostPath: /home/kubernetes/bin/nvidia

extraVolumeMounts:
  - name: nvidia-install-dir-host
    mountPath: /usr/local/nvidia
    readOnly: true
  - name: vulkan-icd-mount
    mountPath: /etc/vulkan/icd.d
    readOnly: true

and...

❯ kubectl -n nvidia-dcgm-exporter logs -f nvidia-dcgm-exporter-4m855
time="2022-05-12T00:10:42Z" level=info msg="Starting dcgm-exporter"
time="2022-05-12T00:10:42Z" level=info msg="DCGM successfully initialized!"
time="2022-05-12T00:10:43Z" level=info msg="Collecting DCP Metrics"
time="2022-05-12T00:10:43Z" level=info msg="No configmap data specified, falling back to metric file /etc/dcgm-exporter/dcp-metrics-included.csv"
time="2022-05-12T00:10:44Z" level=info msg="Kubernetes metrics collection enabled!"
time="2022-05-12T00:10:44Z" level=info msg="Pipeline starting"
time="2022-05-12T00:10:44Z" level=info msg="Starting webserver"

praveenperera · 2022-05-12T02:25:54Z

I just got this working after reading some comments in the archived repo: NVIDIA/gpu-monitoring-tools#96 (comment)

....

I'll try that thanks!

vanHavel · 2023-06-13T08:33:37Z

Thanks a lot for sharing the values. I also got it running on GKE with that setup.
I had to bump the memory request up to 256Mi, otherwise the pods got OOMKilled.

xiaoyifan · 2023-10-30T06:36:15Z

Thanks for sharing the values. I did the same thing to bump the memory to 256Mi and things working now. but it's weird that there are 17 pods, but 4 are still having the CrashLoopBackOff issue. not sure if anyone has a clue

xiaoyifan mentioned this issue Oct 30, 2023

DCGM on GKE partially up and and running #202

Closed

francescov1 mentioned this issue Dec 6, 2023

DCGM initialization error NVIDIA/gpu-operator#222

Open

11 tasks

nvvfedorov closed this as completed Mar 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash loop backoff with `Error: Failed to initialize NVML` on GKE #59

Crash loop backoff with `Error: Failed to initialize NVML` on GKE #59

praveenperera commented Apr 1, 2022 •

edited

glowkey commented Apr 1, 2022

praveenperera commented Apr 1, 2022 •

edited

nikkon-dev commented Apr 1, 2022

praveenperera commented Apr 2, 2022

nikkon-dev commented Apr 2, 2022

praveenperera commented Apr 4, 2022

lhriley commented May 5, 2022

nikkon-dev commented May 10, 2022 •

edited

lhriley commented May 11, 2022

lhriley commented May 12, 2022 •

edited

praveenperera commented May 12, 2022 •

edited

vanHavel commented Jun 13, 2023

xiaoyifan commented Oct 30, 2023

Crash loop backoff with Error: Failed to initialize NVML on GKE #59

Crash loop backoff with Error: Failed to initialize NVML on GKE #59

Comments

praveenperera commented Apr 1, 2022 • edited

What I've tried

glowkey commented Apr 1, 2022

praveenperera commented Apr 1, 2022 • edited

nikkon-dev commented Apr 1, 2022

praveenperera commented Apr 2, 2022

nikkon-dev commented Apr 2, 2022

praveenperera commented Apr 4, 2022

lhriley commented May 5, 2022

nikkon-dev commented May 10, 2022 • edited

lhriley commented May 11, 2022

lhriley commented May 12, 2022 • edited

praveenperera commented May 12, 2022 • edited

vanHavel commented Jun 13, 2023

xiaoyifan commented Oct 30, 2023

Crash loop backoff with `Error: Failed to initialize NVML` on GKE #59

Crash loop backoff with `Error: Failed to initialize NVML` on GKE #59

praveenperera commented Apr 1, 2022 •

edited

praveenperera commented Apr 1, 2022 •

edited

nikkon-dev commented May 10, 2022 •

edited

lhriley commented May 12, 2022 •

edited

praveenperera commented May 12, 2022 •

edited