dcgm-exporter falied so start on GKE cluster (v1.16.11-gke.5) #96

Dimss · 2020-07-21T19:07:37Z

I've GKE cluster with GPU node pull.
The GPU nodes has valid labels, the nvidia device plugin pods are running on each GPU node and the nvida driver daemon set was deployed as well.
The K8S detects 1 allocatable GPU.
However, when I'm deploying the kubectl create -f https://raw.githubusercontent.com/NVIDIA/gpu-monitoring-tools/2.0.0-rc.12/dcgm-exporter.yaml the dcgm-exporter pod is in CrashLoopBackOff with the following error:

time="2020-07-21T18:49:33Z" level=info msg="Starting dcgm-exporter"
Error: Failed to initialize NVML
time="2020-07-21T18:49:33Z" level=fatal msg="Error starting nv-hostengine: DCGM initialization error"

The exact same setup works on AKS and EKS clusters without any issue.

Is there any limitation to use the dcgm-exporter on GKE?

The text was updated successfully, but these errors were encountered:

Morishiri · 2020-08-27T11:40:49Z

I have exactly the same issue right now.

tamizhgeek · 2020-09-17T10:58:03Z

Same issue in EKS cluster now.

tanrobotix · 2020-09-21T05:39:31Z

Same with on-premise cluster

vizgin · 2020-09-28T19:41:03Z

Same with on-premise OKD 4.4 cluster

tanrobotix · 2020-09-29T03:39:22Z

Well, I think I solved the problems

Dimss · 2020-09-29T06:15:38Z

@tanrobotix can you share your solution?

tanrobotix · 2020-09-29T06:35:55Z

In my case
The /etc/docker/daemon.json is

{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

The default-runtime is not set. It should be set with absolute path

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

Then reload daemon and restart docker service

systemctl daemon-reload
systemctl restart docker

aaroncnb · 2020-12-19T09:33:03Z

I am having the same issue with an on-premise cluster (1 VM-based master node, 2 DGX Station GPU nodes) setup via Ansible and DeepOps:

https://github.com/NVIDIA/deepops/tree/master/docs/k8s-cluster

$:~/deepops# kubectl get nodes
NAME     STATUS                        ROLES    AGE   VERSION
gpu01    NotReady,SchedulingDisabled   <none>   2d    v1.18.9
gpu02    Ready                         <none>   23h   v1.18.9
mgmt01   Ready                         master   2d    v1.18.9

$~/deepops# kubectl get pods
NAME                                                              READY   STATUS             RESTARTS   AGE
dcgm-exporter-1608298867-jstgm                                    0/1     CrashLoopBackOff   242        19h
dcgm-exporter-1608298867-n52g2                                    1/1     Running            0          19h
gpu-operator-774ff7994c-gdpdl                                     1/1     Running            29         29h
gpu-test                                                          0/1     Terminating        0          23h
ingress-nginx-controller-6b4fdfdcf7-sb5hs                         1/1     Running            0          28h
nvidia-gpu-operator-node-feature-discovery-master-7d88b984j9grb   1/1     Running            3          29h
nvidia-gpu-operator-node-feature-discovery-worker-jpc24           1/1     Running            49         29h
nvidia-gpu-operator-node-feature-discovery-worker-lzn58           1/1     Running            26         29h
nvidia-gpu-operator-node-feature-discovery-worker-wfqgn           0/1     CrashLoopBackOff   195        19h

$~/deepops# kubectl logs pod/dcgm-exporter-1608298867-jstgm
time="2020-12-19T09:21:20Z" level=info msg="Starting dcgm-exporter"
Error: Failed to initialize NVML
time="2020-12-19T09:21:20Z" level=fatal msg="Error starting nv-hostengine: DCGM initialization error"

/etc/docker/daemon.json however looks fine (identical) on both GPU nodes. Not sure if the persistent NotReady status of one of the nodes is related to this dcgm-exporter issue or not.

{
    "default-runtime": "nvidia",
    "default-shm-size": "1G",
    "default-ulimits": {
        "memlock": {
            "hard": -1,
            "name": "memlock",
            "soft": -1
        },
        "stack": {
            "hard": 67108864,
            "name": "stack",
            "soft": 67108864
        }
    },
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

Many thanks in advance for any advice :)

andre-lx · 2021-02-12T15:42:01Z

Hi. Based in this stack overflow question, we solved using:

apiVersion: v1
kind: Namespace
metadata:
  name: monitoring
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: dcgm-exporter
  namespace: monitoring
  labels:
    app.kubernetes.io/name: dcgm-exporter
    app.kubernetes.io/instance: dcgm-exporter
    app.kubernetes.io/version: "2.1.1"
    app.kubernetes.io/component: "dcgm-exporter"
---
apiVersion: v1
kind: Service
metadata:
  name: dcgm-exporter
  namespace: monitoring
  labels:
    app.kubernetes.io/name: dcgm-exporter
    app.kubernetes.io/instance: dcgm-exporter
    app.kubernetes.io/version: "2.1.1"
    app.kubernetes.io/component: "dcgm-exporter"
spec:
  type: ClusterIP
  ports:
  - name: "metrics"
    port: 9400
    targetPort: 9400
    protocol: TCP
  selector:
    app.kubernetes.io/name: dcgm-exporter
    app.kubernetes.io/instance: dcgm-exporter
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: dcgm-exporter
  namespace: monitoring
  labels:
    app.kubernetes.io/name: dcgm-exporter
    app.kubernetes.io/instance: dcgm-exporter
    app.kubernetes.io/version: "2.1.1"
    app.kubernetes.io/component: "dcgm-exporter"
spec:
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      app.kubernetes.io/name: dcgm-exporter
      app.kubernetes.io/instance: dcgm-exporter
      app.kubernetes.io/component: "dcgm-exporter"
  template:
    metadata:
      labels:
        app.kubernetes.io/name: dcgm-exporter
        app.kubernetes.io/instance: dcgm-exporter
        app.kubernetes.io/component: "dcgm-exporter"
    spec:
      serviceAccountName: dcgm-exporter
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cloud.google.com/gke-accelerator
                operator: Exists
      containers:
      - env:
        - name: DCGM_EXPORTER_LISTEN
          value: :9400
        - name: DCGM_EXPORTER_KUBERNETES
          value: "true"
        image: nvidia/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04
        imagePullPolicy: IfNotPresent
        name: dcgm-exporter
        ports:
        - containerPort: 9400
          name: metrics
          protocol: TCP
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/lib/kubelet/pod-resources
          name: pod-gpu-resources
          readOnly: true
        - mountPath: /usr/local/nvidia
          name: nvidia-install-dir-host
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        operator: Exists
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      volumes:
      - hostPath:
          path: /var/lib/kubelet/pod-resources
          type: ""
        name: pod-gpu-resources
      - hostPath:
          path: /home/kubernetes/bin/nvidia
          type: ""
        name: nvidia-install-dir-host
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
  namespace: monitoring
  labels:
    app.kubernetes.io/name: dcgm-exporter
    app.kubernetes.io/instance: dcgm-exporter
    app.kubernetes.io/version: "2.1.1"
    app.kubernetes.io/component: "dcgm-exporter"
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: dcgm-exporter
      app.kubernetes.io/instance: dcgm-exporter
      app.kubernetes.io/component: "dcgm-exporter"
  endpoints:
  - port: "metrics"
    path: "/metrics"
    interval: "15s"

To work with "all the clouds providers", use the affinity with labels (expecting that you have at least one user label defined in the gpu nodes):

      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: type
                operator: In
                values:
                - label-key1
                - label-key2

omesser · 2021-07-01T13:32:22Z

Trying the suggested solution here of:

Adding securityContext.privileged=true
Adding nvidia-install-dir-host hostPath volume + volumeMount
We've seen that this resolved the issue for GKE using nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04, but with the more recent nvcr.io/nvidia/k8s/dcgm-exporter:2.1.4-2.3.1-ubuntu18.04 it's still broken.
We've opted to downgrade for now, of course

sadovnikov · 2021-09-14T07:21:08Z

Tried downgrading dcgm-exporter from 2.2.9-2.4.1-ubuntu20.04 to 2.0.13-2.1.1-ubuntu18.04, and setting securityContext.privileged=true - keeps failing.

Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  48s                default-scheduler  Successfully assigned monitoring-system/dcgm-exporter-8z8n5 to gke-np-epo-sentalign-euwe4a-gke--gpus-0fd942e9-zom5
  Normal   Pulling    48s                kubelet            Pulling image "nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04"
  Normal   Pulled     37s                kubelet            Successfully pulled image "nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04" in 10.688555255s
  Normal   Created    16s (x3 over 34s)  kubelet            Created container exporter
  Normal   Started    16s (x3 over 34s)  kubelet            Started container exporter
  Normal   Pulled     16s (x2 over 33s)  kubelet            Container image "nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04" already present on machine
  Warning  BackOff    12s (x5 over 32s)  kubelet            Back-off restarting failed container

❯ kubectl -n monitoring-system logs -p dcgm-exporter-8z8n5
time="2021-09-14T07:03:10Z" level=info msg="Starting dcgm-exporter"
Error: Failed to initialize NVML
time="2021-09-14T07:03:10Z" level=fatal msg="Error starting nv-hostengine: DCGM initialization error"

Also tried adding all volumes and volumeMounts from the nvidia-gpu-device-plugin DaemonSet, which is added by GKE - didn't fix the problem

ciiiii · 2021-10-25T07:21:13Z

Trying the suggested solution here of:

* Adding `securityContext.privileged=true`

* Adding `nvidia-install-dir-host` hostPath volume + volumeMount
  We've seen that this resolved the issue for GKE using `nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04`, but with the more recent `nvcr.io/nvidia/k8s/dcgm-exporter:2.1.4-2.3.1-ubuntu18.04` it's still broken.
  We've opted to downgrade for now, of course

It works for me.

salliewalecka mentioned this issue Nov 17, 2021

Pod metrics displays Daemonset name of dcgm-exporter rather than the pod with GPU NVIDIA/dcgm-exporter#27

Closed

lhriley mentioned this issue May 12, 2022

Crash loop backoff with Error: Failed to initialize NVML on GKE NVIDIA/dcgm-exporter#59

Closed

francescov1 mentioned this issue Dec 6, 2023

DCGM initialization error NVIDIA/gpu-operator#222

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dcgm-exporter falied so start on GKE cluster (v1.16.11-gke.5) #96

dcgm-exporter falied so start on GKE cluster (v1.16.11-gke.5) #96

Dimss commented Jul 21, 2020 •

edited

Loading

Morishiri commented Aug 27, 2020

tamizhgeek commented Sep 17, 2020

tanrobotix commented Sep 21, 2020

vizgin commented Sep 28, 2020

tanrobotix commented Sep 29, 2020

Dimss commented Sep 29, 2020

tanrobotix commented Sep 29, 2020 •

edited

Loading

aaroncnb commented Dec 19, 2020 •

edited

Loading

andre-lx commented Feb 12, 2021 •

edited

Loading

omesser commented Jul 1, 2021

sadovnikov commented Sep 14, 2021

ciiiii commented Oct 25, 2021

dcgm-exporter falied so start on GKE cluster (v1.16.11-gke.5) #96

dcgm-exporter falied so start on GKE cluster (v1.16.11-gke.5) #96

Comments

Dimss commented Jul 21, 2020 • edited Loading

Morishiri commented Aug 27, 2020

tamizhgeek commented Sep 17, 2020

tanrobotix commented Sep 21, 2020

vizgin commented Sep 28, 2020

tanrobotix commented Sep 29, 2020

Dimss commented Sep 29, 2020

tanrobotix commented Sep 29, 2020 • edited Loading

aaroncnb commented Dec 19, 2020 • edited Loading

andre-lx commented Feb 12, 2021 • edited Loading

omesser commented Jul 1, 2021

sadovnikov commented Sep 14, 2021

ciiiii commented Oct 25, 2021

Dimss commented Jul 21, 2020 •

edited

Loading

tanrobotix commented Sep 29, 2020 •

edited

Loading

aaroncnb commented Dec 19, 2020 •

edited

Loading

andre-lx commented Feb 12, 2021 •

edited

Loading