Skip to content
This repository has been archived by the owner on Nov 2, 2021. It is now read-only.

dcgm-exporter falied so start on GKE cluster (v1.16.11-gke.5) #96

Open
Dimss opened this issue Jul 21, 2020 · 12 comments
Open

dcgm-exporter falied so start on GKE cluster (v1.16.11-gke.5) #96

Dimss opened this issue Jul 21, 2020 · 12 comments

Comments

@Dimss
Copy link

Dimss commented Jul 21, 2020

I've GKE cluster with GPU node pull.
The GPU nodes has valid labels, the nvidia device plugin pods are running on each GPU node and the nvida driver daemon set was deployed as well.
The K8S detects 1 allocatable GPU.
However, when I'm deploying the kubectl create -f https://raw.githubusercontent.com/NVIDIA/gpu-monitoring-tools/2.0.0-rc.12/dcgm-exporter.yaml the dcgm-exporter pod is in CrashLoopBackOff with the following error:

time="2020-07-21T18:49:33Z" level=info msg="Starting dcgm-exporter"
Error: Failed to initialize NVML
time="2020-07-21T18:49:33Z" level=fatal msg="Error starting nv-hostengine: DCGM initialization error"

The exact same setup works on AKS and EKS clusters without any issue.

Is there any limitation to use the dcgm-exporter on GKE?

@Morishiri
Copy link

I have exactly the same issue right now.

@tamizhgeek
Copy link

Same issue in EKS cluster now.

@tanrobotix
Copy link

Same with on-premise cluster

@vizgin
Copy link

vizgin commented Sep 28, 2020

Same with on-premise OKD 4.4 cluster

@tanrobotix
Copy link

Well, I think I solved the problems

@Dimss
Copy link
Author

Dimss commented Sep 29, 2020

@tanrobotix can you share your solution?

@tanrobotix
Copy link

tanrobotix commented Sep 29, 2020

In my case
The /etc/docker/daemon.json is

{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

The default-runtime is not set. It should be set with absolute path

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

Then reload daemon and restart docker service

systemctl daemon-reload
systemctl restart docker

@aaroncnb
Copy link

aaroncnb commented Dec 19, 2020

I am having the same issue with an on-premise cluster (1 VM-based master node, 2 DGX Station GPU nodes) setup via Ansible and DeepOps:

https://github.com/NVIDIA/deepops/tree/master/docs/k8s-cluster

$:~/deepops# kubectl get nodes
NAME     STATUS                        ROLES    AGE   VERSION
gpu01    NotReady,SchedulingDisabled   <none>   2d    v1.18.9
gpu02    Ready                         <none>   23h   v1.18.9
mgmt01   Ready                         master   2d    v1.18.9
$~/deepops# kubectl get pods
NAME                                                              READY   STATUS             RESTARTS   AGE
dcgm-exporter-1608298867-jstgm                                    0/1     CrashLoopBackOff   242        19h
dcgm-exporter-1608298867-n52g2                                    1/1     Running            0          19h
gpu-operator-774ff7994c-gdpdl                                     1/1     Running            29         29h
gpu-test                                                          0/1     Terminating        0          23h
ingress-nginx-controller-6b4fdfdcf7-sb5hs                         1/1     Running            0          28h
nvidia-gpu-operator-node-feature-discovery-master-7d88b984j9grb   1/1     Running            3          29h
nvidia-gpu-operator-node-feature-discovery-worker-jpc24           1/1     Running            49         29h
nvidia-gpu-operator-node-feature-discovery-worker-lzn58           1/1     Running            26         29h
nvidia-gpu-operator-node-feature-discovery-worker-wfqgn           0/1     CrashLoopBackOff   195        19h
$~/deepops# kubectl logs pod/dcgm-exporter-1608298867-jstgm
time="2020-12-19T09:21:20Z" level=info msg="Starting dcgm-exporter"
Error: Failed to initialize NVML
time="2020-12-19T09:21:20Z" level=fatal msg="Error starting nv-hostengine: DCGM initialization error"

/etc/docker/daemon.json however looks fine (identical) on both GPU nodes. Not sure if the persistent NotReady status of one of the nodes is related to this dcgm-exporter issue or not.

{
    "default-runtime": "nvidia",
    "default-shm-size": "1G",
    "default-ulimits": {
        "memlock": {
            "hard": -1,
            "name": "memlock",
            "soft": -1
        },
        "stack": {
            "hard": 67108864,
            "name": "stack",
            "soft": 67108864
        }
    },
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

Many thanks in advance for any advice :)

@andre-lx
Copy link

andre-lx commented Feb 12, 2021

Hi. Based in this stack overflow question, we solved using:

apiVersion: v1
kind: Namespace
metadata:
  name: monitoring
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: dcgm-exporter
  namespace: monitoring
  labels:
    app.kubernetes.io/name: dcgm-exporter
    app.kubernetes.io/instance: dcgm-exporter
    app.kubernetes.io/version: "2.1.1"
    app.kubernetes.io/component: "dcgm-exporter"
---
apiVersion: v1
kind: Service
metadata:
  name: dcgm-exporter
  namespace: monitoring
  labels:
    app.kubernetes.io/name: dcgm-exporter
    app.kubernetes.io/instance: dcgm-exporter
    app.kubernetes.io/version: "2.1.1"
    app.kubernetes.io/component: "dcgm-exporter"
spec:
  type: ClusterIP
  ports:
  - name: "metrics"
    port: 9400
    targetPort: 9400
    protocol: TCP
  selector:
    app.kubernetes.io/name: dcgm-exporter
    app.kubernetes.io/instance: dcgm-exporter
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: dcgm-exporter
  namespace: monitoring
  labels:
    app.kubernetes.io/name: dcgm-exporter
    app.kubernetes.io/instance: dcgm-exporter
    app.kubernetes.io/version: "2.1.1"
    app.kubernetes.io/component: "dcgm-exporter"
spec:
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      app.kubernetes.io/name: dcgm-exporter
      app.kubernetes.io/instance: dcgm-exporter
      app.kubernetes.io/component: "dcgm-exporter"
  template:
    metadata:
      labels:
        app.kubernetes.io/name: dcgm-exporter
        app.kubernetes.io/instance: dcgm-exporter
        app.kubernetes.io/component: "dcgm-exporter"
    spec:
      serviceAccountName: dcgm-exporter
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cloud.google.com/gke-accelerator
                operator: Exists
      containers:
      - env:
        - name: DCGM_EXPORTER_LISTEN
          value: :9400
        - name: DCGM_EXPORTER_KUBERNETES
          value: "true"
        image: nvidia/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04
        imagePullPolicy: IfNotPresent
        name: dcgm-exporter
        ports:
        - containerPort: 9400
          name: metrics
          protocol: TCP
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/lib/kubelet/pod-resources
          name: pod-gpu-resources
          readOnly: true
        - mountPath: /usr/local/nvidia
          name: nvidia-install-dir-host
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        operator: Exists
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      volumes:
      - hostPath:
          path: /var/lib/kubelet/pod-resources
          type: ""
        name: pod-gpu-resources
      - hostPath:
          path: /home/kubernetes/bin/nvidia
          type: ""
        name: nvidia-install-dir-host
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
  namespace: monitoring
  labels:
    app.kubernetes.io/name: dcgm-exporter
    app.kubernetes.io/instance: dcgm-exporter
    app.kubernetes.io/version: "2.1.1"
    app.kubernetes.io/component: "dcgm-exporter"
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: dcgm-exporter
      app.kubernetes.io/instance: dcgm-exporter
      app.kubernetes.io/component: "dcgm-exporter"
  endpoints:
  - port: "metrics"
    path: "/metrics"
    interval: "15s"

To work with "all the clouds providers", use the affinity with labels (expecting that you have at least one user label defined in the gpu nodes):

      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: type
                operator: In
                values:
                - label-key1
                - label-key2

@omesser
Copy link

omesser commented Jul 1, 2021

Trying the suggested solution here of:

  • Adding securityContext.privileged=true
  • Adding nvidia-install-dir-host hostPath volume + volumeMount
    We've seen that this resolved the issue for GKE using nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04, but with the more recent nvcr.io/nvidia/k8s/dcgm-exporter:2.1.4-2.3.1-ubuntu18.04 it's still broken.
    We've opted to downgrade for now, of course

@sadovnikov
Copy link

Tried downgrading dcgm-exporter from 2.2.9-2.4.1-ubuntu20.04 to 2.0.13-2.1.1-ubuntu18.04, and setting securityContext.privileged=true - keeps failing.

Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  48s                default-scheduler  Successfully assigned monitoring-system/dcgm-exporter-8z8n5 to gke-np-epo-sentalign-euwe4a-gke--gpus-0fd942e9-zom5
  Normal   Pulling    48s                kubelet            Pulling image "nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04"
  Normal   Pulled     37s                kubelet            Successfully pulled image "nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04" in 10.688555255s
  Normal   Created    16s (x3 over 34s)  kubelet            Created container exporter
  Normal   Started    16s (x3 over 34s)  kubelet            Started container exporter
  Normal   Pulled     16s (x2 over 33s)  kubelet            Container image "nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04" already present on machine
  Warning  BackOff    12s (x5 over 32s)  kubelet            Back-off restarting failed container

❯ kubectl -n monitoring-system logs -p dcgm-exporter-8z8n5
time="2021-09-14T07:03:10Z" level=info msg="Starting dcgm-exporter"
Error: Failed to initialize NVML
time="2021-09-14T07:03:10Z" level=fatal msg="Error starting nv-hostengine: DCGM initialization error"

Also tried adding all volumes and volumeMounts from the nvidia-gpu-device-plugin DaemonSet, which is added by GKE - didn't fix the problem

@ciiiii
Copy link

ciiiii commented Oct 25, 2021

Trying the suggested solution here of:

* Adding `securityContext.privileged=true`

* Adding `nvidia-install-dir-host` hostPath volume + volumeMount
  We've seen that this resolved the issue for GKE using `nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04`, but with the more recent `nvcr.io/nvidia/k8s/dcgm-exporter:2.1.4-2.3.1-ubuntu18.04` it's still broken.
  We've opted to downgrade for now, of course

It works for me.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants