-
Notifications
You must be signed in to change notification settings - Fork 306
dcgm-exporter falied so start on GKE cluster (v1.16.11-gke.5) #96
Comments
I have exactly the same issue right now. |
Same issue in EKS cluster now. |
Same with on-premise cluster |
Same with on-premise OKD 4.4 cluster |
Well, I think I solved the problems |
@tanrobotix can you share your solution? |
In my case
The default-runtime is not set. It should be set with absolute path
Then reload daemon and restart docker service
|
I am having the same issue with an on-premise cluster (1 VM-based master node, 2 DGX Station GPU nodes) setup via Ansible and DeepOps: https://github.com/NVIDIA/deepops/tree/master/docs/k8s-cluster
Many thanks in advance for any advice :) |
Hi. Based in this stack overflow question, we solved using:
To work with "all the clouds providers", use the affinity with labels (expecting that you have at least one user label defined in the gpu nodes):
|
Trying the suggested solution here of:
|
Tried downgrading
Also tried adding all |
It works for me. |
I've GKE cluster with GPU node pull.
The GPU nodes has valid labels, the nvidia device plugin pods are running on each GPU node and the nvida driver daemon set was deployed as well.
The K8S detects 1 allocatable GPU.
However, when I'm deploying the
kubectl create -f https://raw.githubusercontent.com/NVIDIA/gpu-monitoring-tools/2.0.0-rc.12/dcgm-exporter.yaml
thedcgm-exporter
pod is inCrashLoopBackOff
with the following error:The exact same setup works on AKS and EKS clusters without any issue.
Is there any limitation to use the
dcgm-exporter
on GKE?The text was updated successfully, but these errors were encountered: