1. Quick Debug Checklist
1. Issue or feature description
When using a legacy gpu that is required for computer boot I am unable to "un-label" this node from being pushed the full GPU-Operator installation.
This is a microk8s cluster running 1.22 and the latest nvidia gpu-operator. The 1070 GPU works just fine.
2. Steps to reproduce the issue
Multinode cluster with 2 nodes
- legacy gpu
- one newer gpu (my case) gtx 1070
Label the node using `kubectl label node <legacy_node> legacy-gpu=true
Using the helm chart gpu-operator
operator:
defaultRuntime: containerd
runtimeClass: nvidia-container-runtime
tolerations:
- key: "legacy-gpu"
operator: "Exists"
effect: "NoSchedule"
It installs it on the proper node, but all other tasks are propagated to both nodes.
I tried setting each section from the helm chart with this toleration and it does nothing. Just keeps re-installing on each node (even uninstalling + reinstalling the chart).
3. Information to attach (optional if deemed irrelevant)
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nvidia-driver-daemonset-bkcbr 1/1 Running 0 7m57s 10.1.134.241 k8s-gpu <none> <none>
gpu-feature-discovery-9slwv 1/1 Running 0 7m54s 10.1.134.201 k8s-gpu <none> <none>
nvidia-cuda-validator-gf9f9 0/1 Completed 0 6m17s 10.1.134.249 k8s-gpu <none> <none>
nvidia-device-plugin-daemonset-dp9jp 1/1 Running 0 7m56s 10.1.134.255 k8s-gpu <none> <none>
nvidia-device-plugin-validator-cv4sj 0/1 Completed 0 6m6s 10.1.134.203 k8s-gpu <none> <none>
nvidia-operator-validator-jqx6v 1/1 Running 0 7m57s 10.1.134.199 k8s-gpu <none> <none>
nvidia-dcgm-hk6pl 1/1 Running 0 7m55s 10.1.134.230 k8s-gpu <none> <none>
nvidia-dcgm-exporter-trtxs 1/1 Running 0 7m55s 10.1.134.239 k8s-gpu <none> <none>
nvidia-driver-daemonset-klg6x 0/1 Init:CrashLoopBackOff 6 (74s ago) 7m57s 10.1.11.56 k8s-cluster <none> <none>
nvidia-dcgm-ks99z 0/1 Init:RunContainerError 3 (13s ago) 74s 10.1.11.28 k8s-cluster <none> <none>
gpu-feature-discovery-j9wh9 0/1 Init:RunContainerError 3 (12s ago) 74s 10.1.11.51 k8s-cluster <none> <none>
nvidia-device-plugin-daemonset-f59vn 0/1 Init:RunContainerError 3 (11s ago) 74s 10.1.11.39 k8s-cluster <none> <none>
nvidia-dcgm-exporter-jq8n9 0/1 Init:CrashLoopBackOff 3 (16s ago) 74s 10.1.11.62 k8s-cluster <none> <none>
nvidia-operator-validator-4876k 0/1 Init:CrashLoopBackOff 3 (13s ago) 74s 10.1.11.1 k8s-cluster <none> <none>
If theres any other missing info let me know.
1. Quick Debug Checklist
i2c_coreandipmi_msghandlerloaded on the nodes?kubectl describe clusterpolicies --all-namespaces)1. Issue or feature description
When using a legacy gpu that is required for computer boot I am unable to "un-label" this node from being pushed the full GPU-Operator installation.
This is a microk8s cluster running 1.22 and the latest nvidia gpu-operator. The 1070 GPU works just fine.
2. Steps to reproduce the issue
Multinode cluster with 2 nodes
Label the node using `kubectl label node <legacy_node> legacy-gpu=true
Using the helm chart gpu-operator
It installs it on the proper node, but all other tasks are propagated to both nodes.
I tried setting each section from the helm chart with this toleration and it does nothing. Just keeps re-installing on each node (even uninstalling + reinstalling the chart).
3. Information to attach (optional if deemed irrelevant)
kubectl get pods --all-namespacesIf theres any other missing info let me know.