-
Notifications
You must be signed in to change notification settings - Fork 472
Description
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
Describe the bug
On clusters with 2+ GPU nodes, nvidia-driver-daemonset permanently reports
Desired=1 and Misscheduled=1 despite all pods running healthy on every GPU node.
The root cause is a requiredDuringSchedulingIgnoredDuringExecution podAntiAffinity
rule hardcoded in assets/state-driver/0500_daemonset.yaml:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app.kubernetes.io/component
operator: In
values:
- nvidia-driver
topologyKey: kubernetes.io/hostnameThis conflicts with the fundamental behavior of a DaemonSet. The DS controller
marks the pod on the second node as misscheduled because anti-affinity prohibits
scheduling there, but cannot evict it due to IgnoredDuringExecution.
The result is a permanent Misscheduled=1 state.
This rule cannot be overridden via daemonsets.affinity: {} in values.yaml
or ClusterPolicy — it is injected directly from the asset file.
Reproduced on both v25.3.4 and v26.3.0.
To Reproduce
- Deploy GPU Operator on a cluster with 2 or more GPU nodes
- Run
kubectl get ds nvidia-driver-daemonset -n gpu-operator - Observe
DESIRED=1,AVAILABLE=1, but driver pods running on all GPU nodes - Run
kubectl describe ds nvidia-driver-daemonset -n gpu-operator | grep Misscheduled - Observe
Number of Nodes Misscheduled: 1
Expected behavior
DESIRED should match the number of GPU nodes and Misscheduled should be 0.
DaemonSets inherently guarantee one pod per node — requiredDuringScheduling
podAntiAffinity is redundant and causes incorrect state reporting in multi-GPU-node clusters.
Suggested Fix
Replace requiredDuringSchedulingIgnoredDuringExecution with
preferredDuringSchedulingIgnoredDuringExecution, or remove the podAntiAffinity
block entirely from assets/state-driver/0500_daemonset.yaml.
Environment
- GPU Operator Version: v25.3.4 (also reproduced on v26.3.0)
- OS: Ubuntu 24.04
- Kernel Version: 6.8.0-87-generic
- Container Runtime Version: containerd (k3s embedded)
- Kubernetes Distro and Version: k3s v1.33.5
Information to attach
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE
nvidia-driver-daemonset 1 1 1 1 1
nvidia-operator-validator 1 1 1 1 1
Both daemonsets show DESIRED=1 while driver pods are Running/Ready on 2 GPU nodes:
nvidia-driver-daemonset-xxxxx 1/1 Running k3s-gpu-worker1
nvidia-driver-daemonset-yyyyy 1/1 Running k3s-gpu-worker2
Number of Nodes Misscheduled: 1