Describe the bug
I'm configuring NVIDIA GPU Operator 26.3.0 in OpenShift Virtualization 4.21 to run VMs using vGPUs from a Tesla T4 card. The driver version I'm using is 595.58.02
Tesla cards don't support SR-IOV:
# nvidia-smi -q
==============NVSMI LOG==============
Timestamp : Sun Apr 12 08:03:33 2026
Driver Version : 595.58.02
CUDA Version : Not Found
vGPU Driver Capability
Heterogenous Multi-vGPU : Supported
Attached GPUs : 1
GPU 00000000:41:00.0
Product Name : Tesla T4
Product Brand : NVIDIA
Product Architecture : Turing
Display Mode : Requested functionality has been deprecated
Display Attached : Yes
Display Active : Disabled
Persistence Mode : Enabled
Addressing Mode : N/A
vGPU Device Capability
Fractional Multi-vGPU : Supported
Heterogeneous Time-Slice Profiles : Supported
Heterogeneous Time-Slice Sizes : Supported
Homogeneous Placements : Not Supported
MIG Time-Slicing : Not Supported
MIG Time-Slicing Mode : Disabled
[...]
GPU Virtualization Mode
Virtualization Mode : Host VGPU
Host VGPU Mode : Non SR-IOV <----
vGPU Heterogeneous Mode : Disabled
However when querying sriov_totalvfs in the node, I get 16:
# lspci|grep -i nvidia
a1:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
# cat /sys/bus/pci/devices/0000\:a1\:00.0/sriov_numvfs
0
# cat /sys/bus/pci/devices/0000\:a1\:00.0/sriov_totalvfs
16
This makes the nvidia-sandbox-validator pod to enter a loop in the init container waiting for the VFs, and it crash after a while:
time="2026-04-06T10:22:16Z" level=info msg="Waiting for VFs to be available..."
2026/04/06 10:22:16 WARNING: unable to detect IOMMU FD for [0000:a1:00.0 open /sys/bus/pci/devices/0000:a1:00.0/vfio-dev: no such file or directory]: %!v(MISSING)
time="2026-04-06T10:22:16Z" level=info msg="Waiting for VFs: 0/16 enabled across 1 GPU(s)"
2026/04/06 10:22:21 WARNING: unable to detect IOMMU FD for [0000:a1:00.0 open /sys/bus/pci/devices/0000:a1:00.0/vfio-dev: no such file or directory]: %!v(MISSING)
time="2026-04-06T10:22:21Z" level=info msg="Waiting for VFs: 0/16 enabled across 1 GPU(s)"
From what I see, the operator waits for VFs if sriov_totalvfs is greater than 0, which is not correct for this card that doesn't support SR-IOV:
https://github.com/NVIDIA/go-nvlib/blob/68058cecb77b8d5f014caec9a8e54e3485000b8e/pkg/nvpci/nvpci.go#L509
To Reproduce
- Use OpenShift 4.21 cluster with virtualization operator (KubeVirt)
- Add kernel cmdline options
amd_iommu=on iommu=pt to nodes
- Label nodes with
nvidia.com/gpu.workload.config=vm-vgpu
- Create driver image for rhel9 with drivers version 595.58.02
- Install NVIDIA GPU Operator 26.3.0
- Create
ClusterPolicy:
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
name: gpu-cluster-policy
spec:
daemonsets:
updateStrategy: RollingUpdate
dcgm:
enabled: true
dcgmExporter: {}
devicePlugin: {}
driver:
enabled: false
kernelModuleType: auto
gfd: {}
mig:
strategy: single
migManager:
enabled: true
nodeStatusExporter:
enabled: true
operator:
defaultRuntime: crio
initContainer: {}
runtimeClass: nvidia
use_ocp_driver_toolkit: true
sandboxDevicePlugin:
enabled: true
sandboxWorkloads:
defaultWorkload: vm-vgpu
enabled: true
toolkit:
enabled: true
installDir: /usr/local/nvidia
validator:
plugin:
env:
- name: WITH_WORKLOAD
value: "true"
vfioManager:
enabled: false
vgpuDeviceManager:
config:
default: default
name: vgpu-devices-config
enabled: false
vgpuManager:
enabled: true
image: vgpu-manager
repository: image-registry.openshift-image-registry.svc:5000/nvidia-gpu-operator
version: 595.58.02
Expected behavior
No pods crashing
Environment (please provide the following information):
- GPU Operator Version: 26.3.0
- OS: Red Hat Enterprise Linux CoreOS 9.6.20260324-0
- Kernel Version: 5.14.0-570.103.1.el9_6.x86_64
- Container Runtime Version: CRI-O 1.34.6-2.rhaos4.21.gitbca534a.el9
- Kubernetes Distro and Version: Red Hat OpenShift 4.21
Information to attach (optional if deemed irrelevant)
$ oc get pods
NAME READY STATUS RESTARTS AGE
gpu-operator-5455bd7dc6-qhk2l 1/1 Running 0 3d21h
nvidia-sandbox-device-plugin-daemonset-mhg8d 0/1 Init:1/2 0 2d21h
nvidia-sandbox-device-plugin-daemonset-xrbf4 0/1 Init:1/2 0 2d21h
nvidia-sandbox-validator-mhld9 0/1 Init:1/3 610 (116s ago) 2d21h
nvidia-sandbox-validator-w8psk 0/1 Init:Error 608 (5m36s ago) 2d21h
nvidia-vgpu-manager-daemonset-9.6.20260324-0-4f5w7 2/2 Running 0 2d21h
nvidia-vgpu-manager-daemonset-9.6.20260324-0-slkq4 2/2 Running 0 2d21h
$ oc get ds
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
gpu-feature-discovery 0 0 0 0 0 nvidia.com/gpu.deploy.gpu-feature-discovery=true 2d21h
nvidia-container-toolkit-daemonset 0 0 0 0 0 nvidia.com/gpu.deploy.container-toolkit=true 2d21h
nvidia-dcgm 0 0 0 0 0 nvidia.com/gpu.deploy.dcgm=true 2d21h
nvidia-dcgm-exporter 0 0 0 0 0 nvidia.com/gpu.deploy.dcgm-exporter=true 2d21h
nvidia-device-plugin-daemonset 0 0 0 0 0 nvidia.com/gpu.deploy.device-plugin=true 2d21h
nvidia-device-plugin-mps-control-daemon 0 0 0 0 0 nvidia.com/gpu.deploy.device-plugin=true,nvidia.com/mps.capable=true 2d21h
nvidia-mig-manager 0 0 0 0 0 nvidia.com/gpu.deploy.mig-manager=true 2d21h
nvidia-node-status-exporter 0 0 0 0 0 nvidia.com/gpu.deploy.node-status-exporter=true 2d21h
nvidia-operator-validator 0 0 0 0 0 nvidia.com/gpu.deploy.operator-validator=true 2d21h
nvidia-sandbox-device-plugin-daemonset 2 2 0 2 0 nvidia.com/gpu.deploy.sandbox-device-plugin=true 2d21h
nvidia-sandbox-validator 2 2 0 2 0 nvidia.com/gpu.deploy.sandbox-validator=true 2d21h
nvidia-vgpu-manager-daemonset-9.6.20260324-0 2 2 2 2 2 feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=9.6.20260324-0,nvidia.com/gpu.deploy.vgpu-manager=true 2d21h
$ oc describe pod nvidia-sandbox-validator-mhld9
Name: nvidia-sandbox-validator-mhld9
Namespace: nvidia-gpu-operator
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: nvidia-sandbox-validator
Node: dell-r7525-01.gsslab.brq2.redhat.com/10.37.192.52
Start Time: Sun, 12 Apr 2026 12:18:18 +0200
Labels: app=nvidia-sandbox-validator
app.kubernetes.io/part-of=gpu-operator
controller-revision-hash=7868c9cbcc
pod-template-generation=1
Annotations: k8s.ovn.org/pod-networks:
{"default":{"ip_addresses":["10.129.2.150/23"],"mac_address":"0a:58:0a:81:02:96","gateway_ips":["10.129.2.1"],"routes":[{"dest":"10.128.0....
k8s.v1.cni.cncf.io/network-status:
[{
"name": "ovn-kubernetes",
"interface": "eth0",
"ips": [
"10.129.2.150"
],
"mac": "0a:58:0a:81:02:96",
"default": true,
"dns": {}
}]
openshift.io/scc: nvidia-sandbox-validator
security.openshift.io/validated-scc-subject-type: serviceaccount
Status: Pending
IP: 10.129.2.150
IPs:
IP: 10.129.2.150
Controlled By: DaemonSet/nvidia-sandbox-validator
Init Containers:
vfio-pci-validation:
Container ID: cri-o://82b67f651ac1076a93417cb885b70851e1ca8bb37b8c340c40dd71ce755eeeb8
Image: nvcr.io/nvidia/gpu-operator:v26.3.0@sha256:64ef3dafb9dd28eebe645010e9a5df1efccddea8ad2d61bb6e0a0c4e12eea310
Image ID: nvcr.io/nvidia/gpu-operator@sha256:64ef3dafb9dd28eebe645010e9a5df1efccddea8ad2d61bb6e0a0c4e12eea310
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Terminated
Reason: Completed
Exit Code: 0
Started: Sun, 12 Apr 2026 12:18:19 +0200
Finished: Sun, 12 Apr 2026 12:18:19 +0200
Ready: True
Restart Count: 0
Environment:
WITH_WAIT: true
COMPONENT: vfio-pci
NODE_NAME: (v1:spec.nodeName)
DEFAULT_GPU_WORKLOAD_CONFIG: vm-vgpu
Mounts:
/host from host-root (ro)
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xqks5 (ro)
vgpu-manager-validation:
Container ID: cri-o://895ffde9b861cdc8fad1537a81f299eb2c1212f6699f6d9c563fe70479f556df
Image: nvcr.io/nvidia/gpu-operator:v26.3.0@sha256:64ef3dafb9dd28eebe645010e9a5df1efccddea8ad2d61bb6e0a0c4e12eea310
Image ID: nvcr.io/nvidia/gpu-operator@sha256:64ef3dafb9dd28eebe645010e9a5df1efccddea8ad2d61bb6e0a0c4e12eea310
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Running
Started: Wed, 15 Apr 2026 09:33:37 +0200
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Wed, 15 Apr 2026 09:28:25 +0200
Finished: Wed, 15 Apr 2026 09:33:25 +0200
Ready: False
Restart Count: 610
Environment:
WITH_WAIT: true
COMPONENT: vgpu-manager
NODE_NAME: (v1:spec.nodeName)
DEFAULT_GPU_WORKLOAD_CONFIG: vm-vgpu
Mounts:
/host from host-root (ro)
/run/nvidia/driver from driver-install-path (rw)
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xqks5 (ro)
vgpu-devices-validation:
Container ID:
Image: nvcr.io/nvidia/gpu-operator:v26.3.0@sha256:64ef3dafb9dd28eebe645010e9a5df1efccddea8ad2d61bb6e0a0c4e12eea310
Image ID:
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment:
WITH_WAIT: true
COMPONENT: vgpu-devices
NODE_NAME: (v1:spec.nodeName)
DEFAULT_GPU_WORKLOAD_CONFIG: vm-vgpu
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xqks5 (ro)
Containers:
nvidia-sandbox-validator:
Container ID:
Image: nvcr.io/nvidia/gpu-operator:v26.3.0@sha256:64ef3dafb9dd28eebe645010e9a5df1efccddea8ad2d61bb6e0a0c4e12eea310
Image ID:
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
echo all validations are successful; while true; do sleep 86400; done
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xqks5 (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
run-nvidia-validations:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/validations
HostPathType: DirectoryOrCreate
driver-install-path:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/driver
HostPathType:
host-root:
Type: HostPath (bare host directory volume)
Path: /
HostPathType:
kube-api-access-xqks5:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
Optional: false
DownwardAPI: true
ConfigMapName: openshift-service-ca.crt
Optional: false
QoS Class: BestEffort
Node-Selectors: nvidia.com/gpu.deploy.sandbox-validator=true
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Created 8m23s (x610 over 2d21h) kubelet Created container: vgpu-manager-validation
Warning BackOff 3m22s (x5034 over 2d21h) kubelet Back-off restarting failed container vgpu-manager-validation in pod nvidia-sandbox-validator-mhld9_nvidia-gpu-operator(6996e347-078c-4060-b52c-6dfa9e26ebef)
Normal Pulled 3m11s (x611 over 2d21h) kubelet Container image "nvcr.io/nvidia/gpu-operator:v26.3.0@sha256:64ef3dafb9dd28eebe645010e9a5df1efccddea8ad2d61bb6e0a0c4e12eea310" already present on machine
Describe the bug
I'm configuring NVIDIA GPU Operator 26.3.0 in OpenShift Virtualization 4.21 to run VMs using vGPUs from a Tesla T4 card. The driver version I'm using is 595.58.02
Tesla cards don't support SR-IOV:
However when querying
sriov_totalvfsin the node, I get 16:This makes the nvidia-sandbox-validator pod to enter a loop in the init container waiting for the VFs, and it crash after a while:
From what I see, the operator waits for VFs if
sriov_totalvfsis greater than 0, which is not correct for this card that doesn't support SR-IOV:gpu-operator/cmd/nvidia-validator/main.go
Line 1792 in afda1f7
https://github.com/NVIDIA/go-nvlib/blob/68058cecb77b8d5f014caec9a8e54e3485000b8e/pkg/nvpci/nvpci.go#L509
To Reproduce
amd_iommu=on iommu=ptto nodesnvidia.com/gpu.workload.config=vm-vgpuClusterPolicy:Expected behavior
No pods crashing
Environment (please provide the following information):
Information to attach (optional if deemed irrelevant)
kubectl get pods -n OPERATOR_NAMESPACEkubectl get ds -n OPERATOR_NAMESPACEkubectl describe pod -n OPERATOR_NAMESPACE POD_NAMEIf a pod/ds is in an error state or pending state
kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers(shared relevant logs above)
Output from running
nvidia-smifrom the driver container:kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi(shared relevant output above)
containerd logs
journalctl -u containerd > containerd.log