Skip to content

[Bug]: #2387

@raviar

Description

@raviar

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

Describe the bug
In the Freshly installed OpenShift cluster on Fusion HCI system, with 2GPU Servers each with 8x L40S GPUs - Few operator pods related to validator are in crashloopback state on these nodes in Nvidia GPU Operator namespace.

To Reproduce
These pods get every few secs restart and they go into crashloopbackoff mode.

Expected behavior
The pods need to be in a steady state

Environment (please provide the following information):

  • GPU Operator Version: 25.3.4
  • OS: Red Hat Enterprise Linux CoreOS 9.6.20250925-0 (Plow)
  • Kernel Version: full=5.14.0-570.49.1.el9_6.x86_64
  • Container Runtime Version: full=13.0
  • OpenShift Distro and Version: OpenShift 4.20.0

Information to attach (optional if deemed irrelevant)

  • pods status:
% oc get pods -n nvidia-gpu-operator
NAME                                           READY   STATUS                  RESTARTS           AGE
nvidia-cuda-validator-6zgzf                    0/1     Init:CrashLoopBackOff   3 (40s ago)        86s
nvidia-cuda-validator-9dcgj                    0/1     Init:CrashLoopBackOff   2 (25s ago)        41s
  • daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
% oc get daemonset -n cpd-operators
No resources found in cpd-operators namespace.
% oc get daemonset -n openshift-operators
No resources found in openshift-operators namespace.
  • If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
oc describe pod nvidia-container-toolkit-daemonset-hnxm7 -n nvidia-gpu-operator
Name:                 nvidia-container-toolkit-daemonset-hnxm7
Namespace:            nvidia-gpu-operator
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      nvidia-container-toolkit
Node:                 compute-1-ru25.sde.cloud9.ibm.com/9.47.144.35
Start Time:           Wed, 01 Apr 2026 04:32:53 -0400
Labels:               app=nvidia-container-toolkit-daemonset
                      controller-revision-hash=748698f757
                      pod-template-generation=1
Annotations:          k8s.ovn.org/pod-networks:
                        {"default":{"ip_addresses":["10.131.2.64/23"],"mac_address":"0a:58:0a:83:02:40","gateway_ips":["10.131.2.1"],"routes":[{"dest":"10.128.0.0...
                      k8s.v1.cni.cncf.io/network-status:
                        [{
                            "name": "ovn-kubernetes",
                            "interface": "eth0",
                            "ips": [
                                "10.131.2.64"
                            ],
                            "mac": "0a:58:0a:83:02:40",
                            "default": true,
                            "dns": {}
                        }]
                      openshift.io/scc: privileged
                      security.openshift.io/validated-scc-subject-type: serviceaccount
Status:               Running
IP:                   10.131.2.64
IPs:
  IP:           10.131.2.64
Controlled By:  DaemonSet/nvidia-container-toolkit-daemonset
Init Containers:
  driver-validation:
    Container ID:  cri-o://e3855094d1c9ce2f237554bc2bc66a52cd2b116230b9222d61520e7e26b82e4b
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:619e7bf48168d76a7e087c3bf6190fda9b10449c9839309fa4f9ed5b8a9e8804
    Image ID:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:619e7bf48168d76a7e087c3bf6190fda9b10449c9839309fa4f9ed5b8a9e8804
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 01 Apr 2026 04:32:59 -0400
      Finished:     Wed, 01 Apr 2026 04:34:16 -0400
    Ready:          True
    Restart Count:  0
    Environment:
      WITH_WAIT:           true
      COMPONENT:           driver
      OPERATOR_NAMESPACE:  nvidia-gpu-operator (v1:metadata.namespace)
    Mounts:
      /host from host-root (ro)
      /host-dev-char from host-dev-char (rw)
      /run/nvidia/driver from driver-install-dir (rw)
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vwglc (ro)
Containers:
  nvidia-container-toolkit-ctr:
    Container ID:  cri-o://c41940a1c37130b25368d7da40006db6f5c00d94aa0930eb2d769869634534b4
    Image:         nvcr.io/nvidia/k8s/container-toolkit@sha256:51c8f71d3b3c08ae4eb4853697e3f8e6f11e435e666e08210178e6a1faf8028f
    Image ID:      nvcr.io/nvidia/k8s/container-toolkit@sha256:51c8f71d3b3c08ae4eb4853697e3f8e6f11e435e666e08210178e6a1faf8028f
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      -c
    Args:
      /bin/entrypoint.sh
    State:          Running
      Started:      Wed, 01 Apr 2026 04:34:17 -0400
    Ready:          True
    Restart Count:  0
    Environment:
      ROOT:                                             /usr/local/nvidia
      NVIDIA_CONTAINER_RUNTIME_MODES_CDI_DEFAULT_KIND:  management.nvidia.com/gpu
      NVIDIA_VISIBLE_DEVICES:                           void
      TOOLKIT_PID_FILE:                                 /run/nvidia/toolkit/toolkit.pid
      RUNTIME:                                          crio
      RUNTIME_CONFIG:                                   /runtime/config-dir/99-nvidia.conf
      CRIO_CONFIG:                                      /runtime/config-dir/99-nvidia.conf
    Mounts:
      /bin/entrypoint.sh from nvidia-container-toolkit-entrypoint (ro,path="entrypoint.sh")
      /driver-root from driver-install-dir (rw)
      /host from host-root (ro)
      /run/nvidia/toolkit from toolkit-root (rw)
      /run/nvidia/validations from run-nvidia-validations (rw)
      /runtime/config-dir/ from crio-config (rw)
      /usr/local/nvidia from toolkit-install-dir (rw)
      /usr/share/containers/oci/hooks.d from crio-hooks (rw)
      /var/run/cdi from cdi-root (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vwglc (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       True 
  ContainersReady             True 
  PodScheduled                True 
Volumes:
  nvidia-container-toolkit-entrypoint:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      nvidia-container-toolkit-entrypoint
    Optional:  false
  toolkit-root:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/toolkit
    HostPathType:  DirectoryOrCreate
  run-nvidia-validations:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/validations
    HostPathType:  DirectoryOrCreate
  driver-install-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/driver
    HostPathType:  DirectoryOrCreate
  host-root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  
  toolkit-install-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/local/nvidia
    HostPathType:  
  crio-hooks:
    Type:          HostPath (bare host directory volume)
    Path:          /run/containers/oci/hooks.d
    HostPathType:  
  host-dev-char:
    Type:          HostPath (bare host directory volume)
    Path:          /dev/char
    HostPathType:  
  cdi-root:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/cdi
    HostPathType:  DirectoryOrCreate
  crio-config:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/crio/crio.conf.d
    HostPathType:  DirectoryOrCreate
  kube-api-access-vwglc:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   BestEffort
Node-Selectors:              nvidia.com/gpu.deploy.container-toolkit=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:                      <none>
Name:                 nvidia-container-toolkit-daemonset-l6fgw
Namespace:            nvidia-gpu-operator
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      nvidia-container-toolkit
Node:                 compute-1-ru28.sde.cloud9.ibm.com/9.47.144.36
Start Time:           Wed, 01 Apr 2026 04:40:48 -0400
Labels:               app=nvidia-container-toolkit-daemonset
                      controller-revision-hash=748698f757
                      pod-template-generation=1
Annotations:          k8s.ovn.org/pod-networks:
                        {"default":{"ip_addresses":["10.128.4.47/23"],"mac_address":"0a:58:0a:80:04:2f","gateway_ips":["10.128.4.1"],"routes":[{"dest":"10.128.0.0...
                      k8s.v1.cni.cncf.io/network-status:
                        [{
                            "name": "ovn-kubernetes",
                            "interface": "eth0",
                            "ips": [
                                "10.128.4.47"
                            ],
                            "mac": "0a:58:0a:80:04:2f",
                            "default": true,
                            "dns": {}
                        }]
                      openshift.io/scc: privileged
                      security.openshift.io/validated-scc-subject-type: serviceaccount
Status:               Running
IP:                   10.128.4.47
IPs:
  IP:           10.128.4.47
Controlled By:  DaemonSet/nvidia-container-toolkit-daemonset
Init Containers:
  driver-validation:
    Container ID:  cri-o://31b4182e7de68552edf0ead344d24753948d731775ce8edd35338474459b7451
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:619e7bf48168d76a7e087c3bf6190fda9b10449c9839309fa4f9ed5b8a9e8804
    Image ID:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:619e7bf48168d76a7e087c3bf6190fda9b10449c9839309fa4f9ed5b8a9e8804
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 01 Apr 2026 04:40:50 -0400
      Finished:     Wed, 01 Apr 2026 04:42:31 -0400
    Ready:          True
    Restart Count:  0
    Environment:
      WITH_WAIT:           true
      COMPONENT:           driver
      OPERATOR_NAMESPACE:  nvidia-gpu-operator (v1:metadata.namespace)
    Mounts:
      /host from host-root (ro)
      /host-dev-char from host-dev-char (rw)
      /run/nvidia/driver from driver-install-dir (rw)
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8w6hl (ro)
Containers:
  nvidia-container-toolkit-ctr:
    Container ID:  cri-o://83448b497563ca01e649d5c1a62b0f7ae1db321af1de42dc0e61b1767966bfd6
    Image:         nvcr.io/nvidia/k8s/container-toolkit@sha256:51c8f71d3b3c08ae4eb4853697e3f8e6f11e435e666e08210178e6a1faf8028f
    Image ID:      nvcr.io/nvidia/k8s/container-toolkit@sha256:51c8f71d3b3c08ae4eb4853697e3f8e6f11e435e666e08210178e6a1faf8028f
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      -c
    Args:
      /bin/entrypoint.sh
    State:          Running
      Started:      Wed, 01 Apr 2026 04:42:32 -0400
    Ready:          True
    Restart Count:  0
    Environment:
      ROOT:                                             /usr/local/nvidia
      NVIDIA_CONTAINER_RUNTIME_MODES_CDI_DEFAULT_KIND:  management.nvidia.com/gpu
      NVIDIA_VISIBLE_DEVICES:                           void
      TOOLKIT_PID_FILE:                                 /run/nvidia/toolkit/toolkit.pid
      RUNTIME:                                          crio
      RUNTIME_CONFIG:                                   /runtime/config-dir/99-nvidia.conf
      CRIO_CONFIG:                                      /runtime/config-dir/99-nvidia.conf
    Mounts:
      /bin/entrypoint.sh from nvidia-container-toolkit-entrypoint (ro,path="entrypoint.sh")
      /driver-root from driver-install-dir (rw)
      /host from host-root (ro)
      /run/nvidia/toolkit from toolkit-root (rw)
      /run/nvidia/validations from run-nvidia-validations (rw)
      /runtime/config-dir/ from crio-config (rw)
      /usr/local/nvidia from toolkit-install-dir (rw)
      /usr/share/containers/oci/hooks.d from crio-hooks (rw)
      /var/run/cdi from cdi-root (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8w6hl (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       True 
  ContainersReady             True 
  PodScheduled                True 
Volumes:
  nvidia-container-toolkit-entrypoint:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      nvidia-container-toolkit-entrypoint
    Optional:  false
  toolkit-root:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/toolkit
    HostPathType:  DirectoryOrCreate
  run-nvidia-validations:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/validations
    HostPathType:  DirectoryOrCreate
  driver-install-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/driver
    HostPathType:  DirectoryOrCreate
  host-root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  
  toolkit-install-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/local/nvidia
    HostPathType:  
  crio-hooks:
    Type:          HostPath (bare host directory volume)
    Path:          /run/containers/oci/hooks.d
    HostPathType:  
  host-dev-char:
    Type:          HostPath (bare host directory volume)
    Path:          /dev/char
    HostPathType:  
  cdi-root:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/cdi
    HostPathType:  DirectoryOrCreate
  crio-config:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/crio/crio.conf.d
    HostPathType:  DirectoryOrCreate
  kube-api-access-8w6hl:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   BestEffort
Node-Selectors:              nvidia.com/gpu.deploy.container-toolkit=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:                      <none>
  • If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
oc logs nvidia-container-toolkit-daemonset-hnxm7 -c cuda-validation -n nvidia-gpu-operator
oc logs nvidia-container-toolkit-daemonset-l6fgw -c cuda-validation -n nvidia-gpu-operator
error: container cuda-validation is not valid for pod nvidia-container-toolkit-daemonset-hnxm7
error: container cuda-validation is not valid for pod nvidia-container-toolkit-daemonset-l6fgw
ravi@ravis-MacBook-Pro ~ % oc logs nvidia-container-toolkit-daemonset-hnxm7 -n nvidia-gpu-operator 
oc logs nvidia-container-toolkit-daemonset-l6fgw -n nvidia-gpu-operator 
Defaulted container "nvidia-container-toolkit-ctr" out of: nvidia-container-toolkit-ctr, driver-validation (init)
IS_HOST_DRIVER=false
NVIDIA_DRIVER_ROOT=/run/nvidia/driver
DRIVER_ROOT_CTR_PATH=/driver-root
NVIDIA_DEV_ROOT=/run/nvidia/driver
DEV_ROOT_CTR_PATH=/driver-root
time="2026-04-01T08:34:22Z" level=info msg="Parsing arguments"
time="2026-04-01T08:34:22Z" level=info msg="Starting nvidia-toolkit"
time="2026-04-01T08:34:22Z" level=info msg="disabling device node creation since --cdi-enabled=false"
time="2026-04-01T08:34:22Z" level=info msg="Verifying Flags"
time="2026-04-01T08:34:22Z" level=info msg=Initializing
time="2026-04-01T08:34:22Z" level=info msg="Installing NVIDIA container toolkit to '/usr/local/nvidia/toolkit'"
time="2026-04-01T08:34:22Z" level=info msg="Removing existing NVIDIA container toolkit installation"
time="2026-04-01T08:34:22Z" level=info msg="Creating directory '/usr/local/nvidia/toolkit'"
time="2026-04-01T08:34:22Z" level=info msg="Creating directory '/usr/local/nvidia/toolkit/.config/nvidia-container-runtime'"
time="2026-04-01T08:34:22Z" level=info msg="Installing NVIDIA container library to '/usr/local/nvidia/toolkit'"
time="2026-04-01T08:34:22Z" level=info msg="Finding library libnvidia-container.so.1 (root=)"
time="2026-04-01T08:34:22Z" level=info msg="Checking library candidate '/usr/lib64/libnvidia-container.so.1'"
time="2026-04-01T08:34:22Z" level=info msg="Resolved link: '/usr/lib64/libnvidia-container.so.1' => '/usr/lib64/libnvidia-container.so.1.17.8'"
time="2026-04-01T08:34:22Z" level=info msg="Installing '/usr/lib64/libnvidia-container.so.1.17.8' to '/usr/local/nvidia/toolkit/libnvidia-container.so.1.17.8'"
time="2026-04-01T08:34:22Z" level=info msg="Installed '/usr/lib64/libnvidia-container.so.1.17.8' to '/usr/local/nvidia/toolkit/libnvidia-container.so.1.17.8'"
time="2026-04-01T08:34:22Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/libnvidia-container.so.1' -> 'libnvidia-container.so.1.17.8'"
time="2026-04-01T08:34:22Z" level=info msg="Finding library libnvidia-container-go.so.1 (root=)"
time="2026-04-01T08:34:22Z" level=info msg="Checking library candidate '/usr/lib64/libnvidia-container-go.so.1'"
time="2026-04-01T08:34:22Z" level=info msg="Resolved link: '/usr/lib64/libnvidia-container-go.so.1' => '/usr/lib64/libnvidia-container-go.so.1.17.8'"
time="2026-04-01T08:34:22Z" level=info msg="Installing '/usr/lib64/libnvidia-container-go.so.1.17.8' to '/usr/local/nvidia/toolkit/libnvidia-container-go.so.1.17.8'"
time="2026-04-01T08:34:22Z" level=info msg="Installed '/usr/lib64/libnvidia-container-go.so.1.17.8' to '/usr/local/nvidia/toolkit/libnvidia-container-go.so.1.17.8'"
time="2026-04-01T08:34:22Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/libnvidia-container-go.so.1' -> 'libnvidia-container-go.so.1.17.8'"
time="2026-04-01T08:34:22Z" level=info msg="Installing executable '/usr/bin/nvidia-container-runtime' to /usr/local/nvidia/toolkit"
time="2026-04-01T08:34:22Z" level=info msg="Installing '/usr/bin/nvidia-container-runtime' to '/usr/local/nvidia/toolkit/nvidia-container-runtime.real'"
time="2026-04-01T08:34:22Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime.real'"
time="2026-04-01T08:34:22Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime'"
time="2026-04-01T08:34:22Z" level=info msg="Installing executable '/usr/bin/nvidia-container-runtime.cdi' to /usr/local/nvidia/toolkit"
time="2026-04-01T08:34:22Z" level=info msg="Installing '/usr/bin/nvidia-container-runtime.cdi' to '/usr/local/nvidia/toolkit/nvidia-container-runtime.cdi.real'"
time="2026-04-01T08:34:22Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime.cdi.real'"
time="2026-04-01T08:34:22Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime.cdi'"
time="2026-04-01T08:34:22Z" level=info msg="Installing executable '/usr/bin/nvidia-container-runtime.legacy' to /usr/local/nvidia/toolkit"
time="2026-04-01T08:34:22Z" level=info msg="Installing '/usr/bin/nvidia-container-runtime.legacy' to '/usr/local/nvidia/toolkit/nvidia-container-runtime.legacy.real'"
time="2026-04-01T08:34:22Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime.legacy.real'"
time="2026-04-01T08:34:22Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime.legacy'"
time="2026-04-01T08:34:22Z" level=info msg="Installing NVIDIA container CLI from '/usr/bin/nvidia-container-cli'"
time="2026-04-01T08:34:22Z" level=info msg="Installing executable '/usr/bin/nvidia-container-cli' to /usr/local/nvidia/toolkit"
time="2026-04-01T08:34:22Z" level=info msg="Installing '/usr/bin/nvidia-container-cli' to '/usr/local/nvidia/toolkit/nvidia-container-cli.real'"
time="2026-04-01T08:34:22Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-cli.real'"
time="2026-04-01T08:34:22Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-cli'"
time="2026-04-01T08:34:22Z" level=info msg="Installing NVIDIA container runtime hook from '/usr/bin/nvidia-container-runtime-hook'"
time="2026-04-01T08:34:22Z" level=info msg="Installing executable '/usr/bin/nvidia-container-runtime-hook' to /usr/local/nvidia/toolkit"
time="2026-04-01T08:34:22Z" level=info msg="Installing '/usr/bin/nvidia-container-runtime-hook' to '/usr/local/nvidia/toolkit/nvidia-container-runtime-hook.real'"
time="2026-04-01T08:34:22Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime-hook.real'"
time="2026-04-01T08:34:22Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime-hook'"
time="2026-04-01T08:34:22Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/nvidia-container-toolkit' -> 'nvidia-container-runtime-hook'"
time="2026-04-01T08:34:22Z" level=info msg="Installing executable '/usr/bin/nvidia-ctk' to /usr/local/nvidia/toolkit"
time="2026-04-01T08:34:22Z" level=info msg="Installing '/usr/bin/nvidia-ctk' to '/usr/local/nvidia/toolkit/nvidia-ctk.real'"
time="2026-04-01T08:34:22Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-ctk.real'"
time="2026-04-01T08:34:22Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-ctk'"
time="2026-04-01T08:34:22Z" level=info msg="Installing executable '/usr/bin/nvidia-cdi-hook' to /usr/local/nvidia/toolkit"
time="2026-04-01T08:34:22Z" level=info msg="Installing '/usr/bin/nvidia-cdi-hook' to '/usr/local/nvidia/toolkit/nvidia-cdi-hook.real'"
time="2026-04-01T08:34:22Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-cdi-hook.real'"
time="2026-04-01T08:34:22Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-cdi-hook'"
time="2026-04-01T08:34:22Z" level=info msg="Installing NVIDIA container toolkit config '/usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml'"
time="2026-04-01T08:34:22Z" level=info msg="Skipping unset option: nvidia-container-runtime.debug"
time="2026-04-01T08:34:22Z" level=info msg="Skipping unset option: nvidia-container-runtime.log-level"
time="2026-04-01T08:34:22Z" level=info msg="Skipping unset option: nvidia-container-runtime.mode"
time="2026-04-01T08:34:22Z" level=info msg="Skipping unset option: nvidia-container-runtime.modes.cdi.annotation-prefixes"
time="2026-04-01T08:34:22Z" level=info msg="Skipping unset option: nvidia-container-runtime.runtimes"
time="2026-04-01T08:34:22Z" level=info msg="Skipping unset option: nvidia-container-cli.debug"
Using config:
accept-nvidia-visible-devices-as-volume-mounts = false
accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false
supported-driver-capabilities = "compat32,compute,display,graphics,ngx,utility,video"

[nvidia-container-cli]
  environment = []
  ldconfig = "@/run/nvidia/driver/sbin/ldconfig"
  load-kmods = true
  path = "/usr/local/nvidia/toolkit/nvidia-container-cli"
  root = "/run/nvidia/driver"

[nvidia-container-runtime]
  log-level = "info"
  mode = "auto"
  runtimes = ["/usr/bin/crun", "/usr/bin/runc", "docker-runc", "runc", "crun"]

  [nvidia-container-runtime.modes]

    [nvidia-container-runtime.modes.cdi]
      annotation-prefixes = ["cdi.k8s.io/"]
      default-kind = "management.nvidia.com/gpu"
      spec-dirs = ["/etc/cdi", "/var/run/cdi"]

    [nvidia-container-runtime.modes.csv]
      mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"

    [nvidia-container-runtime.modes.legacy]
      cuda-compat-mode = "ldconfig"

[nvidia-container-runtime-hook]
  path = "/usr/local/nvidia/toolkit/nvidia-container-runtime-hook"
  skip-mode-detection = true

[nvidia-ctk]
  path = "/usr/local/nvidia/toolkit/nvidia-ctk"
time="2026-04-01T08:34:22Z" level=info msg="Starting 'setup' for nvidia-toolkit"
time="2026-04-01T08:34:22Z" level=info msg="Installing prestart hook"
time="2026-04-01T08:34:22Z" level=info msg="Waiting for signal"
Defaulted container "nvidia-container-toolkit-ctr" out of: nvidia-container-toolkit-ctr, driver-validation (init)
IS_HOST_DRIVER=false
NVIDIA_DRIVER_ROOT=/run/nvidia/driver
DRIVER_ROOT_CTR_PATH=/driver-root
NVIDIA_DEV_ROOT=/run/nvidia/driver
DEV_ROOT_CTR_PATH=/driver-root
time="2026-04-01T08:42:37Z" level=info msg="Parsing arguments"
time="2026-04-01T08:42:37Z" level=info msg="Starting nvidia-toolkit"
time="2026-04-01T08:42:37Z" level=info msg="disabling device node creation since --cdi-enabled=false"
time="2026-04-01T08:42:37Z" level=info msg="Verifying Flags"
time="2026-04-01T08:42:37Z" level=info msg=Initializing
time="2026-04-01T08:42:37Z" level=info msg="Installing NVIDIA container toolkit to '/usr/local/nvidia/toolkit'"
time="2026-04-01T08:42:37Z" level=info msg="Removing existing NVIDIA container toolkit installation"
time="2026-04-01T08:42:37Z" level=info msg="Creating directory '/usr/local/nvidia/toolkit'"
time="2026-04-01T08:42:37Z" level=info msg="Creating directory '/usr/local/nvidia/toolkit/.config/nvidia-container-runtime'"
time="2026-04-01T08:42:37Z" level=info msg="Installing NVIDIA container library to '/usr/local/nvidia/toolkit'"
time="2026-04-01T08:42:37Z" level=info msg="Finding library libnvidia-container.so.1 (root=)"
time="2026-04-01T08:42:37Z" level=info msg="Checking library candidate '/usr/lib64/libnvidia-container.so.1'"
time="2026-04-01T08:42:37Z" level=info msg="Resolved link: '/usr/lib64/libnvidia-container.so.1' => '/usr/lib64/libnvidia-container.so.1.17.8'"
time="2026-04-01T08:42:37Z" level=info msg="Installing '/usr/lib64/libnvidia-container.so.1.17.8' to '/usr/local/nvidia/toolkit/libnvidia-container.so.1.17.8'"
time="2026-04-01T08:42:37Z" level=info msg="Installed '/usr/lib64/libnvidia-container.so.1.17.8' to '/usr/local/nvidia/toolkit/libnvidia-container.so.1.17.8'"
time="2026-04-01T08:42:37Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/libnvidia-container.so.1' -> 'libnvidia-container.so.1.17.8'"
time="2026-04-01T08:42:37Z" level=info msg="Finding library libnvidia-container-go.so.1 (root=)"
time="2026-04-01T08:42:37Z" level=info msg="Checking library candidate '/usr/lib64/libnvidia-container-go.so.1'"
time="2026-04-01T08:42:37Z" level=info msg="Resolved link: '/usr/lib64/libnvidia-container-go.so.1' => '/usr/lib64/libnvidia-container-go.so.1.17.8'"
time="2026-04-01T08:42:37Z" level=info msg="Installing '/usr/lib64/libnvidia-container-go.so.1.17.8' to '/usr/local/nvidia/toolkit/libnvidia-container-go.so.1.17.8'"
time="2026-04-01T08:42:37Z" level=info msg="Installed '/usr/lib64/libnvidia-container-go.so.1.17.8' to '/usr/local/nvidia/toolkit/libnvidia-container-go.so.1.17.8'"
time="2026-04-01T08:42:37Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/libnvidia-container-go.so.1' -> 'libnvidia-container-go.so.1.17.8'"
time="2026-04-01T08:42:37Z" level=info msg="Installing executable '/usr/bin/nvidia-container-runtime' to /usr/local/nvidia/toolkit"
time="2026-04-01T08:42:37Z" level=info msg="Installing '/usr/bin/nvidia-container-runtime' to '/usr/local/nvidia/toolkit/nvidia-container-runtime.real'"
time="2026-04-01T08:42:37Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime.real'"
time="2026-04-01T08:42:37Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime'"
time="2026-04-01T08:42:37Z" level=info msg="Installing executable '/usr/bin/nvidia-container-runtime.cdi' to /usr/local/nvidia/toolkit"
time="2026-04-01T08:42:37Z" level=info msg="Installing '/usr/bin/nvidia-container-runtime.cdi' to '/usr/local/nvidia/toolkit/nvidia-container-runtime.cdi.real'"
time="2026-04-01T08:42:37Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime.cdi.real'"
time="2026-04-01T08:42:37Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime.cdi'"
time="2026-04-01T08:42:37Z" level=info msg="Installing executable '/usr/bin/nvidia-container-runtime.legacy' to /usr/local/nvidia/toolkit"
time="2026-04-01T08:42:37Z" level=info msg="Installing '/usr/bin/nvidia-container-runtime.legacy' to '/usr/local/nvidia/toolkit/nvidia-container-runtime.legacy.real'"
time="2026-04-01T08:42:37Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime.legacy.real'"
time="2026-04-01T08:42:37Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime.legacy'"
time="2026-04-01T08:42:37Z" level=info msg="Installing NVIDIA container CLI from '/usr/bin/nvidia-container-cli'"
time="2026-04-01T08:42:37Z" level=info msg="Installing executable '/usr/bin/nvidia-container-cli' to /usr/local/nvidia/toolkit"
time="2026-04-01T08:42:37Z" level=info msg="Installing '/usr/bin/nvidia-container-cli' to '/usr/local/nvidia/toolkit/nvidia-container-cli.real'"
time="2026-04-01T08:42:37Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-cli.real'"
time="2026-04-01T08:42:37Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-cli'"
time="2026-04-01T08:42:37Z" level=info msg="Installing NVIDIA container runtime hook from '/usr/bin/nvidia-container-runtime-hook'"
time="2026-04-01T08:42:37Z" level=info msg="Installing executable '/usr/bin/nvidia-container-runtime-hook' to /usr/local/nvidia/toolkit"
time="2026-04-01T08:42:37Z" level=info msg="Installing '/usr/bin/nvidia-container-runtime-hook' to '/usr/local/nvidia/toolkit/nvidia-container-runtime-hook.real'"
time="2026-04-01T08:42:37Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime-hook.real'"
time="2026-04-01T08:42:37Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime-hook'"
time="2026-04-01T08:42:37Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/nvidia-container-toolkit' -> 'nvidia-container-runtime-hook'"
time="2026-04-01T08:42:37Z" level=info msg="Installing executable '/usr/bin/nvidia-ctk' to /usr/local/nvidia/toolkit"
time="2026-04-01T08:42:37Z" level=info msg="Installing '/usr/bin/nvidia-ctk' to '/usr/local/nvidia/toolkit/nvidia-ctk.real'"
time="2026-04-01T08:42:37Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-ctk.real'"
time="2026-04-01T08:42:37Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-ctk'"
time="2026-04-01T08:42:37Z" level=info msg="Installing executable '/usr/bin/nvidia-cdi-hook' to /usr/local/nvidia/toolkit"
time="2026-04-01T08:42:37Z" level=info msg="Installing '/usr/bin/nvidia-cdi-hook' to '/usr/local/nvidia/toolkit/nvidia-cdi-hook.real'"
time="2026-04-01T08:42:37Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-cdi-hook.real'"
time="2026-04-01T08:42:37Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-cdi-hook'"
time="2026-04-01T08:42:37Z" level=info msg="Installing NVIDIA container toolkit config '/usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml'"
time="2026-04-01T08:42:37Z" level=info msg="Skipping unset option: nvidia-container-runtime.log-level"
time="2026-04-01T08:42:37Z" level=info msg="Skipping unset option: nvidia-container-runtime.mode"
time="2026-04-01T08:42:37Z" level=info msg="Skipping unset option: nvidia-container-runtime.modes.cdi.annotation-prefixes"
time="2026-04-01T08:42:37Z" level=info msg="Skipping unset option: nvidia-container-runtime.runtimes"
time="2026-04-01T08:42:37Z" level=info msg="Skipping unset option: nvidia-container-cli.debug"
time="2026-04-01T08:42:37Z" level=info msg="Skipping unset option: nvidia-container-runtime.debug"
Using config:
accept-nvidia-visible-devices-as-volume-mounts = false
accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false
supported-driver-capabilities = "compat32,compute,display,graphics,ngx,utility,video"

[nvidia-container-cli]
  environment = []
  ldconfig = "@/run/nvidia/driver/sbin/ldconfig"
  load-kmods = true
  path = "/usr/local/nvidia/toolkit/nvidia-container-cli"
  root = "/run/nvidia/driver"

[nvidia-container-runtime]
  log-level = "info"
  mode = "auto"
  runtimes = ["/usr/bin/crun", "/usr/bin/runc", "docker-runc", "runc", "crun"]

  [nvidia-container-runtime.modes]

    [nvidia-container-runtime.modes.cdi]
      annotation-prefixes = ["cdi.k8s.io/"]
      default-kind = "management.nvidia.com/gpu"
      spec-dirs = ["/etc/cdi", "/var/run/cdi"]

    [nvidia-container-runtime.modes.csv]
      mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"

    [nvidia-container-runtime.modes.legacy]
      cuda-compat-mode = "ldconfig"

[nvidia-container-runtime-hook]
  path = "/usr/local/nvidia/toolkit/nvidia-container-runtime-hook"
  skip-mode-detection = true

[nvidia-ctk]
  path = "/usr/local/nvidia/toolkit/nvidia-ctk"
time="2026-04-01T08:42:37Z" level=info msg="Starting 'setup' for nvidia-toolkit"
time="2026-04-01T08:42:37Z" level=info msg="Installing prestart hook"
time="2026-04-01T08:42:37Z" level=info msg="Waiting for signal"
  • Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
    • containerd logs journalctl -u containerd > containerd.log

Collecting full debug bundle (optional):
nvidia-gpu-operator_20260422_1616.tar.gz

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: operator_feedback@nvidia.com

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugIssue/PR to expose/discuss/fix a bugneeds-triageissue or PR has not been assigned a priority-px label

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions