Skip to content

toolkit-validation container errors out with "nvidia-smi": executable file not found in $PATH after migrating to containerd #936

@jpdstan

Description

@jpdstan

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): ubuntu 20.04
  • Kernel Version: Linux version 5.15.0-1058-aws (buildd@lcy02-amd64-094) (gcc (Ubuntu 9.4.0-1ubuntu120.04.2) 9.4.0, GNU ld (GNU Binutils for Ubuntu) 2.34) NVIDIA GPU-Operator does not work on Openshift 4.3 or Openshift 4.4 #6420.04.1-Ubuntu SMP Tue Apr 9 11:12:27 UTC 2024
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): Self-hosted kubernetes v1.22
  • GPU Operator Version: 23.9.0

2. Issue or feature description

We install the gpu-operator components without the gpu-operator and manually apply the manifests. We are currently trying to migrate from docker to containerd and experiencing this issue that did not exist when we were on docker.

The driver and container-toolkit successfully run. However, we are getting stuck at the nvidia-operator-validator pod:

NAME                                            READY   STATUS                  RESTARTS       AGE
gpu-feature-discovery-lpv6q                     0/1     Init:0/1                0              49m
node-feature-discovery-gc-5c6cb6c949-4gdzv      1/1     Running                 0              15d
node-feature-discovery-master-676bb754d-g9ckx   1/1     Running                 0              7d5h
node-feature-discovery-worker-nxvpj             1/1     Running                 0              49m
nvidia-container-toolkit-containerd-pdrrc       1/1     Running                 0              49m
nvidia-dcgm-containerd-nlcbb                    0/1     Init:0/1                0              49m
nvidia-dcgm-exporter-containerd-l8nh6           0/1     Init:0/1                0              49m
nvidia-device-plugin-containerd-vcdr2           0/1     Init:0/1                0              49m
nvidia-driver-containerd-htdkd                             1/1     Running                 0              49m
nvidia-mig-manager-xb8dx                        0/1     Init:0/1                0              49m
nvidia-operator-validator-containerd-wvgr5      0/1     Init:CrashLoopBackOff   12 (72s ago)   49m

the logs show:

$ k logs pod/nvidia-operator-validator-containerd-wvgr5 -c toolkit-validation
time="2024-08-16T22:18:20Z" level=info msg="version: 762213f2"
toolkit is not ready
time="2024-08-16T22:18:20Z" level=info msg="Error: error validating toolkit installation: exec: \"nvidia-smi\": executable file not found in $PATH"

We've already tried the following:

  • Spinning up gpu-operator v23.9.0 in our cluster using the helm chart directly (this works)
  • Making our configuration look pretty much identical to that generated by the helm chart, namely runtimeClassName: nvidia and setting up the mounts identically
  • Ensuring version compatibility across all the images
  • Deleting the operator-validator pod and recreating it (this works, as the error will disappear -- however this is not an ideal workaround for us, and we'd like to know why the helm-official deployment works but not ours)
    • Note that once the operator-validator pod is recreated and the toolkit-validation container completes, the other pods (dcgm, dcgm-exporter, device-plugin) all also start to fail with their own error messages and need to be recreated as well. This gives us further suspicion that the issue is not with the operator-validator itself but with the container-toolkit or driver which is not setting up the container properly.

This seems somewhat identical to #265 (comment) but it's unclear what the actual fix is.

Manifests:

Logs

3. To reproduce

  1. kubectl apply all the GPU operator manifests above
  2. Wait for the driver pod to be Ready
  3. Wait for the container-toolkit pod to be Ready
  4. Operator-validator pod should start crashlooping
  5. Delete the operator-validator pod, operator-validator pod should proceed
  6. Dcgm and dcgm-exporter should be in Error state and device-plugin should be in CrashLoopBackoff

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions