toolkit-validation container errors out with "nvidia-smi": executable file not found in $PATH after migrating to containerd

### 1. Quick Debug Information
* OS/Version(e.g. RHEL8.6, Ubuntu22.04): ubuntu 20.04
* Kernel Version: Linux version 5.15.0-1058-aws (buildd@lcy02-amd64-094) (gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #64~20.04.1-Ubuntu SMP Tue Apr 9 11:12:27 UTC 2024
* Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
* K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): Self-hosted kubernetes v1.22
* GPU Operator Version: 23.9.0


### 2. Issue or feature description
We install the gpu-operator components without the gpu-operator and manually apply the manifests. We are currently trying to migrate from docker to containerd and experiencing this issue that did not exist when we were on docker.

The driver and container-toolkit successfully run. However, we are getting stuck at the `nvidia-operator-validator` pod:

```
NAME                                            READY   STATUS                  RESTARTS       AGE
gpu-feature-discovery-lpv6q                     0/1     Init:0/1                0              49m
node-feature-discovery-gc-5c6cb6c949-4gdzv      1/1     Running                 0              15d
node-feature-discovery-master-676bb754d-g9ckx   1/1     Running                 0              7d5h
node-feature-discovery-worker-nxvpj             1/1     Running                 0              49m
nvidia-container-toolkit-containerd-pdrrc       1/1     Running                 0              49m
nvidia-dcgm-containerd-nlcbb                    0/1     Init:0/1                0              49m
nvidia-dcgm-exporter-containerd-l8nh6           0/1     Init:0/1                0              49m
nvidia-device-plugin-containerd-vcdr2           0/1     Init:0/1                0              49m
nvidia-driver-containerd-htdkd                             1/1     Running                 0              49m
nvidia-mig-manager-xb8dx                        0/1     Init:0/1                0              49m
nvidia-operator-validator-containerd-wvgr5      0/1     Init:CrashLoopBackOff   12 (72s ago)   49m
```

the logs show: 

```
$ k logs pod/nvidia-operator-validator-containerd-wvgr5 -c toolkit-validation
time="2024-08-16T22:18:20Z" level=info msg="version: 762213f2"
toolkit is not ready
time="2024-08-16T22:18:20Z" level=info msg="Error: error validating toolkit installation: exec: \"nvidia-smi\": executable file not found in $PATH"
```

We've already tried the following:
- Spinning up gpu-operator v23.9.0 in our cluster using the helm chart directly (this works)
- Making our configuration look pretty much identical to that generated by the helm chart, namely  `runtimeClassName: nvidia` and setting up the mounts identically
- Ensuring version compatibility across all the images
- Deleting the operator-validator pod and recreating it (this works, as the error will disappear -- however this is not an ideal workaround for us, and we'd like to know why the helm-official deployment works but not ours)
  - Note that once the `operator-validator` pod is recreated and the `toolkit-validation` container completes, the other pods (`dcgm`, `dcgm-exporter`, `device-plugin`) all also start to fail with their own error messages and need to be recreated as well. This gives us further suspicion that the issue is not with the `operator-validator` itself but with the `container-toolkit` or `driver` which is not setting up the container properly.

This seems somewhat identical to https://github.com/NVIDIA/gpu-operator/issues/265#issuecomment-938189264 but it's unclear what the actual fix is. 

Manifests:
- [nvidia-operator-validator manifest](https://gist.github.com/jpdstan/6fe507c6d5749c17610652036688aba8#file-nvidia-operator-validator-yaml)
- [nvidia-driver manifest](https://gist.github.com/jpdstan/6fe507c6d5749c17610652036688aba8#file-nvidia-driver-containerd-yaml)
- [nvidia-container-toolkit manifest](https://gist.github.com/jpdstan/6fe507c6d5749c17610652036688aba8#file-nvidia-container-toolkit-yaml)

Logs
- [nvidia-container-toolkit logs](https://gist.github.com/jpdstan/6fe507c6d5749c17610652036688aba8#file-nvidia-container-toolkit-logs)
- [nvidia-container-runtime logs](https://gist.github.com/jpdstan/6fe507c6d5749c17610652036688aba8#file-nvidia-container-runtime-log)
- (After deleting the operator-validator):
  - [dcgm-exporter logs](https://gist.github.com/jpdstan/6fe507c6d5749c17610652036688aba8#file-nvidia-dcgm-exporter-log)
  - [dcgm logs](https://gist.github.com/jpdstan/6fe507c6d5749c17610652036688aba8#file-nvidia-dcgm-log)
  - [device plugin logs](https://gist.github.com/jpdstan/6fe507c6d5749c17610652036688aba8#file-nvidia-device-plugin-log)



### 3. To reproduce
1. `kubectl apply` all the GPU operator manifests above
2. Wait for the driver pod to be `Ready` 
3. Wait for the container-toolkit pod to be `Ready`
4. Operator-validator pod should start crashlooping
5. Delete the operator-validator pod, operator-validator pod should proceed 
6. Dcgm and dcgm-exporter should be in `Error` state and device-plugin should be in `CrashLoopBackoff`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

toolkit-validation container errors out with "nvidia-smi": executable file not found in $PATH after migrating to containerd #936

1. Quick Debug Information

2. Issue or feature description

3. To reproduce

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

toolkit-validation container errors out with "nvidia-smi": executable file not found in $PATH after migrating to containerd #936

Description

1. Quick Debug Information

2. Issue or feature description

3. To reproduce

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions