You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We install the gpu-operator components without the gpu-operator and manually apply the manifests. We are currently trying to migrate from docker to containerd and experiencing this issue that did not exist when we were on docker.
The driver and container-toolkit successfully run. However, we are getting stuck at the nvidia-operator-validator pod:
$ k logs pod/nvidia-operator-validator-containerd-wvgr5 -c toolkit-validation
time="2024-08-16T22:18:20Z" level=info msg="version: 762213f2"
toolkit is not ready
time="2024-08-16T22:18:20Z" level=info msg="Error: error validating toolkit installation: exec: \"nvidia-smi\": executable file not found in $PATH"
We've already tried the following:
Spinning up gpu-operator v23.9.0 in our cluster using the helm chart directly (this works)
Making our configuration look pretty much identical to that generated by the helm chart, namely runtimeClassName: nvidia and setting up the mounts identically
Ensuring version compatibility across all the images
Deleting the operator-validator pod and recreating it (this works, as the error will disappear -- however this is not an ideal workaround for us, and we'd like to know why the helm-official deployment works but not ours)
Note that once the operator-validator pod is recreated and the toolkit-validation container completes, the other pods (dcgm, dcgm-exporter, device-plugin) all also start to fail with their own error messages and need to be recreated as well. This gives us further suspicion that the issue is not with the operator-validator itself but with the container-toolkit or driver which is not setting up the container properly.
This seems somewhat identical to #265 (comment) but it's unclear what the actual fix is.
1. Quick Debug Information
20.04.2) 9.4.0, GNU ld (GNU Binutils for Ubuntu) 2.34) NVIDIA GPU-Operator does not work on Openshift 4.3 or Openshift 4.4 #6420.04.1-Ubuntu SMP Tue Apr 9 11:12:27 UTC 20242. Issue or feature description
We install the gpu-operator components without the gpu-operator and manually apply the manifests. We are currently trying to migrate from docker to containerd and experiencing this issue that did not exist when we were on docker.
The driver and container-toolkit successfully run. However, we are getting stuck at the
nvidia-operator-validatorpod:the logs show:
We've already tried the following:
runtimeClassName: nvidiaand setting up the mounts identicallyoperator-validatorpod is recreated and thetoolkit-validationcontainer completes, the other pods (dcgm,dcgm-exporter,device-plugin) all also start to fail with their own error messages and need to be recreated as well. This gives us further suspicion that the issue is not with theoperator-validatoritself but with thecontainer-toolkitordriverwhich is not setting up the container properly.This seems somewhat identical to #265 (comment) but it's unclear what the actual fix is.
Manifests:
Logs
3. To reproduce
kubectl applyall the GPU operator manifests aboveReadyReadyErrorstate and device-plugin should be inCrashLoopBackoff