-
Notifications
You must be signed in to change notification settings - Fork 472
Description
1. Issue or feature description
nvidia-container-toolkit-daemonset-hrncd goes in CrashLoopBackOff after a Completed state indefinitely.
2. Steps to reproduce the issue
Create a cluster with Ubuntu 20.04, Kubectl serveur version 1.23.13, Containerd version 1.6.10
Install nvidia-gpu-operator:
helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator
NAME: gpu-operator-1670258596
LAST DEPLOYED: Mon Dec 5 17:43:19 2022
NAMESPACE: gpu-operator
STATUS: deployed
REVISION: 1
TEST SUITE: None
nvidia-container-toolkit-daemonset-hrncd goes in CrashLoopBackOff after a Completed state (due to Daemonset).
During my investigation on this I can see that the container change the config in /etc/containerd/config.toml, and restart containerd. My understanding is that it should be looping in Waiting for signal but it goes for a cleanup of the configuration. I can see the containerd service been restarted 3 times during the running phase. I don't know what's happening, even by looking in the code I don't understand why he receives a signal (Idk from where either).
https://github.com/NVIDIA/nvidia-container-toolkit/blob/a9fb7a4a8807f1fa7e09e63138a928a410bb15a7/tools/container/nvidia-toolkit/run.go#L261
Do you have any clue on this ? Ty !
3. Information to attach (optional if deemed irrelevant)
- nvidia-container-toolkit-ctr log`
nvidia-container-toolkit-ctr.txt
- kubernetes pods status:
kubectl get pods --all-namespaces
- Containerd configuration file:
cat /etc/containerd/config.toml
- NVIDIA shared directory:
ls -la /run/nvidia
- NVIDIA packages directory:
ls -la /usr/local/nvidia/toolkit
- NVIDIA driver directory:
ls -la /run/nvidia/driver
- kubelet logs
journalctl -u kubelet > kubelet.logs