-
Notifications
You must be signed in to change notification settings - Fork 447
Description
1. Quick Debug Checklist
- Are you running on an Ubuntu 18.04 node?
- Are you running Kubernetes v1.13+?
- Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
- Do you have
i2c_coreandipmi_msghandlerloaded on the nodes? - Did you apply the CRD (
kubectl describe clusterpolicies --all-namespaces)
ubuntu 20.04
docker 20.10.7
kubernetes 1.22.3
1. Issue or feature description
I have a kubernetes cluster created with kubeadm that has several worker nodes without a GPU and a fresh worker with a single GeForce RTX 3090 and without any drivers, NVIDIA toolkits etc.
When I install the gpu operator I get to a state, where all nodes run a feature-discovery-worker pod. The GPU worker additionally runs the following pods successfully:
- gpu-feature-discovery-XXXXX
- nvidia-container-toolkit-daemonset-XXXXX
- nvidia-dcgm-XXXXX
- nvidia-device-plugin-daemonset-XXXXX
- nvidia-driver-daemonset-XXXXX
However, the nvidia-cuda-validator-XXXX pod is stuck at status Init:0/1 and the nvidia-operator-validator is stuck at status Init:2/4.
From kubectl describe pod -n gpu-operator-resources nvidia-cuda-validator-ffrw8
Warning FailedCreatePodSandBox 3m35s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "4320b99990499dea8f6b68d77b5ffa15edfd0d2ced6d9afc320d14260522a517" network for pod "nvidia-cuda-validator-ffrw8": networkPlugin cni failed to set up pod "nvidia-cuda-validator-ffrw8_gpu-operator-resources" network: CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "4320b99990499dea8f6b68d77b5ffa15edfd0d2ced6d9afc320d14260522a517"
The nvidia-operator-validator fails with "Back-off restarting failed container".
Additionally, the nvidia-dcgm-exporter pod constantly fails with the following logs:
time="2021-12-02T14:55:53Z" level=info msg="Starting dcgm-exporter"
time="2021-12-02T14:55:53Z" level=info msg="Attemping to connect to remote hostengine at XXX.XXX.XXX.XXX:5555"
time="2021-12-02T14:55:58Z" level=fatal msg="Error connecting to nv-hostengine: Host engine connection invalid/disconnected"
There is currently no firewall enabled.
When I preinstall a driver and the container toolkit the installation works. So this workaround is still possible. However, it would be great if I could bring the whole GPU operator to work correctly.
2. Steps to reproduce the issue
helm install --wait --generate-name nvidia/gpu-operator
3. Information to attach (optional if deemed irrelevant)
- kubernetes pods status:
kubectl get pods - n gpu-operator-resources - kubernetes daemonset status:
kubectl get ds --all-namespaces - If a pod/ds is in an error state or pending state
kubectl describe pod -n gpu-operator-resources nvidia-cuda-validator-249tt - If a pod/ds is in an error state or pending state
kubectl describe pod -n gpu-operator-resources nvidia-coperator-validator- - If a pod/ds is in an error state or pending state
kubectl logs -n gpu-operator-resources nvidia-dcgm-exporter-pf6n6
daemonset_status.txt
pod_status.txt
cuda_validator_pod_describe.txt
operator_validator_pod_describe.txt
dcgm_exporter_logs.txt
- Output of running a container on the GPU machine:
docker run -it alpine echo foo - Docker configuration file:
cat /etc/docker/daemon.json - original Docker configuration file
- NVIDIA shared directory:
ls -la /run/nvidia - NVIDIA packages directory:
ls -la /usr/local/nvidia/toolkit - NVIDIA driver directory:
ls -la /run/nvidia/driver - kubelet logs
journalctl -u kubelet > kubelet.logs