The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Quick Debug Checklist
1. Issue or feature description
The pod nvidia-device-plugin-daemonset is stuck in Init:CrashLoopBackOff
2. Steps to reproduce the issue
Apply the helm chart
helm install loadgen-gpu --wait nvidia/gpu-operator -f values.yaml --version 1.6.2 --set operator.defaultRuntime=containerd --debug
wait ...
all the pods will come into place except for that one.
Something that might be of value is that I replaced the containerd.io package that was included in the Ubuntu image with the containerd package from the standard repos since it seemed to provide a newer version.
3. Information to attach (optional if deemed irrelevant)
Describe of the problematic pod
(base) ➜ helm kc describe pod nvidia-device-plugin-daemonset-vlkm2 -n gpu-operator-resources
Name: nvidia-device-plugin-daemonset-vlkm2
Namespace: gpu-operator-resources
Priority: 0
Node: ip-172-32-62-232.us-east-2.compute.internal/172.32.62.232
Start Time: Fri, 27 Aug 2021 17:11:04 -0400
Labels: app=nvidia-device-plugin-daemonset
controller-revision-hash=7ff4fb4c4b
pod-template-generation=1
Annotations: scheduler.alpha.kubernetes.io/critical-pod:
Status: Pending
IP: 172.32.33.219
IPs:
IP: 172.32.33.219
Controlled By: DaemonSet/nvidia-device-plugin-daemonset
Init Containers:
toolkit-validation:
Container ID: docker://6893451d063f1e493ba0f20a94d53455f9b82d7deee91808d9bb515634760cbc
Image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
Image ID: docker-pullable://nvcr.io/nvidia/k8s/cuda-sample@sha256:4593078cdb8e786d35566faa2b84da1123acea42f0d4099e84e2af0448724af1
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
/tmp/vectorAdd
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: ContainerCannotRun
Message: all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused": unavailable
Exit Code: 128
Started: Fri, 27 Aug 2021 17:17:22 -0400
Finished: Fri, 27 Aug 2021 17:17:22 -0400
Ready: False
Restart Count: 6
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from nvidia-device-plugin-token-j5wdj (ro)
Containers:
nvidia-device-plugin-ctr:
Container ID:
Image: nvcr.io/nvidia/k8s-device-plugin:v0.8.2-ubi8
Image ID:
Port: <none>
Host Port: <none>
Args:
--mig-strategy=single
--pass-device-specs=true
--fail-on-init-error=true
--device-list-strategy=envvar
--nvidia-driver-root=/run/nvidia/driver
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment:
NVIDIA_VISIBLE_DEVICES: all
NVIDIA_DRIVER_CAPABILITIES: all
Mounts:
/var/lib/kubelet/device-plugins from device-plugin (rw)
/var/run/secrets/kubernetes.io/serviceaccount from nvidia-device-plugin-token-j5wdj (ro)
Conditions:
Type Status
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
device-plugin:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/device-plugins
HostPathType:
nvidia-device-plugin-token-j5wdj:
Type: Secret (a volume populated by a Secret)
SecretName: nvidia-device-plugin-token-j5wdj
Optional: false
QoS Class: BestEffort
Node-Selectors: kops.k8s.io/instancegroup=gpu-nodes
nvidia.com/gpu.present=true
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule
node.kubernetes.io/memory-pressure:NoSchedule
node.kubernetes.io/not-ready:NoExecute
node.kubernetes.io/pid-pressure:NoSchedule
node.kubernetes.io/unreachable:NoExecute
node.kubernetes.io/unschedulable:NoSchedule
nvidia.com/gpu
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 9m43s default-scheduler Successfully assigned gpu-operator-resources/nvidia-device-plugin-daemonset-vlkm2 to ip-172-32-62-232.us-east-2.compute.internal
Warning FailedCreatePodSandBox 9m42s kubelet, ip-172-32-62-232.us-east-2.compute.internal Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "3658bb95ae86ade79a31794b6f7e5a20f319549170ccca2b0d33dacf0918a55b" network for pod "nvidia-device-plugin-daemonset-vlkm2": networkPlugin cni failed to set up pod "nvidia-device-plugin-daemonset-vlkm2_gpu-operator-resources" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container "3658bb95ae86ade79a31794b6f7e5a20f319549170ccca2b0d33dacf0918a55b" network for pod "nvidia-device-plugin-daemonset-vlkm2": networkPlugin cni failed to teardown pod "nvidia-device-plugin-daemonset-vlkm2_gpu-operator-resources" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"]
Normal SandboxChanged 9m25s (x2 over 9m41s) kubelet, ip-172-32-62-232.us-east-2.compute.internal Pod sandbox changed, it will be killed and re-created.
Normal Pulling 9m24s kubelet, ip-172-32-62-232.us-east-2.compute.internal Pulling image "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2"
Normal Pulled 9m20s kubelet, ip-172-32-62-232.us-east-2.compute.internal Successfully pulled image "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2"
Normal Started 8m33s (x4 over 9m20s) kubelet, ip-172-32-62-232.us-east-2.compute.internal Started container toolkit-validation
Normal Created 7m50s (x5 over 9m20s) kubelet, ip-172-32-62-232.us-east-2.compute.internal Created container toolkit-validation
Normal Pulled 7m50s (x4 over 9m19s) kubelet, ip-172-32-62-232.us-east-2.compute.internal Container image "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2" already present on machine
Warning BackOff 4m36s (x22 over 9m18s) kubelet, ip-172-32-62-232.us-east-2.compute.internal Back-off restarting failed container
Logs for the problematic pod
helm kc logs -f nvidia-device-plugin-daemonset-vlkm2 -n gpu-operator-resources
Error from server (BadRequest): container "nvidia-device-plugin-ctr" in pod "nvidia-device-plugin-daemonset-vlkm2" is waiting to start: PodInitializing
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Quick Debug Checklist
i2c_coreandipmi_msghandlerloaded on the nodes? -> Yes, bothkubectl describe clusterpolicies --all-namespaces) -> Yes1. Issue or feature description
The pod nvidia-device-plugin-daemonset is stuck in Init:CrashLoopBackOff
2. Steps to reproduce the issue
Apply the helm chart
wait ...
all the pods will come into place except for that one.
Something that might be of value is that I replaced the
containerd.iopackage that was included in the Ubuntu image with thecontainerdpackage from the standard repos since it seemed to provide a newer version.3. Information to attach (optional if deemed irrelevant)
Describe of the problematic pod
Logs for the problematic pod