nvidia-driver-daemonset CrashLoopBackOff 

This is a Kubernetes 1.23.4 node on Ubuntu 20.04 LTS, GPU is GP104BM [GeForce GTX 1070 Mobile]

nvidia-smi reports drivers properly installed:
VIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6   

Pods can be normally scheduled on this node, which uses containerd 1.5.5-0ubuntu3~20.04.2
nvidia-container-runtime is installed on this node.

Containerd is configured properly and working, as demonstrated by
ctr run --rm --gpus 0 -t docker.io/nvidia/cuda:11.0-base cuda-11.0-base nvidia-smi

But when I install the gpu-operator with helm:

helm install --wait --generate-name  -n gpu-operator --create-namespace nvidia/gpu-operator

...some of the pods are not properly starting. (no monospace font makes this hard to read), is there a way to switch to monospace for the text block below?


gpu-operator   pod/gpu-feature-discovery-2jn29                                       0/1     Init:0/1                0              99s
gpu-operator   pod/gpu-operator-1646339147-node-feature-discovery-master-64cdng6xd   1/1     Running                 0              62m
gpu-operator   pod/gpu-operator-1646339147-node-feature-discovery-worker-cddlh       1/1     Running                 0              62m
gpu-operator   pod/gpu-operator-1646339147-node-feature-discovery-worker-gx9gg       1/1     Running                 0              62m
gpu-operator   pod/gpu-operator-7ff85f9c4f-fw7hb                                     1/1     Running                 0              62m
gpu-operator   pod/nvidia-container-toolkit-daemonset-vxdcj                          0/1     Init:0/1                0              99s
gpu-operator   pod/nvidia-dcgm-exporter-nb5mn                                        0/1     Init:0/1                0              99s
gpu-operator   pod/nvidia-device-plugin-daemonset-rlnjj                              0/1     Init:0/1                0              99s
gpu-operator   pod/nvidia-driver-daemonset-x25nl                                     0/1     Init:CrashLoopBackOff   15 (99s ago)   62m
gpu-operator   pod/nvidia-operator-validator-d8hrg                                   0/1     Init:0/4                0              99s
.........................
gpu-operator   daemonset.apps/gpu-feature-discovery                                   1         1         0       1            0           nvidia.com/gpu.deploy.gpu-feature-discovery=true   62m
gpu-operator   daemonset.apps/gpu-operator-1646339147-node-feature-discovery-worker   2         2         2       2            2           <none>                                             62m
gpu-operator   daemonset.apps/nvidia-container-toolkit-daemonset                      1         1         0       1            0           nvidia.com/gpu.deploy.container-toolkit=true       62m
gpu-operator   daemonset.apps/nvidia-dcgm-exporter                                    1         1         0       1            0           nvidia.com/gpu.deploy.dcgm-exporter=true           62m
gpu-operator   daemonset.apps/nvidia-device-plugin-daemonset                          1         1         0       1            0           nvidia.com/gpu.deploy.device-plugin=true           62m
gpu-operator   daemonset.apps/nvidia-driver-daemonset                                 1         1         0       1            0           nvidia.com/gpu.deploy.driver=true                  62m
gpu-operator   daemonset.apps/nvidia-mig-manager                                      0         0         0       0            0           nvidia.com/gpu.deploy.mig-manager=true             62m
gpu-operator   daemonset.apps/nvidia-operator-validator                               1         1         0       1            0           nvidia.com/gpu.deploy.operator-validator=true      62m

Now my questions:
1. When examining the logs of the pods, it seems that many pods depend on other pods and because of that, can't start. Which pods are the first ones and most essential ones to start? 
2. Can I reduce the modules (features) that the gpu-operator comes with to make troubleshooting easier? Which one is the minimum required to get it started? I just want to start a simple CUDA workload on this single GPU node.
3. There are a ton of different components to the gpu-operator, is there documentation which ones do what? What does the validator do for example or the difference between the nvidia-driver and the nvidia-device-plugin?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvidia-driver-daemonset CrashLoopBackOff #327

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

nvidia-driver-daemonset CrashLoopBackOff #327

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions