Skip to content

nvidia-driver-daemonset CrashLoopBackOff  #327

@k8s-gpubuilder-markus

Description

@k8s-gpubuilder-markus

This is a Kubernetes 1.23.4 node on Ubuntu 20.04 LTS, GPU is GP104BM [GeForce GTX 1070 Mobile]

nvidia-smi reports drivers properly installed:
VIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6

Pods can be normally scheduled on this node, which uses containerd 1.5.5-0ubuntu3~20.04.2
nvidia-container-runtime is installed on this node.

Containerd is configured properly and working, as demonstrated by
ctr run --rm --gpus 0 -t docker.io/nvidia/cuda:11.0-base cuda-11.0-base nvidia-smi

But when I install the gpu-operator with helm:

helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator

...some of the pods are not properly starting. (no monospace font makes this hard to read), is there a way to switch to monospace for the text block below?

gpu-operator pod/gpu-feature-discovery-2jn29 0/1 Init:0/1 0 99s
gpu-operator pod/gpu-operator-1646339147-node-feature-discovery-master-64cdng6xd 1/1 Running 0 62m
gpu-operator pod/gpu-operator-1646339147-node-feature-discovery-worker-cddlh 1/1 Running 0 62m
gpu-operator pod/gpu-operator-1646339147-node-feature-discovery-worker-gx9gg 1/1 Running 0 62m
gpu-operator pod/gpu-operator-7ff85f9c4f-fw7hb 1/1 Running 0 62m
gpu-operator pod/nvidia-container-toolkit-daemonset-vxdcj 0/1 Init:0/1 0 99s
gpu-operator pod/nvidia-dcgm-exporter-nb5mn 0/1 Init:0/1 0 99s
gpu-operator pod/nvidia-device-plugin-daemonset-rlnjj 0/1 Init:0/1 0 99s
gpu-operator pod/nvidia-driver-daemonset-x25nl 0/1 Init:CrashLoopBackOff 15 (99s ago) 62m
gpu-operator pod/nvidia-operator-validator-d8hrg 0/1 Init:0/4 0 99s
.........................
gpu-operator daemonset.apps/gpu-feature-discovery 1 1 0 1 0 nvidia.com/gpu.deploy.gpu-feature-discovery=true 62m
gpu-operator daemonset.apps/gpu-operator-1646339147-node-feature-discovery-worker 2 2 2 2 2 62m
gpu-operator daemonset.apps/nvidia-container-toolkit-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.container-toolkit=true 62m
gpu-operator daemonset.apps/nvidia-dcgm-exporter 1 1 0 1 0 nvidia.com/gpu.deploy.dcgm-exporter=true 62m
gpu-operator daemonset.apps/nvidia-device-plugin-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.device-plugin=true 62m
gpu-operator daemonset.apps/nvidia-driver-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.driver=true 62m
gpu-operator daemonset.apps/nvidia-mig-manager 0 0 0 0 0 nvidia.com/gpu.deploy.mig-manager=true 62m
gpu-operator daemonset.apps/nvidia-operator-validator 1 1 0 1 0 nvidia.com/gpu.deploy.operator-validator=true 62m

Now my questions:

  1. When examining the logs of the pods, it seems that many pods depend on other pods and because of that, can't start. Which pods are the first ones and most essential ones to start?
  2. Can I reduce the modules (features) that the gpu-operator comes with to make troubleshooting easier? Which one is the minimum required to get it started? I just want to start a simple CUDA workload on this single GPU node.
  3. There are a ton of different components to the gpu-operator, is there documentation which ones do what? What does the validator do for example or the difference between the nvidia-driver and the nvidia-device-plugin?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions