Skip to content

Unable to setup gpu-operator | nvidia-cuda-validator init "Failed to create pod sandbox" #289

@kuonat

Description

@kuonat

1. Quick Debug Checklist

  • Are you running on an Ubuntu 18.04 node?
  • Are you running Kubernetes v1.13+?
  • Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
  • Do you have i2c_core and ipmi_msghandler loaded on the nodes?
  • Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

ubuntu 20.04
docker 20.10.7
kubernetes 1.22.3

1. Issue or feature description

I have a kubernetes cluster created with kubeadm that has several worker nodes without a GPU and a fresh worker with a single GeForce RTX 3090 and without any drivers, NVIDIA toolkits etc.

When I install the gpu operator I get to a state, where all nodes run a feature-discovery-worker pod. The GPU worker additionally runs the following pods successfully:

  • gpu-feature-discovery-XXXXX
  • nvidia-container-toolkit-daemonset-XXXXX
  • nvidia-dcgm-XXXXX
  • nvidia-device-plugin-daemonset-XXXXX
  • nvidia-driver-daemonset-XXXXX

However, the nvidia-cuda-validator-XXXX pod is stuck at status Init:0/1 and the nvidia-operator-validator is stuck at status Init:2/4.

From kubectl describe pod -n gpu-operator-resources nvidia-cuda-validator-ffrw8

Warning FailedCreatePodSandBox 3m35s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "4320b99990499dea8f6b68d77b5ffa15edfd0d2ced6d9afc320d14260522a517" network for pod "nvidia-cuda-validator-ffrw8": networkPlugin cni failed to set up pod "nvidia-cuda-validator-ffrw8_gpu-operator-resources" network: CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "4320b99990499dea8f6b68d77b5ffa15edfd0d2ced6d9afc320d14260522a517"

The nvidia-operator-validator fails with "Back-off restarting failed container".

Additionally, the nvidia-dcgm-exporter pod constantly fails with the following logs:

time="2021-12-02T14:55:53Z" level=info msg="Starting dcgm-exporter"
time="2021-12-02T14:55:53Z" level=info msg="Attemping to connect to remote hostengine at XXX.XXX.XXX.XXX:5555"
time="2021-12-02T14:55:58Z" level=fatal msg="Error connecting to nv-hostengine: Host engine connection invalid/disconnected"

There is currently no firewall enabled.

When I preinstall a driver and the container toolkit the installation works. So this workaround is still possible. However, it would be great if I could bring the whole GPU operator to work correctly.

2. Steps to reproduce the issue

helm install --wait --generate-name nvidia/gpu-operator

3. Information to attach (optional if deemed irrelevant)

  • kubernetes pods status: kubectl get pods - n gpu-operator-resources
  • kubernetes daemonset status: kubectl get ds --all-namespaces
  • If a pod/ds is in an error state or pending state kubectl describe pod -n gpu-operator-resources nvidia-cuda-validator-249tt
  • If a pod/ds is in an error state or pending state kubectl describe pod -n gpu-operator-resources nvidia-coperator-validator-
  • If a pod/ds is in an error state or pending state kubectl logs -n gpu-operator-resources nvidia-dcgm-exporter-pf6n6

daemonset_status.txt
pod_status.txt
cuda_validator_pod_describe.txt
operator_validator_pod_describe.txt
dcgm_exporter_logs.txt

  • Output of running a container on the GPU machine: docker run -it alpine echo foo
  • Docker configuration file: cat /etc/docker/daemon.json
  • original Docker configuration file

daemon.txt
orig_daemon.txt

  • NVIDIA shared directory: ls -la /run/nvidia
  • NVIDIA packages directory: ls -la /usr/local/nvidia/toolkit
  • NVIDIA driver directory: ls -la /run/nvidia/driver
  • kubelet logs journalctl -u kubelet > kubelet.logs

directory_outputs.txt
kubelet.logs.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions