Unable to setup gpu-operator | nvidia-cuda-validator init "Failed to create pod sandbox"

### 1. Quick Debug Checklist
- [ ] Are you running on an Ubuntu 18.04 node?
- [x] Are you running Kubernetes v1.13+?
- [x] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
- [x] Do you have `i2c_core` and `ipmi_msghandler` loaded on the nodes?
- [x] Did you apply the CRD (`kubectl describe clusterpolicies --all-namespaces`)

ubuntu 20.04
docker 20.10.7
kubernetes 1.22.3

### 1. Issue or feature description
I have a kubernetes cluster created with kubeadm that has several worker nodes without a GPU and a fresh worker with a single GeForce RTX 3090 and without any drivers, NVIDIA toolkits etc. 

When I install the gpu operator I get to a state, where all nodes run a feature-discovery-worker pod. The GPU worker additionally runs the following pods successfully:

- gpu-feature-discovery-XXXXX
- nvidia-container-toolkit-daemonset-XXXXX
- nvidia-dcgm-XXXXX
- nvidia-device-plugin-daemonset-XXXXX
- nvidia-driver-daemonset-XXXXX

However, the nvidia-cuda-validator-XXXX pod is stuck at status Init:0/1 and the nvidia-operator-validator is stuck at status Init:2/4. 

From kubectl describe pod -n gpu-operator-resources nvidia-cuda-validator-ffrw8 

Warning  FailedCreatePodSandBox  3m35s  kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "4320b99990499dea8f6b68d77b5ffa15edfd0d2ced6d9afc320d14260522a517" network for pod "nvidia-cuda-validator-ffrw8": networkPlugin cni failed to set up pod "nvidia-cuda-validator-ffrw8_gpu-operator-resources" network: CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "4320b99990499dea8f6b68d77b5ffa15edfd0d2ced6d9afc320d14260522a517"

The nvidia-operator-validator fails with "Back-off restarting failed container".

Additionally, the nvidia-dcgm-exporter pod constantly fails with the following logs:

time="2021-12-02T14:55:53Z" level=info msg="Starting dcgm-exporter"
time="2021-12-02T14:55:53Z" level=info msg="Attemping to connect to remote hostengine at XXX.XXX.XXX.XXX:5555"
time="2021-12-02T14:55:58Z" level=fatal msg="Error connecting to nv-hostengine: Host engine connection invalid/disconnected"

There is currently no firewall enabled.

When I preinstall a driver and the container toolkit the installation works.  So this workaround is still possible. However, it would be great if I could bring the whole GPU operator to work correctly. 

### 2. Steps to reproduce the issue

```
helm install --wait --generate-name nvidia/gpu-operator
```

### 3. Information to [attach](https://help.github.com/articles/file-attachments-on-issues-and-pull-requests/) (optional if deemed irrelevant)

 - [x] kubernetes pods status: `kubectl get pods - n gpu-operator-resources `
 - [x] kubernetes daemonset status: `kubectl get ds --all-namespaces`
 - [x] If a pod/ds is in an error state or pending state `kubectl describe pod -n gpu-operator-resources nvidia-cuda-validator-249tt`
 - [x] If a pod/ds is in an error state or pending state `kubectl describe pod -n gpu-operator-resources nvidia-coperator-validator-`
 - [x] If a pod/ds is in an error state or pending state `kubectl logs -n gpu-operator-resources nvidia-dcgm-exporter-pf6n6`

[daemonset_status.txt](https://github.com/NVIDIA/gpu-operator/files/7642942/daemonset_status.txt)
[pod_status.txt](https://github.com/NVIDIA/gpu-operator/files/7642923/pod_status.txt)
[cuda_validator_pod_describe.txt](https://github.com/NVIDIA/gpu-operator/files/7642990/cuda_validator_pod_describe.txt)
[operator_validator_pod_describe.txt](https://github.com/NVIDIA/gpu-operator/files/7642992/operator_validator_pod_describe.txt)
[dcgm_exporter_logs.txt](https://github.com/NVIDIA/gpu-operator/files/7643167/dcgm_exporter_logs.txt)

 - [ ] Output of running a container on the GPU machine: `docker run -it alpine echo foo`
 - [x] Docker configuration file: `cat /etc/docker/daemon.json`
 - [x] original Docker configuration file

[daemon.txt](https://github.com/NVIDIA/gpu-operator/files/7643021/daemon.txt)
[orig_daemon.txt](https://github.com/NVIDIA/gpu-operator/files/7643024/orig_daemon.txt)

 - [x] NVIDIA shared directory: `ls -la /run/nvidia`
 - [x] NVIDIA packages directory: `ls -la /usr/local/nvidia/toolkit`
 - [x] NVIDIA driver directory: `ls -la /run/nvidia/driver`
 - [x] kubelet logs `journalctl -u kubelet > kubelet.logs`

[directory_outputs.txt](https://github.com/NVIDIA/gpu-operator/files/7643169/directory_outputs.txt)
[kubelet.logs.txt](https://github.com/NVIDIA/gpu-operator/files/7643171/kubelet.logs.txt)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to setup gpu-operator | nvidia-cuda-validator init "Failed to create pod sandbox" #289

1. Quick Debug Checklist

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unable to setup gpu-operator | nvidia-cuda-validator init "Failed to create pod sandbox" #289

Description

1. Quick Debug Checklist

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions