Helm install of GPU operator doesn't run daemonset containers and validator containers

_The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense._

### 1. Quick Debug Checklist
- [ ] Are you running on an Ubuntu 18.04 node?
- [x] Are you running Kubernetes v1.13+?
- [x] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
- [ ] Do you have `i2c_core` and `ipmi_msghandler` loaded on the nodes?
- [x] Did you apply the CRD (`kubectl describe clusterpolicies --all-namespaces`)

### 1. Issue or feature description
Once the helm install command is run for GPU operator, only the discovery pods are running but the gpu-operator daemonset and validation pods are not running.
### 2. Steps to reproduce the issue
1. Run the helm install command to install the latest gpu-operator on RHEL 8.4 nodes
2. Check the running pods using kubectl get pods -n gpu-operator

### 3. Information to [attach](https://help.github.com/articles/file-attachments-on-issues-and-pull-requests/) (optional if deemed irrelevant)

 - [x] kubernetes pods status: `kubectl get pods --all-namespaces`
[root@control01 ~]# kubectl get pods -A
NAMESPACE        NAME                                                          READY   STATUS             RESTARTS   AGE
default          my-release-enterprise-steam-867df478d5-4x296                  1/1     Running            0          4d23h
gpu-operator     gpu-operator-7878f5869-mfnzc                                  1/1     Running            0          4d3h
gpu-operator     gpu-operator-node-feature-discovery-master-59b4b67f4f-nsgpk   1/1     Running            0          4d3h
gpu-operator     gpu-operator-node-feature-discovery-worker-7plcj              0/1     CrashLoopBackOff   975        4d3h
gpu-operator     gpu-operator-node-feature-discovery-worker-8b9kq              0/1     CrashLoopBackOff   975        4d3h
gpu-operator     gpu-operator-node-feature-discovery-worker-hh2zn              0/1     CrashLoopBackOff   975        4d3h
gpu-operator     gpu-operator-node-feature-discovery-worker-r5jlv              0/1     CrashLoopBackOff   975        4d3h
gpu-operator     gpu-operator-node-feature-discovery-worker-s8rlb              0/1     CrashLoopBackOff   974        4d3h
gpu-operator     gpu-operator-node-feature-discovery-worker-sc9x2              0/1     CrashLoopBackOff   975        4d3h
gpu-operator     gpu-operator-node-feature-discovery-worker-v9j7c              0/1     CrashLoopBackOff   975        4d3h

 - [ ] kubernetes daemonset status: `kubectl get ds --all-namespaces`
 - [ ] If a pod/ds is in an error state or pending state `kubectl describe pod -n NAMESPACE POD_NAME`

 - [x] If a pod/ds is in an error state or pending state `kubectl logs -n NAMESPACE POD_NAME`
 [root@control01 ~]# kubectl logs gpu-operator-node-feature-discovery-worker-v9j7c -n gpu-operator
I1107 21:48:04.503798       1 nfd-worker.go:155] Node Feature Discovery Worker v0.10.1
I1107 21:48:04.503857       1 nfd-worker.go:156] NodeName: 'worker03.robin.ai.lab'
I1107 21:48:04.504345       1 nfd-worker.go:423] configuration file "/etc/kubernetes/node-feature-discovery/nfd-worker.conf" parsed
I1107 21:48:04.504407       1 nfd-worker.go:461] worker (re-)configuration successfully completed
I1107 21:48:04.504441       1 base.go:126] connecting to nfd-master at gpu-operator-node-feature-discovery-master:8080 ...
I1107 21:48:04.504475       1 component.go:36] [core]parsed scheme: ""
I1107 21:48:04.504480       1 component.go:36] [core]scheme "" not registered, fallback to default scheme
I1107 21:48:04.504495       1 component.go:36] [core]ccResolverWrapper: sending update to cc: {[{gpu-operator-node-feature-discovery-master:8080  <nil> 0 <nil>}] <nil> <nil>}
I1107 21:48:04.504503       1 component.go:36] [core]ClientConn switching balancer to "pick_first"
I1107 21:48:04.504507       1 component.go:36] [core]Channel switches to new LB policy "pick_first"
I1107 21:48:04.504527       1 component.go:36] [core]Subchannel Connectivity change to CONNECTING
I1107 21:48:04.504551       1 component.go:36] [core]Subchannel picks a new address "gpu-operator-node-feature-discovery-master:8080" to connect
I1107 21:48:04.504594       1 component.go:36] [core]Channel Connectivity change to CONNECTING
W1107 21:48:24.505843       1 component.go:41] [core]grpc: addrConn.createTransport failed to connect to {gpu-operator-node-feature-discovery-master:8080 gpu-operator-node-feature-discovery-master:8080 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 172.19.100.15:8080: i/o timeout". Reconnecting...
I1107 21:48:24.505867       1 component.go:36] [core]Subchannel Connectivity change to TRANSIENT_FAILURE
I1107 21:48:24.505890       1 component.go:36] [core]Channel Connectivity change to TRANSIENT_FAILURE
I1107 21:48:25.505942       1 component.go:36] [core]Subchannel Connectivity change to CONNECTING
I1107 21:48:25.505963       1 component.go:36] [core]Subchannel picks a new address "gpu-operator-node-feature-discovery-master:8080" to connect
I1107 21:48:25.506042       1 component.go:36] [core]Channel Connectivity change to CONNECTING
W1107 21:48:45.506557       1 component.go:41] [core]grpc: addrConn.createTransport failed to connect to {gpu-operator-node-feature-discovery-master:8080 gpu-operator-node-feature-discovery-master:8080 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 172.19.100.15:8080: i/o timeout". Reconnecting...
I1107 21:48:45.506586       1 component.go:36] [core]Subchannel Connectivity change to TRANSIENT_FAILURE
I1107 21:48:45.506611       1 component.go:36] [core]Channel Connectivity change to TRANSIENT_FAILURE
I1107 21:48:47.031218       1 component.go:36] [core]Subchannel Connectivity change to CONNECTING
I1107 21:48:47.031247       1 component.go:36] [core]Subchannel picks a new address "gpu-operator-node-feature-discovery-master:8080" to connect
I1107 21:48:47.031372       1 component.go:36] [core]Channel Connectivity change to CONNECTING
I1107 21:49:04.505752       1 component.go:36] [core]Channel Connectivity change to SHUTDOWN
I1107 21:49:04.505778       1 component.go:36] [core]Subchannel Connectivity change to SHUTDOWN
F1107 21:49:04.505796       1 main.go:64] failed to connect: context deadline exceeded

[root@control01 ~]# kubectl logs gpu-operator-node-feature-discovery-worker-7plcj -n gpu-operator
I1107 21:59:22.763062       1 nfd-worker.go:155] Node Feature Discovery Worker v0.10.1
I1107 21:59:22.763113       1 nfd-worker.go:156] NodeName: 'worker08.robin.ai.lab'
I1107 21:59:22.763500       1 nfd-worker.go:423] configuration file "/etc/kubernetes/node-feature-discovery/nfd-worker.conf" parsed
I1107 21:59:22.763555       1 nfd-worker.go:461] worker (re-)configuration successfully completed
I1107 21:59:22.763586       1 base.go:126] connecting to nfd-master at gpu-operator-node-feature-discovery-master:8080 ...
I1107 21:59:22.763618       1 component.go:36] [core]parsed scheme: ""
I1107 21:59:22.763627       1 component.go:36] [core]scheme "" not registered, fallback to default scheme
I1107 21:59:22.763646       1 component.go:36] [core]ccResolverWrapper: sending update to cc: {[{gpu-operator-node-feature-discovery-master:8080  <nil> 0 <nil>}] <nil> <nil>}
I1107 21:59:22.763662       1 component.go:36] [core]ClientConn switching balancer to "pick_first"
I1107 21:59:22.763666       1 component.go:36] [core]Channel switches to new LB policy "pick_first"
I1107 21:59:22.763682       1 component.go:36] [core]Subchannel Connectivity change to CONNECTING
I1107 21:59:22.763701       1 component.go:36] [core]Subchannel picks a new address "gpu-operator-node-feature-discovery-master:8080" to connect
I1107 21:59:22.763784       1 component.go:36] [core]Channel Connectivity change to CONNECTING
W1107 21:59:42.765933       1 component.go:41] [core]grpc: addrConn.createTransport failed to connect to {gpu-operator-node-feature-discovery-master:8080 gpu-operator-node-feature-discovery-master:8080 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 172.19.100.15:8080: i/o timeout". Reconnecting...
I1107 21:59:42.765964       1 component.go:36] [core]Subchannel Connectivity change to TRANSIENT_FAILURE
I1107 21:59:42.766005       1 component.go:36] [core]Channel Connectivity change to TRANSIENT_FAILURE
I1107 21:59:43.766068       1 component.go:36] [core]Subchannel Connectivity change to CONNECTING
I1107 21:59:43.766079       1 component.go:36] [core]Subchannel picks a new address "gpu-operator-node-feature-discovery-master:8080" to connect
I1107 21:59:43.766123       1 component.go:36] [core]Channel Connectivity change to CONNECTING
W1107 22:00:03.766610       1 component.go:41] [core]grpc: addrConn.createTransport failed to connect to {gpu-operator-node-feature-discovery-master:8080 gpu-operator-node-feature-discovery-master:8080 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 172.19.100.15:8080: i/o timeout". Reconnecting...
I1107 22:00:03.766635       1 component.go:36] [core]Subchannel Connectivity change to TRANSIENT_FAILURE
I1107 22:00:03.766666       1 component.go:36] [core]Channel Connectivity change to TRANSIENT_FAILURE
I1107 22:00:05.644606       1 component.go:36] [core]Subchannel Connectivity change to CONNECTING
I1107 22:00:05.644626       1 component.go:36] [core]Subchannel picks a new address "gpu-operator-node-feature-discovery-master:8080" to connect
I1107 22:00:05.644723       1 component.go:36] [core]Channel Connectivity change to CONNECTING
I1107 22:00:22.765866       1 component.go:36] [core]Channel Connectivity change to SHUTDOWN
I1107 22:00:22.765903       1 component.go:36] [core]Subchannel Connectivity change to SHUTDOWN
F1107 22:00:22.765921       1 main.go:64] failed to connect: context deadline exceeded


 - [x] Output of running a container on the GPU machine: `docker run -it alpine echo foo`
 [root@worker08 ~]# docker run -it alpine echo foo
Unable to find image 'alpine:latest' locally
latest: Pulling from library/alpine
213ec9aee27d: Already exists
Digest: sha256:bc41182d7ef5ffc53a40b044e725193bc10142a1243f395ee852a8d9730fc2ad
Status: Downloaded newer image for alpine:latest
foo

 - [ ] Docker configuration file: `cat /etc/docker/daemon.json`
 - [ ] Docker runtime configuration: `docker info | grep runtime`

 - [ ] NVIDIA shared directory: `ls -la /run/nvidia`
 - [ ] NVIDIA packages directory: `ls -la /usr/local/nvidia/toolkit`
 - [ ] NVIDIA driver directory: `ls -la /run/nvidia/driver`
 - [ ] kubelet logs `journalctl -u kubelet > kubelet.logs`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Helm install of GPU operator doesn't run daemonset containers and validator containers #434

1. Quick Debug Checklist

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Helm install of GPU operator doesn't run daemonset containers and validator containers #434

Description

1. Quick Debug Checklist

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions