What happened:
Tried to install an online cluster on aws ec2 instances with gpu-operator and the gpu-operator-nfd-worker pod kept running with the following log:
I1027 04:06:00.077427 1 nfd-worker.go:155] Node Feature Discovery Worker v0.10.1
I1027 04:06:00.077505 1 nfd-worker.go:156] NodeName: 'master1'
I1027 04:06:00.077993 1 nfd-worker.go:423] configuration file "/etc/kubernetes/node-feature-discovery/nfd-worker.conf" parsed
I1027 04:06:00.078074 1 nfd-worker.go:461] worker (re-)configuration successfully completed
I1027 04:06:00.084018 1 base.go:126] connecting to nfd-master at gpu-operator-node-feature-discovery-master:8080 ...
I1027 04:06:00.084068 1 component.go:36] [core]parsed scheme: ""
I1027 04:06:00.084075 1 component.go:36] [core]scheme "" not registered, fallback to default scheme
I1027 04:06:00.084110 1 component.go:36] [core]ccResolverWrapper: sending update to cc: {[{gpu-operator-node-feature-discovery-master:8080 <nil> 0 <nil>}] <nil> <nil>}
I1027 04:06:00.084125 1 component.go:36] [core]ClientConn switching balancer to "pick_first"
I1027 04:06:00.084129 1 component.go:36] [core]Channel switches to new LB policy "pick_first"
I1027 04:06:00.084176 1 component.go:36] [core]Subchannel Connectivity change to CONNECTING
I1027 04:06:00.084201 1 component.go:36] [core]Subchannel picks a new address "gpu-operator-node-feature-discovery-master:8080" to connect
I1027 04:06:00.084323 1 component.go:36] [core]Channel Connectivity change to CONNECTING
W1027 04:06:15.663844 1 component.go:41] [core]grpc: addrConn.createTransport failed to connect to {gpu-operator-node-feature-discovery-master:8080 gpu-operator-node-feature-discovery-master:8080 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp: lookup gpu-operator-node-feature-discovery-master on 10.43.0.10:53: read udp 10.42.0.18:57827->10.43.0.10:53: i/o timeout". Reconnecting...
I1027 04:06:15.663863 1 component.go:36] [core]Subchannel Connectivity change to TRANSIENT_FAILURE
I1027 04:06:15.663892 1 component.go:36] [core]Channel Connectivity change to TRANSIENT_FAILURE
I1027 04:06:16.663975 1 component.go:36] [core]Subchannel Connectivity change to CONNECTING
I1027 04:06:16.664010 1 component.go:36] [core]Subchannel picks a new address "gpu-operator-node-feature-discovery-master:8080" to connect
I1027 04:06:16.664093 1 component.go:36] [core]Channel Connectivity change to CONNECTING
W1027 04:06:26.672234 1 component.go:41] [core]grpc: addrConn.createTransport failed to connect to {gpu-operator-node-feature-discovery-master:8080 gpu-operator-node-feature-discovery-master:8080 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp: lookup gpu-operator-node-feature-discovery-master on 10.43.0.10:53: no such host". Reconnecting...
I1027 04:06:26.672255 1 component.go:36] [core]Subchannel Connectivity change to TRANSIENT_FAILURE
I1027 04:06:26.672284 1 component.go:36] [core]Channel Connectivity change to TRANSIENT_FAILURE
I1027 04:06:28.419087 1 component.go:36] [core]Subchannel Connectivity change to CONNECTING
I1027 04:06:28.419124 1 component.go:36] [core]Subchannel picks a new address "gpu-operator-node-feature-discovery-master:8080" to connect
I1027 04:06:28.419226 1 component.go:36] [core]Channel Connectivity change to CONNECTING
I1027 04:06:28.420372 1 component.go:36] [core]Subchannel Connectivity change to READY
I1027 04:06:28.420391 1 component.go:36] [core]Channel Connectivity change to READY
E1027 04:06:28.596637 1 network.go:145] failed to read net iface attribute speed: read /host-sys/class/net/ens5/speed: invalid argument
I1027 04:06:28.599362 1 nfd-worker.go:472] starting feature discovery...
I1027 04:06:28.599529 1 nfd-worker.go:484] feature discovery completed
I1027 04:06:28.599541 1 nfd-worker.go:565] sending labeling request to nfd-master
E1027 04:07:28.891971 1 network.go:145] failed to read net iface attribute speed: read /host-sys/class/net/ens5/speed: invalid argument
I1027 04:07:28.894501 1 nfd-worker.go:472] starting feature discovery...
I1027 04:07:28.894623 1 nfd-worker.go:484] feature discovery completed
I1027 04:07:28.894634 1 nfd-worker.go:565] sending labeling request to nfd-master
E1027 04:08:28.923784 1 network.go:145] failed to read net iface attribute speed: read /host-sys/class/net/ens5/speed: invalid argument
What you expected to happen:
Support aws ec2 instances without getting the above information from that file.
How to reproduce it (as minimally and precisely as possible):
Install a k3s (v1.19.11) cluster with 3 ec2 instances.
Install gpu-operator (using nfd v0.10.1).
Environment:
- Kubernetes version: K3s v1.19.11
- Cloud provider or hardware configuration: AWS / g4dn.8xlarge
- OS: The bug was reproduced on both Ubuntu 20.04 & Rhel 8
- Kernel: 5.15.0-1022-aws
- Install tools: K3s
- Network plugin and version: Flannel v0.12.0-k3s1
#429 (comment)
What happened:
Tried to install an online cluster on aws ec2 instances with gpu-operator and the gpu-operator-nfd-worker pod kept running with the following log:
What you expected to happen:
Support aws ec2 instances without getting the above information from that file.
How to reproduce it (as minimally and precisely as possible):
Install a k3s (v1.19.11) cluster with 3 ec2 instances.
Install gpu-operator (using nfd v0.10.1).
Environment:
#429 (comment)