Skip to content

gpu-operator-nfd-worker fails to read net interface attribute speed #429

@yotama-anv

Description

@yotama-anv

What happened:
Tried to install an online cluster on aws ec2 instances with gpu-operator and the gpu-operator-nfd-worker pod kept running with the following log:

I1027 04:06:00.077427       1 nfd-worker.go:155] Node Feature Discovery Worker v0.10.1
I1027 04:06:00.077505       1 nfd-worker.go:156] NodeName: 'master1'
I1027 04:06:00.077993       1 nfd-worker.go:423] configuration file "/etc/kubernetes/node-feature-discovery/nfd-worker.conf" parsed
I1027 04:06:00.078074       1 nfd-worker.go:461] worker (re-)configuration successfully completed
I1027 04:06:00.084018       1 base.go:126] connecting to nfd-master at gpu-operator-node-feature-discovery-master:8080 ...
I1027 04:06:00.084068       1 component.go:36] [core]parsed scheme: ""
I1027 04:06:00.084075       1 component.go:36] [core]scheme "" not registered, fallback to default scheme
I1027 04:06:00.084110       1 component.go:36] [core]ccResolverWrapper: sending update to cc: {[{gpu-operator-node-feature-discovery-master:8080  <nil> 0 <nil>}] <nil> <nil>}
I1027 04:06:00.084125       1 component.go:36] [core]ClientConn switching balancer to "pick_first"
I1027 04:06:00.084129       1 component.go:36] [core]Channel switches to new LB policy "pick_first"
I1027 04:06:00.084176       1 component.go:36] [core]Subchannel Connectivity change to CONNECTING
I1027 04:06:00.084201       1 component.go:36] [core]Subchannel picks a new address "gpu-operator-node-feature-discovery-master:8080" to connect
I1027 04:06:00.084323       1 component.go:36] [core]Channel Connectivity change to CONNECTING
W1027 04:06:15.663844       1 component.go:41] [core]grpc: addrConn.createTransport failed to connect to {gpu-operator-node-feature-discovery-master:8080 gpu-operator-node-feature-discovery-master:8080 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp: lookup gpu-operator-node-feature-discovery-master on 10.43.0.10:53: read udp 10.42.0.18:57827->10.43.0.10:53: i/o timeout". Reconnecting...
I1027 04:06:15.663863       1 component.go:36] [core]Subchannel Connectivity change to TRANSIENT_FAILURE
I1027 04:06:15.663892       1 component.go:36] [core]Channel Connectivity change to TRANSIENT_FAILURE
I1027 04:06:16.663975       1 component.go:36] [core]Subchannel Connectivity change to CONNECTING
I1027 04:06:16.664010       1 component.go:36] [core]Subchannel picks a new address "gpu-operator-node-feature-discovery-master:8080" to connect
I1027 04:06:16.664093       1 component.go:36] [core]Channel Connectivity change to CONNECTING
W1027 04:06:26.672234       1 component.go:41] [core]grpc: addrConn.createTransport failed to connect to {gpu-operator-node-feature-discovery-master:8080 gpu-operator-node-feature-discovery-master:8080 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp: lookup gpu-operator-node-feature-discovery-master on 10.43.0.10:53: no such host". Reconnecting...
I1027 04:06:26.672255       1 component.go:36] [core]Subchannel Connectivity change to TRANSIENT_FAILURE
I1027 04:06:26.672284       1 component.go:36] [core]Channel Connectivity change to TRANSIENT_FAILURE
I1027 04:06:28.419087       1 component.go:36] [core]Subchannel Connectivity change to CONNECTING
I1027 04:06:28.419124       1 component.go:36] [core]Subchannel picks a new address "gpu-operator-node-feature-discovery-master:8080" to connect
I1027 04:06:28.419226       1 component.go:36] [core]Channel Connectivity change to CONNECTING
I1027 04:06:28.420372       1 component.go:36] [core]Subchannel Connectivity change to READY
I1027 04:06:28.420391       1 component.go:36] [core]Channel Connectivity change to READY
E1027 04:06:28.596637       1 network.go:145] failed to read net iface attribute speed: read /host-sys/class/net/ens5/speed: invalid argument
I1027 04:06:28.599362       1 nfd-worker.go:472] starting feature discovery...
I1027 04:06:28.599529       1 nfd-worker.go:484] feature discovery completed
I1027 04:06:28.599541       1 nfd-worker.go:565] sending labeling request to nfd-master
E1027 04:07:28.891971       1 network.go:145] failed to read net iface attribute speed: read /host-sys/class/net/ens5/speed: invalid argument
I1027 04:07:28.894501       1 nfd-worker.go:472] starting feature discovery...
I1027 04:07:28.894623       1 nfd-worker.go:484] feature discovery completed
I1027 04:07:28.894634       1 nfd-worker.go:565] sending labeling request to nfd-master
E1027 04:08:28.923784       1 network.go:145] failed to read net iface attribute speed: read /host-sys/class/net/ens5/speed: invalid argument

What you expected to happen:
Support aws ec2 instances without getting the above information from that file.

How to reproduce it (as minimally and precisely as possible):
Install a k3s (v1.19.11) cluster with 3 ec2 instances.
Install gpu-operator (using nfd v0.10.1).

Environment:

  • Kubernetes version: K3s v1.19.11
  • Cloud provider or hardware configuration: AWS / g4dn.8xlarge
  • OS: The bug was reproduced on both Ubuntu 20.04 & Rhel 8
  • Kernel: 5.15.0-1022-aws
  • Install tools: K3s
  • Network plugin and version: Flannel v0.12.0-k3s1

#429 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions