Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tunnelfront pod can't reach API server nor via DNS, neither via IP, so the whole cluster is broken #1322

Closed
mlushpenko opened this issue Nov 18, 2019 · 13 comments
Labels

Comments

@mlushpenko
Copy link

What happened: I think node networking is somehow broken, coredns pod is constantly restarting and tunnelfront is running but can't reach API server.

Anything else we need to know?:

coredns logs taken from docker (as kubectl logs is hanging):

root@aks-system-10099729-vmss000007:~# docker logs 58b7a00332a0 -f
[WARNING] No files matching import glob pattern: custom/*.override
[WARNING] No files matching import glob pattern: custom/*.server
.:53
2019-11-18T15:29:26.276Z [INFO] CoreDNS-1.3.1
2019-11-18T15:29:26.276Z [INFO] linux/amd64, go1.11.4, 6b56a9c
CoreDNS-1.3.1
linux/amd64, go1.11.4, 6b56a9c
2019-11-18T15:29:26.276Z [INFO] plugin/reload: Running configuration MD5 = d8c69602fc5a3428908dc8f34f9aae58
2019-11-18T15:29:32.282Z [ERROR] plugin/errors: 2 5511450955529259098.1955468461322526016. HINFO: read udp 10.40.2.38:47985->168.63.129.16:53: i/o timeout
2019-11-18T15:29:35.277Z [ERROR] plugin/errors: 2 5511450955529259098.1955468461322526016. HINFO: read udp 10.40.2.38:44873->168.63.129.16:53: i/o timeout
2019-11-18T15:29:36.277Z [ERROR] plugin/errors: 2 5511450955529259098.1955468461322526016. HINFO: read udp 10.40.2.38:46537->168.63.129.16:53: i/o timeout
2019-11-18T15:29:37.277Z [ERROR] plugin/errors: 2 5511450955529259098.1955468461322526016. HINFO: read udp 10.40.2.38:48115->168.63.129.16:53: i/o timeout
2019-11-18T15:29:40.278Z [ERROR] plugin/errors: 2 5511450955529259098.1955468461322526016. HINFO: read udp 10.40.2.38:42165->168.63.129.16:53: i/o timeout
E1118 15:29:41.278942       1 reflector.go:134] github.com/coredns/coredns/plugin/kubernetes/controller.go:322: Failed to list *v1.Namespace: Get https://aks-dev-cdce5e25.hcp.westeurope.azmk8s.io:443/api/v1/namespaces?limit=500&resourceVersion=0: dial tcp: lookup aks-dev-cdce5e25.hcp.westeurope.azmk8s.io on 168.63.129.16:53: read udp 10.40.2.38:59718->168.63.129.16:53: i/o timeout
E1118 15:29:41.278950       1 reflector.go:134] github.com/coredns/coredns/plugin/kubernetes/controller.go:317: Failed to list *v1.Endpoints: Get https://aks-dev-cdce5e25.hcp.westeurope.azmk8s.io:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp: lookup aks-dev-cdce5e25.hcp.westeurope.azmk8s.io on 168.63.129.16:53: read udp 10.40.2.38:59718->168.63.129.16:53: i/o timeout
E1118 15:29:41.278962       1 reflector.go:134] github.com/coredns/coredns/plugin/kubernetes/controller.go:315: Failed to list *v1.Service: Get https://aks-dev-cdce5e25.hcp.westeurope.azmk8s.io:443/api/v1/services?limit=500&resourceVersion=0: dial tcp: lookup aks-dev-cdce5e25.hcp.westeurope.azmk8s.io on 168.63.129.16:53: read udp 10.40.2.38:59718->168.63.129.16:53: i/o timeout
2019-11-18T15:29:43.278Z [ERROR] plugin/errors: 2 5511450955529259098.1955468461322526016. HINFO: read udp 10.40.2.38:55289->168.63.129.16:53: i/o timeout
2019-11-18T15:29:46.278Z [ERROR] plugin/errors: 2 5511450955529259098.1955468461322526016. HINFO: read udp 10.40.2.38:39865->168.63.129.16:53: i/o timeout
2019-11-18T15:29:49.278Z [ERROR] plugin/errors: 2 5511450955529259098.1955468461322526016. HINFO: read udp 10.40.2.38:58160->168.63.129.16:53: i/o timeout
2019-11-18T15:29:52.279Z [ERROR] plugin/errors: 2 5511450955529259098.1955468461322526016. HINFO: read udp 10.40.2.38:53832->168.63.129.16:53: i/o timeout
2019-11-18T15:29:55.279Z [ERROR] plugin/errors: 2 5511450955529259098.1955468461322526016. HINFO: read udp 10.40.2.38:40530->168.63.129.16:53: i/o timeout
E1118 15:30:02.280343       1 reflector.go:134] github.com/coredns/coredns/plugin/kubernetes/controller.go:315: Failed to list *v1.Service: Get https://aks-dev-cdce5e25.hcp.westeurope.azmk8s.io:443/api/v1/services?limit=500&resourceVersion=0: dial tcp: lookup aks-dev-cdce5e25.hcp.westeurope.azmk8s.io on 168.63.129.16:53: read udp 10.40.2.38:42700->168.63.129.16:53: i/o timeout
E1118 15:30:02.280351       1 reflector.go:134] github.com/coredns/coredns/plugin/kubernetes/controller.go:322: Failed to list *v1.Namespace: Get https://aks-dev-cdce5e25.hcp.westeurope.azmk8s.io:443/api/v1/namespaces?limit=500&resourceVersion=0: dial tcp: lookup aks-dev-cdce5e25.hcp.westeurope.azmk8s.io on 168.63.129.16:53: read udp 10.40.2.38:42700->168.63.129.16:53: i/o timeout
E1118 15:30:02.280373       1 reflector.go:134] github.com/coredns/coredns/plugin/kubernetes/controller.go:317: Failed to list *v1.Endpoints: Get https://aks-dev-cdce5e25.hcp.westeurope.azmk8s.io:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp: lookup aks-dev-cdce5e25.hcp.westeurope.azmk8s.io on 168.63.129.16:53: read udp 10.40.2.38:42700->168.63.129.16:53: i/o timeout

in this case, coredns can't reach Azure DNS. Same happens for the tunnefront pod:

root@aks-system-10099729-vmss000007:~# docker logs af6ad094f65e
[ Mon Nov 18 15:37:53 UTC 2019 ] INF: Starting: Every:[5] kubeconfig:[/etc/kubernetes/kubeconfig/kubeconfig] SSH-Server:[aks-dev-cdce5e25.hcp.westeurope.azmk8s.io]
[ Mon Nov 18 15:37:53 UTC 2019 ] INF: Creating tunnel interface
[ Mon Nov 18 15:37:53 UTC 2019 ] INF: device tun0 was already setup
[ Mon Nov 18 15:37:53 UTC 2019 ] INF: ip Configuration:
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: tun0: <NO-CARRIER,POINTOPOINT,MULTICAST,NOARP,UP> mtu 1500 qdisc pfifo_fast state DOWN group default qlen 500
    link/none
    inet 192.0.2.200 peer 192.0.2.100/32 scope global tun0
       valid_lft forever preferred_lft forever
67: eth0@if68: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether fe:54:9d:3d:a8:a5 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.40.2.181/16 scope global eth0
       valid_lft forever preferred_lft forever
[ Mon Nov 18 15:37:53 UTC 2019 ] INF: ******************
[ Mon Nov 18 15:37:53 UTC 2019 ] INF: iptables were previously configured..
[ Mon Nov 18 15:37:53 UTC 2019 ] INF: current ip tables
Chain PREROUTING (policy ACCEPT)
num  target     prot opt source               destination

Chain INPUT (policy ACCEPT)
num  target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
num  target     prot opt source               destination

Chain POSTROUTING (policy ACCEPT)
num  target     prot opt source               destination
1    MASQUERADE  all  --  anywhere             anywhere
2    MASQUERADE  all  --  anywhere             anywhere
Chain INPUT (policy ACCEPT)
num  target     prot opt source               destination

Chain FORWARD (policy ACCEPT)
num  target     prot opt source               destination
1    ACCEPT     all  --  anywhere             anywhere

Chain OUTPUT (policy ACCEPT)
num  target     prot opt source               destination
[ Mon Nov 18 15:37:53 UTC 2019 ] INF: ****************
[ Mon Nov 18 15:37:53 UTC 2019 ] INF: Attempting to download tunnel-front secret
Unable to connect to the server: dial tcp: lookup aks-dev-cdce5e25.hcp.westeurope.azmk8s.io on 168.63.129.16:53: read udp 10.40.2.181:38788->168.63.129.16:53: i/o timeout
[ Mon Nov 18 15:39:14 UTC 2019 ] WRN: Attempt:[1] - failed:Get secret:[tunnelfront] from kubernetes. will retry
Unable to connect to the server: dial tcp: lookup aks-dev-cdce5e25.hcp.westeurope.azmk8s.io on 168.63.129.16:53: read udp 10.40.2.181:50761->168.63.129.16:53: i/o timeout
[ Mon Nov 18 15:40:36 UTC 2019 ] WRN: Attempt:[2] - failed:Get secret:[tunnelfront] from kubernetes. will retry

I also tried to skip DNS via overriding kubernetes API server address in /etc/hosts inside tunnelfront pod and it still didn't work:

docker logs 17eaa3b1c2b9 -f
...
[ Mon Nov 18 15:59:22 UTC 2019 ] INF: Attempting to download tunnel-front secret
Unable to connect to the server: dial tcp: lookup aks-dev-cdce5e25.hcp.westeurope.azmk8s.io on 168.63.129.16:53: read udp 10.40.2.77:57245->168.63.129.16:53: i/o timeout
[ Mon Nov 18 16:00:42 UTC 2019 ] WRN: Attempt:[1] - failed:Get secret:[tunnelfront] from kubernetes. will retry
Unable to connect to the server: dial tcp: lookup aks-dev-cdce5e25.hcp.westeurope.azmk8s.io on 168.63.129.16:53: read udp 10.40.2.77:35086->168.63.129.16:53: i/o timeout
[ Mon Nov 18 16:02:04 UTC 2019 ] WRN: Attempt:[2] - failed:Get secret:[tunnelfront] from kubernetes. will retry
Unable to connect to the server: dial tcp 51.145.190.162:443: i/o timeout
[ Mon Nov 18 16:04:06 UTC 2019 ] WRN: Attempt:[3] - failed:Get secret:[tunnelfront] from kubernetes. will retry

At the same time, if we just run random alpine pod, it can connect to Azure DNS (and it is runnning on the same node, so no idea):

root@aks-system-10099729-vmss000007:/home/aks_admin# docker run -it --rm alpine ash
/ # apk add bind-tools
/ # dig @168.63.129.16 google.com
; <<>> DiG 9.14.7 <<>> @168.63.129.16 google.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 24050
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1280
;; QUESTION SECTION:
;google.com.                    IN      A
;; ANSWER SECTION:
google.com.             146     IN      A       172.217.168.238
;; Query time: 2 msec
;; SERVER: 168.63.129.16#53(168.63.129.16)
;; WHEN: Mon Nov 18 16:03:09 UTC 2019
;; MSG SIZE  rcvd: 55

And yet, some other pods like external-dns can't connect to AzureDNS (they don't need to because of coredns, but we checked just for testing purposes)

Environment:

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.0", GitCommit:"e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529", GitTreeState:"clean", BuildDate:"2019-06-20T04:49:16Z", GoVersion:"go1.12.6", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.5", GitCommit:"20c265fef0741dd71a66480e35bd69f18351daea", GitTreeState:"clean", BuildDate:"2019-10-15T19:07:57Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}
  • Size of cluster (how many worker nodes are in the cluster?): 1 node per pool (with autoscaling to 3), two node pools.
  • Others: we are using calico network policies, but I removed all of them to make sure it's not related:
 maksym@Maksyms-MacBook-Pro  ~  kubectl get networkpolicies  --all-namespaces
No resources found.
@jnoller
Copy link
Contributor

jnoller commented Nov 18, 2019

#102 (comment) - please see this comment. the failure above is usually caused by the workload saturating the OS disk IOPS and causing the crash.

@GuyPaddock
Copy link

I, too, ran into a CoreDNS that brought our cluster down last night. It seemed as though either traffic was not getting to CoreDNS or it was failing to resolve all requests. In my case, though, CoreDNS did not report any errors and was not cycling.

Linking here just in case these are related:
#1320

We were not doing any heavy IOPS-workloads at the time, though the DNS outage started shortly after re-deploying the dev instance of a PHP application we are working on.

@jnoller
Copy link
Contributor

jnoller commented Nov 18, 2019

@GuyPaddock You don't think it was IOPS intensive, but if you're using a DS2v_3 with a 100gb OS disk, the OS disk maxes out at 500 IOPS. That means 3-4 containers running (docker host IO counts are 3-4X the in-memory kube metrics data)

@GuyPaddock
Copy link

GuyPaddock commented Nov 18, 2019

Could be possible. In my case, I had to drain, reboot, and un-cordon each of the nodes in the cluster to get networking + DNS back up. I get that IOPS exhaustion can cause problems -- but I had been deploying and testing the same app throughout Saturday the 16th and Sunday the 17th without issue. It was only a final deployment at around 11:30 PM ET on the 17th where the cluster decided it was unhappy.

If it was IOPS exhaustion, the only thing I can think of is that the arrangement of pods on the nodes played a part. We use Kured to ensure that our nodes get kernel security updates rolled out gracefully, and Kured cycled the nodes around 5 PM yesterday. The cluster was healthy after all those reboots, but of course a drain is going to re-arrange the pods on the cluster as it reschedules all the workloads. Perhaps the particular arrangement of pods on the cluster caused IOPS exhaustion on one of the nodes?

@jnoller is there anything within Azure I can look at to determine if IOPS played a role?

@jnoller
Copy link
Contributor

jnoller commented Nov 18, 2019

Yes, pod (container) imbalance would cause this especially on a small cluster. You need to enable monitoring for the IO queue depth, Host-level IOPs metrics - when the queue depth spikes, that's usually when SDN and other components get starved

@mleneveut
Copy link

We had to kill all the kube-system pods to resolve a similar issue. The services had no endpoints and the deployments were saying 0/1 although the pods were saying Running without errors.

#1320 (comment)

The symptoms were that the metrics-server pod could not reach the API server anymore.

Error: Get https://xxx.hcp.northeurope.azmk8s.io:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp: lookup xxx.hcp.northeurope.azmk8s.io on 100.66.204.10:53: read udp 100.66.200.117:50378->100.66.204.10:53: read: connection refused

@mlushpenko
Copy link
Author

Thanks, @jnoller, we see some spikes on queue length
image

We will try soon a new cluster with 512gb premium OS disks, is that a good size for production workloads? Seems a bit unfortunate that it's required to tune disk size to get reliable API server connection. I've found best practices guide on storage, but would appreciate some addition to docs like for a reliable cluster, use disks of minimum 200 GB or even make default size for node pools with a warning not to go less than that.

Also, we contacted support and they reconciled our cluster by making a PUT request on the cluster, I guess that helped as cluster started working. Also, we had one node manually powered off, don't think it should have any effect as we have several clusters with the same setup and they seem to be running fine.

@jnoller
Copy link
Contributor

jnoller commented Dec 20, 2019

Please see this comment for additional information: #1320 (comment)

@jnoller
Copy link
Contributor

jnoller commented Jan 8, 2020

Please also see this issue for intermittent nodenotready, DNS latency and other crashed related to system load: #1373

@github-actions
Copy link

Action required from @Azure/aks-pm

@ghost
Copy link

ghost commented Jul 26, 2020

Action required from @Azure/aks-pm

@ghost ghost added the Needs Attention 👋 Issues needs attention/assignee/owner label Jul 26, 2020
@palma21 palma21 added Needs Information and removed Needs Attention 👋 Issues needs attention/assignee/owner action-required labels Jul 27, 2020
@ghost ghost added the stale Stale issue label Sep 25, 2020
@ghost
Copy link

ghost commented Sep 25, 2020

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

@ghost ghost closed this as completed Oct 11, 2020
@ghost
Copy link

ghost commented Oct 11, 2020

This issue will now be closed because it hasn't had any activity for 15 days after stale. mlushpenko feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.

@ghost ghost locked as resolved and limited conversation to collaborators Nov 10, 2020
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants