WEU cluster experiencing new issue with "server misbehaving" #305

Zimmergren · 2018-04-16T06:51:31Z

Starting to see this new error messages across all my replicasets this morning after scaling out to 5 nodes (from 1), during load:

Get https://aks-nodepool1-17125322-3:10250/containerLogs/default/rcst-unicorn-969f78bd7-2fapt/rcst-unicornrunner?tailLines=5000&amp;timestamps=true: dial tcp: lookup aks-nodepool1-17125322-3 on 172.30.0.10:53: server misbehaving

Cluster Location: West Europe.
Cluster VM Size: A4m_v2

The text was updated successfully, but these errors were encountered:

slack · 2018-04-25T02:24:54Z

Looks like kube-dns may be having issues. Were you able to delete kube-dns pods to restore service?

Zimmergren · 2018-04-25T07:20:52Z

I dropped the cluster and had to provision a new one to continue my tests. I saw it twice for two old clusters, but I haven't seen it since.

Zimmergren · 2018-08-08T09:40:13Z

After repeated attempts to reproduce this issue, I am unable to do it with the latest version of Kube. I will consider this a non-issue until/if it would happen again.

yarinm · 2018-08-20T13:49:32Z

I just got this error for the second time in two days.. Yesterday restarting kube-dns pods solved the issue but now it's not working.

timmydo · 2018-08-23T00:15:37Z

Experiencing this issue also: error: error upgrading connection: error dialing backend: dial tcp: lookup aks-agentpool-31100981-3 on 172.30.0.10:53: server misbehaving

timmydo · 2018-08-23T00:29:33Z

@slack I tried deleting both kube-dns pods but it didn't seem to fix the issue. Any ideas?

timmydo · 2018-08-23T20:31:08Z

Solved by rebooting the VM for aks-agentpool-31100981-3. Would be nice if there was some sort of watchdog that could automatically fix this...

vcorr · 2018-08-24T07:33:32Z

Same problem, cluster created yesterday with newest Kubernetes...

rnkhouse · 2018-08-29T14:41:29Z

Same error. Why is this case closed without any solution?

vcorr · 2018-08-29T16:29:56Z

while waiting for a proper fix, I found that draining the node with this problem "fixes" it. Obviously this only works if you have more than one node.

rzal · 2018-08-30T14:16:15Z

Same problem on centralus on aks-uspool-81828611-1 Ready agent 16d v1.11.1

DenisBiondic · 2018-09-19T15:27:39Z

"Fix" from @vcorr worked for us too, probably since some of daemon-set containers was restarted (dns, proxy etc.)

kubectl drain aks-nodepool1-xxxxx --ignore-daemonsets
kubectl uncordon aks-nodepool1-xxxxx

amandadebler · 2018-10-09T13:31:35Z

The drain/uncordon fix did not work for us. As I recall, 172.30.0.10 is in a reserved subnet that AKS uses, and that we are specifically directed not to use for the cluster or Docker CIDR; our clusters use 172.18.0.0/24 for the cluster CIDR and 172.19.0.1/24 for Docker bridge, with 172.18.0.10 as the DNS. Yet, we get this error in one of our West Europe clusters using advanced networking (custom vnet with corporate network IPs) when pulling pod logs or trying to exec commands on them, even ls. We have several similar clusters that do not have this issue.

vglafirov · 2018-11-16T13:32:53Z

Issue is not resolved. We are experiencing similar instabilities for most of our clusters in different regions.

kvolkovich-sc · 2018-11-16T13:52:21Z

Experiencing same issue. Nodes restart didn't help.

andig · 2018-11-16T14:09:54Z

Similar symptoms here, last comment. Problem wirh api server?

kvolkovich-sc · 2018-11-16T14:12:20Z

Repaired for us after ~30 minutes downtime.

David-Green · 2018-11-22T18:10:09Z

I believe this is related to this issue Azure/acs-engine#3503 with this fix kubernetes/kubernetes#70353. Can be confirmed by running a kubectl describe on the nodes showing they're missing their internalIP. A restart of those servers does fix the issue in this case and the fix is merged in 1.13 but that's not available in AKS yet. There's a backport to 1.11 which I assume would be available in AKS but isn't merged yet.

choovick · 2018-12-03T16:19:16Z

@David-Green Thanks. Experienced this issue on 1 node v1.11.3 cluster running for a month.

kubectl describe node <node_name> | grep 'Addresses:' -A 2 confirms there is no IP address on my node as well. I guess we are waiting for K8S v.1.13...

@slack should this be open until Azure/acs-engine#3503 is resolved? As 53: server misbehaving message not mentioned on related issues.

Advised workaround above did not worked, but this did:

kubectl drain <node_name> --ignore-daemonsets

# reboot VM in dashboard or via ssh

kubectl uncordon <node_name>

bouncing node without drain probably works as well, but its dirty if you are running multi-node cluster.

now IP is located...

$ kubectl describe node <node_name> | grep 'Addresses:' -A 2
Addresses:
  InternalIP:  10.240.0.4
  Hostname:    <node_name>

erewok · 2018-12-17T22:07:20Z

We're seeing this issue on one of our nodes as well: just now missing an IP address. Can't drain the pods on the node, unfortunately, so we're just going to spin up a new node, reschedule pods that are important onto it and restart the node.

baracoder · 2019-01-03T11:09:44Z

This issue appeared on our AKS 1.11.2 cluster in West Europe. Restarting the nodes "solved" it for now.

toddgardner · 2019-01-03T16:04:22Z

We experienced this issue; support said this issue occurred with 1.11.2 and 1.11.3 and upgrading to the most recent version would prevent it in the future.

David-Green · 2019-01-03T16:48:55Z

As far as I can tell this is still not fixed in AKS. The change was backported to the 1.11 release, it was expected to make it into 1.11.5 but only made it into 1.11.6 (you can see it in the last line of the changelog here)
https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.11.md#changelog-since-v1115

I also got the same answer from support about the fix being in the current version of azure but until someone shows me otherwise I'm assuming it's not as the latest version azure is showing is still kube 1.11.5.

az aks get-versions --location westus --output table
KubernetesVersion    Upgrades
-------------------  ----------------------
1.11.5               None available
1.11.4               1.11.5
1.10.9               1.11.4, 1.11.5
1.10.8               1.10.9, 1.11.4, 1.11.5
1.9.11               1.10.8, 1.10.9
1.9.10               1.9.11, 1.10.8, 1.10.9
1.8.15               1.9.10, 1.9.11
1.8.14               1.8.15, 1.9.10, 1.9.11

slack added the question label Apr 25, 2018

Zimmergren closed this as completed Aug 8, 2018

MarkDNL mentioned this issue Apr 24, 2019

Error dialing backend when attemtping to run kubectl exec -it -- /bin/bash Azure/aks-engine#616

Closed

Azure locked as resolved and limited conversation to collaborators Aug 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WEU cluster experiencing new issue with "server misbehaving" #305

WEU cluster experiencing new issue with "server misbehaving" #305

Zimmergren commented Apr 16, 2018

slack commented Apr 25, 2018

Zimmergren commented Apr 25, 2018

Zimmergren commented Aug 8, 2018

yarinm commented Aug 20, 2018

timmydo commented Aug 23, 2018

timmydo commented Aug 23, 2018

timmydo commented Aug 23, 2018

vcorr commented Aug 24, 2018

rnkhouse commented Aug 29, 2018

vcorr commented Aug 29, 2018

rzal commented Aug 30, 2018

DenisBiondic commented Sep 19, 2018

amandadebler commented Oct 9, 2018

vglafirov commented Nov 16, 2018

kvolkovich-sc commented Nov 16, 2018

andig commented Nov 16, 2018 •

edited

kvolkovich-sc commented Nov 16, 2018

David-Green commented Nov 22, 2018

choovick commented Dec 3, 2018

erewok commented Dec 17, 2018 •

edited

baracoder commented Jan 3, 2019

toddgardner commented Jan 3, 2019

David-Green commented Jan 3, 2019 •

edited

WEU cluster experiencing new issue with "server misbehaving" #305

WEU cluster experiencing new issue with "server misbehaving" #305

Comments

Zimmergren commented Apr 16, 2018

slack commented Apr 25, 2018

Zimmergren commented Apr 25, 2018

Zimmergren commented Aug 8, 2018

yarinm commented Aug 20, 2018

timmydo commented Aug 23, 2018

timmydo commented Aug 23, 2018

timmydo commented Aug 23, 2018

vcorr commented Aug 24, 2018

rnkhouse commented Aug 29, 2018

vcorr commented Aug 29, 2018

rzal commented Aug 30, 2018

DenisBiondic commented Sep 19, 2018

amandadebler commented Oct 9, 2018

vglafirov commented Nov 16, 2018

kvolkovich-sc commented Nov 16, 2018

andig commented Nov 16, 2018 • edited

kvolkovich-sc commented Nov 16, 2018

David-Green commented Nov 22, 2018

choovick commented Dec 3, 2018

erewok commented Dec 17, 2018 • edited

baracoder commented Jan 3, 2019

toddgardner commented Jan 3, 2019

David-Green commented Jan 3, 2019 • edited

andig commented Nov 16, 2018 •

edited

erewok commented Dec 17, 2018 •

edited

David-Green commented Jan 3, 2019 •

edited