Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WEU cluster experiencing new issue with "server misbehaving" #305

Closed
Zimmergren opened this issue Apr 16, 2018 · 23 comments
Closed

WEU cluster experiencing new issue with "server misbehaving" #305

Zimmergren opened this issue Apr 16, 2018 · 23 comments
Labels

Comments

@Zimmergren
Copy link

Starting to see this new error messages across all my replicasets this morning after scaling out to 5 nodes (from 1), during load:

Get https://aks-nodepool1-17125322-3:10250/containerLogs/default/rcst-unicorn-969f78bd7-2fapt/rcst-unicornrunner?tailLines=5000&timestamps=true: dial tcp: lookup aks-nodepool1-17125322-3 on 172.30.0.10:53: server misbehaving

Cluster Location: West Europe.
Cluster VM Size: A4m_v2

@slack
Copy link
Contributor

slack commented Apr 25, 2018

Looks like kube-dns may be having issues. Were you able to delete kube-dns pods to restore service?

@slack slack added the question label Apr 25, 2018
@Zimmergren
Copy link
Author

I dropped the cluster and had to provision a new one to continue my tests. I saw it twice for two old clusters, but I haven't seen it since.

@Zimmergren
Copy link
Author

After repeated attempts to reproduce this issue, I am unable to do it with the latest version of Kube. I will consider this a non-issue until/if it would happen again.

@yarinm
Copy link

yarinm commented Aug 20, 2018

I just got this error for the second time in two days.. Yesterday restarting kube-dns pods solved the issue but now it's not working.

@timmydo
Copy link

timmydo commented Aug 23, 2018

Experiencing this issue also: error: error upgrading connection: error dialing backend: dial tcp: lookup aks-agentpool-31100981-3 on 172.30.0.10:53: server misbehaving

@timmydo
Copy link

timmydo commented Aug 23, 2018

@slack I tried deleting both kube-dns pods but it didn't seem to fix the issue. Any ideas?

@timmydo
Copy link

timmydo commented Aug 23, 2018

Solved by rebooting the VM for aks-agentpool-31100981-3. Would be nice if there was some sort of watchdog that could automatically fix this...

@vcorr
Copy link

vcorr commented Aug 24, 2018

Same problem, cluster created yesterday with newest Kubernetes...

@rnkhouse
Copy link

Same error. Why is this case closed without any solution?

@vcorr
Copy link

vcorr commented Aug 29, 2018

while waiting for a proper fix, I found that draining the node with this problem "fixes" it. Obviously this only works if you have more than one node.

@rzal
Copy link

rzal commented Aug 30, 2018

Same problem on centralus on aks-uspool-81828611-1 Ready agent 16d v1.11.1

@DenisBiondic
Copy link

"Fix" from @vcorr worked for us too, probably since some of daemon-set containers was restarted (dns, proxy etc.)

kubectl drain aks-nodepool1-xxxxx --ignore-daemonsets
kubectl uncordon aks-nodepool1-xxxxx

@amandadebler
Copy link

The drain/uncordon fix did not work for us. As I recall, 172.30.0.10 is in a reserved subnet that AKS uses, and that we are specifically directed not to use for the cluster or Docker CIDR; our clusters use 172.18.0.0/24 for the cluster CIDR and 172.19.0.1/24 for Docker bridge, with 172.18.0.10 as the DNS. Yet, we get this error in one of our West Europe clusters using advanced networking (custom vnet with corporate network IPs) when pulling pod logs or trying to exec commands on them, even ls. We have several similar clusters that do not have this issue.

@vglafirov
Copy link

Issue is not resolved. We are experiencing similar instabilities for most of our clusters in different regions.

@kvolkovich-sc
Copy link

Experiencing same issue. Nodes restart didn't help.

@andig
Copy link

andig commented Nov 16, 2018

Similar symptoms here, last comment. Problem wirh api server?

@kvolkovich-sc
Copy link

Repaired for us after ~30 minutes downtime.

@David-Green
Copy link

I believe this is related to this issue Azure/acs-engine#3503 with this fix kubernetes/kubernetes#70353. Can be confirmed by running a kubectl describe on the nodes showing they're missing their internalIP. A restart of those servers does fix the issue in this case and the fix is merged in 1.13 but that's not available in AKS yet. There's a backport to 1.11 which I assume would be available in AKS but isn't merged yet.

@choovick
Copy link

choovick commented Dec 3, 2018

@David-Green Thanks. Experienced this issue on 1 node v1.11.3 cluster running for a month.

kubectl describe node <node_name> | grep 'Addresses:' -A 2 confirms there is no IP address on my node as well. I guess we are waiting for K8S v.1.13...

@slack should this be open until Azure/acs-engine#3503 is resolved? As 53: server misbehaving message not mentioned on related issues.

Advised workaround above did not worked, but this did:

kubectl drain <node_name> --ignore-daemonsets

# reboot VM in dashboard or via ssh

kubectl uncordon <node_name>

bouncing node without drain probably works as well, but its dirty if you are running multi-node cluster.

now IP is located...

$ kubectl describe node <node_name> | grep 'Addresses:' -A 2
Addresses:
  InternalIP:  10.240.0.4
  Hostname:    <node_name>

@erewok
Copy link

erewok commented Dec 17, 2018

We're seeing this issue on one of our nodes as well: just now missing an IP address. Can't drain the pods on the node, unfortunately, so we're just going to spin up a new node, reschedule pods that are important onto it and restart the node.

@baracoder
Copy link

This issue appeared on our AKS 1.11.2 cluster in West Europe. Restarting the nodes "solved" it for now.

@toddgardner
Copy link

We experienced this issue; support said this issue occurred with 1.11.2 and 1.11.3 and upgrading to the most recent version would prevent it in the future.

@David-Green
Copy link

David-Green commented Jan 3, 2019

As far as I can tell this is still not fixed in AKS. The change was backported to the 1.11 release, it was expected to make it into 1.11.5 but only made it into 1.11.6 (you can see it in the last line of the changelog here)
https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.11.md#changelog-since-v1115

I also got the same answer from support about the fix being in the current version of azure but until someone shows me otherwise I'm assuming it's not as the latest version azure is showing is still kube 1.11.5.

az aks get-versions --location westus --output table
KubernetesVersion    Upgrades
-------------------  ----------------------
1.11.5               None available
1.11.4               1.11.5
1.10.9               1.11.4, 1.11.5
1.10.8               1.10.9, 1.11.4, 1.11.5
1.9.11               1.10.8, 1.10.9
1.9.10               1.9.11, 1.10.8, 1.10.9
1.8.15               1.9.10, 1.9.11
1.8.14               1.8.15, 1.9.10, 1.9.11

@Azure Azure locked as resolved and limited conversation to collaborators Aug 8, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests