New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WEU cluster experiencing new issue with "server misbehaving" #305
Comments
Looks like |
I dropped the cluster and had to provision a new one to continue my tests. I saw it twice for two old clusters, but I haven't seen it since. |
After repeated attempts to reproduce this issue, I am unable to do it with the latest version of Kube. I will consider this a non-issue until/if it would happen again. |
I just got this error for the second time in two days.. Yesterday restarting kube-dns pods solved the issue but now it's not working. |
Experiencing this issue also: error: error upgrading connection: error dialing backend: dial tcp: lookup aks-agentpool-31100981-3 on 172.30.0.10:53: server misbehaving |
@slack I tried deleting both kube-dns pods but it didn't seem to fix the issue. Any ideas? |
Solved by rebooting the VM for |
Same problem, cluster created yesterday with newest Kubernetes... |
Same error. Why is this case closed without any solution? |
while waiting for a proper fix, I found that draining the node with this problem "fixes" it. Obviously this only works if you have more than one node. |
Same problem on centralus on aks-uspool-81828611-1 Ready agent 16d v1.11.1 |
"Fix" from @vcorr worked for us too, probably since some of daemon-set containers was restarted (dns, proxy etc.)
|
The drain/uncordon fix did not work for us. As I recall, 172.30.0.10 is in a reserved subnet that AKS uses, and that we are specifically directed not to use for the cluster or Docker CIDR; our clusters use 172.18.0.0/24 for the cluster CIDR and 172.19.0.1/24 for Docker bridge, with 172.18.0.10 as the DNS. Yet, we get this error in one of our West Europe clusters using advanced networking (custom vnet with corporate network IPs) when pulling pod logs or trying to exec commands on them, even ls. We have several similar clusters that do not have this issue. |
Issue is not resolved. We are experiencing similar instabilities for most of our clusters in different regions. |
Experiencing same issue. Nodes restart didn't help. |
Similar symptoms here, last comment. Problem wirh api server? |
Repaired for us after ~30 minutes downtime. |
I believe this is related to this issue Azure/acs-engine#3503 with this fix kubernetes/kubernetes#70353. Can be confirmed by running a kubectl describe on the nodes showing they're missing their internalIP. A restart of those servers does fix the issue in this case and the fix is merged in 1.13 but that's not available in AKS yet. There's a backport to 1.11 which I assume would be available in AKS but isn't merged yet. |
@David-Green Thanks. Experienced this issue on 1 node v1.11.3 cluster running for a month.
@slack should this be open until Azure/acs-engine#3503 is resolved? As Advised workaround above did not worked, but this did:
bouncing node without drain probably works as well, but its dirty if you are running multi-node cluster. now IP is located...
|
We're seeing this issue on one of our nodes as well: just now missing an IP address. Can't drain the pods on the node, unfortunately, so we're just going to spin up a new node, reschedule pods that are important onto it and restart the node. |
This issue appeared on our AKS 1.11.2 cluster in West Europe. Restarting the nodes "solved" it for now. |
We experienced this issue; support said this issue occurred with 1.11.2 and 1.11.3 and upgrading to the most recent version would prevent it in the future. |
As far as I can tell this is still not fixed in AKS. The change was backported to the 1.11 release, it was expected to make it into 1.11.5 but only made it into 1.11.6 (you can see it in the last line of the changelog here) I also got the same answer from support about the fix being in the current version of azure but until someone shows me otherwise I'm assuming it's not as the latest version azure is showing is still kube 1.11.5.
|
Starting to see this new error messages across all my replicasets this morning after scaling out to 5 nodes (from 1), during load:
Cluster Location: West Europe.
Cluster VM Size: A4m_v2
The text was updated successfully, but these errors were encountered: