-
Notifications
You must be signed in to change notification settings - Fork 305
-
Notifications
You must be signed in to change notification settings - Fork 305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tunnelfront pod can't reach API server nor via DNS, neither via IP, so the whole cluster is broken #1322
Comments
#102 (comment) - please see this comment. the failure above is usually caused by the workload saturating the OS disk IOPS and causing the crash. |
I, too, ran into a CoreDNS that brought our cluster down last night. It seemed as though either traffic was not getting to CoreDNS or it was failing to resolve all requests. In my case, though, CoreDNS did not report any errors and was not cycling. Linking here just in case these are related: We were not doing any heavy IOPS-workloads at the time, though the DNS outage started shortly after re-deploying the dev instance of a PHP application we are working on. |
@GuyPaddock You don't think it was IOPS intensive, but if you're using a DS2v_3 with a 100gb OS disk, the OS disk maxes out at 500 IOPS. That means 3-4 containers running (docker host IO counts are 3-4X the in-memory kube metrics data) |
Could be possible. In my case, I had to drain, reboot, and un-cordon each of the nodes in the cluster to get networking + DNS back up. I get that IOPS exhaustion can cause problems -- but I had been deploying and testing the same app throughout Saturday the 16th and Sunday the 17th without issue. It was only a final deployment at around 11:30 PM ET on the 17th where the cluster decided it was unhappy. If it was IOPS exhaustion, the only thing I can think of is that the arrangement of pods on the nodes played a part. We use Kured to ensure that our nodes get kernel security updates rolled out gracefully, and Kured cycled the nodes around 5 PM yesterday. The cluster was healthy after all those reboots, but of course a drain is going to re-arrange the pods on the cluster as it reschedules all the workloads. Perhaps the particular arrangement of pods on the cluster caused IOPS exhaustion on one of the nodes? @jnoller is there anything within Azure I can look at to determine if IOPS played a role? |
Yes, pod (container) imbalance would cause this especially on a small cluster. You need to enable monitoring for the IO queue depth, Host-level IOPs metrics - when the queue depth spikes, that's usually when SDN and other components get starved |
We had to kill all the kube-system pods to resolve a similar issue. The services had no endpoints and the deployments were saying 0/1 although the pods were saying Running without errors. The symptoms were that the metrics-server pod could not reach the API server anymore.
|
Thanks, @jnoller, we see some spikes on queue length We will try soon a new cluster with 512gb premium OS disks, is that a good size for production workloads? Seems a bit unfortunate that it's required to tune disk size to get reliable API server connection. I've found best practices guide on storage, but would appreciate some addition to docs like Also, we contacted support and they reconciled our cluster by making a PUT request on the cluster, I guess that helped as cluster started working. Also, we had one node manually powered off, don't think it should have any effect as we have several clusters with the same setup and they seem to be running fine. |
Please see this comment for additional information: #1320 (comment) |
Please also see this issue for intermittent nodenotready, DNS latency and other crashed related to system load: #1373 |
Action required from @Azure/aks-pm |
Action required from @Azure/aks-pm |
This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment. |
This issue will now be closed because it hasn't had any activity for 15 days after stale. mlushpenko feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion. |
What happened: I think node networking is somehow broken,
coredns
pod is constantly restarting andtunnelfront
is running but can't reach API server.Anything else we need to know?:
coredns
logs taken from docker (askubectl logs
is hanging):in this case,
coredns
can't reach Azure DNS. Same happens for thetunnefront
pod:I also tried to skip DNS via overriding kubernetes API server address in
/etc/hosts
insidetunnelfront
pod and it still didn't work:At the same time, if we just run random alpine pod, it can connect to Azure DNS (and it is runnning on the same node, so no idea):
And yet, some other pods like
external-dns
can't connect to AzureDNS (they don't need to because of coredns, but we checked just for testing purposes)Environment:
kubectl version
):The text was updated successfully, but these errors were encountered: