New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to connect to the server: net/http: TLS handshake timeout #324
Comments
Same problem here. This was working until today. |
Same problem here. It stopped work. I´ve created a namespace, and i thought first that this was the cause. i´ve created also two public ip´s in my master resource-group. i don´t know if one of these changes was the cause. |
same here. Can we please get an estimate when this will be fixed? It seems the master nodes are down, coz i cannot reach the master nodes. |
Same happens to my cluster intermittently. Out of the past 6 days, I have been unable to reach my cluster for three and a half. This is getting ridiculous. |
Same here for a k8s cluster in eastus
It was working okay up until yesterday, started misbehaving earlier today and now completely out of access |
Even though the issue is closed, there's been ongoing discussion in #112 about recurring issues. Microsoft support confirmed yesterday there was an infrastructure problem and that it had been resolved. We were only able to get things back up by destroying the AKS resource and building a brand new cluster. For those just experiencing these connection problems for the first time... this has been going on for months. You should definitely not rely on AKS for stable deployments. They can and do go down with little option for recovery. Furthermore, Microsoft support has been crazy slow to respond to requests and has ultimately provided us with simple "closed as resolved" status updates. While Microsoft works out the kinks, we moved our production workloads from ACS/AKS to Google Kubernetes Engine on GCP. |
Sorry for the service issues. We've been chasing QoS issues in East US and West Europe. East US was addressed yesterday (4/24/2018) and QoS is improving. West Europe service updates just completed and expect QoS to start improving. |
FYI, I was experiencing the same issue the past few days, but it appears it was resolved for me recently in the East US. |
Started having this issue in U.S. East as of today. We run two Clusters one for staging, one for production. The assets that are being served by the VM are accessible normally (ie. our production website) but no updates can be made using the Kubernetes service itself. |
Re: AKS TLS Handshake Timeout. I am starting to collect info on this issue, costs, and workarounds in a question over on StackOverflow: If you guys could answer a couple of these (here or above on SOF)
Feel free to respond here or (as mentioned) hit StackOverflow and I will collect the responses. |
@mdoulaty did you try the 'scale up nodes' workaround? If so did it work for you (fixed my situation) I also started tracking the number of users that solution helps but only found one reference to it on GitHub (4 thumbs up) so I am unsure of how many people have even tried it. I added an answer about it over here (and will add cases to that as I hear back): Thanks for the other info, will throw that into the spread sheet. |
no, didn't try for this issue... |
I think I'm having this issue too since today. Many of my pods keep being restarted, because the liveness probe keeps failing. And the reason it keeps failing is because request times out: Also had the message Edit: West-Europe region |
@nphmuller I've had the same problem a couple days ago. Restarting the node boxes brought them back to life. |
@aevitas @nphmuller could you guys check your system resources on the nodes / vms for me? @aevitas sounds like a restart of node vm worked for you, when / if I run into it again I will try that on my side. |
@necevil Do you want me to look for something specifically in the resource statistics of the nodes, or do you just want them for the sake of gathering more data? |
@aevitas trying to figure out if the CPU and Network IO specifically drop when the cluster is impacted and then go back up to their 'normal' state after. This would in theory allow the creation of an alarm which could keep track of the problem. Not that it's totally a requirement but i figure if we can isolate some of the behaviors beyond kubectl connectivity maybe we can make more informed decisions regarding what workaround is likely to work etc. |
You could also add an alert that periodically checks the k8s API URL. Last
time I ran into the TLS handshake timeout, I was unable to reach the k8s
API entirely.
…On Sat, Jun 9, 2018 at 4:14 PM necevil ***@***.***> wrote:
@aevitas <https://github.com/aevitas> trying to figure out if the CPU and
Network IO specifically drop when the cluster is impacted and then go back
up to their 'normal' state after.
This would in theory allow the creation of an alarm which could keep track
of the problem. Not that it's totally a requirement but i figure if we can
isolate some of the behaviors beyond kubectl connectivity maybe we can make
more informed decisions regarding what workaround is likely to work etc.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#324 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAJPgm1fiHHizAE0C4AmfEVrxK3aNSsNks5t69hDgaJpZM4Tii2o>
.
|
Hi all, I just experienced this same issue, cluster in East US. I scaled up the cluster from 2 to 3 and it has come back to life. Massively annoying... |
First, thank you for reaching out to the AKS team. For this issue (and the error) (TLS Handshake timeout) what we have seen is that this can be traced back to API server downtime (eg an upgrade, an overload of etcd, or other issue), custom network rules / NSG that block the ports needed for API / worker communication. Unfortunately the error is highly contextual to the cluster / region / etc AKS Github issues must be generally re-creatable (e.g here is the YAML/settings/packages/etc installed, can recreate the issue on existing clusters as well as newly created clusters). Unfortunately, this issue is cluster-specific. For cluster specific issues (those that will require support and on-call staff to access you Subscription and clusters) we ask that you file an Azure Technical Support ticket with the identified issue, rec-creation steps if available, and link to this github thread. We can not ask for the access, logs and other sensitive information on github, or ask you to email/slack/etc this information to us. Our portal ticketing system will protect the information including Subscription ids and credentials. If we discover as part of the root cause that this is an AKS or upstream bug, we will mark this issue as a known issue and ensure it is on the Product backlog as a repair item. This will allow us to root-cause the issue and ensure it’s not specific to the environment or other configuration level issue - such as NSGs, custom network devices, etc. |
this problem due to your server proxy,please close local proxy to resolve it !!such as: |
@wanghui122725501 is there any way we can fix keeping the proxy ? because server i am using to communicate with k8s cluster is behind company proxy and which can't be removed. |
Are we saying that kubectl cannot be used behind a corporate proxy? Seem rather limiting don't you think? |
|
My aks environment has been working great for the past three weeks until today. I can no longer
kubectl
into my environment. I've tried from the AZ cloud shell and from my local terminal and neither work. I have also deleted my/.kube/config
file and pulled the latest credentials and still no luck.Prior to this happening, I noticed several weird anomalies to include: fields within my manifest no longer being accepted and RBAC-mapped objects being denied access to resources. My suspicion is that the kubernetes environment on EastUS was updated based on the problems I'm seeing, but I never did a k8s version dump, so I have no way to confirm.
Any help on this would be greatly appreciated.
The text was updated successfully, but these errors were encountered: