Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to connect to the server: net/http: TLS handshake timeout #324

Closed
CarlosOVillanueva opened this issue Apr 24, 2018 · 25 comments
Closed

Comments

@CarlosOVillanueva
Copy link

My aks environment has been working great for the past three weeks until today. I can no longer kubectl into my environment. I've tried from the AZ cloud shell and from my local terminal and neither work. I have also deleted my /.kube/config file and pulled the latest credentials and still no luck.

Prior to this happening, I noticed several weird anomalies to include: fields within my manifest no longer being accepted and RBAC-mapped objects being denied access to resources. My suspicion is that the kubernetes environment on EastUS was updated based on the problems I'm seeing, but I never did a k8s version dump, so I have no way to confirm.

Any help on this would be greatly appreciated.

@pflickin
Copy link

Same problem here. This was working until today.

@gabrielbarceloscn
Copy link

Same problem here. It stopped work. I´ve created a namespace, and i thought first that this was the cause. i´ve created also two public ip´s in my master resource-group. i don´t know if one of these changes was the cause.

@dyhpoon
Copy link

dyhpoon commented Apr 25, 2018

same here. Can we please get an estimate when this will be fixed? It seems the master nodes are down, coz i cannot reach the master nodes.

@aevitas
Copy link

aevitas commented Apr 25, 2018

Same happens to my cluster intermittently. Out of the past 6 days, I have been unable to reach my cluster for three and a half. This is getting ridiculous.

@mdoulaty
Copy link

Same here for a k8s cluster in eastus

$ kubectl get pods
Unable to connect to the server: net/http: TLS handshake timeout

It was working okay up until yesterday, started misbehaving earlier today and now completely out of access

@seniorquico
Copy link

Even though the issue is closed, there's been ongoing discussion in #112 about recurring issues. Microsoft support confirmed yesterday there was an infrastructure problem and that it had been resolved. We were only able to get things back up by destroying the AKS resource and building a brand new cluster.

For those just experiencing these connection problems for the first time... this has been going on for months. You should definitely not rely on AKS for stable deployments. They can and do go down with little option for recovery. Furthermore, Microsoft support has been crazy slow to respond to requests and has ultimately provided us with simple "closed as resolved" status updates.

While Microsoft works out the kinks, we moved our production workloads from ACS/AKS to Google Kubernetes Engine on GCP.

@slack
Copy link
Contributor

slack commented Apr 25, 2018

Sorry for the service issues. We've been chasing QoS issues in East US and West Europe.

East US was addressed yesterday (4/24/2018) and QoS is improving. West Europe service updates just completed and expect QoS to start improving.

@itsMicah
Copy link

itsMicah commented Apr 27, 2018

FYI, I was experiencing the same issue the past few days, but it appears it was resolved for me recently in the East US.

@necevil
Copy link

necevil commented Jun 5, 2018

Started having this issue in U.S. East as of today.
Filing a ticket with support since it looks like it's an issue connecting to the Kubernetes server rather than anything that we broke.

We run two Clusters one for staging, one for production.
Production is un-reachable but Staging — which is essentially a mirror right now (also in U.S. East) — can be contacted normally.

The assets that are being served by the VM are accessible normally (ie. our production website) but no updates can be made using the Kubernetes service itself.

@necevil
Copy link

necevil commented Jun 6, 2018

Re: AKS TLS Handshake Timeout.
@mdoulaty @aevitas @dyhpoon @gabrielrb @pflickin @CarlosOVillanueva

I am starting to collect info on this issue, costs, and workarounds in a question over on StackOverflow:
https://stackoverflow.com/questions/50726534/why-cant-kubectl-connect-to-the-azure-aks-server-managing-my-cluster-net-http

If you guys could answer a couple of these (here or above on SOF)

  1. Have you had the TLS issue impact your Clusters more than once?
  2. How long did it take before your Cluster was back up and could be connected to (what was the outage duration)?
  3. Did it 'heal' itself / resolve / come back to life on it's own or did you have to post a support ticket — and wait for MS to fix it?
  4. If support, how long was that ticket posted for before the problem was solved on your Cluster?
  5. Did you notice anything weird with your CPU / Network IO dropping but your disk usage jumping higher?
  6. Would you be in favor of allowing AKS preview customers to create higher Severity Help Tickets (For this specific issue only) regardless of their support plan in order to achieve a more timely solution?

Feel free to respond here or (as mentioned) hit StackOverflow and I will collect the responses.

@necevil
Copy link

necevil commented Jun 6, 2018

@mdoulaty did you try the 'scale up nodes' workaround?

If so did it work for you (fixed my situation) I also started tracking the number of users that solution helps but only found one reference to it on GitHub (4 thumbs up) so I am unsure of how many people have even tried it.

I added an answer about it over here (and will add cases to that as I hear back):
https://stackoverflow.com/questions/50726534/unable-to-connect-net-http-tls-handshake-timeout-why-cant-kubectl-connect?answertab=votes#tab-top

Thanks for the other info, will throw that into the spread sheet.

@mdoulaty
Copy link

mdoulaty commented Jun 7, 2018

no, didn't try for this issue...

@nphmuller
Copy link

nphmuller commented Jun 8, 2018

I think I'm having this issue too since today. Many of my pods keep being restarted, because the liveness probe keeps failing. And the reason it keeps failing is because request times out:
Warning Unhealthy 8m (x39 over 1h) kubelet, aks-agentpool-10516367-0 Liveness probe failed: Get http://10.50.48.14:8080/healthz: net/http: request canceled (Client.Timeout exceeded while awaiting headers).

Also had the message Net/http: TLS handshake timeout a couple of times when running kubectl commands.

Edit: West-Europe region

@aevitas
Copy link

aevitas commented Jun 8, 2018

@nphmuller I've had the same problem a couple days ago. Restarting the node boxes brought them back to life.

@nphmuller
Copy link

nphmuller commented Jun 8, 2018

@aevitas Thanks!
I used @necevil 's workaround for now, which seems work. Although when scaling up the new node was stuck in NotReady state because of a different issue.

@necevil
Copy link

necevil commented Jun 8, 2018

@aevitas @nphmuller could you guys check your system resources on the nodes / vms for me?
Specifically the CPU, Network IO and disk stuff discussed over here: https://stackoverflow.com/questions/50726534/unable-to-connect-net-http-tls-handshake-timeout-why-cant-kubectl-connect

@aevitas sounds like a restart of node vm worked for you, when / if I run into it again I will try that on my side.

@aevitas
Copy link

aevitas commented Jun 9, 2018

@necevil Do you want me to look for something specifically in the resource statistics of the nodes, or do you just want them for the sake of gathering more data?

@necevil
Copy link

necevil commented Jun 9, 2018

@aevitas trying to figure out if the CPU and Network IO specifically drop when the cluster is impacted and then go back up to their 'normal' state after.

This would in theory allow the creation of an alarm which could keep track of the problem. Not that it's totally a requirement but i figure if we can isolate some of the behaviors beyond kubectl connectivity maybe we can make more informed decisions regarding what workaround is likely to work etc.

@aevitas
Copy link

aevitas commented Jun 9, 2018 via email

@benbuckland
Copy link

Hi all, I just experienced this same issue, cluster in East US. I scaled up the cluster from 2 to 3 and it has come back to life.

Massively annoying...

@jnoller
Copy link
Contributor

jnoller commented Apr 3, 2019

First, thank you for reaching out to the AKS team.

For this issue (and the error) (TLS Handshake timeout) what we have seen is that this can be traced back to API server downtime (eg an upgrade, an overload of etcd, or other issue), custom network rules / NSG that block the ports needed for API / worker communication. Unfortunately the error is highly contextual to the cluster / region / etc

AKS Github issues must be generally re-creatable (e.g here is the YAML/settings/packages/etc installed, can recreate the issue on existing clusters as well as newly created clusters). Unfortunately, this issue is cluster-specific.

For cluster specific issues (those that will require support and on-call staff to access you Subscription and clusters) we ask that you file an Azure Technical Support ticket with the identified issue, rec-creation steps if available, and link to this github thread.

We can not ask for the access, logs and other sensitive information on github, or ask you to email/slack/etc this information to us. Our portal ticketing system will protect the information including Subscription ids and credentials. If we discover as part of the root cause that this is an AKS or upstream bug, we will mark this issue as a known issue and ensure it is on the Product backlog as a repair item.

This will allow us to root-cause the issue and ensure it’s not specific to the environment or other configuration level issue - such as NSGs, custom network devices, etc.

@jnoller jnoller closed this as completed Apr 3, 2019
@wanghui122725501
Copy link

this problem due to your server proxy,please close local proxy to resolve it !!such as:
[root@centos7-node8 ~]# vim /etc/profile
#export http_proxy="http://192.168.199.248:1080"
#export https_proxy="http://192.168.199.248:1080"
#export ftp_proxy=$http_proxy
[root@centos7-node8 ~]# source /etc/profile

@ansha1
Copy link

ansha1 commented Aug 23, 2019

@wanghui122725501 is there any way we can fix keeping the proxy ? because server i am using to communicate with k8s cluster is behind company proxy and which can't be removed.
Appreciate your help!

@hguerrier
Copy link

Are we saying that kubectl cannot be used behind a corporate proxy? Seem rather limiting don't you think?

@Javacppc
Copy link

Javacppc commented Dec 3, 2019

this problem due to your server proxy,please close local proxy to resolve it !!such as:
[root@centos7-node8 ~]# vim /etc/profile
#export http_proxy="http://192.168.199.248:1080"
#export https_proxy="http://192.168.199.248:1080"
#export ftp_proxy=$http_proxy
[root@centos7-node8 ~]# source /etc/profile
yes.Thank you! you solved my problem!

@Azure Azure locked as resolved and limited conversation to collaborators Aug 8, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests