Unable to connect to the server: net/http: TLS handshake timeout #324

CarlosOVillanueva · 2018-04-24T22:59:01Z

My aks environment has been working great for the past three weeks until today. I can no longer kubectl into my environment. I've tried from the AZ cloud shell and from my local terminal and neither work. I have also deleted my /.kube/config file and pulled the latest credentials and still no luck.

Prior to this happening, I noticed several weird anomalies to include: fields within my manifest no longer being accepted and RBAC-mapped objects being denied access to resources. My suspicion is that the kubernetes environment on EastUS was updated based on the problems I'm seeing, but I never did a k8s version dump, so I have no way to confirm.

Any help on this would be greatly appreciated.

The text was updated successfully, but these errors were encountered:

pflickin · 2018-04-24T23:12:49Z

Same problem here. This was working until today.

gabrielbarceloscn · 2018-04-25T01:59:24Z

Same problem here. It stopped work. I´ve created a namespace, and i thought first that this was the cause. i´ve created also two public ip´s in my master resource-group. i don´t know if one of these changes was the cause.

dyhpoon · 2018-04-25T02:47:06Z

same here. Can we please get an estimate when this will be fixed? It seems the master nodes are down, coz i cannot reach the master nodes.

aevitas · 2018-04-25T08:05:43Z

Same happens to my cluster intermittently. Out of the past 6 days, I have been unable to reach my cluster for three and a half. This is getting ridiculous.

mdoulaty · 2018-04-25T12:48:15Z

Same here for a k8s cluster in eastus

$ kubectl get pods
Unable to connect to the server: net/http: TLS handshake timeout

It was working okay up until yesterday, started misbehaving earlier today and now completely out of access

seniorquico · 2018-04-25T14:18:58Z

Even though the issue is closed, there's been ongoing discussion in #112 about recurring issues. Microsoft support confirmed yesterday there was an infrastructure problem and that it had been resolved. We were only able to get things back up by destroying the AKS resource and building a brand new cluster.

For those just experiencing these connection problems for the first time... this has been going on for months. You should definitely not rely on AKS for stable deployments. They can and do go down with little option for recovery. Furthermore, Microsoft support has been crazy slow to respond to requests and has ultimately provided us with simple "closed as resolved" status updates.

While Microsoft works out the kinks, we moved our production workloads from ACS/AKS to Google Kubernetes Engine on GCP.

slack · 2018-04-25T21:51:04Z

Sorry for the service issues. We've been chasing QoS issues in East US and West Europe.

East US was addressed yesterday (4/24/2018) and QoS is improving. West Europe service updates just completed and expect QoS to start improving.

itsMicah · 2018-04-27T16:27:39Z

FYI, I was experiencing the same issue the past few days, but it appears it was resolved for me recently in the East US.

necevil · 2018-06-05T21:38:52Z

Started having this issue in U.S. East as of today.
Filing a ticket with support since it looks like it's an issue connecting to the Kubernetes server rather than anything that we broke.

We run two Clusters one for staging, one for production.
Production is un-reachable but Staging — which is essentially a mirror right now (also in U.S. East) — can be contacted normally.

The assets that are being served by the VM are accessible normally (ie. our production website) but no updates can be made using the Kubernetes service itself.

necevil · 2018-06-06T18:37:14Z

Re: AKS TLS Handshake Timeout.
@mdoulaty @aevitas @dyhpoon @gabrielrb @pflickin @CarlosOVillanueva

I am starting to collect info on this issue, costs, and workarounds in a question over on StackOverflow:
https://stackoverflow.com/questions/50726534/why-cant-kubectl-connect-to-the-azure-aks-server-managing-my-cluster-net-http

If you guys could answer a couple of these (here or above on SOF)

Have you had the TLS issue impact your Clusters more than once?
How long did it take before your Cluster was back up and could be connected to (what was the outage duration)?
Did it 'heal' itself / resolve / come back to life on it's own or did you have to post a support ticket — and wait for MS to fix it?
If support, how long was that ticket posted for before the problem was solved on your Cluster?
Did you notice anything weird with your CPU / Network IO dropping but your disk usage jumping higher?
Would you be in favor of allowing AKS preview customers to create higher Severity Help Tickets (For this specific issue only) regardless of their support plan in order to achieve a more timely solution?

Feel free to respond here or (as mentioned) hit StackOverflow and I will collect the responses.

necevil · 2018-06-06T22:44:12Z

@mdoulaty did you try the 'scale up nodes' workaround?

If so did it work for you (fixed my situation) I also started tracking the number of users that solution helps but only found one reference to it on GitHub (4 thumbs up) so I am unsure of how many people have even tried it.

I added an answer about it over here (and will add cases to that as I hear back):
https://stackoverflow.com/questions/50726534/unable-to-connect-net-http-tls-handshake-timeout-why-cant-kubectl-connect?answertab=votes#tab-top

Thanks for the other info, will throw that into the spread sheet.

mdoulaty · 2018-06-07T12:47:58Z

no, didn't try for this issue...

nphmuller · 2018-06-08T09:05:12Z

I think I'm having this issue too since today. Many of my pods keep being restarted, because the liveness probe keeps failing. And the reason it keeps failing is because request times out:
Warning Unhealthy 8m (x39 over 1h) kubelet, aks-agentpool-10516367-0 Liveness probe failed: Get http://10.50.48.14:8080/healthz: net/http: request canceled (Client.Timeout exceeded while awaiting headers).

Also had the message Net/http: TLS handshake timeout a couple of times when running kubectl commands.

Edit: West-Europe region

aevitas · 2018-06-08T09:17:22Z

@nphmuller I've had the same problem a couple days ago. Restarting the node boxes brought them back to life.

nphmuller · 2018-06-08T09:52:37Z

@aevitas Thanks!
I used @necevil 's workaround for now, which seems work. Although when scaling up the new node was stuck in NotReady state because of a different issue.

necevil · 2018-06-08T14:25:11Z

@aevitas @nphmuller could you guys check your system resources on the nodes / vms for me?
Specifically the CPU, Network IO and disk stuff discussed over here: https://stackoverflow.com/questions/50726534/unable-to-connect-net-http-tls-handshake-timeout-why-cant-kubectl-connect

@aevitas sounds like a restart of node vm worked for you, when / if I run into it again I will try that on my side.

aevitas · 2018-06-09T12:48:24Z

@necevil Do you want me to look for something specifically in the resource statistics of the nodes, or do you just want them for the sake of gathering more data?

necevil · 2018-06-09T14:14:16Z

@aevitas trying to figure out if the CPU and Network IO specifically drop when the cluster is impacted and then go back up to their 'normal' state after.

This would in theory allow the creation of an alarm which could keep track of the problem. Not that it's totally a requirement but i figure if we can isolate some of the behaviors beyond kubectl connectivity maybe we can make more informed decisions regarding what workaround is likely to work etc.

aevitas · 2018-06-09T15:01:00Z

You could also add an alert that periodically checks the k8s API URL. Last time I ran into the TLS handshake timeout, I was unable to reach the k8s API entirely.

…

On Sat, Jun 9, 2018 at 4:14 PM necevil ***@***.***> wrote: @aevitas <https://github.com/aevitas> trying to figure out if the CPU and Network IO specifically drop when the cluster is impacted and then go back up to their 'normal' state after. This would in theory allow the creation of an alarm which could keep track of the problem. Not that it's totally a requirement but i figure if we can isolate some of the behaviors beyond kubectl connectivity maybe we can make more informed decisions regarding what workaround is likely to work etc. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#324 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAJPgm1fiHHizAE0C4AmfEVrxK3aNSsNks5t69hDgaJpZM4Tii2o> .

benbuckland · 2018-11-16T09:08:55Z

Hi all, I just experienced this same issue, cluster in East US. I scaled up the cluster from 2 to 3 and it has come back to life.

Massively annoying...

jnoller · 2019-04-03T20:56:15Z

First, thank you for reaching out to the AKS team.

For this issue (and the error) (TLS Handshake timeout) what we have seen is that this can be traced back to API server downtime (eg an upgrade, an overload of etcd, or other issue), custom network rules / NSG that block the ports needed for API / worker communication. Unfortunately the error is highly contextual to the cluster / region / etc

AKS Github issues must be generally re-creatable (e.g here is the YAML/settings/packages/etc installed, can recreate the issue on existing clusters as well as newly created clusters). Unfortunately, this issue is cluster-specific.

For cluster specific issues (those that will require support and on-call staff to access you Subscription and clusters) we ask that you file an Azure Technical Support ticket with the identified issue, rec-creation steps if available, and link to this github thread.

We can not ask for the access, logs and other sensitive information on github, or ask you to email/slack/etc this information to us. Our portal ticketing system will protect the information including Subscription ids and credentials. If we discover as part of the root cause that this is an AKS or upstream bug, we will mark this issue as a known issue and ensure it is on the Product backlog as a repair item.

This will allow us to root-cause the issue and ensure it’s not specific to the environment or other configuration level issue - such as NSGs, custom network devices, etc.

wanghui122725501 · 2019-05-22T05:52:38Z

this problem due to your server proxy,please close local proxy to resolve it !!such as:
[root@centos7-node8 ~]# vim /etc/profile
#export http_proxy="http://192.168.199.248:1080"
#export https_proxy="http://192.168.199.248:1080"
#export ftp_proxy=$http_proxy
[root@centos7-node8 ~]# source /etc/profile

ansha1 · 2019-08-23T13:30:31Z

@wanghui122725501 is there any way we can fix keeping the proxy ? because server i am using to communicate with k8s cluster is behind company proxy and which can't be removed.
Appreciate your help!

hguerrier · 2019-09-25T14:48:10Z

Are we saying that kubectl cannot be used behind a corporate proxy? Seem rather limiting don't you think?

Javacppc · 2019-12-03T13:47:02Z

this problem due to your server proxy,please close local proxy to resolve it !!such as:
[root@centos7-node8 ~]# vim /etc/profile
#export http_proxy="http://192.168.199.248:1080"
#export https_proxy="http://192.168.199.248:1080"
#export ftp_proxy=$http_proxy
[root@centos7-node8 ~]# source /etc/profile
yes.Thank you! you solved my problem!

seniorquico mentioned this issue Apr 25, 2018

Cluster configuration "reverts" to something old #298

Closed

jiel mentioned this issue Apr 30, 2018

AKS cluster in West Europe not reachable since 2018-04-25 18:50 UTC #328

Closed

dtzar mentioned this issue Aug 6, 2018

TLS Timeout #581

Closed

sauryadas added the runtime issue label Aug 7, 2018

jnoller closed this as completed Apr 3, 2019

Azure locked as resolved and limited conversation to collaborators Aug 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to connect to the server: net/http: TLS handshake timeout #324

Unable to connect to the server: net/http: TLS handshake timeout #324

CarlosOVillanueva commented Apr 24, 2018

pflickin commented Apr 24, 2018

gabrielbarceloscn commented Apr 25, 2018

dyhpoon commented Apr 25, 2018

aevitas commented Apr 25, 2018

mdoulaty commented Apr 25, 2018

seniorquico commented Apr 25, 2018

slack commented Apr 25, 2018

itsMicah commented Apr 27, 2018 •

edited

necevil commented Jun 5, 2018

necevil commented Jun 6, 2018

necevil commented Jun 6, 2018

mdoulaty commented Jun 7, 2018

nphmuller commented Jun 8, 2018 •

edited

aevitas commented Jun 8, 2018

nphmuller commented Jun 8, 2018 •

edited

necevil commented Jun 8, 2018

aevitas commented Jun 9, 2018

necevil commented Jun 9, 2018

aevitas commented Jun 9, 2018 via email

benbuckland commented Nov 16, 2018

jnoller commented Apr 3, 2019

wanghui122725501 commented May 22, 2019

ansha1 commented Aug 23, 2019

hguerrier commented Sep 25, 2019

Javacppc commented Dec 3, 2019

Unable to connect to the server: net/http: TLS handshake timeout #324

Unable to connect to the server: net/http: TLS handshake timeout #324

Comments

CarlosOVillanueva commented Apr 24, 2018

pflickin commented Apr 24, 2018

gabrielbarceloscn commented Apr 25, 2018

dyhpoon commented Apr 25, 2018

aevitas commented Apr 25, 2018

mdoulaty commented Apr 25, 2018

seniorquico commented Apr 25, 2018

slack commented Apr 25, 2018

itsMicah commented Apr 27, 2018 • edited

necevil commented Jun 5, 2018

necevil commented Jun 6, 2018

If you guys could answer a couple of these (here or above on SOF)

necevil commented Jun 6, 2018

mdoulaty commented Jun 7, 2018

nphmuller commented Jun 8, 2018 • edited

aevitas commented Jun 8, 2018

nphmuller commented Jun 8, 2018 • edited

necevil commented Jun 8, 2018

aevitas commented Jun 9, 2018

necevil commented Jun 9, 2018

aevitas commented Jun 9, 2018 via email

benbuckland commented Nov 16, 2018

jnoller commented Apr 3, 2019

wanghui122725501 commented May 22, 2019

ansha1 commented Aug 23, 2019

hguerrier commented Sep 25, 2019

Javacppc commented Dec 3, 2019

itsMicah commented Apr 27, 2018 •

edited

nphmuller commented Jun 8, 2018 •

edited

nphmuller commented Jun 8, 2018 •

edited