TLS Timeout #581

talhairfanbentley · 2018-08-06T09:13:55Z

I'm randomly getting TLS Timeout, fixing on it's own after a while. But in the mean time the kubernetes get unreponsive, like no response is returned from CLI. My cluster is in WEST US

theobolo · 2018-08-06T13:58:03Z

Yep same here, sometimes the Master API is really slow or basically just not available during 5 or 10min, sometimes it's longer ... have you some trouble about master scaling on your side @Azure ?

I'm experiencing this since 2 weeks by the way, it's very important that your master servers are 100% available since a lot of services are using k8s API for different things ... for the moment i can say that AKS is not really fully performant on that specific point.

talhairfanbentley · 2018-08-06T14:03:52Z

No, I have no issue scaling the cluster up or down. But I don't know how to check the status of master node, can you guide me?

theobolo · 2018-08-06T14:09:30Z

@talhairfanbentley No i was asking Azure Team if they have some trouble with their k8s masters on AKS because apparently there is some :)

The purpose of AKS is basically to host the Masters and manage them for us, that's the main difference between ACS and AKS, it's an integrated Azure Service.

talhairfanbentley · 2018-08-06T14:10:47Z

well, hope they fix it soon.

theobolo · 2018-08-06T14:12:47Z

Same ... that's a big issue for AKS.

dtzar · 2018-08-06T16:06:56Z

This is a known issue and been happening for a while. It is currently happening for me and just filed a support ticket. Cluster scale up/down worked for me with no issues, but still having the same problem.

Linking together existing issues:
#493
#416
#324
#268
#14
#124

weinong · 2018-08-07T19:18:24Z

Can you guys send subscriptionID, resource group, resource, and region to aks-help@service.microsoft.com for us to take a look? thanks!

novitoll · 2018-08-16T13:21:54Z

+1 Same diagnosis. But I am not surprised, you know

talhairfanbentley · 2018-08-16T13:31:45Z

@weinong isn't the subscriptionID is ought to be kept secret? how can I share that with you?
But i can share that my cluster and all the resources are in EAST US region of Azure

jskulavik · 2018-08-16T13:45:24Z

Hi @talhairfanbentley,

If you send that information directly to aks-help@service.microsoft.com via a secure email client your information will be kept securely within Microsoft

talhairfanbentley · 2018-08-16T13:46:41Z

@jskulavik thanks , I'll talk to my manager first.

novitoll · 2018-08-18T17:51:55Z

Noticed the normal behavior last 2 days on one AKS cluster, others with same k8s versions keep lagging to k8s API. (resource deletion/creation takes longer). Hopefully, anyone from AKS dev team will provide some feedback/status here.

talhairfanbentley · 2018-08-20T11:21:32Z

It's been 2 weeks now. I still face timeout issues some time.

novitoll · 2018-08-21T09:28:43Z

Pod that used to be deleted and created back within 1-5 seconds on self-managed Kubernetes (via ACS engine), now on AKS GA takes ~14 minutes.

@jskulavik Could you please provide any feedback of how stable AKS schedulers are? We are about to roll out back to ACS solution from AKS GA.

jskulavik · 2018-08-21T14:58:26Z

Hi @novitoll,

As AKS is a managed service, there are significantly more resources provisioned on your behalf during cluster creation than with a pure ACS-Engine cluster creation. The benefit of AKS provisioning workflows are that you receive the managed service - where Azure manages Kubernetes infrastructure on your behalf. This is most likely the reason you are seeing the difference in provisioning time.

novitoll · 2018-08-21T17:19:55Z

@jskulavik I understand, I'd used ACS-Engine cluster before AKS went in GA. But this is more of "complaint" we've been suffering since migrating to AKS. So I propose my help (I could not find the the source code of AKS, but apparently it's just performance) and request to fix the latency issue in AKS.

If we have a Deployment of N-replica, then it seems that when you do RollingUpdate each replica takes ~6 mins to be recreated. So imagine for 3-replica deployment, waiting for 18 mins :) Insane.

With ACS, when we managed our masters ourselves, there was a 0-latency, but we faced another bigger issues.

Please let me know if I can help (fix some Go Lang code etc.) but we need to sort this issue out.

jskulavik · 2018-08-21T17:26:55Z

Thank you @novitoll. We are constantly working to improve AKS, not only from a performance perspective, but across the board. This type of feedback is a great way to help. The more feedback like this that we receive, the better. So thank you again and please continue this feedback. I assure you, we're working hard to address your concerns.

Kemyke · 2018-08-22T14:40:27Z

We are experiencing the same issue today in the West Europe region.
The only thing we can do is to wait?

jskulavik · 2018-08-22T15:02:33Z

Hi @Kemyke,

You can submit a support request via portal.azure.com linking to this issue. Thank you.

talhairfanbentley · 2018-08-27T05:08:58Z

Hoping for these major issues to be solved soon.

subesokun · 2018-08-27T07:21:14Z

Linking in #605 as it sounds related

subesokun · 2018-08-27T07:43:35Z

Since longer time I'm facing already issues with Helm deployments and for me they sound related to this issue. Re-opened my support request and linked in this issue 👍

* helm_release.myapp: error creating tunnel: "forwarding ports: error upgrading connection: error dialing backend: dial tcp 10.240.0.4:10250: getsockopt: connection timed out"

* helm_release.myapp: rpc error: code = Unknown desc = release myapp failed: the server was unable to return a response in the time allotted, but may still be processing the request (get replicasets.extensions)

talhairfanbentley · 2018-08-27T07:49:23Z

@subesokun yes it seems to be same,
Update:
I haven't faced it in last 5 days.

subesokun · 2018-08-27T07:54:35Z

@talhairfanbentley Oh ok, I'm facing this issues in the EASTUS and WESTEUROPE region mostly during nightly deployments. For each nightly build a new AKS cluster gets created via IaC and some apps are getting deployed into it. Sometimes AKS provisioning fails (well that's another issue) but most of the time the deployment of the apps is failing with the errors mentioned above.

talhairfanbentley · 2018-08-27T08:35:12Z

are you trying to deploy them through command line or with yaml files?

subesokun · 2018-08-27T09:17:15Z

YAML files / Helm charts via Helm helm install --timeout 1800 --wait ...

talhairfanbentley · 2018-08-27T09:31:39Z

try to increase the timeout

subesokun · 2018-08-27T09:42:13Z

This timeout is only Helm specific (1800s = 30min for the complete deployment) and usually my deployments are taking 5mins as I'm creating a LB service. But the timeout I'm actually facing is on a lower level (timeout on connection level). So my deployments are failing sometimes already after 2min as a connection times out.

talhairfanbentley · 2018-08-27T09:55:56Z

hmm,, strange. Well we can wait for microsoft to look into this

BernhardRode · 2018-08-28T05:38:44Z

Maybe you should follow this topic:
#620

agolomoodysaada · 2018-11-01T02:02:58Z

Everytime I sale-in or scale-out an AKS cluster, all kubectl commands start reporting TLS timeouts. Also, if an Azure VM fails, the same thing occurs. This happens with kubernetes versions 1.10.* and 1.11.*.

talhairfanbentley · 2018-11-02T05:28:19Z

@agolomoodysaada are you sure about that it only occur on these specific versions?

agolomoodysaada · 2018-11-02T12:10:38Z

Yes I'm sure. That's why I wrote that comment. I should mention this is on EastUS and EastUS2. So perhaps WestUS is on a different AKS release?

talhairfanbentley · 2018-11-02T12:28:07Z

well, I had this issue in EastUS and EastUS2, I changed my region to CentralUS and everything is working fine for now

lud97x · 2018-11-27T14:04:15Z

Hello,
We have the issue in North Europe and West Europe.

talhairfanbentley · 2018-11-28T04:44:29Z

hey guys, I might have a fix for you all. I created my cluster in Central US. Also what I gathered from my findings. This timeout thing had to do something with Api Server at master node. So try to reduced the frequency of calls to your api server.
In Summary:

Try to create your cluster in Central US
Try to reduce the frequency of your calls to api server if you are hitting it frequently.

jnoller · 2019-04-04T15:13:09Z

Closing this issue as old/stale.

The error reported (TLS Timeout) can be caused by many different things. API server/etcd overload, custom NSG rules blocking needed ports, and more (custom firewalls, url whitelisting, etc).

If you are only seeing this behavior on clusters with a unique configuration (such as custom DNS/VNet/etc) please open an Azure technical support ticket.

theobolo mentioned this issue Aug 6, 2018

[westeurope] Intermittent Ability to Communicate with API Server #577

Closed

weinong self-assigned this Aug 7, 2018

weinong added bug az-networking labels Aug 7, 2018

jnoller closed this as completed Apr 4, 2019

ghost locked as resolved and limited conversation to collaborators Aug 2, 2020

TLS Timeout #581

TLS Timeout #581

Comments

talhairfanbentley commented Aug 6, 2018 • edited

theobolo commented Aug 6, 2018

talhairfanbentley commented Aug 6, 2018

theobolo commented Aug 6, 2018 • edited

talhairfanbentley commented Aug 6, 2018

theobolo commented Aug 6, 2018

dtzar commented Aug 6, 2018

weinong commented Aug 7, 2018

novitoll commented Aug 16, 2018

talhairfanbentley commented Aug 16, 2018

jskulavik commented Aug 16, 2018

talhairfanbentley commented Aug 16, 2018

novitoll commented Aug 18, 2018

talhairfanbentley commented Aug 20, 2018

novitoll commented Aug 21, 2018

jskulavik commented Aug 21, 2018

novitoll commented Aug 21, 2018

jskulavik commented Aug 21, 2018

Kemyke commented Aug 22, 2018

jskulavik commented Aug 22, 2018

talhairfanbentley commented Aug 27, 2018

subesokun commented Aug 27, 2018

subesokun commented Aug 27, 2018

talhairfanbentley commented Aug 27, 2018

subesokun commented Aug 27, 2018

talhairfanbentley commented Aug 27, 2018

subesokun commented Aug 27, 2018

talhairfanbentley commented Aug 27, 2018

subesokun commented Aug 27, 2018 • edited

talhairfanbentley commented Aug 27, 2018

BernhardRode commented Aug 28, 2018

agolomoodysaada commented Nov 1, 2018

talhairfanbentley commented Nov 2, 2018

agolomoodysaada commented Nov 2, 2018 • edited

talhairfanbentley commented Nov 2, 2018

lud97x commented Nov 27, 2018

talhairfanbentley commented Nov 28, 2018

jnoller commented Apr 4, 2019

talhairfanbentley commented Aug 6, 2018 •

edited

theobolo commented Aug 6, 2018 •

edited

subesokun commented Aug 27, 2018 •

edited

agolomoodysaada commented Nov 2, 2018 •

edited