Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TLS Timeout #581

Closed
talhairfanbentley opened this issue Aug 6, 2018 · 37 comments
Closed

TLS Timeout #581

talhairfanbentley opened this issue Aug 6, 2018 · 37 comments
Assignees
Labels

Comments

@talhairfanbentley
Copy link

talhairfanbentley commented Aug 6, 2018

I'm randomly getting TLS Timeout, fixing on it's own after a while. But in the mean time the kubernetes get unreponsive, like no response is returned from CLI. My cluster is in WEST US

@theobolo
Copy link

theobolo commented Aug 6, 2018

Yep same here, sometimes the Master API is really slow or basically just not available during 5 or 10min, sometimes it's longer ... have you some trouble about master scaling on your side @Azure ?

I'm experiencing this since 2 weeks by the way, it's very important that your master servers are 100% available since a lot of services are using k8s API for different things ... for the moment i can say that AKS is not really fully performant on that specific point.

@talhairfanbentley
Copy link
Author

No, I have no issue scaling the cluster up or down. But I don't know how to check the status of master node, can you guide me?

@theobolo
Copy link

theobolo commented Aug 6, 2018

@talhairfanbentley No i was asking Azure Team if they have some trouble with their k8s masters on AKS because apparently there is some :)

The purpose of AKS is basically to host the Masters and manage them for us, that's the main difference between ACS and AKS, it's an integrated Azure Service.

@talhairfanbentley
Copy link
Author

well, hope they fix it soon.

@theobolo
Copy link

theobolo commented Aug 6, 2018

Same ... that's a big issue for AKS.

@dtzar
Copy link

dtzar commented Aug 6, 2018

This is a known issue and been happening for a while. It is currently happening for me and just filed a support ticket. Cluster scale up/down worked for me with no issues, but still having the same problem.

Linking together existing issues:
#493
#416
#324
#268
#14
#124

@weinong
Copy link
Contributor

weinong commented Aug 7, 2018

Can you guys send subscriptionID, resource group, resource, and region to aks-help@service.microsoft.com for us to take a look? thanks!

@novitoll
Copy link

+1 Same diagnosis. But I am not surprised, you know

@talhairfanbentley
Copy link
Author

@weinong isn't the subscriptionID is ought to be kept secret? how can I share that with you?
But i can share that my cluster and all the resources are in EAST US region of Azure

@jskulavik
Copy link

Hi @talhairfanbentley,

If you send that information directly to aks-help@service.microsoft.com via a secure email client your information will be kept securely within Microsoft

@talhairfanbentley
Copy link
Author

@jskulavik thanks , I'll talk to my manager first.

@novitoll
Copy link

Noticed the normal behavior last 2 days on one AKS cluster, others with same k8s versions keep lagging to k8s API. (resource deletion/creation takes longer). Hopefully, anyone from AKS dev team will provide some feedback/status here.

@talhairfanbentley
Copy link
Author

It's been 2 weeks now. I still face timeout issues some time.

@novitoll
Copy link

Pod that used to be deleted and created back within 1-5 seconds on self-managed Kubernetes (via ACS engine), now on AKS GA takes ~14 minutes.

@jskulavik Could you please provide any feedback of how stable AKS schedulers are? We are about to roll out back to ACS solution from AKS GA.

@jskulavik
Copy link

Hi @novitoll,

As AKS is a managed service, there are significantly more resources provisioned on your behalf during cluster creation than with a pure ACS-Engine cluster creation. The benefit of AKS provisioning workflows are that you receive the managed service - where Azure manages Kubernetes infrastructure on your behalf. This is most likely the reason you are seeing the difference in provisioning time.

@novitoll
Copy link

@jskulavik I understand, I'd used ACS-Engine cluster before AKS went in GA. But this is more of "complaint" we've been suffering since migrating to AKS. So I propose my help (I could not find the the source code of AKS, but apparently it's just performance) and request to fix the latency issue in AKS.

If we have a Deployment of N-replica, then it seems that when you do RollingUpdate each replica takes ~6 mins to be recreated. So imagine for 3-replica deployment, waiting for 18 mins :) Insane.

With ACS, when we managed our masters ourselves, there was a 0-latency, but we faced another bigger issues.

Please let me know if I can help (fix some Go Lang code etc.) but we need to sort this issue out.

@jskulavik
Copy link

Thank you @novitoll. We are constantly working to improve AKS, not only from a performance perspective, but across the board. This type of feedback is a great way to help. The more feedback like this that we receive, the better. So thank you again and please continue this feedback. I assure you, we're working hard to address your concerns.

@Kemyke
Copy link

Kemyke commented Aug 22, 2018

We are experiencing the same issue today in the West Europe region.
The only thing we can do is to wait?

@jskulavik
Copy link

Hi @Kemyke,

You can submit a support request via portal.azure.com linking to this issue. Thank you.

@talhairfanbentley
Copy link
Author

Hoping for these major issues to be solved soon.

@subesokun
Copy link

Linking in #605 as it sounds related

@subesokun
Copy link

Since longer time I'm facing already issues with Helm deployments and for me they sound related to this issue. Re-opened my support request and linked in this issue 👍

* helm_release.myapp: error creating tunnel: "forwarding ports: error upgrading connection: error dialing backend: dial tcp 10.240.0.4:10250: getsockopt: connection timed out"
* helm_release.myapp: rpc error: code = Unknown desc = release myapp failed: the server was unable to return a response in the time allotted, but may still be processing the request (get replicasets.extensions)

@talhairfanbentley
Copy link
Author

@subesokun yes it seems to be same,
Update:
I haven't faced it in last 5 days.

@subesokun
Copy link

@talhairfanbentley Oh ok, I'm facing this issues in the EASTUS and WESTEUROPE region mostly during nightly deployments. For each nightly build a new AKS cluster gets created via IaC and some apps are getting deployed into it. Sometimes AKS provisioning fails (well that's another issue) but most of the time the deployment of the apps is failing with the errors mentioned above.

@talhairfanbentley
Copy link
Author

are you trying to deploy them through command line or with yaml files?

@subesokun
Copy link

YAML files / Helm charts via Helm helm install --timeout 1800 --wait ...

@talhairfanbentley
Copy link
Author

try to increase the timeout

@subesokun
Copy link

subesokun commented Aug 27, 2018

This timeout is only Helm specific (1800s = 30min for the complete deployment) and usually my deployments are taking 5mins as I'm creating a LB service. But the timeout I'm actually facing is on a lower level (timeout on connection level). So my deployments are failing sometimes already after 2min as a connection times out.

@talhairfanbentley
Copy link
Author

hmm,, strange. Well we can wait for microsoft to look into this

@BernhardRode
Copy link

Maybe you should follow this topic:
#620

@agolomoodysaada
Copy link
Contributor

Everytime I sale-in or scale-out an AKS cluster, all kubectl commands start reporting TLS timeouts. Also, if an Azure VM fails, the same thing occurs. This happens with kubernetes versions 1.10.* and 1.11.*.

@talhairfanbentley
Copy link
Author

@agolomoodysaada are you sure about that it only occur on these specific versions?

@agolomoodysaada
Copy link
Contributor

agolomoodysaada commented Nov 2, 2018

Yes I'm sure. That's why I wrote that comment. I should mention this is on EastUS and EastUS2. So perhaps WestUS is on a different AKS release?

@talhairfanbentley
Copy link
Author

well, I had this issue in EastUS and EastUS2, I changed my region to CentralUS and everything is working fine for now

@lud97x
Copy link

lud97x commented Nov 27, 2018

Hello,
We have the issue in North Europe and West Europe.

@talhairfanbentley
Copy link
Author

hey guys, I might have a fix for you all. I created my cluster in Central US. Also what I gathered from my findings. This timeout thing had to do something with Api Server at master node. So try to reduced the frequency of calls to your api server.
In Summary:

  1. Try to create your cluster in Central US
  2. Try to reduce the frequency of your calls to api server if you are hitting it frequently.

@jnoller
Copy link
Contributor

jnoller commented Apr 4, 2019

Closing this issue as old/stale.

The error reported (TLS Timeout) can be caused by many different things. API server/etcd overload, custom NSG rules blocking needed ports, and more (custom firewalls, url whitelisting, etc).

If you are only seeing this behavior on clusters with a unique configuration (such as custom DNS/VNet/etc) please open an Azure technical support ticket.

@jnoller jnoller closed this as completed Apr 4, 2019
@ghost ghost locked as resolved and limited conversation to collaborators Aug 2, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests