Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance degradation for high levels of in-cluster kube-apiserver traffic #620

Closed
slack opened this issue Aug 27, 2018 · 16 comments
Closed
Labels

Comments

@slack
Copy link
Contributor

slack commented Aug 27, 2018

The AKS team is aware of performance issues when in-cluster components (like Tiller, Istio) generate a large amount of traffic to the Kubernetes API. Symptoms include slow in-cluster API responses, slow Kubernetes dashboard, long-running API watches timing out, or an inability to establish an outbound connection to the AKS cluster's external API endpoint.

In the short term, we have made a few changes to the AKS infrastructure that is expected to help, but not eliminate these timeouts. Updated configuration will begin global rollout on in the coming weeks. Customers who create new clusters or upgrade existing clusters will automatically receive this updated deployment.

In parallel, engineering is working on a long-term fix. As we make progress, we will update this GitHub issue.

@stdistef
Copy link

Were any changes made to azureproxy? How can customers check if they were upgraded?

@strtdusty
Copy link

strtdusty commented Aug 29, 2018

The azureproxy deployment has disappeared from two of my clusters and all kube-svc-redirect pods are in CrashLoop... is this the effect of the upgrade?

@derekperkins
Copy link
Contributor

see #626

@m1o1
Copy link

m1o1 commented Aug 30, 2018

Is this related to / responsible for #455, #522, and #577 ?

@blackbaud-brandonstirnaman

Any updates on this issue?

@strtdusty
Copy link

Can you please add some color around the issue you are still chasing? I opened #637 today and am wondering if it is related.

@slack
Copy link
Contributor Author

slack commented Sep 13, 2018

Quick update, thanks for your patience. The configuration changes for azureproxy were released and completed global deployment on 8/31. All AKS clusters that were upgraded or created after that date have the new configuration applied automatically.

As part of the rollout there were some circumstances where incorrect limit/requests were set on kube-svc-redirect which prevented the updated Pod from starting (or crashing shortly thereafter). A hotfix for impacted clusters was deployed shortly thereafter.

What changed

  1. We moved the azureproxy component from a K8s deployment into a sidecar as part of kube-svc-redirect DaemonSet
  2. Updated per-AKS cluster connection timeouts to 10 minutes.

While we don't yet drop a release version in a customer-visible spot, you can check your kube-system namespace for kube-svc-redirect which should now consist of two pods:

$ kubectl -n kube-system get ds,po -l component=kube-svc-redirect
NAME                                     DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE SELECTOR                 AGE
daemonset.extensions/kube-svc-redirect   3         3         3         3            3           beta.kubernetes.io/os=linux   23d

NAME                          READY     STATUS    RESTARTS   AGE
pod/kube-svc-redirect-2lgl6   2/2       Running   0          15d
pod/kube-svc-redirect-6czxw   2/2       Running   0          15d
pod/kube-svc-redirect-8n69s   2/2       Running   0          15d

What this does

DeamonSet change

The azureproxy component lives in-cluster and is network local to all customer nodes in the VM, this nginx proxy is responsible for hauling traffic destined for kubernetes.local back to the AKS control plane. Under high enough connection rates, the VM running azureproxy could exhaust the allocated outbound ports, causing slowness and/or connection refused errors.

Now traffic destined for the K8s control plane will remain local to the host originating the traffic, spreading the "query load" across a larger number of Azure VMs. Note that this moves the goalposts rather than completely fixing the problem. This workaround performs better with more cluster members.

Connection timeouts

Previous idle connection timeouts were too conservative and impacted watches. The symptoms would appear as aborted or closed connections before a watch timeout. Under most circumstances, the controller/informer loop would re-initialize the watch and carry on. Timeouts are now set to a minimum of 10 minutes.

Going forward

We've seen a large decrease in connection related issues across the fleet, but aren't yet out of the woods. Engineering continues to work on long-term networking updates to fully address high-volume in-cluster K8s workloads.

@strtdusty
Copy link

@slack thanks a lot for the update. Understanding what is going with these changes is really important for us.

@novitoll
Copy link

novitoll commented Sep 14, 2018

Nice. Finally :) Guys, could you please confirm that restarting nodes of AKS is the way to get this patch?
CC: @slack

@novitoll
Copy link

We restarted our nodes to get the patch (verified that azureproxy is running inside of kube-svc-redirect) and noticed that the time of Helm deployments (with pods recreation) was decreased, but not constantly. Sometimes it still takes 6 mins.

image

13:12:36 + kubectl --namespace xxx rollout status deployment/xxx
13:12:38 Waiting for deployment spec update to be observed...
13:14:15 Waiting for rollout to finish: 1 old replicas are pending termination...
13:14:15 Waiting for rollout to finish: 1 old replicas are pending termination...
13:14:15 Waiting for rollout to finish: 1 old replicas are pending termination...
13:14:15 Waiting for rollout to finish: 0 of 1 updated replicas are available...
13:19:51 deployment "xxx" successfully rolled out
# ~ 7 mins

Could you please assist on this?

@fhoy
Copy link

fhoy commented Oct 24, 2018

Istio recently made some changes to reduce the number of watches set on the API-server (istio/istio#7675 (comment)), and this has had positive effects on stability in one of our AKS-clusters, at least. Is this relevant to how you continue to tune AKS, or was the impact of the number of watches already known? In any case, would it be possible to tune AKS to at least handle a somewhat higher number of watches?

@digeler
Copy link

digeler commented Oct 31, 2018

Where is source code for svcredirect and azure proxy ?
Can you share the lists of prs?

@charlspjohn
Copy link

I have an AKS cluster in westeurope region with 58 nodes. My prometheus server scrapes are timing out regularly, mostly the cadvisor & node metric targets. (We have found similar timout in verious api related operations as well).

These kind of problems arose somewhere between the growth of this cluster from 14 nodes to 44 nodes. Since Prometheus is querying api server for scraping node & cadvisor metrics, the traffic caused by prometheus increased by almost ~5 times. Will this be a reason for api performance degredation ?

scrape conf samples:
https://github.com/helm/charts/blob/2fc288f1b2ad095d5b6dd0b60f9a3a16f663b049/stable/prometheus/values.yaml#L1028
https://github.com/helm/charts/blob/2fc288f1b2ad095d5b6dd0b60f9a3a16f663b049/stable/prometheus/values.yaml#L1065
(Prometheus scrape interval is 45Sec).

@stdistef
Copy link

I wonder if you are hitting a SNAT Port exhaustion? If using Advanced Networking for your cluster and UDR Routing to next hop to a NVA firewall, you could max out SNAT ports, maybe? The Azure Loadbalancer does have a metric for SNAP Port usage, assuming your NVAs are behind one.

@Azure Azure deleted a comment from prune998 Apr 20, 2019
@Azure Azure deleted a comment from thehappycoder Jul 12, 2019
@stale
Copy link

stale bot commented Jul 20, 2020

This issue has been automatically marked as stale because it has not had activity in 90 days. It will be closed if no further activity occurs. Thank you!

@stale stale bot added the stale Stale issue label Jul 20, 2020
@ghost ghost closed this as completed Aug 4, 2020
@ghost
Copy link

ghost commented Aug 4, 2020

This issue will now be closed because it hasn't had any activity for 15 days after stale. slack feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.

@ghost ghost locked as resolved and limited conversation to collaborators Sep 4, 2020
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

10 participants