kube-manager is being rate limited by Azure #420
Comments
@feiskyer Is this rate limiting something that can be addressed w/ backoff/QPS tuning? |
yep, it's |
@feiskyer are you referring to https://github.com/Azure/aks-engine/blob/e250a6c5065cc941bcc9cb9feb6461a1449b2a47/docs/howto/kubernetes-large-clusters.md#backoff-configuration-options ? We currently have the following set as per the defaults:
What do you suggest setting them to? And which kubernetes service flags do these relate to? I can't see any on https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/ |
Those options are configured in cloud config file, e.g. /etc/kubernetes/azure.json. The meaning of each options are documented here. The setting of those options really depends. From my experience, setting cloudProviderRateLimitQPS to 6 should work better for most cases. |
Should that be made the default? This isn't a particularly large cluster (3 masters, 6 agents) and we see the throttling across a number of Azure APIs. |
Reasonable, it's throttling because the rate limit is applied to all Azure APIs. @jackfrancis could we set the default value of cloudProviderRateLimitQPS larger? |
Setting the QPS higher would increase the rate of traffic to Azure APIs. Don't we want to do the opposite? |
Right, but current QPS is too small (still far from Azure's rate limit), and it may make kubernetes slow to finish its job if there're a lot of requests. |
What is Azure's rate limit? |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
I found the same errors. |
Does this message means that we are throttled by the internal kubernetes throttling mechanism or by Azure ? |
I've a "busy" dev cluster with, 3 masters, 19 agents (4 vmss), 35 Load Balancer Services. setting {
"cloudProviderBackoffDuration": 5,
"cloudProviderBackoffExponent": 1.5,
"cloudProviderBackoffJitter": 1,
"cloudProviderBackoffRetries": 6,
"cloudProviderRateLimitBucket": 10,
"cloudProviderRateLimitQPS": 50,
...
} At the end I disabled rate limiting altogether and since no more errors. |
@sylr you mean you set |
Re-opening as we're observing similar issues, it seems that we can improve our defaults here. |
This is the message I'm receiving:
This when creating an ILB against a cluster that looks like this in terms of nodes:
Setting the limit to 100 removed the rate limits. Setting it to 11 did not. I'm retrying with 50. The curious thing about this is that it seems the rate limit is not accommodating any "burst", and so if any number of simultaneous requests is ever needed to reconcile a request (e.g., new LB), then the rate limiter never disengages, and the reconciliation never finishes. If that's correct, then we need to figure out a formula to calculate that simultaneous request count for the most expensive Azure API request (is LB the most expensive), and use the api model to assign a default value that will work. The number of VMSS instances is certainly a key multiplier in our formula. Also a challenge here is that for VMSS scenarios, the instance count is by design not meant t obe fixed, so we need to think about how to continually modify these rate limiting values as the size of the cluster evolves. @juan-lee @CecileRobertMichon FYI on the last point |
Yes
100 QPS * 60 seconds * 60 minutes = 360 000 Queries per hour Azure read limit by subscription is 12 000 queries by hour so I believe setting 100 QPS is like deactivating rate limiting at cloudprovider level because Azure will throttle calls far before the cloud provider rate limiting kicks in. I suggested this kubernetes-sigs/cloud-provider-azure#202 |
Also I believe this issue should be renamed "kube-manager Azure API calls are being rate limited by cloud provider" |
Is this a request for help?:
No
Is this an ISSUE or FEATURE REQUEST? (choose one):
ISSUE
What version of aks-engine?:
0.28.0
Kubernetes version:
1.13.1
What happened:
The logs for kube-manager show frequent errors for being rate limited by Azure:
1574:E0201 14:48:49.915225 1 attacher.go:138] azureDisk - Error checking if volumes (mycluster-dynamic-pvc-$uuid mycluster-dynamic-pvc-$uuid) are attached to current node ("k8s-agentpool1-1234567-vmss000002"). err=azure - cloud provider rate limited(read) for operation:VMSSGetInstanceView 1575:E0201 14:48:49.915258 1 operation_generator.go:186] VolumesAreAttached failed for checking on node "k8s-agentpool1-1234567-vmss000002" with: azure - cloud provider rate limited(read) for operation:VMSSGetInstanceView
714:E0201 09:17:48.805158 1 service_controller.go:219] error processing service instrumentation/elasticsearch-logging (will retry): error getting LB for service instrumentation/elasticsearch-logging: azure - cloud provider rate limited(read) for operation:LBList 715:I0201 09:17:48.805172 1 event.go:221] Event(v1.ObjectReference{Kind:"Service", Namespace:"instrumentation", Name:"myservice", UID:"919ec973-1358-11e9-a57f-000d3a29b594", APIVersion:"v1", ResourceVersion:"3891", FieldPath:""}): type: 'Warning' reason: 'CleanupLoadBalancerFailed' Error cleaning up load balancer (will retry): error getting LB for service mynamespace/myservice: azure - cloud provider rate limited(read) for operation:LBList 716:E0201 09:17:48.806361 1 azure_backoff.go:166] LoadBalancerClient.List(mycluster) - backoff: failure, will retry,err=azure - cloud provider rate limited(read) for operation:LBList
What you expected to happen:
No errors when kube-manager makes requests to Azure APIs.
How to reproduce it (as minimally and precisely as possible):
3 masters with availability set, agents in a VMSS.
Anything else we need to know:
The text was updated successfully, but these errors were encountered: