Skip to content
This repository has been archived by the owner on Oct 24, 2023. It is now read-only.

kube-manager is being rate limited by Azure #420

Closed
danmassie opened this issue Feb 1, 2019 · 18 comments · Fixed by #1693
Closed

kube-manager is being rate limited by Azure #420

danmassie opened this issue Feb 1, 2019 · 18 comments · Fixed by #1693
Projects

Comments

@danmassie
Copy link
Contributor

danmassie commented Feb 1, 2019

Is this a request for help?:
No

Is this an ISSUE or FEATURE REQUEST? (choose one):
ISSUE

What version of aks-engine?:
0.28.0

Kubernetes version:
1.13.1

What happened:
The logs for kube-manager show frequent errors for being rate limited by Azure:

1574:E0201 14:48:49.915225 1 attacher.go:138] azureDisk - Error checking if volumes (mycluster-dynamic-pvc-$uuid mycluster-dynamic-pvc-$uuid) are attached to current node ("k8s-agentpool1-1234567-vmss000002"). err=azure - cloud provider rate limited(read) for operation:VMSSGetInstanceView 1575:E0201 14:48:49.915258 1 operation_generator.go:186] VolumesAreAttached failed for checking on node "k8s-agentpool1-1234567-vmss000002" with: azure - cloud provider rate limited(read) for operation:VMSSGetInstanceView

714:E0201 09:17:48.805158 1 service_controller.go:219] error processing service instrumentation/elasticsearch-logging (will retry): error getting LB for service instrumentation/elasticsearch-logging: azure - cloud provider rate limited(read) for operation:LBList 715:I0201 09:17:48.805172 1 event.go:221] Event(v1.ObjectReference{Kind:"Service", Namespace:"instrumentation", Name:"myservice", UID:"919ec973-1358-11e9-a57f-000d3a29b594", APIVersion:"v1", ResourceVersion:"3891", FieldPath:""}): type: 'Warning' reason: 'CleanupLoadBalancerFailed' Error cleaning up load balancer (will retry): error getting LB for service mynamespace/myservice: azure - cloud provider rate limited(read) for operation:LBList 716:E0201 09:17:48.806361 1 azure_backoff.go:166] LoadBalancerClient.List(mycluster) - backoff: failure, will retry,err=azure - cloud provider rate limited(read) for operation:LBList

What you expected to happen:
No errors when kube-manager makes requests to Azure APIs.

How to reproduce it (as minimally and precisely as possible):
3 masters with availability set, agents in a VMSS.

Anything else we need to know:

@jackfrancis
Copy link
Member

@feiskyer Is this rate limiting something that can be addressed w/ backoff/QPS tuning?

@feiskyer
Copy link
Member

feiskyer commented Feb 2, 2019

yep, it's cloudProviderRateLimitQPS. should increase it based on the logs.

@danmassie
Copy link
Contributor Author

danmassie commented Mar 4, 2019

@feiskyer are you referring to https://github.com/Azure/aks-engine/blob/e250a6c5065cc941bcc9cb9feb6461a1449b2a47/docs/howto/kubernetes-large-clusters.md#backoff-configuration-options ?

We currently have the following set as per the defaults:

"cloudProviderBackoff": true,
        "cloudProviderBackoffRetries": 6,
        "cloudProviderBackoffJitter": 1,
        "cloudProviderBackoffDuration": 5,
        "cloudProviderBackoffExponent": 1.5,
        "cloudProviderRateLimit": true,
        "cloudProviderRateLimitQPS": 3,
        "cloudProviderRateLimitBucket": 10,

What do you suggest setting them to? And which kubernetes service flags do these relate to? I can't see any on https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/

@feiskyer
Copy link
Member

feiskyer commented Mar 5, 2019

Those options are configured in cloud config file, e.g. /etc/kubernetes/azure.json. The meaning of each options are documented here.

The setting of those options really depends. From my experience, setting cloudProviderRateLimitQPS to 6 should work better for most cases.

@danmassie
Copy link
Contributor Author

Should that be made the default? This isn't a particularly large cluster (3 masters, 6 agents) and we see the throttling across a number of Azure APIs.

@feiskyer
Copy link
Member

feiskyer commented Mar 5, 2019

Should that be made the default? This isn't a particularly large cluster (3 masters, 6 agents) and we see the throttling across a number of Azure APIs.

Reasonable, it's throttling because the rate limit is applied to all Azure APIs. @jackfrancis could we set the default value of cloudProviderRateLimitQPS larger?

@jackfrancis
Copy link
Member

Setting the QPS higher would increase the rate of traffic to Azure APIs. Don't we want to do the opposite?

@feiskyer
Copy link
Member

feiskyer commented Mar 6, 2019

Setting the QPS higher would increase the rate of traffic to Azure APIs. Don't we want to do the opposite?

Right, but current QPS is too small (still far from Azure's rate limit), and it may make kubernetes slow to finish its job if there're a lot of requests.

@danmassie
Copy link
Contributor Author

What is Azure's rate limit?

@palma21 palma21 added this to To do in backlog via automation Mar 21, 2019
@stale
Copy link

stale bot commented May 5, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label May 5, 2019
@stale stale bot closed this as completed May 12, 2019
backlog automation moved this from To do to Done May 12, 2019
@oronboni
Copy link

oronboni commented Jul 23, 2019

I found the same errors.
Warning LoadBalancerUpdateFailed 37m (x23 over 74m) service-controller Error updating load balancer with new hosts map[XXX:{} XXX:{} XXX:{} XXX:{} k8s-atpstg1eus2-15640228-vmss00000f:{} XXX:{} XXX:{} XXX:{} XXX:{} XXX:{} XXX:{} XXX:{} XXX:{} XXX:{}]: [EnsureHostInPool(ingress/nginx-ingress-controller): backendPoolID(/subscriptions/XXX/resourceGroups/XXX/providers/Microsoft.Network/loadBalancers/XXX/backendAddressPools/XXX) - failed to ensure host in pool: "azure - cloud provider rate limited(read) for operation:VMSSGet", EnsureHostInPool(wdatp-infra-system-ingress/nginx-ingress-controller): backendPoolID(/subscriptions/XXX/resourceGroups/XXX/providers/Microsoft.Network/loadBalancers/XXX/backendAddressPools/XXX) - failed to ensure host in pool: "azure - cloud provider rate limited(read) for operation:VMSSGetInstanceView"]
Warning ListLoadBalancers 4m7s (x50 over 178m) azure-cloud-provider azure - cloud provider rate limited(read) for operation:LBList
Normal EnsuringLoadBalancer 2m35s (x41 over 178m) service-controller Ensuring load balancer

@sylr
Copy link
Contributor

sylr commented Jul 30, 2019

azure - cloud provider rate limited(read)

Does this message means that we are throttled by the internal kubernetes throttling mechanism or by Azure ?

@sylr
Copy link
Contributor

sylr commented Jul 30, 2019

I've a "busy" dev cluster with, 3 masters, 19 agents (4 vmss), 35 Load Balancer Services.

setting cloudProviderRateLimitQPS to 50 did not help. I still had cloud provider rate limited(read) errors.

{
    "cloudProviderBackoffDuration": 5,
    "cloudProviderBackoffExponent": 1.5,
    "cloudProviderBackoffJitter": 1,
    "cloudProviderBackoffRetries": 6,
    "cloudProviderRateLimitBucket": 10,
    "cloudProviderRateLimitQPS": 50,
    ...
}

At the end I disabled rate limiting altogether and since no more errors.

@jackfrancis
Copy link
Member

@sylr you mean you set "cloudProviderRateLimit": false in the kubernetesConfig?

@jackfrancis
Copy link
Member

Re-opening as we're observing similar issues, it seems that we can improve our defaults here.

@jackfrancis jackfrancis reopened this Jul 30, 2019
@stale stale bot removed the stale label Jul 30, 2019
@jackfrancis
Copy link
Member

This is the message I'm receiving:

I0730 22:05:47.417773       1 event.go:209] Event(v1.ObjectReference{Kind:"Service", Namespace:"default", Name:"ingress-nginx-ilb", UID:"ca23b592-b315-11e9-8d8c-000d3afe148b", APIVersion:"v1", ResourceVersion:"1539", FieldPath:""}): type: 'Warning' reason: 'LoadBalancerUpdateFailed' Error updating load balancer with new hosts map[1119k8s01000000:{} 1119k8s01000001:{} 1119k8s01000002:{} 1119k8s01000003:{} 1119k8s01000004:{} 1119k8s01000005:{} 1119k8s01000006:{} k8s-agentpool1-11196332-vmss000000:{} k8s-agentpool1-11196332-vmss000001:{}]: EnsureHostInPool(default/ingress-nginx-ilb): backendPoolID(/subscriptions/31614129-0f24-4a4c-9731-53ceecc3017d/resourceGroups/kubernetes-westus2-90057/providers/Microsoft.Network/loadBalancers/kubernetes-westus2-90057-internal/backendAddressPools/kubernetes-westus2-90057) - failed to ensure host in pool: "azure - cloud provider rate limited(read) for operation:VMSSGetInstanceView"

This when creating an ILB against a cluster that looks like this in terms of nodes:

    "masterProfile": {
      "count": 1,
      "dnsPrefix": "",
      "vmSize": "Standard_D2_v3"
    },
    "agentPoolProfiles": [
      {
        "name": "agentpool1",
        "count": 2,
        "vmSize": "Standard_D2_v3",
        "availabilityProfile": "VirtualMachineScaleSets"
      },
      {
        "name": "agentwin",
        "count": 7,
        "vmSize": "Standard_D2_v3",
        "availabilityProfile": "VirtualMachineScaleSets",
        "osType": "Windows"
      }
    ],

Setting the limit to 100 removed the rate limits. Setting it to 11 did not. I'm retrying with 50.

The curious thing about this is that it seems the rate limit is not accommodating any "burst", and so if any number of simultaneous requests is ever needed to reconcile a request (e.g., new LB), then the rate limiter never disengages, and the reconciliation never finishes.

If that's correct, then we need to figure out a formula to calculate that simultaneous request count for the most expensive Azure API request (is LB the most expensive), and use the api model to assign a default value that will work. The number of VMSS instances is certainly a key multiplier in our formula.

Also a challenge here is that for VMSS scenarios, the instance count is by design not meant t obe fixed, so we need to think about how to continually modify these rate limiting values as the size of the cluster evolves.

@juan-lee @CecileRobertMichon FYI on the last point

@sylr
Copy link
Contributor

sylr commented Jul 31, 2019

@sylr you mean you set "cloudProviderRateLimit": false in the kubernetesConfig?

Yes

Setting the limit to 100 removed the rate limits. Setting it to 11 did not. I'm retrying with 50.

100 QPS * 60 seconds * 60 minutes = 360 000 Queries per hour

Azure read limit by subscription is 12 000 queries by hour so I believe setting 100 QPS is like deactivating rate limiting at cloudprovider level because Azure will throttle calls far before the cloud provider rate limiting kicks in.

I suggested this kubernetes-sigs/cloud-provider-azure#202

@sylr
Copy link
Contributor

sylr commented Jul 31, 2019

Also I believe this issue should be renamed "kube-manager Azure API calls are being rate limited by cloud provider"

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
No open projects
backlog
  
Done
Development

Successfully merging a pull request may close this issue.

5 participants