kube-manager is being rate limited by Azure #420

danmassie · 2019-02-01T15:11:52Z

Is this a request for help?:
No

Is this an ISSUE or FEATURE REQUEST? (choose one):
ISSUE

What version of aks-engine?:
0.28.0

Kubernetes version:
1.13.1

What happened:
The logs for kube-manager show frequent errors for being rate limited by Azure:

1574:E0201 14:48:49.915225 1 attacher.go:138] azureDisk - Error checking if volumes (mycluster-dynamic-pvc-$uuid mycluster-dynamic-pvc-$uuid) are attached to current node ("k8s-agentpool1-1234567-vmss000002"). err=azure - cloud provider rate limited(read) for operation:VMSSGetInstanceView 1575:E0201 14:48:49.915258 1 operation_generator.go:186] VolumesAreAttached failed for checking on node "k8s-agentpool1-1234567-vmss000002" with: azure - cloud provider rate limited(read) for operation:VMSSGetInstanceView

714:E0201 09:17:48.805158 1 service_controller.go:219] error processing service instrumentation/elasticsearch-logging (will retry): error getting LB for service instrumentation/elasticsearch-logging: azure - cloud provider rate limited(read) for operation:LBList 715:I0201 09:17:48.805172 1 event.go:221] Event(v1.ObjectReference{Kind:"Service", Namespace:"instrumentation", Name:"myservice", UID:"919ec973-1358-11e9-a57f-000d3a29b594", APIVersion:"v1", ResourceVersion:"3891", FieldPath:""}): type: 'Warning' reason: 'CleanupLoadBalancerFailed' Error cleaning up load balancer (will retry): error getting LB for service mynamespace/myservice: azure - cloud provider rate limited(read) for operation:LBList 716:E0201 09:17:48.806361 1 azure_backoff.go:166] LoadBalancerClient.List(mycluster) - backoff: failure, will retry,err=azure - cloud provider rate limited(read) for operation:LBList

What you expected to happen:
No errors when kube-manager makes requests to Azure APIs.

How to reproduce it (as minimally and precisely as possible):
3 masters with availability set, agents in a VMSS.

Anything else we need to know:

The text was updated successfully, but these errors were encountered:

jackfrancis · 2019-02-01T19:09:53Z

@feiskyer Is this rate limiting something that can be addressed w/ backoff/QPS tuning?

feiskyer · 2019-02-02T02:40:33Z

yep, it's cloudProviderRateLimitQPS. should increase it based on the logs.

danmassie · 2019-03-04T14:59:10Z

@feiskyer are you referring to https://github.com/Azure/aks-engine/blob/e250a6c5065cc941bcc9cb9feb6461a1449b2a47/docs/howto/kubernetes-large-clusters.md#backoff-configuration-options ?

We currently have the following set as per the defaults:

"cloudProviderBackoff": true,
        "cloudProviderBackoffRetries": 6,
        "cloudProviderBackoffJitter": 1,
        "cloudProviderBackoffDuration": 5,
        "cloudProviderBackoffExponent": 1.5,
        "cloudProviderRateLimit": true,
        "cloudProviderRateLimitQPS": 3,
        "cloudProviderRateLimitBucket": 10,

What do you suggest setting them to? And which kubernetes service flags do these relate to? I can't see any on https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/

feiskyer · 2019-03-05T01:51:37Z

Those options are configured in cloud config file, e.g. /etc/kubernetes/azure.json. The meaning of each options are documented here.

The setting of those options really depends. From my experience, setting cloudProviderRateLimitQPS to 6 should work better for most cases.

danmassie · 2019-03-05T12:09:44Z

Should that be made the default? This isn't a particularly large cluster (3 masters, 6 agents) and we see the throttling across a number of Azure APIs.

feiskyer · 2019-03-05T13:59:27Z

Should that be made the default? This isn't a particularly large cluster (3 masters, 6 agents) and we see the throttling across a number of Azure APIs.

Reasonable, it's throttling because the rate limit is applied to all Azure APIs. @jackfrancis could we set the default value of cloudProviderRateLimitQPS larger?

jackfrancis · 2019-03-06T00:32:44Z

Setting the QPS higher would increase the rate of traffic to Azure APIs. Don't we want to do the opposite?

feiskyer · 2019-03-06T03:54:15Z

Setting the QPS higher would increase the rate of traffic to Azure APIs. Don't we want to do the opposite?

Right, but current QPS is too small (still far from Azure's rate limit), and it may make kubernetes slow to finish its job if there're a lot of requests.

danmassie · 2019-03-06T10:13:45Z

What is Azure's rate limit?

stale · 2019-05-05T10:16:32Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

oronboni · 2019-07-23T10:52:45Z

I found the same errors.
Warning LoadBalancerUpdateFailed 37m (x23 over 74m) service-controller Error updating load balancer with new hosts map[XXX:{} XXX:{} XXX:{} XXX:{} k8s-atpstg1eus2-15640228-vmss00000f:{} XXX:{} XXX:{} XXX:{} XXX:{} XXX:{} XXX:{} XXX:{} XXX:{} XXX:{}]: [EnsureHostInPool(ingress/nginx-ingress-controller): backendPoolID(/subscriptions/XXX/resourceGroups/XXX/providers/Microsoft.Network/loadBalancers/XXX/backendAddressPools/XXX) - failed to ensure host in pool: "azure - cloud provider rate limited(read) for operation:VMSSGet", EnsureHostInPool(wdatp-infra-system-ingress/nginx-ingress-controller): backendPoolID(/subscriptions/XXX/resourceGroups/XXX/providers/Microsoft.Network/loadBalancers/XXX/backendAddressPools/XXX) - failed to ensure host in pool: "azure - cloud provider rate limited(read) for operation:VMSSGetInstanceView"]
Warning ListLoadBalancers 4m7s (x50 over 178m) azure-cloud-provider azure - cloud provider rate limited(read) for operation:LBList
Normal EnsuringLoadBalancer 2m35s (x41 over 178m) service-controller Ensuring load balancer

sylr · 2019-07-30T09:22:33Z

azure - cloud provider rate limited(read)

Does this message means that we are throttled by the internal kubernetes throttling mechanism or by Azure ?

sylr · 2019-07-30T11:00:20Z

I've a "busy" dev cluster with, 3 masters, 19 agents (4 vmss), 35 Load Balancer Services.

setting cloudProviderRateLimitQPS to 50 did not help. I still had cloud provider rate limited(read) errors.

{
    "cloudProviderBackoffDuration": 5,
    "cloudProviderBackoffExponent": 1.5,
    "cloudProviderBackoffJitter": 1,
    "cloudProviderBackoffRetries": 6,
    "cloudProviderRateLimitBucket": 10,
    "cloudProviderRateLimitQPS": 50,
    ...
}

At the end I disabled rate limiting altogether and since no more errors.

jackfrancis · 2019-07-30T21:25:53Z

@sylr you mean you set "cloudProviderRateLimit": false in the kubernetesConfig?

jackfrancis · 2019-07-30T21:26:27Z

Re-opening as we're observing similar issues, it seems that we can improve our defaults here.

jackfrancis · 2019-07-30T22:13:03Z

This is the message I'm receiving:

I0730 22:05:47.417773       1 event.go:209] Event(v1.ObjectReference{Kind:"Service", Namespace:"default", Name:"ingress-nginx-ilb", UID:"ca23b592-b315-11e9-8d8c-000d3afe148b", APIVersion:"v1", ResourceVersion:"1539", FieldPath:""}): type: 'Warning' reason: 'LoadBalancerUpdateFailed' Error updating load balancer with new hosts map[1119k8s01000000:{} 1119k8s01000001:{} 1119k8s01000002:{} 1119k8s01000003:{} 1119k8s01000004:{} 1119k8s01000005:{} 1119k8s01000006:{} k8s-agentpool1-11196332-vmss000000:{} k8s-agentpool1-11196332-vmss000001:{}]: EnsureHostInPool(default/ingress-nginx-ilb): backendPoolID(/subscriptions/31614129-0f24-4a4c-9731-53ceecc3017d/resourceGroups/kubernetes-westus2-90057/providers/Microsoft.Network/loadBalancers/kubernetes-westus2-90057-internal/backendAddressPools/kubernetes-westus2-90057) - failed to ensure host in pool: "azure - cloud provider rate limited(read) for operation:VMSSGetInstanceView"

This when creating an ILB against a cluster that looks like this in terms of nodes:

    "masterProfile": {
      "count": 1,
      "dnsPrefix": "",
      "vmSize": "Standard_D2_v3"
    },
    "agentPoolProfiles": [
      {
        "name": "agentpool1",
        "count": 2,
        "vmSize": "Standard_D2_v3",
        "availabilityProfile": "VirtualMachineScaleSets"
      },
      {
        "name": "agentwin",
        "count": 7,
        "vmSize": "Standard_D2_v3",
        "availabilityProfile": "VirtualMachineScaleSets",
        "osType": "Windows"
      }
    ],

Setting the limit to 100 removed the rate limits. Setting it to 11 did not. I'm retrying with 50.

The curious thing about this is that it seems the rate limit is not accommodating any "burst", and so if any number of simultaneous requests is ever needed to reconcile a request (e.g., new LB), then the rate limiter never disengages, and the reconciliation never finishes.

If that's correct, then we need to figure out a formula to calculate that simultaneous request count for the most expensive Azure API request (is LB the most expensive), and use the api model to assign a default value that will work. The number of VMSS instances is certainly a key multiplier in our formula.

Also a challenge here is that for VMSS scenarios, the instance count is by design not meant t obe fixed, so we need to think about how to continually modify these rate limiting values as the size of the cluster evolves.

@juan-lee @CecileRobertMichon FYI on the last point

sylr · 2019-07-31T07:35:04Z

@sylr you mean you set "cloudProviderRateLimit": false in the kubernetesConfig?

Yes

Setting the limit to 100 removed the rate limits. Setting it to 11 did not. I'm retrying with 50.

100 QPS * 60 seconds * 60 minutes = 360 000 Queries per hour

Azure read limit by subscription is 12 000 queries by hour so I believe setting 100 QPS is like deactivating rate limiting at cloudprovider level because Azure will throttle calls far before the cloud provider rate limiting kicks in.

I suggested this kubernetes-sigs/cloud-provider-azure#202

sylr · 2019-07-31T07:38:06Z

Also I believe this issue should be renamed "kube-manager Azure API calls are being rate limited by cloud provider"

palma21 added this to To do in backlog via automation Mar 21, 2019

stale bot added the stale label May 5, 2019

stale bot closed this as completed May 12, 2019

backlog automation moved this from To do to Done May 12, 2019

CecileRobertMichon mentioned this issue Jun 27, 2019

Load balancer creation in US DoD #1534

Closed

sylr mentioned this issue Jul 30, 2019

Add metrics about cloudprovider backoff and rate limiting kubernetes-sigs/cloud-provider-azure#201

Closed

jackfrancis reopened this Jul 30, 2019

stale bot removed the stale label Jul 30, 2019

jackfrancis mentioned this issue Jul 31, 2019

feat: enable smart cloudprovider rate limiting #1693

Merged

4 tasks

acs-bot closed this as completed in #1693 Aug 5, 2019

feiskyer mentioned this issue Sep 6, 2019

Benchmark number of Azure requests kubernetes-sigs/cloud-provider-azure#228

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kube-manager is being rate limited by Azure #420

kube-manager is being rate limited by Azure #420

danmassie commented Feb 1, 2019 •

edited

jackfrancis commented Feb 1, 2019

feiskyer commented Feb 2, 2019

danmassie commented Mar 4, 2019 •

edited

feiskyer commented Mar 5, 2019

danmassie commented Mar 5, 2019

feiskyer commented Mar 5, 2019

jackfrancis commented Mar 6, 2019

feiskyer commented Mar 6, 2019 •

edited

danmassie commented Mar 6, 2019

stale bot commented May 5, 2019

oronboni commented Jul 23, 2019 •

edited

sylr commented Jul 30, 2019

sylr commented Jul 30, 2019

jackfrancis commented Jul 30, 2019

jackfrancis commented Jul 30, 2019

jackfrancis commented Jul 30, 2019

sylr commented Jul 31, 2019

sylr commented Jul 31, 2019

kube-manager is being rate limited by Azure #420

kube-manager is being rate limited by Azure #420

Comments

danmassie commented Feb 1, 2019 • edited

Is this a request for help?: No

Is this an ISSUE or FEATURE REQUEST? (choose one): ISSUE

What version of aks-engine?: 0.28.0

jackfrancis commented Feb 1, 2019

feiskyer commented Feb 2, 2019

danmassie commented Mar 4, 2019 • edited

feiskyer commented Mar 5, 2019

danmassie commented Mar 5, 2019

feiskyer commented Mar 5, 2019

jackfrancis commented Mar 6, 2019

feiskyer commented Mar 6, 2019 • edited

danmassie commented Mar 6, 2019

stale bot commented May 5, 2019

oronboni commented Jul 23, 2019 • edited

sylr commented Jul 30, 2019

sylr commented Jul 30, 2019

jackfrancis commented Jul 30, 2019

jackfrancis commented Jul 30, 2019

jackfrancis commented Jul 30, 2019

sylr commented Jul 31, 2019

sylr commented Jul 31, 2019

danmassie commented Feb 1, 2019 •

edited

Is this a request for help?:
No

Is this an ISSUE or FEATURE REQUEST? (choose one):
ISSUE

What version of aks-engine?:
0.28.0

danmassie commented Mar 4, 2019 •

edited

feiskyer commented Mar 6, 2019 •

edited

oronboni commented Jul 23, 2019 •

edited