feat: enable smart cloudprovider rate limiting #1693

jackfrancis · 2019-07-30T22:57:03Z

Reason for Change:

This PR changes the cloudprovider-enforced rate limiter configuration for VMSS scenarios to allow for more "burstability". Specifically, for Kubernetes LoadBalancer reconciliation, VMSS instance Get() calls are made during backend pool creation; one Get() call per VMSS instance. Because we can't anticipate in advance exactly how many VMSS instances will be served by an AKS Engine control plane (and because we don't currently have any functionality to update control plane cloudprovider configuration in response to scale up/down events), we need to configure the control plane so that it can handle the maximum number of supported VMSS instances. At present, we support up to 100 nodes per pool, and so we configure the "burstability" of the rate limit to be a factor of 100 per pool in VMSS cluster configurations.

In addition, this PR sets the actual QPS configuration to be at least 10% of the burstability tolerance, to prevent situations where the real world QPS (which is the allowed velocity of the rate limiter to release requests from the "burst queue") is never able to empty the "burst queue".

This PR does not change the node cloudprovider configuration for AKS (which would have no effect, but this skip is there for sanity).

Issue Fixed:

Fixes #420

Requirements:

uses conventional commit messages
includes documentation
adds unit tests
tested upgrade from previous version

Notes:

codecov · 2019-07-31T00:03:04Z

Codecov Report

Merging #1693 into master will increase coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #1693      +/-   ##
==========================================
+ Coverage    76.2%   76.21%   +0.01%     
==========================================
  Files         130      130              
  Lines       19194    19204      +10     
==========================================
+ Hits        14626    14636      +10     
  Misses       3771     3771              
  Partials      797      797

jackfrancis · 2019-07-31T17:23:24Z

@feiskyer do you have any thoughts on an iterative improvement for programmatically generating a rate limit QPS value based on the cluster IaaS configuration? My initial prototype of multiplying 4 times the number of VMSS nodes in a cluster seems to improve reliability of LB creation for clusters that have >= 10 VMSS nodes at cluster creation time.

Again, this is a stopgap fix, as we don't currently have anything in place that will continually increase or decrease the rate limit enforcement of the control plane based on changes to the cluster's IaaS over time. The goal here is to improve real world LB functionality for folks who are using VMSS and who are unable to create LB resources immediately after the cluster comes online.

jackfrancis · 2019-08-01T00:42:51Z

pkg/api/types.go

+func (p *Properties) SetCloudProviderRateLimitDefaults() {
+	if p.OrchestratorProfile.KubernetesConfig.CloudProviderRateLimitBucket == 0 {
+		if p.HasVMSSAgentPool() {
+			var maxVMSSNodesSupported int


@feiskyer the idea here is that for certain events (e.g., LB) VMSS read API calls are generated, so we need to provide burstability that accommodates large VMSS clusters

also @yangl900 FYI

jackfrancis · 2019-08-01T00:43:44Z

pkg/api/types.go

+		}
+	}
+	if p.OrchestratorProfile.KubernetesConfig.CloudProviderRateLimitQPS == 0 {
+		const minQPSToBucketFactor float64 = 0.1


and @feiskyer @yangl900 the idea here is that we might not want to have the ratio of QPS to bucket size be too extreme; not sure if this is overthinking it...

makes sense

feiskyer · 2019-08-02T02:33:11Z

@jackfrancis thanks of improving this. I think there's still no perfect way to determine those limits, because they're also depending on the number&update-frequency of services and nodes.

From Azure documentation, for each Azure subscription and tenant, Resource Manager allows up to 12,000 read requests per hour and 1,200 write requests per hour. So the safest read QPS is 3 and write QPS is 0.3 (though CloudProviderRateLimitQPSWrite should be an integer).

But those operations won't happen in every second. And usually, there would be a lot of requests during a short period when there're services or nodes changes.

So I think the changes here make sense to reduce the rate limit issues from cloud provider side.

jackfrancis · 2019-08-02T17:42:27Z

Validated this fix works to unblock LoadBalancer reconciliation for VMSS clusters up to 200 nodes. (2 pools of 100)

jackfrancis · 2019-08-02T19:32:15Z

@yangl900 @palma21 FYI: these VMSS-influenced cloudprovider changes work for us in our testing of large VMSS node pool cluster configurations, I recommend investing in your own investigation to tune up the "burst queue" (i.e., bucket size) for large VMSS clusters.

palma21 · 2019-08-03T00:32:08Z

cc @chengliangli0918 @xizhamsft
@jnoller (large cluster relation only)

mboersma · 2019-08-05T19:27:41Z

/lgtm

mboersma · 2019-08-05T21:56:13Z

/approve

CecileRobertMichon

lgtm

mboersma · 2019-08-05T22:03:55Z

/lgtm

acs-bot · 2019-08-05T22:04:47Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CecileRobertMichon, jackfrancis, mboersma

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [CecileRobertMichon,jackfrancis,mboersma]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mboersma added the do-not-merge/work-in-progress label Jul 31, 2019

jackfrancis force-pushed the smart-rate-limits branch from 3468d44 to 29532b2 Compare August 1, 2019 00:38

jackfrancis commented Aug 1, 2019

View reviewed changes

jackfrancis added 3 commits August 1, 2019 09:32

feat: enable smart cloudprovider rate limiting

e822208

chore: 4.0 works more it seems

6f0e9c3

refactor: configure bucket, not QPS, for large vmss clusters

d35e3e8

jackfrancis force-pushed the smart-rate-limits branch from 29532b2 to d35e3e8 Compare August 1, 2019 16:32

jackfrancis added 2 commits August 1, 2019 10:11

chore: these default VMSS changes don’t apply to AKS

93eca2d

chore: change to 100 per pool

a2168af

jackfrancis added 2 commits August 2, 2019 10:26

chore: consts and UT

d938725

chore: lint

b50756f

jackfrancis changed the title ~~[WIP] feat: enable smart cloudprovider rate limiting~~ feat: enable smart cloudprovider rate limiting Aug 2, 2019

jackfrancis removed the do-not-merge/work-in-progress label Aug 2, 2019

acs-bot added the approved label Aug 5, 2019

CecileRobertMichon approved these changes Aug 5, 2019

View reviewed changes

acs-bot assigned mboersma Aug 5, 2019

acs-bot added the lgtm label Aug 5, 2019

acs-bot merged commit 13dbdc5 into Azure:master Aug 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: enable smart cloudprovider rate limiting #1693

feat: enable smart cloudprovider rate limiting #1693

jackfrancis commented Jul 30, 2019 •

edited

codecov bot commented Jul 31, 2019 •

edited

jackfrancis commented Jul 31, 2019

jackfrancis Aug 1, 2019

jackfrancis Aug 1, 2019

jackfrancis Aug 1, 2019

feiskyer Aug 2, 2019

feiskyer commented Aug 2, 2019

jackfrancis commented Aug 2, 2019

jackfrancis commented Aug 2, 2019

palma21 commented Aug 3, 2019

mboersma commented Aug 5, 2019

mboersma commented Aug 5, 2019

CecileRobertMichon left a comment

mboersma commented Aug 5, 2019

acs-bot commented Aug 5, 2019

feat: enable smart cloudprovider rate limiting #1693

feat: enable smart cloudprovider rate limiting #1693

Conversation

jackfrancis commented Jul 30, 2019 • edited

codecov bot commented Jul 31, 2019 • edited

Codecov Report

jackfrancis commented Jul 31, 2019

jackfrancis Aug 1, 2019

Choose a reason for hiding this comment

jackfrancis Aug 1, 2019

Choose a reason for hiding this comment

jackfrancis Aug 1, 2019

Choose a reason for hiding this comment

feiskyer Aug 2, 2019

Choose a reason for hiding this comment

feiskyer commented Aug 2, 2019

jackfrancis commented Aug 2, 2019

jackfrancis commented Aug 2, 2019

palma21 commented Aug 3, 2019

mboersma commented Aug 5, 2019

mboersma commented Aug 5, 2019

CecileRobertMichon left a comment

Choose a reason for hiding this comment

mboersma commented Aug 5, 2019

acs-bot commented Aug 5, 2019

jackfrancis commented Jul 30, 2019 •

edited

codecov bot commented Jul 31, 2019 •

edited