-
Notifications
You must be signed in to change notification settings - Fork 527
feat: enable smart cloudprovider rate limiting #1693
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1693 +/- ##
==========================================
+ Coverage 76.2% 76.21% +0.01%
==========================================
Files 130 130
Lines 19194 19204 +10
==========================================
+ Hits 14626 14636 +10
Misses 3771 3771
Partials 797 797 |
@feiskyer do you have any thoughts on an iterative improvement for programmatically generating a rate limit QPS value based on the cluster IaaS configuration? My initial prototype of multiplying Again, this is a stopgap fix, as we don't currently have anything in place that will continually increase or decrease the rate limit enforcement of the control plane based on changes to the cluster's IaaS over time. The goal here is to improve real world LB functionality for folks who are using VMSS and who are unable to create LB resources immediately after the cluster comes online. |
3468d44
to
29532b2
Compare
pkg/api/types.go
Outdated
func (p *Properties) SetCloudProviderRateLimitDefaults() { | ||
if p.OrchestratorProfile.KubernetesConfig.CloudProviderRateLimitBucket == 0 { | ||
if p.HasVMSSAgentPool() { | ||
var maxVMSSNodesSupported int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@feiskyer the idea here is that for certain events (e.g., LB) VMSS read API calls are generated, so we need to provide burstability that accommodates large VMSS clusters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also @yangl900 FYI
pkg/api/types.go
Outdated
} | ||
} | ||
if p.OrchestratorProfile.KubernetesConfig.CloudProviderRateLimitQPS == 0 { | ||
const minQPSToBucketFactor float64 = 0.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense
29532b2
to
d35e3e8
Compare
@jackfrancis thanks of improving this. I think there's still no perfect way to determine those limits, because they're also depending on the number&update-frequency of services and nodes. From Azure documentation, for each Azure subscription and tenant, Resource Manager allows up to 12,000 read requests per hour and 1,200 write requests per hour. So the safest read QPS is 3 and write QPS is 0.3 (though CloudProviderRateLimitQPSWrite should be an integer). But those operations won't happen in every second. And usually, there would be a lot of requests during a short period when there're services or nodes changes. So I think the changes here make sense to reduce the rate limit issues from cloud provider side. |
Validated this fix works to unblock LoadBalancer reconciliation for VMSS clusters up to 200 nodes. (2 pools of 100) |
cc @chengliangli0918 @xizhamsft |
/lgtm |
/approve |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: CecileRobertMichon, jackfrancis, mboersma The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Reason for Change:
This PR changes the cloudprovider-enforced rate limiter configuration for VMSS scenarios to allow for more "burstability". Specifically, for Kubernetes LoadBalancer reconciliation, VMSS instance Get() calls are made during backend pool creation; one Get() call per VMSS instance. Because we can't anticipate in advance exactly how many VMSS instances will be served by an AKS Engine control plane (and because we don't currently have any functionality to update control plane cloudprovider configuration in response to scale up/down events), we need to configure the control plane so that it can handle the maximum number of supported VMSS instances. At present, we support up to 100 nodes per pool, and so we configure the "burstability" of the rate limit to be a factor of 100 per pool in VMSS cluster configurations.
In addition, this PR sets the actual QPS configuration to be at least 10% of the burstability tolerance, to prevent situations where the real world QPS (which is the allowed velocity of the rate limiter to release requests from the "burst queue") is never able to empty the "burst queue".
This PR does not change the node cloudprovider configuration for AKS (which would have no effect, but this skip is there for sanity).
Issue Fixed:
Fixes #420
Requirements:
Notes: