Skip to content
This repository has been archived by the owner on Oct 24, 2023. It is now read-only.

feat: enable smart cloudprovider rate limiting #1693

Merged
merged 7 commits into from
Aug 5, 2019

Conversation

jackfrancis
Copy link
Member

@jackfrancis jackfrancis commented Jul 30, 2019

Reason for Change:

This PR changes the cloudprovider-enforced rate limiter configuration for VMSS scenarios to allow for more "burstability". Specifically, for Kubernetes LoadBalancer reconciliation, VMSS instance Get() calls are made during backend pool creation; one Get() call per VMSS instance. Because we can't anticipate in advance exactly how many VMSS instances will be served by an AKS Engine control plane (and because we don't currently have any functionality to update control plane cloudprovider configuration in response to scale up/down events), we need to configure the control plane so that it can handle the maximum number of supported VMSS instances. At present, we support up to 100 nodes per pool, and so we configure the "burstability" of the rate limit to be a factor of 100 per pool in VMSS cluster configurations.

In addition, this PR sets the actual QPS configuration to be at least 10% of the burstability tolerance, to prevent situations where the real world QPS (which is the allowed velocity of the rate limiter to release requests from the "burst queue") is never able to empty the "burst queue".

This PR does not change the node cloudprovider configuration for AKS (which would have no effect, but this skip is there for sanity).

Issue Fixed:

Fixes #420

Requirements:

Notes:

@codecov
Copy link

codecov bot commented Jul 31, 2019

Codecov Report

Merging #1693 into master will increase coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #1693      +/-   ##
==========================================
+ Coverage    76.2%   76.21%   +0.01%     
==========================================
  Files         130      130              
  Lines       19194    19204      +10     
==========================================
+ Hits        14626    14636      +10     
  Misses       3771     3771              
  Partials      797      797

@jackfrancis
Copy link
Member Author

@feiskyer do you have any thoughts on an iterative improvement for programmatically generating a rate limit QPS value based on the cluster IaaS configuration? My initial prototype of multiplying 4 times the number of VMSS nodes in a cluster seems to improve reliability of LB creation for clusters that have >= 10 VMSS nodes at cluster creation time.

Again, this is a stopgap fix, as we don't currently have anything in place that will continually increase or decrease the rate limit enforcement of the control plane based on changes to the cluster's IaaS over time. The goal here is to improve real world LB functionality for folks who are using VMSS and who are unable to create LB resources immediately after the cluster comes online.

pkg/api/types.go Outdated
func (p *Properties) SetCloudProviderRateLimitDefaults() {
if p.OrchestratorProfile.KubernetesConfig.CloudProviderRateLimitBucket == 0 {
if p.HasVMSSAgentPool() {
var maxVMSSNodesSupported int
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@feiskyer the idea here is that for certain events (e.g., LB) VMSS read API calls are generated, so we need to provide burstability that accommodates large VMSS clusters

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also @yangl900 FYI

pkg/api/types.go Outdated
}
}
if p.OrchestratorProfile.KubernetesConfig.CloudProviderRateLimitQPS == 0 {
const minQPSToBucketFactor float64 = 0.1
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and @feiskyer @yangl900 the idea here is that we might not want to have the ratio of QPS to bucket size be too extreme; not sure if this is overthinking it...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense

@feiskyer
Copy link
Member

feiskyer commented Aug 2, 2019

@jackfrancis thanks of improving this. I think there's still no perfect way to determine those limits, because they're also depending on the number&update-frequency of services and nodes.

From Azure documentation, for each Azure subscription and tenant, Resource Manager allows up to 12,000 read requests per hour and 1,200 write requests per hour. So the safest read QPS is 3 and write QPS is 0.3 (though CloudProviderRateLimitQPSWrite should be an integer).

But those operations won't happen in every second. And usually, there would be a lot of requests during a short period when there're services or nodes changes.

So I think the changes here make sense to reduce the rate limit issues from cloud provider side.

@jackfrancis jackfrancis changed the title [WIP] feat: enable smart cloudprovider rate limiting feat: enable smart cloudprovider rate limiting Aug 2, 2019
@jackfrancis
Copy link
Member Author

Validated this fix works to unblock LoadBalancer reconciliation for VMSS clusters up to 200 nodes. (2 pools of 100)

@jackfrancis
Copy link
Member Author

@yangl900 @palma21 FYI: these VMSS-influenced cloudprovider changes work for us in our testing of large VMSS node pool cluster configurations, I recommend investing in your own investigation to tune up the "burst queue" (i.e., bucket size) for large VMSS clusters.

@palma21
Copy link
Member

palma21 commented Aug 3, 2019

cc @chengliangli0918 @xizhamsft
@jnoller (large cluster relation only)

@mboersma
Copy link
Member

mboersma commented Aug 5, 2019

/lgtm

@mboersma
Copy link
Member

mboersma commented Aug 5, 2019

/approve

Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@mboersma
Copy link
Member

mboersma commented Aug 5, 2019

/lgtm

@acs-bot acs-bot added the lgtm label Aug 5, 2019
@acs-bot acs-bot merged commit 13dbdc5 into Azure:master Aug 5, 2019
@acs-bot
Copy link

acs-bot commented Aug 5, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CecileRobertMichon, jackfrancis, mboersma

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [CecileRobertMichon,jackfrancis,mboersma]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kube-manager is being rate limited by Azure
6 participants