fix #51135 make CFS quota period configurable #63437

szuecs · 2018-05-04T15:44:57Z

What this PR does / why we need it:

This PR makes it possible for users to change CFS quota period from the default 100ms to some other value between 1µs and 1s.
#51135 shows that multiple production users have serious issues running reasonable workloads in kubernetes. The latency added by the 100ms CFS quota period is adding way too much time.

Which issue(s) this PR fixes:
Fixes #51135

Special notes for your reviewer:

5ms is used by user experience Avoid setting CPU limits for Guaranteed pods #51135 (comment)
Latency added caused by CFS 100ms is shown at Avoid setting CPU limits for Guaranteed pods #51135 (comment)
explanation why we should not disable limits Avoid setting CPU limits for Guaranteed pods #51135 (comment)
agreement found at kubecon EU 2018: Avoid setting CPU limits for Guaranteed pods #51135 (comment)

Release note:

Adds a kubelet parameter and config option to change CFS quota period from the default 100ms to some other value between 1µs and 1s. This was done to improve response latencies for workloads running in clusters with guaranteed and burstable QoS classes.

liggitt · 2018-05-04T18:04:04Z

@kubernetes/sig-node-pr-reviews @derekwaynecarr

derekwaynecarr · 2018-05-04T19:19:41Z

The value used today is the default in many OS distributions.

There are many users running with 100ms today that could see a change in behavior.

I was not at Kubecon, so I missed some context for the discussion. I prefer we have a kubelet flag to tweak the desired cfs_period_us setting on a linux host rather than hard-coding. Red Hat has customers that disable CFS quota entirely via the existing flag. I think its reasonable to pair that flag with the additional option to tweak the default CFS period. I do not think we need to let it be tweaked per pod.

@dchen1107 @vishh

/hold

dims · 2018-05-04T21:37:10Z

@derekwaynecarr The "cpu-cfs-quota" command line flag in kubelet right? So we would have a "cpu-cfs-quota-period" to go along with it?

derekwaynecarr · 2018-05-04T22:14:14Z

Yes

…

On Fri, May 4, 2018 at 5:37 PM Davanum Srinivas ***@***.***> wrote: @derekwaynecarr <https://github.com/derekwaynecarr> The "cpu-cfs-quota" command line flag in kubelet right? So we would have a "cpu-cfs-quota-period" to go along with it? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#63437 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AF8dbPLiCmJFAt46M0b_94jS34wGFFxYks5tvMohgaJpZM4Ty3li> .

dims · 2018-05-04T23:11:02Z

@szuecs looking forward to an update in this PR with the suggestion above

derekwaynecarr · 2018-05-05T00:56:56Z

see my comment here for why a flag is preferred. Not all nodes are tuned for latency.

#51135 (comment)

szuecs · 2018-05-05T18:36:26Z

Thanks for all your comments. I will work on it probably next week, we have a short week, so I am not sure if takes more than next week.

szuecs · 2018-05-07T15:50:27Z

@derekwaynecarr I added the cli flag and config option. I hope that I added it everywhere, where it is needed.

I tested, that:

tests run and report ok % go test ./pkg/kubelet/cm -v
I can locally build kubelet: % make WHAT=cmd/kubelet

Let me know if I have to change anything.

vishh · 2018-05-07T17:26:57Z

/hold
I'd like for us to reach a consensus on #51135 (comment) prior to merging this patch.

dims · 2018-05-08T01:20:09Z

@szuecs i think you need to set the default value in SetDefaults_KubeletConfiguration method too:
https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/kubeletconfig/v1beta1/defaults.go#L151-L153

dims · 2018-05-08T01:20:23Z

/ok-to-test

dims · 2018-05-08T10:54:56Z

@derekwaynecarr @vishh Do we count this as a new feature? if so, Does this feature need an alpha feature gate?

dims · 2018-05-08T10:56:38Z

@derekwaynecarr @vishh Also, do we leave the current default as-is? (when nothing is specified in the command line)

dims · 2018-05-08T21:09:42Z

@szuecs

if run "hack/update-bazel.sh" that should take care of the verify job failure. (and bazel-test job failure too)
the fix for e2e failure should have already gone in, but you may have to rebase to master

So, please squash the commits (for easier review), rebase to master and we should get all green (cross my fingers)

thanks,
Dims

k8s-github-robot · 2018-09-01T23:55:34Z

/test all [submit-queue is verifying that this PR is safe to merge]

k8s-github-robot · 2018-09-01T23:58:29Z

Automatic merge from submit-queue (batch tested with PRs 63437, 68081). If you want to cherry-pick this change to another branch, please follow the instructions here: https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md.

k8s-ci-robot · 2018-09-02T00:47:18Z

@szuecs: The following tests failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
pull-kubernetes-local-e2e	0a3c4dbc6c8fd43ade794aa7a5cd8fa0bc251637	link	`/test pull-kubernetes-local-e2e`
pull-kubernetes-e2e-gce	`588d280`	link	`/test pull-kubernetes-e2e-gce`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

hjacobs · 2018-09-02T09:08:37Z

@derekwaynecarr @dims @timothysc @szuecs thanks!

dashpole · 2018-09-06T22:19:37Z

This has broken the release blocking alphafeatures test suite: https://k8s-testgrid.appspot.com/sig-release-1.12-blocking#gce-cos-1.12-alphafeatures

The kubelet is crashlooping:
server.go:207] invalid configuration: CPUCFSQuotaPeriod (--cpu-cfs-quota-period) {0s} must be between 1usec and 1sec, inclusive

Tracking bug for release-blocking failures: #68313

guineveresaenger · 2018-09-06T22:31:00Z

@dashpole @derekwaynecarr @dims please give this your immediate attention re: #68313.

dashpole · 2018-09-06T22:30:38Z

pkg/kubelet/apis/config/v1beta1/defaults.go

@@ -154,6 +155,9 @@ func SetDefaults_KubeletConfiguration(obj *KubeletConfiguration) {
 	if obj.CPUCFSQuota == nil {
 		obj.CPUCFSQuota = utilpointer.BoolPtr(true)
 	}
+	if obj.CPUCFSQuotaPeriod == nil && obj.FeatureGates[string(features.CPUCFSQuotaPeriod)] {


I don't know if this actually works... We generally do not feature gate the defaulting of flags. We generally only place the feature gate around code we don't want to run, rather than around configuration.

This does look like a problem. The flag should always default, it just shouldn’t have been used. Apologies for missing in review

dashpole · 2018-09-07T00:01:29Z

Fix is out: #68386

@dims

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here: https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md. Remove feature gate from kubelet defaulting **What this PR does / why we need it**: Fixes a release-blocking test: https://k8s-testgrid.appspot.com/sig-release-1.12-blocking#gce-cos-1.12-alphafeatures Regression added by #63437 This solution was discussed on slack in the sig-release channel This should be targeted for 1.12 **Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*: Issue ##68313 **Special notes for your reviewer**: /hold testing to make sure this fixes the issue Using: `make test-e2e-node FOCUS=ImageGCNoEviction SKIP= PARALLELISM=1 REMOTE=true TEST_ARGS='--feature-gates=CustomCPUCFSQuotaPeriod=true'` to reproduce the issue, as it runs a test with the feature gate enabled. **Release note**: ```release-note NONE ``` /assign @dims @derekwaynecarr /sig node /kind bug /priority critical-urgent

warmchang · 2019-03-01T06:29:37Z

ping @Pingan2017

Adds a monzo.com/cpu-period resource, which allows tuning the period of time over which the kernel tracksw CPU throttling. In upstream Kubernetes versions pre-1.12, this is not tunable and is hardcoded to the kernel default (100ms). We originally introduced this after seeing long GC pauses clustered around 100ms [1], which was eventually traced to CFS throttling. Essentially it's recommended for very latency sensitive & bursty workloads (like HTTP microservices!) it's recommended to set the CFS quota period lower. We mostly set ours at 5ms across the board. See [2] and [3] for further discussion in the Kubernetes repository. This is fixed in upstream 1.12 via a slightly different path [4]; the period is now tunable via a kubelet CLI flag. This doesn't give us as fine-grained control, but we can still set this and optimise for the vast majority of our workloads. [1] golang/go#19378 [2] kubernetes#51135 [3] kubernetes#67577 [4] kubernetes#63437

Adds a monzo.com/cpu-period resource, which allows tuning the period of time over which the kernel tracksw CPU throttling. In upstream Kubernetes versions pre-1.12, this is not tunable and is hardcoded to the kernel default (100ms). We originally introduced this after seeing long GC pauses clustered around 100ms [1], which was eventually traced to CFS throttling. Essentially it's recommended for very latency sensitive & bursty workloads (like HTTP microservices!) it's recommended to set the CFS quota period lower. We mostly set ours at 5ms across the board. See [2] and [3] for further discussion in the Kubernetes repository. This is fixed in upstream 1.12 via a slightly different path [4]; the period is now tunable via a kubelet CLI flag. This doesn't give us as fine-grained control, but we can still set this and optimise for the vast majority of our workloads. [1] golang/go#19378 [2] kubernetes#51135 [3] kubernetes#67577 [4] kubernetes#63437 Squashed commits: commit 61551b0 Merge: a446c68 de2c6cb Author: Miles Bryant <milesbryant@monzo.com> Date: Wed Mar 13 16:16:17 2019 +0000 Merge pull request #2 from monzo/v1.9.11-kubelet-register-cpu-period Register nodes with monzo.com/cpu-period resource commit de2c6cb Author: Miles Bryant <milesbryant@monzo.com> Date: Wed Mar 13 15:14:58 2019 +0000 Register nodes with monzo.com/cpu-period resource We have a custom pod resource which allows tuning the CPU throttling period. Upgrading to 1.9 causes this to break scheduling logic, as the scheduler and pod preemption controller takes this resource into account when deciding where to place pods, and which pods to pre-empt. This patches the kubelet so that it registers its node with a large amount of this resource - 10000 * max number of pods (default 110). We typically run pods with this set to 5000, so this should be plenty. commit a446c68 Author: Miles Bryant <milesbryant@monzo.com> Date: Tue Jan 29 16:43:03 2019 +0000 ResourceConfig.CpuPeriod is now uint64, not int64 Some changes to upstream dependencies between v1.7 and v1.9 mean that the CpuPeriod field of ResourceConfig has changed type; unfortunately this means the Monzo CFS period patch doesn't compile. This won't change behaviour at all - the apiserver already validates that `monzo.com/cpu-period` can never be negative. The only edge case is if someone sets it to higher than the int64 positive bound (this will result in an overflow), but I don't think this is worth mitigating commit 1ead2d6 Author: Oliver Beattie <oliver@obeattie.com> Date: Wed Aug 9 22:57:53 2017 +0100 [PLAT-713] Allow the CFS period to be tuned for containers

bartoszhernas · 2019-12-05T11:51:19Z

Our php container started seeing problems connected with CPU limits.

We started profiling and saw that initSoapClient method which usually takes 10ms, started taking up to 120s (no mistake, 2 minutes...). We do not use any container limits, we only use requests. After requesting for much more CPU than possibly needed (40%) the issue disappeared.

We are hosted on GKE, is there an option to disable cpu limits or reduce CFS period there? Does anyone has idea how to fix this problem on GKE?

paskal · 2022-07-21T11:52:16Z

pkg/kubelet/apis/config/types.go

@@ -220,6 +220,8 @@ type KubeletConfiguration struct {
 	// cpuCFSQuota enables CPU CFS quota enforcement for containers that
 	// specify CPU limits
 	CPUCFSQuota bool
+	// CPUCFSQuotaPeriod sets the CPU CFS quota period value, cpu.cfs_period_us, defaults to 100ms


I'm super confused about this one and can't figure this one out by myself: cpu.cfs_period_us default limit called "100ms" but actually is microseconds:

cpu.cfs_period_us: the length of a period (in microseconds)

In the change I'm commenting author clearly acted under assumptions that milliseconds are used in k8s and now it's all over the test cases and comments. But I wonder where that 1000x disperancy between k8s value and Linux value comes from? And does it exist, or k8s default is actually a microsecond, same as linux? Unless we multiple that default value of QuotaPeriod by 1000 somewhere, converted to time.Duration it's 100 microseconds and not milliseconds.

cc @vishh @sjenning and especially @hjacobs: I hope you all have some context about it and can tell me why I'm wrong and k8s code comments are right.

yes cfs_period_us is in µs, that's why it has _us suffix
Maybe https://go.dev/play/p/65fG1OmLFYN helps to understand.
As far as I remember, I think the time.Duration is only used if you want to set it via kubelet flag/config.

The comment in that MR says "ms". In contrast, I think it better be "microseconds". I can send MR to change it as well as explicit millisecond mentions in the current code, whereas it should be microseconds. I'll ping you in it.

In this MR, you wrote milliseconds duration in multiple places in defaults and tests CPUCFSQuotaPeriod: metav1.Duration{Duration: 100 * time.Millisecond},, I wonder if that should be changed to microseconds to match the actual default value of 100 microseconds?

Also, the discussion which led to this MR has multiple people talking about changing the value from 100 milliseconds to 5/25 milliseconds for testing. Still, it doesn't make sense unless everyone were altering the source code in microseconds, thinking it's milliseconds. @hjacobs even did a talk about it talking about the default value as if it was in milliseconds. Your code introducing milliseconds and not microseconds (so 1000x) contributed to my confusion, and I wonder if it's intentional or if you are mistaken, and it should be changed microseconds to change the real default.

The only way current code would make sense to me is if the value you feed to k8s (100 milliseconds) is divided by 1000 at some point in time and then matches the real Linux default of 100 microseconds, but I haven't found such a place yet, and I would doubt that's the case.

@szuecs, the simplest form of the question I can form is the following.

Is the 100ms value of CPUCFSQuota result in the same k8s behaviour as the default Linux value of 100 microseconds, or is it 1000 times higher as its milliseconds versus microseconds in Linux?

I was wrong, and the original code was correct. 100ms is indeed a default value in Linux, and I got confused by conversions in two places in code that had no comments about them. PR #112123 adds comments to the code added here.

paskal · 2022-07-28T12:47:51Z

pkg/kubelet/kuberuntime/helpers_linux.go

-	// we set the period to 100ms by default
-	period = quotaPeriod
+	if !utilfeature.DefaultFeatureGate.Enabled(kubefeatures.CPUCFSQuotaPeriod) {
+		period = quotaPeriod


@szuecs that's precise place which worries me: previously we used quotaPeriod as-is, which, translated to time.Duration, equals to 100 microseconds. And now we set it to cpuCFSQuotaPeriod, which according to code and comments is 100 millisecond default.

I don't see how this code can be worrying, because it shows the same assignment as before. If not the feature gate is set, set it to quotaPeriod as before. see https://github.com/kubernetes/kubernetes/pull/63437/files/588d2808b77d11f235b6eba5c21bcaa89a2f7804#r932625167

if you enable the featuregate you can set to the value you like

"ms" mentioned across the code alongside the tests with milliseconds and not microseconds confused me. Now I see it's still microseconds, and comments should be altered, and maybe tests as well. I'll prepare the PR, thanks for the help!

Thanks @paskal , makes sense to update, if it's wrong :)

I've created #111520 about it.

I was wrong, and the original code was correct. 100ms is indeed a default value in Linux, and I got confused by conversions in two places in code that had no comments about them. PR #112123 adds comments to the code added here.

I think I read the kernel documentation at least 5 times when I did investigation and the code here.

szuecs · 2022-07-28T20:07:11Z

pkg/kubelet/kuberuntime/helpers_linux.go

 const (
 	// Taken from lmctfy https://github.com/google/lmctfy/blob/master/lmctfy/controllers/cpu_controller.cc
 	minShares     = 2
 	sharesPerCPU  = 1024
 	milliCPUToCPU = 1000

 	// 100000 is equivalent to 100ms
-	quotaPeriod    = 100 * minQuotaPeriod
+	quotaPeriod    = 100000
 	minQuotaPeriod = 1000


quotaPeriod calculation was inlined:
100 * minQuotaPeriod = 100 * 1000 = 100000

k8s-ci-robot requested review from dims and pmorie May 4, 2018 15:45

k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label May 4, 2018

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 4, 2018

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels May 7, 2018

szuecs changed the title ~~fix #51135 set default quota period to 5ms based on user experience~~ fix #51135 make CFS quota period configurable May 7, 2018

hjacobs mentioned this pull request May 7, 2018

Avoid setting CPU limits for Guaranteed pods #51135

Closed

k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 8, 2018

k8s-ci-robot added area/kubeadm sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. labels May 8, 2018

szuecs force-pushed the fix/51135-set-saneer-default-cpu.cfs_period branch from 8bba451 to f0b2f0d Compare May 9, 2018 16:17

k8s-github-robot merged commit 147520f into kubernetes:master Sep 1, 2018

dashpole reviewed Sep 6, 2018

View reviewed changes

dashpole mentioned this pull request Sep 7, 2018

Remove feature gate from kubelet defaulting #68386

Merged

hrzbrg mentioned this pull request Sep 23, 2018

Ability to configure cpu-cfs-quota and cpu-cfs-quota-period for kubelet kubernetes/kops#5826

Closed

nikhita mentioned this pull request Nov 19, 2018

Fix CPUThrottlingHigh alert kubermatic/kubermatic#2367

Closed

szuecs mentioned this pull request Jan 30, 2019

REQUEST: New membership for szuecs kubernetes/org#433

Closed

6 tasks

This was referenced Aug 15, 2019

[kops] Enable CFS configuration cloudposse-archives/reference-architectures#42

Merged

Big SRE update cloudposse-archives/helmfiles#159

Merged

sebgl mentioned this pull request Jan 21, 2020

A performance issue elastic/cloud-on-k8s#2402

Closed

szuecs mentioned this pull request May 26, 2022

REQUEST: New membership for szuecs kubernetes/org#3442

Closed

9 tasks

paskal reviewed Jul 21, 2022

View reviewed changes

paskal reviewed Jul 28, 2022

View reviewed changes

szuecs commented Jul 28, 2022

View reviewed changes

This was referenced Jul 28, 2022

Change CPUCFSQuotaPeriod default value from 100ms to 100us to match Linux default #111520

Merged

Clarify cpu.cfs_period_us default value #111554

Merged

aojea mentioned this pull request Aug 29, 2022

Revert "change CPUCFSQuotaPeriod default value to 100us to match Linu… #112077

Merged

szuecs deleted the fix/51135-set-saneer-default-cpu.cfs_period branch August 31, 2022 13:16

pacoxu mentioned this pull request Jan 21, 2025

promote CustomCPUCFSQuotaPeriod to beta #129726

Closed

fix #51135 make CFS quota period configurable #63437

fix #51135 make CFS quota period configurable #63437

Uh oh!

Conversation

szuecs commented May 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liggitt commented May 4, 2018

Uh oh!

derekwaynecarr commented May 4, 2018

Uh oh!

dims commented May 4, 2018

Uh oh!

derekwaynecarr commented May 4, 2018 via email

Uh oh!

dims commented May 4, 2018

Uh oh!

derekwaynecarr commented May 5, 2018

Uh oh!

szuecs commented May 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

szuecs commented May 7, 2018

Uh oh!

vishh commented May 7, 2018

Uh oh!

dims commented May 8, 2018

Uh oh!

dims commented May 8, 2018

Uh oh!

dims commented May 8, 2018

Uh oh!

dims commented May 8, 2018

Uh oh!

dims commented May 8, 2018

Uh oh!

k8s-github-robot commented Sep 1, 2018

Uh oh!

k8s-github-robot commented Sep 1, 2018

Uh oh!

k8s-ci-robot commented Sep 2, 2018

Uh oh!

hjacobs commented Sep 2, 2018

Uh oh!

dashpole commented Sep 6, 2018

Uh oh!

guineveresaenger commented Sep 6, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dashpole commented Sep 7, 2018

Uh oh!

warmchang commented Mar 1, 2019

Uh oh!

bartoszhernas commented Dec 5, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paskal Jul 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szuecs commented May 4, 2018 •

edited

Loading

szuecs commented May 5, 2018 •

edited

Loading

paskal Jul 22, 2022 •

edited

Loading