Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix #51135 make CFS quota period configurable #63437

Conversation

szuecs
Copy link
Member

@szuecs szuecs commented May 4, 2018

What this PR does / why we need it:

This PR makes it possible for users to change CFS quota period from the default 100ms to some other value between 1µs and 1s.
#51135 shows that multiple production users have serious issues running reasonable workloads in kubernetes. The latency added by the 100ms CFS quota period is adding way too much time.

Which issue(s) this PR fixes:
Fixes #51135

Special notes for your reviewer:

Release note:

Adds a kubelet parameter and config option to change CFS quota period from the default 100ms to some other value between 1µs and 1s. This was done to improve response latencies for workloads running in clusters with guaranteed and burstable QoS classes.  

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 4, 2018
@k8s-ci-robot k8s-ci-robot requested review from dims and pmorie May 4, 2018 15:45
@liggitt
Copy link
Member

liggitt commented May 4, 2018

@kubernetes/sig-node-pr-reviews @derekwaynecarr

@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label May 4, 2018
@derekwaynecarr
Copy link
Member

The value used today is the default in many OS distributions.

There are many users running with 100ms today that could see a change in behavior.

I was not at Kubecon, so I missed some context for the discussion. I prefer we have a kubelet flag to tweak the desired cfs_period_us setting on a linux host rather than hard-coding. Red Hat has customers that disable CFS quota entirely via the existing flag. I think its reasonable to pair that flag with the additional option to tweak the default CFS period. I do not think we need to let it be tweaked per pod.

@dchen1107 @vishh

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 4, 2018
@dims
Copy link
Member

dims commented May 4, 2018

@derekwaynecarr The "cpu-cfs-quota" command line flag in kubelet right? So we would have a "cpu-cfs-quota-period" to go along with it?

@derekwaynecarr
Copy link
Member

derekwaynecarr commented May 4, 2018 via email

@dims
Copy link
Member

dims commented May 4, 2018

@szuecs looking forward to an update in this PR with the suggestion above

@derekwaynecarr
Copy link
Member

see my comment here for why a flag is preferred. Not all nodes are tuned for latency.

#51135 (comment)

@szuecs
Copy link
Member Author

szuecs commented May 5, 2018

Thanks for all your comments. I will work on it probably next week, we have a short week, so I am not sure if takes more than next week.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels May 7, 2018
@szuecs
Copy link
Member Author

szuecs commented May 7, 2018

@derekwaynecarr I added the cli flag and config option. I hope that I added it everywhere, where it is needed.

I tested, that:

  • tests run and report ok % go test ./pkg/kubelet/cm -v
  • I can locally build kubelet: % make WHAT=cmd/kubelet

Let me know if I have to change anything.

@vishh
Copy link
Contributor

vishh commented May 7, 2018

/hold
I'd like for us to reach a consensus on #51135 (comment) prior to merging this patch.

@szuecs szuecs changed the title fix #51135 set default quota period to 5ms based on user experience fix #51135 make CFS quota period configurable May 7, 2018
@dims
Copy link
Member

dims commented May 8, 2018

@szuecs i think you need to set the default value in SetDefaults_KubeletConfiguration method too:
https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/kubeletconfig/v1beta1/defaults.go#L151-L153

@dims
Copy link
Member

dims commented May 8, 2018

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 8, 2018
@dims
Copy link
Member

dims commented May 8, 2018

@derekwaynecarr @vishh Do we count this as a new feature? if so, Does this feature need an alpha feature gate?

@dims
Copy link
Member

dims commented May 8, 2018

@derekwaynecarr @vishh Also, do we leave the current default as-is? (when nothing is specified in the command line)

@k8s-ci-robot k8s-ci-robot added area/kubeadm sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. labels May 8, 2018
@dims
Copy link
Member

dims commented May 8, 2018

@szuecs

  • if run "hack/update-bazel.sh" that should take care of the verify job failure. (and bazel-test job failure too)
  • the fix for e2e failure should have already gone in, but you may have to rebase to master

So, please squash the commits (for easier review), rebase to master and we should get all green (cross my fingers)

thanks,
Dims

@szuecs szuecs force-pushed the fix/51135-set-saneer-default-cpu.cfs_period branch from 8bba451 to f0b2f0d Compare May 9, 2018 16:17
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: derekwaynecarr, dims, szuecs, timothysc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-github-robot
Copy link

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 63437, 68081). If you want to cherry-pick this change to another branch, please follow the instructions here: https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md.

@k8s-github-robot k8s-github-robot merged commit 147520f into kubernetes:master Sep 1, 2018
@k8s-ci-robot
Copy link
Contributor

@szuecs: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-local-e2e 0a3c4dbc6c8fd43ade794aa7a5cd8fa0bc251637 link /test pull-kubernetes-local-e2e
pull-kubernetes-e2e-gce 588d280 link /test pull-kubernetes-e2e-gce

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@hjacobs
Copy link

hjacobs commented Sep 2, 2018

@derekwaynecarr @dims @timothysc @szuecs thanks!

@dashpole
Copy link
Contributor

dashpole commented Sep 6, 2018

This has broken the release blocking alphafeatures test suite: https://k8s-testgrid.appspot.com/sig-release-1.12-blocking#gce-cos-1.12-alphafeatures

The kubelet is crashlooping:
server.go:207] invalid configuration: CPUCFSQuotaPeriod (--cpu-cfs-quota-period) {0s} must be between 1usec and 1sec, inclusive

Tracking bug for release-blocking failures: #68313

@guineveresaenger
Copy link
Contributor

@dashpole @derekwaynecarr @dims please give this your immediate attention re: #68313.

@@ -154,6 +155,9 @@ func SetDefaults_KubeletConfiguration(obj *KubeletConfiguration) {
if obj.CPUCFSQuota == nil {
obj.CPUCFSQuota = utilpointer.BoolPtr(true)
}
if obj.CPUCFSQuotaPeriod == nil && obj.FeatureGates[string(features.CPUCFSQuotaPeriod)] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this actually works... We generally do not feature gate the defaulting of flags. We generally only place the feature gate around code we don't want to run, rather than around configuration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does look like a problem. The flag should always default, it just shouldn’t have been used. Apologies for missing in review

@dashpole
Copy link
Contributor

dashpole commented Sep 7, 2018

Fix is out: #68386

k8s-github-robot pushed a commit that referenced this pull request Sep 7, 2018
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here: https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md.

Remove feature gate from kubelet defaulting

**What this PR does / why we need it**:
Fixes a release-blocking test: https://k8s-testgrid.appspot.com/sig-release-1.12-blocking#gce-cos-1.12-alphafeatures
Regression added by #63437
This solution was discussed on slack in the sig-release channel
This should be targeted for 1.12

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Issue ##68313

**Special notes for your reviewer**:
/hold
testing to make sure this fixes the issue
Using: `make test-e2e-node FOCUS=ImageGCNoEviction SKIP= PARALLELISM=1 REMOTE=true TEST_ARGS='--feature-gates=CustomCPUCFSQuotaPeriod=true'` to reproduce the issue, as it runs a test with the feature gate enabled.

**Release note**:
```release-note
NONE
```

/assign @dims @derekwaynecarr 
/sig node
/kind bug
/priority critical-urgent
@warmchang
Copy link
Contributor

ping @Pingan2017

milesbxf added a commit to monzo/kubernetes that referenced this pull request Mar 19, 2019
Adds a monzo.com/cpu-period resource, which allows tuning the period of
time over which the kernel tracksw CPU throttling. In upstream Kubernetes
versions pre-1.12, this is not tunable and is hardcoded to the kernel
default (100ms).

We originally introduced this after seeing long GC pauses clustered
around 100ms [1], which was eventually traced to CFS throttling.
Essentially it's recommended for very latency sensitive & bursty
workloads (like HTTP microservices!) it's recommended to set the CFS
quota period lower. We mostly set ours at 5ms across the board. See [2]
and [3] for further discussion in the Kubernetes repository.

This is fixed in upstream 1.12 via a slightly different path [4]; the
period is now tunable via a kubelet CLI flag. This doesn't give us as
fine-grained control, but we can still set this and optimise for the
vast majority of our workloads.

[1] golang/go#19378
[2] kubernetes#51135
[3] kubernetes#67577
[4] kubernetes#63437
milesbxf added a commit to monzo/kubernetes that referenced this pull request Mar 19, 2019
Adds a monzo.com/cpu-period resource, which allows tuning the period of
time over which the kernel tracksw CPU throttling. In upstream Kubernetes
versions pre-1.12, this is not tunable and is hardcoded to the kernel
default (100ms).

We originally introduced this after seeing long GC pauses clustered
around 100ms [1], which was eventually traced to CFS throttling.
Essentially it's recommended for very latency sensitive & bursty
workloads (like HTTP microservices!) it's recommended to set the CFS
quota period lower. We mostly set ours at 5ms across the board. See [2]
and [3] for further discussion in the Kubernetes repository.

This is fixed in upstream 1.12 via a slightly different path [4]; the
period is now tunable via a kubelet CLI flag. This doesn't give us as
fine-grained control, but we can still set this and optimise for the
vast majority of our workloads.

[1] golang/go#19378
[2] kubernetes#51135
[3] kubernetes#67577
[4] kubernetes#63437

Squashed commits:

commit 61551b0
Merge: a446c68 de2c6cb
Author: Miles Bryant <milesbryant@monzo.com>
Date:   Wed Mar 13 16:16:17 2019 +0000

    Merge pull request #2 from monzo/v1.9.11-kubelet-register-cpu-period

    Register nodes with monzo.com/cpu-period resource

commit de2c6cb
Author: Miles Bryant <milesbryant@monzo.com>
Date:   Wed Mar 13 15:14:58 2019 +0000

    Register nodes with monzo.com/cpu-period resource

    We have a custom pod resource which allows tuning the CPU throttling
    period. Upgrading to 1.9 causes this to break scheduling logic, as the
    scheduler and pod preemption controller takes this resource into account
    when deciding where to place pods, and which pods to pre-empt.

    This patches the kubelet so that it registers its node with a large
    amount of this resource - 10000 * max number of pods (default 110). We
    typically run pods with this set to 5000, so this should be plenty.

commit a446c68
Author: Miles Bryant <milesbryant@monzo.com>
Date:   Tue Jan 29 16:43:03 2019 +0000

    ResourceConfig.CpuPeriod is now uint64, not int64

    Some changes to upstream dependencies between v1.7 and v1.9 mean that
    the CpuPeriod field of ResourceConfig has changed type; unfortunately
    this means the Monzo CFS period patch doesn't compile.

    This won't change behaviour at all - the apiserver already validates
    that `monzo.com/cpu-period` can never be negative. The only edge case is
    if someone sets it to higher than the int64 positive bound (this will
    result in an overflow), but I don't think this is worth mitigating

commit 1ead2d6
Author: Oliver Beattie <oliver@obeattie.com>
Date:   Wed Aug 9 22:57:53 2017 +0100

    [PLAT-713] Allow the CFS period to be tuned for containers
@bartoszhernas
Copy link

Our php container started seeing problems connected with CPU limits.

We started profiling and saw that initSoapClient method which usually takes 10ms, started taking up to 120s (no mistake, 2 minutes...). We do not use any container limits, we only use requests. After requesting for much more CPU than possibly needed (40%) the issue disappeared.

We are hosted on GKE, is there an option to disable cpu limits or reduce CFS period there? Does anyone has idea how to fix this problem on GKE?

@@ -220,6 +220,8 @@ type KubeletConfiguration struct {
// cpuCFSQuota enables CPU CFS quota enforcement for containers that
// specify CPU limits
CPUCFSQuota bool
// CPUCFSQuotaPeriod sets the CPU CFS quota period value, cpu.cfs_period_us, defaults to 100ms
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm super confused about this one and can't figure this one out by myself: cpu.cfs_period_us default limit called "100ms" but actually is microseconds:

cpu.cfs_period_us: the length of a period (in microseconds)

In the change I'm commenting author clearly acted under assumptions that milliseconds are used in k8s and now it's all over the test cases and comments. But I wonder where that 1000x disperancy between k8s value and Linux value comes from? And does it exist, or k8s default is actually a microsecond, same as linux? Unless we multiple that default value of QuotaPeriod by 1000 somewhere, converted to time.Duration it's 100 microseconds and not milliseconds.

cc @vishh @sjenning and especially @hjacobs: I hope you all have some context about it and can tell me why I'm wrong and k8s code comments are right.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes cfs_period_us is in µs, that's why it has _us suffix
Maybe https://go.dev/play/p/65fG1OmLFYN helps to understand.
As far as I remember, I think the time.Duration is only used if you want to set it via kubelet flag/config.

Copy link
Contributor

@paskal paskal Jul 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment in that MR says "ms". In contrast, I think it better be "microseconds". I can send MR to change it as well as explicit millisecond mentions in the current code, whereas it should be microseconds. I'll ping you in it.

In this MR, you wrote milliseconds duration in multiple places in defaults and tests CPUCFSQuotaPeriod: metav1.Duration{Duration: 100 * time.Millisecond},, I wonder if that should be changed to microseconds to match the actual default value of 100 microseconds?

Also, the discussion which led to this MR has multiple people talking about changing the value from 100 milliseconds to 5/25 milliseconds for testing. Still, it doesn't make sense unless everyone were altering the source code in microseconds, thinking it's milliseconds. @hjacobs even did a talk about it talking about the default value as if it was in milliseconds. Your code introducing milliseconds and not microseconds (so 1000x) contributed to my confusion, and I wonder if it's intentional or if you are mistaken, and it should be changed microseconds to change the real default.

The only way current code would make sense to me is if the value you feed to k8s (100 milliseconds) is divided by 1000 at some point in time and then matches the real Linux default of 100 microseconds, but I haven't found such a place yet, and I would doubt that's the case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szuecs, the simplest form of the question I can form is the following.

Is the 100ms value of CPUCFSQuota result in the same k8s behaviour as the default Linux value of 100 microseconds, or is it 1000 times higher as its milliseconds versus microseconds in Linux?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wrong, and the original code was correct. 100ms is indeed a default value in Linux, and I got confused by conversions in two places in code that had no comments about them. PR #112123 adds comments to the code added here.

// we set the period to 100ms by default
period = quotaPeriod
if !utilfeature.DefaultFeatureGate.Enabled(kubefeatures.CPUCFSQuotaPeriod) {
period = quotaPeriod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szuecs that's precise place which worries me: previously we used quotaPeriod as-is, which, translated to time.Duration, equals to 100 microseconds. And now we set it to cpuCFSQuotaPeriod, which according to code and comments is 100 millisecond default.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I don't see how this code can be worrying, because it shows the same assignment as before. If not the feature gate is set, set it to quotaPeriod as before. see https://github.com/kubernetes/kubernetes/pull/63437/files/588d2808b77d11f235b6eba5c21bcaa89a2f7804#r932625167
  2. if you enable the featuregate you can set to the value you like

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"ms" mentioned across the code alongside the tests with milliseconds and not microseconds confused me. Now I see it's still microseconds, and comments should be altered, and maybe tests as well. I'll prepare the PR, thanks for the help!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @paskal , makes sense to update, if it's wrong :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created #111520 about it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wrong, and the original code was correct. 100ms is indeed a default value in Linux, and I got confused by conversions in two places in code that had no comments about them. PR #112123 adds comments to the code added here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I read the kernel documentation at least 5 times when I did investigation and the code here.

const (
// Taken from lmctfy https://github.com/google/lmctfy/blob/master/lmctfy/controllers/cpu_controller.cc
minShares = 2
sharesPerCPU = 1024
milliCPUToCPU = 1000

// 100000 is equivalent to 100ms
quotaPeriod = 100 * minQuotaPeriod
quotaPeriod = 100000
minQuotaPeriod = 1000
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quotaPeriod calculation was inlined:
100 * minQuotaPeriod = 100 * 1000 = 100000

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver area/kubeadm area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Avoid setting CPU limits for Guaranteed pods