Skip to content
This repository has been archived by the owner on Oct 24, 2023. It is now read-only.

feat: enable alternate kube-reserved cgroups #3201

Merged
merged 6 commits into from May 12, 2020

Conversation

alexeldeib
Copy link
Contributor

@alexeldeib alexeldeib commented May 5, 2020

Reason for Change:
Kubernetes enforces resource control through cgroups. Kubelet understands several cgroup configuration options like --kube-reserved-cgroup, --system-reserved-cgroup which specify how to partition a system for resource control. Kubelet subtracts the resource quantities specified through kube-reserved and system-reserved from total node resources to arrive at node allocatable.

Currently AKS-Engine exposes the kubelet flags for configuring enforcement of these options, but there's no way to actually set up the cgroups beforehand. This PR makes that possible by creating a systemd slice and appropriate drop-ins for Kubelet and the container runtime to exist in. That is, when using the features enabled in this PR a cgroup will be created which corresponds to kube-reserved, leaving other system daemons in the system slice. The system slice roughly acts as system-reserved when a user specifies kubeReservedCgroup.

This PR does not align enforcement of the provided value against kubelet -- we could probably do this by checking for a flag that matches --kube-reserved-cgroup , but I figured I'd open as is for feedback.

Issue Fixed:
consider this PR the feature request 馃槃

Requirements:
Can place kubelet and docker into arbitrary systemd slices.

Notes:

@codecov
Copy link

codecov bot commented May 5, 2020

Codecov Report

Merging #3201 into master will increase coverage by 0.00%.
The diff coverage is 81.81%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #3201   +/-   ##
=======================================
  Coverage   71.43%   71.43%           
=======================================
  Files         147      147           
  Lines       25653    25664   +11     
=======================================
+ Hits        18324    18333    +9     
- Misses       6187     6188    +1     
- Partials     1142     1143    +1     
Impacted Files Coverage 螖
pkg/api/types.go 94.35% <酶> (酶)
pkg/api/vlabs/types.go 73.30% <酶> (酶)
pkg/engine/cse.go 100.00% <酶> (酶)
pkg/engine/templates_generated.go 39.64% <酶> (酶)
pkg/engine/template_generator.go 81.47% <77.77%> (-0.05%) 猬囷笍
pkg/api/converterfromapi.go 93.65% <100.00%> (+0.01%) 猬嗭笍
pkg/api/convertertoapi.go 93.16% <100.00%> (+0.01%) 猬嗭笍

Continue to review full report at Codecov.

Legend - Click here to learn more
螖 = absolute <relative> (impact), 酶 = not affected, ? = missing data
Powered by Codecov. Last update 0c0d7ac...74f195f. Read the comment docs.

@alexeldeib
Copy link
Contributor Author

@jackfrancis curious how you feel about this approach vs trying to detect and/or match more tightly what the user might provide as a raw kubelet flag. I found this a bit more palatable, even if it requires more manual jiggering.

@alexeldeib alexeldeib force-pushed the ace/systemd branch 2 times, most recently from 5a3d08c to c1493ec Compare May 5, 2020 23:51
@alexeldeib
Copy link
Contributor Author

Cc @xuto2

Before=slices.target
Requires=-.slice
After=-.slice
#EOF
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This #EOF sentinel chars pattern is meant to work in concert w/ the wait_for_file func in the CSE bootstrap scripts. If the following is true:

  • in the event that CSE runs prior to cloud-init being finished paving the filesystem, these missing files will prevent CSE from completing successfully/deterministically

...then, we should keep the #EOF and add the appropriate wait_for_file invocations against these new files (probably inside the ensureKubelet func). Unfortunately, becase the CSE files themselves do not operate against a per-pool or per-master context, we have to pass that context via env vars.

Either that, or we can generalize this approach and pick a "default " cgroup so that we always deliver this new cgroup implementation flavor, and allow the value of kubeReservedCgroup to be user-configured (we could apply that per-node/per-master user configuration in cloud-init, and skip the CSE env var boilerplate thing).

Hope that makes sense!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to always deliver the cgroup and default it to kubereserved, but not run anything in it. That's effectively a no-op unless users manually move stuff into the slice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it just me, or should

DOCKER_MOUNT_FLAGS_SYSTEMD_FILE=/etc/systemd/system/docker.service.d/clear_mount_propagation_flags.conf
wait for the file to exist?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not sure what's going on there...

Copy link
Contributor Author

@alexeldeib alexeldeib May 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meh, this got a little tricky since the name of the file determines the slice. That doesn't work for the current setup with the system slice, since it already exists (and not via this filepath). I don't really think there's any harm forcing users to pick a slice, but I'm also totally fine with doing this clusterwide and evolving it if we need. That's what I stuck with for now. The end goal is to pin down one really well-isolated config anyway.

@mboersma mboersma added this to In progress in backlog via automation May 6, 2020
@alexeldeib
Copy link
Contributor Author

Creating some monday morning notifications, let me know how this looks to you @jackfrancis?

@@ -65,6 +65,7 @@ $ aks-engine get-versions
| gcHighThreshold | no | Sets the --image-gc-high-threshold value on the kublet configuration. Default is 85. [See kubelet Garbage Collection](https://kubernetes.io/docs/concepts/cluster-administration/kubelet-garbage-collection/) |
| gcLowThreshold | no | Sets the --image-gc-low-threshold value on the kublet configuration. Default is 80. [See kubelet Garbage Collection](https://kubernetes.io/docs/concepts/cluster-administration/kubelet-garbage-collection/) |
| kubeletConfig | no | Configure various runtime configuration for kubelet. See `kubeletConfig` [below](#feat-kubelet-config) |
| kubeReservedCgroup | no | The name of a systemd slice to create for containtment of both kubelet and the container runtime. This should not point to an existing systemd slice. Defaults to the empty string, which means kubelet and the container runtime will be placed in the system slice. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's more accurate to say:

Defaults to "kubereserved", which means kubelet and the container runtime will be placed in the system slice.

Is that right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think the current phrasing is correct. The default behavior is always to put them in the system slice unless specified. If you specify anything else, that slice will be created and the units will be placed inside it.

Will update for clarity

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the "Defaults to the empty string" is what's confusing to me, as the default of that property will actually be "kubereserved".

@@ -434,6 +434,10 @@ func (cs *ContainerService) setOrchestratorDefaults(isUpgrade, isScale bool) {
o.KubernetesConfig.ContainerRuntimeConfig = make(map[string]string)
}

if o.KubernetesConfig.KubeReservedCgroup == "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on this, it is actually not possible for a fully populated (post-defaults enforcement) api model to have a "" value of KubeReservedCgroup. In which case, the HasKubeReservedCgroup func is redundant, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops, I'll remove this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is leftover from playing around with per-nodepool and defaulting, but it didn't work out the way i'd like

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That explains everything! :)

pkg/api/common/const.go Outdated Show resolved Hide resolved
@@ -65,6 +65,7 @@ $ aks-engine get-versions
| gcHighThreshold | no | Sets the --image-gc-high-threshold value on the kublet configuration. Default is 85. [See kubelet Garbage Collection](https://kubernetes.io/docs/concepts/cluster-administration/kubelet-garbage-collection/) |
| gcLowThreshold | no | Sets the --image-gc-low-threshold value on the kublet configuration. Default is 80. [See kubelet Garbage Collection](https://kubernetes.io/docs/concepts/cluster-administration/kubelet-garbage-collection/) |
| kubeletConfig | no | Configure various runtime configuration for kubelet. See `kubeletConfig` [below](#feat-kubelet-config) |
| kubeReservedCgroup | no | The name of a systemd slice to create for containtment of both kubelet and the container runtime. When this value is a non-empty string, a file will be dropped at `/etc/systemd/system/$KUBE_RESERVED_CGROUP.slice` creating a systemd slice. Both kubelet and docker will run in this slice. This should not point to an existing systemd slice. If this value is unspecified or specified as the empty string, kubelet and the container runtime will run in the system slice by default. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/containtment/containment

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, to be clear the default "kubeReservedCgroup": "" results in no functional change compared to current aks-engine:master, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no functional change compared to current aks-engine:master, correct

Correct, that is the intent.

Will fix spelling

@@ -306,6 +306,7 @@ installContainerd() {
ensureContainerd() {
wait_for_file 1200 1 /etc/systemd/system/containerd.service.d/exec_start.conf || exit {{GetCSEErrorCode "ERR_FILE_WATCH_TIMEOUT"}}
wait_for_file 1200 1 /etc/containerd/config.toml || exit {{GetCSEErrorCode "ERR_FILE_WATCH_TIMEOUT"}}
wait_for_file 1200 1 /etc/systemd/system/containerd.service.d/kubereserved-slice.conf|| exit {{GetCSEErrorCode "ERR_FILE_WATCH_TIMEOUT"}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be inside a {{- if HasKubeReservedCgroup}} block, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch

My last set of commits...not so clean

"containerRuntimeConfig": {
"dataDir": "/mnt/docker"
},
"kubeReservedCgroup": "kubesystem",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: dups!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cries in merge commits

backlog automation moved this from In progress to Review in progress May 12, 2020
Copy link
Member

@jackfrancis jackfrancis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@acs-bot
Copy link

acs-bot commented May 12, 2020

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alexeldeib, jackfrancis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jackfrancis jackfrancis merged commit 9f3ec93 into Azure:master May 12, 2020
backlog automation moved this from Review in progress to Done May 12, 2020
alexeldeib added a commit to alexeldeib/aks-engine that referenced this pull request Jun 24, 2020
alexeldeib added a commit to alexeldeib/aks-engine that referenced this pull request Jun 24, 2020
alexeldeib added a commit to alexeldeib/aks-engine that referenced this pull request Jun 24, 2020
xuto2 pushed a commit that referenced this pull request Jun 24, 2020
* feat: enable alternate kube-reserved cgroups (#3201)

* fix bad commit
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
No open projects
backlog
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

3 participants