feat: enable alternate kube-reserved cgroups #3201
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3201 +/- ##
=======================================
Coverage 71.43% 71.43%
=======================================
Files 147 147
Lines 25653 25664 +11
=======================================
+ Hits 18324 18333 +9
- Misses 6187 6188 +1
- Partials 1142 1143 +1
Continue to review full report at Codecov.
|
@jackfrancis curious how you feel about this approach vs trying to detect and/or match more tightly what the user might provide as a raw kubelet flag. I found this a bit more palatable, even if it requires more manual jiggering. |
5a3d08c
to
c1493ec
Compare
Cc @xuto2 |
Before=slices.target | ||
Requires=-.slice | ||
After=-.slice | ||
#EOF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This #EOF
sentinel chars pattern is meant to work in concert w/ the wait_for_file
func in the CSE bootstrap scripts. If the following is true:
- in the event that CSE runs prior to cloud-init being finished paving the filesystem, these missing files will prevent CSE from completing successfully/deterministically
...then, we should keep the #EOF
and add the appropriate wait_for_file
invocations against these new files (probably inside the ensureKubelet
func). Unfortunately, becase the CSE files themselves do not operate against a per-pool or per-master context, we have to pass that context via env vars.
Either that, or we can generalize this approach and pick a "default " cgroup so that we always deliver this new cgroup implementation flavor, and allow the value of kubeReservedCgroup
to be user-configured (we could apply that per-node/per-master user configuration in cloud-init, and skip the CSE env var boilerplate thing).
Hope that makes sense!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to always deliver the cgroup and default it to kubereserved, but not run anything in it. That's effectively a no-op unless users manually move stuff into the slice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it just me, or should
DOCKER_MOUNT_FLAGS_SYSTEMD_FILE=/etc/systemd/system/docker.service.d/clear_mount_propagation_flags.conf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm not sure what's going on there...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
meh, this got a little tricky since the name of the file determines the slice. That doesn't work for the current setup with the system slice, since it already exists (and not via this filepath). I don't really think there's any harm forcing users to pick a slice, but I'm also totally fine with doing this clusterwide and evolving it if we need. That's what I stuck with for now. The end goal is to pin down one really well-isolated config anyway.
Creating some monday morning notifications, let me know how this looks to you @jackfrancis? |
docs/topics/clusterdefinitions.md
Outdated
@@ -65,6 +65,7 @@ $ aks-engine get-versions | |||
| gcHighThreshold | no | Sets the --image-gc-high-threshold value on the kublet configuration. Default is 85. [See kubelet Garbage Collection](https://kubernetes.io/docs/concepts/cluster-administration/kubelet-garbage-collection/) | | |||
| gcLowThreshold | no | Sets the --image-gc-low-threshold value on the kublet configuration. Default is 80. [See kubelet Garbage Collection](https://kubernetes.io/docs/concepts/cluster-administration/kubelet-garbage-collection/) | | |||
| kubeletConfig | no | Configure various runtime configuration for kubelet. See `kubeletConfig` [below](#feat-kubelet-config) | | |||
| kubeReservedCgroup | no | The name of a systemd slice to create for containtment of both kubelet and the container runtime. This should not point to an existing systemd slice. Defaults to the empty string, which means kubelet and the container runtime will be placed in the system slice. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's more accurate to say:
Defaults to "kubereserved", which means kubelet and the container runtime will be placed in the system slice.
Is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I think the current phrasing is correct. The default behavior is always to put them in the system slice unless specified. If you specify anything else, that slice will be created and the units will be placed inside it.
Will update for clarity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the "Defaults to the empty string" is what's confusing to me, as the default of that property will actually be "kubereserved".
pkg/api/defaults.go
Outdated
@@ -434,6 +434,10 @@ func (cs *ContainerService) setOrchestratorDefaults(isUpgrade, isScale bool) { | |||
o.KubernetesConfig.ContainerRuntimeConfig = make(map[string]string) | |||
} | |||
|
|||
if o.KubernetesConfig.KubeReservedCgroup == "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on this, it is actually not possible for a fully populated (post-defaults enforcement) api model to have a ""
value of KubeReservedCgroup
. In which case, the HasKubeReservedCgroup
func is redundant, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whoops, I'll remove this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is leftover from playing around with per-nodepool and defaulting, but it didn't work out the way i'd like
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That explains everything! :)
docs/topics/clusterdefinitions.md
Outdated
@@ -65,6 +65,7 @@ $ aks-engine get-versions | |||
| gcHighThreshold | no | Sets the --image-gc-high-threshold value on the kublet configuration. Default is 85. [See kubelet Garbage Collection](https://kubernetes.io/docs/concepts/cluster-administration/kubelet-garbage-collection/) | | |||
| gcLowThreshold | no | Sets the --image-gc-low-threshold value on the kublet configuration. Default is 80. [See kubelet Garbage Collection](https://kubernetes.io/docs/concepts/cluster-administration/kubelet-garbage-collection/) | | |||
| kubeletConfig | no | Configure various runtime configuration for kubelet. See `kubeletConfig` [below](#feat-kubelet-config) | | |||
| kubeReservedCgroup | no | The name of a systemd slice to create for containtment of both kubelet and the container runtime. When this value is a non-empty string, a file will be dropped at `/etc/systemd/system/$KUBE_RESERVED_CGROUP.slice` creating a systemd slice. Both kubelet and docker will run in this slice. This should not point to an existing systemd slice. If this value is unspecified or specified as the empty string, kubelet and the container runtime will run in the system slice by default. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/containtment/containment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, to be clear the default "kubeReservedCgroup": ""
results in no functional change compared to current aks-engine:master, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no functional change compared to current aks-engine:master, correct
Correct, that is the intent.
Will fix spelling
@@ -306,6 +306,7 @@ installContainerd() { | |||
ensureContainerd() { | |||
wait_for_file 1200 1 /etc/systemd/system/containerd.service.d/exec_start.conf || exit {{GetCSEErrorCode "ERR_FILE_WATCH_TIMEOUT"}} | |||
wait_for_file 1200 1 /etc/containerd/config.toml || exit {{GetCSEErrorCode "ERR_FILE_WATCH_TIMEOUT"}} | |||
wait_for_file 1200 1 /etc/systemd/system/containerd.service.d/kubereserved-slice.conf|| exit {{GetCSEErrorCode "ERR_FILE_WATCH_TIMEOUT"}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be inside a {{- if HasKubeReservedCgroup}}
block, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch
My last set of commits...not so clean
Signed-off-by: Alexander Eldeib <alexeldeib@gmail.com>
"containerRuntimeConfig": { | ||
"dataDir": "/mnt/docker" | ||
}, | ||
"kubeReservedCgroup": "kubesystem", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: dups!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cries in merge commits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alexeldeib, jackfrancis The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* feat: enable alternate kube-reserved cgroups (#3201) * fix bad commit
Reason for Change:
Kubernetes enforces resource control through cgroups. Kubelet understands several cgroup configuration options like
--kube-reserved-cgroup
,--system-reserved-cgroup
which specify how to partition a system for resource control. Kubelet subtracts the resource quantities specified through kube-reserved and system-reserved from total node resources to arrive at node allocatable.Currently AKS-Engine exposes the kubelet flags for configuring enforcement of these options, but there's no way to actually set up the cgroups beforehand. This PR makes that possible by creating a systemd slice and appropriate drop-ins for Kubelet and the container runtime to exist in. That is, when using the features enabled in this PR a cgroup will be created which corresponds to
kube-reserved
, leaving other system daemons in the system slice. The system slice roughly acts assystem-reserved
when a user specifies kubeReservedCgroup.This PR does not align enforcement of the provided value against kubelet -- we could probably do this by checking for a flag that matches
--kube-reserved-cgroup
, but I figured I'd open as is for feedback.Issue Fixed:
consider this PR the feature request 馃槃
Requirements:
Can place kubelet and docker into arbitrary systemd slices.
Notes: