Kubelet CPU usage linearly increases #750

sg3s · 2018-12-07T13:05:46Z

We were asked to open an issue on the public tracker following our interactions with support engineers (118113025003216) & slack discussion.

What happened:
We've been experiencing degraded performance & mainly CPU creep over longer periods of time on clusters rolled out several months ago.

The CPU creep causes our cluster workers to become unresponsive/NotReady as the CPU load approaches 100%.

The upstream issue is the following
kubernetes/kubernetes#64137

As we debugged the issue we noticed that our monitoring agent was consuming a lot of resources, though this was eventually not the cause
We monitor containers only on AKS which meant it took us a while to figure out kubelet caused the overall CPU creep/high CPU usage
Our monitoring partner tracked down the known issue related to kubelet/systemd interactions which we were able to confirm

We've been having this issue on all clusters since at least mid-October. We've been trying to manage our production and test cluster loads, since these are actively used/have more significant workloads they reached 100% CPU load on multiple nodes. But that also makes the monitoring graphs very hard to read.

The clearest example is our staging cluster, but it never reached 100% CPU. See graph below for the past 3 months. We scaled this cluster up/changed the node before we knew what caused the issue but it was obviously also present on the new node.

We rarely make large changes or experience high load on this cluster, so the line should be pretty horizontal, and not going upward.

What you expected to happen:

We expect CPU to not increase significantly as long as conditions/configuration does not change.

How to reproduce it (as minimally and precisely as possible):

Reproduction steps are noted in the upstream issue.

From what I understand this occurs when you:

Use specific kernel / systemd versions (see below)
And schedule pods (with volumes)

The problem is then exacerbated when you use (Cron)Jobs in k8s because you're constantly scheduling pods.

Anything else we need to know?:
This should already be a known issue being worked on by other vendors responsible for the mentioned components. This issue exists because within AKS it is not possible to customize / patch said components as a customer.

Environment:

Kubernetes version (use kubectl version):

2018-12-07 13:12:18 user infra (master) (aks-sams/pw-sso) $ k version
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.2", GitCommit:"81753b10df112992bf51bbc2c2f85208aad78335", GitTreeState:"clean", BuildDate:"2018-05-12T04:12:12Z", GoVersion:"go1.9.6", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.3", GitCommit:"a4529464e4629c21224b3d52edfe0ea91b072862", GitTreeState:"clean", BuildDate:"2018-09-09T17:53:03Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}

root@aks-agentpool-13066911-1:/home/azureuser# uname -a
Linux aks-agentpool-13066911-1 4.15.0-1030-azure #31~16.04.1-Ubuntu SMP Tue Oct 30 19:40:01 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
root@aks-agentpool-13066911-1:/home/azureuser# systemd --version
systemd 229
+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ -LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN
root@aks-agentpool-13066911-1:/home/azureuser# kubelet --version
Kubernetes v1.11.3

Size of cluster (how many worker nodes are in the cluster?)

Production: 2-3
Staging: 1
Test: 1-2

General description of workloads in the cluster (e.g. HTTP microservices, Java app, Ruby on Rails, machine learning, etc.)

Workloads roughly consist of the following:

Basic nginx-ingress & cert-manager and a monitoring agent
HTTP Microservices (nginx/static content, PHP, Node.js)
Synchronisation application/api/service in PHP, using k8s CronJobs

No service meshing other other exciting components what would make this a special setup.

Other:

Thanks to our monitoring partner we were able to confirm the upstream issue with the following instructions. I have not seen this explicitly noted in a ticket so I thought I should include these here for others trying to confirm the issue.

On some version combinations of systemd+kernel versions (see issue), the kubelet leaks system slices and slows down to a crawl. This can be confirmed by inspecting the contents of the metrics/cadvisor endpoint (full path if read-only port is enabled is http://localhost:10255/metrics/cadvisor).

A "normal" payload is under a MB (or a couple of MB).

These leaked slices have an empty container_name label, you can confirm it's this by comparing the total number of lines and the number of lines with an empty label

$ wc -l metrics-cadvisor-worker13.txt
291239 metrics-cadvisor-worker13.txt
$ grep 'container_name=""' metrics-cadvisor-worker13.txt | wc -l
289924

Our cadvisor metrics were 15+MB after several days of running hosts uninterrupted.

The text was updated successfully, but these errors were encountered:

yves-vogl · 2018-12-07T17:35:51Z

Thanks for the report. We are planning to use the Cluster Autoscaler so this issue means additional awareness when the clusters does not seem to cool down automatically after a typical amount of time.

marekr · 2018-12-11T15:20:25Z

This should already be a known issue being worked on by other vendors responsible for the mentioned components. This issue exists because within AKS it is not possible to customize / patch said components as a customer.

It is through kubernetes wizardry. You can make a DaemonSet which executes shell commands as a host process with full host access. You can use it to patch systemd and leave a flag for the next time the DaemonSet runs to not need to update.

Terrible hack but it works as a solution until Microsoft patches the standard VM image.

sg3s · 2018-12-12T12:20:54Z

@marekr Could you elaborate on what kind of work around you're referring to? We currently have a daemonset cleanup cgroups intermittently but that of course isn't a (good) fix.

We checked for other options like fixing the host image/systemd used but;

Were not able to find a version/path we could upgrade to using the default package manager
Did not want to try build the package from source & replace systemd

Executing commands on the host from a container is ill advised so we did not think it through seriously but we're not sure how to do that short of exposing the crontab to the container and adding a line there...

We're open to work around options if they can be implemented from a k8s deployment perspective.

jnoller · 2018-12-14T21:50:40Z

@sg3s Sorry for the delayed response - we're looking at this to see if we can mitigate it with patching systemd. I have your internal support ticket and have communicated this to engineering.

MGA-dotSource · 2019-04-03T07:23:08Z

@jnoller is there any update on this? As we're having lots of cronjobs this is kind of a showstopper..

jnoller · 2019-04-03T12:51:51Z

One thing we changed recently to help job leaking is we reduced the setting for garbage collection (see the changelog) which should help with large number of jobs.

I’m also digging in and seeing where the bug is / available mitigations

derekrprice · 2019-08-06T15:09:47Z

This is currently happening to us and crashing our cluster weekly. I would really appreciate that DaemonSet workaround if you could provide a complete implementation. A fix on the AKS side would be even better!

derekrprice · 2019-08-08T15:41:26Z

@reaperes's systemd cgroup cleanup code from kubernetes/kubernetes#64137 seemed the cleanest and most surgical of all the workarounds that I've found documented for this and the related issues, so I've converted it into a DaemonSet that runs the fix hourly on every node in a cluster. You could set any interval that you like, of course, but the script isn't very resource intensive and hourly seemed reasonable. It actually takes about a day or so for the CPU loading to become noticeable in my cluster and a week or so for it to crash a node. I've been running this for a few days now in my staging cluster and it appears to keep the CPU loading under control.

kubectl apply -f https://github.com/derekrprice/k8s-hacks/blob/master/systemd-cgroup-gc.yaml

github-actions · 2020-07-21T01:34:02Z

Action required from @Azure/aks-pm

ghost · 2020-07-26T16:05:02Z

Action required from @Azure/aks-pm

ghost · 2020-08-05T18:03:27Z

Issue needing attention of @Azure/aks-leads

ghost · 2020-08-20T18:04:07Z

Issue needing attention of @Azure/aks-leads

ghost · 2020-09-05T00:02:41Z

Issue needing attention of @Azure/aks-leads

ghost · 2020-09-16T18:02:35Z

@Azure/aks-pm issue needs labels

ghost · 2020-09-23T18:02:37Z

@Azure/aks-pm issue needs labels

palma21 · 2020-09-25T05:36:14Z

stale-close

jnoller added the triage label Apr 5, 2019

github-actions bot added the action-required label Jul 21, 2020

triage-new-issues bot removed the triage label Jul 21, 2020

ghost added the Needs Attention 👋 Issues needs attention/assignee/owner label Jul 26, 2020

palma21 added stale Stale issue and removed Needs Attention 👋 Issues needs attention/assignee/owner action-required labels Sep 9, 2020

ghost removed the stale Stale issue label Sep 9, 2020

palma21 closed this as completed Sep 25, 2020

ghost locked as resolved and limited conversation to collaborators Oct 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubelet CPU usage linearly increases #750

Kubelet CPU usage linearly increases #750

sg3s commented Dec 7, 2018

yves-vogl commented Dec 7, 2018

marekr commented Dec 11, 2018

sg3s commented Dec 12, 2018

jnoller commented Dec 14, 2018

MGA-dotSource commented Apr 3, 2019

jnoller commented Apr 3, 2019

derekrprice commented Aug 6, 2019

derekrprice commented Aug 8, 2019

github-actions bot commented Jul 21, 2020

ghost commented Jul 26, 2020

ghost commented Aug 5, 2020

ghost commented Aug 20, 2020

ghost commented Sep 5, 2020

ghost commented Sep 16, 2020

ghost commented Sep 23, 2020

palma21 commented Sep 25, 2020

Kubelet CPU usage linearly increases #750

Kubelet CPU usage linearly increases #750

Comments

sg3s commented Dec 7, 2018

yves-vogl commented Dec 7, 2018

marekr commented Dec 11, 2018

sg3s commented Dec 12, 2018

jnoller commented Dec 14, 2018

MGA-dotSource commented Apr 3, 2019

jnoller commented Apr 3, 2019

derekrprice commented Aug 6, 2019

derekrprice commented Aug 8, 2019

github-actions bot commented Jul 21, 2020

ghost commented Jul 26, 2020

ghost commented Aug 5, 2020

ghost commented Aug 20, 2020

ghost commented Sep 5, 2020

ghost commented Sep 16, 2020

ghost commented Sep 23, 2020

palma21 commented Sep 25, 2020