Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubelet CPU usage linearly increases #750

Closed
sg3s opened this issue Dec 7, 2018 · 16 comments
Closed

Kubelet CPU usage linearly increases #750

sg3s opened this issue Dec 7, 2018 · 16 comments

Comments

@sg3s
Copy link

sg3s commented Dec 7, 2018

We were asked to open an issue on the public tracker following our interactions with support engineers (118113025003216) & slack discussion.

What happened:
We've been experiencing degraded performance & mainly CPU creep over longer periods of time on clusters rolled out several months ago.

The CPU creep causes our cluster workers to become unresponsive/NotReady as the CPU load approaches 100%.

The upstream issue is the following
kubernetes/kubernetes#64137

  • As we debugged the issue we noticed that our monitoring agent was consuming a lot of resources, though this was eventually not the cause
  • We monitor containers only on AKS which meant it took us a while to figure out kubelet caused the overall CPU creep/high CPU usage
  • Our monitoring partner tracked down the known issue related to kubelet/systemd interactions which we were able to confirm

We've been having this issue on all clusters since at least mid-October. We've been trying to manage our production and test cluster loads, since these are actively used/have more significant workloads they reached 100% CPU load on multiple nodes. But that also makes the monitoring graphs very hard to read.

The clearest example is our staging cluster, but it never reached 100% CPU. See graph below for the past 3 months. We scaled this cluster up/changed the node before we knew what caused the issue but it was obviously also present on the new node.

image

We rarely make large changes or experience high load on this cluster, so the line should be pretty horizontal, and not going upward.

What you expected to happen:

We expect CPU to not increase significantly as long as conditions/configuration does not change.

How to reproduce it (as minimally and precisely as possible):

Reproduction steps are noted in the upstream issue.

From what I understand this occurs when you:

  • Use specific kernel / systemd versions (see below)
  • And schedule pods (with volumes)

The problem is then exacerbated when you use (Cron)Jobs in k8s because you're constantly scheduling pods.

Anything else we need to know?:
This should already be a known issue being worked on by other vendors responsible for the mentioned components. This issue exists because within AKS it is not possible to customize / patch said components as a customer.

Environment:

  • Kubernetes version (use kubectl version):
2018-12-07 13:12:18 user infra (master) (aks-sams/pw-sso) $ k version
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.2", GitCommit:"81753b10df112992bf51bbc2c2f85208aad78335", GitTreeState:"clean", BuildDate:"2018-05-12T04:12:12Z", GoVersion:"go1.9.6", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.3", GitCommit:"a4529464e4629c21224b3d52edfe0ea91b072862", GitTreeState:"clean", BuildDate:"2018-09-09T17:53:03Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
root@aks-agentpool-13066911-1:/home/azureuser# uname -a
Linux aks-agentpool-13066911-1 4.15.0-1030-azure #31~16.04.1-Ubuntu SMP Tue Oct 30 19:40:01 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
root@aks-agentpool-13066911-1:/home/azureuser# systemd --version
systemd 229
+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ -LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN
root@aks-agentpool-13066911-1:/home/azureuser# kubelet --version
Kubernetes v1.11.3
  • Size of cluster (how many worker nodes are in the cluster?)

Production: 2-3
Staging: 1
Test: 1-2

  • General description of workloads in the cluster (e.g. HTTP microservices, Java app, Ruby on Rails, machine learning, etc.)

Workloads roughly consist of the following:

  • Basic nginx-ingress & cert-manager and a monitoring agent
  • HTTP Microservices (nginx/static content, PHP, Node.js)
  • Synchronisation application/api/service in PHP, using k8s CronJobs

No service meshing other other exciting components what would make this a special setup.

  • Other:

Thanks to our monitoring partner we were able to confirm the upstream issue with the following instructions. I have not seen this explicitly noted in a ticket so I thought I should include these here for others trying to confirm the issue.

On some version combinations of systemd+kernel versions (see issue), the kubelet leaks system slices and slows down to a crawl. This can be confirmed by inspecting the contents of the metrics/cadvisor endpoint (full path if read-only port is enabled is http://localhost:10255/metrics/cadvisor).

A "normal" payload is under a MB (or a couple of MB).

These leaked slices have an empty container_name label, you can confirm it's this by comparing the total number of lines and the number of lines with an empty label

$ wc -l metrics-cadvisor-worker13.txt
291239 metrics-cadvisor-worker13.txt
$ grep 'container_name=""' metrics-cadvisor-worker13.txt | wc -l
289924

Our cadvisor metrics were 15+MB after several days of running hosts uninterrupted.

@yves-vogl
Copy link

Thanks for the report. We are planning to use the Cluster Autoscaler so this issue means additional awareness when the clusters does not seem to cool down automatically after a typical amount of time.

@marekr
Copy link

marekr commented Dec 11, 2018

This should already be a known issue being worked on by other vendors responsible for the mentioned components. This issue exists because within AKS it is not possible to customize / patch said components as a customer.

It is through kubernetes wizardry. You can make a DaemonSet which executes shell commands as a host process with full host access. You can use it to patch systemd and leave a flag for the next time the DaemonSet runs to not need to update.

Terrible hack but it works as a solution until Microsoft patches the standard VM image.

@sg3s
Copy link
Author

sg3s commented Dec 12, 2018

@marekr Could you elaborate on what kind of work around you're referring to? We currently have a daemonset cleanup cgroups intermittently but that of course isn't a (good) fix.

We checked for other options like fixing the host image/systemd used but;

  1. Were not able to find a version/path we could upgrade to using the default package manager
  2. Did not want to try build the package from source & replace systemd

Executing commands on the host from a container is ill advised so we did not think it through seriously but we're not sure how to do that short of exposing the crontab to the container and adding a line there...

We're open to work around options if they can be implemented from a k8s deployment perspective.

@jnoller
Copy link
Contributor

jnoller commented Dec 14, 2018

@sg3s Sorry for the delayed response - we're looking at this to see if we can mitigate it with patching systemd. I have your internal support ticket and have communicated this to engineering.

@MGA-dotSource
Copy link

@jnoller is there any update on this? As we're having lots of cronjobs this is kind of a showstopper..

@jnoller
Copy link
Contributor

jnoller commented Apr 3, 2019

One thing we changed recently to help job leaking is we reduced the setting for garbage collection (see the changelog) which should help with large number of jobs.

I’m also digging in and seeing where the bug is / available mitigations

@jnoller jnoller added the triage label Apr 5, 2019
@derekrprice
Copy link

This is currently happening to us and crashing our cluster weekly. I would really appreciate that DaemonSet workaround if you could provide a complete implementation. A fix on the AKS side would be even better!

@derekrprice
Copy link

@reaperes's systemd cgroup cleanup code from kubernetes/kubernetes#64137 seemed the cleanest and most surgical of all the workarounds that I've found documented for this and the related issues, so I've converted it into a DaemonSet that runs the fix hourly on every node in a cluster. You could set any interval that you like, of course, but the script isn't very resource intensive and hourly seemed reasonable. It actually takes about a day or so for the CPU loading to become noticeable in my cluster and a week or so for it to crash a node. I've been running this for a few days now in my staging cluster and it appears to keep the CPU loading under control.

@github-actions
Copy link

Action required from @Azure/aks-pm

@ghost
Copy link

ghost commented Jul 26, 2020

Action required from @Azure/aks-pm

@ghost ghost added the Needs Attention 👋 Issues needs attention/assignee/owner label Jul 26, 2020
@ghost
Copy link

ghost commented Aug 5, 2020

Issue needing attention of @Azure/aks-leads

2 similar comments
@ghost
Copy link

ghost commented Aug 20, 2020

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented Sep 5, 2020

Issue needing attention of @Azure/aks-leads

@palma21 palma21 added stale Stale issue and removed Needs Attention 👋 Issues needs attention/assignee/owner action-required labels Sep 9, 2020
@ghost ghost removed the stale Stale issue label Sep 9, 2020
@ghost
Copy link

ghost commented Sep 16, 2020

@Azure/aks-pm issue needs labels

1 similar comment
@ghost
Copy link

ghost commented Sep 23, 2020

@Azure/aks-pm issue needs labels

@palma21
Copy link
Member

palma21 commented Sep 25, 2020

stale-close

@palma21 palma21 closed this as completed Sep 25, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Oct 25, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants