Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to modify the default prometheus-to-sd resource limits? #327

Open
artazar opened this issue Apr 27, 2020 · 10 comments
Open

How to modify the default prometheus-to-sd resource limits? #327

artazar opened this issue Apr 27, 2020 · 10 comments

Comments

@artazar
Copy link

artazar commented Apr 27, 2020

Hi,

Several of my GKE clusters experience constant CPUThrottlingHigh alerts, coming from prometheus-to-sd pods.

This deployment has incredibly low CPU requests/limits by default:

	resources:
	  limits:
		cpu: 3m
		memory: 20Mi
	  requests:
		cpu: 1m
		memory: 20Mi

When I try to edit the prometheus-to-sd daemonset and increase these values, they get reverted to defaults.

Questions:

  1. Is it possible to modify these default values in any way?

  2. Why are they so low? Such low values are very likely to cause CPU throttling and increase the monitoring noise from GKE clusters.

Similar report by a different user: https://stackoverflow.com/questions/58182345/cpu-throttling-on-default-gke-pods

My GKE cluster version is v1.15.11-gke.5

@artazar artazar changed the title How to modify the default prometheus-to-sd hardware limits? How to modify the default prometheus-to-sd resource limits? Apr 27, 2020
@serathius
Copy link
Contributor

/cc @loburm

@loburm
Copy link
Member

loburm commented Apr 28, 2020

Hi Artem,

This container is used by GKE engineers to collect operational metrics from system components and shouldn't influence user workloads. Previously we had a bug in prometheus-to-sd that caused memory leak and high CPU usage, so it was made decision to minimize potential impact by setting those limits.

I'm really sorry for the alerts that this component is causing in your cluster. Our team is working on completely removing it and replacing by Open Telemetry agent.

@artazar
Copy link
Author

artazar commented Apr 28, 2020

Hi @loburm ,
Thank you for your answer! It is good to know you are working on improvements.
The strange thing here is that those alerts were not seen on version 1.14 and started to occur massively on 1.15.

@danieldides
Copy link

We're seeing similar issues on our clusters in GKE. the prometheus-to-sd containers are being OOMKilled and CPU Throttled and causing CrashLoopBackOffs. Hopefully this one can get resolved soon, it's triggering all our alerting. 😅

@bfil
Copy link

bfil commented Jun 8, 2020

The only way I found to limit the resources is to set a LimitRange to the kube-system namespace, which is not ideal because it affects everything in the namespace so I had to ensure the limit was high enough to run the rest of the pods:

apiVersion: v1
kind: LimitRange
metadata:
  name: kube-system-resource-limits
  namespace: kube-system
spec:
  limits:
    - default:
        memory: 200Mi
      defaultRequest:
        memory: 20Mi
      type: Container

Every other change affecting the pod seems to be reverted by GKE.

I've put this in place because of a memory leak with prometheus-to-sd (part of the event-exporter-gke pod), with the limit range the container will be OOMKilled when approaching the limit so that prevents it from disrupting the other pods running on the nodes, but it's not pretty:

Screenshot 2020-06-08 at 09 34 36

I hope this helps someone.

@bfil
Copy link

bfil commented Jul 9, 2020

The memory leak seems to have gone away after the prometheus-to-sd container was automatically upgraded from 0.8.0 to 0.10.0.

@jamesproud
Copy link

@bfil I'm still seeing this with prometheus-to-sd-exporter | CrashLoopBackOff | gke.gcr.io/prometheus-to-sd:v0.10.0-gke.0. @loburm Any ideas when this will be replaced?

@bfil
Copy link

bfil commented Jul 22, 2020

@jamesproud I've been experiencing some issues at the end of June that went away after a while after Google pushed an update to the event-exporter-gke pod on the 29th of June, this is how the disruption looked like:

Screenshot 2020-07-22 at 09 35 11

It all looks good now since the last revision:

Screenshot 2020-07-22 at 09 35 58

My cluster is on the stable release channel on version 1.16.9-gke.6 and it has been fine since the 29th of June so far.

Screenshot 2020-07-22 at 09 38 35

Not sure if this helps.

@Romiko
Copy link

Romiko commented Nov 15, 2020

Hi, this daemonset is causing a lot of issues when running virtual nodes using virtual-kubelet. Please change the tolerance levels or can you provide a way for me to patch it? I have a virtual node that runs pods in Azure Container instances, so it does not make sense to have this running on the virtual node at all.

@Nadavpe
Copy link

Nadavpe commented Nov 30, 2020

It is perfectly fine to set low defaults for the resources, but you really should let us update them to higher levels.
This is causing a lot of monitoring noise and we are willing to pay with some resources to have the noise resolved.

The other option is to start setting ignores on the alerts, which is not something we would like to do.
Please release the values to our control, or at least let us increase the resources.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants