How to modify the default prometheus-to-sd resource limits? #327

artazar · 2020-04-27T02:43:58Z

Hi,

Several of my GKE clusters experience constant CPUThrottlingHigh alerts, coming from prometheus-to-sd pods.

This deployment has incredibly low CPU requests/limits by default:

	resources:
	  limits:
		cpu: 3m
		memory: 20Mi
	  requests:
		cpu: 1m
		memory: 20Mi

When I try to edit the prometheus-to-sd daemonset and increase these values, they get reverted to defaults.

Questions:

Is it possible to modify these default values in any way?
Why are they so low? Such low values are very likely to cause CPU throttling and increase the monitoring noise from GKE clusters.

Similar report by a different user: https://stackoverflow.com/questions/58182345/cpu-throttling-on-default-gke-pods

My GKE cluster version is v1.15.11-gke.5

The text was updated successfully, but these errors were encountered:

serathius · 2020-04-27T18:48:53Z

/cc @loburm

loburm · 2020-04-28T08:06:42Z

Hi Artem,

This container is used by GKE engineers to collect operational metrics from system components and shouldn't influence user workloads. Previously we had a bug in prometheus-to-sd that caused memory leak and high CPU usage, so it was made decision to minimize potential impact by setting those limits.

I'm really sorry for the alerts that this component is causing in your cluster. Our team is working on completely removing it and replacing by Open Telemetry agent.

artazar · 2020-04-28T08:30:59Z

Hi @loburm ,
Thank you for your answer! It is good to know you are working on improvements.
The strange thing here is that those alerts were not seen on version 1.14 and started to occur massively on 1.15.

danieldides · 2020-05-04T16:16:38Z

We're seeing similar issues on our clusters in GKE. the prometheus-to-sd containers are being OOMKilled and CPU Throttled and causing CrashLoopBackOffs. Hopefully this one can get resolved soon, it's triggering all our alerting. 😅

bfil · 2020-06-08T08:41:31Z

The only way I found to limit the resources is to set a LimitRange to the kube-system namespace, which is not ideal because it affects everything in the namespace so I had to ensure the limit was high enough to run the rest of the pods:

apiVersion: v1
kind: LimitRange
metadata:
  name: kube-system-resource-limits
  namespace: kube-system
spec:
  limits:
    - default:
        memory: 200Mi
      defaultRequest:
        memory: 20Mi
      type: Container

Every other change affecting the pod seems to be reverted by GKE.

I've put this in place because of a memory leak with prometheus-to-sd (part of the event-exporter-gke pod), with the limit range the container will be OOMKilled when approaching the limit so that prevents it from disrupting the other pods running on the nodes, but it's not pretty:

I hope this helps someone.

bfil · 2020-07-09T12:52:52Z

The memory leak seems to have gone away after the prometheus-to-sd container was automatically upgraded from 0.8.0 to 0.10.0.

jamesproud · 2020-07-21T23:10:19Z

@bfil I'm still seeing this with prometheus-to-sd-exporter | CrashLoopBackOff | gke.gcr.io/prometheus-to-sd:v0.10.0-gke.0. @loburm Any ideas when this will be replaced?

bfil · 2020-07-22T07:41:14Z

@jamesproud I've been experiencing some issues at the end of June that went away after a while after Google pushed an update to the event-exporter-gke pod on the 29th of June, this is how the disruption looked like:

It all looks good now since the last revision:

My cluster is on the stable release channel on version 1.16.9-gke.6 and it has been fine since the 29th of June so far.

Not sure if this helps.

Romiko · 2020-11-15T03:36:30Z

Hi, this daemonset is causing a lot of issues when running virtual nodes using virtual-kubelet. Please change the tolerance levels or can you provide a way for me to patch it? I have a virtual node that runs pods in Azure Container instances, so it does not make sense to have this running on the virtual node at all.

Nadavpe · 2020-11-30T23:08:20Z

It is perfectly fine to set low defaults for the resources, but you really should let us update them to higher levels.
This is causing a lot of monitoring noise and we are willing to pay with some resources to have the noise resolved.

The other option is to start setting ignores on the alerts, which is not something we would like to do.
Please release the values to our control, or at least let us increase the resources.

artazar changed the title ~~How to modify the default prometheus-to-sd hardware limits?~~ How to modify the default prometheus-to-sd resource limits? Apr 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to modify the default prometheus-to-sd resource limits? #327

How to modify the default prometheus-to-sd resource limits? #327

artazar commented Apr 27, 2020 •

edited

serathius commented Apr 27, 2020

loburm commented Apr 28, 2020

artazar commented Apr 28, 2020

danieldides commented May 4, 2020

bfil commented Jun 8, 2020 •

edited

bfil commented Jul 9, 2020

jamesproud commented Jul 21, 2020

bfil commented Jul 22, 2020 •

edited

Romiko commented Nov 15, 2020

Nadavpe commented Nov 30, 2020

How to modify the default prometheus-to-sd resource limits? #327

How to modify the default prometheus-to-sd resource limits? #327

Comments

artazar commented Apr 27, 2020 • edited

serathius commented Apr 27, 2020

loburm commented Apr 28, 2020

artazar commented Apr 28, 2020

danieldides commented May 4, 2020

bfil commented Jun 8, 2020 • edited

bfil commented Jul 9, 2020

jamesproud commented Jul 21, 2020

bfil commented Jul 22, 2020 • edited

Romiko commented Nov 15, 2020

Nadavpe commented Nov 30, 2020

artazar commented Apr 27, 2020 •

edited

bfil commented Jun 8, 2020 •

edited

bfil commented Jul 22, 2020 •

edited