cnrm-resource-stats-recorder OOMing #239

ct-dh · 2020-07-13T12:01:59Z

Describe the bug
cnrm-resource-stats-recorder is crashlooping due to OOM on 1 of our GKE clusters. We are using the workload identity based setup and are not using the GKE addon yet as it is still in beta.

The memory limit was set to 64 Mi in the version of CC we have deployed, I tried bumping it to 256 Mi and it seems to be getting stuck in the listing all CRDs part of the code.

Recording the stats of Config Connector resources                                                                                                                                     
2020/07/13 11:32:01 Registering OpenCensus views                                                                                                                                      
2020/07/13 11:32:01 Registering the Prometheus exporter                                                                                                                               
2020/07/13 11:32:01 Recording the process start time                                                                                                                                  
I0713 11:32:03.965616       1 request.go:621] Throttling request took 1.070573036s, request: GET:https://10.12.16.1:443/apis/container.cnrm.cloud.google.com/v1beta1?timeout=32s      
2020/07/13 11:32:05 recording the build info                                                                                                                                          
2020/07/13 11:32:05 listing all CRDs managed by Config Connector                                                                                                                      
2020/07/13 11:32:50 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:33:43 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:34:12 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:34:42 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:35:08 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:35:20 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:35:46 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:36:26 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:36:41 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:36:51 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:37:18 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:37:45 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:38:15 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:38:44 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:39:02 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:39:24 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:39:46 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:40:24 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:40:35 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:41:02 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:41:51 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:42:00 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:42:11 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:42:51 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:43:07 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:43:34 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:44:08 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:44:26 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:44:49 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:45:23 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:45:36 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:45:50 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:46:24 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:46:57 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:47:22 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:47:31 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:47:56 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)

The above just continued to be the only output for at least 30 minutes, I haven't checked for longer yet.

We were not on the latest version of CCas mentioned below, so I diffed the install YAML for our version and the latest version and could see the container image version had been bumped (from 97b6128 -> e032470), so I tried changing it to that (I didn't update the annotations for the CC version if that might have some impact) and it didn't help.

One thing that would be helpful for these sort of problems is having access to the source for cnrm-resource-stats-recorder (and the other parts of CC) but as far as I can tell this isn't publicly available? If I am wrong, please can you point me to the repos.

In another cluster of ours, the logs show the listing CRD step as completing within about 30 seconds.

ConfigConnector Version

kubectl get ns cnrm-system -o jsonpath='{.metadata.annotations.cnrm\.cloud\.google\.com/version}'
1.9.1%

I am happy to try an upgrade, especially if I can get an answer on #238 and no that it can be done with no impact on a running cluster.

The text was updated successfully, but these errors were encountered:

ct-dh · 2020-07-13T14:20:04Z

I found a second cluster where this was happening and is less relied on than the others so it is easier to test things with. I tried applying the most recent CC yaml and applying it to upgrade to the latest version (this seemed to work with no errors) but the behavior with cnrm-resource-stats-recorder seems to be the same (OOM with the default limits, and seem to not progress if given more RAM limit).

snuggie12 · 2020-07-13T17:39:52Z

@ct-dh I had OOM problems until I bumped the CPU limits. When CPU wasn't being throttled the memory never seemed to spike near the default limit.

However, I'm back to having OOM problems because I deployed KCC with the operator and there does not appear to be an option to alter the resources on the stats recorder. There might be something with an annotation but as you stated there's no access to the code AFAIK.

Hope my suggestion helps, I think I need a separate issue for my OOM problem.

xiaobaitusi · 2020-07-13T18:23:40Z

Hi @ct-dh and @snuggie12 ,

Thanks for reporting the issue. We will increase the CPU limit to help mitigate the issue as @snuggie12 suggested.

Thanks,
Zeya

ct-dh · 2020-07-14T08:35:52Z

@xiaobaitusi and @snuggie12 I just wanted to confirm that indeed increasing the CPU limit allowed the recorder process to progress as per below (it would still OOM within a minute or so with the original 64Mi memory limit):

Recording the stats of Config Connector resources                                                                                                                                     
2020/07/14 08:31:36 Registering OpenCensus views                                                                                                                                      
2020/07/14 08:31:36 Registering the Prometheus exporter                                                                                                                               
2020/07/14 08:31:36 Recording the process start time                                                                                                                                  
I0714 08:31:37.539324       1 request.go:621] Throttling request took 1.028922629s, request: GET:https://10.4.16.1:443/apis/dataflow.cnrm.cloud.google.com/v1beta1?timeout=32s        
2020/07/14 08:31:38 recording the build info                                                                                                                                          
2020/07/14 08:31:38 listing all CRDs managed by Config Connector                                                                                                                      
2020/07/14 08:31:44 finished one run of recording resource metrics                                                                                                                    
2020/07/14 08:32:44 recording the build info                                                                                                                                          
2020/07/14 08:32:44 listing all CRDs managed by Config Connector                                                                                                                      
2020/07/14 08:32:52 finished one run of recording resource metrics                                                                                                                    
2020/07/14 08:33:52 recording the build info                                                                                                                                          
2020/07/14 08:33:52 listing all CRDs managed by Config Connector                                                                                                                      
2020/07/14 08:34:05 finished one run of recording resource metrics                                                                                                                    
2020/07/14 08:34:13 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/14 08:34:22 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/14 08:35:05 recording the build info                                                                                                                                          
2020/07/14 08:35:05 listing all CRDs managed by Config Connector

So just to clarify, it seems like both the memory limit and the CPU limit need increasing.

snuggie12 · 2020-07-14T13:54:49Z

A few questions:

What'd you update the cpu resources to? I should have specified in my comment (my apologies) but I believe the defaults are 10/20m and I changed mine to 10/100m (requests/limits)
What version did you test the limit change on? Mine was 1.11.1 and I noticed you had different clusters at different versions at different times.
What's your setup like?
- Do you have one controller per cluster or are you in namespaced mode? I'm in namespaced mode and have 5 or 6 namespaces with controllers (no clue if this matters as I don't know what the stats recorder does or if it would be impacted by number of controllers.)
- How many GCP resources do you have that use kcc?

ct-dh · 2020-07-14T14:01:02Z

I changed the CPU limit to 100m and memory to 128Mi.
Both clusters I was working with were on 1.9.1 initially, I tried upgrading 1 to 1.14.x but it didn't change the behaviour.
We have 1 controller per cluster using the workload identity based instructions. In terms of numbers of resources using KCC on these clusters, they were both testing environments and no-one has jumped onto using KCC yet other than a few things so it's very low, i.e less than 10.

snuggie12 · 2020-07-14T14:25:08Z

The only thing I see in common is the workload identity. I didn't notice the issue until someone on my team switched us over to that (you have to use it if in namespaced mode.)

I neglected to mention that my testing showed the CPU was proportionate to the problem. The pod always had a massive peak in memory at the beginning that eventually decreased and would level off around halfway between request and limit. However, changing the CPU would change the height of the peak and the duration before it hit. Your data seems to somewhat match up with that saying it progressed further.

Anyways, hopefully they can figure out what's happening.

spew · 2020-07-14T16:27:06Z

We have identified what we believe is the root cause, basically a runaway workload, and will be making a change to the recorder's logic which will reduce its requirements. This fix will be in this week's release.

jcanseco · 2020-07-16T18:16:06Z

Hi all, we just released v1.15.0 which should fix the issue. Please try it out and let us know if it fixes the problem.

snuggie12 · 2020-07-17T19:41:25Z

I still have heavy CPU throttling and eventually OOM, however it seems to have been prolonged from 7 mins to 10 mins for the recorder container to restart and then OOM.

jcanseco · 2020-07-20T19:39:42Z

@snuggie12, gotcha, thanks for reporting, we'll keep looking into then and let you know when we have further updates.

spew · 2020-07-29T18:38:32Z

We have discovered that this occurs when prometheus is enable on the cluster and pulling metrics from the stats recorder process. We are determining what to do next.

spew · 2020-07-29T19:53:53Z

@snuggie12 do you have prometheus enabled on your cluster?

snuggie12 · 2020-07-30T02:58:37Z

We use prometheus installed via the prometheus-operator so there's probably quite a few defaults. If we block scraping to the recorder container it will get fixed?

spew · 2020-07-30T03:15:41Z

Yes, I believe it will. If you could try and let us know that would be great. Also, you can try reducing the number of replicas in the Prometheus spec.

snuggie12 · 2020-07-30T03:53:39Z

I don't see any. I think the only thing grabbing data is kube-state-metrics? If prometheus is scraping it there is only one replica.

I also noticed that 9/10 controllers are running a lower version vs the stats recorder. I scaled down 7 of them to 0 but it did not seem to change the performance at all. I only did this because I noticed they are listening on the same port so didn't know if the recorder is scraping the controllers or what.

snuggie12 · 2020-07-30T18:12:28Z

I did a curl on the stats-recorder pod to see which metrics show up and have confirmed they aren't in our prometheus system.

I did notice that our scrape interval on the recorder container is set to 60 seconds and my curl took 69 seconds. I do not see a setting on the prom-to-sd container specifying how often it queries. Curious if putting initialDelaySeconds on the readiness probe could help at all?

It's currently spitting out ~6500 metrics. It seems to be across 15 namespaces, 10 statuses and 56 GVKs though that figures out to be 8400 so some metrics must not be present. I believe the namespaces are anywhere there has previously been a cnrm resource, there is one but not currently a controller, or there is currently a resource and controller.

IDK if any of that data helps, but maybe something will stick out as odd.

spew · 2020-07-30T18:45:01Z

Hi @snuggie12 note: we are doubling the CPU for the stats-recorder pod in this week's release as in our testing that has prevented the issue.

toumorokoshi · 2020-11-25T23:22:00Z

Closing this issue since we have doubled the resource for stats-recorder. Please re-open if you find that cnrm-resource-stats-recorder is still OOMing, with the latest release.

ct-dh added the bug Something isn't working label Jul 13, 2020

snuggie12 mentioned this issue Jul 13, 2020

Allow to change resource limits on pods deployed via the operator #240

Open

yhrn mentioned this issue Aug 12, 2020

Insufficient resources for cnrm-webhook-manager #252

Closed

toumorokoshi closed this as completed Nov 25, 2020

This was referenced Apr 7, 2022

[Snyk] Fix for 1 vulnerabilities AndreLSnyk/k8s-config-connector#17

Open

[Snyk] Fix for 1 vulnerabilities Matthelonianxl/k8s-config-connector#22

Open

[Snyk] Fix for 5 vulnerabilities Matthelonianxl/k8s-config-connector#23

Closed

AndreLSnyk mentioned this issue Dec 14, 2023

[Snyk] Security upgrade @google-cloud/pubsub from 0.19.1 to 0.20.0 AndreLSnyk/k8s-config-connector#29

Open

Matthelonianxl mentioned this issue Dec 15, 2023

[Snyk] Security upgrade @google-cloud/pubsub from 0.19.1 to 0.20.0 Matthelonianxl/k8s-config-connector#35

Open

AndreLSnyk mentioned this issue Jan 1, 2024

[Snyk] Security upgrade @google-cloud/pubsub from 0.19.1 to 0.20.0 AndreLSnyk/k8s-config-connector#31

Open

Matthelonianxl mentioned this issue Jan 1, 2024

[Snyk] Security upgrade @google-cloud/pubsub from 0.19.1 to 0.20.0 Matthelonianxl/k8s-config-connector#36

Open

AndreLSnyk mentioned this issue Jan 5, 2024

[Snyk] Security upgrade @google-cloud/pubsub from 0.19.1 to 0.20.0 AndreLSnyk/k8s-config-connector#33

Open

Matthelonianxl mentioned this issue Jan 5, 2024

[Snyk] Security upgrade @google-cloud/pubsub from 0.19.1 to 0.20.0 Matthelonianxl/k8s-config-connector#38

Open

AndreLSnyk mentioned this issue Mar 15, 2024

[Snyk] Security upgrade @google-cloud/pubsub from 0.19.1 to 0.20.0 AndreLSnyk/k8s-config-connector#36

Open

Matthelonianxl mentioned this issue Mar 16, 2024

[Snyk] Security upgrade @google-cloud/pubsub from 0.19.1 to 0.20.0 Matthelonianxl/k8s-config-connector#42

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cnrm-resource-stats-recorder OOMing #239

cnrm-resource-stats-recorder OOMing #239

ct-dh commented Jul 13, 2020

ct-dh commented Jul 13, 2020

snuggie12 commented Jul 13, 2020

xiaobaitusi commented Jul 13, 2020

ct-dh commented Jul 14, 2020 •

edited

snuggie12 commented Jul 14, 2020

ct-dh commented Jul 14, 2020

snuggie12 commented Jul 14, 2020

spew commented Jul 14, 2020

jcanseco commented Jul 16, 2020

snuggie12 commented Jul 17, 2020

jcanseco commented Jul 20, 2020

spew commented Jul 29, 2020

spew commented Jul 29, 2020

snuggie12 commented Jul 30, 2020

spew commented Jul 30, 2020

snuggie12 commented Jul 30, 2020

snuggie12 commented Jul 30, 2020 •

edited

spew commented Jul 30, 2020

toumorokoshi commented Nov 25, 2020

cnrm-resource-stats-recorder OOMing #239

cnrm-resource-stats-recorder OOMing #239

Comments

ct-dh commented Jul 13, 2020

ct-dh commented Jul 13, 2020

snuggie12 commented Jul 13, 2020

xiaobaitusi commented Jul 13, 2020

ct-dh commented Jul 14, 2020 • edited

snuggie12 commented Jul 14, 2020

ct-dh commented Jul 14, 2020

snuggie12 commented Jul 14, 2020

spew commented Jul 14, 2020

jcanseco commented Jul 16, 2020

snuggie12 commented Jul 17, 2020

jcanseco commented Jul 20, 2020

spew commented Jul 29, 2020

spew commented Jul 29, 2020

snuggie12 commented Jul 30, 2020

spew commented Jul 30, 2020

snuggie12 commented Jul 30, 2020

snuggie12 commented Jul 30, 2020 • edited

spew commented Jul 30, 2020

toumorokoshi commented Nov 25, 2020

ct-dh commented Jul 14, 2020 •

edited

snuggie12 commented Jul 30, 2020 •

edited