Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cnrm-resource-stats-recorder OOMing #239

Closed
ct-dh opened this issue Jul 13, 2020 · 19 comments
Closed

cnrm-resource-stats-recorder OOMing #239

ct-dh opened this issue Jul 13, 2020 · 19 comments
Labels
bug Something isn't working

Comments

@ct-dh
Copy link

ct-dh commented Jul 13, 2020

Describe the bug
cnrm-resource-stats-recorder is crashlooping due to OOM on 1 of our GKE clusters. We are using the workload identity based setup and are not using the GKE addon yet as it is still in beta.

The memory limit was set to 64 Mi in the version of CC we have deployed, I tried bumping it to 256 Mi and it seems to be getting stuck in the listing all CRDs part of the code.

Recording the stats of Config Connector resources                                                                                                                                     
2020/07/13 11:32:01 Registering OpenCensus views                                                                                                                                      
2020/07/13 11:32:01 Registering the Prometheus exporter                                                                                                                               
2020/07/13 11:32:01 Recording the process start time                                                                                                                                  
I0713 11:32:03.965616       1 request.go:621] Throttling request took 1.070573036s, request: GET:https://10.12.16.1:443/apis/container.cnrm.cloud.google.com/v1beta1?timeout=32s      
2020/07/13 11:32:05 recording the build info                                                                                                                                          
2020/07/13 11:32:05 listing all CRDs managed by Config Connector                                                                                                                      
2020/07/13 11:32:50 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:33:43 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:34:12 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:34:42 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:35:08 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:35:20 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:35:46 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:36:26 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:36:41 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:36:51 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:37:18 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:37:45 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:38:15 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:38:44 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:39:02 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:39:24 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:39:46 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:40:24 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:40:35 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:41:02 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:41:51 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:42:00 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:42:11 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:42:51 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:43:07 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:43:34 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:44:08 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:44:26 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:44:49 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:45:23 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:45:36 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:45:50 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:46:24 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:46:57 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:47:22 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:47:31 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/13 11:47:56 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)

The above just continued to be the only output for at least 30 minutes, I haven't checked for longer yet.

We were not on the latest version of CCas mentioned below, so I diffed the install YAML for our version and the latest version and could see the container image version had been bumped (from 97b6128 -> e032470), so I tried changing it to that (I didn't update the annotations for the CC version if that might have some impact) and it didn't help.

One thing that would be helpful for these sort of problems is having access to the source for cnrm-resource-stats-recorder (and the other parts of CC) but as far as I can tell this isn't publicly available? If I am wrong, please can you point me to the repos.

In another cluster of ours, the logs show the listing CRD step as completing within about 30 seconds.

ConfigConnector Version

kubectl get ns cnrm-system -o jsonpath='{.metadata.annotations.cnrm\.cloud\.google\.com/version}'
1.9.1%

I am happy to try an upgrade, especially if I can get an answer on #238 and no that it can be done with no impact on a running cluster.

@ct-dh ct-dh added the bug Something isn't working label Jul 13, 2020
@ct-dh
Copy link
Author

ct-dh commented Jul 13, 2020

I found a second cluster where this was happening and is less relied on than the others so it is easier to test things with. I tried applying the most recent CC yaml and applying it to upgrade to the latest version (this seemed to work with no errors) but the behavior with cnrm-resource-stats-recorder seems to be the same (OOM with the default limits, and seem to not progress if given more RAM limit).

@snuggie12
Copy link

@ct-dh I had OOM problems until I bumped the CPU limits. When CPU wasn't being throttled the memory never seemed to spike near the default limit.

However, I'm back to having OOM problems because I deployed KCC with the operator and there does not appear to be an option to alter the resources on the stats recorder. There might be something with an annotation but as you stated there's no access to the code AFAIK.

Hope my suggestion helps, I think I need a separate issue for my OOM problem.

@xiaobaitusi
Copy link
Contributor

Hi @ct-dh and @snuggie12 ,

Thanks for reporting the issue. We will increase the CPU limit to help mitigate the issue as @snuggie12 suggested.

Thanks,
Zeya

@ct-dh
Copy link
Author

ct-dh commented Jul 14, 2020

@xiaobaitusi and @snuggie12 I just wanted to confirm that indeed increasing the CPU limit allowed the recorder process to progress as per below (it would still OOM within a minute or so with the original 64Mi memory limit):

Recording the stats of Config Connector resources                                                                                                                                     
2020/07/14 08:31:36 Registering OpenCensus views                                                                                                                                      
2020/07/14 08:31:36 Registering the Prometheus exporter                                                                                                                               
2020/07/14 08:31:36 Recording the process start time                                                                                                                                  
I0714 08:31:37.539324       1 request.go:621] Throttling request took 1.028922629s, request: GET:https://10.4.16.1:443/apis/dataflow.cnrm.cloud.google.com/v1beta1?timeout=32s        
2020/07/14 08:31:38 recording the build info                                                                                                                                          
2020/07/14 08:31:38 listing all CRDs managed by Config Connector                                                                                                                      
2020/07/14 08:31:44 finished one run of recording resource metrics                                                                                                                    
2020/07/14 08:32:44 recording the build info                                                                                                                                          
2020/07/14 08:32:44 listing all CRDs managed by Config Connector                                                                                                                      
2020/07/14 08:32:52 finished one run of recording resource metrics                                                                                                                    
2020/07/14 08:33:52 recording the build info                                                                                                                                          
2020/07/14 08:33:52 listing all CRDs managed by Config Connector                                                                                                                      
2020/07/14 08:34:05 finished one run of recording resource metrics                                                                                                                    
2020/07/14 08:34:13 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/14 08:34:22 http: superfluous response.WriteHeader call from cnrm.googlesource.com/cnrm/vendor/github.com/prometheus/client_golang/prometheus/promhttp.httpError (http.go:344)
2020/07/14 08:35:05 recording the build info                                                                                                                                          
2020/07/14 08:35:05 listing all CRDs managed by Config Connector

So just to clarify, it seems like both the memory limit and the CPU limit need increasing.

@snuggie12
Copy link

A few questions:

  • What'd you update the cpu resources to? I should have specified in my comment (my apologies) but I believe the defaults are 10/20m and I changed mine to 10/100m (requests/limits)
  • What version did you test the limit change on? Mine was 1.11.1 and I noticed you had different clusters at different versions at different times.
  • What's your setup like?
    • Do you have one controller per cluster or are you in namespaced mode? I'm in namespaced mode and have 5 or 6 namespaces with controllers (no clue if this matters as I don't know what the stats recorder does or if it would be impacted by number of controllers.)
    • How many GCP resources do you have that use kcc?

@ct-dh
Copy link
Author

ct-dh commented Jul 14, 2020

  • I changed the CPU limit to 100m and memory to 128Mi.
  • Both clusters I was working with were on 1.9.1 initially, I tried upgrading 1 to 1.14.x but it didn't change the behaviour.
  • We have 1 controller per cluster using the workload identity based instructions. In terms of numbers of resources using KCC on these clusters, they were both testing environments and no-one has jumped onto using KCC yet other than a few things so it's very low, i.e less than 10.

@snuggie12
Copy link

The only thing I see in common is the workload identity. I didn't notice the issue until someone on my team switched us over to that (you have to use it if in namespaced mode.)

I neglected to mention that my testing showed the CPU was proportionate to the problem. The pod always had a massive peak in memory at the beginning that eventually decreased and would level off around halfway between request and limit. However, changing the CPU would change the height of the peak and the duration before it hit. Your data seems to somewhat match up with that saying it progressed further.

Anyways, hopefully they can figure out what's happening.

@spew
Copy link
Contributor

spew commented Jul 14, 2020

We have identified what we believe is the root cause, basically a runaway workload, and will be making a change to the recorder's logic which will reduce its requirements. This fix will be in this week's release.

@jcanseco
Copy link
Member

Hi all, we just released v1.15.0 which should fix the issue. Please try it out and let us know if it fixes the problem.

@snuggie12
Copy link

I still have heavy CPU throttling and eventually OOM, however it seems to have been prolonged from 7 mins to 10 mins for the recorder container to restart and then OOM.

@jcanseco
Copy link
Member

@snuggie12, gotcha, thanks for reporting, we'll keep looking into then and let you know when we have further updates.

@spew
Copy link
Contributor

spew commented Jul 29, 2020

We have discovered that this occurs when prometheus is enable on the cluster and pulling metrics from the stats recorder process. We are determining what to do next.

@spew
Copy link
Contributor

spew commented Jul 29, 2020

@snuggie12 do you have prometheus enabled on your cluster?

@snuggie12
Copy link

We use prometheus installed via the prometheus-operator so there's probably quite a few defaults. If we block scraping to the recorder container it will get fixed?

@spew
Copy link
Contributor

spew commented Jul 30, 2020

Yes, I believe it will. If you could try and let us know that would be great. Also, you can try reducing the number of replicas in the Prometheus spec.

@snuggie12
Copy link

I don't see any. I think the only thing grabbing data is kube-state-metrics? If prometheus is scraping it there is only one replica.

I also noticed that 9/10 controllers are running a lower version vs the stats recorder. I scaled down 7 of them to 0 but it did not seem to change the performance at all. I only did this because I noticed they are listening on the same port so didn't know if the recorder is scraping the controllers or what.

@snuggie12
Copy link

snuggie12 commented Jul 30, 2020

I did a curl on the stats-recorder pod to see which metrics show up and have confirmed they aren't in our prometheus system.

I did notice that our scrape interval on the recorder container is set to 60 seconds and my curl took 69 seconds. I do not see a setting on the prom-to-sd container specifying how often it queries. Curious if putting initialDelaySeconds on the readiness probe could help at all?

It's currently spitting out ~6500 metrics. It seems to be across 15 namespaces, 10 statuses and 56 GVKs though that figures out to be 8400 so some metrics must not be present. I believe the namespaces are anywhere there has previously been a cnrm resource, there is one but not currently a controller, or there is currently a resource and controller.

IDK if any of that data helps, but maybe something will stick out as odd.

@spew
Copy link
Contributor

spew commented Jul 30, 2020

Hi @snuggie12 note: we are doubling the CPU for the stats-recorder pod in this week's release as in our testing that has prevented the issue.

@toumorokoshi
Copy link
Contributor

Closing this issue since we have doubled the resource for stats-recorder. Please re-open if you find that cnrm-resource-stats-recorder is still OOMing, with the latest release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants