New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cnrm-resource-stats-recorder OOMing #239
Comments
I found a second cluster where this was happening and is less relied on than the others so it is easier to test things with. I tried applying the most recent CC yaml and applying it to upgrade to the latest version (this seemed to work with no errors) but the behavior with cnrm-resource-stats-recorder seems to be the same (OOM with the default limits, and seem to not progress if given more RAM limit). |
@ct-dh I had OOM problems until I bumped the CPU limits. When CPU wasn't being throttled the memory never seemed to spike near the default limit. However, I'm back to having OOM problems because I deployed KCC with the operator and there does not appear to be an option to alter the resources on the stats recorder. There might be something with an annotation but as you stated there's no access to the code AFAIK. Hope my suggestion helps, I think I need a separate issue for my OOM problem. |
Hi @ct-dh and @snuggie12 , Thanks for reporting the issue. We will increase the CPU limit to help mitigate the issue as @snuggie12 suggested. Thanks, |
@xiaobaitusi and @snuggie12 I just wanted to confirm that indeed increasing the CPU limit allowed the recorder process to progress as per below (it would still OOM within a minute or so with the original 64Mi memory limit):
So just to clarify, it seems like both the memory limit and the CPU limit need increasing. |
A few questions:
|
|
The only thing I see in common is the workload identity. I didn't notice the issue until someone on my team switched us over to that (you have to use it if in namespaced mode.) I neglected to mention that my testing showed the CPU was proportionate to the problem. The pod always had a massive peak in memory at the beginning that eventually decreased and would level off around halfway between request and limit. However, changing the CPU would change the height of the peak and the duration before it hit. Your data seems to somewhat match up with that saying it progressed further. Anyways, hopefully they can figure out what's happening. |
We have identified what we believe is the root cause, basically a runaway workload, and will be making a change to the recorder's logic which will reduce its requirements. This fix will be in this week's release. |
Hi all, we just released v1.15.0 which should fix the issue. Please try it out and let us know if it fixes the problem. |
I still have heavy CPU throttling and eventually OOM, however it seems to have been prolonged from 7 mins to 10 mins for the recorder container to restart and then OOM. |
@snuggie12, gotcha, thanks for reporting, we'll keep looking into then and let you know when we have further updates. |
We have discovered that this occurs when prometheus is enable on the cluster and pulling metrics from the stats recorder process. We are determining what to do next. |
@snuggie12 do you have prometheus enabled on your cluster? |
We use prometheus installed via the prometheus-operator so there's probably quite a few defaults. If we block scraping to the recorder container it will get fixed? |
Yes, I believe it will. If you could try and let us know that would be great. Also, you can try reducing the number of replicas in the Prometheus spec. |
I don't see any. I think the only thing grabbing data is kube-state-metrics? If prometheus is scraping it there is only one replica. I also noticed that 9/10 controllers are running a lower version vs the stats recorder. I scaled down 7 of them to 0 but it did not seem to change the performance at all. I only did this because I noticed they are listening on the same port so didn't know if the recorder is scraping the controllers or what. |
I did a curl on the stats-recorder pod to see which metrics show up and have confirmed they aren't in our prometheus system. I did notice that our scrape interval on the recorder container is set to 60 seconds and my curl took 69 seconds. I do not see a setting on the prom-to-sd container specifying how often it queries. Curious if putting It's currently spitting out ~6500 metrics. It seems to be across 15 namespaces, 10 statuses and 56 GVKs though that figures out to be 8400 so some metrics must not be present. I believe the namespaces are anywhere there has previously been a cnrm resource, there is one but not currently a controller, or there is currently a resource and controller. IDK if any of that data helps, but maybe something will stick out as odd. |
Hi @snuggie12 note: we are doubling the CPU for the stats-recorder pod in this week's release as in our testing that has prevented the issue. |
Closing this issue since we have doubled the resource for stats-recorder. Please re-open if you find that cnrm-resource-stats-recorder is still OOMing, with the latest release. |
Describe the bug
cnrm-resource-stats-recorder is crashlooping due to OOM on 1 of our GKE clusters. We are using the workload identity based setup and are not using the GKE addon yet as it is still in beta.
The memory limit was set to 64 Mi in the version of CC we have deployed, I tried bumping it to 256 Mi and it seems to be getting stuck in the listing all CRDs part of the code.
The above just continued to be the only output for at least 30 minutes, I haven't checked for longer yet.
We were not on the latest version of CCas mentioned below, so I diffed the install YAML for our version and the latest version and could see the container image version had been bumped (from 97b6128 -> e032470), so I tried changing it to that (I didn't update the annotations for the CC version if that might have some impact) and it didn't help.
One thing that would be helpful for these sort of problems is having access to the source for cnrm-resource-stats-recorder (and the other parts of CC) but as far as I can tell this isn't publicly available? If I am wrong, please can you point me to the repos.
In another cluster of ours, the logs show the listing CRD step as completing within about 30 seconds.
ConfigConnector Version
I am happy to try an upgrade, especially if I can get an answer on #238 and no that it can be done with no impact on a running cluster.
The text was updated successfully, but these errors were encountered: