-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running Multiple Recommenders with Checkpointing results in deletion of checkpoints from other recommenders #6387
Comments
Hey @javin-stripe, thanks for the great description and analysis! This does absolutely not feel like an intended behavior and probably indicates that this isn't a widely used feature or that all users of this feature use a different mechanism than checkpoints for history data. Thanks for offering to provide a fix, this would be highly appreciated! When you're saying the fix should be relatively simple, which solution do you have in mind? |
Hey @voelzmo, thanks for getting back to me! I haven't really done any work on this since posting this issue and finding essentially the workaround (i.e., if you can cycle your nodes outside of VPA, since the garbage collection timer is in-memory it will reset on a new node). However how I see this fix being implemented is essentially here in code:
We will likely want to group the checkpoint and the vpa object itself before filtering the vpa's, then only look at checkpoints that are also post filtering. Let me know if that doesn't make sense, or if there is anything you would like to see in the fix! I should have some cycles in the next few weeks to put up a PR. |
Hey @javin-stripe, thanks for outlining your thoughts on how to approach this before jumping to do the PR! I think that inside For garbage collecting the checkpoints you correctly found that we should operate with an unfiltered list of all existing VPA objects in the entire cluster, so my best guess would be to avoid calling
The rest of the method can probably stay the same: we just need to make sure that we check against the map of unfiltered VPAs and not the one in the As alternatives I thought about keeping a list of unfiltered VPAs in the Does this make sense? |
Yup that sounds good, ill loop back with a PR at some point. Thanks for the thoughts / approach! |
I'm facing the same issue & this is currently limiting from using multiple VPA recommenders; @javin-stripe any progress with the PR? If you're busy I could also give it a shot; |
/area vertical-pod-autoscaler |
@BojanZelic - unfortunately this was de-prioritized on our end. We have the work around by setting the garbage collection timeframe to be high, and making sure our VPA pods are cycled before than. Sadly im not too sure if/when ill be able to pick up this work anymore. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
Which component are you using?:
Vertical Pod Autoscaler
What version of the component are you using?:
Component version: v0.13.0
What k8s version are you using (
kubectl version
)?:kubectl version
OutputWhat environment is this in?:
Self hosted on EC2
What did you expect to happen?:
When running multiple recommenders, only checkpoints that correspond with that recommender are eligible for deletion during garbage collection. I.e., for a checkpoint to be deleted by RecommenderA, the checkpoint needs to be associated with RecommenderA.
What happened instead?:
RecommenderA filters out any VPAs that aren’t associated with it, and then in the garbage collection process those checkpoints would be considered ‘orphaned’ and deleted.
The following logs could be found:
Which only showed for VPA checkpoints not associated with the recommender.
This can also be seen in code:
Inside this function it includes any VPAs that match the recommender’s name, filtering out any VPAs that have a different recommender name.
We can see this happening based on logs “Ignoring vpaCRD vpa-something in namespace some-namespace as new-recommender recommender doesn't process CRDs implicitly destined to default recommender”
So we know that all the VPAs are able to be loaded at this point, but are being filtered out.
Inside this function it will either simply update the VPA object inside the cluster state, or it will create and update it if it’s not present yet.
This is the only place the clusterState.Vpas object is added too.
Inside this function we once again do the LoadVPAs function, to make sure we are working on fresh data (remember that inside this function we filter!)
We then grab all checkpoints, in all namespaces (which is not filtered), and check if the VPA for that checkpoint exists in the clusterState.
Since we know from (3) that the clusterState only contains the filtered list of VPAs, and this function checks all VPA checkpoints, it starts deleting the VPA checkpoints not associated with that recommender.
We can validate that this happened based on seeing the logs: “I1216 06:14:17.567806 571 cluster_feeder.go:331] Orphaned VPA checkpoint cleanup - deleting some-namespace/some-vpa-checkpoint.”
How to reproduce it (as minimally and precisely as possible):
This can easily be reproduced by spinning up two Recommenders in the same cluster, keep one as the default recommender (i.e., don’t override the recommender name) and add a new one.
Create a few VPAs for each of the corresponding Recommenders, and then start watching for the ‘Orphaned VPA checkpoint cleanup*’ log.
After getting them both running, list the checkpoints in that cluster and you should see them being deleted every ~10 minutes (i.e., the garbage collection interval).
Anything else we need to know?:
The fix should be relatively simple, I can put that together if we come to agreement that this is actually a bug and not intended behavior.
The work around is to set the garbage collection interval to something very large, and replace your recommender pods before that interval is elapsed. However this does have the downside that you are effectively turning off the checkpoint garbage collection logic.
The text was updated successfully, but these errors were encountered: