VPA: prune stale container aggregates, split recommendations over true number of containers #6745

jkyros · 2024-04-22T15:39:57Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

Previously we weren't cleaning up "stale" aggregates when container names changed (because of renames, removals) and that was resulting in:

VPAs showing recommendations for containers which no longer exist
Resources being split across containers which no longer exist (resulting in some containers ending up with resource limits too small for them to effectively live)
There was also a corner case where during a rollout after a container was renamed/removed from a deployment, we were counting the number of unique container names and not the actual number of containers in each pod, so we were splitting resources that shouldn't have been split.

This PR is an attempt to clean up those stale aggregates without incurring too much overhead, and make sure that the resources get spread across the correct number of containers during a rollout.

Which issue(s) this PR fixes:

Fixes #6744

Special notes for your reviewer:

There are probably a lot of different ways we can do the pruning of stale aggregates for missing containers:

I went with explicitly marking and sweeping them because it saved us an additional loop through all the pods and containers
We could also just as easily just have a PruneAggregates() that runs after LoadPods() that goes through everything and removes them (or do this work as part of LoadPods() but that seems...deceptive?)
We could probably also tweak the existing garbageCollectAggregateCollectionStates and run it immediately after LoadPods() every time but that might be expensive.

I'm not super-attached to any particular approach, I'd just like to fix this, so I can retool it if necessary.

If I am being ignorant, and there are corner cases I'm missing, absolutely let me know
it probably need some tests/cleanup and I'll change the names of things to...whatever you want them to be. 😄

Does this PR introduce a user-facing change?

Added pruning of container aggregates and changed container math so resources will no longer be split across the wrong number of containers

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Previously we were dividing the resources per pod by the number of container aggregates, but in a situation where we're doing a rollout and the container names are changing (either a rename, or a removal) we're splitting resources across the wrong number of containers, resulting in smaller values than we should actually have. This collects a count of containers in the model when the pods are loaded, and uses the "high water mark value", so in the event we are doing something like adding a container during a rollout, we favor the pod that has the additional container. There are probably better ways to do this plumbing, but this was my initial attempt, and it does fix the issue.

Previously we were only cleaning checkpoints after something happened to the VPA or the targetRef, and so when a container got renamed the checkpoint would stick around forever. Since we're trying to clean up the aggregates immediately now, we need to force the checkpoint garbage collection to clean up any checkpoints that don't have matching aggregates. If the checkpoints did get loaded back in after a restart, PruneContainers() would take the aggregates back out, but we probably shouldn't leave the checkpoints out there.

Previously we were letting the rate limited garbage collector clean up the aggregate states, and that works really well in most cases, but when the list of containers in a pod changes, either due to the removal or rename of a container, the aggregates for the old containers stick around forever and cause problems. To get around this, this marks all existing aggregates/initial aggregates in the list for each VPA as "not under a VPA" every time before we LoadPods(), and then LoadPods() will re-mark the aggregates as "under a VPA" for all the ones that are still there, which lets us easily prune the stale container aggregates that are still marked as "not under a VPA" but are still wrongly in the VPA's list. This does leave the ultimate garbage collection to the rate limited garbage collector, which should be fine, we just needed the stale entries to get removed from the per-VPA lists so they didn't affect VPA behavior.

k8s-ci-robot · 2024-04-22T15:40:03Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jkyros
Once this PR has been reviewed and has the lgtm label, please assign kgolab for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

vertical-pod-autoscaler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2024-04-22T15:40:07Z

Hi @jkyros. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kwiesmueller · 2024-04-29T17:53:57Z

vertical-pod-autoscaler/pkg/recommender/input/cluster_feeder.go

+	// TODO(jkyros): This only removes the container state from the VPA's aggregate states, there
+	// is still a reference to them in feeder.clusterState.aggregateStateMap, and those get
+	// garbage collected eventually by the rate limited aggregate garbage collector later.
+	// Maybe we should clean those up here too since we know which ones are stale?


Is it a lot of extra work to do that? Do you see any risks doing it here?

No, I don't think it's a lot of extra work, it should be reasonably cheap to clean them up here since it's just deletions from the other maps if the keys exist, I just didn't know all the history.

It seemed possible at least that we were intentionally waiting to clean up the aggregates so if there was an unexpected hiccup we didn't just immediately blow away all that aggregate history we worked so hard to get? (Like maybe someone oopses, deletes their deployment, then puts it back? Right now we don't have to start over -- the pods come back in, find their container aggregates, and resume ? But if I clean them up here, we have to start over...)

kwiesmueller · 2024-04-29T17:58:31Z

vertical-pod-autoscaler/pkg/recommender/model/cluster.go

+// the correct number and not just the number of aggregates that have *ever* been present. (We don't want minimum resources
+// to erroneously shrink, either)
+func (cluster *ClusterState) setVPAContainersPerPod(pod *PodState) {
+	for _, vpa := range cluster.Vpas {


I'm wondering if there is already a place where this logic could go so we don't have to loop over all VPAs for every pod again here.
In large clusters with a VPA to Pod ratio that's closer to 1 this could be a little wasteful.

Hmm, yeah, I struggled with finding a less expensive way without making too much of a mess. Unless I'm missing something (and I might be) we don't seem to have a VPA <--> Pod map -- probably because we didn't need one until now? At the very least I think I should gate this to only run if the number of containers in the pod is > 1.

Like, I think our options are:

update the VPA as the pods roll through (which requires me to find the VPA for each pod like I did here) or

count the containers as we load the VPAs (but we load the VPAs before we load the pods, so we'd have to go through the pods again, so that doesn't help us)

have the VPA actually track the pods it's managing, something like this: jkyros@6ddc208 (could also just be an array of PodIDs and we could look up the state so we could save the memory cost of the PodState pointer, but you know what I mean)

I put it where I did (option 1) because at least LoadPods() was already looping through all the pods so we could freeload off the "outer" pod loop and I figured we didn't want to spend the memory on option 3. If we'd entertain option 3 and are okay with the memory usage, I can totally do that?

jkyros added 3 commits April 19, 2024 21:36

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. labels Apr 22, 2024

k8s-ci-robot requested review from krzysied and voelzmo April 22, 2024 15:40

k8s-ci-robot added the area/vertical-pod-autoscaler label Apr 22, 2024

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 22, 2024

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 22, 2024

jkyros marked this pull request as ready for review April 22, 2024 17:56

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 22, 2024

k8s-ci-robot requested a review from kgolab April 22, 2024 17:57

kwiesmueller reviewed Apr 29, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VPA: prune stale container aggregates, split recommendations over true number of containers #6745

VPA: prune stale container aggregates, split recommendations over true number of containers #6745

jkyros commented Apr 22, 2024

k8s-ci-robot commented Apr 22, 2024

k8s-ci-robot commented Apr 22, 2024

kwiesmueller Apr 29, 2024

jkyros May 2, 2024 •

edited

kwiesmueller Apr 29, 2024

jkyros May 2, 2024

VPA: prune stale container aggregates, split recommendations over true number of containers #6745

Are you sure you want to change the base?

VPA: prune stale container aggregates, split recommendations over true number of containers #6745

Conversation

jkyros commented Apr 22, 2024

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot commented Apr 22, 2024

k8s-ci-robot commented Apr 22, 2024

kwiesmueller Apr 29, 2024

Choose a reason for hiding this comment

jkyros May 2, 2024 • edited

Choose a reason for hiding this comment

kwiesmueller Apr 29, 2024

Choose a reason for hiding this comment

jkyros May 2, 2024

Choose a reason for hiding this comment

jkyros May 2, 2024 •

edited