VPA: When targetRef is a Rollout, VerticalPodAutoscalerCheckpoint history is reset during deployment #6730

kodmaskinen · 2024-04-18T06:56:11Z

Which component are you using?:
vertical-pod-autoscaler

What version of the component are you using?:
Component version: 1.0.0

What k8s version are you using (kubectl version)?:
1.29.1

kubectl version Output

$ kubectl version
Client Version: v1.29.4
Server Version: v1.29.1-eks-508b6b3

What environment is this in?:
EKS

What did you expect to happen?:
I expect the VPA to retain the history from earlier versions of the same Rollout.

What happened instead?:
VPA deletes the history from the VerticalPodAutoscalerCheckpoint during deployment of a new version using Argo Rollouts, which often means that the memory target is initially set to low which causes unnecessary OOM situations.

How to reproduce it (as minimally and precisely as possible):

Deploy Argo Rollouts in cluster.
Create a Rollout object (using blue/green strategy).
Create a VPA object referencing the Rollout.
Wait a while until VerticalPodAutoscalerCheckpoint is populated.
Update the rollout and promote (if not using auto-promotion).
When rollout is finished, VerticalPodAutoscalerCheckpoint history is reset/deleted.

Anything else we need to know?:
This may be related to the issue mentioned in #5598.

The text was updated successfully, but these errors were encountered:

voelzmo · 2024-04-22T12:34:36Z

Yeah, so deleting the VPACheckpoints is 100% related to what I described in #5598:

The Pods for the new version in a Rollout are created before the Selector is changed to match them. As pointed out before, Rollout works by only updating the Selector to match the new Pods after promoting the new version
The VPA object points to the Rollout object, therefore, the new Pods are not determined to be under control of the VPA
- Note: as I described in VPA: The admission controller does not apply recommendations to pods when deploying using Argo Rollouts #5598, this is the reason why Pods for the new version don't get the same request values assigned as the old version has. You would need to build some additional mechanism which gets the current Pod requests (or recommendations from the VPA Status) and puts them on the new version before rolling in the update. Most likely this is the cause of the OOMKills you're seeing?
Checkpoints are maintained for Pods under VPA control by creating an individual VPACheckpoint per VPA-Container pair
VPA recognizes those new Pods, doesn't find them to be under control of a VPA, does not create a checkpoint yet
Internally, the recommender gathers usage metric samples for the new Containers and creates a new Aggregation
When an Aggregation is created the first time, it is checked if it matches an existing VPA and if so, adds it to the VPA model
For a Rollout, this check is false, in contrast to e.g. a Deployment, where the selectors are updated when the new ReplicaSet is created.
Once the Rollout Selector has been switched, VPACheckpoints are created, but only from the new Aggregations. They're never merged with the existing Aggregations

Hope that explains it a bit.

In general, it seems that the way how Rollouts are designed, it is pretty incompatible how VPA works currently. I guess that's also one of the reasons, why e.g. knative doesn't have VPA support: it is pretty hard to integrate with the process of rolling out new versions by first creating the Pods and only later on switching and updating the selector.

voelzmo · 2024-04-22T12:35:29Z

/remove-kind bug
/kind support

kodmaskinen · 2024-04-22T14:30:07Z

Thanks for the explanation!

It seems to me like it would work if VPA treated a Rollout more like a Deployment and used the .spec.Selector instead of the .status.Selector. It would, however, need to handle the case where a Rollout references a Deployment in .spec.workloadRef, and in that case get the selector from the .spec.Selector of the Deployment.

kodmaskinen added the kind/bug Categorizes issue or PR as related to a bug. label Apr 18, 2024

k8s-ci-robot added kind/support Categorizes issue or PR as a support question. and removed kind/bug Categorizes issue or PR as related to a bug. labels Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VPA: When targetRef is a Rollout, VerticalPodAutoscalerCheckpoint history is reset during deployment #6730

VPA: When targetRef is a Rollout, VerticalPodAutoscalerCheckpoint history is reset during deployment #6730

kodmaskinen commented Apr 18, 2024 •

edited

voelzmo commented Apr 22, 2024

voelzmo commented Apr 22, 2024

kodmaskinen commented Apr 22, 2024

VPA: When targetRef is a Rollout, VerticalPodAutoscalerCheckpoint history is reset during deployment #6730

VPA: When targetRef is a Rollout, VerticalPodAutoscalerCheckpoint history is reset during deployment #6730

Comments

kodmaskinen commented Apr 18, 2024 • edited

voelzmo commented Apr 22, 2024

voelzmo commented Apr 22, 2024

kodmaskinen commented Apr 22, 2024

kodmaskinen commented Apr 18, 2024 •

edited