Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VPA: When targetRef is a Rollout, VerticalPodAutoscalerCheckpoint history is reset during deployment #6730

Open
kodmaskinen opened this issue Apr 18, 2024 · 3 comments
Labels
kind/support Categorizes issue or PR as a support question.

Comments

@kodmaskinen
Copy link

kodmaskinen commented Apr 18, 2024

Which component are you using?:
vertical-pod-autoscaler

What version of the component are you using?:
Component version: 1.0.0

What k8s version are you using (kubectl version)?:
1.29.1

kubectl version Output
$ kubectl version
Client Version: v1.29.4
Server Version: v1.29.1-eks-508b6b3

What environment is this in?:
EKS

What did you expect to happen?:
I expect the VPA to retain the history from earlier versions of the same Rollout.

What happened instead?:
VPA deletes the history from the VerticalPodAutoscalerCheckpoint during deployment of a new version using Argo Rollouts, which often means that the memory target is initially set to low which causes unnecessary OOM situations.

How to reproduce it (as minimally and precisely as possible):

  • Deploy Argo Rollouts in cluster.
  • Create a Rollout object (using blue/green strategy).
  • Create a VPA object referencing the Rollout.
  • Wait a while until VerticalPodAutoscalerCheckpoint is populated.
  • Update the rollout and promote (if not using auto-promotion).
  • When rollout is finished, VerticalPodAutoscalerCheckpoint history is reset/deleted.

Anything else we need to know?:
This may be related to the issue mentioned in #5598.

@kodmaskinen kodmaskinen added the kind/bug Categorizes issue or PR as related to a bug. label Apr 18, 2024
@voelzmo
Copy link
Contributor

voelzmo commented Apr 22, 2024

Yeah, so deleting the VPACheckpoints is 100% related to what I described in #5598:

  • The Pods for the new version in a Rollout are created before the Selector is changed to match them. As pointed out before, Rollout works by only updating the Selector to match the new Pods after promoting the new version
  • The VPA object points to the Rollout object, therefore, the new Pods are not determined to be under control of the VPA
  • Checkpoints are maintained for Pods under VPA control by creating an individual VPACheckpoint per VPA-Container pair
  • VPA recognizes those new Pods, doesn't find them to be under control of a VPA, does not create a checkpoint yet
  • Internally, the recommender gathers usage metric samples for the new Containers and creates a new Aggregation
  • When an Aggregation is created the first time, it is checked if it matches an existing VPA and if so, adds it to the VPA model
  • For a Rollout, this check is false, in contrast to e.g. a Deployment, where the selectors are updated when the new ReplicaSet is created.
  • Once the Rollout Selector has been switched, VPACheckpoints are created, but only from the new Aggregations. They're never merged with the existing Aggregations

Hope that explains it a bit.

In general, it seems that the way how Rollouts are designed, it is pretty incompatible how VPA works currently. I guess that's also one of the reasons, why e.g. knative doesn't have VPA support: it is pretty hard to integrate with the process of rolling out new versions by first creating the Pods and only later on switching and updating the selector.

@voelzmo
Copy link
Contributor

voelzmo commented Apr 22, 2024

/remove-kind bug
/kind support

@k8s-ci-robot k8s-ci-robot added kind/support Categorizes issue or PR as a support question. and removed kind/bug Categorizes issue or PR as related to a bug. labels Apr 22, 2024
@kodmaskinen
Copy link
Author

Thanks for the explanation!

It seems to me like it would work if VPA treated a Rollout more like a Deployment and used the .spec.Selector instead of the .status.Selector. It would, however, need to handle the case where a Rollout references a Deployment in .spec.workloadRef, and in that case get the selector from the .spec.Selector of the Deployment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/support Categorizes issue or PR as a support question.
Projects
None yet
Development

No branches or pull requests

3 participants