AEP-7862: Decouple Startup CPU Boost from VPA modes - updates #8175

laoj2 · 2025-05-27T20:47:08Z

What type of PR is this?

/kind documentation
/kind feature

What this PR does / why we need it:

Changes to AEP-7862:

Decouple startup boost config from regular VPA, so the API is simplified, in case users want to only use CPU boost without regular VPA updates (we no longer require a new VPA mode + have a way to specify the startup boost config at the VPA/workload level).
Added more yaml examples

#7862

k8s-ci-robot · 2025-05-27T20:47:28Z

Hi @laoj2. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

laoj2 · 2025-05-27T20:52:33Z

/assign @omerap12
/assign @adrianmoisey

vertical-pod-autoscaler/enhancements/7862-cpu-startup-boost/README.md

omerap12 · 2025-06-16T06:16:26Z

/ok-to-test

vertical-pod-autoscaler/enhancements/7862-cpu-startup-boost/README.md

omerap12 · 2025-06-17T19:50:26Z

@shubhamrai1993 , something wrong with Github I can't comment on your review so Ill do it here.

This is indeed a valid concern. I don't see any reasonable solution for that to be honest.

force recreation of the pod on container crash so that it starts in a boosted state

I don’t think this is a good solution. Evicting the entire pod means that the pod would go through the full scheduling cycle again. In the worst-case scenario, the original node might no longer have available resources, and the pod’s spot could be taken. This could lead to the need to provision another node or preempt another pod - depending on the user’s configuration.

This might not be possible anymore if the node where the pod was scheduled no longer has enough resources. transition back to boosted resources if pod has reached unhealthy state or has crashed.

True. But if the pod has reached an unhealthy state, it should generally be evicted anyway. So in my opinion, the user should handle the eviction explicitly in such cases.

Maybe @laoj2 has some ideas?

adrianmoisey · 2025-06-18T08:37:37Z

Unrelated to the changes in this PR specifically but, what would happen in a scenario where a the CPU resources for a pod has been scaled down by VPA after readiness/liveness probe success and subsequently crashes because of some reason (OOM, transition to unhealthy state).

This is a problem because if a restart is attempted with the non-boosted (lower) resources, then the initial cpu spike will get throttled and might even get stuck in CrashLoopBackoff with repeated readiness/liveness failures.

This is indeed a valid concern. I don't see any reasonable solution for that to be honest.

I have some thought some thoughts on this. The AEP is suggesting a CPU boost only. I don't think that CPU throttling can be the cause of an OOM (I guess it can, due to slower garbage collection). But if the non-boosted CPU is the cause of a problem, then that means that the CPU limits are too low anyway, and the VPA needs to increase them.

Also, this feature isn't even created yet, and will be in alpha. It's possible we want to test this in the real world a bit.

omerap12 · 2025-06-18T10:23:32Z

Unrelated to the changes in this PR specifically but, what would happen in a scenario where a the CPU resources for a pod has been scaled down by VPA after readiness/liveness probe success and subsequently crashes because of some reason (OOM, transition to unhealthy state).

This is a problem because if a restart is attempted with the non-boosted (lower) resources, then the initial cpu spike will get throttled and might even get stuck in CrashLoopBackoff with repeated readiness/liveness failures.

This is indeed a valid concern. I don't see any reasonable solution for that to be honest.

I have some thought some thoughts on this. The AEP is suggesting a CPU boost only. I don't think that CPU throttling can be the cause of an OOM (I guess it can, due to slower garbage collection). But if the non-boosted CPU is the cause of a problem, then that means that the CPU limits are too low anyway, and the VPA needs to increase them.

Also, this feature isn't even created yet, and will be in alpha. It's possible we want to test this in the real world a bit.

Agree. we can open a discussion about how we want to handle memory boost when the time comes

laoj2 · 2025-06-19T15:53:42Z

Unrelated to the changes in this PR specifically but, what would happen in a scenario where a the CPU resources for a pod has been scaled down by VPA after readiness/liveness probe success and subsequently crashes because of some reason (OOM, transition to unhealthy state).

In the current design, the containers get boosted via the admission-controller webhook. So, you're right that it'll requires a pod recreation to get the resources boosted again.

For option (1), starting a container with lower requests and boosting them afterwards can be problematic. The resize request itself may be slow (O(10s) to finish?) which may be too long for some applications. And for some apps (like JMV-based apps), the container detection mechanism won't reflect the increased resources that happened after the application has already started, so the boosted resources will be wasted.

For option (2), we could consider adding an option to evict the pod upon a container restart. But I agree with @omerap12 and @adrianmoisey here: we can start simple, test this feature out in the real world, then we can consider tweaks to the logic and see if we need to handle these cases.

omerap12

Overall lgtm , there are just a few notes from me

vertical-pod-autoscaler/enhancements/7862-cpu-startup-boost/README.md

omerap12 · 2025-06-24T06:25:02Z

vertical-pod-autoscaler/enhancements/7862-cpu-startup-boost/README.md

+    updateMode: "Off"
+  startupBoost:
+    cpu:
+      value: 3.0


As @adrianmoisey pointed out, using floats is discouraged: #8026 (comment).

Perhaps we should apply the same approach I suggested here: https://github.com/kubernetes/autoscaler/pull/8026/files#diff-b666c5599d02971edb0c08090acb86761b7c726c97c4c6c1bbf00ff5883a2364R87.

Good catch! I changed this to string. Value is a "factor" when Type=factor or a quantity when Type=quantity, so we need this to be a string to give us enough flexibility to handle the quantity cases. So, we can handle the conversion logic in the controller code.

Thanks! seems ok to me

vertical-pod-autoscaler/enhancements/7862-cpu-startup-boost/README.md

omerap12 · 2025-06-24T19:56:36Z

/lgtm
/assign @adrianmoisey

adrianmoisey · 2025-06-25T19:13:27Z

/title AEP-7862: Decouple Startup CPU Boost from VPA modes - updates

adrianmoisey · 2025-06-25T19:13:51Z

/retitle AEP-7862: Decouple Startup CPU Boost from VPA modes - updates

adrianmoisey · 2025-06-25T19:18:58Z

vertical-pod-autoscaler/enhancements/7862-cpu-startup-boost/README.md

+    multiplier for the recommended CPU request. For example, a value of `2` will
+    double the CPU request.
+    * `Quantity`: The `StartupBoost.CPU.Value` field will be interpreted as an
+    absolute CPU resource quantity (e.g., `"500m"`, `"1"`) to be used as the CPU


I just want to sanity check this. If Type was set to Factor, and Value was set to 500m, would that be invalid and the API reject it?

That's right, @adrianmoisey. I updated the Validation section with this information.

adrianmoisey · 2025-06-25T19:19:31Z

I'm generally happy with this
Pinging more folks for visibility: @voelzmo @raywainman @maxcao13

maxcao13 · 2025-06-26T03:22:02Z

vertical-pod-autoscaler/enhancements/7862-cpu-startup-boost/README.md

-  during the startup boost phase.
-  * [Optional] `StartupBoost.CPU.Duration`: if specified, it indicates for how
-  long to keep the pod boosted **after** it goes to `Ready`.
+The new `StartupBoost` parameter will be added to both:


Is adding the option to both the spec and the container resource policy simply just for convenience?

I expect that people will want to configure the CPU boost only for the main container of a pod. So, the container resource policy helps with that, so you don't get all of the pod's containers boosted.

Makes sense to me, and looks good to me. :)

adrianmoisey · 2025-06-26T16:33:10Z

/approve

omerap12

Great work! Looking forward for the implementation of this feature!
/lgtm

k8s-ci-robot · 2025-06-26T16:49:24Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: adrianmoisey, laoj2, omerap12

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~vertical-pod-autoscaler/OWNERS~~ [adrianmoisey,omerap12]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

adrianmoisey · 2025-07-01T19:13:35Z

Hey @kubernetes/sig-autoscaling-leads
I the spirit of #8277, could you give this a retrospective review?

raywainman · 2025-07-02T01:22:04Z

Overall LGTM. Just one comment on my end... (and apologies for not getting this in time)

I'm not a huge fan of overloading the value field for both the boost and value cases, it feels a bit icky to have to parse the string depending on the use case. It would also be really nice to be able to use the resource.Quantity type for the boost case for example.

Otherwise we will need to parse the string field to handle all the various ways someone could encode a CPU value (eg. 500m, 0.5, 5x10^-1, etc...)

You could have two fields:

factorValue (type int)
quantity (type resource.Quantity)

Then we have some simple validation to make sure only one of these is used at a time.

I know the API is slightly uglier this way but in my opinion a lot less ambiguous.

(This may have been discussed previously - I did a quick grep of the comments but couldn't find anything but curious what you all think).

cc @laoj2 @adrianmoisey @omerap12

adrianmoisey · 2025-07-02T06:54:16Z

(This may have been discussed previously - I did a quick grep of the comments but couldn't find anything but curious what you all think).

It was discussed in #8175 (comment)

I think you're right though. I had mis-read the HPA Horizontalpodautoscaler.spec.metrics.resource.target field and assumed that a single value was overloaded, which is why I had made that suggestion. However, I was wrong:

$ kubectl explain Horizontalpodautoscaler.spec.metrics.resource.target
GROUP:      autoscaling
KIND:       HorizontalPodAutoscaler
VERSION:    v2

FIELD: target <MetricTarget>


DESCRIPTION:
    target specifies the target value for the given metric
    MetricTarget defines the target value, average value, or average utilization
    of a specific metric

FIELDS:
  averageUtilization	<integer>
    averageUtilization is the target value of the average of the resource metric
    across all relevant pods, represented as a percentage of the requested value
    of the resource for the pods. Currently only valid for Resource metric
    source type

  averageValue	<Quantity>
    averageValue is the target value of the average of the metric across all
    relevant pods (as a quantity)

  type	<string> -required-
    type represents whether the metric type is Utilization, Value, or
    AverageValue

  value	<Quantity>
    value is the target value of the metric (as a quantity).

You could have two fields:

factorValue (type int)

quantity (type resource.Quantity)

Agreed, so in summary:

StartupBoost.CPU.Type - Could be Factor or Quantity. 
StartupBoost.CPU.factorValue
StartupBoost.CPU.quantity

Depending on which Type is selected, we need to enforce that it's "value" counterpart is also set, so only these are the only valid combinations:

StartupBoost.CPU.Type - Factor
StartupBoost.CPU.factorValue - 2

StartupBoost.CPU.Type - Quantity. 
StartupBoost.CPU.quantity - 200Mi

And if it's unset, it's equivalent of:

StartupBoost.CPU.Type - Factor
StartupBoost.CPU.factorValue - 1

OR

StartupBoost.CPU.Type - Quantity. 
StartupBoost.CPU.quantity - 0Mi

I'm unsure if we want to set (or need to set?) defaults to either of those?

omerap12 · 2025-07-02T07:26:29Z

Thanks @raywainman! I agree.
@adrianmoisey , IMO I don't think we need any defaults.

adrianmoisey · 2025-07-07T14:15:41Z

@laoj2 Just want to make sure you saw this discussion, sorry for having it post-merge

laoj2 · 2025-07-07T15:05:07Z

Thanks for the recommendations @raywainman @adrianmoisey.

@adrianmoisey, regarding your summary. I think we can skip the StartupBoost.CPU.Type field?

And only one of StartupBoost.CPU.quantity or StartupBoost.CPU.factorValue can be set at a time (and this can be enforced in the controller).

Yeah, we probably don't need defaults and if we need to "disable" the feature for a single container, we can set StartupBoost.CPU.factorValue=1 in the ContainerPolicy.

I'll send a PR with the updates.

adrianmoisey · 2025-07-07T15:09:28Z

regarding your summary. I think we can skip the StartupBoost.CPU.Type field?

It's a pattern I've noticed in the HPA.
It has Horizontalpodautoscaler.spec.metrics.type and also Horizontalpodautoscaler.spec.metrics.resource.target.type

May be it's worth asking in #sig-api-machinery (on slack) on what they recommend?

raywainman · 2025-07-09T02:37:24Z

May be it's worth asking in #sig-api-machinery (on slack) on what they recommend?

Good idea - then let's put together a quick PR to update the AEP with the outcome.

Thanks all!

laoj2 · 2025-07-09T18:16:23Z

Thanks folks! I asked #sig-api-machinery about this here: https://kubernetes.slack.com/archives/C0EG7JC6T/p1751910562504969

k8s-ci-robot added cncf-cla: yes kind/documentation kind/feature area/vertical-pod-autoscaler needs-ok-to-test labels May 27, 2025

k8s-ci-robot requested review from kwiesmueller and raywainman May 27, 2025 20:47

k8s-ci-robot added the size/L label May 27, 2025

k8s-ci-robot assigned adrianmoisey and omerap12 May 27, 2025

omerap12 reviewed May 28, 2025

View reviewed changes

vertical-pod-autoscaler/enhancements/7862-cpu-startup-boost/README.md Outdated Show resolved Hide resolved

adrianmoisey reviewed May 28, 2025

View reviewed changes

vertical-pod-autoscaler/enhancements/7862-cpu-startup-boost/README.md Outdated Show resolved Hide resolved

laoj2 force-pushed the aep-7862-fixes-to-api branch from 954bbbc to 0b42a31 Compare May 28, 2025 17:18

laoj2 requested review from adrianmoisey and omerap12 May 28, 2025 20:18

raywainman reviewed May 29, 2025

View reviewed changes

vertical-pod-autoscaler/enhancements/7862-cpu-startup-boost/README.md Outdated Show resolved Hide resolved

vertical-pod-autoscaler/enhancements/7862-cpu-startup-boost/README.md Outdated Show resolved Hide resolved

laoj2 force-pushed the aep-7862-fixes-to-api branch from 0b42a31 to ea041d5 Compare June 6, 2025 16:07

laoj2 requested a review from raywainman June 6, 2025 16:08

k8s-ci-robot added ok-to-test and removed needs-ok-to-test labels Jun 16, 2025

shubhamrai1993 reviewed Jun 17, 2025

View reviewed changes

vertical-pod-autoscaler/enhancements/7862-cpu-startup-boost/README.md Show resolved Hide resolved

laoj2 force-pushed the aep-7862-fixes-to-api branch from ea041d5 to 61a7116 Compare June 23, 2025 20:58

omerap12 reviewed Jun 24, 2025

View reviewed changes

laoj2 force-pushed the aep-7862-fixes-to-api branch from 61a7116 to eba5f1e Compare June 24, 2025 17:36

laoj2 requested a review from omerap12 June 24, 2025 17:38

k8s-ci-robot added the lgtm label Jun 24, 2025

k8s-ci-robot changed the title ~~Decouple Startup CPU Boost from VPA modes~~ AEP-7862: Decouple Startup CPU Boost from VPA modes - updates Jun 25, 2025

adrianmoisey reviewed Jun 25, 2025

View reviewed changes

laoj2 force-pushed the aep-7862-fixes-to-api branch from eba5f1e to 4d5a01c Compare June 25, 2025 20:15

k8s-ci-robot removed the lgtm label Jun 25, 2025

Decouple Startup CPU Boost from VPA modes

ecb4297

laoj2 force-pushed the aep-7862-fixes-to-api branch from 4d5a01c to ecb4297 Compare June 25, 2025 20:16

maxcao13 reviewed Jun 26, 2025

View reviewed changes

k8s-ci-robot added the approved label Jun 26, 2025

omerap12 approved these changes Jun 26, 2025

View reviewed changes

k8s-ci-robot added the lgtm label Jun 26, 2025

k8s-ci-robot merged commit af75d6e into kubernetes:master Jun 26, 2025
7 checks passed

AEP-7862: Decouple Startup CPU Boost from VPA modes - updates #8175

AEP-7862: Decouple Startup CPU Boost from VPA modes - updates #8175

Conversation

laoj2 commented May 27, 2025

What type of PR is this?

What this PR does / why we need it:

Uh oh!

k8s-ci-robot commented May 27, 2025

Uh oh!

laoj2 commented May 27, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

omerap12 commented Jun 16, 2025

Uh oh!

Uh oh!

omerap12 commented Jun 17, 2025

Uh oh!

adrianmoisey commented Jun 18, 2025

Uh oh!

omerap12 commented Jun 18, 2025

Uh oh!

laoj2 commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

omerap12 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

omerap12 commented Jun 24, 2025

Uh oh!

adrianmoisey commented Jun 25, 2025

Uh oh!

adrianmoisey commented Jun 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adrianmoisey commented Jun 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adrianmoisey commented Jun 26, 2025

Uh oh!

omerap12 left a comment

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Jun 26, 2025

Uh oh!

Uh oh!

adrianmoisey commented Jul 1, 2025

Uh oh!

raywainman commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adrianmoisey commented Jul 2, 2025

Uh oh!

omerap12 commented Jul 2, 2025

Uh oh!

adrianmoisey commented Jul 7, 2025

Uh oh!

laoj2 commented Jul 7, 2025

Uh oh!

adrianmoisey commented Jul 7, 2025

Uh oh!

raywainman commented Jul 9, 2025

laoj2 commented Jun 19, 2025 •

edited

Loading

raywainman commented Jul 2, 2025 •

edited

Loading