-
Notifications
You must be signed in to change notification settings - Fork 4.1k
AEP-7862: Decouple Startup CPU Boost from VPA modes - updates #8175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi @laoj2. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/assign @omerap12 |
vertical-pod-autoscaler/enhancements/7862-cpu-startup-boost/README.md
Outdated
Show resolved
Hide resolved
vertical-pod-autoscaler/enhancements/7862-cpu-startup-boost/README.md
Outdated
Show resolved
Hide resolved
954bbbc
to
0b42a31
Compare
vertical-pod-autoscaler/enhancements/7862-cpu-startup-boost/README.md
Outdated
Show resolved
Hide resolved
vertical-pod-autoscaler/enhancements/7862-cpu-startup-boost/README.md
Outdated
Show resolved
Hide resolved
0b42a31
to
ea041d5
Compare
/ok-to-test |
@shubhamrai1993 , something wrong with Github I can't comment on your review so Ill do it here. This is indeed a valid concern. I don't see any reasonable solution for that to be honest.
I don’t think this is a good solution. Evicting the entire pod means that the pod would go through the full scheduling cycle again. In the worst-case scenario, the original node might no longer have available resources, and the pod’s spot could be taken. This could lead to the need to provision another node or preempt another pod - depending on the user’s configuration.
True. But if the pod has reached an unhealthy state, it should generally be evicted anyway. So in my opinion, the user should handle the eviction explicitly in such cases. Maybe @laoj2 has some ideas? |
I have some thought some thoughts on this. The AEP is suggesting a CPU boost only. I don't think that CPU throttling can be the cause of an OOM (I guess it can, due to slower garbage collection). But if the non-boosted CPU is the cause of a problem, then that means that the CPU limits are too low anyway, and the VPA needs to increase them. Also, this feature isn't even created yet, and will be in alpha. It's possible we want to test this in the real world a bit. |
Agree. we can open a discussion about how we want to handle memory boost when the time comes |
In the current design, the containers get boosted via the admission-controller webhook. So, you're right that it'll requires a pod recreation to get the resources boosted again. For option (1), starting a container with lower requests and boosting them afterwards can be problematic. The resize request itself may be slow (O(10s) to finish?) which may be too long for some applications. And for some apps (like JMV-based apps), the container detection mechanism won't reflect the increased resources that happened after the application has already started, so the boosted resources will be wasted. For option (2), we could consider adding an option to evict the pod upon a container restart. But I agree with @omerap12 and @adrianmoisey here: we can start simple, test this feature out in the real world, then we can consider tweaks to the logic and see if we need to handle these cases. |
ea041d5
to
61a7116
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall lgtm , there are just a few notes from me
vertical-pod-autoscaler/enhancements/7862-cpu-startup-boost/README.md
Outdated
Show resolved
Hide resolved
updateMode: "Off" | ||
startupBoost: | ||
cpu: | ||
value: 3.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @adrianmoisey pointed out, using floats is discouraged: #8026 (comment).
Perhaps we should apply the same approach I suggested here: https://github.com/kubernetes/autoscaler/pull/8026/files#diff-b666c5599d02971edb0c08090acb86761b7c726c97c4c6c1bbf00ff5883a2364R87.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I changed this to string
. Value is a "factor" when Type=factor
or a quantity when Type=quantity
, so we need this to be a string to give us enough flexibility to handle the quantity cases. So, we can handle the conversion logic in the controller code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! seems ok to me
vertical-pod-autoscaler/enhancements/7862-cpu-startup-boost/README.md
Outdated
Show resolved
Hide resolved
61a7116
to
eba5f1e
Compare
/lgtm |
/title AEP-7862: Decouple Startup CPU Boost from VPA modes - updates |
/retitle AEP-7862: Decouple Startup CPU Boost from VPA modes - updates |
multiplier for the recommended CPU request. For example, a value of `2` will | ||
double the CPU request. | ||
* `Quantity`: The `StartupBoost.CPU.Value` field will be interpreted as an | ||
absolute CPU resource quantity (e.g., `"500m"`, `"1"`) to be used as the CPU |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just want to sanity check this. If Type was set to Factor, and Value was set to 500m, would that be invalid and the API reject it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, @adrianmoisey. I updated the Validation section with this information.
I'm generally happy with this |
eba5f1e
to
4d5a01c
Compare
4d5a01c
to
ecb4297
Compare
during the startup boost phase. | ||
* [Optional] `StartupBoost.CPU.Duration`: if specified, it indicates for how | ||
long to keep the pod boosted **after** it goes to `Ready`. | ||
The new `StartupBoost` parameter will be added to both: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is adding the option to both the spec and the container resource policy simply just for convenience?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect that people will want to configure the CPU boost only for the main container of a pod. So, the container resource policy helps with that, so you don't get all of the pod's containers boosted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me, and looks good to me. :)
/approve |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Looking forward for the implementation of this feature!
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: adrianmoisey, laoj2, omerap12 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hey @kubernetes/sig-autoscaling-leads |
Overall LGTM. Just one comment on my end... (and apologies for not getting this in time) I'm not a huge fan of overloading the Otherwise we will need to parse the string field to handle all the various ways someone could encode a CPU value (eg. 500m, 0.5, 5x10^-1, etc...) You could have two fields:
Then we have some simple validation to make sure only one of these is used at a time. I know the API is slightly uglier this way but in my opinion a lot less ambiguous. (This may have been discussed previously - I did a quick grep of the comments but couldn't find anything but curious what you all think). |
It was discussed in #8175 (comment) I think you're right though. I had mis-read the HPA
Agreed, so in summary:
Depending on which Type is selected, we need to enforce that it's "value" counterpart is also set, so only these are the only valid combinations:
And if it's unset, it's equivalent of:
OR
I'm unsure if we want to set (or need to set?) defaults to either of those? |
Thanks @raywainman! I agree. |
@laoj2 Just want to make sure you saw this discussion, sorry for having it post-merge |
Thanks for the recommendations @raywainman @adrianmoisey. @adrianmoisey, regarding your summary. I think we can skip the And only one of Yeah, we probably don't need defaults and if we need to "disable" the feature for a single container, we can set I'll send a PR with the updates. |
It's a pattern I've noticed in the HPA. May be it's worth asking in #sig-api-machinery (on slack) on what they recommend? |
Good idea - then let's put together a quick PR to update the AEP with the outcome. Thanks all! |
Thanks folks! I asked #sig-api-machinery about this here: https://kubernetes.slack.com/archives/C0EG7JC6T/p1751910562504969 |
What type of PR is this?
/kind documentation
/kind feature
What this PR does / why we need it:
Changes to AEP-7862:
#7862