Skip to content

Adding Alpha criteria for KEP to support IPPR & Pod Level Resources #5423

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

ndixita
Copy link
Contributor

@ndixita ndixita commented Jun 18, 2025

  • One-line PR description: Split KEP from Pod level resources #2837 to add the alpha criteria for In place pod-level resources resize

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 18, 2025
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 18, 2025
@ndixita ndixita force-pushed the pod-level-resources-ippr-alpha branch from 4e712ae to 71e896c Compare June 19, 2025 04:57
@ndixita ndixita force-pushed the pod-level-resources-ippr-alpha branch from 71e896c to c561b22 Compare June 19, 2025 05:05
@ndixita ndixita marked this pull request as ready for review June 19, 2025 05:17
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 19, 2025
@ndixita
Copy link
Contributor Author

ndixita commented Jun 19, 2025

/assign @tallclair
/assign @soltysh

@ndixita ndixita force-pushed the pod-level-resources-ippr-alpha branch from c561b22 to 287a985 Compare June 19, 2025 06:05
3. Decrease pod-level cgroup (if needed)
4. Increase container resources

Note: As part of a resize operation, users will now be permitted to add or modify pod-level resources within the Pod specification. Upon successful update, the Kubelet will ensure its internal checkpoint for the Pod is updated to reflect these new resource definitions. Importantly, to optimize performance and prevent redundant operations, the Kubelet will only trigger an actual cgroup resize for the Pod's sandbox if the specified pod-level resources are not equal to the aggregated sum of the individual container-level resources.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tallclair I couldn't find your comment because I changed the origin of my branch. I tried to address your question about allowing adding pod-level resources as a part of resize. PTAL

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are going to allow adding & removing container-level resources. I feel like the use case is pretty weak, so I'm not too worried about this constraint, but it is a minor inconsistency.

@ndixita
Copy link
Contributor Author

ndixita commented Jun 19, 2025

/assign @mrunalp

Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly ok, the two bits that I'd like to get clarified before approving are:

  1. defaulting
  2. Resoruces field in PodStatus.

@k8s-ci-robot
Copy link
Contributor

@ndixita: Can not set label lead-opted-in: Must be member in one of these teams: [release-team-enhancements release-team-leads sig-api-machinery-leads sig-apps-leads sig-architecture-leads sig-auth-leads sig-autoscaling-leads sig-cli-leads sig-cloud-provider-leads sig-cluster-lifecycle-leads sig-contributor-experience-leads sig-docs-leads sig-instrumentation-leads sig-k8s-infra-leads sig-multicluster-leads sig-network-leads sig-node-leads sig-release-leads sig-scalability-leads sig-scheduling-leads sig-security-leads sig-storage-leads sig-testing-leads sig-windows-leads]

In response to this:

/label lead-opted-in

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ndixita ndixita force-pushed the pod-level-resources-ippr-alpha branch from 287a985 to 0aacd00 Compare June 19, 2025 17:34
@ndixita ndixita force-pushed the pod-level-resources-ippr-alpha branch 2 times, most recently from 7b5a42b to 9445a26 Compare June 19, 2025 18:48
@ndixita
Copy link
Contributor Author

ndixita commented Jun 19, 2025

This is mostly ok, the two bits that I'd like to get clarified before approving are:

  1. defaulting
  2. Resoruces field in PodStatus.

@soltysh added some details in #### Surfacing Pod Resource Requirements to add some clarity. PTAL

@ndixita ndixita force-pushed the pod-level-resources-ippr-alpha branch from 9445a26 to 01f01ff Compare June 19, 2025 23:56
@ndixita ndixita force-pushed the pod-level-resources-ippr-alpha branch 2 times, most recently from 63f9b28 to ea2804e Compare June 20, 2025 05:38
Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
the PRR

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 20, 2025
3. Decrease pod-level cgroup (if needed)
4. Increase container resources

Note: As part of a resize operation, users will now be permitted to add or modify pod-level resources within the Pod specification. Upon successful update, the Kubelet will ensure its internal checkpoint for the Pod is updated to reflect these new resource definitions. Importantly, to optimize performance and prevent redundant operations, the Kubelet will only trigger an actual cgroup resize for the Pod's sandbox if the specified pod-level resources are not equal to the aggregated sum of the individual container-level resources.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are going to allow adding & removing container-level resources. I feel like the use case is pretty weak, so I'm not too worried about this constraint, but it is a minor inconsistency.

controller to inspect running and pending pods to quickly determine their
effective resource requirements.

* The field will be first populated initlaized in API server defaulting logic.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a weird consequence of defaulting in the apiserver: if a pod is resized before it is scheduled, then the desired resources will be different. Do we care about this scenario?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would only be temporary until it is updated by the kubelet in generateAPIPodStatus, which presumably should happen quickly right? I think that seems fine?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should happen quickly in most cases, but if the pod is unschedulable it could be long-lasting. It is also sort of a resize pending/in-progress state, but without the associated condition. The other question it raises is, should the scheduler schedule based on the desired, original, or max resources?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the alternative? Leaving it empty until the kubelet gets to it?

Copy link
Contributor Author

@ndixita ndixita Jun 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should happen quickly in most cases, but if the pod is unschedulable it could be long-lasting.

I think it is better to have the scheduler set PodStatus.Resources after it binds the node to a pod in that case. If a pod is resized before scheduler sets PodStatus.Resources, it will be nil and controllers that need the info about pod requests in that case could compute it from pod spec using the resource helper

Copy link
Contributor

@natasha41575 natasha41575 Jun 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm missing the context of the motivation of defaulting this value before it gets to the kubelet, so can't really speak to the consequences of not doing that.

But I think:

Should happen quickly in most cases, but if the pod is unschedulable it could be long-lasting. It is also sort of a resize pending/in-progress state, but without the associated condition.

We're already having conversations about putting the kubelet in the scheduling path to deal with other resize / scheduling race conditions. If we prioritize that for 1.35 and include this in that discussion, the problems here could become much easier to resolve.

The other question it raises is, should the scheduler schedule based on the desired, original, or max resources?

The kubelet admits based on desired and wouldn't see the original, so the scheduler probably would need to do the same?

If a pod is resized before scheduler sets PodStatus.Resources, it will be nil and controllers that need the info about pod requests in that case could compute it from pod spec using the resource helper

If controllers are anyways going to have to handle the case that PodStatus.Resources could be nil, then I'm not seeing the benefit of having the apiserver or scheduler set PodStatus.Resources.

Copy link
Contributor Author

@ndixita ndixita Jun 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are converging towards the same thing. If we want to address the other resizes/scheduling race conditions, it makes sense for kubelet to set PodStatus.Resources. The idea behind this field is to allow controllers to inspect scheduled pods to quickly determine their effective resource requirements.

For unscheduled pods, the controllers can rely on components resource helpers to determine the requests from the PodSpec object.

Agree with not much benefit of having scheduler set PodStatus.Resources in that case. It would just help users understand what defaults does API server apply to the PodSpec before it is scheduled in case that's needed, but then there are cons of this wrt to the other resizes/scheduling race conditions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving PodStatus.Resources unset for now sounds good. The Kubelet owns this field, so I don't think the scheduler should touch it (especially not after binding to the node). It's really only safe for another component to set it on create, after that the Kubelet should be the only writer.

The motivation for defaulting it is if any components want to use this field instead of the component-helper to calculate pod-level resource totals. If we don't set it in defaulting, that means that it can't be used for admission (e.g. quota controls), or any controllers that run before the pod is started.

@ndixita ndixita force-pushed the pod-level-resources-ippr-alpha branch from ea2804e to e30e963 Compare June 20, 2025 18:46
@ndixita ndixita force-pushed the pod-level-resources-ippr-alpha branch 3 times, most recently from 3023a9f to 3eec5cd Compare June 20, 2025 19:00
Copy link
Contributor

@natasha41575 natasha41575 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mrunalp, natasha41575, ndixita, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ndixita ndixita force-pushed the pod-level-resources-ippr-alpha branch from 3eec5cd to c7b276c Compare June 20, 2025 19:18
@ndixita ndixita force-pushed the pod-level-resources-ippr-alpha branch from c7b276c to 0e02825 Compare June 20, 2025 20:24
@ndixita ndixita requested a review from tallclair June 20, 2025 20:26
Copy link
Member

@tallclair tallclair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 20, 2025
@k8s-ci-robot k8s-ci-robot merged commit acca58c into kubernetes:master Jun 20, 2025
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.34 milestone Jun 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants