Pass down resources to CRI #4113

marquiz · 2023-06-28T14:32:01Z

One-line PR description: KEP for extending the CRI API to pass down unmodified resource information from the kubelet to the CRI runtime.

Issue link: Pass down resources to CRI #4112

Other comments:

Co-authored-by: Antti Kervinen <antti.kervinen@intel.com>

marquiz · 2023-06-28T14:34:23Z

/cc @haircommander @mikebrow @zvonkok @fidencio @kad

k8s-ci-robot · 2023-06-28T14:34:28Z

@marquiz: GitHub didn't allow me to request PR reviews from the following users: fidencio.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @haircommander @mikebrow @zvonkok @fidencio @kad

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

haircommander · 2023-06-28T14:41:25Z

keps/sig-node/4112-passdown-resources-to-cri/README.md

+    map<string, k8s.io.apimachinery.pkg.api.resource.Quantity> requests = 2;
+    map<string, k8s.io.apimachinery.pkg.api.resource.Quantity> limits = 3;


should the keys here be a special type instead of unstructured?

I don't think it's possible to have smth like type ResourceName string in protobuf. Please correct me if I'm wrong

ping @haircommander, are you satisfied with the reply (close as resolved)?

keps/sig-node/4112-passdown-resources-to-cri/README.md

marquiz · 2023-07-18T07:59:40Z

/retitle Pass down resources to CRI

zvonkok · 2023-08-02T12:42:46Z

@marquiz We need to check how this will work with DRA and CDI devices. If we have enough information to know which devices need to be added to the sandbox just by the resource claim name.

zvonkok · 2023-08-02T12:44:17Z

@marquiz There is already some code for sandbox sizing, accumulation of resources CPU and Memory for reference: kubernetes/kubernetes#104886 that we leverage in Kata, what are the plans for this interface, deprecate or keep it?

zvonkok · 2023-08-02T12:44:46Z

@bergwolf @egernst FYI

elezar

Thanks @marquiz.

It would be good to get more concrete details on the use cases that this would enable.
There is also the question of complex devices that are managed by device plugins where there isn't a clear mapping from the resources entry (e.g. vendor.com/xpu: 1) to the resources added to the container, or DRA where the associated resources.requests.claims entry is not mentioned.

elezar · 2023-08-02T13:02:04Z

keps/sig-node/4112-passdown-resources-to-cri/README.md

+
+#### Story 3
+
+As a cluster administrator, I want to install an OCI hook/runc wrapper/NRI


Could you expand on this use case? How does extending the CRI translate to modifications in the OCI runtime specification which is interpreted by runc (or wrappers)?

The CRI changes (in this KEP) would not directly translate to anything in the OCI config. It's just "informational" that a possible hook/wrapper/plugin can then use to tweak the OCI config. Say you want to do customized cpu pinning in your plugin. I'll come up with some more flesh on this section...

@elezar I updated Story 3, PTAL

elezar · 2023-08-02T13:05:23Z

keps/sig-node/4112-passdown-resources-to-cri/README.md

+          requests:
+            cpu: 100m
+            memory: 100M
+            vendor.com/xpu: 1


For clarification: This does not indicate the properties of the resource that was actually allocated for a container requesting one of these devices?

That's very much true. I think I'll add a note about this in the KEP somewhere

@elezar I added a not about device plugin resources after this example. WDYT?

keps/sig-node/4112-passdown-resources-to-cri/README.md

aojea · 2023-08-03T08:37:22Z

keps/sig-node/4112-passdown-resources-to-cri/README.md

+     WindowsPodSandboxConfig windows = 9;
+
+    // Kubernetes resource spec of the containers in the pod.
+    PodResourceConfig pod_resources = 10;


@MikeZappa87 since you shared recently something along these lines for the networking capabilities, this KEP also means to interface with NRI

zvonkok · 2023-08-03T09:45:53Z

Another point to consider is how we're going to integrate or not these enhancements with the new containerd Sandbox API.

marquiz · 2023-08-03T13:24:20Z

There is already some code for sandbox sizing, accumulation of resources CPU and Memory for reference: kubernetes/kubernetes#104886 that we leverage in Kata, what are the plans for this interface, deprecate or keep it?

@zvonkok that one is just the native resources and gives the resources in the "obfuscated" form i.e. not telling the actual reqeusts/limits (plus it's for Linux resources only). I think we wouldn't, or even couldn't, touch this, i.e. keep it.

zvonkok · 2023-08-03T14:50:07Z

@moshe010 @adrianchiris @shivamerla @cdesiniotis FYI

zvonkok · 2023-09-01T07:28:34Z

Since the DevicePlugin API supports CDI devices with this KEP: #4011 we should try to add more restrictions and requirements how we want to design this passthrough interface. @marquiz FYI

With each release make sure we ship a GPU enabled rootfs/initrd Fixes: kata-containers#6554 DependsOn: kata-containers#6664 kata-containers#6595 kata-containers#6993 kata-containers#6949 Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com> gpu: reintroduce pcie_root_port and add pcie_switch_port In Kubernetes we still do not have proper VM sizing at sandbox creation level. This KEP tries to mitigates that: kubernetes/enhancements#4113 but this can take some time until Kube and containerd or other runtimes have those changes rolled out. Before we used a static config of VFIO ports, and we introduced CDI support which needs a patched contianerd. We want to eliminate the patched continerd in the GPU case as well. Fixes: kata-containers#8860 SQUASH runtimeclass example fabricmanager

keps/sig-node/4112-passdown-resources-to-cri/README.md

With each release make sure we ship a GPU enabled rootfs/initrd Fixes: kata-containers#6554 DependsOn: kata-containers#6664 kata-containers#6595 kata-containers#6993 kata-containers#6949 Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com> gpu: reintroduce pcie_root_port and add pcie_switch_port In Kubernetes we still do not have proper VM sizing at sandbox creation level. This KEP tries to mitigates that: kubernetes/enhancements#4113 but this can take some time until Kube and containerd or other runtimes have those changes rolled out. Before we used a static config of VFIO ports, and we introduced CDI support which needs a patched contianerd. We want to eliminate the patched continerd in the GPU case as well. Fixes: kata-containers#8860

- support sidecar containers: instead of separate lists for init and regular containers, have one list and include the type of container (init, sidecar, regular) - add notes about mounts and devices when describing changes to CreateContainer and UpdateContainerResources requests - update description of kubelet: more accurate description of what information is included in each CRI request - fix typos - kep.yaml: update milestone

marquiz · 2024-04-09T13:49:58Z

Thanks @tallclair for the review

I'm concerned that this KEP as is forces the container runtime to reimplement too much of the container lifecycle, which in the best case puts a burden on CRI implementation maintainers, and in the worst case could slow down future Kubernetes feature development.

The goal is to not require any changes for existing CRI implementations. The CRI runtime can omit the data if it doesn't need/want to pre-allocate/pre-optimize resources for the pod. The idea is enable a kinda "forward lookup" into the future for those who need it. This probably needs to be better communicated in the proposal (and the API, with comments and naming). Thoughts?

For example, the proposed API separates out init containers & regular containers, but sidecar containers blur those lines. Now, calculating the maximum resource requirements for the pod involves accounting for sidecar containers: https://github.com/kubernetes/kubernetes/blob/4a4f5dbc079e85e63f62178af962cb65bd60d987/pkg/api/v1/resource/helpers.go#L50. I don't think we should treat this as a 1-off change.

This is a very valid point. The proposal was now changed: instead of separate lists for init and regular containers, it now has one list that contains all containers, each element in the list including the type of container (init, sidecar or regular).

Why not have the Kubelet create a pod-level aggregated view of the resources? Similar to what is already done with the sandbox annotations, but without translating to the platform-specific types?

I believe that will cause gray hairs/problems in some scenarios, e.g. in VM sizing and CoCo. For example, how would you aggregate resource limits? Also, you could make better decisions in case that an init container requests a lot of resources wrt. the regular containers. In CoCo knowing exactly what resources each container needs helps implementing the principle of least privileges/smaller attach surface (no sharing of unnecessary mounts between containers for example).

Ref e.g.: #4113 (comment)

With each release make sure we ship a GPU enabled rootfs/initrd Fixes: kata-containers#6554 DependsOn: kata-containers#6664 kata-containers#6595 kata-containers#6993 kata-containers#6949 Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com> gpu: reintroduce pcie_root_port and add pcie_switch_port In Kubernetes we still do not have proper VM sizing at sandbox creation level. This KEP tries to mitigates that: kubernetes/enhancements#4113 but this can take some time until Kube and containerd or other runtimes have those changes rolled out. Before we used a static config of VFIO ports, and we introduced CDI support which needs a patched contianerd. We want to eliminate the patched continerd in the GPU case as well. Fixes: kata-containers#8860

In Kubernetes we still do not have proper VM sizing at sandbox creation level. This KEP tries to mitigates that: kubernetes/enhancements#4113 but this can take some time until Kube and containerd or other runtimes have those changes rolled out. Before we used a static config of VFIO ports, and we introduced CDI support which needs a patched contianerd. We want to eliminate the patched continerd in the GPU case as well. Fixes: kata-containers#8860 Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>

With each release make sure we ship a GPU enabled rootfs/initrd Fixes: kata-containers#6554 DependsOn: kata-containers#6664 kata-containers#6595 kata-containers#6993 kata-containers#6949 Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com> gpu: reintroduce pcie_root_port and add pcie_switch_port In Kubernetes we still do not have proper VM sizing at sandbox creation level. This KEP tries to mitigates that: kubernetes/enhancements#4113 but this can take some time until Kube and containerd or other runtimes have those changes rolled out. Before we used a static config of VFIO ports, and we introduced CDI support which needs a patched contianerd. We want to eliminate the patched continerd in the GPU case as well. Fixes: kata-containers#8860

marquiz · 2024-04-16T14:09:43Z

I pushed an update last week but forgot to leave a comment:

sidecar containers: instead of separate lists for init and regular containers, have one list and include the type of container (init, sidecar, regular)
add notes about mounts and devices where describing changes to CreateContainer and UpdateContainerResources requests
update description of kubelet changes: more accurate description of what information is included in each CRI request
fix typos
kep.yaml: updated milestone to v1.31

With each release make sure we ship a GPU enabled rootfs/initrd Fixes: kata-containers#6554 DependsOn: kata-containers#6664 kata-containers#6595 kata-containers#6993 kata-containers#6949 Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com> gpu: reintroduce pcie_root_port and add pcie_switch_port In Kubernetes we still do not have proper VM sizing at sandbox creation level. This KEP tries to mitigates that: kubernetes/enhancements#4113 but this can take some time until Kube and containerd or other runtimes have those changes rolled out. Before we used a static config of VFIO ports, and we introduced CDI support which needs a patched contianerd. We want to eliminate the patched continerd in the GPU case as well. Fixes: kata-containers#8860

tallclair · 2024-04-23T17:15:13Z

/assign

With each release make sure we ship a GPU enabled rootfs/initrd Fixes: kata-containers#6554 DependsOn: kata-containers#6664 kata-containers#6595 kata-containers#6993 kata-containers#6949 Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com> gpu: reintroduce pcie_root_port and add pcie_switch_port In Kubernetes we still do not have proper VM sizing at sandbox creation level. This KEP tries to mitigates that: kubernetes/enhancements#4113 but this can take some time until Kube and containerd or other runtimes have those changes rolled out. Before we used a static config of VFIO ports, and we introduced CDI support which needs a patched contianerd. We want to eliminate the patched continerd in the GPU case as well. Fixes: kata-containers#8860

tallclair · 2024-05-02T23:41:27Z

For example, the proposed API separates out init containers & regular containers, but sidecar containers blur those lines. Now, calculating the maximum resource requirements for the pod involves accounting for sidecar containers: https://github.com/kubernetes/kubernetes/blob/4a4f5dbc079e85e63f62178af962cb65bd60d987/pkg/api/v1/resource/helpers.go#L50. I don't think we should treat this as a 1-off change.

This is a very valid point. The proposal was now changed: instead of separate lists for init and regular containers, it now has one list that contains all containers, each element in the list including the type of container (init, sidecar or regular).

While sidecars did need to be addressed by the original proposal, it misses the big picture I was trying to raise here: exposing this pod lifecycle information into the container runtime will create friction for future k8s changes with pod lifecycle implications. Now any pod lifecycle change is potentially a breaking change to the runtime, so we need to manage runtime version skew in a way we didn't before. This is why I prefer to keep as much of the lifecycle logic in the Kubelet as we can.

KEP: Initial version of the Pass down resources to CRI

7de612d

Co-authored-by: Antti Kervinen <antti.kervinen@intel.com>

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Jun 28, 2023

k8s-ci-robot requested review from dchen1107 and mrunalp June 28, 2023 14:32

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 28, 2023

k8s-ci-robot requested review from haircommander, mikebrow, zvonkok and kad June 28, 2023 14:34

haircommander reviewed Jun 28, 2023

View reviewed changes

keps/sig-node/4112-passdown-resources-to-cri/README.md Outdated Show resolved Hide resolved

SergeyKanzhelev added this to Triage in SIG Node PR Triage Jun 28, 2023

marquiz mentioned this pull request Jul 18, 2023

Pass down resources to CRI #4112

Open

4 tasks

k8s-ci-robot changed the title ~~KEP: Initial version of the Pass down resources to CRI~~ Pass down resources to CRI Jul 18, 2023

KEP-4112: address review feedback from haircommander

bc8e299

bart0sh moved this from Triage to Needs Reviewer in SIG Node PR Triage Jul 20, 2023

elezar reviewed Aug 2, 2023

View reviewed changes

aojea reviewed Aug 3, 2023

View reviewed changes

Apokleos reviewed Mar 28, 2024

View reviewed changes

keps/sig-node/4112-passdown-resources-to-cri/README.md Outdated Show resolved Hide resolved

k8s-ci-robot assigned tallclair Apr 23, 2024

zvonkok mentioned this pull request May 2, 2024

[RFC] Unified framework with CDI for diverse vendor GPUs in Kata kata-containers/kata-containers#9561

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pass down resources to CRI #4113

Pass down resources to CRI #4113

marquiz commented Jun 28, 2023

marquiz commented Jun 28, 2023

k8s-ci-robot commented Jun 28, 2023

haircommander Jun 28, 2023

marquiz Jul 18, 2023

marquiz Jan 26, 2024

marquiz commented Jul 18, 2023

zvonkok commented Aug 2, 2023 •

edited

zvonkok commented Aug 2, 2023

zvonkok commented Aug 2, 2023

elezar left a comment

elezar Aug 2, 2023

marquiz Aug 3, 2023

marquiz Jan 26, 2024

elezar Feb 2, 2024

marquiz Feb 2, 2024

elezar Aug 2, 2023

marquiz Aug 3, 2023

marquiz Jan 26, 2024

aojea Aug 3, 2023

zvonkok commented Aug 3, 2023

marquiz commented Aug 3, 2023

zvonkok commented Aug 3, 2023

zvonkok commented Sep 1, 2023

marquiz commented Apr 9, 2024

marquiz commented Apr 16, 2024

tallclair commented Apr 23, 2024

tallclair commented May 2, 2024

		+ map<string, k8s.io.apimachinery.pkg.api.resource.Quantity> requests = 2;
		+ map<string, k8s.io.apimachinery.pkg.api.resource.Quantity> limits = 3;


		#### Story 3

		As a cluster administrator, I want to install an OCI hook/runc wrapper/NRI

Pass down resources to CRI #4113

Are you sure you want to change the base?

Pass down resources to CRI #4113

Conversation

marquiz commented Jun 28, 2023

marquiz commented Jun 28, 2023

k8s-ci-robot commented Jun 28, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marquiz commented Jul 18, 2023

zvonkok commented Aug 2, 2023 • edited

zvonkok commented Aug 2, 2023

zvonkok commented Aug 2, 2023

elezar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zvonkok commented Aug 3, 2023

marquiz commented Aug 3, 2023

zvonkok commented Aug 3, 2023

zvonkok commented Sep 1, 2023

marquiz commented Apr 9, 2024

marquiz commented Apr 16, 2024

tallclair commented Apr 23, 2024

tallclair commented May 2, 2024

zvonkok commented Aug 2, 2023 •

edited