Skip to content

KEP-5075: DRA: Consumable Capacity #5104

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 18, 2025

Conversation

sunya-ch
Copy link
Contributor

@sunya-ch sunya-ch commented Jan 30, 2025

  • One-line PR description: Enable DRA API and scheduling to support a shared device allocation with consumable capacity.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 30, 2025
@sunya-ch sunya-ch marked this pull request as draft January 30, 2025 01:58
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jan 30, 2025
@k8s-ci-robot
Copy link
Contributor

Welcome @sunya-ch!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 30, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @sunya-ch. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jan 30, 2025
@sunya-ch sunya-ch mentioned this pull request Jan 30, 2025
4 tasks
@pohly
Copy link
Contributor

pohly commented Feb 4, 2025

/wg device-management

@k8s-ci-robot k8s-ci-robot added the wg/device-management Categorizes an issue or PR as relevant to WG Device Management. label Feb 4, 2025
Copy link
Member

@johnbelamaric johnbelamaric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is a good start. After reading this thoroughly, I think we can meet all the use cases with "per-device allocatable resources", and do not need to expose the "device provisioning" aspect to the K8s control plane. I think the driver may still do that, but it's actually not something the scheduler needs to care about.

I think if we repurpose this to just "per-device allocatable resources", we will not only meet these use cases, but we can meet some of the ones for modeling standard resources in DRA.

Please take a look and if you want to meet to discuss it, let me know.

-->
- Introduce an ability to allocating on-demand provisioned devices via DRA. This should cover the use case of macvlan, ipvlan in DRA driver for CNI and virtual accelerator devices with on-demand memory fraction.
- Enhance a capability of secondary networks to dynamically allocate secondary networks based on present availabilities such as bandwidth.
- Leverage the enhancement of capacity consumption from [KEP-4815](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/4815-dra-partitionable-devices).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to do something distinct from the capacity consumption in there. That consumption is 100% internal to the resource slice, and is about shared resources. In this case, we are talking about consumption by the claims. The reason capacity is separate from attributes was exactly to support this kind of consumption model.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnbelamaric I updated the third goal to

- Enable capacity field to be consumable.

required) or even code snippets. If there's any ambiguity about HOW your
proposal will be implemented, this is the place to discuss them.
-->
This enhancement system design introduces the terms `device source`, `provision limit`, `provision request`, and `provisioned device`. In addition to the existing device instance, `device source` represents the root device or device specifiction as a source to provision or generate a new device allocated to the Pod. The number of provision per `device source` can be limited by the `provision limit`. Device sources are exposed by device vendors or provisioners via Publish API.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I like this KEP, but I don't think it's necessary to introduce all this machinery. I also am not sure there is a need to "provision" a new device. Instead, I would suggest we can allocate a partial device and share the underlying device. That is, we can move toward "platform mediated sharing of devices", rather than just the current mechanism of pointing multiple containers or pods add the same resource claim ("user mediated sharing of devices").

In platform mediated sharing, we do not share resource claims, but instead we have separate resource claims which hold a partial allocation of the same underlying device from the same resource slice. Each of those resource claims contains the amount of allocation. This is analogous to how resources are consumed by pods on nodes; in this case, resources are consumed by claims on devices.

For some more background see this (outdated) document, and the prototype linked in there.

Consider if the capacity values in a device were considered allocatable. Then, you could have a resource claim for a portion of the device. The scheduler would be able to reconstruct "how much" of the device is consumed by aggregating all resource claims that are attached to that device.

Rather than a specific "provisioning limit" on the number of devices, the limit is hit when there are insufficient resources in capacity to satisfy the claim. This allows the limit to be hit based upon specific resources consumed.

With this, I think in fact we should be able to model "compute devices" that represent the CPU, memory, and other standard node resources. I think this would satisfy the requirements @catblade has for aligning CPU with GPU with NIC, for example.

WDYT? Would you be willing/able to rework this a bit and see if this idea can satisfy your needs? Pretty much, we need to be able to write claims that "allocate" from the capacity fields. In the minimal implementation, we wouldn't actually need a change in the ResourceSlice API, I don't think, only in the ResourceClaim API. In the case of things like virtual interfaces, the name of the created interface could be stored in the ResourceClaim.Status data.

For a more robust implementation, we would want the DeviceCapacity entries in the Capacities to be able to do things like block allocation ("you asked for 1 byte, but I can only allocate in 1Mi increments" - see the doc above). We may also need to be able to model different segments of resources, tagged with different topology, and be able to identify which types of resources are aggregatable across those topologies (for example, "8Gi is in numa0, 8Gi is in numa1 - if you ask for numa alignment and 4Gi, we give you a CPU and memory numa aligned, but if you ask for 10Gi we have to aggregate across the boundary"). Those are harder, maybe we start by writing it up without those, then wait until @catblade complains :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(by the way, there may well be use cases for "provisioning", I just think the ones listed so far don't need it...if you have more use cases, please do add them)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnbelamaric I think your comment is apt, and agree.

Copy link
Contributor Author

@sunya-ch sunya-ch Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In platform mediated sharing, we do not share resource claims, but instead we have separate resource claims which hold a partial allocation of the same underlying device from the same resource slice. Each of those resource claims contains the amount of allocation. This is analogous to how resources are consumed by pods on nodes; in this case, resources are consumed by claims on devices.

I totally agree with this direction.

The objective of this KEP is to distinguish the case when the device which is allocated to the pod is actually a new device provisioned dynamically based on the master device. For instance, macvlan and ipvlan are a new virtual network device (i.e., net1) which will be provisioned, configured, and removed by the cni-dra-driver based on the selected master device (i.e., eth1).

However, at the same time, I agree that we can consider provision limit as one capacity, and leave all provisioned data (specific to the new device) to the resourceclaim status.

Is it possible to have an explicit way to clarify the type of sharing in the ResourceClaims status such as for provisioned device, pre-sliced device like MIG, dynamically-slice like virtual GPU ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The objective of this KEP is to distinguish the case when the device which is allocated to the pod is actually a new device provisioned dynamically based on the master device. For instance, macvlan and ipvlan are a new virtual network device (i.e., net1) which will be provisioned, configured, and removed by the cni-dra-driver based on the selected master device (i.e., eth1).

While at the actual node layer, this is what is happening, it's not clear to me that the K8s control plane needs to model this. It's not clear to me there needs to be awareness at the ResourceSlice/ResourceClaim level of these devices that are provisioned just to contain the allocation of a shared underlying device. It might be necessary, I am just not 100% sure yet and want to explore the options.

Is it possible to have an explicit way to clarify the type of sharing in the ResourceClaims status such as for provisioned device, pre-sliced device like MIG, dynamically-slice like virtual GPU ?

I think we are probably going to need a way to indicate that a device can be shared by platform-mediated sharing. Earlier version of DRA had a Shareable flag to indicate that a device can be shared by user-mediated sharing, but we decided it was unnecessary (since the user controls this anyway). But if the platform needs to manage it, then the driver author may need to let K8s know if a device is shareable. More to explore!

devices:
provisions:
- name: net1
deviceClassName: vlan-cni.networking.x-k8s.io
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you'll also need to attach some config to specify the VLAN ID, for example

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the CNI configuration will be also required for ResourceClaim as @LionelJouin's design in cni-dra-driver.
https://github.com/kubernetes-sigs/cni-dra-driver/blob/eb94de8c61835de1f13eefe700de9d9d89615531/docs/design/device-configuration-api.md#design

I will add reference in this KEP too.

List the specific goals of the KEP. What is it trying to achieve? How will we
know that this has succeeded?
-->
- Introduce an ability to allocating on-demand provisioned devices via DRA. This should cover the use case of macvlan, ipvlan in DRA driver for CNI and virtual accelerator devices with on-demand memory fraction.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestions for grammar/clarity: Introduce an ability to allocating on-demand provisioned devices via DRA. This should cover the use cases of macvlan or ipvlan in a DRA driver for CNI and virtual accelerator devices with on-demand memory fraction.

@sunya-ch sunya-ch force-pushed the dra-dynamic-provision branch from 6a57299 to 72f9f82 Compare February 12, 2025 04:34
@sunya-ch
Copy link
Contributor Author

sunya-ch commented Feb 12, 2025

@johnbelamaric @catblade Thank you so much for taking care of this KEP.
I have updated the KEP to address some comments such as filling KEP information which I can fill and applying rewrite suggestion first.
To reduce the number of notification, I have put 🚀 icon to the comment that I think I have addressed. Please consider click resolve button if you are agree.

For the comment related to design change, I will also update the KEP correspondingly next.

@sunya-ch
Copy link
Contributor Author

sunya-ch commented Feb 12, 2025

@johnbelamaric I have tried drafting your suggestion below. Please correct my understanding.

kind: ResourceSlice
...
spec:
  driver: cni.dra.networking.x-k8s.io
  devices:
  - name: eth1
    basic:
      attributes:
        name:
          string: "eth1"
      capacity:
        vlan: # <- define provision limit here
          quantity: 1000 
        bandwidth:
          quantity: 10Gi
---
kind: DeviceClass
metadata:
  name: provisionable.networking.x-k8s.io
kind: ResourceClaim
...
spec:
  devices: # <- move back the provision request to device request
    requests:
    - name: macvlan
      deviceClassName: provisionable.networking.x-k8s.io
      allocationMode: ExactCount
      count: 1
      resources:  # <- add resource in request (like PodCapacityClaim?). This field does not exist yet; however, I think you also have an idea to have this available based on your suggestion. 
      # Discussion point 1: should the following resource requests be per each count? 
        vlan:
          quantity: 1 # <- I guess that you are expect this resource consumption defined in resource claim. 
          #  Discussion point 2: do we need explicit vlan quantity here while it should not be other than 1?
        bandwidth:
          quantity: 1Gi
    config:
    - requests:
      - macvlan
      opaque:
        driver: cni.dra.networking.x-k8s.io
        parameters: # CNIParameters with the GVK, interface name and CNI Config (in YAML format).
          apiVersion: cni.networking.x-k8s.io/v1alpha1
          kind: CNI
          ifName: "net1"
          config:
            cniVersion: 1.0.0
            name: net1
            plugins:
            - type: macvlan
              master: eth0
              mode: bridge
              ipam:
                type: host-local
                ranges:
                - - subnet: 10.10.1.0/24

For the resource claim status, I still would like to propose the way that we can distinguish the way device is shared there. Can we instead add some optional field in the AllocatedDeviceStatus? For example,

type AllocatedDeviceStatus struct {
...
	// ProvisionData contains provisioning-related information specific to the provisioned device.
	//
	// This data may include provisioner-specific information.
	//
	// +optional
       *ProvisionData
}

@johnbelamaric
Copy link
Member

johnbelamaric commented Feb 12, 2025

@johnbelamaric I have tried drafting your suggestion below. Please correct my understanding.

Yes, what you propose in this comment is very much what I am suggesting. And your "discussion" points are definitely things I agree need more discussion and thought.

For the resource claim status, I still would like to propose the way that we can distinguish the way device is shared there. Can we instead add some optional field in the AllocatedDeviceStatus? For example,

Maybe yes? Let's ground it in the specific use cases.

@sunya-ch sunya-ch force-pushed the dra-dynamic-provision branch from c861fb3 to 32e1c9c Compare February 18, 2025 03:09
@k8s-ci-robot k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Feb 18, 2025
@sunya-ch sunya-ch force-pushed the dra-dynamic-provision branch from 32e1c9c to d0eb088 Compare February 18, 2025 03:11
@sunya-ch sunya-ch changed the title KEP-5075: DRA: Dynamic Device Provisioning Support KEP-5075: DRA: Consumable Capacity Feb 18, 2025
@sunya-ch sunya-ch force-pushed the dra-dynamic-provision branch from d0eb088 to 3e4b3b4 Compare February 18, 2025 08:58
Comment on lines 176 to 214
requests:
- name: macvlan
deviceClassName: vlan-cni.networking.x-k8s.io
allocationMode: Shared
resources:
requests:
bandwidth: "1Gi"
Copy link
Member

@aojea aojea Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand how is this going to be related to eth1, what if there is eth2 and eth3 too?
config is opaque so we should not depend on opaque values, is what this is suggesting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User can specify the device via the selection (such as specify device's attribute name = eth1). Additionally, DRA allows pod to dynamically allocate device based on its available resource such as bandwidth.The selected device information will be also sent to the driver then the network driver and configure the selected device.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the idea here is to underspecify when possible, but be more specific if needed. So if there are also eth2 and eth3, then leaving it like this means that any one of eth1-3 will work. If that is not the case, then a selector is needed to further narrow down the eligible devices.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly, Morten. We can "schedule" onto any of the underlying eth* devices. If we want a specific one - or one with a specific attribute (like network or trunking) - then that should be specified like any other selector.

and its attributes match the request's selectors and constraints.
"share" refers to amount of resources reserved or allocated to the claim request.
The newly added `allocatedShare` field in the `DeviceRequestAllocationResult` will be set when the allocation is successful.
A device can be shared (allocated) to different pods.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the same node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. A device in the same node can be shared to multiple pods. There will be multiple resource claim with the same device allocated but with different resource share.
I will update and add "in the same node" in the sentence for clarification.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think multi-node resources could have those pods split across nodes. I don't think there is a need to nail it down here.

@sunya-ch sunya-ch force-pushed the dra-dynamic-provision branch from d8cbdfc to 94eea31 Compare June 5, 2025 05:32
Comment on lines 286 to 290
// If the capacity is consumable (i.e., a ClaimPolicy is specified),
// the consumed amount is deducted and cached in memory by the scheduler.
// Note that the remaining capacity is not reflected in the resource slice.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if the scheduler restarts or fails over another scheduler instance, is it able to recover the right state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aojea Yes, it should be. It works in the same manner of DRA. It will gather the allocated information from resourceclaim status. The allocating capacity information in the allocator is cached and processed the same way as the present allocating devices.

Copy link
Contributor

@pohly pohly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comments so far have been addressed, but I need to do another full pass.

//
// +optional
// +listType=atomic
Options []resource.Quantity
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The KEP now lists proposals for the maximum size of slices. Those look good enough for merging the KEP. During the implementation, we need verify that the overall size does not get too large and perhaps do some more tweaks (let's note under test plan). Then before beta we need to collect feedback whether the resulting limits are sufficient for real-world use (covered by beta criteria).


// CapacitySharingPolicyRange defines a valid range for consumable capacity values.
//
// - If the requested amount is less than Minimum, it is rounded up to the Minimum value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Counting usage by device class is very coarse and remains so with this KEP, so neither worse nor better 😐

Copy link
Member

@BenTheElder BenTheElder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bah, I thought these review comments were submitted last week >.<

-->

Yes, this feature can be disabled once it has been enabled.
The `allowMultipleAllocations` flag, `sharingPolicy`, `capacityRequests`, and `consumedCapacities` fields will be dropped.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dropped, or ignored?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These fields are dropped by the existing implementing strategy. Should it be ignored instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These fields are dropped by the existing implementing strategy. Should it be ignored instead?

There is per-feature flexibility to choose the best option, but often new usages of the fields are disallowed, but existing usages are allowed to remain. that allows cleanup. I think you've anticipated this in the next sentence about status retaining the information.

- Impact of its degraded performance or high-error rates on the feature:
-->

This feature depends on the DRA structured parameters feature being enabled, and on DRA drivers being deployed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drivers that support the feature? or just drivers with pre-existing support?
(IIUC the former)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BenTheElder Yes, the driver must support this feature. I will add the clarification.

@BenTheElder BenTheElder dismissed their stale review June 11, 2025 02:42

[PRR was filled out]

@dom4ha
Copy link
Member

dom4ha commented Jun 11, 2025

It looks good. Note that the sig-scheduling approval should be already granted.


###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Will consider in the beta timeframe.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While not required for alpha, I do always encourage people to spend time considering the metrics they'll need for debugging early. I suspect the existing metric for successful claims will be useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deads2k Thank you. I added metrics apiserver_request and scheduler_plugin_execution_duration_seconds in the section.

@deads2k
Copy link
Contributor

deads2k commented Jun 11, 2025

PRR lgtm, approving PRR

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, dom4ha, sunya-ch

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 11, 2025
Copy link
Contributor

@pohly pohly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Complete pass, including API review.

Some nits, but no blockers. Remaining open questions are all things that we can still decide later during the API review of the implementation.

This would move the device from the host into the user's pod, preventing other users from accessing the shareable device.

**Mitigation:**
- Administrators should define clear device classes for shareable and unshareable devices to prevent such misallocations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my understanding: the user allocated a shared device, but then did something in their pod which makes the device exclusive to the pod?

I don't see how different device classes help here. It does not prevent a malicious user from doing the above, unless there are additional constraints in place, like limiting the quota for the shareable device class to zero for such a user.

This is primarily an issue for the DRA driver managing shareable devices, not this KEP. In this KEP we can assume that a DRA driver enforces proper sharing and that users cannot circumvent this.

//
// +optional
// +featureGate=DRAConsumableCapacity
ShareID *string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: indent with the same amount of spaces as elsewhere.

I'm undecided whether we want to enforce this as a hex value. This seems a bit unusual from an API perspective and, as presented here, doesn't specify a maximum length of the string.

Perhaps just go with the usual "must be a DNS label"? Doing it as a short hex string then becomes an implementation detail, without being part of the API.

Let's first clarify that before re-indenting, otherwise this comment thread is immediately "obsolete"...

Copy link
Member

@liggitt liggitt Jun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a random string that correlates to nothing else, and whose only purpose is to be unique, maybe use ShareUID instead? can decide / tweak during impl

This seems a little fuzzy / underspecified currently. I think it's ok to figure out specifics at implementation time and update here, but they do need to be figured out:

  1. format / min-length / max-length restrictions
  2. how uniqueness is validated
  3. whether we just trust writers to generate unique values correctly ... what happens if they don't?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pohly
Copy link
Contributor

pohly commented Jun 17, 2025

Status check...

There are some opens regarding the API:

Those can be revisited during the actual API review of the implementation.

I think the PR is in a good enough shape to merge it. @sunya-ch is traveling this week and I'll be out starting Thursday, so let me propose merging this tomorrow (Wednesday) unless some new concerns are brought up.

/lgtm
/hold

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 17, 2025
@pohly
Copy link
Contributor

pohly commented Jun 17, 2025

/label tide/merge-method-squash

@sunya-ch: I prefer one manually squashed commit and can re-add LGTM if needed in this repo (not all do), but don't worry if you don't get to it.

@k8s-ci-robot k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Jun 17, 2025
Comment on lines +496 to +480
// If the allocation result includes a ShareID, the Device field is extended with the ShareID,
// formatted as `<device name>/<share id>`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mixing these into the same field is a little surprising... multiple possible formats / meanings seems like it will make it harder and more error-prone to consume, especially if consumers get used to this always matching the device name currently, then it surprisingly starts appending /<share id> for shared devices in the future

is there a good reason not to have the device name and share id as separate fields here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had it as a separate field initially, but it turned out that you were right about warning that separate field as a new key in the status list map.

Encoding more information in an existing string is the lesser evil - not nice conceptually, but it works.

@sunya-ch: this one of those aspects that need to be added as additional source code comment in the types.go (i.e. outside of the doc comment), because that is where future readers will find it when wondering about the API design.

Copy link
Member

@liggitt liggitt Jun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I didn't realize this was an existing field.... then I really don't think we should redefine it to sometimes smush in another value / change the meaning and format ... do existing consumers of this field use the value to look up the device? I would assume they do

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're on the cusp of promoting to v1, this seems really important to pin down... if we can't add new optional fields, AND give them defaults, AND make them key fields to allow for otherwise-duplicate items in []AllocatedDeviceStatus, we're in a sticky spot

did we over-specify the map-list aspects of Devices []AllocatedDeviceStatus in ways we can't evolve? do we need to tweak anything there or is there anything we can tweak?

	// +listType=map
	// +listMapKey=driver
	// +listMapKey=device
	// +listMapKey=pool
	// +featureGate=DRAResourceClaimDeviceStatus
	Devices []AllocatedDeviceStatus `json:"devices,omitempty" protobuf:"bytes,4,opt,name=devices"`

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do existing consumers of this field use the value to look up the device? I would assume they do.

Some consumers will simply show the string as part of the driver/pool/device triplet. Those are fine. It will even look nice if they use the recommended (but not required) "concatenate with slash" format.

Some consumers might try to look up the device in the ResourceSlice or correlate. Those consumers need to be aware of the additional share ID. Primarily these are schedulers. They need to be aware of this feature either way to behave correctly. Most consumers shouldn't need to do this, partly because the ResourceSlice with the device definition isn't even guaranteed to be available.

Finally, DRA drivers need to be aware if they support this feature. Nothing changes for other DRA drivers.

did we over-specify the map-list aspects of Devices []AllocatedDeviceStatus in ways we can't evolve?

Having it as a +listType=map is useful because different DRA drivers own different entries in it and then can use SSA.

That the API machinery behind it doesn't fully support adding a new map key is a bummer. I think we should try to fix this in 1.34. If we can, then the "device" string can remain as it is and "shareID" can be stored as a new field. Downgrades to 1.33 with the extra field will be problematic, but that's "downgrading after using an alpha feature", so it's not guaranteed to work smoothly.

Figuring this out is a blocker for the GA promotion and the implementation PR of this KEP. I'm sufficiently motivated to tackle this 😅 But it's not an area that I know, so help would be needed.

cc @jpbetz

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small correction: Devices []AllocatedDeviceStatus is not going to GA in 1.34 because it is not part of core DRA. It's part of #4817, which remains in beta.

Still important to figure out, of course!

//
// +optional
// +featureGate=DRAConsumableCapacity
ShareID *string
Copy link
Member

@liggitt liggitt Jun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a random string that correlates to nothing else, and whose only purpose is to be unique, maybe use ShareUID instead? can decide / tweak during impl

This seems a little fuzzy / underspecified currently. I think it's ok to figure out specifics at implementation time and update here, but they do need to be figured out:

  1. format / min-length / max-length restrictions
  2. how uniqueness is validated
  3. whether we just trust writers to generate unique values correctly ... what happens if they don't?

// making it less reliable than ShareID.

// ConsumedCapacities tracks the amount of capacity consumed per device as part of the claim request.
// The consumed amount may differ from the requested amount: it is rounded up to the nearest valid
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may differ from the requested amount

but may not be less than the requested amount?


// DeviceCapacity describes a quantity associated with a device.
type DeviceCapacity struct {
// Value defines how much of a certain device capacity is available.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking nit to tweak during impl or in a follow-up: I'd probably drop or rephrase "available" from the doc here, since that is ambiguous whether it is real-time / updated as the device is allocated and then used

//
// If the requested amount is not listed in the options, it cannot be allocated to this device.
type CapacitySharingPolicyDiscrete struct {
// Options defines a list of additional valid capacity values that can be requested.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Options defines a list of additional valid capacity values that can be requested.
// Options defines the list of valid capacity values that can be requested.

// Minimum + chunk size must be less than or equal to the capacity value.
//
// +optional
ChunkSize *resource.Quantity
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking question: is there a more intuitive / standard name for this? Can think about it on the way to implementation, maybe StepSize or UnitSize?

Comment on lines 315 to 319
// DiscreteValues defines a set of acceptable quantity values in consuming requests.
//
// +optional
// +oneOf=ValidSharingValues
DiscreteValues *CapacitySharingPolicyDiscrete
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking: on the way to implementation, consider if we need the nesting levels or if this could simplify / flatten to ValidValues []resource.Quantity

//
// +optional
// +oneOf=ValidSharingValues
ValueRange *CapacitySharingPolicyRange
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking, can think about this as we implement: if we rename the above option to ValidValues, this could rename to ValidRange or something similar

Comment on lines +358 to +356
// Minimum specifies the minimum capacity allowed for a consumption request.
//
// Minimum must be less than or equal to the capacity value.
// Default must be more than or equal to the minimum.
//
// +required
Minimum resource.Quantity

// Maximum defines the upper limit for capacity that can be requested.
//
// Maximum must be less than or equal to the capacity value.
// Minimum and default must be less than or equal to the maximum.
//
// +optional
Maximum *resource.Quantity
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking, but make a note to decide whether to use Minimum/Maximum or match LimitRange and use Min/Max

@sunya-ch sunya-ch force-pushed the dra-dynamic-provision branch from be65f28 to 8bb378b Compare June 18, 2025 06:24
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 18, 2025
@sunya-ch sunya-ch force-pushed the dra-dynamic-provision branch from 8bb378b to a33d33d Compare June 18, 2025 06:47
Signed-off-by: Sunyanan Choochotkaew <sunyanan.choochotkaew1@ibm.com>
@sunya-ch sunya-ch force-pushed the dra-dynamic-provision branch from a33d33d to dcc2aaa Compare June 18, 2025 10:54
@pohly
Copy link
Contributor

pohly commented Jun 18, 2025

/lgtm

I'm still inclined to merge this KEP today.

How to add the share ID (#5104 (comment)) is under discussion. We need to figure out the best way forward before merging the implementation.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 18, 2025
@pohly
Copy link
Contributor

pohly commented Jun 18, 2025

I can't get hold of @liggitt on Slack for confirmation and will be out for the rest of the week now, so I am going ahead and will merge this KEP as it is now. Future discussion points:

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 18, 2025
@k8s-ci-robot k8s-ci-robot merged commit 52abef7 into kubernetes:master Jun 18, 2025
4 checks passed
@github-project-automation github-project-automation bot moved this from Needs Review to Done in SIG Scheduling Jun 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Status: In progress
Status: 👀 In review
Status: Done
Development

Successfully merging this pull request may close these issues.