Add New Resource API proposal #782

vikaschoudhary16 · 2017-07-06T10:30:16Z

Notes for reviewers

First proposal submitted to the community repo, please advise if something's not right with the format or procedure, etc.

cc @aveshagarwal @jeremyeder @derekwaynecarr @vishh @jiayingz

Signed-off-by: vikaschoudhary16 vichoudh@redhat.com

k8s-ci-robot · 2017-07-06T10:30:17Z

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://github.com/kubernetes/kubernetes/wiki/CLA-FAQ to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.

If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.
If you have done the above and are still having issues with the CLA being reported as unsigned, please email the CNCF helpdesk: helpdesk@rt.linuxfoundation.org

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

jeremyeder · 2017-07-06T11:19:07Z

contributors/design-proposals/resource-class.md

+        // +patchMergeKey=key
+        // +patchStrategy=merge
+        Key string
+        // Example 0.1, intel etc


What is this?

these are example of Values. For example if Key is 'version', Values can have '0.1' OR Key is 'vendor', Values can have 'intel'. I will make it more clear.

jeremyeder · 2017-07-06T11:19:40Z

contributors/design-proposals/resource-class.md

+        ResourceSelectorOpExists       ResourceSelectorOperator = "Exists"
+        ResourceSelectorOpDoesNotExist ResourceSelectorOperator = "DoesNotExist"
+        ResourceSelectorOpGt           ResourceSelectorOperator = "Gt"
+        ResourceSelectorOpLt           ResourceSelectorOperator = "Lt"


You mentioned Equal To on line 43 but no Eq operator here.

will correct at line 43 to use "In" in place of "Equal to"

I think using In for Eq causes confusion.

jeremyeder · 2017-07-06T11:24:46Z

contributors/design-proposals/resource-class.md

+* User can view the current usage/availability details about the resource class using kubectl.
+
+### User story
+Admin knows what all devices are present on nodes and have deployed corresponding device plugins. Device plugins will make devices appear in node status. Next admin creates resource classes that have generic/portable names and metadata which can select available devices.


Slightly rephrased this paragraph.

The administrator as deployed device plugins to support hardware present in the cluster. The Kubelet will update its status indicating the presence of this hardware via those device plugins. To offer this hardware to applications deployed on kubernetes in a portable way, the administrators creates a number of resource classes to represent that hardware. These resource classes will include metadata about the device and selection criteria.

Thanks, will update!

jeremyeder · 2017-07-06T11:26:18Z

contributors/design-proposals/resource-class.md

+5. The user deletes the pod or the pod terminates
+6. Kubelet reads pod object annotation for devices consumed and calls `Deallocate` on the matching Device Plugins
+
+The scheduler is incharge of both, selecting a node and also selecting a device for requested resource classes.


In addition to node selection, the scheduler is also responsible for selecting a device that matches the resource class requested by the user.

jeremyeder · 2017-07-06T11:27:49Z

contributors/design-proposals/resource-class.md

+* Select device at pod admission while applying predicates and change all api interfaces that are required to pass selected device to container runtime manager.
+* Create resource consumption state again at container runtime manager and select device.
+
+None of the above approach seems cleaner than doing device selection at scheduler.


Is it worth mentioning here that this decision helps to retain a clean abstraction between the container runtime and kubernetes? ISTR that was a portion of the discussion.

jeremyeder · 2017-07-06T11:33:54Z

contributors/design-proposals/resource-class.md

+
+## Future Scope
+* RBAC: It can further be explored that how to tie resource classes with RBAC like any other existing API resource objects.
+* Nested Resource Classes: In future device plugins and resource classes can be extended to support the nested resource class functionality where one resource class could be comprised of a group of sub-resource classes. For example 'numa-node' resource class comprised of sub-resource classes, 'single-core'.


We need to communicate the plans/decisions on how ResourceClasses relate to Opaque Integer Resources.

@ConnorDoyle

whatever existing resource discovery tools are available which update node objects with OIR, those will adapt to update node status devAllocatable with devices instead. Will add more details.

RenaudWasTaken · 2017-07-06T20:21:01Z

@vikaschoudhary16 I don't see any mention of overlapping resources is that something you plan to address in another PR ?

vikaschoudhary16 · 2017-07-07T04:07:05Z

@RenaudWasTaken It is explained implicitly by explaining how resource class can select device with k-v metadata and how a resource class can select different devices. Overlapping is essentially the functionality which is providing portability.
I will add more details to make it more visible and explicit.
Thanks!

RenaudWasTaken · 2017-07-07T05:24:17Z

@RenaudWasTaken It is explained implicitly by explaining how resource class can select device with k-v metadata

I'm also interested in how you solved selecting multiple overlapping resource class because it is not a trivial problem.
An example would be:

node has 2 GPUs:
- 1 GPU with 4G
- 1 GPU with 8G
Cluster has 2 resource classes:
- GPU with memory > 2 (lowMemGPU)
- GPU with memory > 4 (highMemGPU)
User submits pod with 2 containers:
- The first one requests 1 lowMemGPU
- The second one requests 1 highMemGPU

In this example if your selection algorithm is a first fit then it is very possible that you won't be able to satisfy the request because you might give the gpu with 8G to the first container.

It seems to me that your only solution is to generate all the possible permutations but that doesn't scale well...
Another solution I thought about was to have multiple algorithms and give the option to either the end user or the cluster admin to select which one he wanted to use.
But it feels like this edge case doesn't have any good solutions...

What do you think ?

vikaschoudhary16 · 2017-07-07T05:36:17Z

@RenaudWasTaken

Another solution I thought about was to have multiple algorithms and give the option to either the end user or the cluster admin to select which one he wanted to use.

Thanks i missed to add these details though had in mind. This proposal is implementing a first fit selection process. For better resource usage and scalability, in future, selection algorithm will be optimized on the lines you mentioned.

fabiand · 2017-07-07T07:37:19Z

contributors/design-proposals/resource-class.md

+  name: nvidia.high.mem
+spec:
+  resourceSelector:
+    -


Is this a superfluous line?

@fabiand this is yaml syntax for nested sequences, https://learn.getgrav.org/advanced/yaml#sequences

Oh obviously …
Would pulling matchExpressions into line 89 look saner`

- matchExpressions:

Ah, sure, will update. Thanks!

fabiand · 2017-07-07T07:37:37Z

contributors/design-proposals/resource-class.md

+
+## Motivation
+Compute resources in Kubernetes are represented as a key-value map with the key
+being a string and the value being a 'Quantity' which can (optionally) be


Are you referring to OIRs here?

@fabiand yes.

fabiand · 2017-07-07T07:37:51Z

contributors/design-proposals/resource-class.md

+  resourceSelector:
+    -
+      matchExpressions:
+        -


Maybe something missing here as well?

@fabiand same as above.

saad-ali · 2017-07-07T21:16:09Z

A couple of drive-by comments:

What happens when a ResourceClass is deleted before the pod referencing it? Particularly if kubelet has to enforce the limits.
Also consider the case where kubelet and/or scheduler crash and lose in-memory state (and the ResourceClass object is deleted).
Define which, if any, of the fields of ResourceClass are immutable, and if they are mutable what the expected behavior is.
Maybe worth calling out more explicitly that this will be non-namespaced and expected to be created by cluster admins.

vikaschoudhary16 · 2017-07-10T15:28:42Z

@saad-ali
Thanks for taking a look. Please find my responses as follows:

What happens when a ...

This proposal's scope is bounded by device plugin proposal's scope, which is to cover devices only and not resources such as CPU and memory.
Since device for the pod is being selected by scheduler, so in case resource class gets deleted, either pod will fail the predicate at scheduler or pod will have the device info.

Also consider the case where kubelet and/or scheduler crash ...

In scheduler crash case, for any unscheduled pods which request the deleted resource class, scheduling will fail at predicate validation. Similarly predicate will also fail at kubelet because for any new pod kubelet creates resource consumption state from the beginning.

Once created, resource class object is immutable, only status will be updated by scheduler later on.
sure, will mention explicitly.

balajismaniam · 2017-07-10T19:38:22Z

contributors/design-proposals/resource-class.md

+            - "1G"
+```
+Above resource class will select all the hugepages with size greater than
+equal to 1 GB.


s/greater than equal to/greater than

balajismaniam · 2017-07-10T19:39:21Z

contributors/design-proposals/resource-class.md

+          values:
+            - "nic"
+          key: "speed"
+          operator: "In"


Did you mean either Eq or Gt instead of In?

balajismaniam · 2017-07-10T19:39:42Z

contributors/design-proposals/resource-class.md

+            - "40GBPS"
+```
+Above resource class will select all the NICs with speed greater than equal to
+40 GBPS.


This should change based on the change above.

balajismaniam · 2017-07-10T20:04:42Z

contributors/design-proposals/resource-class.md

+2. Iterate over all existing nodes in cache to figure out if there are devices
+   on these nodes which are selectable by resource class. If found, update the
+   resource class availability status in local cache.
+3. Patch the status of resource class api object with availability state in locyy


s/locyy/local

balajismaniam · 2017-07-10T20:07:14Z

contributors/design-proposals/resource-class.md

+        ResourceSelectorOpExists       ResourceSelectorOperator = "Exists"
+        ResourceSelectorOpDoesNotExist ResourceSelectorOperator = "DoesNotExist"
+        ResourceSelectorOpGt           ResourceSelectorOperator = "Gt"
+        ResourceSelectorOpLt           ResourceSelectorOperator = "Lt"


I think using In for Eq causes confusion.

balajismaniam · 2017-07-10T20:24:35Z

contributors/design-proposals/resource-class.md

+In addition to node selection, the scheduler is also responsible for selecting a
+device that matches the resource class requested by the user.
+
+### Reason for not preferring device selection at kubelet


May be change this to Preferring device selection in the scheduler (instead of the Kubelet).

balajismaniam · 2017-07-10T20:26:17Z

contributors/design-proposals/resource-class.md

+device that matches the resource class requested by the user.
+
+### Reason for not preferring device selection at kubelet
+Kubelet does not maintain any cache. Therefore to know the availability of a device,


s/to know the availability of a device/to know the quantity of the device available for scheduling/

balajismaniam · 2017-07-10T20:27:31Z

contributors/design-proposals/resource-class.md

+### Reason for not preferring device selection at kubelet
+Kubelet does not maintain any cache. Therefore to know the availability of a device,
+will have to calculate current total consumption by iterating over all the admitted
+pods running on the node. This is already done today while running predicates for


May be rephrase as the quantity of the device already consumed should be calculated by iterating.

balajismaniam · 2017-07-10T20:29:53Z

contributors/design-proposals/resource-class.md

+consumption state that is created at runtime for each pod, are exactly same,
+current api interfaces does not allow to pass selected device to container manager
+(where actually device plugin will be invoked from). This problem occurs because
+devices are determined internally from resource classes while other resource


Is this not opposite (i.e., resource classes are determined from devices)?

balajismaniam · 2017-07-10T20:31:37Z

contributors/design-proposals/resource-class.md

+From the events perspective, handling for the following events will be added/updated:
+
+### Resource Class Creation
+1. Init and add resource class info into local cache


Nit: s/Init/Initialize

balajismaniam · 2017-07-10T20:45:42Z

contributors/design-proposals/resource-class.md

+## Opaque Integer Resources
+This API will supercede the [Opaque Integer Resources](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#opaque-integer-resources-alpha-feature)
+(OIR). External agents can continue to attach additional 'opaque' resources to
+nodes, but the special naming scheme that is part of the current OIR approach


s/but the special naming scheme that is part of the current OIR approach will no longer be necessary/using device plugins

@balajismaniam
not necessarily device plugins, OIRs could also be used same as it is used today(without device plugins). Using OIR with device plugins will be case where plugin has not adapted device advertisement as per device plugin proposal and use OIRs to advertise resources.

Got it. Thanks.

balajismaniam · 2017-07-10T20:47:19Z

contributors/design-proposals/resource-class.md

+plugins.
+
+## Opaque Integer Resources
+This API will supercede the [Opaque Integer Resources](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#opaque-integer-resources-alpha-feature)


Nit: s/supercede/supersede

ConnorDoyle · 2017-07-10T23:44:42Z

contributors/design-proposals/resource-class.md

+*Resource Class* is a new type, objects of which provides abstraction over
+[Devices](https://github.com/RenaudWasTaken/community/blob/a7762d8fa80b9a805dbaa7deb510e95128905148/contributors/design-proposals/device-plugin.md#resourcetype).
+A *Resource Class* object selects devices using `matchExpressions`, a list of
+(operator, key, value). A *Resource Class* object selects a device if atleast


s/atleast/at least

ConnorDoyle · 2017-07-11T00:32:14Z

contributors/design-proposals/resource-class.md

+In addition to node selection, the scheduler is also responsible for selecting a
+device that matches the resource class requested by the user.
+
+### Reason for not preferring device selection at kubelet


This section was a little confusing to read (what exactly does device selection mean?) -- maybe a concrete example would help?

Was the option considered where the scheduler is responsible for mapping resource classes from the container spec to resource names from the node capacityV2, but assigning specific devices is left to the Kubelet? IIRC a reason to delay device binding to Kubelet was to avoid publishing hardware topology to the API server.

@ConnorDoyle
I have updated several initial sections of the doc to make it more clear. Device selection means selecting a device using the resource class details which includes applying different operators like explained in 'Resource Class' section of this document.

scheduler is responsible for mapping resource classes from the container spec to resource names from the node capacityV2

Yes. and this section notes down challenges in that approach. I have updated this section to make it more clear. Hope it is more understandable now.

Kubelet was to avoid publishing hardware topology to the API server.

This proposal assumes that device details are updated in node status by vendor device plugins, as proposed in device plugin proposal.

ConnorDoyle · 2017-07-11T00:45:46Z

contributors/design-proposals/resource-class.md

+1. Get the requested resource class name and quantity from pod spec.
+2. Select nodes by applying predicates according to requested quantity and Resource
+   class's state present in the cache.
+3. On the selected node, select a Device from the stored devices info in cache


Would it be better to say something like "concrete resource" here instead of "Device"? There's no limitation that resource classes can only represent devices right?

yes, there is a limitation. Resource class would be able to represent only the devices which are advertised by device plugins in Device structure format in the node status.

@vikaschoudhary16 But devices may have the same name (device-id) across nodes:

Node1 have dev-1 which satisfied ResourceClassA

Node2 have dev-1 which also satisfied ResourceClassA

How do we cache the devices info in ResourceClassA? use this pattern: Node1-dev-1 and Node1-dev-1?

A deviceinfo structure will be maintained in schedulercache. So each device will have a list of what all resource class it is satisfying.
Take a look: https://github.com/vikaschoudhary16/kubernetes/pull/3/files#diff-558cb8bde14dca10a3151bfc222a3aae

ConnorDoyle · 2017-07-11T04:07:58Z

contributors/design-proposals/resource-class.md

+   be addressed by [device plugin proposal](https://github.com/kubernetes/community/pull/695/files)
+
+## Resource Class
+*Resource Class* is a new type, objects of which provides abstraction over


s/provides/provide

ConnorDoyle · 2017-07-11T04:11:24Z

contributors/design-proposals/resource-class.md

+
+1. A user submits a pod spec requesting 'X' resource classes.
+2. The scheduler filters the nodes which do not match the resource requests.
+3. scheduler selects a device for each resource class requested and annotates


annotates the pod object

Do you have an example of what the annotation might look like?

@ConnorDoyle
sure, will update. Thanks!

vikaschoudhary16 · 2017-07-11T08:15:32Z

Thanks @ConnorDoyle @saad-ali @balajismaniam @RenaudWasTaken @fabiand for the review comments.
I have updated the doc to address those. PTAL!

RenaudWasTaken · 2017-07-11T08:38:10Z

contributors/design-proposals/resource-class.md

+  extended to support the nested resource class functionality where one resource
+  class could be comprised of a group of sub-resource classes. For example 'numa-node'
+  resource class comprised of sub-resource classes, 'single-core'.
+* Multiple device selection algorithms, each with a different selection strategy,


I think this should be thoroughly discussed and not just a "side note" on the bottom of the design doc.
Maybe a sig-scheduling discussion ?

@RenaudWasTaken totally agree. Scheduler section is also updated about "first fit" approach.
Example use case you quoted can be handled by "best fit" approach, which I thought to cover in follow-up proposal. And meanwhile if a user dont want to use "first fit", he/she can request devices using OIRs and can bypass resource classes. Thats how i am thinking. Anyways if community thinks differently, happy to discuss and adapt the proposal accordingly.

vikaschoudhary16 · 2017-07-17T07:13:29Z

/sig scheduling

derekwaynecarr · 2017-07-29T13:06:33Z

i will prioritize reviewing this further when device plugins are agreed upon.

jaypipes · 2017-12-12T20:03:50Z

contributors/design-proposals/resource-class.md

+   `scheduler.alpha.kubernetes.io/resClass_test-res-class_nvidia-tesla-gpu=4`
+   where `scheduler.alpha.kubernetes.io/resClass` is the common prefix for all the
+   device annotations, `tes-res-class` is resource class name,
+   `nvidia-tesla-gpu` is the selected device name and `4` is the quantity requested.


@vikaschoudhary16 OK, I see here is how the user would request an amount of one of these resources. I still think that combining the device with the resource class in the selector is going to be problematic, though.

Let me ask something... do you envision the user knowing about the devices on nodes that are providing a particular resource class? Put another way... would (should?) the proposed end user that is constructing the pod spec and specifying the resource requirement selector scheduler.alpha.kubernetes.io/resClass_test-res-class_nvidia-tesla-gpu=4 actually know that the "test-res-class" resources were being provided by the "nvidia-tesla-gpu" device(s)?

My guess is that, often, the end user won't know about the underlying device that is providing some amount of abstracted resources. The deployer of the cloud infrastructure knows that information, of course. But the end user that is consuming some of those resources won't necessarily know the exact vendor/device information and deployers may not actually want the end user to know the device/vendor/model specifics :-)

To respond to your comment from above that we should think of resource classes in the same vein as EC2 instance types, I'd just comment that Amazon has complete and total control over the information it gives users with regards to the amount of resources and the types of capabilities that its instance types comprise. Amazon can change (and has changed in the past) its mind about the quantities of resources and quality/vendor models associated with a particular instance type. If Kubernetes continues to be agnostic to cloud infrastructure providers, I think two things are necessary:

The way that Kubernetes advertises resources and capabilities to end users should hide underlying device implementation details as much as possible. This includes hiding vendor information as much as possible.

All concepts that provide a grouping/coupling mechanism between qualitative and quantitative things should be possible to specify as their individual quantitative and qualitative components.

The second point probably deserves a little more explanation. What I mean is that if you are going to expose a concept like a ResourceClass object -- something that allows the deployer to describe a collection of consumed resource amounts as well as capabilities that describe one or more providers of those consumable resources -- then the end user should be able to request a pod consumes those coupled resources and lands on a node with those capabilities without using the coupled object.

In other words, if you have a ResourceClass that looks like this:

kind: ResourceClass metadata: name: gpu.high.mem spec: resourceSelector: - matchExpressions: - key: "gpu.vendor" operator: "In" values: - "nvidia" - "intel" - key: "gpu.memory" operator: "GtEq" values: - "4G"

the end user should be able to request a pod where container in the pod need 1 gpu.high.mem OR the end user should be able to request a pod where containers in the pod need 4G of resource type gpu.memory and the "gpu.vendor" annotation/selector is "intel" instead of either "intel" OR "nvidia". That's what I mean about breaking the coupling down into its finest-grained representation and allowing the end user to specify that fine-grained request.

Hope that makes sense! I recognize that sometimes, the terminology I use is overlapping and confusing, so I apologize in advance about that. I'm trying to bridge the terminology differences between the OpenStack infrastructure representation of these things with the proposed Kubernetes representation of similar ideas.

Best,
-jay

@jaypipes

The way that Kubernetes advertises resources and capabilities to end users should hide underlying device implementation details as much as possible. This includes hiding vendor information as much as possible.

Thats upto the deployer. Deployer is free to not use any keys in the resource class which he thinks, user should not know about. Proposal does not make it mandatory to create resource classes with any vendor specific details.

All concepts that provide a grouping/coupling mechanism between qualitative and quantitative things should be possible to specify as their individual quantitative and qualitative components.

I think i understood what your point is. Problem with this approach is that it will become identity mapping. Resource classes are aimed at creating broader abstractions where arbitrary ranges for resource properties is also able to be supported. There are two main problems:

Portability gone: With current approach, resource class is an allocatable unit. So though its providing a broader abstraction but still its consumed, capacity and remaining units are countable. By this, admin knows, what is the quota which is being offered. If user is left free to choose the range of resource property, like gpu.memory gt 30, one cluster may support but other may not. But if resource classes are treated as a single allocated unit with non-mutated properties, it is easier for admins to control the availability of it as a resource across clusters.

We cant expect end user to know that much about the device properties. There will be hell more chances of mis-configuration.

Thoughts?

ScorpioCPH

Hi, i'm interested in resource classes and devices mapping, I think it is the key point of this proposal.

ScorpioCPH · 2017-12-13T12:43:27Z

contributors/design-proposals/resource-class.md

+cluster. Device plugins, running on nodes, will update node status indicating
+the presence of this hardware. To offer this hardware to applications deployed
+on kubernetes in a portable way, the administrator creates a number of resource
+classes to represent that hardware. These resource classes will include metadata


Thanks for your explanation.

ScorpioCPH · 2017-12-13T12:48:48Z

contributors/design-proposals/resource-class.md

+the presence of this hardware. To offer this hardware to applications deployed
+on kubernetes in a portable way, the administrator creates a number of resource
+classes to represent that hardware. These resource classes will include metadata
+about the devices as selection criteria.


Can you give an example how resource classes know which devices are selectable? AFAIK, as mentioned above, the administrator creates a resource class without any devices info or node info, who is responsible to do the mapping (ResourceClass --> Devices) work?

Scheduler will watch apiserver for resource class objects creation and then scheduler already has nodeinfo in the cache. For each kind of device, a deviceinfo structure will be instantiated and list of device info will reside in nodeinfo. At each resource class creation, scheduler will iterate over the deviceinfo list and if resourceclass matches the device, deviceinfo's list of resourceclass references is updated.

For more detalis take a look at PoC: https://github.com/vikaschoudhary16/kubernetes/pull/3/files#diff-558cb8bde14dca10a3151bfc222a3aae

ScorpioCPH · 2017-12-13T12:49:35Z

contributors/design-proposals/resource-class.md

+1. Initialize and add resource class info into local cache
+2. Iterate over all existing nodes in cache to figure out if there are devices
+   on these nodes which are selectable by resource class. If found, update the
+   resource class availability status in local cache.


@vikaschoudhary16 I'm very interested in the details about this. As commented above, how does Resource Class know which devices is selectable?

Does above reply answer this question?

Yes, thanks!

ScorpioCPH · 2017-12-13T12:51:25Z

contributors/design-proposals/resource-class.md

+   on the matching Device Plugins
+
+In addition to node selection, the scheduler is also responsible for selecting a
+device that matches the resource class requested by the user.


I'm wondering how scheduler know which device matches the resource class?

Sorry, now that device plugin has got merged, i need to update this proposal for remaining enhancements of device plugin for resource classes.
In the ListWatch() response, device plugin will send device properties also in the arbitrary key-value map. Using this, a Device api object will be created by device manager. Scheduler will keep watching api server for any new Device object creation and will keep synching its device info in its cache.

Thanks, sound like that it depends on device properties which exposed by device plugin. So we can't support any matchExpressions that can not be understood by device properties, am i right?

right. matchExpressions must have a subset of properities exposed by device plugin.

resouer · 2017-12-13T23:55:06Z

contributors/design-proposals/resource-class.md

+
+## Motivation
+Kubernetes system knows only two resource types, 'CPU' and 'Memory'. Any other
+resource can be requested by pod using opaque-integer-resource(OIR) mechanism.


Maybe we should say ExtendedResource now?

right. As we discussed on slack, will iterate.

fabiand · 2017-12-10T17:28:22Z

contributors/design-proposals/resource-class.md

+```yaml
+kind: ResourceClass
+metadata:
+  name: fast.nic


Just a side note that the device manager today does not support network interfaces. Thus - as resource classes build upon device manager - this would not work.

However, I'd love to see that the device manager is getting extended to support this one-off case of network resources as well.

This is a key cross-sig deliverable. We have discussed it at length in the RMWG and it was deferred because we wanted it to be led by sig-network (or more likely the new Network Plumbing WG).

Is it possible that we start with a simple model to support high-performance nic with a combination of CNI plugin and device plugin? The CNI plugin takes care of network interface setup and management, and can be portable across different container orchestration systems. The device plugin can run as a sidecar container and take care of device initialization, health monitoring, and resource advertising. I know having device plugin act like a middleman between CNI and Kubelet may pose certain limitations. We can discuss them and see whether we may solve them by enriching the information passed between device plugin and Kubelet. However, at least for now, I hope the information passed between device plugin and Kubelet can stay at resource level (such as resource name and properties) and device level (such as device runtime configuration). This way, it is clear at the API level that Kubelet is the central place in charge of resource allocation and container runtime setup, which I feel perhaps will be easier to be extended to support future features like cross-resource affinity allocation and etc.

I don't think the sig have enough resource to implement this case, I would like to see this design keep as simple as possible and hopefully get merged in 1.10 cycle.

Ack, thanks for the info.

@jiayingz technically it can also be done closer to CNI. To me it just looks like it will be a much cleaner interface to provide the network connectivity via the device manager. This will effectively formalize how components can add additional NICs to a pod. Today this is unspecified and everybody is pretty much using CNI in some way to achieve this.
The problem is that there are so many different assumptions by different projects. Eventually it even implies changing the CNI config of the host which is not necessarily desireable.

OTOH if we could use the device manager, then we know a way how to distribute such a plugin and how it would operate, which could eventually lead to more cooperation, and to less wild west.

timothysc

I'd love to see this POC'd as a completely pluggable concept addition.

timothysc · 2018-01-03T03:13:36Z

contributors/design-proposals/resource-class.md

+  * If I request a resource `nvidia.gpu.high.mem` for my pod, any 'nvidia-gpu'
+    type device which has memory greater than or equal to 'X' GB, should be able
+    to satisfy this request, independent of other device capabilities such as
+    'version' or 'nvlink locality' etc.


Getting too selective on non-fungible resources can be a dangerous game vs. channeling a lowest common denominator.

For a decade on grid systems we allow arbitrary matching on any attribute via an expressive configuration language, and it was eventually highly abused by its users to hoard the prized resources.

lichuqiang · 2018-01-23T08:05:30Z

contributors/design-proposals/resource-class.md

+In addition to node selection, the scheduler is also responsible for selecting a
+device that matches the resource class requested by the user.
+
+### Reason for preferring device selection at the scheduler and not at the kubelet


What if a user requests ER in pods that not handled by scheduler(daemonset pods or static pods)?

resouer · 2018-01-31T19:13:56Z

@vikaschoudhary16 knock knock :) As @timothysc suggested, maybe we can arrange a PoC? Any free bandwidth?

RenaudWasTaken · 2018-02-08T23:07:20Z

contributors/design-proposals/resource-class.md

+   classes which could select this device in the cache.
+5. Patch the resource class objects with new 'Requested' in the `ResourceClassStatus`.
+6. Add the pod reference in local DeviceToPod mapping structure in the cache.
+7. Patch the pod object with selected device annotation with prefix 'scheduler.alpha.kubernetes.io/resClass'


Would like to know the opinion of some sig-scheduling folks on the best way to achieve that.
Sending a patch request during host selection for a pod seem a bit different from the current scheduling implementation.

Maybe @bsalamat or @timothysc ?

Sending a patch request directly here is not right.

But we can probably update the annotation of assumedPod, and update Pod api object during bind(), which is async.

@bsalamat Make sense?

I generally do not like the idea of scheduler knowing about mapping of devices to resource classes. This does not fit well in scheduler's logic and does not seem to be something that scheduler should be aware of.
It is also not clear why scheduler needs to update the pod with the selected device annotation.

@bsalamat The problem I can see here is: device plugin is not expected to have any scheduling logic (e.g. select devices) by design, and if scheduler also does not want to have it. The only place to select devices will be kubelet. But what this proposal is describing requires awareness of whole picture of all devices info in the cluster to make decision, this is not what one kubelet is capable of (it only has device info of this node).

I am willing to pushing this design as it do help to enable GPU topology in Kubernetes a lot, and also, fixes blockers for other devices like FPGA. So would be great to know your ideas for scheduling.

RenaudWasTaken · 2018-02-09T00:00:35Z

Do you mind adding a pod spec example to your design document? I'm fairly certain Resource classes are supposed to be requested in the resources field but It's probably better to have explicit confirmation :)

After discussing it a bit internally and since this is going to be discussed at the Face 2 Face, we think it might be a good idea to discuss device sharing ideas in this proposal as sharing might have a significant impact on the technical implementation.
And in general if we agree that the sharing block should be implemented with this model, it would be a pretty good argument that this design document is a good step forward.

There are two kinds of device sharing:

"simple", simple because you only need to express the sharing notion between the containers
"complex", complex because it requires expressing more than just the fact that devices are shared, it also needs to express an action by the underlying infrastructure (mainly the device plugin) for the service to properly be used.
e.g: for GPUs MPS needs a daemon to run

simple sharing

Seem to be something that might be expressed as a construct on top of the ResourceClass API and built into the podSpec exactly like other sharing APIs such as volumes.

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vector-add
spec:
  restartPolicy: OnFailure
  initContainers:
    - name: myInitContainer1
      image: "nvidia/cuda"
      resources:
        limits:
          devices: ["nvidia-gpu"]
  containers:
    - name: myInferenceContainer1
      image: "nvidia/cuda"
      resources:
        limits:
          devices: ["nvidia-gpu"]
    - name: myInferenceContainer2
      image: "nvidia/cuda"
      resources:
        limits:
          devices: ["nvidia-gpu"]
  devices:
    - name: "nvidia-gpu"
      resources:
        nvidia.high.mem: 1
--- 
kind: ResourceClass
metadata:
  name: nvidia.high.mem
spec:
  resourceSelector:
    - matchExpressions:
        - key: "Kind"
          operator: "In"
          values:
            - "nvidia-gpu"
        - key: "memory"
          operator: "GtEq"
          values:
            - "30G"

Complex sharing

Coud be expressed by adding labels to the devices field such as the following (which could be advertised by the device plugin):

  devices:
    - name: "nvidia-gpu"
      labels: ["nvidia.com/MPS"]
      resources:
        nvidia.high.mem: 1

bsalamat · 2018-02-13T01:30:18Z

contributors/design-proposals/resource-class.md

+   classes which could select this device in the cache.
+5. Patch the resource class objects with new 'Requested' in the `ResourceClassStatus`.
+6. Add the pod reference in local DeviceToPod mapping structure in the cache.
+7. Patch the pod object with selected device annotation with prefix 'scheduler.alpha.kubernetes.io/resClass'


I generally do not like the idea of scheduler knowing about mapping of devices to resource classes. This does not fit well in scheduler's logic and does not seem to be something that scheduler should be aware of.
It is also not clear why scheduler needs to update the pod with the selected device annotation.

bsalamat · 2018-02-13T01:33:37Z

contributors/design-proposals/resource-class.md

+1. Discovery, advertisement, allocation/deallocation of devices is expected to
+   be addressed by [device plugin proposal](https://github.com/kubernetes/community/pull/695/files)
+
+## Resource Class


I skimmed over the doc. I think any resource class proposal should cover both node resources and cluster resources. This proposal does not cover the latter.

ns-sundar · 2018-02-15T00:41:35Z

I have been lurking for a while to see how this shapes up. This proposal seems aimed at only at selecting devices based on metadata. There is certainly a need to model resources in K8s but the current proposal needs to be enhanced in many ways to reflect the variety of devices, usage models and use cases.

Let us start by stating what we would want from a resource model:

Many types of resources cannot be shared: once assigned to a pod/
container, they cannot be assigned to another until the first one
releases it (or is preempted). E.g. SR-IOV VF. So, we would want
a way to track the inventory and usage of resources.
Some devices such as FPGAs can offer multiple 'regions', which can be
programmed with different accelerators. If we represent each accelerator
as a resource, a simpe model of 'devices with resource classes' is
not enough. We instead need a way to express that a region nested
inside an FPGA contains an accelerator. This calls for a hierarchy.
FPGAs (and GPUs) often contain local memory, which is a different kind
of resource. A user may want, say an ipsec accelerator with 2 GB of
local memory, both of which need to come from the same device.
Further, depending on the implementation, some of the memory may be
dedicated to one region, while others can be shared across regions
within the device.

To crystallize the ideas above, consider this FPGA card currently in the market [*]. It can be abstracted as below:

      dedicated <-> region <-> common <-> region <-> dedicated
        memory        A        memory       B         memory

How would such a device be represented and handled in this proposal? I think the scope needs to be broadened considerably.

[*] This is only an example. I am not affiliated with, or own stock in, the company selling this product.

XiaoningDing · 2018-02-20T21:48:00Z

@RenaudWasTaken In your example of complex sharing, do users have to be aware of MPS? Is it possible that device plugin talks to MPS daemon to setup per-container GPU resource limit, without users knowing about MPS?

k8s-ci-robot · 2018-04-11T09:22:23Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: derekwaynecarr

Assign the PR to them by writing /assign @derekwaynecarr in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

contributors/design-proposals/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: vikaschoudhary16 <vichoudh@redhat.com>

NickrenREN · 2018-06-05T08:23:29Z

contributors/design-proposals/resource-class.md

+ComputeResource objects are similar to PV objects on storage. It is tied to the physical resource that can have a wide range of vendor specific properties. Kubelet will create or update ComputeResource objects upon any resource availability or property change for node-level resources. Once a node is configured to support ComputeResource API and the underlying resource is exported as a ComputeResource, its quantity should NOT be included in the conventional NodeStatus Capacity/Allocatable fields to avoid resource multiple counting. ComputeResource API can be included into NodeStatus to facilitate resource introspection.
+For cluster level resources, a special controller or a scheduler extender can create a ComputeResource and dynamically bind that to a node during or after scheduling.
+
+### ResourceClass API


Will the users use ResourceClass directly in their pods ? How about letting them create ComputeResourceClaims(CRCs) and use CRCs in their pods just like PV/PVC does ? Users can express their requirements in CRCs and scheduler(or some controller) can match CRCs and CRs.
And using ResourceClass for auto provision if necessary ?

vikaschoudhary16 · 2018-06-14T07:20:21Z

Created a KEP, #2265, for this, so closing this one.

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jul 6, 2017

jeremyeder reviewed Jul 6, 2017

View reviewed changes

vikaschoudhary16 force-pushed the resource_class branch 2 times, most recently from a3d30c3 to c7f5820 Compare July 6, 2017 19:05

derekwaynecarr self-assigned this Jul 6, 2017

fabiand mentioned this pull request Jul 7, 2017

Device Plugin Design Proposal #695

Merged

fabiand reviewed Jul 7, 2017

View reviewed changes

fabiand mentioned this pull request Jul 7, 2017

Add CPU manager proposal. #654

Merged

balajismaniam reviewed Jul 10, 2017

View reviewed changes

ConnorDoyle reviewed Jul 11, 2017

View reviewed changes

vikaschoudhary16 force-pushed the resource_class branch 2 times, most recently from f6d6c95 to 53c2a80 Compare July 11, 2017 08:12

RenaudWasTaken reviewed Jul 11, 2017

View reviewed changes

ConnorDoyle mentioned this pull request Jul 14, 2017

Resources outside the *kubernetes.io namespace are integers and cannot be over-committed. kubernetes/kubernetes#48922

Merged

jaypipes reviewed Dec 12, 2017

View reviewed changes

ScorpioCPH reviewed Dec 13, 2017

View reviewed changes

resouer reviewed Dec 13, 2017

View reviewed changes

fabiand reviewed Dec 15, 2017

View reviewed changes

ScorpioCPH mentioned this pull request Dec 24, 2017

Proposal: kubeflow-scheduler kubeflow/kubeflow#68

Closed

timothysc reviewed Jan 3, 2018

View reviewed changes

vikaschoudhary16 mentioned this pull request Jan 17, 2018

Extend Device Plugin api to support resource attributes kubernetes/kubernetes#58388

Closed

bgrant0607 mentioned this pull request Jan 22, 2018

Label Data Typing kubernetes/kubernetes#19081

Closed

lichuqiang reviewed Jan 23, 2018

View reviewed changes

RenaudWasTaken reviewed Feb 8, 2018

View reviewed changes

bsalamat reviewed Feb 13, 2018

View reviewed changes

vikaschoudhary16 force-pushed the resource_class branch from 53c2a80 to 7064970 Compare April 11, 2018 09:22

vikaschoudhary16 force-pushed the resource_class branch 5 times, most recently from a2ff1dc to adb3da2 Compare April 11, 2018 09:36

Add new Resource API proposal

814999c

Signed-off-by: vikaschoudhary16 <vichoudh@redhat.com>

vikaschoudhary16 force-pushed the resource_class branch from adb3da2 to 814999c Compare April 11, 2018 09:52

vikaschoudhary16 changed the title ~~Add Resource Class proposal~~ Add New Resource API proposal Apr 11, 2018

jiayingz mentioned this pull request Apr 13, 2018

Proposal: Cluster Scoped Resources #1400

Closed

NickrenREN reviewed Jun 5, 2018

View reviewed changes

vikaschoudhary16 closed this Jun 14, 2018

zshi-redhat mentioned this pull request May 22, 2019

VF passthrough not working inside a VM k8snetworkplumbingwg/sriov-network-device-plugin#127

Closed

Add New Resource API proposal #782

Add New Resource API proposal #782

Conversation

vikaschoudhary16 commented Jul 6, 2017 • edited

k8s-ci-robot commented Jul 6, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RenaudWasTaken commented Jul 6, 2017

vikaschoudhary16 commented Jul 7, 2017

RenaudWasTaken commented Jul 7, 2017 • edited

vikaschoudhary16 commented Jul 7, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

saad-ali commented Jul 7, 2017

vikaschoudhary16 commented Jul 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vikaschoudhary16 commented Jul 11, 2017

RenaudWasTaken Jul 11, 2017 • edited

Choose a reason for hiding this comment

vikaschoudhary16 Jul 11, 2017 • edited

Choose a reason for hiding this comment

vikaschoudhary16 commented Jul 17, 2017

derekwaynecarr commented Jul 29, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ScorpioCPH left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ScorpioCPH Dec 13, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ScorpioCPH Dec 13, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vikaschoudhary16 commented Jul 6, 2017 •

edited

RenaudWasTaken commented Jul 7, 2017 •

edited

RenaudWasTaken Jul 11, 2017 •

edited

vikaschoudhary16 Jul 11, 2017 •

edited

ScorpioCPH Dec 13, 2017 •

edited

ScorpioCPH Dec 13, 2017 •

edited