Scheduler extension Proposal #11470

ravigadde · 2015-07-17T21:18:41Z

Kubernetes scheduler schedules based on resources managed by Kubernetes. Scheduling based on opaque resource counting helps extending this further. But when there is a need for contextual scheduling for resources managed outside of kubernetes(example: place a pod where its storage is), there is no mechanism to do it today.

The proposal is to make kubernetes scheduler extensible by adding the capability to make http calls out to another endpoint to help achieve this functionality. I am curious whether you think the cloud provider abstraction is the right abstraction for implementation.

Here is a rough draft of what I am thinking about. Would like to solicit community feedback

type SchedulerExtension interface {

      // Filter based on provider implemented predicate functions.
      Filter(pod *api.Pod, nodes *api.NodeList) (*api.NodeList, error)

      // Prioritize based on provider implemented priority functions. Weight*priority is added up for each
      // such priority function. The returned score is added to the score computed by Kubernetes
      // scheduler. The total score is used to do the host selection.
      Prioritize(pod *api.Pod, nodes *api.NodeList) (*scheduler/api.HostPriorityList, error)

      // Inform the provider about the scheduling decision. This could also be done by the provider
      // watching apiserver pods/binding endpoint.
      Bind(pod *api.Pod, host string) error

      // Inform the provider about the unbind. To be called by apiserver in pod deletion path. This could
      // also be accomplished by watching apiserver pods endpoint.
      Unbind(pod *api.Pod, host string) error
}

The text was updated successfully, but these errors were encountered:

ravigadde · 2015-07-17T21:19:55Z

cc @bgrant0607 @davidopp @erictune

erictune · 2015-07-23T23:58:12Z

You gave the example of "put a pod where its storage is". This could be implemented via a node selector on the pod?

bgrant0607 · 2015-07-24T00:53:23Z

Speaking of labels, we'll need a way for Kubelets to export

node attributes, represented as labels
node resources, represented as non-zero capacity of generic counted resources

with these attributes/resources provided by a configuration file and/or plugin -- supporting both file (which kubelet would monitor for changes) and http would likely be adequate.

With respect to co-location and other commonly desired features, attempting to arrive at a general solution would be more desirable than hiding the behavior in an extension.

As for actual extension of the scheduler, there are a couple possible approaches:

Run a different scheduler. This is the approach used in Kubernetes-on-Mesos.
Add new fit and/or priority functions to the generic scheduler. This is a special case of (1) where most of the existing scheduling code is reused.

The current cloudprovider API isn't a good model -- see #2770 and #10503.

I do want to support multiple schedulers, including user-provided schedulers. The main thing we need is a way to not apply the default scheduling behavior, perhaps by namespace, using some sort of "admission control" plugin. I want that to be a fairly generic mechanism, since I expect to need it at least for horizontal auto-scaling and vertical auto-sizing, as well, if not other things in the future. This is related to the "initializers" topic (#3585).

A custom scheduler would then need to be configured for which pods to schedule. This probably requires adding information of some form to the pods. We should take that into account when thinking about the general initializer mechanism.

The custom scheduler should then be able to watch all pods and nodes to keep its state up to date, but I imagine we could add more information, events, or something to make this more convenient, as requested in #1517.

@timothysc @eparis

bgrant0607 · 2015-07-24T01:57:30Z

Also, I think the scheduler is a candidate for moving to its own github project, once we figure out how to manage testing, releases, etc.

ravigadde · 2015-07-24T01:57:47Z

@erictune

In this case, the pod user doesn't need to know where the volume is actually deployed. Also, it may not be a predicate, but a priority function that prefers to deploy the pod on the host where the volume physically resides. There are a few more examples I can think of:
a) Scheduling based on network requirements (connectivity to a specific subnet that doesn't span all nodes in the cluster)
b) Scheduling based on qos requirements
(These are not supported in the pod spec today. They may be some day, but a provider may have unique requirements that are not captured in the stock scheduler.)

@bgrant0607
I am following #10503 and prototyping this scheduling extension using the http cloud provider.

For our use case, we have qos, network, storage requirements that affect scheduling. A different vendor may have different requirements. If there is a way to generically extend the scheduler, it will help everyone.

We would also like to be able to use the stock kubernetes distribution (through one of the many OS vendors who support it) rather than ship our own scheduler. Hence the push for extensibility.

Worth discussing tomorrow or face/face meeting or at Linuxcon?

davidopp · 2015-07-24T02:21:27Z

Aspects of this topic were discussed earlier in #9920

bgrant0607 · 2015-07-24T03:21:54Z

How about the August 7 community hangout?

ravigadde · 2015-07-24T03:31:44Z

@bgrant0607
Thanks! I am out that week and the week after on vacation. I am not sure if there will be a hangout the week of Linuxcon. But I will be there Mon-Wed and available anytime to discuss.

bgrant0607 · 2015-07-24T18:26:50Z

cc @smarterclayton @eparis @ncdc

davidopp · 2015-07-27T04:09:18Z

I agree with @bgrant0607 that writing your own scheduler is probably a better direction. We should try to make the default Kubernetes scheduler factored in a way that people who want to write their own schedulers can import/reuse the core parts as a library, otherwise there will be a lot of code duplication across all the schedulers people write. @lavalamp 's controller framework and modeler are steps towards that, in a sense.

But it would be great to hear more about your thoughts.

ravigadde · 2015-07-27T04:31:32Z

@davidopp

Thanks for sharing your thoughts. Technically, either approach works. As I mentioned earlier in the thread, its more for business reasons we would like to use the stock kubernetes distribution from one of the OS vendors. They sell the support for the distro. If we ship our own scheduler, it violates that support model.

I will post some prototype changes in the next couple of days to this thread. The scope of change to the scheduler is minimal. Hope that alleviates some of the concerns. Will discuss the rest hopefully on a call/in person.

markebalch · 2015-07-27T18:14:27Z

An extensible scheduler has value beyond a third-party support model. Being able to plug more flexible placement information to an existing Kubernetes deployment where the default scheduler is doing a fine job otherwise saves rework and porting/updating every time the default scheduler (or the plugin) changes.

Having many different people release their own schedulers (even with libraries) seems to defeat the point of a common project that takes the best of what everyone has to offer and instead makes people choose all-or-nothing with a monolithic scheduler.

There may be valid concerns over destabilizing the scheduler with non-deterministic responses. We can overcome these as other projects have done with well defined interfaces. The plugin is responsible for adhering to the interface and any response criteria that are established. Are there any specific concerns with the concept of an extensible scheduler or is this more a matter of priority?

ravigadde · 2015-07-28T02:13:30Z

@davidopp @bgrant0607

Sample code here - ravigadde/kube-scheduler@23ad25b

Please let me know your thoughts.

bgrant0607 · 2015-08-11T17:09:58Z

Sorry, @ravigadde. Still digging out of the pre-1.0 backlog.

I have a few quick comments:

Scheduler extensions should not be tied to cloudprovider in any way. There are many reasons why people might want to extend the scheduler.
We have existing mechanisms for adding new filtering and prioritization passes. Why not just use/improve them? Then you'd just need to add bind/unbind hooks.
FYI/FWIW, we plan on making significant changes to the scheduler, for performance/scale, improvements in the way prioritization works, and in order to take observed resource usage into account. You could continue to develop on the current implementation in the meantime, but we may fork or rewrite it.

ravigadde · 2015-08-17T18:40:42Z

@bgrant0607

Thanks for your comments.

I was trying to latch on to an existing abstraction. I can introduce a new abstraction.
None that I am aware of without needing to change the binary. As I have mentioned before, we have a requirement to be able to call out from the stock scheduler binary for our extensions.
With regards to bind/unbind calls, in our case some resources are tracked outside of kubernetes. The extension is not aware of the final node selection and hence needs to be made aware to account for resources used.
Thanks, any specific threads that I can track?

davidopp · 2015-08-20T09:06:59Z

@ravigadde Please correct me if I'm wrong -- IIUC, your PR allows the scheduler to outsource "Prioritize" and "Filter" operations (i.e. priority functions and predicates) to an external process which it contacts via HTTP (this other process also has endpoints for "bind" and "unbind" so it can be informed of changes in cluster state--BTW I'm not sure this is sufficient since it also needs to know each machine's capacity, labels, etc.). Is that correct?

If that's correct, I'm not understanding why this is considered to be using "stock kubernetes distribution" whereas writing your own scheduler is not. The only difference I see is the direction of the communication -- in your model, Kubernetes scheduler calls out to your scheduler, while in the model @bgrant0607 and I were discussing above (see also #11793), your scheduler calls into Kubernetes to watch state and post bindings.

That said, I guess there isn't any reason why we couldn't have the scheduler call out to another process (modulo performance concerns). But I agree with @bgrant0607 that there's no reason to use the cloud provider interface -- you could probably just configure the identity of the remote endpoint using the scheduler config file (see plugin/pkg/scheduler/api/).

ravigadde · 2015-08-20T16:49:53Z

@davidopp
Thanks for your response.

Yes, that is correct. The other process that is handling these calls is aware of node resources/labels. It watches apiserver for any changes.

There are subtle differences in the two models (mostly non-technical).
a) kubernetes binaries are provided, maintained and supported by OS/OSS vendor. Dont want to make an exception for the scheduler.
b) We could technically make a fork and add our extensions to the scheduler. There will be sync overhead which will continue to get worse over time. Also, if something breaks, the onus is on us to prove that our scheduler is not significantly different from the stock version.
c) kubernetes binaries can be independently upgraded without any dependency on us (as long as the APIs dont change).

The proposal aims to address the above issues. I originally intended for apiserver to invoke Unbind on pod deletion so the resources associated with the pod can be cleaned up. Hence needed an interface that can be shared by both. But this could also be achieved by watching the apiserver for pod deletion. Will go with your suggestion of using the scheduler config file and create a PR. Please let me know if anything is not clear, we can discuss in the call tomorrow.

davidopp · 2015-08-23T07:43:06Z

@ravigadde I'd be happy to discuss with you offline. Send me an email at (my github username) @ google.com and we can arrange a time.

bgrant0607 · 2015-08-24T17:25:22Z

@ravigadde

I don't see how this proposal addresses your expressed concerns.

a) Your custom scheduling logic would be in a non-maintained/supported binary.
b) Your custom scheduling logic would amount to a fork.
c) You would be responsible for upgrading your own scheduling component.

As for this specific API:

We need to be able to run the "fit" check in several places. In addition to the scheduler, we already also run it in Kubelet, and have discussed also running it in apiserver, upon calls to /binding. We're not going to call out to another endpoint in all those places.

Also, while we need to support the current scheduler configuration for some time, we anticipate creating a new approach prioritization, and would like to avoid creating hard-to-remove dependencies upon the current approach.

Notification of binds and unbinds is insufficient to communicate the state of the cluster. How would the scheduler extension get the initial state? How would it update its state after an outage of the apiserver? What should be done if the scheduler were unable to contact your extension? The logic needs to be "level-based". https://github.com/kubernetes/kubernetes/blob/master/docs/design/principles.md#control-logic

Is there something we could do to make it easier to fork and extend the scheduler, such as refactoring more of it into reusable libraries / frameworks?

We could also investigate how to make it easier to keep a scheduler's state up to date using get/watch: #1517. That will likely happen as a consequence of increasing the amount of caching done in the scheduler.

bgrant0607 · 2015-08-26T06:52:50Z

Meeting summary:

Drop bind and unbind. They are insufficient and unnecessary.
We need to align the extension HTTP(S) API design with other HTTP-based extensions, such as the cloudprovider Create HTTP cloudprovider #10503
For reasons described above, this specific extension API would be tied to the existing scheduler
This API would be implemented behind existing fit and prioritization plugin and configuration mechanisms, would be optional, and would not be enabled by default.
If we were to perform optimizations that would cause nodes to not be evaluated, we'd need to disable them if such an API were enabled.
Likely this API would not meet our scaling goals, so we would not recommend it for high-scale deployments
The fit predicates would not be called from anywhere but the scheduler, which has implications for Move scheduler fit predicates into a library #12744

smarterclayton · 2015-08-26T13:10:19Z

Meeting summary:

Drop bind and unbind. They are insufficient and unnecessary.

In the proposal, or our API?

ravigadde · 2015-08-26T14:48:11Z

@smarterclayton - In the proposal.
@bgrant0607 - Thank you for the summary.

timothysc · 2015-11-13T15:11:27Z

imho this whole proposal sounds like what you really want is #17197

ravigadde · 2015-11-13T15:47:10Z

No it isn't. I will explain in the other thread.

davidopp · 2016-02-19T08:34:53Z

This was implemented in #13580

ref/ #11470

bgrant0607 added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. area/extensibility team/master labels Jul 18, 2015

bgrant0607 added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Jul 24, 2015

bgrant0607 mentioned this issue Jul 24, 2015

Support multiple schedulers #11793

Closed

davidopp mentioned this issue Aug 17, 2015

Auto-populate node labels with node information from cloud provider #9044

Closed

erictune added team/control-plane and removed team/master labels Aug 19, 2015

bgrant0607 mentioned this issue Aug 25, 2015

Create HTTP cloudprovider #10503

Closed

bgrant0607 mentioned this issue Aug 26, 2015

Move scheduler fit predicates into a library #12744

Closed

This was referenced Aug 28, 2015

Places for hooks #3585

Closed

Communication between Kubernetes components #6363

Closed

ravigadde mentioned this issue Sep 4, 2015

Scheduler extension #13580

Merged

bgrant0607 added this to the v1.2-candidate milestone Sep 9, 2015

bgrant0607 mentioned this issue Nov 13, 2015

Create proposal on multiple schedulers #17197

Merged

combk8s mentioned this issue Nov 18, 2015

scheduler dispatcher proposal #17430

Closed

mali11 mentioned this issue Dec 10, 2015

MetadataPolicy and its use in choosing the scheduler in multi-scheduler Kubernetes system #18262

Merged

davidopp closed this as completed Feb 19, 2016

davidopp removed this from the v1.2-candidate milestone Feb 19, 2016

ravigadde mentioned this issue Feb 12, 2017

Add Binder to HTTPExtender of scheduler #41235

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduler extension Proposal #11470

Scheduler extension Proposal #11470

ravigadde commented Jul 17, 2015

ravigadde commented Jul 17, 2015

erictune commented Jul 23, 2015

bgrant0607 commented Jul 24, 2015

bgrant0607 commented Jul 24, 2015

ravigadde commented Jul 24, 2015

davidopp commented Jul 24, 2015

bgrant0607 commented Jul 24, 2015

ravigadde commented Jul 24, 2015

bgrant0607 commented Jul 24, 2015

davidopp commented Jul 27, 2015

ravigadde commented Jul 27, 2015

markebalch commented Jul 27, 2015

ravigadde commented Jul 28, 2015

bgrant0607 commented Aug 11, 2015

ravigadde commented Aug 17, 2015

davidopp commented Aug 20, 2015

ravigadde commented Aug 20, 2015

davidopp commented Aug 23, 2015

bgrant0607 commented Aug 24, 2015

bgrant0607 commented Aug 26, 2015

smarterclayton commented Aug 26, 2015

ravigadde commented Aug 26, 2015

timothysc commented Nov 13, 2015

ravigadde commented Nov 13, 2015

davidopp commented Feb 19, 2016

Scheduler extension Proposal #11470

Scheduler extension Proposal #11470

Comments

ravigadde commented Jul 17, 2015

ravigadde commented Jul 17, 2015

erictune commented Jul 23, 2015

bgrant0607 commented Jul 24, 2015

bgrant0607 commented Jul 24, 2015

ravigadde commented Jul 24, 2015

davidopp commented Jul 24, 2015

bgrant0607 commented Jul 24, 2015

ravigadde commented Jul 24, 2015

bgrant0607 commented Jul 24, 2015

davidopp commented Jul 27, 2015

ravigadde commented Jul 27, 2015

markebalch commented Jul 27, 2015

ravigadde commented Jul 28, 2015

bgrant0607 commented Aug 11, 2015

ravigadde commented Aug 17, 2015

davidopp commented Aug 20, 2015

ravigadde commented Aug 20, 2015

davidopp commented Aug 23, 2015

bgrant0607 commented Aug 24, 2015

bgrant0607 commented Aug 26, 2015

smarterclayton commented Aug 26, 2015

ravigadde commented Aug 26, 2015

timothysc commented Nov 13, 2015

ravigadde commented Nov 13, 2015

davidopp commented Feb 19, 2016