Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler extension Proposal #11470

Closed
ravigadde opened this issue Jul 17, 2015 · 25 comments
Closed

Scheduler extension Proposal #11470

ravigadde opened this issue Jul 17, 2015 · 25 comments
Labels
area/extensibility priority/backlog Higher priority than priority/awaiting-more-evidence. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.

Comments

@ravigadde
Copy link
Contributor

Kubernetes scheduler schedules based on resources managed by Kubernetes. Scheduling based on opaque resource counting helps extending this further. But when there is a need for contextual scheduling for resources managed outside of kubernetes(example: place a pod where its storage is), there is no mechanism to do it today.

The proposal is to make kubernetes scheduler extensible by adding the capability to make http calls out to another endpoint to help achieve this functionality. I am curious whether you think the cloud provider abstraction is the right abstraction for implementation.

Here is a rough draft of what I am thinking about. Would like to solicit community feedback

type SchedulerExtension interface {

      // Filter based on provider implemented predicate functions.
      Filter(pod *api.Pod, nodes *api.NodeList) (*api.NodeList, error)

      // Prioritize based on provider implemented priority functions. Weight*priority is added up for each
      // such priority function. The returned score is added to the score computed by Kubernetes
      // scheduler. The total score is used to do the host selection.
      Prioritize(pod *api.Pod, nodes *api.NodeList) (*scheduler/api.HostPriorityList, error)

      // Inform the provider about the scheduling decision. This could also be done by the provider
      // watching apiserver pods/binding endpoint.
      Bind(pod *api.Pod, host string) error

      // Inform the provider about the unbind. To be called by apiserver in pod deletion path. This could
      // also be accomplished by watching apiserver pods endpoint.
      Unbind(pod *api.Pod, host string) error
}
@ravigadde
Copy link
Contributor Author

cc @bgrant0607 @davidopp @erictune

@bgrant0607 bgrant0607 added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. area/extensibility team/master labels Jul 18, 2015
@erictune
Copy link
Member

You gave the example of "put a pod where its storage is". This could be implemented via a node selector on the pod?

@bgrant0607 bgrant0607 added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Jul 24, 2015
@bgrant0607
Copy link
Member

Speaking of labels, we'll need a way for Kubelets to export

  • node attributes, represented as labels
  • node resources, represented as non-zero capacity of generic counted resources

with these attributes/resources provided by a configuration file and/or plugin -- supporting both file (which kubelet would monitor for changes) and http would likely be adequate.

With respect to co-location and other commonly desired features, attempting to arrive at a general solution would be more desirable than hiding the behavior in an extension.

As for actual extension of the scheduler, there are a couple possible approaches:

  1. Run a different scheduler. This is the approach used in Kubernetes-on-Mesos.
  2. Add new fit and/or priority functions to the generic scheduler. This is a special case of (1) where most of the existing scheduling code is reused.

The current cloudprovider API isn't a good model -- see #2770 and #10503.

I do want to support multiple schedulers, including user-provided schedulers. The main thing we need is a way to not apply the default scheduling behavior, perhaps by namespace, using some sort of "admission control" plugin. I want that to be a fairly generic mechanism, since I expect to need it at least for horizontal auto-scaling and vertical auto-sizing, as well, if not other things in the future. This is related to the "initializers" topic (#3585).

A custom scheduler would then need to be configured for which pods to schedule. This probably requires adding information of some form to the pods. We should take that into account when thinking about the general initializer mechanism.

The custom scheduler should then be able to watch all pods and nodes to keep its state up to date, but I imagine we could add more information, events, or something to make this more convenient, as requested in #1517.

@timothysc @eparis

@bgrant0607
Copy link
Member

Also, I think the scheduler is a candidate for moving to its own github project, once we figure out how to manage testing, releases, etc.

@ravigadde
Copy link
Contributor Author

@erictune

In this case, the pod user doesn't need to know where the volume is actually deployed. Also, it may not be a predicate, but a priority function that prefers to deploy the pod on the host where the volume physically resides. There are a few more examples I can think of:
a) Scheduling based on network requirements (connectivity to a specific subnet that doesn't span all nodes in the cluster)
b) Scheduling based on qos requirements
(These are not supported in the pod spec today. They may be some day, but a provider may have unique requirements that are not captured in the stock scheduler.)

@bgrant0607
I am following #10503 and prototyping this scheduling extension using the http cloud provider.

For our use case, we have qos, network, storage requirements that affect scheduling. A different vendor may have different requirements. If there is a way to generically extend the scheduler, it will help everyone.

We would also like to be able to use the stock kubernetes distribution (through one of the many OS vendors who support it) rather than ship our own scheduler. Hence the push for extensibility.

Worth discussing tomorrow or face/face meeting or at Linuxcon?

@davidopp
Copy link
Member

Aspects of this topic were discussed earlier in #9920

@bgrant0607
Copy link
Member

How about the August 7 community hangout?

@ravigadde
Copy link
Contributor Author

@bgrant0607
Thanks! I am out that week and the week after on vacation. I am not sure if there will be a hangout the week of Linuxcon. But I will be there Mon-Wed and available anytime to discuss.

@bgrant0607
Copy link
Member

cc @smarterclayton @eparis @ncdc

@davidopp
Copy link
Member

I agree with @bgrant0607 that writing your own scheduler is probably a better direction. We should try to make the default Kubernetes scheduler factored in a way that people who want to write their own schedulers can import/reuse the core parts as a library, otherwise there will be a lot of code duplication across all the schedulers people write. @lavalamp 's controller framework and modeler are steps towards that, in a sense.

But it would be great to hear more about your thoughts.

@ravigadde
Copy link
Contributor Author

@davidopp

Thanks for sharing your thoughts. Technically, either approach works. As I mentioned earlier in the thread, its more for business reasons we would like to use the stock kubernetes distribution from one of the OS vendors. They sell the support for the distro. If we ship our own scheduler, it violates that support model.

I will post some prototype changes in the next couple of days to this thread. The scope of change to the scheduler is minimal. Hope that alleviates some of the concerns. Will discuss the rest hopefully on a call/in person.

@markebalch
Copy link

An extensible scheduler has value beyond a third-party support model. Being able to plug more flexible placement information to an existing Kubernetes deployment where the default scheduler is doing a fine job otherwise saves rework and porting/updating every time the default scheduler (or the plugin) changes.

Having many different people release their own schedulers (even with libraries) seems to defeat the point of a common project that takes the best of what everyone has to offer and instead makes people choose all-or-nothing with a monolithic scheduler.

There may be valid concerns over destabilizing the scheduler with non-deterministic responses. We can overcome these as other projects have done with well defined interfaces. The plugin is responsible for adhering to the interface and any response criteria that are established. Are there any specific concerns with the concept of an extensible scheduler or is this more a matter of priority?

@ravigadde
Copy link
Contributor Author

@davidopp @bgrant0607

Sample code here - ravigadde/kube-scheduler@23ad25b

Please let me know your thoughts.

@bgrant0607
Copy link
Member

Sorry, @ravigadde. Still digging out of the pre-1.0 backlog.

I have a few quick comments:

  1. Scheduler extensions should not be tied to cloudprovider in any way. There are many reasons why people might want to extend the scheduler.
  2. We have existing mechanisms for adding new filtering and prioritization passes. Why not just use/improve them? Then you'd just need to add bind/unbind hooks.
  3. FYI/FWIW, we plan on making significant changes to the scheduler, for performance/scale, improvements in the way prioritization works, and in order to take observed resource usage into account. You could continue to develop on the current implementation in the meantime, but we may fork or rewrite it.

@ravigadde
Copy link
Contributor Author

@bgrant0607

Thanks for your comments.

  1. I was trying to latch on to an existing abstraction. I can introduce a new abstraction.
  2. None that I am aware of without needing to change the binary. As I have mentioned before, we have a requirement to be able to call out from the stock scheduler binary for our extensions.
    With regards to bind/unbind calls, in our case some resources are tracked outside of kubernetes. The extension is not aware of the final node selection and hence needs to be made aware to account for resources used.
  3. Thanks, any specific threads that I can track?

@davidopp
Copy link
Member

@ravigadde Please correct me if I'm wrong -- IIUC, your PR allows the scheduler to outsource "Prioritize" and "Filter" operations (i.e. priority functions and predicates) to an external process which it contacts via HTTP (this other process also has endpoints for "bind" and "unbind" so it can be informed of changes in cluster state--BTW I'm not sure this is sufficient since it also needs to know each machine's capacity, labels, etc.). Is that correct?

If that's correct, I'm not understanding why this is considered to be using "stock kubernetes distribution" whereas writing your own scheduler is not. The only difference I see is the direction of the communication -- in your model, Kubernetes scheduler calls out to your scheduler, while in the model @bgrant0607 and I were discussing above (see also #11793), your scheduler calls into Kubernetes to watch state and post bindings.

That said, I guess there isn't any reason why we couldn't have the scheduler call out to another process (modulo performance concerns). But I agree with @bgrant0607 that there's no reason to use the cloud provider interface -- you could probably just configure the identity of the remote endpoint using the scheduler config file (see plugin/pkg/scheduler/api/).

@ravigadde
Copy link
Contributor Author

@davidopp
Thanks for your response.

Yes, that is correct. The other process that is handling these calls is aware of node resources/labels. It watches apiserver for any changes.

There are subtle differences in the two models (mostly non-technical).
a) kubernetes binaries are provided, maintained and supported by OS/OSS vendor. Dont want to make an exception for the scheduler.
b) We could technically make a fork and add our extensions to the scheduler. There will be sync overhead which will continue to get worse over time. Also, if something breaks, the onus is on us to prove that our scheduler is not significantly different from the stock version.
c) kubernetes binaries can be independently upgraded without any dependency on us (as long as the APIs dont change).

The proposal aims to address the above issues. I originally intended for apiserver to invoke Unbind on pod deletion so the resources associated with the pod can be cleaned up. Hence needed an interface that can be shared by both. But this could also be achieved by watching the apiserver for pod deletion. Will go with your suggestion of using the scheduler config file and create a PR. Please let me know if anything is not clear, we can discuss in the call tomorrow.

@davidopp
Copy link
Member

@ravigadde I'd be happy to discuss with you offline. Send me an email at (my github username) @ google.com and we can arrange a time.

@bgrant0607
Copy link
Member

@ravigadde

I don't see how this proposal addresses your expressed concerns.

a) Your custom scheduling logic would be in a non-maintained/supported binary.
b) Your custom scheduling logic would amount to a fork.
c) You would be responsible for upgrading your own scheduling component.

As for this specific API:

We need to be able to run the "fit" check in several places. In addition to the scheduler, we already also run it in Kubelet, and have discussed also running it in apiserver, upon calls to /binding. We're not going to call out to another endpoint in all those places.

Also, while we need to support the current scheduler configuration for some time, we anticipate creating a new approach prioritization, and would like to avoid creating hard-to-remove dependencies upon the current approach.

Notification of binds and unbinds is insufficient to communicate the state of the cluster. How would the scheduler extension get the initial state? How would it update its state after an outage of the apiserver? What should be done if the scheduler were unable to contact your extension? The logic needs to be "level-based". https://github.com/kubernetes/kubernetes/blob/master/docs/design/principles.md#control-logic

Is there something we could do to make it easier to fork and extend the scheduler, such as refactoring more of it into reusable libraries / frameworks?

We could also investigate how to make it easier to keep a scheduler's state up to date using get/watch: #1517. That will likely happen as a consequence of increasing the amount of caching done in the scheduler.

@bgrant0607
Copy link
Member

Meeting summary:

  • Drop bind and unbind. They are insufficient and unnecessary.
  • We need to align the extension HTTP(S) API design with other HTTP-based extensions, such as the cloudprovider Create HTTP cloudprovider #10503
  • For reasons described above, this specific extension API would be tied to the existing scheduler
  • This API would be implemented behind existing fit and prioritization plugin and configuration mechanisms, would be optional, and would not be enabled by default.
  • If we were to perform optimizations that would cause nodes to not be evaluated, we'd need to disable them if such an API were enabled.
  • Likely this API would not meet our scaling goals, so we would not recommend it for high-scale deployments
  • The fit predicates would not be called from anywhere but the scheduler, which has implications for Move scheduler fit predicates into a library #12744

@smarterclayton
Copy link
Contributor

Meeting summary:

  • Drop bind and unbind. They are insufficient and unnecessary.

In the proposal, or our API?

@ravigadde
Copy link
Contributor Author

@smarterclayton - In the proposal.
@bgrant0607 - Thank you for the summary.

@timothysc
Copy link
Member

imho this whole proposal sounds like what you really want is #17197

@ravigadde
Copy link
Contributor Author

No it isn't. I will explain in the other thread.

@davidopp
Copy link
Member

This was implemented in #13580

ref/ #11470

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/extensibility priority/backlog Higher priority than priority/awaiting-more-evidence. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
None yet
Development

No branches or pull requests

7 participants