New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: kubeflow-scheduler #68
Comments
We can discuss more details about this proposal on google doc. |
Welcome comments :-) /cc @DjangoPeng @mitake |
@ScorpioCPH @gaocegege Thanks for your works on kubeflow-scheduler. As we all know, Kubeflow is dedicated to supporting multiple ML frameworks. While there are some differential specifications of resource, cluster and etc. So, do we plan to make one kubeflow-scheduler for all ML frameworks? Or one for one? |
@ScorpioCPH thanks for creating the doc. Can I add motivations and ideas of the gang scheduler to the doc directly? |
@mitake Sure, please feel free to edit this doc. |
@DjangoPeng It depends on the different requirements for ML frameworks, maybe we can start with TensorFlow. |
@DjangoPeng Yeah, I think one scheduler for all ML frameworks is awesome but I am not sure if we could implement such a generic one. Let's start from TensorFlow since kubeflow starts from it, too. |
SGTM. @ScorpioCPH @gaocegege |
Thank you for putting this together. So the proposal seems to address a couple of problems with a custom scheduler
However, an alternative solution would be to extend the existing GPU support to support multiple types of GPUs and to use kubernetes-incubator/kube-arbitrator (see also kubeflow/training-operator#165). What are the advantages and disadvantages of using a custom scheduler compared to alternative approaches? |
This is terrific stuff - I share @jlewi's questions. I don't mean to dismiss using a custom scheduler, but I am curious if we can avoid having to write our own. +1 for supporting other frameworks (eventually). I'm totally ok skipping that in the first version. |
The main advantage of the custom scheduler approach is that we don't have to add scheduling mechanisms and heuristics to the default scheduler. It is good for improving development speed and making maintenance easier. The main disadvantage is that we would need a concurrency control mechanism between schedulers, probably. It seems to be a very subtle and difficult problem and no clean solutions would be available yet. Also, kube-arbitrator will be used in the custom scheduler approach for controlling TfJob object of tf_operator. But kube-arbitrator itself doesn't provide a mechanism of the gang scheduling. So I think we need the custom scheduler anyway. cc @k82cn |
kube-arbitrator already support that. |
and gang-scheduling just part of kube-arbitrator, we had already discussed several points about supporting "batch" workload (spark, tf) in the last year: https://docs.google.com/document/d/1-H2hnZap7gQivcSU-9j4ZrJ8wE_WwcfOkTeAGjzUyLA/edit :). kube-arbitrator will handle the scheduling part, it need a tf job controller to work with it, e.g. send resource request to it, monitor job's status. |
@k82cn Thanks! we will take a look at this. |
For the GPU scheduling stuff, I'd really like it if @vishh could weigh in since he's far more knowledgeable about accelerators in Kubernetes. Unfortunately, I think he's out of the office until the last week of January. Can we wait until then to finalize plans about GPU scheduling?
Can you explain why we would need to add heuristics to the default scheduler? My expectation is that Kubernetes scheduling will eventually provide all the necessary scheduling features to support hardware accelerators e.g. GPUs. So I wouldn't expect tensorflow/k8s or Kubeflow to be adding heuristics to handle GPU scheduling. Similarly, I wouldn't expect tensorflow/k8s or Kubeflow to provide its own solution to gang-scheduling since this seems like a general problem shared by a variety of workloads (e.g. Spark Jobs). So I'd expect kube-arbitrator or a more generic solution to solve this. |
@jlewi Hi, yes, there are some solutions for our requirements, GPU device-plugin is supported in v1.8 release, it just tell us there are And For gang scheduling, Thanks so much for bringing up a great discussion here: kubeflow/training-operator#165, we can continue this topic and push it forward together :) |
The issue title "kubeflow-scheduler" seems much more general than the proposal which focuses on introducing Resource class.
@ScorpioCPH @gaocegege Can we treat ResourceClass as an optional experimental feature in a Kubeflow deployment? |
LGTM, but without Resource Class, @mitake also want to discuss more about gang scheduling.
Yes, this proposal is about scheduling not about lifecycle management.
Agreed, Resource Class is just like other resources CPU/Memory in a way, you can specify the requirement or leave it empty. But without GPU, the performance of many ML workloads maybe very low (both in training and inference) :) |
@jiayingz is planning to work on something related to resource classes in H1 2018. The current way to target nodes with particular accelerator types is to use node selectors and node labels. (See doc from PR kubernetes/website#6736) |
If every heuristics we need can be added to the default scheduler easily, it is nice of course. But we are facing an interesting problem related to scheduling distributed DL frameworks: slow GPU. We found that few GPUs can be slow because of overheat. This would be a very environment and application specific problem so I thought adding the heuristics to the default scheduler or other general mechanisms wouldn't be so easy. Anyway, I'm still trying to understand how the problem should be solved so I don't have strong opinions about how to solve it (although our internal solution relies on a custom scheduler and doing well for now). Of course it is great if it can be solved by general methodologies e.g. the default scheduler or kube-arbitrator. |
That's really interesting. Could you detect the overheating and mark the node unhealthy? |
No need to mark the whole node as unhealthy, with our device plugin, we could mark individual GPUs as unhealthy if we notice that one GPU is repeatedly throttled. It's possible, but it will require some thoughts for the strategy we need to use for detecting this case. |
@mitake do you know whether the overheating problem can be detected through some nvml throttling events as @flx42 mentioned? When such problems happened, were those slow GPUs still usable/recoverable or your would rather restart your jobs on different nodes with GPUs available? Also curious on whether the jobs you mentioned were running on cloud or on prem. On cloud, I would expect such failure events were detected at VM level. |
@ScorpioCPH and @gaocegege could you explain more on why the current available approaches are not enough to solve your use cases (e.g., inter-pod affinity to have jobs scheduled on the same machine as much as possible, and use node labels and nodeselector to schedule jobs on specialized hardware)? We are hoping to make some progress on the resource class design next year and would like to better understand the use cases we should target for. |
@jiayingz Glad to hear that! Please involve us into Resource Class. Some quick comments about your questions:
|
@ScorpioCPH Do you see a lot of clusters with heterogenous nodes? I'd expect that most clusters use one type of GPU on each node. |
@jlewi Not a lot, but it exists :) |
@ScorpioCPH @gaocegege So what are the next steps for this proposal? |
It seems to me like this is a stop-gap solution until kube-arbitrator and resource classes are ready for usage. Is that correct? |
+1, I'll try to host a meeting to clarify the requirements :). |
Sorry for my late reply.
The only behaviour we could observe was slowing down of the problematic GPU. So the reason, overheating, is our guess (but it is hard to guess that other root causes which can introduce the same result). The GPU is GeForce so probably its cooling fan was broken. We could detect it by benchmarking, not by management tools.
Probably no. We just observed the throttling by performance degradation.
The GPU was usable. It is worse than simply stopping because the progress of training process became slow. If the training process simply stopped and could be restarted, the entire experiment could be finished faster.
The problem happened in our on premise cluster. @YujiOshima can share more details. He is the owner of the cluster. |
Thanks a lot for sharing the details @mitake I think your described use case brings an interesting question on whether we want to have device plugin export a management API that can be used to drain or change properties of the underlying devices. Most likely we already have underlying tools to do this but there is benefit to have a consistent and portable management model for extended resources. Not sure if there are strong use cases at the moment. We perhaps want to continue this discussion outside Kubeflow repo. |
+1 for a use-case meeting. If possible, keeping scheduling decoupled from kubeflow seems desirable given the broad cross-application needs for similar scheduling functionalities |
Why? AFAIK NVML or DCGM support this use case.
If you look at
|
Hi @jiayingz @erikerlandson @flx42 ! Here are more details of our problem. The result of a healthy GPU is as below.
It keeps throughput about 1900 samples/sec to the last.
The temperature of that GPU looks too high.
As @Fix42 says, we can find it is throttled because of thermal.
However, problems can only be detected after overheating and degraded the performance. |
It seems challenging to detect this issue in advance. Especially for temperature, if the fan is only slightly under-performing (but still enough for triggering a clock throttle), it might take seconds or minutes before this happens. We could still try to detect the cases where the GPU is obviously not healthy. At startup time, a device plugin could implement a lengthy health check, if it wants too. For instance, launching heavy computations for 30 seconds and then check the health, then don't advertise the unhealthy GPUs. |
Would it be possible to use a clock throttle signal?
…On Tue, Jan 9, 2018 at 10:30 PM Felix Abecassis ***@***.***> wrote:
It seems challenging to detect this issue in advance. Especially for
temperature, if the fan is only slightly under-performing (but still enough
for triggering a clock throttle), it might take seconds or minutes before
this happens.
We could still try to detect the cases where the GPU is obviously not
healthy. At startup time, a device plugin could implement a lengthy health
check, if it wants too. For instance, launching heavy computations for 30
seconds and then check the health, then don't advertise the unhealthy GPUs.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#68 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AADIdYbklWutmZdcV7hXCKe14liklufoks5tJC7jgaJpZM4RL703>
.
|
What do you mean by signal? With the new API DCGM, you can register a callback that gets called whenever a thermal violation happens. We are going to release Go bindings for DCGM very soon. |
That's perfect! I'm looking for just a bit more - basically a container or
some mechanism, to register that the node is unhealthy (or some other
system) with Kubernetes and begin evicting until it becomes healthy. For
example, using the NodeProblemDetector -
https://github.com/kubernetes/node-problem-detector
…On Tue, Jan 9, 2018 at 10:59 PM Felix Abecassis ***@***.***> wrote:
What do you mean by signal? With the new API DCGM
<http://www.nvidia.com/object/data-center-gpu-manager.html>, you can
register a callback that gets called whenever a thermal violation happens.
We are going to release Go bindings for DCGM very soon.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#68 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AADIdfHiV84uv8UsNoQJ6rY5C8dyUyjUks5tJDWugaJpZM4RL703>
.
|
This seems to be still within the realm of the Device Plugin. IIRC, pods are not evicted today if a device becomes unhealthy, but they are just excluded from being scheduled for future jobs. Is that correct @jiayingz @RenaudWasTaken? |
@flx42 what you described is the current behavior. However, some external tools can take more aggressive response, like turning down affected pod or even drain the node. @YujiOshima I guess one possible mitigation is to have a livenessProbe to watch for Thermal caused throttling in the training worker pod? |
Should we close this issue? We have issues for gang-scheduling and kube-arbitrator Can we close this issue in favor of those? |
@ScorpioCPH @gaocegege Thoughts about closing this issue? |
SGTM |
Need member access to manage PRs from IBM for Kubeflow components and pipelines
rebased v1.6-branch with v1.6.1 release tag upstream
Kubeflow Scheduler
Status: Draft
Version: Alpha
Implementation Owner: TBD
Authors:
Motivation
Kubeflow has a controller (or operator) that makes it easy to create ML jobs which are defined by CRD. When we create new ML jobs,
kube-scheduler
reacts by scheduling them to nodes which satisfy the requests.It works well for most workloads but not good enough for the ML workloads.
Instead, we need a more advanced scheduler to help us schedule ML workloads more efficiently.
We can achieve this goal by custom scheduler.
Use Cases
API
Kubernetes have a proposal about Resource Class, which provides a better resource representation. We can use this feature for resource management.
In
TFJob
spec, we can tell kubeflow-scheduler which resource we request by addingResource Class
in the spec as the same as CPU/Memory request.Example, we want a TF worker to use 2
NVIDIA Tesla K80
while training.Design
Resource Class
The details about Resource Class is still TBD as it is under discussion now.
But we can use CRD and label selector to implement a simple example of this (only support homogenous architecture now) which proposed here.
Scheduler
Kubeflow-scheduler should work well with default scheduler (kube-scheduler). It will listen on Pods which created by
kubeflow-controller
, evaluate whether a Node satisfies the requirements of this Pod by predicate functions.These predicate functions may include:
CPU-sensitive
(e.g. TensorFlow PS) orGPU-sensitive
(e.g. TensorFlow Worker)?group
of the same job? e.g. workers of the same TensorFlow training job.network-sensitive
which requestInfiniBand
for high throughput and low latency.storage-sensitive
which requestSSD
hardware for high I/O speed.Where to Start
The text was updated successfully, but these errors were encountered: