Proposal: kubeflow-scheduler #68

ScorpioCPH · 2017-12-24T14:11:33Z

Kubeflow Scheduler

Motivation
Use Cases
API
Design
Where to Start

Status: Draft
Version: Alpha
Implementation Owner: TBD

Authors:

@ScorpioCPH - Penghao Cen <cenph@caicloud.io>
@gaocegege - Ce Gao <gaoce@caicloud.io>

Motivation

Kubeflow has a controller (or operator) that makes it easy to create ML jobs which are defined by CRD. When we create new ML jobs, kube-scheduler reacts by scheduling them to nodes which satisfy the requests.

It works well for most workloads but not good enough for the ML workloads.
Instead, we need a more advanced scheduler to help us schedule ML workloads more efficiently.
We can achieve this goal by custom scheduler.

Use Cases

For distributed ML workloads, communication bandwidth will be bottleneck, so we want our jobs fall on the same machine as much as possible.
ML workloads could be accelerated by hardware accelerator (e.g. GPUs), we want to ensure these workloads only schedule on nodes with specialized hardware.
GPUs have various versions with different properties (e.g. Cores, Memory), we want to ensure that our workloads have enough resources.
- Case A: our training jobs will require more than 8GB memory on GPUs as we have a large model.
- Case B: we want some serving jobs to use NVIDIA Tesla K80 for better performance when running inference.
GPUs can be connected together to have much more performance, such as NVLink, so hardware topology need to be considered.

API

Kubernetes have a proposal about Resource Class, which provides a better resource representation. We can use this feature for resource management.

In TFJob spec, we can tell kubeflow-scheduler which resource we request by adding Resource Class in the spec as the same as CPU/Memory request.

Example, we want a TF worker to use 2 NVIDIA Tesla K80 while training.

spec:
  containers:
    - image: tf_training_worker_gpu_1
      name: tensorflow_worker_gpu_1
      resources:
        requests:
          nvidia.com/gpu.tesla.k80: 2
        limits:
          nvidia.com/gpu.tesla.k80: 2

Design

Resource Class

The details about Resource Class is still TBD as it is under discussion now.
But we can use CRD and label selector to implement a simple example of this (only support homogenous architecture now) which proposed here.

Scheduler

Kubeflow-scheduler should work well with default scheduler (kube-scheduler). It will listen on Pods which created by kubeflow-controller, evaluate whether a Node satisfies the requirements of this Pod by predicate functions.

These predicate functions may include:

Pods are CPU-sensitive (e.g. TensorFlow PS) or GPU-sensitive (e.g. TensorFlow Worker)?
Pods are in a group of the same job? e.g. workers of the same TensorFlow training job.
Pods are network-sensitive which request InfiniBand for high throughput and low latency.
Pods are storage-sensitive which request SSD hardware for high I/O speed.

Where to Start

@mitake has implemented a demo here.
And the resource-scheduler is another reference.
There is another project kube-arbitrator may be helpful for this proposal.

The text was updated successfully, but these errors were encountered:

ScorpioCPH · 2017-12-24T14:12:06Z

We can discuss more details about this proposal on google doc.

gaocegege · 2017-12-24T14:47:42Z

Welcome comments :-)

/cc @DjangoPeng @mitake

DjangoPeng · 2017-12-25T02:59:26Z

@ScorpioCPH @gaocegege Thanks for your works on kubeflow-scheduler.

As we all know, Kubeflow is dedicated to supporting multiple ML frameworks. While there are some differential specifications of resource, cluster and etc.

So, do we plan to make one kubeflow-scheduler for all ML frameworks? Or one for one?

mitake · 2017-12-25T04:35:07Z

@ScorpioCPH thanks for creating the doc. Can I add motivations and ideas of the gang scheduler to the doc directly?

ScorpioCPH · 2017-12-25T06:07:02Z

@mitake Sure, please feel free to edit this doc.

ScorpioCPH · 2017-12-25T06:12:55Z

@DjangoPeng It depends on the different requirements for ML frameworks, maybe we can start with TensorFlow.

gaocegege · 2017-12-25T06:50:14Z

@DjangoPeng Yeah, I think one scheduler for all ML frameworks is awesome but I am not sure if we could implement such a generic one. Let's start from TensorFlow since kubeflow starts from it, too.

DjangoPeng · 2017-12-25T06:53:58Z

SGTM. @ScorpioCPH @gaocegege

jlewi · 2017-12-25T19:24:07Z

Thank you for putting this together.

So the proposal seems to address a couple of problems with a custom scheduler

Better GPU scheduling

Support multiple types of GPUs each with different resources
Support NVLink

Gang scheduling

However, an alternative solution would be to extend the existing GPU support to support multiple types of GPUs and to use kubernetes-incubator/kube-arbitrator (see also kubeflow/training-operator#165).

What are the advantages and disadvantages of using a custom scheduler compared to alternative approaches?

/cc @vishh @foxish

aronchick · 2017-12-26T00:13:00Z

This is terrific stuff - I share @jlewi's questions. I don't mean to dismiss using a custom scheduler, but I am curious if we can avoid having to write our own.

+1 for supporting other frameworks (eventually). I'm totally ok skipping that in the first version.

mitake · 2017-12-26T04:40:33Z

The main advantage of the custom scheduler approach is that we don't have to add scheduling mechanisms and heuristics to the default scheduler. It is good for improving development speed and making maintenance easier.

The main disadvantage is that we would need a concurrency control mechanism between schedulers, probably. It seems to be a very subtle and difficult problem and no clean solutions would be available yet.

Also, kube-arbitrator will be used in the custom scheduler approach for controlling TfJob object of tf_operator. But kube-arbitrator itself doesn't provide a mechanism of the gang scheduling. So I think we need the custom scheduler anyway. cc @k82cn

k82cn · 2017-12-26T05:05:00Z

But kube-arbitrator itself doesn't provide a mechanism of the gang scheduling.

kube-arbitrator already support that.

k82cn · 2017-12-26T05:09:19Z

and gang-scheduling just part of kube-arbitrator, we had already discussed several points about supporting "batch" workload (spark, tf) in the last year: https://docs.google.com/document/d/1-H2hnZap7gQivcSU-9j4ZrJ8wE_WwcfOkTeAGjzUyLA/edit :).

kube-arbitrator will handle the scheduling part, it need a tf job controller to work with it, e.g. send resource request to it, monitor job's status.

ScorpioCPH · 2017-12-26T07:32:58Z

@k82cn Thanks! we will take a look at this.

jlewi · 2017-12-26T23:26:40Z

For the GPU scheduling stuff, I'd really like it if @vishh could weigh in since he's far more knowledgeable about accelerators in Kubernetes. Unfortunately, I think he's out of the office until the last week of January. Can we wait until then to finalize plans about GPU scheduling?

@mitake

The main advantage of the custom scheduler approach is that we don't have to add scheduling mechanisms and heuristics to the default scheduler. It is good for improving development speed and making maintenance easier.

Can you explain why we would need to add heuristics to the default scheduler?

My expectation is that Kubernetes scheduling will eventually provide all the necessary scheduling features to support hardware accelerators e.g. GPUs. So I wouldn't expect tensorflow/k8s or Kubeflow to be adding heuristics to handle GPU scheduling.

Similarly, I wouldn't expect tensorflow/k8s or Kubeflow to provide its own solution to gang-scheduling since this seems like a general problem shared by a variety of workloads (e.g. Spark Jobs). So I'd expect kube-arbitrator or a more generic solution to solve this.

ScorpioCPH · 2017-12-27T06:29:20Z

@jlewi Hi, yes, there are some solutions for our requirements, GPU device-plugin is supported in v1.8 release, it just tell us there are N nvidia-gpu allocatable in the cluster without any more details (e.g. type, cores, memory). Maybe it's not enough for ML cases.

And Resource Class which proposed here will be helpful for our requirements, and we have received many feedbacks about this feature, but it seems like a long-term goal for Kubernetes. So i wrote another proposal here to implement this in another way without changing Kubernetes core codes. I think kubeflow maybe a good case to push Resource Class forward.

/cc @vikaschoudhary16

For gang scheduling, Thanks so much for bringing up a great discussion here: kubeflow/training-operator#165, we can continue this topic and push it forward together :)

jlewi · 2017-12-27T14:34:39Z

The issue title "kubeflow-scheduler" seems much more general than the proposal which focuses on introducing Resource class.

Should we narrow the scope of this issue to Resource class?
The proposal doesn't seem to require any changes to the TfJob controller
- Users make use of the new Resource class by specifying schedulerName and requests and limits
- Is this accurate?
So given the above, can we make ResourceClass support optional in a Kubeflow deployment?
- It would have to be enabled if users wanted to use ResourClasses' with TfJobs
It looks like the ResourceClass proposal (Add New Resource API proposal kubernetes/community#782) has been open since July with no clear resolution
- There's an old comment about trying to pick this up after 1.9 (which was just released).
- Including optional support in Kubeflow might be a good way to push the issue forward.

@ScorpioCPH @gaocegege Can we treat ResourceClass as an optional experimental feature in a Kubeflow deployment?

ScorpioCPH · 2017-12-27T17:39:54Z

@jlewi

Should we narrow the scope of this issue to Resource Class?

LGTM, but without Resource Class, @mitake also want to discuss more about gang scheduling.

The proposal doesn't seem to require any changes to the TfJob controller

Yes, this proposal is about scheduling not about lifecycle management.

It would have to be enabled if users wanted to use Resource Class with TfJobs

Agreed, Resource Class is just like other resources CPU/Memory in a way, you can specify the requirement or leave it empty. But without GPU, the performance of many ML workloads maybe very low (both in training and inference) :)

rohitagarwal003 · 2017-12-27T23:35:36Z

@jiayingz is planning to work on something related to resource classes in H1 2018.

The current way to target nodes with particular accelerator types is to use node selectors and node labels. (See doc from PR kubernetes/website#6736)

mitake · 2017-12-28T07:37:48Z

@jlewi

Can you explain why we would need to add heuristics to the default scheduler?

If every heuristics we need can be added to the default scheduler easily, it is nice of course. But we are facing an interesting problem related to scheduling distributed DL frameworks: slow GPU. We found that few GPUs can be slow because of overheat. This would be a very environment and application specific problem so I thought adding the heuristics to the default scheduler or other general mechanisms wouldn't be so easy.

Anyway, I'm still trying to understand how the problem should be solved so I don't have strong opinions about how to solve it (although our internal solution relies on a custom scheduler and doing well for now). Of course it is great if it can be solved by general methodologies e.g. the default scheduler or kube-arbitrator.

jlewi · 2017-12-28T14:04:02Z

@mitake

That's really interesting. Could you detect the overheating and mark the node unhealthy?

flx42 · 2017-12-28T18:54:22Z

No need to mark the whole node as unhealthy, with our device plugin, we could mark individual GPUs as unhealthy if we notice that one GPU is repeatedly throttled. It's possible, but it will require some thoughts for the strategy we need to use for detecting this case.

jiayingz · 2017-12-28T23:13:44Z

@mitake do you know whether the overheating problem can be detected through some nvml throttling events as @flx42 mentioned? When such problems happened, were those slow GPUs still usable/recoverable or your would rather restart your jobs on different nodes with GPUs available? Also curious on whether the jobs you mentioned were running on cloud or on prem. On cloud, I would expect such failure events were detected at VM level.

jiayingz · 2017-12-28T23:20:53Z

@ScorpioCPH and @gaocegege could you explain more on why the current available approaches are not enough to solve your use cases (e.g., inter-pod affinity to have jobs scheduled on the same machine as much as possible, and use node labels and nodeselector to schedule jobs on specialized hardware)? We are hoping to make some progress on the resource class design next year and would like to better understand the use cases we should target for.

ScorpioCPH · 2017-12-29T13:27:14Z

@jiayingz Glad to hear that! Please involve us into Resource Class.

Some quick comments about your questions:

node labels and nodeselector is a good solution for homogenous arch, e.g. one node only have one type of GPU, but not enough for heterogeneous arch, one node have more than one type of GPUs.
And many cases show that we can get more performance if we connect some GPUs together (e.g. NVLink), so device topology support is also another use case.

jlewi · 2017-12-29T16:34:04Z

@ScorpioCPH Do you see a lot of clusters with heterogenous nodes? I'd expect that most clusters use one type of GPU on each node.

ScorpioCPH · 2018-01-02T10:08:29Z

@jlewi Not a lot, but it exists :)

jlewi · 2018-01-02T17:48:37Z

@ScorpioCPH @gaocegege So what are the next steps for this proposal?

foxish · 2018-01-02T17:55:57Z

It seems to me like this is a stop-gap solution until kube-arbitrator and resource classes are ready for usage. Is that correct?
Maybe the folks here, and @k82cn, @erikerlandson should have a meeting to understand the requirements and use-cases to ensure that we prioritize work on kube-arbitrator appropriately to allow leveraging it here.

ScorpioCPH · 2018-01-03T05:36:46Z

@jlewi Hi, @foxish 's comment looks good to me :) We can focus on tf-operator now.

k82cn · 2018-01-03T14:20:23Z

Maybe the folks here, and @k82cn, @erikerlandson should have a meeting to understand the requirements and use-cases to ensure that we prioritize work on kube-arbitrator appropriately to allow leveraging it here.

+1, I'll try to host a meeting to clarify the requirements :).

mitake · 2018-01-09T07:38:01Z

@jlewi @flx42 @jiayingz

Sorry for my late reply.

That's really interesting. Could you detect the overheating and mark the node unhealthy?

The only behaviour we could observe was slowing down of the problematic GPU. So the reason, overheating, is our guess (but it is hard to guess that other root causes which can introduce the same result). The GPU is GeForce so probably its cooling fan was broken. We could detect it by benchmarking, not by management tools.

@mitake do you know whether the overheating problem can be detected through some nvml throttling events as @flx42 mentioned?

Probably no. We just observed the throttling by performance degradation.

When such problems happened, were those slow GPUs still usable/recoverable or your would rather restart your jobs on different nodes with GPUs available?

The GPU was usable. It is worse than simply stopping because the progress of training process became slow. If the training process simply stopped and could be restarted, the entire experiment could be finished faster.

Also curious on whether the jobs you mentioned were running on cloud or on prem.

The problem happened in our on premise cluster.

@YujiOshima can share more details. He is the owner of the cluster.

jiayingz · 2018-01-09T22:34:27Z

Thanks a lot for sharing the details @mitake I think your described use case brings an interesting question on whether we want to have device plugin export a management API that can be used to drain or change properties of the underlying devices. Most likely we already have underlying tools to do this but there is benefit to have a consistent and portable management model for extended resources. Not sure if there are strong use cases at the moment. We perhaps want to continue this discussion outside Kubeflow repo.

erikerlandson · 2018-01-09T23:23:48Z

+1 for a use-case meeting. If possible, keeping scheduling decoupled from kubeflow seems desirable given the broad cross-application needs for similar scheduling functionalities

flx42 · 2018-01-09T23:53:28Z

The GPU is GeForce so probably its cooling fan was broken. We could detect it by benchmarking, not by management tools.

Why? AFAIK NVML or DCGM support this use case.
For instance:

$ nvidia-smi -q | grep -A8 'Clocks Throttle'
    Clocks Throttle Reasons
        Idle                        : Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
            HW Thermal Slowdown     : Not Active
            HW Power Brake Slowdown : Not Active
        Sync Boost                  : Not Active
        SW Thermal Slowdown         : Not Active

If you look at nvml.h:

/** HW Slowdown (reducing the core clocks by a factor of 2 or more) is engaged                                                                                                                                    
 *                                                                                                                                                                                                                
 * This is an indicator of:                                                                                                                                                                                       
 *   - temperature being too high                                                                                                                                                                                 
 *   - External Power Brake Assertion is triggered (e.g. by the system power supply)                                                                                                                              
 *   - Power draw is too high and Fast Trigger protection is reducing the clocks                                                                                                                                  
 *   - May be also reported during PState or clock change                                                                                                                                                         
 *      - This behavior may be removed in a later release.                                                                                                                                                        
 *                                                                                                                                                                                                                
 * @see nvmlDeviceGetTemperature                                                                                                                                                                                  
 * @see nvmlDeviceGetTemperatureThreshold                                                                                                                                                                         
 * @see nvmlDeviceGetPowerUsage                                                                                                                                                                                   
 */                                                                                                                                                                                                               
#define nvmlClocksThrottleReasonHwSlowdown                0x0000000000000008LL

YujiOshima · 2018-01-10T02:26:02Z

Hi @jiayingz @erikerlandson @flx42 !
Thanks @mitake .

Here are more details of our problem.
I trained Resnet with Cifer10 for a benchmark.
GPU: GeForce GTX 1080 Ti.
DL Flamework: MxNet

The result of a healthy GPU is as below.

$ kubectl logs jobs/fec7152e290645c3-1-2
INFO:root:Epoch[0] Batch [20]   Speed: 2015.42 samples/sec      accuracy=0.131696
INFO:root:Epoch[0] Batch [40]   Speed: 2025.30 samples/sec      accuracy=0.191016
INFO:root:Epoch[0] Batch [60]   Speed: 2018.21 samples/sec      accuracy=0.230273
INFO:root:Epoch[0] Batch [80]   Speed: 2025.97 samples/sec      accuracy=0.258496
INFO:root:Epoch[0] Train-accuracy=0.297909
INFO:root:Epoch[0] Time cost=25.134
INFO:root:Epoch[0] Validation-accuracy=0.314941
INFO:root:Epoch[1] Batch [20]   Speed: 2016.25 samples/sec      accuracy=0.302920
INFO:root:Epoch[1] Batch [40]   Speed: 2015.39 samples/sec      accuracy=0.334473
INFO:root:Epoch[1] Batch [60]   Speed: 2008.51 samples/sec      accuracy=0.347852
INFO:root:Epoch[1] Batch [80]   Speed: 2001.21 samples/sec      accuracy=0.377051
INFO:root:Epoch[1] Train-accuracy=0.397748
INFO:root:Epoch[1] Time cost=24.828
INFO:root:Epoch[1] Validation-accuracy=0.434570
INFO:root:Epoch[2] Batch [20]   Speed: 1993.33 samples/sec      accuracy=0.410249
INFO:root:Epoch[2] Batch [40]   Speed: 1989.81 samples/sec      accuracy=0.419043
INFO:root:Epoch[2] Batch [60]   Speed: 1988.56 samples/sec      accuracy=0.437500
INFO:root:Epoch[2] Batch [80]   Speed: 1975.50 samples/sec      accuracy=0.463379
INFO:root:Epoch[2] Train-accuracy=0.471680
INFO:root:Epoch[2] Time cost=24.840
INFO:root:Epoch[2] Validation-accuracy=0.481702
INFO:root:Epoch[3] Batch [20]   Speed: 1915.47 samples/sec      accuracy=0.487630
INFO:root:Epoch[3] Batch [40]   Speed: 1943.98 samples/sec      accuracy=0.500098
INFO:root:Epoch[3] Batch [60]   Speed: 1948.77 samples/sec      accuracy=0.518262
INFO:root:Epoch[3] Batch [80]   Speed: 1894.11 samples/sec      accuracy=0.523438
INFO:root:Epoch[3] Train-accuracy=0.540441
INFO:root:Epoch[3] Time cost=25.947
INFO:root:Epoch[3] Validation-accuracy=0.536523
INFO:root:Epoch[4] Batch [20]   Speed: 1898.14 samples/sec      accuracy=0.546875
INFO:root:Epoch[4] Batch [40]   Speed: 1900.62 samples/sec      accuracy=0.556152
INFO:root:Epoch[4] Batch [60]   Speed: 1868.62 samples/sec      accuracy=0.565723
INFO:root:Epoch[4] Batch [80]   Speed: 1865.30 samples/sec      accuracy=0.586328
INFO:root:Epoch[4] Train-accuracy=0.576861
INFO:root:Epoch[4] Time cost=26.507
INFO:root:Epoch[4] Validation-accuracy=0.620580

It keeps throughput about 1900 samples/sec to the last.
But some GPU drastically slow down during training.

INFO:root:Epoch[0] Batch [20]   Speed: 1826.97 samples/sec      accuracy=0.156343
INFO:root:Epoch[0] Batch [40]   Speed: 1833.35 samples/sec      accuracy=0.218457
INFO:root:Epoch[0] Batch [60]   Speed: 1705.14 samples/sec      accuracy=0.267090
INFO:root:Epoch[0] Batch [80]   Speed: 1251.16 samples/sec      accuracy=0.298828
INFO:root:Epoch[0] Train-accuracy=0.333180
INFO:root:Epoch[0] Time cost=37.115
INFO:root:Epoch[0] Validation-accuracy=0.369727
INFO:root:Epoch[1] Batch [20]   Speed: 624.62 samples/sec       accuracy=0.344401
INFO:root:Epoch[1] Batch [40]   Speed: 477.14 samples/sec       accuracy=0.367188
INFO:root:Epoch[1] Batch [60]   Speed: 408.84 samples/sec       accuracy=0.378906
INFO:root:Epoch[1] Batch [80]   Speed: 372.14 samples/sec       accuracy=0.406836
INFO:root:Epoch[1] Train-accuracy=0.434858
INFO:root:Epoch[1] Time cost=115.186
INFO:root:Epoch[1] Validation-accuracy=0.426270
INFO:root:Epoch[2] Batch [20]   Speed: 350.03 samples/sec       accuracy=0.449498
INFO:root:Epoch[2] Batch [40]   Speed: 394.87 samples/sec       accuracy=0.460938
INFO:root:Epoch[2] Batch [60]   Speed: 433.00 samples/sec       accuracy=0.483789
INFO:root:Epoch[2] Batch [80]   Speed: 413.60 samples/sec       accuracy=0.504199
INFO:root:Epoch[2] Train-accuracy=0.510498
INFO:root:Epoch[2] Time cost=119.830
INFO:root:Epoch[2] Validation-accuracy=0.487767
INFO:root:Epoch[3] Batch [20]   Speed: 555.49 samples/sec       accuracy=0.527995
INFO:root:Epoch[3] Batch [40]   Speed: 534.64 samples/sec       accuracy=0.548438
INFO:root:Epoch[3] Batch [60]   Speed: 579.96 samples/sec       accuracy=0.556055
INFO:root:Epoch[3] Batch [80]   Speed: 580.17 samples/sec       accuracy=0.565918
INFO:root:Epoch[3] Train-accuracy=0.580423

The temperature of that GPU looks too high.

$ nvidia-smi -q -d Temperature

==============NVSMI LOG==============

Timestamp                           : Wed Jan 10 10:53:43 2018
Driver Version                      : 384.90

Attached GPUs                       : 4
GPU 00000000:05:00.0
    Temperature
        GPU Current Temp            : 61 C
        GPU Shutdown Temp           : 96 C
        GPU Slowdown Temp           : 93 C
        GPU Max Operating Temp      : N/A
        Memory Current Temp         : N/A
        Memory Max Operating Temp   : N/A

GPU 00000000:06:00.0
    Temperature
        GPU Current Temp            : 90 C
        GPU Shutdown Temp           : 96 C
        GPU Slowdown Temp           : 93 C
        GPU Max Operating Temp      : N/A
        Memory Current Temp         : N/A
        Memory Max Operating Temp   : N/A

As @Fix42 says, we can find it is throttled because of thermal.

$ nvidia-smi -q | grep -A8 'Clocks Throttle'
    Clocks Throttle Reasons
        Idle                        : Not Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
        Sync Boost                  : Not Active
        SW Thermal Slowdown         : Active
    FB Memory Usage
        Total                       : 11172 MiB
--

However, problems can only be detected after overheating and degraded the performance.
And I infer that its cooling fan is the reason but I don't know the exact reason yet.
I hope to detect such problem in advance and avoid scheduling tasks to the GPU.
Of course, it is useful to be able to mark the GPU as broken after overheating.

flx42 · 2018-01-10T03:30:42Z

It seems challenging to detect this issue in advance. Especially for temperature, if the fan is only slightly under-performing (but still enough for triggering a clock throttle), it might take seconds or minutes before this happens.

We could still try to detect the cases where the GPU is obviously not healthy. At startup time, a device plugin could implement a lengthy health check, if it wants too. For instance, launching heavy computations for 30 seconds and then check the health, then don't advertise the unhealthy GPUs.

aronchick · 2018-01-10T03:49:27Z

Would it be possible to use a clock throttle signal?

…

On Tue, Jan 9, 2018 at 10:30 PM Felix Abecassis ***@***.***> wrote: It seems challenging to detect this issue in advance. Especially for temperature, if the fan is only slightly under-performing (but still enough for triggering a clock throttle), it might take seconds or minutes before this happens. We could still try to detect the cases where the GPU is obviously not healthy. At startup time, a device plugin could implement a lengthy health check, if it wants too. For instance, launching heavy computations for 30 seconds and then check the health, then don't advertise the unhealthy GPUs. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#68 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AADIdYbklWutmZdcV7hXCKe14liklufoks5tJC7jgaJpZM4RL703> .

flx42 · 2018-01-10T03:59:42Z

What do you mean by signal? With the new API DCGM, you can register a callback that gets called whenever a thermal violation happens. We are going to release Go bindings for DCGM very soon.

aronchick · 2018-01-10T18:10:19Z

That's perfect! I'm looking for just a bit more - basically a container or some mechanism, to register that the node is unhealthy (or some other system) with Kubernetes and begin evicting until it becomes healthy. For example, using the NodeProblemDetector - https://github.com/kubernetes/node-problem-detector

…

On Tue, Jan 9, 2018 at 10:59 PM Felix Abecassis ***@***.***> wrote: What do you mean by signal? With the new API DCGM <http://www.nvidia.com/object/data-center-gpu-manager.html>, you can register a callback that gets called whenever a thermal violation happens. We are going to release Go bindings for DCGM very soon. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#68 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AADIdfHiV84uv8UsNoQJ6rY5C8dyUyjUks5tJDWugaJpZM4RL703> .

flx42 · 2018-01-10T18:34:40Z

This seems to be still within the realm of the Device Plugin. IIRC, pods are not evicted today if a device becomes unhealthy, but they are just excluded from being scheduled for future jobs. Is that correct @jiayingz @RenaudWasTaken?

jiayingz · 2018-01-11T07:15:50Z

@flx42 what you described is the current behavior. However, some external tools can take more aggressive response, like turning down affected pod or even drain the node.

@YujiOshima I guess one possible mitigation is to have a livenessProbe to watch for Thermal caused throttling in the training worker pod?

jlewi · 2018-03-26T19:24:15Z

@ScorpioCPH @gaocegege

Should we close this issue? We have issues for gang-scheduling and kube-arbitrator
kubeflow/training-operator#349

Can we close this issue in favor of those?

jlewi · 2018-09-04T13:34:57Z

@ScorpioCPH @gaocegege Thoughts about closing this issue?

gaocegege · 2018-09-04T13:48:34Z

SGTM

Need member access to manage PRs from IBM for Kubeflow components and pipelines

rebased v1.6-branch with v1.6.1 release tag upstream

ScorpioCPH mentioned this issue Dec 27, 2017

Prevent scheduling deadlocks kubeflow/training-operator#165

Closed

jlewi closed this as completed Oct 8, 2018

kkasravi mentioned this issue Apr 1, 2019

device integration #2882

Closed

yanniszark pushed a commit to arrikto/kubeflow that referenced this issue Nov 1, 2019

Fix typo of jupyter yaml (kubeflow#68)

63223d9

elenzio9 pushed a commit to arrikto/kubeflow that referenced this issue Oct 31, 2022

Update org.yaml (kubeflow#68)

e060b06

Need member access to manage PRs from IBM for Kubeflow components and pipelines

VaishnaviHire pushed a commit to VaishnaviHire/kubeflow that referenced this issue Apr 6, 2023

Merge pull request kubeflow#68 from shalberd/v1.6-branch

ec580ee

rebased v1.6-branch with v1.6.1 release tag upstream

Proposal: kubeflow-scheduler #68

Proposal: kubeflow-scheduler #68

Comments

ScorpioCPH commented Dec 24, 2017 • edited

Kubeflow Scheduler

Motivation

Use Cases

API

Design

Resource Class

Scheduler

Where to Start

ScorpioCPH commented Dec 24, 2017

gaocegege commented Dec 24, 2017 • edited

DjangoPeng commented Dec 25, 2017

mitake commented Dec 25, 2017

ScorpioCPH commented Dec 25, 2017

ScorpioCPH commented Dec 25, 2017

gaocegege commented Dec 25, 2017

DjangoPeng commented Dec 25, 2017

jlewi commented Dec 25, 2017

aronchick commented Dec 26, 2017

mitake commented Dec 26, 2017

k82cn commented Dec 26, 2017

k82cn commented Dec 26, 2017

ScorpioCPH commented Dec 26, 2017

jlewi commented Dec 26, 2017

ScorpioCPH commented Dec 27, 2017

jlewi commented Dec 27, 2017

ScorpioCPH commented Dec 27, 2017

rohitagarwal003 commented Dec 27, 2017

mitake commented Dec 28, 2017

jlewi commented Dec 28, 2017

flx42 commented Dec 28, 2017

jiayingz commented Dec 28, 2017

jiayingz commented Dec 28, 2017

ScorpioCPH commented Dec 29, 2017

jlewi commented Dec 29, 2017

ScorpioCPH commented Jan 2, 2018

jlewi commented Jan 2, 2018

foxish commented Jan 2, 2018

ScorpioCPH commented Jan 3, 2018

k82cn commented Jan 3, 2018 • edited

mitake commented Jan 9, 2018

jiayingz commented Jan 9, 2018

erikerlandson commented Jan 9, 2018

flx42 commented Jan 9, 2018

YujiOshima commented Jan 10, 2018 • edited

flx42 commented Jan 10, 2018

aronchick commented Jan 10, 2018 via email

flx42 commented Jan 10, 2018

aronchick commented Jan 10, 2018 via email

flx42 commented Jan 10, 2018

jiayingz commented Jan 11, 2018

jlewi commented Mar 26, 2018

jlewi commented Sep 4, 2018

gaocegege commented Sep 4, 2018

ScorpioCPH commented Dec 24, 2017 •

edited

gaocegege commented Dec 24, 2017 •

edited

k82cn commented Jan 3, 2018 •

edited

YujiOshima commented Jan 10, 2018 •

edited