Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: kubeflow-scheduler #68

Closed
ScorpioCPH opened this issue Dec 24, 2017 · 45 comments
Closed

Proposal: kubeflow-scheduler #68

ScorpioCPH opened this issue Dec 24, 2017 · 45 comments

Comments

@ScorpioCPH
Copy link
Member

ScorpioCPH commented Dec 24, 2017

Kubeflow Scheduler

Status: Draft
Version: Alpha
Implementation Owner: TBD

Authors:

Motivation

Kubeflow has a controller (or operator) that makes it easy to create ML jobs which are defined by CRD. When we create new ML jobs, kube-scheduler reacts by scheduling them to nodes which satisfy the requests.

It works well for most workloads but not good enough for the ML workloads.
Instead, we need a more advanced scheduler to help us schedule ML workloads more efficiently.
We can achieve this goal by custom scheduler.

Use Cases

  • For distributed ML workloads, communication bandwidth will be bottleneck, so we want our jobs fall on the same machine as much as possible.
  • ML workloads could be accelerated by hardware accelerator (e.g. GPUs), we want to ensure these workloads only schedule on nodes with specialized hardware.
  • GPUs have various versions with different properties (e.g. Cores, Memory), we want to ensure that our workloads have enough resources.
    • Case A: our training jobs will require more than 8GB memory on GPUs as we have a large model.
    • Case B: we want some serving jobs to use NVIDIA Tesla K80 for better performance when running inference.
  • GPUs can be connected together to have much more performance, such as NVLink, so hardware topology need to be considered.

API

Kubernetes have a proposal about Resource Class, which provides a better resource representation. We can use this feature for resource management.

In TFJob spec, we can tell kubeflow-scheduler which resource we request by adding Resource Class in the spec as the same as CPU/Memory request.

Example, we want a TF worker to use 2 NVIDIA Tesla K80 while training.

spec:
  containers:
    - image: tf_training_worker_gpu_1
      name: tensorflow_worker_gpu_1
      resources:
        requests:
          nvidia.com/gpu.tesla.k80: 2
        limits:
          nvidia.com/gpu.tesla.k80: 2

Design

Resource Class

The details about Resource Class is still TBD as it is under discussion now.
But we can use CRD and label selector to implement a simple example of this (only support homogenous architecture now) which proposed here.

Scheduler

Kubeflow-scheduler should work well with default scheduler (kube-scheduler). It will listen on Pods which created by kubeflow-controller, evaluate whether a Node satisfies the requirements of this Pod by predicate functions.

These predicate functions may include:

  • Pods are CPU-sensitive (e.g. TensorFlow PS) or GPU-sensitive (e.g. TensorFlow Worker)?
  • Pods are in a group of the same job? e.g. workers of the same TensorFlow training job.
  • Pods are network-sensitive which request InfiniBand for high throughput and low latency.
  • Pods are storage-sensitive which request SSD hardware for high I/O speed.

Where to Start

@ScorpioCPH
Copy link
Member Author

We can discuss more details about this proposal on google doc.

@gaocegege
Copy link
Member

gaocegege commented Dec 24, 2017

Welcome comments :-)

/cc @DjangoPeng @mitake

@DjangoPeng
Copy link
Member

@ScorpioCPH @gaocegege Thanks for your works on kubeflow-scheduler.

As we all know, Kubeflow is dedicated to supporting multiple ML frameworks. While there are some differential specifications of resource, cluster and etc.

So, do we plan to make one kubeflow-scheduler for all ML frameworks? Or one for one?

@mitake
Copy link
Contributor

mitake commented Dec 25, 2017

@ScorpioCPH thanks for creating the doc. Can I add motivations and ideas of the gang scheduler to the doc directly?

@ScorpioCPH
Copy link
Member Author

@mitake Sure, please feel free to edit this doc.

@ScorpioCPH
Copy link
Member Author

@DjangoPeng It depends on the different requirements for ML frameworks, maybe we can start with TensorFlow.

@gaocegege
Copy link
Member

@DjangoPeng Yeah, I think one scheduler for all ML frameworks is awesome but I am not sure if we could implement such a generic one. Let's start from TensorFlow since kubeflow starts from it, too.

@DjangoPeng
Copy link
Member

SGTM. @ScorpioCPH @gaocegege

@jlewi
Copy link
Contributor

jlewi commented Dec 25, 2017

Thank you for putting this together.

So the proposal seems to address a couple of problems with a custom scheduler

  1. Better GPU scheduling
  • Support multiple types of GPUs each with different resources
  • Support NVLink
  1. Gang scheduling

However, an alternative solution would be to extend the existing GPU support to support multiple types of GPUs and to use kubernetes-incubator/kube-arbitrator (see also kubeflow/training-operator#165).

What are the advantages and disadvantages of using a custom scheduler compared to alternative approaches?

/cc @vishh @foxish

@aronchick
Copy link
Contributor

This is terrific stuff - I share @jlewi's questions. I don't mean to dismiss using a custom scheduler, but I am curious if we can avoid having to write our own.

+1 for supporting other frameworks (eventually). I'm totally ok skipping that in the first version.

@mitake
Copy link
Contributor

mitake commented Dec 26, 2017

The main advantage of the custom scheduler approach is that we don't have to add scheduling mechanisms and heuristics to the default scheduler. It is good for improving development speed and making maintenance easier.

The main disadvantage is that we would need a concurrency control mechanism between schedulers, probably. It seems to be a very subtle and difficult problem and no clean solutions would be available yet.

Also, kube-arbitrator will be used in the custom scheduler approach for controlling TfJob object of tf_operator. But kube-arbitrator itself doesn't provide a mechanism of the gang scheduling. So I think we need the custom scheduler anyway. cc @k82cn

@k82cn
Copy link

k82cn commented Dec 26, 2017

But kube-arbitrator itself doesn't provide a mechanism of the gang scheduling.

kube-arbitrator already support that.

@k82cn
Copy link

k82cn commented Dec 26, 2017

and gang-scheduling just part of kube-arbitrator, we had already discussed several points about supporting "batch" workload (spark, tf) in the last year: https://docs.google.com/document/d/1-H2hnZap7gQivcSU-9j4ZrJ8wE_WwcfOkTeAGjzUyLA/edit :).

kube-arbitrator will handle the scheduling part, it need a tf job controller to work with it, e.g. send resource request to it, monitor job's status.

@ScorpioCPH
Copy link
Member Author

@k82cn Thanks! we will take a look at this.

@jlewi
Copy link
Contributor

jlewi commented Dec 26, 2017

For the GPU scheduling stuff, I'd really like it if @vishh could weigh in since he's far more knowledgeable about accelerators in Kubernetes. Unfortunately, I think he's out of the office until the last week of January. Can we wait until then to finalize plans about GPU scheduling?

@mitake

The main advantage of the custom scheduler approach is that we don't have to add scheduling mechanisms and heuristics to the default scheduler. It is good for improving development speed and making maintenance easier.

Can you explain why we would need to add heuristics to the default scheduler?

My expectation is that Kubernetes scheduling will eventually provide all the necessary scheduling features to support hardware accelerators e.g. GPUs. So I wouldn't expect tensorflow/k8s or Kubeflow to be adding heuristics to handle GPU scheduling.

Similarly, I wouldn't expect tensorflow/k8s or Kubeflow to provide its own solution to gang-scheduling since this seems like a general problem shared by a variety of workloads (e.g. Spark Jobs). So I'd expect kube-arbitrator or a more generic solution to solve this.

@ScorpioCPH
Copy link
Member Author

@jlewi Hi, yes, there are some solutions for our requirements, GPU device-plugin is supported in v1.8 release, it just tell us there are N nvidia-gpu allocatable in the cluster without any more details (e.g. type, cores, memory). Maybe it's not enough for ML cases.

And Resource Class which proposed here will be helpful for our requirements, and we have received many feedbacks about this feature, but it seems like a long-term goal for Kubernetes. So i wrote another proposal here to implement this in another way without changing Kubernetes core codes. I think kubeflow maybe a good case to push Resource Class forward.

/cc @vikaschoudhary16

For gang scheduling, Thanks so much for bringing up a great discussion here: kubeflow/training-operator#165, we can continue this topic and push it forward together :)

@jlewi
Copy link
Contributor

jlewi commented Dec 27, 2017

The issue title "kubeflow-scheduler" seems much more general than the proposal which focuses on introducing Resource class.

  • Should we narrow the scope of this issue to Resource class?

  • The proposal doesn't seem to require any changes to the TfJob controller

    • Users make use of the new Resource class by specifying schedulerName and requests and limits
    • Is this accurate?
  • So given the above, can we make ResourceClass support optional in a Kubeflow deployment?
    - It would have to be enabled if users wanted to use ResourClasses' with TfJobs

  • It looks like the ResourceClass proposal (Add New Resource API proposal kubernetes/community#782) has been open since July with no clear resolution
    - There's an old comment about trying to pick this up after 1.9 (which was just released).
    - Including optional support in Kubeflow might be a good way to push the issue forward.

@ScorpioCPH @gaocegege Can we treat ResourceClass as an optional experimental feature in a Kubeflow deployment?

@ScorpioCPH
Copy link
Member Author

@jlewi

Should we narrow the scope of this issue to Resource Class?

LGTM, but without Resource Class, @mitake also want to discuss more about gang scheduling.

The proposal doesn't seem to require any changes to the TfJob controller

Yes, this proposal is about scheduling not about lifecycle management.

It would have to be enabled if users wanted to use Resource Class with TfJobs

Agreed, Resource Class is just like other resources CPU/Memory in a way, you can specify the requirement or leave it empty. But without GPU, the performance of many ML workloads maybe very low (both in training and inference) :)

@rohitagarwal003
Copy link
Contributor

@jiayingz is planning to work on something related to resource classes in H1 2018.

The current way to target nodes with particular accelerator types is to use node selectors and node labels. (See doc from PR kubernetes/website#6736)

@mitake
Copy link
Contributor

mitake commented Dec 28, 2017

@jlewi

Can you explain why we would need to add heuristics to the default scheduler?

If every heuristics we need can be added to the default scheduler easily, it is nice of course. But we are facing an interesting problem related to scheduling distributed DL frameworks: slow GPU. We found that few GPUs can be slow because of overheat. This would be a very environment and application specific problem so I thought adding the heuristics to the default scheduler or other general mechanisms wouldn't be so easy.

Anyway, I'm still trying to understand how the problem should be solved so I don't have strong opinions about how to solve it (although our internal solution relies on a custom scheduler and doing well for now). Of course it is great if it can be solved by general methodologies e.g. the default scheduler or kube-arbitrator.

@jlewi
Copy link
Contributor

jlewi commented Dec 28, 2017

@mitake

That's really interesting. Could you detect the overheating and mark the node unhealthy?

@flx42
Copy link

flx42 commented Dec 28, 2017

No need to mark the whole node as unhealthy, with our device plugin, we could mark individual GPUs as unhealthy if we notice that one GPU is repeatedly throttled. It's possible, but it will require some thoughts for the strategy we need to use for detecting this case.

@jiayingz
Copy link

@mitake do you know whether the overheating problem can be detected through some nvml throttling events as @flx42 mentioned? When such problems happened, were those slow GPUs still usable/recoverable or your would rather restart your jobs on different nodes with GPUs available? Also curious on whether the jobs you mentioned were running on cloud or on prem. On cloud, I would expect such failure events were detected at VM level.

@jiayingz
Copy link

@ScorpioCPH and @gaocegege could you explain more on why the current available approaches are not enough to solve your use cases (e.g., inter-pod affinity to have jobs scheduled on the same machine as much as possible, and use node labels and nodeselector to schedule jobs on specialized hardware)? We are hoping to make some progress on the resource class design next year and would like to better understand the use cases we should target for.

@ScorpioCPH
Copy link
Member Author

@jiayingz Glad to hear that! Please involve us into Resource Class.

Some quick comments about your questions:

  • node labels and nodeselector is a good solution for homogenous arch, e.g. one node only have one type of GPU, but not enough for heterogeneous arch, one node have more than one type of GPUs.
  • And many cases show that we can get more performance if we connect some GPUs together (e.g. NVLink), so device topology support is also another use case.

@jlewi
Copy link
Contributor

jlewi commented Dec 29, 2017

@ScorpioCPH Do you see a lot of clusters with heterogenous nodes? I'd expect that most clusters use one type of GPU on each node.

@ScorpioCPH
Copy link
Member Author

@jlewi Not a lot, but it exists :)

@jlewi
Copy link
Contributor

jlewi commented Jan 2, 2018

@ScorpioCPH @gaocegege So what are the next steps for this proposal?

@foxish
Copy link
Contributor

foxish commented Jan 2, 2018

It seems to me like this is a stop-gap solution until kube-arbitrator and resource classes are ready for usage. Is that correct?
Maybe the folks here, and @k82cn, @erikerlandson should have a meeting to understand the requirements and use-cases to ensure that we prioritize work on kube-arbitrator appropriately to allow leveraging it here.

@ScorpioCPH
Copy link
Member Author

@jlewi Hi, @foxish 's comment looks good to me :) We can focus on tf-operator now.

@k82cn
Copy link

k82cn commented Jan 3, 2018

Maybe the folks here, and @k82cn, @erikerlandson should have a meeting to understand the requirements and use-cases to ensure that we prioritize work on kube-arbitrator appropriately to allow leveraging it here.

+1, I'll try to host a meeting to clarify the requirements :).

@mitake
Copy link
Contributor

mitake commented Jan 9, 2018

@jlewi @flx42 @jiayingz

Sorry for my late reply.

That's really interesting. Could you detect the overheating and mark the node unhealthy?

The only behaviour we could observe was slowing down of the problematic GPU. So the reason, overheating, is our guess (but it is hard to guess that other root causes which can introduce the same result). The GPU is GeForce so probably its cooling fan was broken. We could detect it by benchmarking, not by management tools.

@mitake do you know whether the overheating problem can be detected through some nvml throttling events as @flx42 mentioned?

Probably no. We just observed the throttling by performance degradation.

When such problems happened, were those slow GPUs still usable/recoverable or your would rather restart your jobs on different nodes with GPUs available?

The GPU was usable. It is worse than simply stopping because the progress of training process became slow. If the training process simply stopped and could be restarted, the entire experiment could be finished faster.

Also curious on whether the jobs you mentioned were running on cloud or on prem.

The problem happened in our on premise cluster.

@YujiOshima can share more details. He is the owner of the cluster.

@jiayingz
Copy link

jiayingz commented Jan 9, 2018

Thanks a lot for sharing the details @mitake I think your described use case brings an interesting question on whether we want to have device plugin export a management API that can be used to drain or change properties of the underlying devices. Most likely we already have underlying tools to do this but there is benefit to have a consistent and portable management model for extended resources. Not sure if there are strong use cases at the moment. We perhaps want to continue this discussion outside Kubeflow repo.

@erikerlandson
Copy link

+1 for a use-case meeting. If possible, keeping scheduling decoupled from kubeflow seems desirable given the broad cross-application needs for similar scheduling functionalities

@flx42
Copy link

flx42 commented Jan 9, 2018

The GPU is GeForce so probably its cooling fan was broken. We could detect it by benchmarking, not by management tools.

Why? AFAIK NVML or DCGM support this use case.
For instance:

$ nvidia-smi -q | grep -A8 'Clocks Throttle'
    Clocks Throttle Reasons
        Idle                        : Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
            HW Thermal Slowdown     : Not Active
            HW Power Brake Slowdown : Not Active
        Sync Boost                  : Not Active
        SW Thermal Slowdown         : Not Active

If you look at nvml.h:

/** HW Slowdown (reducing the core clocks by a factor of 2 or more) is engaged                                                                                                                                    
 *                                                                                                                                                                                                                
 * This is an indicator of:                                                                                                                                                                                       
 *   - temperature being too high                                                                                                                                                                                 
 *   - External Power Brake Assertion is triggered (e.g. by the system power supply)                                                                                                                              
 *   - Power draw is too high and Fast Trigger protection is reducing the clocks                                                                                                                                  
 *   - May be also reported during PState or clock change                                                                                                                                                         
 *      - This behavior may be removed in a later release.                                                                                                                                                        
 *                                                                                                                                                                                                                
 * @see nvmlDeviceGetTemperature                                                                                                                                                                                  
 * @see nvmlDeviceGetTemperatureThreshold                                                                                                                                                                         
 * @see nvmlDeviceGetPowerUsage                                                                                                                                                                                   
 */                                                                                                                                                                                                               
#define nvmlClocksThrottleReasonHwSlowdown                0x0000000000000008LL    

@YujiOshima
Copy link
Contributor

YujiOshima commented Jan 10, 2018

Hi @jiayingz @erikerlandson @flx42 !
Thanks @mitake .

Here are more details of our problem.
I trained Resnet with Cifer10 for a benchmark.
GPU: GeForce GTX 1080 Ti.
DL Flamework: MxNet

The result of a healthy GPU is as below.

$ kubectl logs jobs/fec7152e290645c3-1-2
INFO:root:Epoch[0] Batch [20]   Speed: 2015.42 samples/sec      accuracy=0.131696
INFO:root:Epoch[0] Batch [40]   Speed: 2025.30 samples/sec      accuracy=0.191016
INFO:root:Epoch[0] Batch [60]   Speed: 2018.21 samples/sec      accuracy=0.230273
INFO:root:Epoch[0] Batch [80]   Speed: 2025.97 samples/sec      accuracy=0.258496
INFO:root:Epoch[0] Train-accuracy=0.297909
INFO:root:Epoch[0] Time cost=25.134
INFO:root:Epoch[0] Validation-accuracy=0.314941
INFO:root:Epoch[1] Batch [20]   Speed: 2016.25 samples/sec      accuracy=0.302920
INFO:root:Epoch[1] Batch [40]   Speed: 2015.39 samples/sec      accuracy=0.334473
INFO:root:Epoch[1] Batch [60]   Speed: 2008.51 samples/sec      accuracy=0.347852
INFO:root:Epoch[1] Batch [80]   Speed: 2001.21 samples/sec      accuracy=0.377051
INFO:root:Epoch[1] Train-accuracy=0.397748
INFO:root:Epoch[1] Time cost=24.828
INFO:root:Epoch[1] Validation-accuracy=0.434570
INFO:root:Epoch[2] Batch [20]   Speed: 1993.33 samples/sec      accuracy=0.410249
INFO:root:Epoch[2] Batch [40]   Speed: 1989.81 samples/sec      accuracy=0.419043
INFO:root:Epoch[2] Batch [60]   Speed: 1988.56 samples/sec      accuracy=0.437500
INFO:root:Epoch[2] Batch [80]   Speed: 1975.50 samples/sec      accuracy=0.463379
INFO:root:Epoch[2] Train-accuracy=0.471680
INFO:root:Epoch[2] Time cost=24.840
INFO:root:Epoch[2] Validation-accuracy=0.481702
INFO:root:Epoch[3] Batch [20]   Speed: 1915.47 samples/sec      accuracy=0.487630
INFO:root:Epoch[3] Batch [40]   Speed: 1943.98 samples/sec      accuracy=0.500098
INFO:root:Epoch[3] Batch [60]   Speed: 1948.77 samples/sec      accuracy=0.518262
INFO:root:Epoch[3] Batch [80]   Speed: 1894.11 samples/sec      accuracy=0.523438
INFO:root:Epoch[3] Train-accuracy=0.540441
INFO:root:Epoch[3] Time cost=25.947
INFO:root:Epoch[3] Validation-accuracy=0.536523
INFO:root:Epoch[4] Batch [20]   Speed: 1898.14 samples/sec      accuracy=0.546875
INFO:root:Epoch[4] Batch [40]   Speed: 1900.62 samples/sec      accuracy=0.556152
INFO:root:Epoch[4] Batch [60]   Speed: 1868.62 samples/sec      accuracy=0.565723
INFO:root:Epoch[4] Batch [80]   Speed: 1865.30 samples/sec      accuracy=0.586328
INFO:root:Epoch[4] Train-accuracy=0.576861
INFO:root:Epoch[4] Time cost=26.507
INFO:root:Epoch[4] Validation-accuracy=0.620580

It keeps throughput about 1900 samples/sec to the last.
But some GPU drastically slow down during training.

INFO:root:Epoch[0] Batch [20]   Speed: 1826.97 samples/sec      accuracy=0.156343
INFO:root:Epoch[0] Batch [40]   Speed: 1833.35 samples/sec      accuracy=0.218457
INFO:root:Epoch[0] Batch [60]   Speed: 1705.14 samples/sec      accuracy=0.267090
INFO:root:Epoch[0] Batch [80]   Speed: 1251.16 samples/sec      accuracy=0.298828
INFO:root:Epoch[0] Train-accuracy=0.333180
INFO:root:Epoch[0] Time cost=37.115
INFO:root:Epoch[0] Validation-accuracy=0.369727
INFO:root:Epoch[1] Batch [20]   Speed: 624.62 samples/sec       accuracy=0.344401
INFO:root:Epoch[1] Batch [40]   Speed: 477.14 samples/sec       accuracy=0.367188
INFO:root:Epoch[1] Batch [60]   Speed: 408.84 samples/sec       accuracy=0.378906
INFO:root:Epoch[1] Batch [80]   Speed: 372.14 samples/sec       accuracy=0.406836
INFO:root:Epoch[1] Train-accuracy=0.434858
INFO:root:Epoch[1] Time cost=115.186
INFO:root:Epoch[1] Validation-accuracy=0.426270
INFO:root:Epoch[2] Batch [20]   Speed: 350.03 samples/sec       accuracy=0.449498
INFO:root:Epoch[2] Batch [40]   Speed: 394.87 samples/sec       accuracy=0.460938
INFO:root:Epoch[2] Batch [60]   Speed: 433.00 samples/sec       accuracy=0.483789
INFO:root:Epoch[2] Batch [80]   Speed: 413.60 samples/sec       accuracy=0.504199
INFO:root:Epoch[2] Train-accuracy=0.510498
INFO:root:Epoch[2] Time cost=119.830
INFO:root:Epoch[2] Validation-accuracy=0.487767
INFO:root:Epoch[3] Batch [20]   Speed: 555.49 samples/sec       accuracy=0.527995
INFO:root:Epoch[3] Batch [40]   Speed: 534.64 samples/sec       accuracy=0.548438
INFO:root:Epoch[3] Batch [60]   Speed: 579.96 samples/sec       accuracy=0.556055
INFO:root:Epoch[3] Batch [80]   Speed: 580.17 samples/sec       accuracy=0.565918
INFO:root:Epoch[3] Train-accuracy=0.580423

The temperature of that GPU looks too high.

$ nvidia-smi -q -d Temperature

==============NVSMI LOG==============

Timestamp                           : Wed Jan 10 10:53:43 2018
Driver Version                      : 384.90

Attached GPUs                       : 4
GPU 00000000:05:00.0
    Temperature
        GPU Current Temp            : 61 C
        GPU Shutdown Temp           : 96 C
        GPU Slowdown Temp           : 93 C
        GPU Max Operating Temp      : N/A
        Memory Current Temp         : N/A
        Memory Max Operating Temp   : N/A

GPU 00000000:06:00.0
    Temperature
        GPU Current Temp            : 90 C
        GPU Shutdown Temp           : 96 C
        GPU Slowdown Temp           : 93 C
        GPU Max Operating Temp      : N/A
        Memory Current Temp         : N/A
        Memory Max Operating Temp   : N/A

As @Fix42 says, we can find it is throttled because of thermal.

$ nvidia-smi -q | grep -A8 'Clocks Throttle'
    Clocks Throttle Reasons
        Idle                        : Not Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
        Sync Boost                  : Not Active
        SW Thermal Slowdown         : Active
    FB Memory Usage
        Total                       : 11172 MiB
--

However, problems can only be detected after overheating and degraded the performance.
And I infer that its cooling fan is the reason but I don't know the exact reason yet.
I hope to detect such problem in advance and avoid scheduling tasks to the GPU.
Of course, it is useful to be able to mark the GPU as broken after overheating.

@flx42
Copy link

flx42 commented Jan 10, 2018

It seems challenging to detect this issue in advance. Especially for temperature, if the fan is only slightly under-performing (but still enough for triggering a clock throttle), it might take seconds or minutes before this happens.

We could still try to detect the cases where the GPU is obviously not healthy. At startup time, a device plugin could implement a lengthy health check, if it wants too. For instance, launching heavy computations for 30 seconds and then check the health, then don't advertise the unhealthy GPUs.

@aronchick
Copy link
Contributor

aronchick commented Jan 10, 2018 via email

@flx42
Copy link

flx42 commented Jan 10, 2018

What do you mean by signal? With the new API DCGM, you can register a callback that gets called whenever a thermal violation happens. We are going to release Go bindings for DCGM very soon.

@aronchick
Copy link
Contributor

aronchick commented Jan 10, 2018 via email

@flx42
Copy link

flx42 commented Jan 10, 2018

This seems to be still within the realm of the Device Plugin. IIRC, pods are not evicted today if a device becomes unhealthy, but they are just excluded from being scheduled for future jobs. Is that correct @jiayingz @RenaudWasTaken?

@jiayingz
Copy link

@flx42 what you described is the current behavior. However, some external tools can take more aggressive response, like turning down affected pod or even drain the node.

@YujiOshima I guess one possible mitigation is to have a livenessProbe to watch for Thermal caused throttling in the training worker pod?

@jlewi
Copy link
Contributor

jlewi commented Mar 26, 2018

@ScorpioCPH @gaocegege

Should we close this issue? We have issues for gang-scheduling and kube-arbitrator
kubeflow/training-operator#349

Can we close this issue in favor of those?

@jlewi
Copy link
Contributor

jlewi commented Sep 4, 2018

@ScorpioCPH @gaocegege Thoughts about closing this issue?

@gaocegege
Copy link
Member

SGTM

@jlewi jlewi closed this as completed Oct 8, 2018
yanniszark pushed a commit to arrikto/kubeflow that referenced this issue Nov 1, 2019
elenzio9 pushed a commit to arrikto/kubeflow that referenced this issue Oct 31, 2022
Need member access to manage PRs from IBM for Kubeflow components and pipelines
VaishnaviHire pushed a commit to VaishnaviHire/kubeflow that referenced this issue Apr 6, 2023
rebased v1.6-branch with v1.6.1 release tag upstream
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests