Add CFS design #468

typhoonzero · 2017-11-06T05:21:30Z

No description provided.

gongweibao · 2017-11-07T03:37:15Z

doc/design/scheduler.md

+    - `ReplicaSet` of `pserver` process
+    - `Job` of `trainer` process
+- Queue to sort `TrainingJob` resource for schedule
+- Scheduler to determine which job to run or to scale by:


这四种Job priority具体的含义是什么？

这下面描述的并不是4种priority，是scheduler根据这些数值决定每个job期望的运行状态，即Job的打分值。

gongweibao · 2017-11-07T03:39:40Z

doc/design/scheduler.md

+a job is waiting enough long, it can finally be scheduled to the cluster, no
+matter it may have very low priority (except that the cluster is full of
+production service).
+1. A cluster may run both online service and offline batch jobs. The online


我的理解：我们可以按照任务的性质分成

在线

离线

团队实验

个人实验

几种层别，这样用户可能对Job priority有更直观的理解。一定程度避免现在大家都把自己任务调高优先级别的情况出现。

高层别的任务优先级别一定大于低层别的优先级别，在同一个层别里边才有更细的任务优先级别的排序。

提交任务的时候，可以让用户同时提交Job nature Job priority两个参数。

优先级在下面 Interface章节已经介绍了。

gongweibao · 2017-11-07T03:42:02Z

doc/design/scheduler.md

+consumption will be considered.
+
+Scheduler stores all nodes in a red-black tree, sorted by score
+`sum(Prio() * ResourceScore() * Running() * running time)`


这个公式出现的比较突兀。不清楚排序的效果如何？有无解释？
另外running time是如何估算出来的？

gongweibao · 2017-11-07T03:43:29Z

doc/design/scheduler.md

+
+## References
+
+https://en.wikipedia.org/wiki/Completely_Fair_Scheduler


Indention issue.

Comments all done.

putcn · 2017-11-07T19:36:05Z

doc/design/scheduler.md

+
+### Interface
+
+Scheduler deal with atomic scheduling unit named `Node`. The `TraniningJob`


would this be confused with Node in Kubernetes as a physical machine?

Yep. Will change the naming.

Yancey1989 · 2017-11-08T02:58:29Z

doc/design/scheduler.md

+
+## Background
+
+We are going to define PaddlePaddle cluster job as a Kubernetes [TPR]() or


Need to add the link of
TPR: https://kubernetes.io/docs/tasks/access-kubernetes-api/extend-api-third-party-resource/
CRD:https://kubernetes.io/docs/concepts/api-extension/custom-resources/

Yancey1989 · 2017-11-08T03:07:53Z

doc/design/scheduler.md

+services have high priority and is not interuptable. But trainingjobs can
+re-use the cluster resource when the online service came to certain time of
+day that is not that active.
+1. About quota, users quota should be considered so that scheduled job is not


I think pserver and etcd should have higher priority than trainers, and because

jobs require GPU resource should
have higher priority to run on GPU machines than CPU only jobs

pserver and etcd will be assigned at CPU node with higher priority.

Thought pserver, etcd, master, trainers in a single TrainingJob should have same priority.

helinwang · 2017-11-09T00:36:00Z

doc/design/scheduler.md

+
+Cases that need to be considerd during the implementaion:
+
+1. GPU is much more expensive than CPUs, jobs require GPU resource should


laugch -> launch

helinwang · 2017-11-09T00:38:27Z

doc/design/scheduler.md

+  Running() int64
+
+  // Obj returns inner scheduling unit.
+  Obj() *interface{}


Maybe Obj() interface, interface is itself a "pointer".

helinwang · 2017-11-09T00:41:30Z

doc/design/scheduler.md

+}
+```
+
+Currently we only support 4 levels of priority. Note that the priority is not


Maybe

const ( Experiement PrioLevel = 10 Offline = 100 Normal = 1000 Production = 10000 )

could be more extensible?

helinwang · 2017-11-09T00:45:46Z

doc/design/scheduler.md

+then job result can be updated to the production service. Other jobs like
+experiement and one-shot jobs have lower priority, they can be scaled up when
+cluster is free and can be scaled down when cluster is busy.
+1. Otherwise, jobs should share the cluster resource fairly, which means, if


Topic for discussion: how "fair" do we want? If we are really fair, every jobs' trainer count will be constantly in flux, but cold starting a trainer have cost.

Maybe we can have some "freezing window", we only do the non-urgent scaling when entering the next window.

Reasonable. The idea of "freezing window" is awesome, will add to design.

helinwang · 2017-11-10T22:35:32Z

doc/design/scheduler.md

+
+- Parser to parse `TrainingJob` resource to corresponding job components,
+  including:
+    - `ReplicaSet` of master process


Using ReplicaSet / StatefulSet / Job means we will depend on Kubernetes' scheduler for scheduling Pods. Since we are creating our own scheduler, should we rely on Kubernetes' scheduler or not? What is the pros and cons of both cases?

Sorry for the late reply.
Yes, you are right. Not using default k8s scheduler will let us have more control over TraningJobs see here. The scheduler is in charge of putting pods on nodes.

Pros:

Can add weight to resource types like GPU when scheduling pods.

Scaling and scheduling can be in the same process.

New resource types can be taken care of like FPGA.

Cons:

The core function of schedule pods to nodes seems to be same for the TrainginJob scheduler. Resource request per node won't be changed, we only change the number of pods to run, which is already done by autoscaler.

Hard to implement, we have to implement leader selection for "High Availability".

I think using default-scheduler for the pod->nodes job is enough currently, we only need to queue TrainingJobs by priority and hand them to k8s, for now.

@typhoonzero I see, thanks! Agree that using the default scheduler is better for our current use case.

Another related question: do we need to use k8s Job / StateSet, another possibility is we can submit the creation and deletion of Pods directly (but still using the default scheduler).

Another related question: do we need to use k8s Job / StateSet, another possibility is we can submit the creation and deletion of Pods directly (but still using the default scheduler).

This is possible and may be useful. Instead, the controller has to track all pods' status, which is implemented in the k8s Job/StateSet controller.

Pros for directly control pods:

control scale by pod status, i.e. scale up/down the slowest pod (pod runs the smallest batch-id)

dynamically change resource request of pods.

@helinwang @Yancey1989 what you think of "submit the creation and deletion of Pods directly", I'll update the doc if we all agree with this.

another possibility is we can submit the creation and deletion of Pods directly (but still using the default scheduler)

Maybe we also need a custom scheduler, because only creation and deletion of Pods also produce Pending Pod, and the default-scheduler use a FIFO queue to scheduler the Pod, we can not dynamic adjust the priorities for all the pending Pods.

@gongweibao : please take a look at this discussion, it's related to your converting Python start job to Go controller.

gongweibao

LGTM

add cfs design

e9a9916

gongweibao requested changes Nov 7, 2017

View reviewed changes

putcn reviewed Nov 7, 2017

View reviewed changes

Yancey1989 reviewed Nov 8, 2017

View reviewed changes

helinwang reviewed Nov 9, 2017

View reviewed changes

typhoonzero added 2 commits November 9, 2017 10:40

follow comments

89d52b9

add freezing window

5881a3b

helinwang reviewed Nov 10, 2017

View reviewed changes

gongweibao approved these changes Nov 20, 2017

View reviewed changes

typhoonzero merged commit 304c4fc into PaddlePaddle:develop Nov 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CFS design #468

Add CFS design #468

typhoonzero commented Nov 6, 2017

gongweibao Nov 7, 2017

typhoonzero Nov 9, 2017

gongweibao Nov 7, 2017 •

edited

Loading

typhoonzero Nov 9, 2017

gongweibao Nov 7, 2017

gongweibao Nov 7, 2017

typhoonzero Nov 9, 2017

putcn Nov 7, 2017

typhoonzero Nov 8, 2017

typhoonzero Nov 9, 2017

Yancey1989 Nov 8, 2017

typhoonzero Nov 9, 2017

Yancey1989 Nov 8, 2017 •

edited

Loading

typhoonzero Nov 9, 2017

helinwang Nov 9, 2017

helinwang Nov 9, 2017

helinwang Nov 9, 2017 •

edited

Loading

typhoonzero Nov 9, 2017

helinwang Nov 9, 2017

typhoonzero Nov 9, 2017

helinwang Nov 10, 2017

typhoonzero Nov 13, 2017 •

edited

Loading

helinwang Nov 14, 2017 •

edited

Loading

typhoonzero Nov 14, 2017

typhoonzero Nov 14, 2017

Yancey1989 Nov 16, 2017 •

edited

Loading

helinwang Nov 16, 2017

gongweibao Nov 16, 2017

gongweibao left a comment


		## References

		https://en.wikipedia.org/wiki/Completely_Fair_Scheduler


		### Interface

		Scheduler deal with atomic scheduling unit named `Node`. The `TraniningJob`


		## Background

		We are going to define PaddlePaddle cluster job as a Kubernetes [TPR]() or


		Cases that need to be considerd during the implementaion:

		1. GPU is much more expensive than CPUs, jobs require GPU resource should

Add CFS design #468

Add CFS design #468

Conversation

typhoonzero commented Nov 6, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gongweibao Nov 7, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yancey1989 Nov 8, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang Nov 9, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

typhoonzero Nov 13, 2017 • edited Loading

Choose a reason for hiding this comment

helinwang Nov 14, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yancey1989 Nov 16, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gongweibao left a comment

Choose a reason for hiding this comment

gongweibao Nov 7, 2017 •

edited

Loading

Yancey1989 Nov 8, 2017 •

edited

Loading

helinwang Nov 9, 2017 •

edited

Loading

typhoonzero Nov 13, 2017 •

edited

Loading

helinwang Nov 14, 2017 •

edited

Loading

Yancey1989 Nov 16, 2017 •

edited

Loading