Add fault tolerant design doc for fluid #11625

typhoonzero · 2018-06-21T05:28:21Z

Add new fault-tolerant design doc which has no master.

gongweibao

由master来管理任务队列的状态信息是否更方便一点？

gongweibao · 2018-06-21T08:06:59Z

doc/fluid/design/dist_train/fault_tolerant.md

+The default Fault-Tolerant using checkpoint feature have some limitations:
+
+1. Processes on all nodes must be restarted and load the checkpoint from storage.
+1. The offset of data reader is not saved, recovered job must train from start.


offset应该存了吧？

目前的版本应该还没有

gongweibao · 2018-06-21T08:10:33Z

doc/fluid/design/dist_train/fault_tolerant.md

+
+### Trainer Recovery
+
+Trainers will use etcd transactions to fetch training data chunks from "Todo" queue, and put to a


Which one splits data into chunks and put chunks to TODO queue?

chunk是指recordio的chunk或者是一整个文件也行。startup program可以获得一个分布式锁，然后push。这个操作也可以通用成distributed_run_once之类的。

gongweibao · 2018-06-21T08:13:29Z

doc/fluid/design/dist_train/fault_tolerant.md

+and be pushed back to "Todo" later on. When the failed trainer is brought up by Kubernetes,
+it will ask for a new chunk from "Todo" queue and continues the training.
+
+Each trainer have a daemonized thread periatically obtain a distributed etcd lock and try finding


What's the time interval to do it? If we have a large number of trainers, many of these actions may be duplicated.

gongweibao · 2018-06-21T08:20:48Z

doc/fluid/design/dist_train/fault_tolerant.md

+### Parameter Server Recovery
+
+When one of the pserver goes down and then restarted by Kubernetes,
+it will start on a different pod with a different network identity (IP address). Meanwhile,


If we need to split to a name service?

gongweibao · 2018-06-21T08:22:32Z

doc/fluid/design/dist_train/fault_tolerant.md

+receiving "get" calls, then the recovered pserver will start to wait "send" calls, this may
+cause the job wait for ever. 
+
+We design the pserver can start with a "recovery mode", when it's automatically bringed up 


这个地方没有太看懂，关于barrier。barrier状态信息不保存？

我的理解是在recovery mode, pserver会跳过第一个batch的更新。

这里有一点疑问, 如果跳过了第一个batch, 那就是说在recovery时, 我们会损失一个batch, 导致部分batch没有被优化

Yancey1989 · 2018-06-21T09:09:57Z

doc/fluid/design/dist_train/fault_tolerant.md

+- parameter server liveness
+- trainer liveness
+- distributed job queues recording the training data offsets
+


Another point, we also need to let the user use a uniform data format, such as recordio so that we can save offset and dispatch task.

Yancey1989 · 2018-06-21T09:15:37Z

doc/fluid/design/dist_train/fault_tolerant.md

+
+To be "Full Fault-Tolerant", we will enable the distributed training job to be able to
+detect hardware failures and recover the training process in a short time. To achieve
+this, the following states must be recorded and watched by the job nodes, we 


the following states must be recorded and watched by the job nodes

How does the job nodes record or watch the states? Do you mean a distributed storage like etcd ?

Yancey1989 · 2018-06-21T09:19:44Z

doc/fluid/design/dist_train/fault_tolerant.md

+### Parameter Server Recovery
+
+When one of the pserver goes down and then restarted by Kubernetes,
+it will start on a different pod with a different network identity (IP address). Meanwhile,


Somehow, only restart pserver container instead of Pod, the IP address would be changed.

Yancey1989 · 2018-06-21T09:23:18Z

doc/fluid/design/dist_train/fault_tolerant.md

+"Pending" queue.when one chunk is finished the chunk's index will be pushed to the etcd "Complete"
+queue.
+
+When one trainer fails, the data chunk should be in "Pending" queue, this chunk will timeout


When one trainer fails, the data chunk should be in "Pending" queue, this chunk will timeout
+and be pushed back to "Todo" later on

If we don't have a master, who monitor the chunk timeout?

Yancey1989 · 2018-06-21T09:24:41Z

doc/fluid/design/dist_train/fault_tolerant.md

+receiving "get" calls, then the recovered pserver will start to wait "send" calls, this may
+cause the job wait for ever. 
+
+We design the pserver can start with a "recovery mode", when it's automatically bringed up 


我的理解是在recovery mode, pserver会跳过第一个batch的更新。

Yancey1989 · 2018-06-21T09:25:39Z

How about using a figure to make this design more clearly?

typhoonzero · 2018-06-21T11:24:16Z

@Yancey1989 sure, will add.

helinwang · 2018-06-21T18:51:35Z

doc/fluid/design/dist_train/fault_tolerant.md

+
+- parameter server liveness
+- trainer liveness
+- distributed job queues recording the training data offsets


开发，维护，测试保存训练数据进度估计会花很多时间，我们真的需要维护这个状态吗？
相比之下，如果每个reader都随机读取被指定的文件，没有一个中心训练数据分发队列，实现起来可能要简单得多。

每个reader随机读取文件，这个方法是否得通过一定的测试和验证呢？这样的话，的确设计会非常简单了。

之前遇到一个场景，是每天需要训练大量增量的数据，数据量很大，每天训完一个pass都需要消耗很多的时间。这样如果是完全随机的reader，可能对训练数据并不能完全覆盖，不确定是否会产生影响。

我理解reader随机指定文件，这里相当于增加了一个“采样器”，采样方法是固定的“随机采样”，应该有一定的局限性？

确实，每次重启之后，从头随机采样的确有局限性。特别是有scaling的时候，trainer可能会经常重启。所以感觉的确需要记录reader的状态。

我觉得一开始可以不需要数据分发队列，可以是trainer启动的时候就安排了要训练哪些文件，reader状态会保存，重启之后可以继续。

如果实在需要分发队列，可以有一个dequeue op，以及while op，reader每一个循环dequeue文件名，读到结束然后进入下一个循环。另外有一个进程专门分发数据（enqueue文件路径）。不过这个感觉不一定有需求。

一点想法，供参考：）

感觉为了支持EDL，其实还是需要有个队列的，用于记录是否完整的完成了一个pass？

嗯嗯，也是的。好的。

velconia · 2018-06-28T06:15:13Z

doc/fluid/design/dist_train/fault_tolerant.md

+When one of the pserver goes down and then restarted by Kubernetes,
+it will start on a different pod with a different network identity (IP address). Meanwhile,
+trainers may still trying to send gradients to that non-existing server. So trainers must
+watch the pserver liveness states and change the retry request target to the new recovered


实际上不用监听 pserver 是否存活也能实现切换 request target, 只需要在 rpc 返回CONNECTION_REFUSE错误的时候访问一下etcd中注册的pserver是哪台就好

deadline可能设置的时间会很长。超时回复时间就太长了

velconia · 2018-06-28T06:18:05Z

doc/fluid/design/dist_train/fault_tolerant.md

+receiving "get" calls, then the recovered pserver will start to wait "send" calls, this may
+cause the job wait for ever. 
+
+We design the pserver can start with a "recovery mode", when it's automatically bringed up 


这里有一点疑问, 如果跳过了第一个batch, 那就是说在recovery时, 我们会损失一个batch, 导致部分batch没有被优化

velconia · 2018-06-28T06:19:17Z

doc/fluid/design/dist_train/fault_tolerant.md

+current barrier. When training continues, the "recovery mode" is turned off automatically.
+In general, pserver can start up with an option `--recovory` which enables the barrier condition
+wait method for only one loop.
+


还有一个疑问是, pserver recovery 之后, pserver 上记录的参数如何恢复呢?

luotao1 · 2019-02-01T05:55:13Z

感谢您给PaddlePaddle贡献文档。由于文档已迁移至FluidDoc repo，因此关闭您的PR，欢迎您向FluidDoc Repo贡献文档。
Thanks for contributing to PaddlePaddle! Since documents have been moved to FluidDoc repo, we close this PR. Welcome to contribute to FluidDoc repo.

add ft design doc

afa8b8d

typhoonzero requested review from Yancey1989 and gongweibao June 21, 2018 05:28

gongweibao reviewed Jun 21, 2018

View reviewed changes

gongweibao requested a review from helinwang June 21, 2018 08:29

Yancey1989 reviewed Jun 21, 2018

View reviewed changes

typhoonzero changed the title ~~Add fault tolerant design doc~~ Add fault tolerant design doc for fluid Jun 21, 2018

helinwang reviewed Jun 21, 2018

View reviewed changes

update doc

26c03d7

velconia reviewed Jun 28, 2018

View reviewed changes

Yancey1989 mentioned this pull request Jul 4, 2018

PaddlePaddle Fluid EDL #11958

Closed

11 tasks

typhoonzero mentioned this pull request Jul 6, 2018

Run Fluid with EDL elasticdeeplearning/edl#35

Open

10 tasks

luotao1 closed this Feb 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fault tolerant design doc for fluid #11625

Add fault tolerant design doc for fluid #11625

typhoonzero commented Jun 21, 2018

gongweibao left a comment

gongweibao Jun 21, 2018

typhoonzero Jun 25, 2018

gongweibao Jun 21, 2018

typhoonzero Jun 25, 2018 •

edited

Loading

gongweibao Jun 21, 2018

gongweibao Jun 21, 2018

gongweibao Jun 21, 2018

Yancey1989 Jun 21, 2018

velconia Jun 28, 2018

Yancey1989 Jun 21, 2018

Yancey1989 Jun 21, 2018

Yancey1989 Jun 21, 2018

Yancey1989 Jun 21, 2018

Yancey1989 Jun 21, 2018

Yancey1989 commented Jun 21, 2018

typhoonzero commented Jun 21, 2018

helinwang Jun 21, 2018

typhoonzero Jun 25, 2018

helinwang Jun 25, 2018

typhoonzero Jun 26, 2018

helinwang Jun 26, 2018

velconia Jun 28, 2018

typhoonzero Jul 6, 2018

velconia Jun 28, 2018

velconia Jun 28, 2018

luotao1 commented Feb 1, 2019


		### Trainer Recovery

		Trainers will use etcd transactions to fetch training data chunks from "Todo" queue, and put to a

Add fault tolerant design doc for fluid #11625

Add fault tolerant design doc for fluid #11625

Conversation

typhoonzero commented Jun 21, 2018

gongweibao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

typhoonzero Jun 25, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yancey1989 commented Jun 21, 2018

typhoonzero commented Jun 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luotao1 commented Feb 1, 2019

typhoonzero Jun 25, 2018 •

edited

Loading