Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fault tolerant design doc for fluid #11625

Closed

Conversation

typhoonzero
Copy link
Contributor

Add new fault-tolerant design doc which has no master.

Copy link
Contributor

@gongweibao gongweibao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

由master来管理任务队列的状态信息是否更方便一点?

The default Fault-Tolerant using checkpoint feature have some limitations:

1. Processes on all nodes must be restarted and load the checkpoint from storage.
1. The offset of data reader is not saved, recovered job must train from start.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

offset应该存了吧?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

目前的版本应该还没有


### Trainer Recovery

Trainers will use etcd transactions to fetch training data chunks from "Todo" queue, and put to a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which one splits data into chunks and put chunks to TODO queue?

Copy link
Contributor Author

@typhoonzero typhoonzero Jun 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chunk是指recordio的chunk或者是一整个文件也行。startup program可以获得一个分布式锁,然后push。这个操作也可以通用成distributed_run_once之类的。

and be pushed back to "Todo" later on. When the failed trainer is brought up by Kubernetes,
it will ask for a new chunk from "Todo" queue and continues the training.

Each trainer have a daemonized thread periatically obtain a distributed etcd lock and try finding
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the time interval to do it? If we have a large number of trainers, many of these actions may be duplicated.

### Parameter Server Recovery

When one of the pserver goes down and then restarted by Kubernetes,
it will start on a different pod with a different network identity (IP address). Meanwhile,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need to split to a name service?

receiving "get" calls, then the recovered pserver will start to wait "send" calls, this may
cause the job wait for ever.

We design the pserver can start with a "recovery mode", when it's automatically bringed up
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个地方没有太看懂,关于barrier。barrier状态信息不保存?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我的理解是在recovery mode, pserver会跳过第一个batch的更新。

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里有一点疑问, 如果跳过了第一个batch, 那就是说在recovery时, 我们会损失一个batch, 导致部分batch没有被优化

- parameter server liveness
- trainer liveness
- distributed job queues recording the training data offsets

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another point, we also need to let the user use a uniform data format, such as recordio so that we can save offset and dispatch task.


To be "Full Fault-Tolerant", we will enable the distributed training job to be able to
detect hardware failures and recover the training process in a short time. To achieve
this, the following states must be recorded and watched by the job nodes, we
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the following states must be recorded and watched by the job nodes

How does the job nodes record or watch the states? Do you mean a distributed storage like etcd ?

### Parameter Server Recovery

When one of the pserver goes down and then restarted by Kubernetes,
it will start on a different pod with a different network identity (IP address). Meanwhile,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somehow, only restart pserver container instead of Pod, the IP address would be changed.

"Pending" queue.when one chunk is finished the chunk's index will be pushed to the etcd "Complete"
queue.

When one trainer fails, the data chunk should be in "Pending" queue, this chunk will timeout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When one trainer fails, the data chunk should be in "Pending" queue, this chunk will timeout
+and be pushed back to "Todo" later on

If we don't have a master, who monitor the chunk timeout?

receiving "get" calls, then the recovered pserver will start to wait "send" calls, this may
cause the job wait for ever.

We design the pserver can start with a "recovery mode", when it's automatically bringed up
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我的理解是在recovery mode, pserver会跳过第一个batch的更新。

@Yancey1989
Copy link
Contributor

How about using a figure to make this design more clearly?

@typhoonzero
Copy link
Contributor Author

@Yancey1989 sure, will add.

@typhoonzero typhoonzero changed the title Add fault tolerant design doc Add fault tolerant design doc for fluid Jun 21, 2018

- parameter server liveness
- trainer liveness
- distributed job queues recording the training data offsets
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

开发,维护,测试保存训练数据进度估计会花很多时间,我们真的需要维护这个状态吗?
相比之下,如果每个reader都随机读取被指定的文件,没有一个中心训练数据分发队列,实现起来可能要简单得多。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

每个reader随机读取文件,这个方法是否得通过一定的测试和验证呢?这样的话,的确设计会非常简单了。

之前遇到一个场景,是每天需要训练大量增量的数据,数据量很大,每天训完一个pass都需要消耗很多的时间。这样如果是完全随机的reader,可能对训练数据并不能完全覆盖,不确定是否会产生影响。

我理解reader随机指定文件,这里相当于增加了一个“采样器”,采样方法是固定的“随机采样”,应该有一定的局限性?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

确实,每次重启之后,从头随机采样的确有局限性。特别是有scaling的时候,trainer可能会经常重启。所以感觉的确需要记录reader的状态。

我觉得一开始可以不需要数据分发队列,可以是trainer启动的时候就安排了要训练哪些文件,reader状态会保存,重启之后可以继续。

如果实在需要分发队列,可以有一个dequeue op,以及while op,reader每一个循环dequeue文件名,读到结束然后进入下一个循环。另外有一个进程专门分发数据(enqueue文件路径)。不过这个感觉不一定有需求。

一点想法,供参考:)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感觉为了支持EDL,其实还是需要有个队列的,用于记录是否完整的完成了一个pass?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

嗯嗯,也是的。好的。

When one of the pserver goes down and then restarted by Kubernetes,
it will start on a different pod with a different network identity (IP address). Meanwhile,
trainers may still trying to send gradients to that non-existing server. So trainers must
watch the pserver liveness states and change the retry request target to the new recovered
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

实际上不用监听 pserver 是否存活也能实现切换 request target, 只需要在 rpc 返回CONNECTION_REFUSE错误的时候访问一下etcd中注册的pserver是哪台就好

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deadline可能设置的时间会很长。超时回复时间就太长了

receiving "get" calls, then the recovered pserver will start to wait "send" calls, this may
cause the job wait for ever.

We design the pserver can start with a "recovery mode", when it's automatically bringed up
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里有一点疑问, 如果跳过了第一个batch, 那就是说在recovery时, 我们会损失一个batch, 导致部分batch没有被优化

current barrier. When training continues, the "recovery mode" is turned off automatically.
In general, pserver can start up with an option `--recovory` which enables the barrier condition
wait method for only one loop.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

还有一个疑问是, pserver recovery 之后, pserver 上记录的参数如何恢复呢?

@Yancey1989 Yancey1989 mentioned this pull request Jul 4, 2018
11 tasks
@luotao1
Copy link
Contributor

luotao1 commented Feb 1, 2019

感谢您给PaddlePaddle贡献文档。由于文档已迁移至FluidDoc repo,因此关闭您的PR,欢迎您向FluidDoc Repo贡献文档。
Thanks for contributing to PaddlePaddle! Since documents have been moved to FluidDoc repo, we close this PR. Welcome to contribute to FluidDoc repo.

@luotao1 luotao1 closed this Feb 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants