-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fault tolerant design doc for fluid #11625
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
由master来管理任务队列的状态信息是否更方便一点?
The default Fault-Tolerant using checkpoint feature have some limitations: | ||
|
||
1. Processes on all nodes must be restarted and load the checkpoint from storage. | ||
1. The offset of data reader is not saved, recovered job must train from start. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
offset应该存了吧?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
目前的版本应该还没有
|
||
### Trainer Recovery | ||
|
||
Trainers will use etcd transactions to fetch training data chunks from "Todo" queue, and put to a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which one splits data into chunks and put chunks to TODO queue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chunk是指recordio的chunk或者是一整个文件也行。startup program可以获得一个分布式锁,然后push。这个操作也可以通用成distributed_run_once
之类的。
and be pushed back to "Todo" later on. When the failed trainer is brought up by Kubernetes, | ||
it will ask for a new chunk from "Todo" queue and continues the training. | ||
|
||
Each trainer have a daemonized thread periatically obtain a distributed etcd lock and try finding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the time interval to do it? If we have a large number of trainers, many of these actions may be duplicated.
### Parameter Server Recovery | ||
|
||
When one of the pserver goes down and then restarted by Kubernetes, | ||
it will start on a different pod with a different network identity (IP address). Meanwhile, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we need to split to a name service?
receiving "get" calls, then the recovered pserver will start to wait "send" calls, this may | ||
cause the job wait for ever. | ||
|
||
We design the pserver can start with a "recovery mode", when it's automatically bringed up |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个地方没有太看懂,关于barrier。barrier状态信息不保存?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我的理解是在recovery mode
, pserver会跳过第一个batch的更新。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里有一点疑问, 如果跳过了第一个batch, 那就是说在recovery时, 我们会损失一个batch, 导致部分batch没有被优化
- parameter server liveness | ||
- trainer liveness | ||
- distributed job queues recording the training data offsets | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another point, we also need to let the user use a uniform data format, such as recordio so that we can save offset and dispatch task.
|
||
To be "Full Fault-Tolerant", we will enable the distributed training job to be able to | ||
detect hardware failures and recover the training process in a short time. To achieve | ||
this, the following states must be recorded and watched by the job nodes, we |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the following states must be recorded and watched by the job nodes
How does the job nodes record or watch the states? Do you mean a distributed storage like etcd ?
### Parameter Server Recovery | ||
|
||
When one of the pserver goes down and then restarted by Kubernetes, | ||
it will start on a different pod with a different network identity (IP address). Meanwhile, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somehow, only restart pserver container instead of Pod, the IP address would be changed.
"Pending" queue.when one chunk is finished the chunk's index will be pushed to the etcd "Complete" | ||
queue. | ||
|
||
When one trainer fails, the data chunk should be in "Pending" queue, this chunk will timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When one trainer fails, the data chunk should be in "Pending" queue, this chunk will timeout
+and be pushed back to "Todo" later on
If we don't have a master, who monitor the chunk timeout?
receiving "get" calls, then the recovered pserver will start to wait "send" calls, this may | ||
cause the job wait for ever. | ||
|
||
We design the pserver can start with a "recovery mode", when it's automatically bringed up |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我的理解是在recovery mode
, pserver会跳过第一个batch的更新。
How about using a figure to make this design more clearly? |
@Yancey1989 sure, will add. |
|
||
- parameter server liveness | ||
- trainer liveness | ||
- distributed job queues recording the training data offsets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
开发,维护,测试保存训练数据进度估计会花很多时间,我们真的需要维护这个状态吗?
相比之下,如果每个reader都随机读取被指定的文件,没有一个中心训练数据分发队列,实现起来可能要简单得多。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
每个reader随机读取文件,这个方法是否得通过一定的测试和验证呢?这样的话,的确设计会非常简单了。
之前遇到一个场景,是每天需要训练大量增量的数据,数据量很大,每天训完一个pass都需要消耗很多的时间。这样如果是完全随机的reader,可能对训练数据并不能完全覆盖,不确定是否会产生影响。
我理解reader随机指定文件,这里相当于增加了一个“采样器”,采样方法是固定的“随机采样”,应该有一定的局限性?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
确实,每次重启之后,从头随机采样的确有局限性。特别是有scaling的时候,trainer可能会经常重启。所以感觉的确需要记录reader的状态。
我觉得一开始可以不需要数据分发队列,可以是trainer启动的时候就安排了要训练哪些文件,reader状态会保存,重启之后可以继续。
如果实在需要分发队列,可以有一个dequeue op,以及while op,reader每一个循环dequeue文件名,读到结束然后进入下一个循环。另外有一个进程专门分发数据(enqueue文件路径)。不过这个感觉不一定有需求。
一点想法,供参考:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
感觉为了支持EDL,其实还是需要有个队列的,用于记录是否完整的完成了一个pass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
嗯嗯,也是的。好的。
When one of the pserver goes down and then restarted by Kubernetes, | ||
it will start on a different pod with a different network identity (IP address). Meanwhile, | ||
trainers may still trying to send gradients to that non-existing server. So trainers must | ||
watch the pserver liveness states and change the retry request target to the new recovered |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
实际上不用监听 pserver 是否存活也能实现切换 request target, 只需要在 rpc 返回CONNECTION_REFUSE
错误的时候访问一下etcd中注册的pserver是哪台就好
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deadline可能设置的时间会很长。超时回复时间就太长了
receiving "get" calls, then the recovered pserver will start to wait "send" calls, this may | ||
cause the job wait for ever. | ||
|
||
We design the pserver can start with a "recovery mode", when it's automatically bringed up |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里有一点疑问, 如果跳过了第一个batch, 那就是说在recovery时, 我们会损失一个batch, 导致部分batch没有被优化
current barrier. When training continues, the "recovery mode" is turned off automatically. | ||
In general, pserver can start up with an option `--recovory` which enables the barrier condition | ||
wait method for only one loop. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
还有一个疑问是, pserver recovery 之后, pserver 上记录的参数如何恢复呢?
Add new fault-tolerant design doc which has no master.