Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add save checkpoint on pserver. #10376

Closed
typhoonzero opened this issue May 3, 2018 · 5 comments
Closed

Add save checkpoint on pserver. #10376

typhoonzero opened this issue May 3, 2018 · 5 comments

Comments

@typhoonzero
Copy link
Contributor

No description provided.

@seiriosPlus seiriosPlus self-assigned this May 4, 2018
@seiriosPlus seiriosPlus moved this from TODO to Fault Tolerant TODOs in PaddlePaddle Distributed Refactoring (Due: 201802) May 4, 2018
@seiriosPlus seiriosPlus moved this from Fault Tolerant TODOs to TODO in PaddlePaddle Distributed Refactoring (Due: 201802) May 4, 2018
@typhoonzero typhoonzero moved this from TODO to Fault Tolerant TODOs in PaddlePaddle Distributed Refactoring (Due: 201802) May 4, 2018
@seiriosPlus seiriosPlus moved this from Fault Tolerant TODOs to TODO in PaddlePaddle Distributed Refactoring (Due: 201802) May 4, 2018
@typhoonzero typhoonzero moved this from TODO to Fault Tolerant TODOs in PaddlePaddle Distributed Refactoring (Due: 201802) May 4, 2018
@seiriosPlus seiriosPlus moved this from Fault Tolerant TODOs to DOING in PaddlePaddle Distributed Refactoring (Due: 201802) May 9, 2018
@seiriosPlus
Copy link
Collaborator

seiriosPlus commented May 9, 2018

The goal of the checkpoint is: add save/restore checkpoint to PServer and add restore variables/connections to Trainer to realize fault tolerant.
The development plan is:

  1. PServer save checkpoint
  2. PServer restore checkpoint
  3. Trainer restore variables and connections
  4. Need to support async and sync

@seiriosPlus
Copy link
Collaborator

背景:
PaddlePaddle在训练过程中会由于偶发的硬件/网络等外部问题导致训练失败, 需要有Checkpoint的机制在训练途中保存训练结果,在训练失败后,通过Checkpoint来进行恢复和继续训练。
目标:
分为三个阶段:
M1: 在Trainer 0 节点进行参数保存, 一旦有PServer/Trainer异常,则重启全部训练节点, 从Checkpoint 中恢复参数后继续训练。
M2:支持在Pserver端save checkpoint
M3:

  • pserver endpoint保存在etcd中,trainer可以动态的watch pserver endpoint,在变更后重新连接
  • offset保存在etcd中,trainer恢复后从etcd读取到reader offset

M1 阶段设计方案:
由Trainer 0 通过checkpoint保存全部参数,Trainer 独自保存自己的reader offset 到 checkpoint的文件中。
需要NFS来存储/同步checkpoint文件, 所有的trainer需要有读写权限, pserver需要有读权限。
添加OP: ckpt_restore_op 和 ckpt_save_op, ckpt_restore_op 用于startup_program阶段从checkpoint文件读取参数, ckpt_save_op用于 send_op之后,将参数写入checkpoint文件。
M2 和 M3 阶段 设计 将后续补充。

@typhoonzero
Copy link
Contributor Author

ckpt_restore_op 这个op的名称尽量不要用缩写?虽然长点但方便理解?

另外还有一个细节:pserver load checkpoint的时候,需要能知道自己需要load的一部分数据。并通过attr传给restore_op

@seiriosPlus
Copy link
Collaborator

OP中的 ckpt 会全部更新成 checkpoint
Pserver 这个细节点已收到,具体开发的时候会注意这个

@shanyi15
Copy link
Collaborator

您好,此issue在近一个月内暂无更新,我们将于今天内关闭。若在关闭后您仍需跟进提问,可重新开启此问题,我们将在24小时内回复您。因关闭带来的不便我们深表歉意,请您谅解~感谢您对PaddlePaddle的支持!
Hello, this issue has not been updated in the past month. We will close it today for the sake of other user‘s experience. If you still need to follow up on this question after closing, please feel free to reopen it. In that case, we will get back to you within 24 hours. We apologize for the inconvenience caused by the closure and thank you so much for your support of PaddlePaddle Group!

PaddlePaddle Distributed Refactoring (Due: 201802) automation moved this from DOING to DONE Aug 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants