Add save checkpoint on pserver. #10376

typhoonzero · 2018-05-03T09:05:48Z

No description provided.

seiriosPlus · 2018-05-09T02:56:57Z

The goal of the checkpoint is: add save/restore checkpoint to PServer and add restore variables/connections to Trainer to realize fault tolerant.
The development plan is:

PServer save checkpoint
PServer restore checkpoint
Trainer restore variables and connections
Need to support async and sync

seiriosPlus · 2018-05-13T13:56:28Z

背景：
PaddlePaddle在训练过程中会由于偶发的硬件/网络等外部问题导致训练失败，需要有Checkpoint的机制在训练途中保存训练结果，在训练失败后，通过Checkpoint来进行恢复和继续训练。
目标：
分为三个阶段：
M1：在Trainer 0 节点进行参数保存，一旦有PServer/Trainer异常，则重启全部训练节点，从Checkpoint 中恢复参数后继续训练。
M2：支持在Pserver端save checkpoint
M3：

pserver endpoint保存在etcd中，trainer可以动态的watch pserver endpoint，在变更后重新连接
offset保存在etcd中，trainer恢复后从etcd读取到reader offset

M1 阶段设计方案：
由Trainer 0 通过checkpoint保存全部参数，Trainer 独自保存自己的reader offset 到 checkpoint的文件中。
需要NFS来存储/同步checkpoint文件，所有的trainer需要有读写权限， pserver需要有读权限。
添加OP： ckpt_restore_op 和 ckpt_save_op， ckpt_restore_op 用于startup_program阶段从checkpoint文件读取参数， ckpt_save_op用于 send_op之后，将参数写入checkpoint文件。
M2 和 M3 阶段设计将后续补充。

typhoonzero · 2018-05-14T02:53:30Z

ckpt_restore_op 这个op的名称尽量不要用缩写？虽然长点但方便理解？

另外还有一个细节：pserver load checkpoint的时候，需要能知道自己需要load的一部分数据。并通过attr传给restore_op

seiriosPlus · 2018-05-14T03:36:39Z

OP中的 ckpt 会全部更新成 checkpoint
Pserver 这个细节点已收到，具体开发的时候会注意这个

shanyi15 · 2018-08-15T10:27:09Z

您好，此issue在近一个月内暂无更新，我们将于今天内关闭。若在关闭后您仍需跟进提问，可重新开启此问题，我们将在24小时内回复您。因关闭带来的不便我们深表歉意，请您谅解~感谢您对PaddlePaddle的支持!
Hello, this issue has not been updated in the past month. We will close it today for the sake of other user‘s experience. If you still need to follow up on this question after closing, please feel free to reopen it. In that case, we will get back to you within 24 hours. We apologize for the inconvenience caused by the closure and thank you so much for your support of PaddlePaddle Group!

typhoonzero created this issue from a note in PaddlePaddle Distributed Refactoring (Due: 201802) (TODO) May 3, 2018

seiriosPlus self-assigned this May 4, 2018

seiriosPlus moved this from TODO to Fault Tolerant TODOs in PaddlePaddle Distributed Refactoring (Due: 201802) May 4, 2018

seiriosPlus moved this from Fault Tolerant TODOs to TODO in PaddlePaddle Distributed Refactoring (Due: 201802) May 4, 2018

typhoonzero moved this from TODO to Fault Tolerant TODOs in PaddlePaddle Distributed Refactoring (Due: 201802) May 4, 2018

seiriosPlus moved this from Fault Tolerant TODOs to TODO in PaddlePaddle Distributed Refactoring (Due: 201802) May 4, 2018

typhoonzero moved this from TODO to Fault Tolerant TODOs in PaddlePaddle Distributed Refactoring (Due: 201802) May 4, 2018

seiriosPlus moved this from Fault Tolerant TODOs to DOING in PaddlePaddle Distributed Refactoring (Due: 201802) May 9, 2018

seiriosPlus mentioned this issue May 9, 2018

add checkpoint util class and implement #10532

Merged

seiriosPlus mentioned this issue May 23, 2018

Incremental Learning Support for Fluid with Distribution #10870

Closed

shanyi15 closed this as completed Aug 15, 2018

PaddlePaddle Distributed Refactoring (Due: 201802) automation moved this from DOING to DONE Aug 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add save checkpoint on pserver. #10376

Add save checkpoint on pserver. #10376

typhoonzero commented May 3, 2018

seiriosPlus commented May 9, 2018 •

edited

Loading

seiriosPlus commented May 13, 2018

typhoonzero commented May 14, 2018

seiriosPlus commented May 14, 2018

shanyi15 commented Aug 15, 2018

Add save checkpoint on pserver. #10376

Add save checkpoint on pserver. #10376

Comments

typhoonzero commented May 3, 2018

seiriosPlus commented May 9, 2018 • edited Loading

seiriosPlus commented May 13, 2018

typhoonzero commented May 14, 2018

seiriosPlus commented May 14, 2018

shanyi15 commented Aug 15, 2018

seiriosPlus commented May 9, 2018 •

edited

Loading