Saving models/parameters in fault tolerent training #2638

typhoonzero · 2017-06-28T02:34:32Z

Related PR: #2634

In a discussion with @helinwang this morning, previous thought was to save parameters to a distributed storage service by merging parameters from all pservers.

In general there are two ways:

save parameter snapshots on each pserver, and then merge them together
- recommended method: use API call Save to trigger snapshot saving, each pserver saves parameters on the distributed filesystem, this also saves the pserver status for recovering. Users can use a "model merge tool" to merge all the parts of the model and then use it.
save merged parameter(the model) on one trainer
- trainers will fetch the whole model every iteration, so saving models from trainer do not need a "merge" step. Models will be saved every pass.
- how to select exactly one trainer to save the model
  - use etcd distributed lock or transaction
  - use something like hash(trainer_ip) % trainer_count == 0

Notice: when users want to stop the training and use the current output models, he can stop the job right away, because the job will save every pass model into the distributed storage service.

The text was updated successfully, but these errors were encountered:

helinwang · 2017-06-28T04:49:01Z

Thanks @typhoonzero for leading the discussion!

Here is my summary of possible solutions:

There are two ways for saving model:

In the training program, Python code specify where and when to save the model.
User can cancel the training job at any time, but still able to get the most recent model.

For 2, we need to provide user a way to convert snapshots saved in different places by different pserers to one single model file. (maybe not immediately required).

For 1, there are two ways:

a. Trainer save model.

Pro:
- Can reuse current model saving logic when training locally, except sparse training.
Con:
- When doing sparse training, trainer does not have the entire model. It needs to download the entire model from pservers, and save to disk.

b. Pservers save model (can share code with the snapshot process, but need to have a pserver that converts saved snapshots into a same model).

Pro:
- Same saving logic for dense and sparse training.
- Same saving logic with "2" (convert snapshot to model after job being cancelled).
Con:
- Does not work without a distributed filesystem. Because Pserers on different node need to save the model to places visible to the convert program, the convert program will convert the snapshots to a single model.

After some more thinking, I am currently inclining to let trainer save the model, because it works even when a distributed filesystem is not present.

Would love to know how you guys think!

Yancey1989 · 2017-06-28T10:10:11Z

I am also inclining to save the model by the trainer, for the resong as following:

If the users want to use the model, he/she don't use use a tool to merge the parameters come from all pserver.
If the users want to recovery the train, he/she would choose the paramters come from the speficy the pass number.

FROM @typhoonzero

recommended method: use API call Save to trigger snapshot saving, each pserver saves parameters on the distributed filesystem, this also saves the pserver status for recovering. Users can use a "model merge tool" to merge all the parts of the model and then use it.

I think the trainer also need a API do the snapshort, it will save the mode and call Save to trigger the pserver snapshot.

dzhwinter · 2017-06-28T16:08:47Z

@helinwang
in these two saving logic, I prefer to 2. Which is more friendly to use, furthermore, we can add a lambda function to determine recent model, e.g. minimum training loss on evaluation dataset.

2、User can cancel the training job at any time, but still able to get the most recent model.

consider the decouple pserver with etcd cluster as you guys methoned, sounds like saving model in leader trainer is a good idea.
regardless of the leader trainer role voted, suppose we already have a leader trainer here.

There is some problem on this way.
1、In sparse update job, when leader trainer invoke saving, he need to wait until all model downloaded, the training world have to wait until it has finished. Otherwise, some part of model will be polluted.
2、as we all known, read and write distribute system file is a very time consume job. one trainer write such a large model to filesystem force us use asynchorince write method. We need a replica of whole model locally on the trainer machine. Which means all the trainer machine will double memory redudent than before(we don't know which one will be the leader). This may raise the apply threshold of our system.

helinwang · 2017-06-28T18:45:46Z

@dzhwinter Thanks for the feedback! Very valuable.

regardless of the leader trainer role voted, suppose we already have a leader trainer here.

Yes, when using etcd we can elect a leader trainer, when training on MPI without etcd. The trainer could have trainer IDs, the trainer with ID 0 could be the leader.

the training world have to wait until it has finished. Otherwise, some part of model will be polluted.

In my opinion, pollution is fine, as usual in deep learning, the training process is stochastic. Also when doing AGSD training, during one trainer downloading the model, we allow the model to be modified if other trainers upload the gradients during the trainer's download. So the model training process already involves "pollution". Maybe we can allow it, unless its proven to be more harm than benefit (system simplicity).

Which means all the trainer machine will double memory redudent than before(we don't know which one will be the leader)

I think we can know who is leader. When using etcd, we can elect an leader by etcd. When doing MPI, trainer 0 can be the leader.
Furthermore, when doing ASGD, elected trainer can save the model without blocking other trainers (assuming we allow "pollution"). In SGD, the trainer could block other trainers if pservers don't have a timeout logic to continue on next batch when some trainer is not uploading the gradient on time.

helinwang · 2017-06-28T20:50:44Z

@typhoonzero @Yancey1989 @dzhwinter Thanks for the comments!

Seems we all think let trainer save the model would be a good idea (please let me know otherwise). I have create a PR summarizing our discussion. Please review: #2655

dzhwinter · 2017-06-29T00:23:48Z

@helinwang
我还是用中文了，

regardless of the leader trainer role voted, suppose we already have a leader trainer here.
我的意思是wuyi说的这两种都ok. 下文不讨论如何选trainer 0, 关注其他方面。
use etcd distributed lock or transaction
use something like hash(trainer_ip) % trainer_count == 0

假定允许模型污染，受到污染的时间窗口为：
trainer 0 下载全量模型 + 单机写模型到文件系统。
paddle常用的language model 在几百M水平，广告模型稍大一些，按G级别估算。再看hdfs写速度大约是20M/sec左右。
benchmark
时间窗口粗略估算1 min，这个时间长短足够很多个minibatch更新了。模型污染对厂外业务应该没什么影响，大家对精度要求不高。厂内的话大部分收益都是auc 0.1% 这个水平，很可能会覆盖掉收益，导致只有最后一个pass未受到污染的模型是可用的。

dzhwinter · 2017-06-29T00:26:38Z

如果保证高精度不受任何影响，需要double内存（因为不知道哪个机器会被选为trainer 0，即使MPI集群，也是MPI Role函数随机决定的），所以有上面评论得到的结论。

helinwang · 2017-06-29T00:34:52Z

@dzhwinter 懂了！这里有假设一个模型如果有一些是旧的，一些是新的，就会差。我认为不一定会成立。因为更新模型是按照mini-batch算出来的梯度乘以一个很小的步长更新的。更新之后的模型只是对于这一个mini-batch来说是更好的，因为随机性很大，无法说是对于测试集是有正影响还是负影响。

dzhwinter · 2017-06-29T00:39:36Z

恩，这个影响收益是正负不好说。提醒一下这个点，暂时就想到这么多

helinwang · 2017-06-29T00:40:14Z

@dzhwinter 明白了。因为第一版并不支持稀疏更新，能否我们先把这个放在TODO，到时候再权衡一下。

typhoonzero assigned Yancey1989, helinwang, dzhwinter and gongweibao Jun 28, 2017

helinwang assigned jacquesqiao Jun 28, 2017

helinwang mentioned this issue Jun 28, 2017

Design doc: save model in cluster training. #2655

Merged

Yancey1989 mentioned this issue Jun 29, 2017

Trainer save model in distributed training. #2512

Closed

helinwang closed this as completed in #2655 Jun 30, 2017

heavengate pushed a commit to heavengate/Paddle that referenced this issue Aug 16, 2021

pose bottomup higherhrnet: model (PaddlePaddle#2638)

e762365

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Saving models/parameters in fault tolerent training #2638

Saving models/parameters in fault tolerent training #2638

typhoonzero commented Jun 28, 2017

helinwang commented Jun 28, 2017 •

edited

Loading

Yancey1989 commented Jun 28, 2017

dzhwinter commented Jun 28, 2017

helinwang commented Jun 28, 2017

helinwang commented Jun 28, 2017

dzhwinter commented Jun 29, 2017

dzhwinter commented Jun 29, 2017

helinwang commented Jun 29, 2017

dzhwinter commented Jun 29, 2017

helinwang commented Jun 29, 2017

Saving models/parameters in fault tolerent training #2638

Saving models/parameters in fault tolerent training #2638

Comments

typhoonzero commented Jun 28, 2017

helinwang commented Jun 28, 2017 • edited Loading

Yancey1989 commented Jun 28, 2017

dzhwinter commented Jun 28, 2017

helinwang commented Jun 28, 2017

helinwang commented Jun 28, 2017

dzhwinter commented Jun 29, 2017

dzhwinter commented Jun 29, 2017

helinwang commented Jun 29, 2017

dzhwinter commented Jun 29, 2017

helinwang commented Jun 29, 2017

helinwang commented Jun 28, 2017 •

edited

Loading