Design doc: save model in cluster training. #2655

helinwang · 2017-06-28T20:26:40Z

jacquesqiao · 2017-06-28T23:59:27Z

doc/design/cluster_train/save_model.md

+
+There are two types of model: dense (e.g., weight for a
+fully-connected layer) and sparse model (e.g., word
+embedding). Pservers always jointly have the entire model at any given


word embedding is not sparse model， when input training data is sparse and user configures the parameter to be sparse, the trainer then will detect which part of the parameter should be updated in this batch.

@jacquesqiao Thanks for pointing out! Does it mean it's a sparse model only if the user configures the parameter to be sparse, and the input for the calculation involving the parse parameter must be sparse?

I think language model is sparse model, word emebedding definitely is one kind of sparse model. To be clear, maybe sparse update is more accurate than sparse model.
One training instance does need the whole parameters, which means that is sparse update one.

Thanks! Will change to "sparse update"
-- Helin

I consult @lcy-seso , she tells me that sparse model is a kind of training method to make the model sparse, it's different then sparse update.language model and word emebedding both have no relation with sparse model. And mostly we will not use sparse model, so we can just use sparse update in the future.

Thanks! Will change to "sparse update".

typhoonzero · 2017-06-29T00:20:19Z

doc/design/cluster_train/save_model.md

+The model is the output of the training process. There are two
+ways from which user can obtain a model:
+
+- Save model triggered by user code: user code asks PaddlePaddle to


Since we are saving model in trainer, there's no "asks PaddlePaddle" to do something, which is likely a remote API call. May be changed to "user code can save model by themselves when batch finishes or pass finishes."

Thanks! Will change.

@typhoonzero Actually it depends on if we implement a method for saving model, or let user save model from the parameters by himself. Can you take a look at #2655 (comment) ?

dzhwinter · 2017-06-29T00:33:49Z

doc/design/cluster_train/save_model.md

+
+There are two types of model: dense (e.g., weight for a
+fully-connected layer) and sparse model (e.g., word
+embedding). Pservers always jointly have the entire model at any given


I think language model is sparse model, word emebedding definitely is one kind of sparse model. To be clear, maybe sparse update is more accurate than sparse model.
One training instance does need the whole parameters, which means that is sparse update one.

Thanks! Will change to "sparse update"
-- Helin

dzhwinter · 2017-06-29T00:44:50Z

doc/design/cluster_train/save_model.md

+- Convert model from the snapshot: model being converted from
+  pservers' periodic snapshot. In this way, the user can cancel a job
+  at any time, and still have a relatively fresh model (we snapshot
+  around every 5 minutes).


we can overemphasize snapshot with model snapshot here, otherwise, someone may confuse with the checkpoint if he doesn't take a look at the checkpoint.

Thanks! by "snapshot", I meant "checkpoint". Will change to "checkpoint"

typhoonzero · 2017-06-29T02:53:01Z

doc/design/cluster_train/save_model.md

+dense model, but only have a fraction of the sparse model at any given
+time.
+
+#### Pservers Saving Model


After a short discussion with @dzhwinter, we think saving the snapshot at pserver side is needed for recovering the state of pservers. A pserver snapshot should contain not only parameters but also some status like optimizer internels.

Saving snapshot by pserver can be triggered by call Save() RPC call from trainers to pservers. Trainers can save models by parameter.to_tar() in event_handlers.

Pserver side "snapshot" will only be used for pserver recovery, but trainer side saved model can be used for inference.

@typhoonzero @dzhwinter

saving the snapshot at pserver side is needed for recovering the state of pservers. A pserver snapshot should contain not only parameters but also some status like optimizer internels.

Agree!

Saving snapshot by pserver can be triggered by call Save() RPC call from trainers to pservers.

I think we can add that if we find it necessary later. But for the first version I am inclining to just let the pservers save periodically.

Trainers can save models by parameter.to_tar() in event_handlers.

I think trainer need Python function called save_model(save_dir), in the Python function it will first ask trainer client if it is elected to save the model, save the model if elected. Otherwise every trainer will try to save the model, putting too much burden to the distributed FS.

Pserver side "snapshot" will only be used for pserver recovery, but trainer side saved model can be used for inference.

Agree!

same ideas as we are. 👍

Yancey1989 · 2017-06-29T04:27:46Z

doc/design/cluster_train/save_model.md

+  at any time, and still have a relatively fresh model (we snapshot
+  around every 5 minutes).
+
+### Save Model Triggered by User Code


I seems that this section descript the reason why we choice trainer saving model, so how about modify the title From
Save Model Triggered by User Code
To
Trainer Saving Model vs. Pservers Saving Model

Good idea! Will do.

Yancey1989 · 2017-06-29T04:46:08Z

doc/design/cluster_train/save_model.md

+
+Each trainer will be given the directory to save the model. The
+elected trainer will save the model to
+`given-directory/trainerID`. Since the tainerID is unique, this would


If it happens with split-brain, maybe there are two trainers will save the model to given-directory/00001/pass-0-* and given-directory/00002/pass-0-*, but which one we will choose to recover?
How about add file lock under the path given-directory/ and the trainers will save the model only it can get the lock.

Thanks! Great question.
I think the save model is not for recovery (checkpoint is for recovery), it's for user to use for inference, or as a initial model to train from.
Back to the question, if user want to initialize the training from a saved model, users will specify the path by themselves. We don't need to decide for them.

OK, I got it.

wangkuiyi · 2017-06-30T05:47:39Z

喜欢这样像剥洋葱一样把道理一层一层说得很明白的design doc！

helinwang force-pushed the save_model_doc branch 4 times, most recently from 3a1caba to 6c1a314 Compare June 28, 2017 20:49

helinwang requested review from Yancey1989, dzhwinter and typhoonzero June 28, 2017 20:49

helinwang mentioned this pull request Jun 28, 2017

Saving models/parameters in fault tolerent training #2638

Closed

jacquesqiao reviewed Jun 28, 2017

View reviewed changes

typhoonzero reviewed Jun 29, 2017

View reviewed changes

dzhwinter reviewed Jun 29, 2017

View reviewed changes

create save model design doc

5157ba6

helinwang force-pushed the save_model_doc branch from 6c1a314 to 5157ba6 Compare June 29, 2017 00:45

typhoonzero reviewed Jun 29, 2017

View reviewed changes

Yancey1989 reviewed Jun 29, 2017

View reviewed changes

fix according to comments

7c066f6

wangkuiyi approved these changes Jun 30, 2017

View reviewed changes

polish wording and grammar.

62e582e

helinwang merged commit 45a78a4 into PaddlePaddle:develop Jun 30, 2017

helinwang deleted the save_model_doc branch June 30, 2017 20:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design doc: save model in cluster training. #2655

Design doc: save model in cluster training. #2655

helinwang commented Jun 28, 2017 •

edited

Loading

jacquesqiao Jun 28, 2017

helinwang Jun 29, 2017

dzhwinter Jun 29, 2017 •

edited by helinwang

Loading

jacquesqiao Jun 29, 2017 •

edited

Loading

helinwang Jun 29, 2017

helinwang Jun 30, 2017

typhoonzero Jun 29, 2017

helinwang Jun 29, 2017

helinwang Jun 30, 2017

dzhwinter Jun 29, 2017 •

edited by helinwang

Loading

dzhwinter Jun 29, 2017

helinwang Jun 29, 2017

helinwang Jun 30, 2017

typhoonzero Jun 29, 2017

helinwang Jun 29, 2017

dzhwinter Jun 30, 2017

Yancey1989 Jun 29, 2017

helinwang Jun 29, 2017

helinwang Jun 30, 2017

Yancey1989 Jun 29, 2017

helinwang Jun 29, 2017

Yancey1989 Jun 29, 2017

wangkuiyi commented Jun 30, 2017

Design doc: save model in cluster training. #2655

Design doc: save model in cluster training. #2655

Conversation

helinwang commented Jun 28, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dzhwinter Jun 29, 2017 • edited by helinwang Loading

Choose a reason for hiding this comment

jacquesqiao Jun 29, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dzhwinter Jun 29, 2017 • edited by helinwang Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wangkuiyi commented Jun 30, 2017

helinwang commented Jun 28, 2017 •

edited

Loading

dzhwinter Jun 29, 2017 •

edited by helinwang

Loading

jacquesqiao Jun 29, 2017 •

edited

Loading

dzhwinter Jun 29, 2017 •

edited by helinwang

Loading