Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design doc: save model in cluster training. #2655

Merged
merged 3 commits into from
Jun 30, 2017

Conversation

helinwang
Copy link
Contributor

@helinwang helinwang commented Jun 28, 2017

Fixes: #2638 #2658


There are two types of model: dense (e.g., weight for a
fully-connected layer) and sparse model (e.g., word
embedding). Pservers always jointly have the entire model at any given
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

word embedding is not sparse model, when input training data is sparse and user configures the parameter to be sparse, the trainer then will detect which part of the parameter should be updated in this batch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jacquesqiao Thanks for pointing out! Does it mean it's a sparse model only if the user configures the parameter to be sparse, and the input for the calculation involving the parse parameter must be sparse?

Copy link
Contributor

@dzhwinter dzhwinter Jun 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think language model is sparse model, word emebedding definitely is one kind of sparse model. To be clear, maybe sparse update is more accurate than sparse model.
One training instance does need the whole parameters, which means that is sparse update one.


Thanks! Will change to "sparse update"
-- Helin

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I consult @lcy-seso , she tells me that sparse model is a kind of training method to make the model sparse, it's different then sparse update.language model and word emebedding both have no relation with sparse model. And mostly we will not use sparse model, so we can just use sparse update in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Will change to "sparse update".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

The model is the output of the training process. There are two
ways from which user can obtain a model:

- Save model triggered by user code: user code asks PaddlePaddle to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are saving model in trainer, there's no "asks PaddlePaddle" to do something, which is likely a remote API call. May be changed to "user code can save model by themselves when batch finishes or pass finishes."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Will change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@typhoonzero Actually it depends on if we implement a method for saving model, or let user save model from the parameters by himself. Can you take a look at #2655 (comment) ?


There are two types of model: dense (e.g., weight for a
fully-connected layer) and sparse model (e.g., word
embedding). Pservers always jointly have the entire model at any given
Copy link
Contributor

@dzhwinter dzhwinter Jun 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think language model is sparse model, word emebedding definitely is one kind of sparse model. To be clear, maybe sparse update is more accurate than sparse model.
One training instance does need the whole parameters, which means that is sparse update one.


Thanks! Will change to "sparse update"
-- Helin

- Convert model from the snapshot: model being converted from
pservers' periodic snapshot. In this way, the user can cancel a job
at any time, and still have a relatively fresh model (we snapshot
around every 5 minutes).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can overemphasize snapshot with model snapshot here, otherwise, someone may confuse with the checkpoint if he doesn't take a look at the checkpoint.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! by "snapshot", I meant "checkpoint". Will change to "checkpoint"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

dense model, but only have a fraction of the sparse model at any given
time.

#### Pservers Saving Model
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a short discussion with @dzhwinter, we think saving the snapshot at pserver side is needed for recovering the state of pservers. A pserver snapshot should contain not only parameters but also some status like optimizer internels.

Saving snapshot by pserver can be triggered by call Save() RPC call from trainers to pservers. Trainers can save models by parameter.to_tar() in event_handlers.

Pserver side "snapshot" will only be used for pserver recovery, but trainer side saved model can be used for inference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@typhoonzero @dzhwinter

saving the snapshot at pserver side is needed for recovering the state of pservers. A pserver snapshot should contain not only parameters but also some status like optimizer internels.

Agree!

Saving snapshot by pserver can be triggered by call Save() RPC call from trainers to pservers.

I think we can add that if we find it necessary later. But for the first version I am inclining to just let the pservers save periodically.

Trainers can save models by parameter.to_tar() in event_handlers.

I think trainer need Python function called save_model(save_dir), in the Python function it will first ask trainer client if it is elected to save the model, save the model if elected. Otherwise every trainer will try to save the model, putting too much burden to the distributed FS.

Pserver side "snapshot" will only be used for pserver recovery, but trainer side saved model can be used for inference.

Agree!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same ideas as we are. 👍

at any time, and still have a relatively fresh model (we snapshot
around every 5 minutes).

### Save Model Triggered by User Code
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I seems that this section descript the reason why we choice trainer saving model, so how about modify the title From
Save Model Triggered by User Code
To
Trainer Saving Model vs. Pservers Saving Model

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! Will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


Each trainer will be given the directory to save the model. The
elected trainer will save the model to
`given-directory/trainerID`. Since the tainerID is unique, this would
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it happens with split-brain, maybe there are two trainers will save the model to given-directory/00001/pass-0-* and given-directory/00002/pass-0-*, but which one we will choose to recover?
How about add file lock under the path given-directory/ and the trainers will save the model only it can get the lock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Great question.
I think the save model is not for recovery (checkpoint is for recovery), it's for user to use for inference, or as a initial model to train from.
Back to the question, if user want to initialize the training from a saved model, users will specify the path by themselves. We don't need to decide for them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I got it.

@wangkuiyi
Copy link
Collaborator

喜欢这样像剥洋葱一样把道理一层一层说得很明白的design doc!

@helinwang helinwang merged commit 45a78a4 into PaddlePaddle:develop Jun 30, 2017
@helinwang helinwang deleted the save_model_doc branch June 30, 2017 20:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants