do we need a update api in new pserver cclient #2347

jacquesqiao · 2017-06-01T16:03:05Z

In this design(https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/cluster_train/pserver_client.md) we don't have a update() interface for update parameter, I guess we want to do this immediately after call paddle_send_grads.

My question is, do we need to add a update function for updating parameter by pserver cclient？

The text was updated successfully, but these errors were encountered:

dzhwinter · 2017-06-01T16:10:55Z

we do not need a update() interface in v1 since the parameter optimizer job is done in pserver side.
I thought that trainer need to provide a update() interface to support future's trainer side optimizer(not implement it now).

jacquesqiao · 2017-06-01T16:17:48Z

Great, I agree with you that for now we do not need to do optimize locally. Here I mean do we need a update interface in cclient to inform pserver to do update/optimize. Because in the current design, the update/optimize is done by pserver implicitly.

helinwang · 2017-06-01T16:46:35Z

For ASGD, pserver will immediately update the parameters once trainer tell pserver gradient. When trainer call get parameter, it will always return the latest model.
For SGD, pserver will wait for all trainers reporting the gradient, and then update the parameters. And when trainer call get parameter, it will block until the update is finished. For fault tolerance, pserver need to control when to update parameter, it will have a timer, when some trainer do not send gradient in time, it will do the update. Trainer do not have this information, so it should not be the one controlling when to perform the update.

From the behavior above, feels like trainer's role is to provide gradient to pserver, and pserver will decide when and how to update the parameters?

dzhwinter · 2017-06-01T17:02:16Z

for the pserver side optimize, I totally keep same idea with yours.
there is one detail need to figure out when it come to SGD,
1、how will one trainer report the training epoch is over?
for example, there is 8 part of data, we have 3 node trainer, obviously, there will be one node lack one batch data. in SGD, will this machine send empty parameter? or just deregister itself from training node map? any other method else?
2 、when to determine that the whole training process is finished?
all the machine going to the same epoch count?
3、how about the lagged node problem, kick out during training process?
....
I think we need these things discussed sufficiently. I thought the synchronizing part will be a bunch of code, should we put it into pserver?
another issue, Do we will support trainer side optimize? As far as I know, just for performance reason will we implement this feature in the future?
https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/cluster_train/pserver_client.md#model-optimization-using-gradients

helinwang · 2017-06-01T17:55:04Z

@dzhwinter here is what in my mind:

for example, there is 8 part of data, we have 3 node trainer, obviously, there will be one node lack one batch data. in SGD, will this machine send empty parameter? or just deregister itself from training node map? any other method else?

In this case the trainer lacking data will just timeout, pserver will move on after a timeout threshold has reached.

when to determine that the whole training process is finished?

It's not the responsibility of the trainer or the pserver. The master server will know when a pass or whole training process is finished.

how about the lagged node problem, kick out during training process?

Same as first one, there will be a timeout threshold.

another issue, Do we will support trainer side optimize? As far as I know, just for performance reason will we implement this feature in the future?

I think we already agreed on that our plan for the first version is to only implement trainer side optimization, trainers will send the parameter diff to pservers, and pservers only do simple averaging on pserver.

dzhwinter · 2017-06-02T00:52:33Z

timeout strategy seems great, it simplifies the coordinate problem!
the timeout threshold will be set to retry times * send data interval, can we afford this time delay overhead? If there is one lagged node(not dead one), then the time delay will be time threshold in each send_grads/update, can we afford that price?

typhoonzero · 2017-06-02T02:09:58Z

Agree with @helinwang with the update() interface.

In this case the trainer lacking data will just timeout, pserver will move on after a timeout threshold has reached.

Pserver's timeout is different to "master timeouts". A master timeout will mark the task as "failed". Consider trainers lacks batches is training a "small" task, the process may like:

The trainer with "small" task finishes its task earlier than the others
Master will mark this task to "Done" status and dispatch a new task to this trainer
Then the trainer will call "paddle_get_params" to fetch parameters to start a new training batch, because pserver may still be waiting for all trainers to send gradients, so the call may be blocked util parameters are updated on the pserver side.
Cluster goes on training with one trainer training a new task and other trainers training old tasks.
Go back to 1

jacquesqiao · 2017-06-02T02:50:45Z

@typhoonzero

Then the trainer will call "paddle_get_params" to fetch parameters to start a new training batch, because master may still be waiting for all trainers to send gradients

master may still be waiting ==> pserver may still be waiting?

helinwang · 2017-06-02T03:57:46Z

@typhoonzero Yes the process is correct 👍

helinwang · 2017-06-02T04:04:05Z

@dzhwinter

the timeout threshold will be set to retry times * send data interval, can we afford this time delay overhead? If there is one lagged node(not dead one), then the time delay will be time threshold in each send_grads/update, can we afford that price?

I think for SGD we expect that all trainer finishes a mini-batch in roughly same time. So timeout will not happen often. However timeout may happen often due to network or other issues. In this case we can use more aggressive timeout, or add backup trainers. Or switch to ASGD.

helinwang · 2017-08-10T05:30:33Z

The API have been discussed and reached agreement. Closing this issue.

jacquesqiao assigned helinwang, dzhwinter and typhoonzero Jun 1, 2017

helinwang closed this as completed Aug 10, 2017

typhoonzero mentioned this issue Sep 21, 2017

num_samples_processed of learning rate annealing #4305

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

do we need a update api in new pserver cclient #2347

do we need a update api in new pserver cclient #2347

jacquesqiao commented Jun 1, 2017

dzhwinter commented Jun 1, 2017 •

edited

Loading

jacquesqiao commented Jun 1, 2017

helinwang commented Jun 1, 2017 •

edited

Loading

dzhwinter commented Jun 1, 2017 •

edited

Loading

helinwang commented Jun 1, 2017 •

edited

Loading

dzhwinter commented Jun 2, 2017 •

edited

Loading

typhoonzero commented Jun 2, 2017 •

edited

Loading

jacquesqiao commented Jun 2, 2017 •

edited

Loading

helinwang commented Jun 2, 2017

helinwang commented Jun 2, 2017 •

edited

Loading

helinwang commented Aug 10, 2017 •

edited

Loading

do we need a update api in new pserver cclient #2347

do we need a update api in new pserver cclient #2347

Comments

jacquesqiao commented Jun 1, 2017

dzhwinter commented Jun 1, 2017 • edited Loading

jacquesqiao commented Jun 1, 2017

helinwang commented Jun 1, 2017 • edited Loading

dzhwinter commented Jun 1, 2017 • edited Loading

helinwang commented Jun 1, 2017 • edited Loading

dzhwinter commented Jun 2, 2017 • edited Loading

typhoonzero commented Jun 2, 2017 • edited Loading

jacquesqiao commented Jun 2, 2017 • edited Loading

helinwang commented Jun 2, 2017

helinwang commented Jun 2, 2017 • edited Loading

helinwang commented Aug 10, 2017 • edited Loading

dzhwinter commented Jun 1, 2017 •

edited

Loading

helinwang commented Jun 1, 2017 •

edited

Loading

dzhwinter commented Jun 1, 2017 •

edited

Loading

helinwang commented Jun 1, 2017 •

edited

Loading

dzhwinter commented Jun 2, 2017 •

edited

Loading

typhoonzero commented Jun 2, 2017 •

edited

Loading

jacquesqiao commented Jun 2, 2017 •

edited

Loading

helinwang commented Jun 2, 2017 •

edited

Loading

helinwang commented Aug 10, 2017 •

edited

Loading