Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"add part of trainer design doc" #2363

Closed
wants to merge 1 commit into from

Conversation

dzhwinter
Copy link
Contributor

No description provided.


## Synchronize SGD

In synchronize SGD, trainer need to wait other nodes finish training in every minibatch. And don't go on next epoch training if there is any node lag behind.
Copy link
Contributor

@helinwang helinwang Jun 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some terms we usually use:

  • step: one forward backward step, computes gradient.
  • mini-batch: several data instances used in a single step.
  • task: multiple mini-batches, the master server assigns task to trainers.
  • pass: all training data, consisted of multiple tasks.
  • epoch: start of a new pass.

In this line, "epoch" is used with "mini-batch", I think by "epoch" you actually mean "step"?


<img src="src/paddle-trainer.png" width="600"/>

To wait other trainer in same epoch, use the waitEpochFinish to decide if an epoch has finished and enter next training epoch.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trainer does not need to know about epoch (start of a new pass), it just get task from the master. So I think waitEpochFinish is not necessary.


## Event Handler

To select the trainer for process Python client event, same way as initialization parameters. Every trainer will try to get a distribute lock, then election a leader one. Leader trainer will keep to writing a file/ send metric data to evaluatorServer. Then python client can use that data draw metrics in real time.
Copy link
Contributor

@helinwang helinwang Jun 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "Event Handler" section is too early to be put into a design doc (we have not reached consensus yet).
Please see: #2364 (comment)

@dzhwinter dzhwinter closed this Aug 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants