Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed training progress and todo #1820

Closed
helinwang opened this issue Apr 20, 2017 · 3 comments
Closed

Distributed training progress and todo #1820

helinwang opened this issue Apr 20, 2017 · 3 comments

Comments

@helinwang
Copy link
Contributor

helinwang commented Apr 20, 2017

Progress:
Design docs:

TODO:
For the first level bullet points, please refer to the corresponding design doc. The second level bullet points are questions that we need to figure out.

  • Implement PaddlePaddle Server.
  • Implement master program.
  • Implement fault tolerant parameter server.
    • Do we need to rewrite parameter server? How much effort is it to add fault tolerant in C++? If the effort is bigger or equal to rewrite in golang, maybe we can rewrite in golang.
    • Do we need to support sparse parameter update in v1?
    • What kind of update rule does parameter server need to support in v1? maybe only simple "add" (no momentum based).
  • Implement fault tolerant trainer.
    • able to scale up trainer.
    • it involves changes for python part and native part (c++ or golang). We need to define a clean C api for python to use.
  • Client submit cluster training.
  • Setup etcd service (no detail specification in the design doc, will not reuse etcd from k8s, according to distributed training: should we re-use etcd from k8s? #1807).
    • How to control etcd access namespace?
  • Filesystem provisioned for each user (no detailed design yet).
  • Collect logs and display to users (no detailed design yet).
  • Upload custom dataset to cluster.
    • Do we need to support merge data files to a big custom file to speed up sequential read. This is performance in the first version?
    • How can the trainer read the dataset, and be backward / forward compatible, maybe we need a reader "driver" for each dataset?
@typhoonzero
Copy link
Contributor

Shall we first implement a "workable" version that will run on paddle cloud? A simple version with capability of "fault tolerant" needs below features. This version running "async SGD" can deal with trainer fails and scale trainer.

  • All reader implements should read environment variable "PADDLE_LOCAL_TRAIN", if true, do normal local file reading. If false, reader fetches "task" (which is the file path) from master queue, and then doing file reading.
  • A simple master implement doing task dispatching using queues.

@helinwang
Copy link
Contributor Author

helinwang commented Apr 20, 2017

I agree with having the first "workable" implementation of fault tolerant distributed training. Supporting only async SGD at first seems very reasonable to me. A side note, google internally uses async SGD for the majority of the jobs. Same with Cai Cloud Technology.

  • For the "PADDLE_LOCAL_TRAIN" env variable part, if we support two modes (local file reading v.s. master dispatch), feels like we are adding complexity here since local file reading can be
    implemented by master dispatching path to a distributed storage.

  • Yes, the master dispatching is meant to be simple, not because it's the first version that it need to be simple, I think even in the final version it needs to be simple. I suggest we avoid premature optimization at all cost.

Let's discuss more during meeting :)

@helinwang
Copy link
Contributor Author

Moved to #1860

heavengate pushed a commit to heavengate/Paddle that referenced this issue Aug 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants