Distributed training progress and todo #1820

helinwang · 2017-04-20T01:08:58Z

Progress:
Design docs:

https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/dist/README.md:
- Fault tolerant master program
- Fault tolerant parameter server
- Fault tolerant trainer
Paddle cluster design #1696 (PR review in progress)
- Parameter server checkpointing / recover from checkpoint file.
- Upload custom dataset to cluster / how can the reader use the uploaded dataset.
Design doc: submit a distributed job #1770 (PR review in progress)
- How to submit cluster training job from client.
- PaddlePaddle Server.

TODO:
For the first level bullet points, please refer to the corresponding design doc. The second level bullet points are questions that we need to figure out.

Implement PaddlePaddle Server.
Implement master program.
Implement fault tolerant parameter server.
- Do we need to rewrite parameter server? How much effort is it to add fault tolerant in C++? If the effort is bigger or equal to rewrite in golang, maybe we can rewrite in golang.
- Do we need to support sparse parameter update in v1?
- What kind of update rule does parameter server need to support in v1? maybe only simple "add" (no momentum based).
Implement fault tolerant trainer.
- able to scale up trainer.
- it involves changes for python part and native part (c++ or golang). We need to define a clean C api for python to use.
Client submit cluster training.
Setup etcd service (no detail specification in the design doc, will not reuse etcd from k8s, according to distributed training: should we re-use etcd from k8s? #1807).
- How to control etcd access namespace?
Filesystem provisioned for each user (no detailed design yet).
Collect logs and display to users (no detailed design yet).
Upload custom dataset to cluster.
- Do we need to support merge data files to a big custom file to speed up sequential read. This is performance in the first version?
- How can the trainer read the dataset, and be backward / forward compatible, maybe we need a reader "driver" for each dataset?

typhoonzero · 2017-04-20T08:48:17Z

Shall we first implement a "workable" version that will run on paddle cloud? A simple version with capability of "fault tolerant" needs below features. This version running "async SGD" can deal with trainer fails and scale trainer.

All reader implements should read environment variable "PADDLE_LOCAL_TRAIN", if true, do normal local file reading. If false, reader fetches "task" (which is the file path) from master queue, and then doing file reading.
A simple master implement doing task dispatching using queues.

helinwang · 2017-04-20T20:29:02Z

I agree with having the first "workable" implementation of fault tolerant distributed training. Supporting only async SGD at first seems very reasonable to me. A side note, google internally uses async SGD for the majority of the jobs. Same with Cai Cloud Technology.

For the "PADDLE_LOCAL_TRAIN" env variable part, if we support two modes (local file reading v.s. master dispatch), feels like we are adding complexity here since local file reading can be
implemented by master dispatching path to a distributed storage.
Yes, the master dispatching is meant to be simple, not because it's the first version that it need to be simple, I think even in the final version it needs to be simple. I suggest we avoid premature optimization at all cost.

Let's discuss more during meeting :)

helinwang · 2017-04-24T04:36:25Z

Moved to #1860

Yancey1989 closed this as completed Apr 21, 2017

Yancey1989 reopened this Apr 21, 2017

typhoonzero mentioned this issue Apr 24, 2017

paddle cloud 计划内容 #1860

Closed

helinwang closed this as completed Apr 24, 2017

heavengate pushed a commit to heavengate/Paddle that referenced this issue Aug 16, 2021

fix import error (PaddlePaddle#1820)

aeda569

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed training progress and todo #1820

Distributed training progress and todo #1820

helinwang commented Apr 20, 2017 •

edited

Loading

typhoonzero commented Apr 20, 2017

helinwang commented Apr 20, 2017 •

edited

Loading

helinwang commented Apr 24, 2017

Distributed training progress and todo #1820

Distributed training progress and todo #1820

Comments

helinwang commented Apr 20, 2017 • edited Loading

typhoonzero commented Apr 20, 2017

helinwang commented Apr 20, 2017 • edited Loading

helinwang commented Apr 24, 2017

helinwang commented Apr 20, 2017 •

edited

Loading

helinwang commented Apr 20, 2017 •

edited

Loading