Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2018 milestones #9108

Closed
reyoung opened this issue Mar 15, 2018 · 13 comments
Closed

2018 milestones #9108

reyoung opened this issue Mar 15, 2018 · 13 comments

Comments

@reyoung
Copy link
Collaborator

reyoung commented Mar 15, 2018

Fluid supports multi-GPUs and cluster, and high usability

Deadline:

KPI:

  1. Make fluid supports all models in PaddlePaddle/Book, PaddlePaddle/models. Complete the inference framework of Fluid on linux and mobile.
    • Make Baidu teams (Speech, NLP, Image, Abacus) use fluid to train and inference models.
  2. The speed of training models in PaddlePaddle/book is not slower than TF in MultiGPUs and cluster.
  3. The memory consumption of training models in PaddlePaddle/book is not larger than TF in MultiGPUs and a cluster.

Fluid distributed computing

Deadline:

KPI:

  1. Make Fluid support EDL (Elastic Deep Learning). Make cluster training of Fluid can adapt to workload changes by provisioning and de-provisioning resources in an autonomic manner.
  2. Support model parallelism as well as data parallelism.
  3. Make Fluid support OpenMPI APIs to do distributed all-reduce.
  4. Make Fluid support GPU direct when possible.

Compatible with ONNX

Deadline:

KPI:

  1. Make ProgramDesc can be converted to ONNX model files.
  2. Make ONNX can be converted into ProgramDesc, and make Fluid can train ONNX model.

Support CSP program model and imperative programming

Deadline:

KPI:

  1. Users only use Python as a compiler frontend and produce the ProgramDesc, and an interpreter will execute the ProgramDesc.
  2. The ProgramDesc includes IfElse operator and While, and supports auto diff.
  3. Saving and loading model, printing metrics are all configured in ProgramDesc. Deeply integrate with VisualDL to give a GUI.
  4. Support to configure CSP(coroutines, channel, select) in ProgramDesc. Use CSP to implement multi GPUs and cluster training.
@helinwang
Copy link
Contributor

helinwang commented Mar 15, 2018

Should we specify a TensorFlow version (e.g., the latest release v1.6.0) for performance comparison? Otherwise if after we surpass TensorFlow's performance, they release a new well-optimized version, we would be in an awkward position.

Not saying that we should not aim at being better than the latest TensorFlow, my point is maybe we should focus on a fixed target first.

@helinwang
Copy link
Contributor

Support model parallelism as well as data parallelism

Is there a need for model parallelism other than large embedding lookups. If so we may want to change to "Support large embedding lookups as well as data parallelism"

@helinwang
Copy link
Contributor

helinwang commented Mar 15, 2018

Make Fluid support OpenMPI APIs to do distributed all-reduce.

Currently, we use the parameter server architecture (via send/recv operator) for parameter update, it's a completely different architecture with all-reduce. From my understanding their theoretical network throughput consumption and time consumption for each step are similar. We already support parameter server architecture, what is the reason that we need to support another approach with similar performance?

Support to configure CSP(coroutines, channel, select) in ProgramDesc. Use CSP to implement multi GPUs and cluster training.

If we use CSP for cluster training, it looks more like the parameter server architecture than the all-reduce architecture.

@wangkuiyi
Copy link
Collaborator

wangkuiyi commented Mar 15, 2018

I agree with @helinwang that MPI AllReduce is NOT part of the milestones of PaddlePaddle. I am open to someone out of PaddlePaddle team to try that approach, but it makes sense only if they could do it when they run PaddlePaddle jobs in containers.

@wangkuiyi
Copy link
Collaborator

wangkuiyi commented Mar 15, 2018

Thanks to @reyoung and @helinwang and others for this list.

I tried to summarize the milestones as the follows. @PaddlePaddle/paddle team please comment

note from @panyx0718 about eng time: 6 fulltime months could mean 3 people spend 2 months fulltime working on something

Version ETA Features eng time
0.11.2 end-March Support FP16 2+ fulltime months
0.11.3 mid-April Improve the performance of the single-node-multi-GPU setting 5+ fulltime months
0.11.4 end-April Improve the performance of the multi-node setting 5+ fulltime months
Implement the distributed lookup table using memcached ?
0.12 end-April ONNX-compatible and Fluid-to-ONNX converter ?
Support the imperative programming paradigm 30+ fulltime months
0.12.1 end-May Improve the Fluid API and make it stable 15+ fulltime months
0.13 end-June Fluid on Kubernetes and EDL ?
0.13.1 end-July Develop the transpiler for CUDA ?
0.13.2 mid-August Develop the transpiler for ARM ?
0.13.3 end-August Develop the transpiler for ROCm (depends on AMD team's work) ?
0.13.4 mid-September Develop the transpiler for Intel Movidius ?
0.14 mid-October Fluid Debugger 5 fulltime monts
1.0 end-October Wrap up the above work and officially release the new set of tutorials, documentation, Websites ?
1.0.1 end-November Fluid on EDL Kubernetes stable and performant for cloud service deployment ?

@varunarora
Copy link

This is a solid list, thanks to everyone who worked on it. (more comments coming soon)

@typhoonzero
Copy link
Contributor

typhoonzero commented Mar 16, 2018

@helinwang @wangkuiyi The reason for supporting MPI all-reduce is that latest openmpi implement can support GPU direct if the hardware supports it. This is the fastest way to implement very high performance distributed GPU training. Anyway, we can still try the time-consuming way: implement GPU direct using CUDA libs directly. What do you think?

@PaddleCI
Copy link
Contributor

PaddleCI commented Mar 16, 2018 via email

@panyx0718
Copy link
Contributor

@wangkuiyi Thanks for the milestone.

I suggest we add 2 more columns: 1. the number of full time engineers and 2. the number of months spent developing them. For example:
0.13.1 | end-July | Develop the transpiler for CUDA | fulltime N people | develop for M months

@panyx0718 panyx0718 self-assigned this Mar 16, 2018
@wangkuiyi
Copy link
Collaborator

Good point! @panyx0718 Please feel free to add these columns.

@helinwang
Copy link
Contributor

@typhoonzero @wangkuiyi @PaddleCI thanks for the comments! Good to know the MPI need from Baidu, as well as the GPU direct support from openmpi.

Given that we already have NCCL all-reduce, the development time for integrating openmpi (or NCCL2) maybe not that high, plus the additional benefit of already tuned communication and GPU direct support. That could save us a lot of effort. Fault recovery can be added by checkpointing, maybe fault tolerance on MPI can be added with some special peer-aware logic that creates a new MPI communicator when some node left or joined.

@panyx0718
Copy link
Contributor

@wangkuiyi Done

@shanyi15
Copy link
Collaborator

您好,此issue在近一个月内暂无更新,我们将于今天内关闭。若在关闭后您仍需跟进提问,可重新开启此问题,我们将在24小时内回复您。因关闭带来的不便我们深表歉意,请您谅解~感谢您对PaddlePaddle的支持!
Hello, this issue has not been updated in the past month. We will close it today for the sake of other user‘s experience. If you still need to follow up on this question after closing, please feel free to reopen it. In that case, we will get back to you within 24 hours. We apologize for the inconvenience caused by the closure and thank you so much for your support of PaddlePaddle Group!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants