Adding design doc for multi device(background) #5284

kavyasrinet · 2017-11-01T18:15:21Z

This is the design document that explains the background of multi-device training.
The next steps are:

Design doc for how multi device happens in Tensorflow
Design doc for our Python API.
Design doc for ProgramDesc

QiJune · 2017-11-01T18:40:09Z

doc/design/multi_device_training.md

+On a typical device or system, there are multiple computing devices. In PaddlePaddle, the supported device types as of now, are CPU and GPU (we will be adding FPGA in the future too).
+
+If a PaddlePaddle operation has both CPU and GPU implementations, we decide which kernel to execute based on the device type.  
+Training deep learning models can be resource intensive. Even with a very powerful GPU, some models can take really long to train. This is obvious with deep learning models, especially recurrent models where the execution of each step depends on the execution and output of previous step.


Training is a scene of using multi-device. In inference, we may also need to support multi-device. For example, FPGA is used for inference, but FPGA is not suitable for all operators. We may switch to CPU for some complex operators and then, switch back to FPGA.

Thanks for pointing this out. Will update the writeup.

QiJune · 2017-11-01T18:45:24Z

@QingshuChen and @shijiaxin Please kindly review this doc. We are planning to support multi-device feature for Paddle.

QingshuChen · 2017-11-02T03:02:41Z

doc/design/multi_device_training.md

+2. ProgramDesc: Behind the scenes we need a module that can convert the python API to a ProgramDesc (which is a collection of repeated OpDescs). The ProgramDesc then will be sent to the Executor, which creates the Ops and eventually runs the Ops.
+
+We need to design the above two components as well as propose how the Python API will be parsed into ProgramDesc.
+These components will be addressed in the following design documents, one for each component.


C-API is usually used in inference. We may also need to add C-API document for multi-device support.

Sounds good. Fixed in the recent commit.

wangkuiyi

We will have various combinations of devices. For example

DGX-1 : x64 CPU + CUDA
Intel Nervana system: x64 CPU + Xeon Phi
PX2/3 : ARM CPU + CUDA
some servers : x64 CPU + FPGA

I agree with @kavyasrinet that we can start from a survey of TensorFlow as the starting point, so could we have a systematic view of challenges and possible solutions.

wangkuiyi · 2017-11-02T03:38:29Z

doc/design/multi_device_training.md

+### Model Parallelism
+Model parallelism on the other hand, works by keeping a part of the model on each available device. This is useful when the model is too large to keep on one device or when there are parts of the model that can be executed independently ( in parallel). In this setup, each device will train a part of the model and pass on its updates to the next device.
+
+Here, we want to explore the model parallelism setup, where different parts of the same model reside on different devices and communicate to each other by sending updates.


Here explores model parallelism; does this implies that we prefer model parallelism over data parallelism?

As per my discussion with @QiJune , he mentioned that we should focus on model parallelism at the moment, some other team is already looking at data parallelism. If that's not the case, I would be happy to include more about data parallelism as well.

Hello, I think maybe computation node(operator) placement policy is another topic that is worth mentioning in multi-device situation.
Sometimes the model is not parallel, but it still benefit the overall training and inference efficiency to put some compute nodes on devices like GPU or FPGA, like DNN/CNN part. Of course user could be able to explicitly specify on which device each of the model's node is computed, but I think how to assign each node's computing device automatically is worth discussing.

Thanks for pointing that out, I covered how TensorFlow does that in this PR: #5412 , but this is a good suggestion, I will fix this PR to include this too.

Hello @zealoct , I have fixed the changes you proposed in the latest commit. Please have a look when you can.

… fpga_inference

Adding design doc for multi device(background)

aa1f276

kavyasrinet assigned wangkuiyi and QiJune Nov 1, 2017

QiJune reviewed Nov 1, 2017

View reviewed changes

QiJune requested a review from kexinzhao November 1, 2017 18:40

Adding inference portion

0228628

QingshuChen reviewed Nov 2, 2017

View reviewed changes

wangkuiyi reviewed Nov 2, 2017

View reviewed changes

Kavya Srinet added 2 commits November 6, 2017 15:54

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

48bbf36

… fpga_inference

Addressing review comments:

9f84cd3

kavyasrinet added the design_doc label Nov 7, 2017

Kavya Srinet added 2 commits November 7, 2017 10:51

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

a587cae

… fpga_inference

Adding operator assignment

66817b5

kavyasrinet mentioned this pull request Nov 23, 2017

In depth survey of TensorFlow Lite and MutliGPU support of TensorFlow. PaddlePaddle/Mobile#33

Closed

kavyasrinet closed this Feb 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding design doc for multi device(background) #5284

Adding design doc for multi device(background) #5284

kavyasrinet commented Nov 1, 2017

QiJune Nov 1, 2017

kavyasrinet Nov 1, 2017

QiJune commented Nov 1, 2017

QingshuChen Nov 2, 2017

kavyasrinet Nov 7, 2017

wangkuiyi left a comment

wangkuiyi Nov 2, 2017

kavyasrinet Nov 6, 2017

zealoct Nov 7, 2017

kavyasrinet Nov 7, 2017

kavyasrinet Nov 7, 2017

Adding design doc for multi device(background) #5284

Adding design doc for multi device(background) #5284

Conversation

kavyasrinet commented Nov 1, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

QiJune commented Nov 1, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wangkuiyi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment