MPI-enabled PaddlePaddle #9405

seiriosPlus · 2018-03-27T07:36:33Z

By using MPI API, we enables PaddlePaddle to take advantage of high performance low latency networks such as Infiniband.
There are two benefits:

Enable RDMA with PaddlePaddle, which bring high performance low latency networks.
Enable GPUDriect with PaddlePaddle, which bring highest throughput and lowest latency GPU read and write.

wangkuiyi · 2018-03-28T05:51:41Z

It looks to me that in order to utilize InfiniBank and GPDirect via MPI, we need to call MPI_AllReduce?

MPI_AllReduce is mutually exclusive with parameter server, fault tolerance, and elastic scheduling. It would be important to draft a design doc to make sure that Fluid supports both modes -- AllReduce and ParameterServer.

wangkuiyi · 2018-03-28T05:57:01Z

Also, I noticed that the current distributed computing solution includes a transpiler that generates the ProgramDesc messages for trainers and parameters. If we are going to use MPI_AllReduce to replace parameter servers, do we need a new transpiler that generates the ProgramDesc for trainers in the AllReduce mode?

seiriosPlus · 2018-03-28T07:02:35Z

We current target speed up PaddlePaddle with Distribution. And for that, We need to call API MPI_Isend and MPI_Irecv to send and receive data between nodes in MPI cluster.
Introduce Open MPI API to PaddlePaddle, which can bring two benefits to PaddlePaddle:

Enable RDMA with PaddlePaddle, which bring high-performance low latency networks.
Enable GPUDriect with PaddlePaddle, which bring the highest throughput and lowest latency GPU read and write.

seiriosPlus created this issue from a note in PaddlePaddle Distributed Refactoring (Due: 201802) (TODO) Mar 27, 2018

seiriosPlus moved this from TODO to DOING in PaddlePaddle Distributed Refactoring (Due: 201802) Mar 27, 2018

seiriosPlus self-assigned this Mar 27, 2018

seiriosPlus mentioned this issue Mar 29, 2018

Mpi enabled #9490

Merged

typhoonzero moved this from DOING to Perf TODOs in PaddlePaddle Distributed Refactoring (Due: 201802) May 4, 2018

seiriosPlus mentioned this issue Jul 5, 2018

[WIP] MPI enabled for Fluid #11992

Closed

shanyi15 closed this as completed Aug 15, 2018

PaddlePaddle Distributed Refactoring (Due: 201802) automation moved this from Perf TODOs to DONE Aug 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPI-enabled PaddlePaddle #9405

MPI-enabled PaddlePaddle #9405

seiriosPlus commented Mar 27, 2018

wangkuiyi commented Mar 28, 2018 •

edited

Loading

wangkuiyi commented Mar 28, 2018

seiriosPlus commented Mar 28, 2018

MPI-enabled PaddlePaddle #9405

MPI-enabled PaddlePaddle #9405

Comments

seiriosPlus commented Mar 27, 2018

wangkuiyi commented Mar 28, 2018 • edited Loading

wangkuiyi commented Mar 28, 2018

seiriosPlus commented Mar 28, 2018

wangkuiyi commented Mar 28, 2018 •

edited

Loading