-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mpi enabled #9490
Mpi enabled #9490
Changes from 4 commits
3b39815
b1adcd4
94ce020
9d256dd
900b3e9
5bf671b
10669f1
d2ba05a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
#MPI-enabled PaddlePaddle Design doc | ||
|
||
# Background | ||
Now, PaddlePaddle Fluid with Distribution has relatively large network bottleneck, We want to use RDMA and GPUDriect to improve and solve it, so we enabled the features to PaddlePaddle with the help of MPI. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When we do distribute multi GPU training, the communication overhead between servers become the major bottleneck, because of the following reasons:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
|
||
We will introduce Open MPI API to PaddlePaddle, which can bring two benefits to PaddlePaddle: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We will use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Open MPI => OpenMPI There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
1. Enable RDMA with PaddlePaddle, which bring high-performance low latency networks. | ||
2. Enable GPUDriect with PaddlePaddle, which bring the highest throughput and lowest latency GPU read and write. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Need details of the design. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
## Execute args | ||
Launch the script using the ```mpirun``` launcher, For example: ```mpirun -np 3 -hosts node1,node2,node3 python train.py```. By doing this, We can number the actors (trainer/pserver/master) with o .. (n-1). The node's number is the Rank of the calling process in a group of comm (integer), The MPI processes identify each other using a Rank ID. We have to create a mapping between PaddlePaddle's actors and their Rank ID so that we can communicate with the correct destinations when using MPI operations. | ||
**We have to store the Rank ID and the mapping in global variables.** | ||
|
||
## New OP/MODULE | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just list things need to be changed like:
then discuss the details. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
We won't replace all the gRPC requests to MPI requests, the standard gRPC library is used for all administrative operations and the MPI API will be used to transfer tensor or selectRows to Pservers. The base of this idea, we create two new operators to handle requests and receives, the two operators are send_mpi_op and listenandserve_mpi_op. They are a little similar with [send_op](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/send_op.cc) and [listen_and_serv_op](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/listen_and_serv_op.cc), also, We will build a new module to package MPI send and receive process. | ||
|
||
### mpi_module | ||
We will build a new module to package MPI send and receive process. MPI send and recvice are defferent to gRPC, the MPI [recvice](https://www.open-mpi.org/doc/v1.8/man3/MPI_Irecv.3.php) must know receive buffer size and receive buffer element. For this reason, We have to make conmunications twice, the first one is to send metadata about gradient through gRPC, the second one is the real conmunications through MPI which send gradient data to mpi_listenandserve_op. | ||
The detail flow is below: | ||
![](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/fluid/design/dist_train/src/mpi_module.png) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. picture is not shown There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
|
||
### mpi_send_op | ||
Very similar with ```send_op```, we will replace gRPC code which used to send gradient with ```mpi_module```, at the same time , we will wrap it with ```framework::Async```. | ||
|
||
### mpi_listenandserve_op | ||
Very similar with ```listen_and_serv_op```, we will replace gRPC code which used to receive gradient with ```mpi_module```, at the same time , we will wrap it with ```framework::Async```. | ||
|
||
### modify distribute_transpiler.py | ||
Need to add args to distinguish use MPI or not. if confirm to use MPI, we will modify ```send_op``` to ```mpi_send_op``` in distribute_transpiler, and modify ```listenandserve_op``` to ```mpi_listenandserve_op``` also. | ||
|
||
## Build args | ||
Because MPI or CUDA need hardware supported, so we will add some build args to control compiling. | ||
**The specific arguments are under design** |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Separate the implement and design to 2 PRs. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have deleted the implement now and will rebuild it in another PR. |
||
/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. | ||
Licensed under the Apache License, Version 2.0 (the "License"); | ||
you may not use this file except in compliance with the License. | ||
You may obtain a copy of the License at | ||
http://www.apache.org/licenses/LICENSE-2.0 | ||
Unless required by applicable law or agreed to in writing, software | ||
distributed under the License is distributed on an "AS IS" BASIS, | ||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
See the License for the specific language governing permissions and | ||
limitations under the License. */ | ||
|
||
#include "mpi_client.h" | ||
#include "mpi_utils.h" | ||
|
||
namespace paddle { | ||
namespace operators { | ||
namespace detail { | ||
bool MPIClient::AsyncSendVariable() { | ||
char* msg = "123456787654"; | ||
int dst = 1; | ||
MPIIsend send = MPIIsend(dst, msg); | ||
} | ||
|
||
bool MPIClient::Wait() {} | ||
|
||
} // namespace detail | ||
} // namespace operators | ||
} // namespace paddle |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
|
||
/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. | ||
Licensed under the Apache License, Version 2.0 (the "License"); | ||
you may not use this file except in compliance with the License. | ||
You may obtain a copy of the License at | ||
http://www.apache.org/licenses/LICENSE-2.0 | ||
Unless required by applicable law or agreed to in writing, software | ||
distributed under the License is distributed on an "AS IS" BASIS, | ||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
See the License for the specific language governing permissions and | ||
limitations under the License. */ | ||
|
||
#pragma once | ||
|
||
#include <iostream> | ||
#include <map> | ||
#include <string> | ||
|
||
#include "paddle/fluid/framework/data_type.h" | ||
#include "paddle/fluid/framework/lod_tensor.h" | ||
#include "paddle/fluid/framework/scope.h" | ||
#include "paddle/fluid/framework/selected_rows.h" | ||
|
||
namespace paddle { | ||
namespace operators { | ||
namespace detail { | ||
class MPIClient { | ||
public: | ||
// bool AsyncSendVariable(const std::string& ep, | ||
// const platform::DeviceContext& ctx, | ||
// const framework::Scope& scope, | ||
// const std::string& var_name, | ||
// int64_t time_out = 600 * 1000); | ||
|
||
// bool AsyncGetVariable(const std::string& ep, | ||
// const platform::DeviceContext& ctx, | ||
// const framework::Scope& scope, | ||
// const std::string& var_name, | ||
// int64_t time_out = 600 * 1000); | ||
|
||
// void AsyncSendBatchBarrier(const std::string& ep, | ||
// int64_t time_out = 600 * 1000); | ||
|
||
// void AsyncSendFetchBarrier(const std::string& ep, | ||
// int64_t time_out = 600 * 1000); | ||
|
||
bool AsyncSendVariable(); | ||
|
||
bool Wait(); | ||
|
||
private: | ||
int64_t req_count_ = 0; | ||
}; | ||
} // namespace detail | ||
} // namespace operators | ||
} // namespace paddle |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
|
||
/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. | ||
Licensed under the Apache License, Version 2.0 (the "License"); | ||
you may not use this file except in compliance with the License. | ||
You may obtain a copy of the License at | ||
http://www.apache.org/licenses/LICENSE-2.0 | ||
Unless required by applicable law or agreed to in writing, software | ||
distributed under the License is distributed on an "AS IS" BASIS, | ||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
See the License for the specific language governing permissions and | ||
limitations under the License. */ | ||
|
||
#pragma once | ||
namespace paddle { | ||
namespace operators { | ||
namespace detail { | ||
class MPIServer { | ||
public: | ||
private: | ||
}; | ||
} // namespace detail | ||
} // namespace operators | ||
} // namespace paddle |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,92 @@ | ||
// | ||
// Created by tangwei12 on 2018/3/27. | ||
// | ||
|
||
#include <stdio.h> | ||
#include <string.h> | ||
|
||
#include <mpi.h> | ||
#include "mpi_utils.h" | ||
|
||
#define max_worker_name_length 128 | ||
#define mpi_tag = 2008 | ||
|
||
namespace paddle { | ||
namespace operators { | ||
namespace detail { | ||
MPIUtils::MPIUtils(const std::string& worker_name) { | ||
InitMPI(); | ||
|
||
int rank = 0, size = 1; | ||
char my_name[max_work_group_size]; | ||
MPI_Comm_rank(MPI_COMM_WORLD, &rank); | ||
MPI_Comm_size(MPI_COMM_WORLD, &size); | ||
snprintf(my_name, max_worker_name_length, worker_name.c_str()); | ||
|
||
std::vector<char> worker_names(size * max_worker_name_length); | ||
MPI_Allgather(my_name, max_worker_name_length, MPI_CHAR, &worker_names[0], | ||
max_worker_name_length, MPI_CHAR, MPI_COMM_WORLD); | ||
for (int i = 0; i < number_of_procs; i++) { | ||
name_to_id_[std::string(&worker_names[i * 128])] = i; | ||
} | ||
} | ||
|
||
void MPIUtils::InitMPI() { | ||
int flag = 0; | ||
MPI_CHECK(MPI_Initialized(&flag)); | ||
|
||
if (!flag) { | ||
int rank = 0, size = 1, len = -1; | ||
char host_name[max_worker_name_length]; | ||
|
||
MPI_Init(0, 0); | ||
MPI_Comm_rank(MPI_COMM_WORLD, &rank); | ||
MPI_Comm_size(MPI_COMM_WORLD, &size); | ||
MPI_Get_processor_name(host_name, &len) | ||
} | ||
}; | ||
|
||
MPIIsend::MPIIsend(int dst, const char* req) { | ||
done1 = 0; | ||
done2 = 0; | ||
length = strlen(req); | ||
req = req; | ||
} | ||
|
||
MPIIsend::Send() { | ||
MPI_Isend(&req, length, MPI_CHAR, dst, mpi_tag, MPI_COMM_WORLD, | ||
&msg1_); | ||
MPI_Test(&msg1_, &done1_, MPI_STATUS_IGNORE) | ||
} | ||
|
||
bool MPIIsend::IsFinished() { | ||
MPI_Status status; | ||
if (!done1_) MPI_Test(&msg1_, &done1_, &status); | ||
return done1; | ||
} | ||
|
||
MPIIsend::~MPIIsend(){ | ||
MPI_Wait(&msg1_, MPI_STATUS_IGNORE); | ||
MPI_Free_mem(req); | ||
} | ||
|
||
MPIIrecv::MPIIrecv(){ | ||
|
||
} | ||
|
||
MPIIrecv::Recv(){ | ||
|
||
} | ||
|
||
MPIIrecv::IsFinished(){ | ||
|
||
} | ||
|
||
MPIIrecv::~MPIIrecv(){ | ||
|
||
} | ||
|
||
} // namespace detail | ||
|
||
} // namespace operators | ||
} // namespace paddle |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. | ||
Licensed under the Apache License, Version 2.0 (the "License"); | ||
you may not use this file except in compliance with the License. | ||
You may obtain a copy of the License at | ||
http://www.apache.org/licenses/LICENSE-2.0 | ||
Unless required by applicable law or agreed to in writing, software | ||
distributed under the License is distributed on an "AS IS" BASIS, | ||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
See the License for the specific language governing permissions and | ||
limitations under the License. */ | ||
|
||
#pragma once | ||
#include <mpi.h> | ||
#include <map> | ||
#include <string> | ||
#include <vector> | ||
|
||
namespace paddle { | ||
namespace operators { | ||
namespace detail { | ||
class MPIUtils { | ||
public: | ||
MPIUtils(const std::string& worker_name); | ||
const int GetRankID(const std::string& task_id); | ||
|
||
private: | ||
void InitMPI(); | ||
std::map<std::string, int> name_id_map; | ||
}; | ||
|
||
class MPIIsend { | ||
public: | ||
MPIIsend(int dst, const char* buf); | ||
bool IsFinished(); | ||
void Send(); | ||
~MPIIsend(); | ||
|
||
private: | ||
int done1; | ||
int length; | ||
char* req; | ||
MPI_Request msg1_; | ||
}; | ||
|
||
class MPIIrecv { | ||
public: | ||
MPIIrecv(); | ||
bool IsFinished(); | ||
void Recv(); | ||
~MPIIrecv(); | ||
}; | ||
|
||
} // namespace detail | ||
} // namespace operators | ||
} // namespace paddle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need a space after "#"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done