Feature/multigpu #4838

dzhwinter · 2017-10-16T18:13:05Z

NCCL is a group of MPI-like primitives library, which can be used in synchronized values between multi-GPU cards.

The simplest implementation of #3769, currently, we only support multi-GPU cards run the same topology and synchronized parameters before parameter optimizing.

To minimize the review job, this PR only implement AllReduce operator, which will be used frequently in synchronizing parameters/gradients between GPU cards.
We will leave the other operators Gather/Bcast in the future work.

To support the NCCL library in current refactorization stage, Here is the brief plan.

feature/multigpu nccl support design doc. Multigpu Feature #3769
Every GPU should run the same graph/blocks, and only can be synchronized at specific parameters/gradients.
third_party NCCL library integration. "add nccl cmake enforce" #4818
Construct NCCL communicators. To demonstrate communicator implement correctly, support NCCL AllReduce primitive. Feature/multigpu #4838
AllGather/Bcast any other NCCL primitives.

This will be supported if the performance is a bottleneck.

hand-write allreduce/bcast primitive routines.

jacquesqiao · 2017-10-26T18:01:55Z

paddle/framework/operator.h

@@ -290,6 +290,15 @@ class ExecutionContext {
    return device_context_;
  }

+  //! Get a input which has multiple variables.


the comment is not accurate, all our input/output is stored in a vector.

jacquesqiao · 2017-10-26T18:02:07Z

paddle/framework/operator.h

+  const std::vector<std::string>& Inputs(const std::string& name) const {
+    return op_.Inputs(name);
+  }
+  //! Get an output which has multiple variables.


same as above

jacquesqiao · 2017-10-26T18:02:49Z

paddle/operators/nccl/nccl_gpu_common.cc

@@ -0,0 +1,17 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.


what is this file used for

build an operator independent module, which may be used for Multiexecutor or some more module.

jacquesqiao · 2017-10-26T18:06:37Z

paddle/operators/nccl_op.cc

+
+    auto x_dims = ctx->GetInputsDim("X");
+
+    // std::string reduction = ctx->Attrs().Get<std::string>("reduction");


are these checks needed?

Add reduction Done.

jacquesqiao · 2017-10-26T18:13:29Z

paddle/operators/nccl_op.cc

+    AddOutput("Out", "The output of Reduce op");
+    AddAttr<int>("root",
+                 "root gpu of the parameter. if not set(-1). hashed by name.")
+        .SetDefault(-1);


use a const value to represent -1, such as kInvalidGPUId

jacquesqiao · 2017-10-26T18:14:10Z

paddle/operators/nccl_op.cu

+
+    auto* comm = ctx.Input<Communicator>("Communicator");
+
+    auto stream = reinterpret_cast<const platform::CUDADeviceContext&>(


这里需要cast么

The returned value is DeviceContext, the abstract class doesn't contain a stream interface.

jacquesqiao · 2017-10-26T18:14:35Z

paddle/operators/nccl_op.cu

+    auto ins_names = ctx.Inputs("X");
+    std::hash<std::string> hasher;
+    for (size_t i = 0; i < ins.size(); ++i) {
+      if (root == -1) {


replace -1 with a const value

jacquesqiao · 2017-10-28T00:28:20Z

paddle/framework/operator.h

@@ -290,6 +290,16 @@ class ExecutionContext {
    return device_context_;
  }

+  //! Get variables vector with same input name.


I think

Get actual name vector for this input.

is better

jacquesqiao · 2017-10-28T00:34:49Z

paddle/operators/nccl_op.cc

+    std::vector<int> gpus = Attr<std::vector<int>>("gpus");
+    PADDLE_ENFORCE(!gpus.empty(), "Attr(gpus) should not be empty.");
+    platform::Communicator *comm =
+        scope.FindVar(name)->GetMutable<platform::Communicator>();


maybe add a check

if (scope.FindVar(name) == nullptr) {...}

because this op doesn't have infershape to ensure the output is there

jacquesqiao · 2017-10-28T00:38:46Z

paddle/operators/nccl_op.cu

+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#define EIGEN_USE_GPU


not use Eigen here

jacquesqiao · 2017-10-28T00:41:51Z

paddle/operators/nccl_op.cu

+    auto outs = ctx.MultiOutput<LoDTensor>("Out");
+
+    std::string reduction = ctx.Attr<std::string>("reduction");
+    ncclRedOp_t reduction_op_ = ncclSum;


where are ncclSum and ncclMax and ncclProd come from?

NCCL have all the four type of operations. http://docs.nvidia.com/deeplearning/sdk/nccl-api/ncclapidoc.html#ncclredop_t

This operator can be used with "reduction" attribute to indicate the operation.

jacquesqiao · 2017-10-28T00:42:30Z

paddle/operators/nccl_op.cu

+    } else if (reduction == "ncclProd") {
+      reduction_op_ = ncclProd;
+    } else {
+      PADDLE_ENFORCE(false, "Invalid reduction. default ncclSum.");


PADDLE_ENFORCE(false, ...) to PADDLE_THROW

jacquesqiao · 2017-10-29T18:55:34Z

paddle/operators/nccl_op_test.cu

+   See the License for the specific language governing permissions and
+   limitations under the License. */
+
+#define EIGEN_USE_GPU


rm this line

jacquesqiao · 2017-10-29T19:15:27Z

paddle/operators/nccl_op_test.cu

+  op2->SetInput("X", {"st"});
+  op2->SetInput("Communicator", {"comm"});
+  op2->SetOutput("Out", {"rt"});
+  op2->SetAttr("root", {kRoot});


I think this line should be

op2->SetAttr("root", kRoot);

jacquesqiao · 2017-10-29T20:17:32Z

paddle/operators/nccl_op_test.cu

+  template <class T>
+  void PerThreadProgram(int gpu_id, const f::OpDescBind &op_desc,
+                        f::Scope *scope) {
+    std::unique_lock<std::mutex> lk(mu);


why here need a lock?

Because we will call GetMutable interface, which is not a thread safe function.

In my understanding, each PerThreadProgram will run on an independent scope and place, In this situation, do we still have thread safe problem? Just want to make sure~

Yes, we did have.
If I moved this lock guard, I really get a segment fault. In my view,

T* GetMutable() { if (!IsType<T>()) { holder_.reset(new PlaceholderImpl<T>(new T())); } return static_cast<T*>(holder_->Ptr()); }

use the global placement new to allocate memory for type T, at this stage, we do not have any guard on it.
The scope and place only seperate the allocated pointer from each other, so our scope hierachy is only take effect to user's program built on the scope.

If we want the thread safe feature, we need a lock on the new T. I think.

jacquesqiao · 2017-10-29T20:19:37Z

paddle/operators/nccl_op_test.cu

+  }
+}
+
+// ncclAReduceOp with desc


ncclAReduceOp => ncclReduceOp

jacquesqiao

Greate Job!

dzhwinter added 19 commits September 6, 2017 21:12

"nccl multigpu init"

48dea84

Merge remote-tracking branch 'origin/develop' into impl

4e95c49

"remove clang format detect"

408e21a

nccl init

0fa34db

Merge remote-tracking branch 'origin/develop' into multigpu

4118782

add test

51abb6c

"nccl add interface"

d144310

"add enforce check"

54d3dbd

Merge remote-tracking branch 'origin/develop' into multigpu

d2be7ec

"fix enforce error"

d8aebaf

"fix common test"

5bcb638

Merge remote-tracking branch 'origin/develop' into multigpu

d4d215a

"fix error"

73883bd

"add python test case"

23cb825

"switch to Init op"

fdfc8f9

"move nccl to another directory"

333045d

"add register gpu macro"

423d743

"add reduce hash function"

ec47565

merge develop into multigpu

f6106ff

dzhwinter mentioned this pull request Oct 23, 2017

"get inputs outputs name from execution context" #5025

Closed

dzhwinter added 10 commits October 23, 2017 17:13

"add init allreduce test"

50f04dc

write nccl c++ test case

ef257e6

Merge remote-tracking branch 'origin/develop' into feature/multigpu

da1181b

checkin nccl operator

0990c87

"delete python ops testcase"

1e8474b

"fix allreduce python test"

026c61c

"redefine the initop from kernel to OpBase"

63fb41b

"move Tensor to LoDTensor"

5200c65

"add bcast c++ test case"

6d1493a

"refactorization of nccl test case"

11cf3e3

dzhwinter added 4 commits October 25, 2017 20:23

"add disable"

4b9cf0e

fix ci

6bc261b

"FIX CI"

dbfd130

fix conflict

16a39d2

jacquesqiao reviewed Oct 26, 2017

View reviewed changes

dzhwinter added 7 commits October 26, 2017 11:31

"fixed based on comment"

6cce526

"polish code based on comment"

5220052

fix based on comment

f632706

"rerun ci"

75eaccc

rerun ci

37842d8

rerun CI

99308b1

rerun ci

6f009cf

jacquesqiao reviewed Oct 28, 2017

View reviewed changes

jacquesqiao reviewed Oct 29, 2017

View reviewed changes

"polish code based on comment"

71305e5

jacquesqiao approved these changes Oct 29, 2017

View reviewed changes

dzhwinter merged commit 833d0ad into PaddlePaddle:develop Oct 29, 2017

		@@ -0,0 +1,17 @@
		/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.


		auto x_dims = ctx->GetInputsDim("X");

		// std::string reduction = ctx->Attrs().Get<std::string>("reduction");


		auto* comm = ctx.Input<Communicator>("Communicator");

		auto stream = reinterpret_cast<const platform::CUDADeviceContext&>(

Feature/multigpu #4838

Feature/multigpu #4838

Conversation

dzhwinter commented Oct 16, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacquesqiao Oct 26, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacquesqiao Oct 26, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacquesqiao Oct 28, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacquesqiao Oct 28, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dzhwinter Oct 29, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacquesqiao left a comment

Choose a reason for hiding this comment

dzhwinter commented Oct 16, 2017 •

edited

Loading

jacquesqiao Oct 26, 2017 •

edited

Loading

jacquesqiao Oct 26, 2017 •

edited

Loading

jacquesqiao Oct 28, 2017 •

edited

Loading

jacquesqiao Oct 28, 2017 •

edited

Loading

dzhwinter Oct 29, 2017 •

edited

Loading