Backward on parallel do using nccl #8361

tonyyang-svail · 2018-02-11T00:25:26Z

Part of the optimization of parallel_do

This PR contains the following:

add nccl library to the framework
add nccl callback on backward
add nncl flag in parallel do
use asign op to overwrite the reduced gradient
verify the correctness of parallel_do with nccl

…into nccl2

CLAassistant · 2018-02-11T05:47:52Z

All committers have signed the CLA.

JiayiFeng · 2018-02-11T06:56:27Z

python/paddle/v2/fluid/backward.py

@@ -239,7 +304,8 @@ def empty_callback(block, context):
            sub_block = program.block(op.block_attr("sub_block"))
            grad_sub_block = program.create_block(parent_idx=sub_block.idx)
            _append_backward_ops_(sub_block, sub_block.ops, grad_sub_block,
-                                  no_grad_dict, grad_to_var)
+                                  no_grad_dict, grad_to_var,
+                                  _callback_lookup_(op))


How can we apply more than one callbacks here? E.g. we would like to apply the nccl and error clips at the same time.

And I think op is a bad name. The parameter actually means the op owns the current block. op is too broad for it.

@jiangfeng Thank you for your review.

How can we apply more than one callbacks here? E.g. we would like to apply the nccl and error clips at the same time.

I will change the callback to a list of callbacks.

And I think op is a bad name. The parameter actually means the op owns the current block. op is too broad for it.

What do you mean by "The parameter"

… HEAD

…llel_do

tonyyang-svail · 2018-02-16T00:37:24Z

paddle/fluid/operators/parallel_do_op.cc

+      }
+    }
+    for (auto &s : Outputs(framework::GradVarName(kParameters))) {
+      if (s == "@EMPTY@") {


Backward will change some of the gradients to @EMPTY@ if we don't need to calculate it. For example, we don't need to calculate the gradient of layer.data. In this case, parallel_do should skip them.

tonyyang-svail · 2018-02-16T00:38:51Z

paddle/fluid/operators/conv_op.h

@@ -28,8 +28,8 @@ using Tensor = framework::Tensor;

 // Base convolution operator definations for other conv
 // like operators to reuse the implementation.
-inline int OutputSize(int input_size, int filter_size, int dilation,
-                      int padding, int stride) {
+inline int ConvOutputSize(int input_size, int filter_size, int dilation,


Name OutputSize is too general. And it is under the namescope paddle::namespace, which is too broad.

tonyyang-svail · 2018-02-16T00:39:23Z

paddle/fluid/operators/conv_op.cc

@@ -60,8 +60,9 @@ void ConvOp::InferShape(framework::InferShapeContext* ctx) const {
                   "Due to the settings of paddings, filter_dims and "
                   "dilations, the output size is less than 0, please check "
                   "again.");
-    output_shape.push_back(OutputSize(in_dims[i + 2], filter_dims[i + 2],
-                                      dilations[i], paddings[i], strides[i]));
+    output_shape.push_back(ConvOutputSize(in_dims[i + 2], filter_dims[i + 2],


When I was debugging conv_op.cc, looks like OutputSize has been linked to another function...

helinwang · 2018-02-16T00:18:44Z

paddle/fluid/operators/nccl_op.cc

@@ -14,10 +14,13 @@ limitations under the License. */

 #include "paddle/fluid/framework/op_registry.h"
 #include "paddle/fluid/operators/nccl/nccl_gpu_common.h"
+#include "paddle/fluid/operators/nccl/nccl_gpu_common.h"


paddle/fluid/operators/nccl/nccl_gpu_common.h is included twice.

typo. thanks for pointing it out.

helinwang · 2018-02-16T00:23:28Z

paddle/fluid/operators/nccl_op.cc

-    PADDLE_ENFORCE(!gpus.empty(), "Attr(gpus) should not be empty.");
+    // A parallel do may not use all the gpus. For example, the batch size is 7
+    // in the last batch while we have 8 gpu. In this case, parallel_do will
+    // create 7 parallel scopes, so should ncclInitOp create 7 gpu peers


You mentioned "last batch", is it implying ncclInitOp will be called for every mini-batch?

helinwang · 2018-02-16T00:25:39Z

paddle/fluid/operators/nccl_op.cc

+    // in the last batch while we have 8 gpu. In this case, parallel_do will
+    // create 7 parallel scopes, so should ncclInitOp create 7 gpu peers
+    auto &parallel_scopes = scope.FindVar(Input(kParallelScopes))
+                                ->Get<std::vector<framework::Scope *>>();


Does Scope support serialization?

When do we need to serialize scope?

Sorry, never mind, I got confused.

helinwang · 2018-02-16T00:27:39Z

paddle/fluid/operators/nccl_op.cc

+    auto &parallel_scopes = scope.FindVar(Input(kParallelScopes))
+                                ->Get<std::vector<framework::Scope *>>();
+    std::vector<int> gpus(parallel_scopes.size());
+    for (int i = 0; i < static_cast<int>(parallel_scopes.size()); ++i) {


Why only parallel_scopes.size() is used, could be just pass kNumParallelScopes instead of kParallelScopes?

we don't know kNumParallelScopes at the compilation time

helinwang · 2018-02-16T00:49:36Z

paddle/fluid/operators/parallel_do_op.cc

+      }
+    }
+    for (auto &s : Outputs(framework::GradVarName(kParameters))) {
+      if (s == "@EMPTY@") {


Can @EMPTY@ be put into a constant?

helinwang · 2018-02-16T01:05:16Z

paddle/fluid/operators/parallel_do_op.cc

+      VLOG(3) << "Moving " << s;
+      CopyOrShare(*sub_scopes[0]->FindVar(s), place, scope.FindVar(s));
+    }
+    WaitOnPlaces(places);


Not related to this PR, but I am curious why parallel do has to wait for all stream to complete? I thought even the executor does not wait.

I can't think of a case where it is wrong without waiting. But just as we always wait for threads to be joined after we launched them, I feel it's nature for parallel_do to wait for all streams.

That could affect performance. We introduced a synchronization point which we are not sure if we need.

There are several places in parallel do where we have to wait. This line won't be a large effect

helinwang · 2018-02-16T19:25:06Z

python/paddle/v2/fluid/backward.py

+def _callback_lookup_(op):
+    """
+    Only used in _append_backward_ops_
+    Build and returns a callback function for certain op. For example


Can you add more comment about what is callback function? (e.g, is it something gets called after a OP is completed?)

helinwang · 2018-02-16T19:40:14Z

python/paddle/v2/fluid/backward.py

+                    op_desc = _create_op_desc_(
+                        "ncclInit",
+                        {"parallel_scopes": self.parallel_scopes_name},
+                        {"Communicator": ['nccl_com__do_not_change_']}, {})


Can you put nccl_com__do_not_change_ into a constant?

helinwang · 2018-02-16T19:41:52Z

python/paddle/v2/fluid/backward.py

+                                    "X": [o_argu],
+                                    "Communicator":
+                                    ['nccl_com__do_not_change_']
+                                }, {"Out": [allreduce_out_name]},


allreduce_out_name is assigned to o_argu in the next op, why o_argu can not be the output here so we don't need the next assign op.

the ncclAllreduce requires a buffer memory to hold the result, i.e. it doesn't support in place.

helinwang · 2018-02-16T19:48:37Z

python/paddle/v2/fluid/backward.py

+                else:
+                    new_callbacks = callbacks + [_callback_lookup_(op)]
+                _append_backward_ops_(sub_block, sub_block.ops, grad_sub_block,
+                                      no_grad_dict, grad_to_var, new_callbacks)


Do we need to do callbacks = new_callbacks, since callbacks is used later as well.

callbacks used later should not contain _callback_lookup_(op)

helinwang

LGTM!

helinwang

LGTM!

Yang Yang and others added 14 commits February 6, 2018 22:22

compile with nccl2

67881ad

add ncclGroup; it is necessary in nccl2

634f523

backward insert callback pass compile

1c91574

add nccl

672cdc2

disable ncclInit infer shape & var type

e9ddaab

pass run time

f2129b1

add assign op

0815c0f

Mt pusherge branch 'develop' of http://github.com/paddlepaddle/paddle …

23bbaad

…into nccl2

nccl pass parallel_do test

0d57ca4

nccl pass parallel_do test

bb3ae20

pass tiny data

4bb492e

clean up log(info)

bfa78ca

merge develop

cd9e660

clean up

3067114

tonyyang-svail requested review from reyoung, dzhwinter and JiayiFeng February 11, 2018 00:25

reyoung and others added 7 commits February 11, 2018 10:46

Fix constructor bug in mixed_vector

82c33c6

Fix warnings

816fa8f

Clean code

ae2296e

Extract for-loop init. Make nvcc happy

190119b

Merge remote-tracking branch 'pr/8364' into backward_on_parallel_do

0e2deaa

no getmutable nccl_com

0c45eab

diable debug string due to vector bug

f35401c

JiayiFeng reviewed Feb 11, 2018

View reviewed changes

reyoung and others added 4 commits February 12, 2018 12:58

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

37792e5

… HEAD

add back libnccl-dev

3c47c73

merge develop

da97d9d

Merge remote-tracking branch 'pr/8411' into backward_on_parallel_do

5f343e3

Yang Yang added 10 commits February 12, 2018 23:13

remove duplicated cbegin and cend in mixed vector

a259ad4

merge develop

7129fa3

Merge remote-tracking branch 'upstream/develop' into backward_on_para…

e021ad6

…llel_do

pass compile

3f09620

Merge remote-tracking branch 'upstream/develop' into backward_on_para…

bea80b0

…llel_do

callback to list of callbacks

9d26f1a

pass test_recognize_digits

1d9fd1c

merge develop

5229ccb

test error clip

eb82b5c

test error clip

3494b79

tonyyang-svail requested review from helinwang and JiayiFeng February 16, 2018 00:41

tonyyang-svail commented Feb 16, 2018

View reviewed changes

helinwang reviewed Feb 16, 2018

View reviewed changes

Yang Yang added 2 commits February 16, 2018 23:55

merge develop

ec01f63

merge develop

ae69f0b

helinwang previously approved these changes Feb 17, 2018

View reviewed changes

clean up

4b957af

tonyyang-svail dismissed helinwang’s stale review via 4b957af February 17, 2018 01:36

helinwang approved these changes Feb 20, 2018

View reviewed changes

helinwang merged commit 633756a into PaddlePaddle:develop Feb 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backward on parallel do using nccl #8361

Backward on parallel do using nccl #8361

tonyyang-svail commented Feb 11, 2018 •

edited

Loading

CLAassistant commented Feb 11, 2018 •

edited

Loading

JiayiFeng Feb 11, 2018 •

edited

Loading

tonyyang-svail Feb 12, 2018

tonyyang-svail Feb 16, 2018 •

edited

Loading

tonyyang-svail Feb 16, 2018 •

edited

Loading

tonyyang-svail Feb 16, 2018

helinwang Feb 16, 2018

tonyyang-svail Feb 16, 2018

helinwang Feb 16, 2018

tonyyang-svail Feb 16, 2018

helinwang Feb 16, 2018

tonyyang-svail Feb 16, 2018

helinwang Feb 16, 2018

helinwang Feb 16, 2018

tonyyang-svail Feb 16, 2018

helinwang Feb 16, 2018

tonyyang-svail Feb 16, 2018

helinwang Feb 16, 2018

tonyyang-svail Feb 16, 2018

helinwang Feb 16, 2018

tonyyang-svail Feb 16, 2018

helinwang Feb 16, 2018

helinwang Feb 16, 2018

helinwang Feb 16, 2018

tonyyang-svail Feb 17, 2018

helinwang Feb 16, 2018

tonyyang-svail Feb 17, 2018

helinwang left a comment

helinwang left a comment

Backward on parallel do using nccl #8361

Backward on parallel do using nccl #8361

Conversation

tonyyang-svail commented Feb 11, 2018 • edited Loading

CLAassistant commented Feb 11, 2018 • edited Loading

JiayiFeng Feb 11, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tonyyang-svail Feb 16, 2018 • edited Loading

Choose a reason for hiding this comment

tonyyang-svail Feb 16, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang left a comment

Choose a reason for hiding this comment

helinwang left a comment

Choose a reason for hiding this comment

tonyyang-svail commented Feb 11, 2018 •

edited

Loading

CLAassistant commented Feb 11, 2018 •

edited

Loading

JiayiFeng Feb 11, 2018 •

edited

Loading

tonyyang-svail Feb 16, 2018 •

edited

Loading

tonyyang-svail Feb 16, 2018 •

edited

Loading