Feature/insert reduce_op to parallel exe #10096

chengduoZH · 2018-04-20T09:37:36Z

This PR's work includes:

Insert broadcast_op and reduce_op into SSA-Graph.
Change the strategy of updating parameters. If the model has 100 parameters and runs on 4-card, the new strategy is every card updates 25 parameters but not the all, and then broadcast their new parameters to others. All those operations are completed by reduce_op and broadcast_op.
If the program is running on GPU, broadcast_op will call nccl_Bcast, otherwise, it will call tensor copying.
If the inputs' type of reduce_op is sparse, reduce_op will concatenate them, otherwise, it will call nccl_Bcast or tensor copying according to the input's place.

I compared the performance of before and after optimization, the result is: (sec/batch)

script: https://gist.github.com/chengduoZH/f9932a92a519658667c902a29b070311

	4-card, 4-threads	4-card, 8-threads	4-card, 16-threads
before	1.004974976	0.967834597	1.021792168
after	0.936332986	0.920828621	1.077100386

This is to say, parallel_exe supports updating sparse parameter now, meanwhile, the performance of parallel_exe is Improved.

…ngduoZH/Paddle into feature/add_reduce_op_handle

…ngduoZH/Paddle into feature/insert_reduce_to_parallel_exe

… feature/insert_reduce_to_parallel_exe

panyx0718

Can you further verify the performance improvements on different environments? The improvement seems not obvious in 4card-8thread condition, while this change introduce significant complexity?

panyx0718 · 2018-04-23T08:53:55Z

paddle/fluid/framework/details/broadcast_op_handle.cc

+          call();
+        }
+      }
+      if (*out_handle != *in_var_handle) {


what is out_handle? What is the usage of this code block? Can you add some comments to explain this block?

Broadcast is to broadcast the root's data to all the devices except itself, so if the input and output of root device are different, we need to add the copying operation.

what is root device and what is input and output of root device?

Sorry, the description above is a bit obscure. I regarded the source device, which will send the data to other devices, as the root node.
The function of broadcast_op is sending input data to all the devices, so there must exist one of output and input is in the same device, if the value(tensor or seleceted_rows) of them is not equal, we should do the memory copying manually.

panyx0718 · 2018-04-23T08:55:36Z

paddle/fluid/framework/details/broadcast_op_handle.h

  BroadcastOpHandle(const std::vector<Scope *> &local_scopes,
-                    const std::vector<platform::Place> &places);
+                    const std::vector<platform::Place> &places, bool use_nccl)


use_nccl_ should be false and we don't need use_nccl argument here?

BroadcastOp also can broadcast CPU data which can not use nccl.

right, that's why we don't need use_nccl argument here for CPU case?

Thank you for pointing out, in order to compare the performance between memory copying and ncclBcast, I add this parameter, the result is ncclBcast is faster. So use_nccl can be removed.

panyx0718 · 2018-04-23T08:57:49Z

paddle/fluid/framework/details/broadcast_op_handle_test.cc

+#endif
+
+    if (use_gpu_) {
+#ifndef PADDLE_WITH_CUDA


There are so many ifdef here, can we improve this block of codes?

panyx0718 · 2018-04-23T08:59:32Z

paddle/fluid/framework/details/gather_op_handle.cc

  // Wait input done, this Wait is asynchronous operation
  WaitInputVarGenerated(in_var_handles);

  std::vector<int64_t> out_rows;
  std::vector<Tensor> in_tensors;
-  std::vector<platform::Place> in_places;
+  //  std::vector<platform::Place> in_places;


this shouldn't exist

panyx0718 · 2018-04-23T09:04:26Z

paddle/fluid/framework/details/gather_op_handle.cc

-    in_places.push_back(in_p);
-    PADDLE_ENFORCE_EQ(in_p.which(), pre_place.which(),
-                      "Places must be all on CPU or all on CUDA.");
+    //    in_places.push_back(in_handle->place_);


panyx0718 · 2018-04-23T09:09:22Z

paddle/fluid/framework/details/multi_devices_graph_builder.cc

+  std::vector<std::unordered_set<std::string>> var_name_on_devices;
+  std::vector<std::unordered_set<std::string>> bcast_var_name_set;
+
+  var_name_on_devices.resize(places_.size());


should this be reserve()? Is it different from resize?

Refer : https://stackoverflow.com/questions/7397768/choice-between-vectorresize-and-vectorreserve

The resize() method (and passing argument to constructor is equivalent to that) will insert or delete appropriate number of elements to the vector to make it given size (it has optional second argument to specify their value). It will affect the size(), iteration will go over all those elements, push_back will insert after them and you can directly access them using the operator[].

The reserve() method only allocates memory, but leaves it uninitialized. It only affects capacity(), but size() will be unchanged. There is no value for the objects, because nothing is added to the vector. If you then insert the elements, no reallocation will happen, because it was done in advance, but that's the only effect.

panyx0718 · 2018-04-23T09:34:56Z

paddle/fluid/framework/details/reduce_op_handle.h

@@ -55,7 +55,7 @@ struct ReduceOpHandle : public OpHandleBase {

  std::string Name() const override;

-  bool IsMultiDeviceTransfer() override { return false; };
+  bool IsMultiDeviceTransfer() override { return true; };


I'm not so sure about this now that many things have changed.

panyx0718 · 2018-04-23T09:36:28Z

python/paddle/fluid/parallel_executor.py

@@ -28,7 +28,9 @@ def __init__(self,
                 loss_name=None,
                 main_program=None,
                 num_threads=None,
+                 threads_per_dev=None,


It seems confusing to have both num_threads and threads_per_dev

panyx0718 · 2018-04-24T11:00:02Z

paddle/fluid/framework/details/broadcast_op_handle.cc

+      auto dev_ctx = dev_ctxes_.at(out_p);
+      RunAndRecordEvent(out_p, [in_tensor, out_var, dev_ctx, out_p] {
+        paddle::framework::TensorCopy(
+            in_tensor, out_p, *(dev_ctx),


no need for () for dex_ctx?

Have removed.

panyx0718 · 2018-04-24T11:08:26Z

paddle/fluid/framework/details/broadcast_op_handle.cc

+      auto *out = out_var_handles[j];
+      auto *out_var = var_scopes.at(out->scope_idx_)->FindVar(out->name_);
+
+      if (*out != *in_var_handle) {


Are you comparing VarHandle instead of VarHandle*? Is it correct?
Hopefully, we don't use "auto" everywhere.

Are you comparing VarHandle instead of VarHandle*?

line 90 compares the instance but not the point.

Is it correct?

The point of input (in_var_handle) and output(out) must be different because the output varHandle is always created. code

Hopefully, we don't use "auto" everywhere.

Thanks, I will correct later.

panyx0718 · 2018-04-24T11:15:55Z

paddle/fluid/framework/details/broadcast_op_handle.cc

+      }
+
+      int type = platform::ToNCCLDataType(in_tensor.type());
+      all_reduce_calls.emplace_back([=] {


it is "broadcast" not "all_reduce"?

Yes, thanks!

panyx0718 · 2018-04-24T11:16:58Z

paddle/fluid/framework/details/broadcast_op_handle.cc

+          call();
+        }
+      }
+      if (*out_handle != *in_var_handle) {


what is root device and what is input and output of root device?

panyx0718 · 2018-04-24T11:18:22Z

paddle/fluid/framework/details/broadcast_op_handle.h

  BroadcastOpHandle(const std::vector<Scope *> &local_scopes,
-                    const std::vector<platform::Place> &places);
+                    const std::vector<platform::Place> &places, bool use_nccl)


right, that's why we don't need use_nccl argument here for CPU case?

panyx0718 · 2018-04-24T11:54:26Z

python/paddle/fluid/parallel_executor.py

@@ -30,7 +30,8 @@ def __init__(self,
                 num_threads=None,
                 allow_op_delay=False,
                 share_vars_from=None,
-                 customize_loss_grad=False):
+                 customize_loss_grad=False,
+                 use_nccl_allreduce=True):


any comments?

panyx0718 · 2018-04-24T11:55:17Z

python/paddle/fluid/tests/unittests/test_parallel_executor.py

@@ -565,24 +568,31 @@ def test_parallel_testing(self):
 depth = 8
 mix_hidden_lr = 1e-3
 embedding_name = 'emb'
+is_sparse = True
+use_nccl_allreduce = False


Does it cover is_sparse=False and use_nccl_allreduce=True?

Thanks, I will add unit test.

panyx0718 · 2018-04-24T11:57:30Z

paddle/fluid/framework/details/var_handle.h

+
+  bool operator!=(const VarHandle& o) const {
+    return o.generated_op_ != generated_op_ || o.name_ != name_ ||
+           o.scope_idx_ != scope_idx_;


what about version_ and place_? this seems dangerous

Thanks, this operator is used in here,

if (*out != *in_var_handle)

The version_ of them can be different, but the place_.....
Let me think about it again.

It is no need to implement != manually. You can just invoke return this->operator==(o);

… feature/insert_reduce_to_parallel_exe

chengduoZH and others added 13 commits April 18, 2018 18:41

Merge branch 'feature/refine_gather_reduce' of https://github.com/che…

7995380

…ngduoZH/Paddle into feature/add_reduce_op_handle

clean code

a6715d6

fix VisitVariable

ac7b414

Merge branch 'feature/add_reduce_op_handle' of https://github.com/che…

9c8b9c0

…ngduoZH/Paddle into feature/insert_reduce_to_parallel_exe

insert reduce op handle to parallel_exe

bd11523

fix API

9f45c06

debug

cecebbe

merge feature/add_reduce_op_handle_Debug

01b0106

unit test can work

dee077b

clean code

c741aeb

debug

fca0fb9

debug

0561fae

Polish MultiDevicesBuilder

a22d385

reyoung force-pushed the feature/insert_reduce_to_parallel_exe branch from cf3a209 to a22d385 Compare April 20, 2018 09:40

chengduoZH force-pushed the feature/insert_reduce_to_parallel_exe branch 2 times, most recently from f9eafab to 898b196 Compare April 20, 2018 10:33

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

ed13f4a

… feature/insert_reduce_to_parallel_exe

chengduoZH force-pushed the feature/insert_reduce_to_parallel_exe branch 7 times, most recently from f6113f6 to 9aa2b65 Compare April 20, 2018 17:05

fix reduce op, broadcast op and gather op

e0f37f8

chengduoZH force-pushed the feature/insert_reduce_to_parallel_exe branch 2 times, most recently from aa04680 to f5749fe Compare April 21, 2018 02:51

chengduoZH added 2 commits April 21, 2018 11:54

fix the threads per dev

8d83914

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

cec94e1

… feature/insert_reduce_to_parallel_exe

chengduoZH force-pushed the feature/insert_reduce_to_parallel_exe branch from f5749fe to cec94e1 Compare April 21, 2018 03:56

code refine

3210055

chengduoZH requested review from reyoung and panyx0718 April 22, 2018 14:44

chengduoZH force-pushed the feature/insert_reduce_to_parallel_exe branch from b932a66 to 1389575 Compare April 22, 2018 14:45

add nccl broadcast

20ba594

chengduoZH force-pushed the feature/insert_reduce_to_parallel_exe branch from 1389575 to 20ba594 Compare April 22, 2018 16:06

clean code

bbad887

chengduoZH force-pushed the feature/insert_reduce_to_parallel_exe branch 2 times, most recently from b4f0207 to 8b0adbe Compare April 24, 2018 03:24

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

f965e9a

… feature/insert_reduce_to_parallel_exe

chengduoZH force-pushed the feature/insert_reduce_to_parallel_exe branch from 8b0adbe to f965e9a Compare April 24, 2018 05:12

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

7b58d47

… feature/insert_reduce_to_parallel_exe

chengduoZH force-pushed the feature/insert_reduce_to_parallel_exe branch from 9dd6517 to 7b58d47 Compare April 24, 2018 06:39

panyx0718 reviewed Apr 24, 2018

View reviewed changes

follow comments

7ee07df

panyx0718 requested changes Apr 24, 2018

View reviewed changes

chengduoZH force-pushed the feature/insert_reduce_to_parallel_exe branch 3 times, most recently from 55d542c to be5a84b Compare April 24, 2018 16:11

follow comments

ea78be2

chengduoZH force-pushed the feature/insert_reduce_to_parallel_exe branch from be5a84b to ea78be2 Compare April 25, 2018 11:19

chengduoZH added 2 commits May 2, 2018 10:39

Merge develop branch and fix CPPLint issues

ed052f1

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

c0a3746

… feature/insert_reduce_to_parallel_exe

chengduoZH force-pushed the feature/insert_reduce_to_parallel_exe branch from 93daa01 to c0a3746 Compare May 2, 2018 10:39

chengduoZH mentioned this pull request May 3, 2018

Feature/update sparse parameter #10351

Merged

chengduoZH closed this May 25, 2018

Feature/insert reduce_op to parallel exe #10096

Feature/insert reduce_op to parallel exe #10096

Conversation

chengduoZH commented Apr 20, 2018 • edited

This PR's work includes:

panyx0718 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

panyx0718 Apr 24, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chengduoZH commented Apr 20, 2018 •

edited

panyx0718 Apr 24, 2018 •

edited