listen_and_serv_op support async update #10042

jacquesqiao · 2018-04-19T01:56:21Z

fix: #9997

… add-async-listen-and-serv-op

…qiao/Paddle into add-async-listen-and-serv-op

… add-async-listen-and-serv-op

…o/Paddle into add-async-listen-and-serv-op

…om/jacquesqiao/Paddle into add-async-listen-and-serv-op

… add-async-listen-and-serv-op

typhoonzero · 2018-04-25T03:26:11Z

paddle/fluid/operators/listen_and_serv_op.cc

+  std::unordered_map<std::string, int32_t> grad_to_id;
+  std::unordered_map<int32_t, std::string> id_to_grad;
+
+  auto grad_to_id_str = Attr<std::vector<std::string>>("grad_to_id");


grad_to_id_str can be generated by listen_and_serv_op when initializing, read the ProgramDesc blocks and create a mapping, so we can save this attribute.

I think grad_to_id_str should be created in Python by transpiler because the transpile logic know how to split the operator and block, listen_and_serv_op just use the result is fine, or it has to understand the detailed logic of transpiler.

Firstly we want to make listen_and_serv_op general, that means it should not know that attributes and inputs are "grads" or parameters, it should simply receive the data and run a block.

In that case, for Async Execution, listen_and_serv is responsible to determine which block need to run when the data arrives. Just open for discussion.

After discussing with @typhoonzero, I get the point, I totally agree with the idea that listen_and_serv_op should be a general operator! We will find a better way to implement async update in the future PRs.

typhoonzero · 2018-04-25T03:26:48Z

paddle/fluid/operators/detail/grpc_server.cc

@@ -30,9 +30,13 @@ enum CallStatus { PROCESS = 0, FINISH };
 class RequestBase {
 public:
  explicit RequestBase(GrpcService::AsyncService* service,
-                       ::grpc::ServerCompletionQueue* cq,
+                       ::grpc::ServerCompletionQueue* cq, bool sync_mode,


maybe is_sync or just sync can tell the meaning?

I think sync_mode means it works in a mode, but is_sync means itself is async. So I think sync_mode is better.

typhoonzero · 2018-04-25T03:28:48Z

paddle/fluid/operators/listen_and_serv_op.cc

+  auto optimize_prepared = executor->Prepare(*program, block_list);
+  std::unordered_map<std::string,
+                     std::shared_ptr<framework::ExecutorPrepareContext>>
+      grad_to_prepared;


grad_to_prepared_block

typhoonzero · 2018-04-25T03:36:59Z

paddle/fluid/operators/listen_and_serv_op.cc

+      LOG(ERROR) << "run sub program error " << e.what();
+    }
+  });
+  // TODO(qiao) maybe we can remove this


removing this means more "async" mode, trainer even doesn't know whether the sent gradient is updated to the server side weights before it gets the latest weights. Or do you mean by letting updates to different weights become parallel?

The current implementation will update gradients in sequence if we keep this wait. This may influence the effect, I will do some test on it.

After discussing with @typhoonzero , we think that each gradient should be put to an independent block queue to ensure that they are updated without conflict.

we think that each gradient should be put to an independent block queue

Do you mean each gradient of one parameter, such as grad_w1(trainer0), grad_w1(trainer1), grad_w2(trainer0), we put grad_w1(trainer0) and grad_w1(trainer1) into a queue, and grad_w2(trainer0) into another one?

According to the design doc, maybe we need multiple BlockingQueues so that each parameter can own one of them to implement a lock of updating parameter.

yes, we need multiple block queue, each will store gradients for on parameters, but we do not need to add a lock, because the queue will block until the optimize block is finished.

panyx0718 · 2018-04-25T04:59:21Z

paddle/fluid/operators/detail/grpc_server.cc

+        queue_(queue),
+        responder_(&ctx_) {
+    if (sync_mode_) {
+      request_.reset(new VariableResponse(scope, dev_ctx_, false));


request_.reset(new VariableResponse(
scope,
dev_ctx_,
!sync_mode_ // create_scope
));

I thought a while here, and think the current code is easier for user understand the intent.

panyx0718 · 2018-04-25T12:19:36Z

paddle/fluid/operators/detail/variable_response.h

@@ -61,7 +63,7 @@ class VariableResponse {
  // other: number of error field.
  int Parse(const ::grpc::ByteBuffer& byte_buffer);

-  const framework::Scope& GetLocalScope() const { return *local_scope_; }
+  framework::Scope& GetLocalScope() const { return *local_scope_; }


Consider GetMutableLocalScope that returns a pointer and avoid removing the const?

panyx0718 · 2018-04-25T12:35:47Z

paddle/fluid/operators/listen_and_serv_op.cc

@@ -221,6 +327,12 @@ from send_op and send back variables to recv_op.
                         "IP address to listen on.")
        .SetDefault("127.0.0.1:6164")
        .AddCustomChecker([](const std::string &ip) { return !ip.empty(); });
+    AddAttr<std::vector<std::string>>(
+        "grad_to_id",


grad_to_block_id?

panyx0718 · 2018-04-25T12:39:00Z

python/paddle/fluid/distribute_transpiler.py

@@ -143,7 +143,8 @@ def transpile(self,
                  program=None,
                  pservers="127.0.0.1:6174",
                  trainers=1,
-                  split_method=splitter.round_robin):
+                  split_method=splitter.round_robin,
+                  sync_mode=True):


need comment

… add-async-listen-and-serv-op

Yancey1989 · 2018-04-26T05:31:18Z

paddle/fluid/operators/listen_and_serv_op.cc

+      AsyncExecuteBlock(executor, grad_to_prepared_block[recv_var_name].get(),
+                        v.second->GetMutableLocalScope());
+      // TODO(qiao): explain why
+      if (var->IsType<framework::SelectedRows>()) {


Maybe we don't need to clear the rows, because of each gradient var is in a new scope.

Great suggestion! removed.

typhoonzero

It LGTM now, does python wrapping in io.py need to be updated in this PR or later?

jacquesqiao · 2018-04-26T09:21:07Z

@typhoonzero the CI is too slow, I will give another PR to fix the io.py.

Yancey1989

LGTM++

jacquesqiao added 28 commits April 18, 2018 17:48

init async gprc server

79a1a7c

implement main logic

1a43828

optimize

e84f353

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

a39e607

… add-async-listen-and-serv-op

update

d002aa7

add split string

1e30c41

Merge branch 'refine-listen-and-serve-op' of ssh://github.com/jacques…

3608301

…qiao/Paddle into add-async-listen-and-serv-op

init RunAsyncUpdate

0a881a1

Merge branch 'refine-listen-and-serve-op' of ssh://github.com/jacques…

f997c9b

…qiao/Paddle into add-async-listen-and-serv-op

rename RunAsyncUpdate to RunAsyncLoop

e2ace03

update send_recv_op_test

63fbdcf

add sync_mode

260bf5a

rename grad_map to grad_to_id

dc3d2dc

remove unused file

0763ae9

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

1d75674

… add-async-listen-and-serv-op

tmp

c6937ab

Merge branch 'fix-build-activation_op' of ssh://github.com/jacquesqia…

4b86b49

…o/Paddle into add-async-listen-and-serv-op

Merge branch 'split-optimize-op-into-signle-blocks' of ssh://github.c…

5d32008

…om/jacquesqiao/Paddle into add-async-listen-and-serv-op

Merge branch 'split-optimize-op-into-signle-blocks' of ssh://github.c…

1b5de9d

…om/jacquesqiao/Paddle into add-async-listen-and-serv-op

Merge branch 'split-optimize-op-into-signle-blocks' of ssh://github.c…

39892fe

…om/jacquesqiao/Paddle into add-async-listen-and-serv-op

complete grad_to_id

63055a3

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

42a15a4

… add-async-listen-and-serv-op

distribute transpiler support async config

34f2818

async update can run

a0ced3d

optimize code

a29e352

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

0881d80

… add-async-listen-and-serv-op

listen and serv default sync mode

3503c47

fix send_recv_op_test

8081e15

jacquesqiao changed the title ~~[WIP] listen_and_serv_op support async update~~ listen_and_serv_op support async update Apr 24, 2018

jacquesqiao requested a review from typhoonzero April 24, 2018 16:09

jacquesqiao requested review from Yancey1989 and guru4elephant April 24, 2018 16:09

Merge branch 'develop' into add-async-listen-and-serv-op

63bf82d

typhoonzero reviewed Apr 25, 2018

View reviewed changes

jacquesqiao requested a review from panyx0718 April 25, 2018 04:47

panyx0718 reviewed Apr 25, 2018

View reviewed changes

jacquesqiao added 2 commits April 26, 2018 12:25

code optimize

63bd38b

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

0264ec3

… add-async-listen-and-serv-op

Yancey1989 reviewed Apr 26, 2018

View reviewed changes

jacquesqiao added 2 commits April 26, 2018 13:51

delete useless code

46342a2

optimize naming

3295f31

typhoonzero reviewed Apr 26, 2018

View reviewed changes

Yancey1989 approved these changes Apr 26, 2018

View reviewed changes

jacquesqiao merged commit 6d93456 into PaddlePaddle:develop Apr 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

listen_and_serv_op support async update #10042

listen_and_serv_op support async update #10042

jacquesqiao commented Apr 19, 2018 •

edited

typhoonzero Apr 25, 2018

jacquesqiao Apr 26, 2018

typhoonzero Apr 26, 2018

jacquesqiao Apr 26, 2018

typhoonzero Apr 25, 2018

jacquesqiao Apr 26, 2018

typhoonzero Apr 26, 2018

typhoonzero Apr 25, 2018

jacquesqiao Apr 26, 2018

typhoonzero Apr 25, 2018

jacquesqiao Apr 25, 2018

jacquesqiao Apr 25, 2018 •

edited

Yancey1989 Apr 25, 2018 •

edited

jacquesqiao Apr 25, 2018

panyx0718 Apr 25, 2018

jacquesqiao Apr 26, 2018

panyx0718 Apr 25, 2018

jacquesqiao Apr 26, 2018

panyx0718 Apr 25, 2018

jacquesqiao Apr 26, 2018

panyx0718 Apr 25, 2018

jacquesqiao Apr 26, 2018

Yancey1989 Apr 26, 2018

jacquesqiao Apr 26, 2018 •

edited

typhoonzero left a comment

jacquesqiao commented Apr 26, 2018

Yancey1989 left a comment

listen_and_serv_op support async update #10042

listen_and_serv_op support async update #10042

Conversation

jacquesqiao commented Apr 19, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacquesqiao Apr 25, 2018 • edited

Choose a reason for hiding this comment

Yancey1989 Apr 25, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacquesqiao Apr 26, 2018 • edited

Choose a reason for hiding this comment

typhoonzero left a comment

Choose a reason for hiding this comment

jacquesqiao commented Apr 26, 2018

Yancey1989 left a comment

Choose a reason for hiding this comment

jacquesqiao commented Apr 19, 2018 •

edited

jacquesqiao Apr 25, 2018 •

edited

Yancey1989 Apr 25, 2018 •

edited

jacquesqiao Apr 26, 2018 •

edited