Reuduce memory copy when communication between trainer and pserver. #9271

gongweibao · 2018-03-21T01:11:01Z

No description provided.

… optsend

typhoonzero · 2018-03-21T06:11:07Z

benchmark/cluster/vgg16/vgg16_fluid.py

@@ -237,6 +242,8 @@ def train_loop(exe, trainer_prog):
            "TRAINING_ROLE",
            "TRAINER")  # get the training role: trainer/pserver

+        #print(debuger.pprint_program_codes(fluid.default_main_program().desc))


Remove comments.

typhoonzero · 2018-03-21T06:11:14Z

benchmark/cluster/vgg16/vgg16_fluid.py

@@ -251,6 +258,7 @@ def train_loop(exe, trainer_prog):
            if not current_endpoint:
                print("need env SERVER_ENDPOINT")
                exit(1)
+            print("get_pserver_program")


remove print

Not removed yet

typhoonzero · 2018-03-21T06:12:47Z

benchmark/cluster/vgg16/vgg16_tf_local.py

@@ -0,0 +1,315 @@
+#   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.


why not use argument to decide whether to run tf as local or not?

typhoonzero · 2018-03-21T06:13:30Z

paddle/fluid/framework/threadpool.cc

@@ -32,7 +32,8 @@ void ThreadPool::Init() {
    // TODO(Yancey1989): specify the max threads number
    int num_threads = std::thread::hardware_concurrency();
    PADDLE_ENFORCE_GT(num_threads, 0);
-    threadpool_.reset(new ThreadPool(num_threads));
+    // threadpool_.reset(new ThreadPool(num_threads));
+    threadpool_.reset(new ThreadPool(1));


should revert this change.

typhoonzero · 2018-03-21T06:24:50Z

paddle/fluid/operators/detail/bytebuffer_stream.h

+  ::google::protobuf::io::ZeroCopyInputStream* contents() override {
+    DeleteStream();
+    stream_ = new (&space_) Reader(buffer_);
+    return stream_;


why need so many wrappers, just create the ::grpc::GrpcBufferReader as ZeroCopyInputStream could be simpler.

Abstract type should not be a parameter.

So make it don't have any abstract type. The interface is simple enough to understand without this abstract.

没看懂。。。。
Parse函数需要支持ByteBuffer和grpc_byte_buffer两种类型的参数,他们都可以转成ZeroCopyInputStream, 而ZeroCopyInputStream是不能当做参数类型的。

typhoonzero · 2018-03-21T06:24:56Z

paddle/fluid/operators/detail/grpc_client.cc

+    struct timeval t0_wait, t1_wait;
+    gettimeofday(&t0_wait, 0);
+    std::thread::id this_id = std::this_thread::get_id();
+    */


remove comments

typhoonzero

LGTM, a lot of work to finish the SerialiseTraits

typhoonzero · 2018-03-21T11:31:18Z

benchmark/cluster/vgg16/vgg16_fluid.py

@@ -251,6 +258,7 @@ def train_loop(exe, trainer_prog):
            if not current_endpoint:
                print("need env SERVER_ENDPOINT")
                exit(1)
+            print("get_pserver_program")


Not removed yet

gongweibao · 2018-03-21T11:59:20Z

Done.

* commit '9c35b0dc1ba0ace5acf721685802a21045ea1249': (36 commits) Fix dist compile error (PaddlePaddle#9320) Fix bug for backward tanspiler when using parallel_do operator. (PaddlePaddle#9282) update fix transpiler bug Update index_en.rst (PaddlePaddle#9286) "fix mixed_vector bug" (PaddlePaddle#9319) Update index_en.rst (PaddlePaddle#9280) Adjust some contents in write_docs_en.rst for Contribue Documentation (PaddlePaddle#9147) CMake refine for HIP support. Fix CI. Reuduce memory copy when communication between trainer and pserver. (PaddlePaddle#9271) Modified build.sh and remove build_doc.sh fix doc Enhance device context pool (PaddlePaddle#9293) Device blobs are created only in training. Added testing attribute Shrink batch_norm_grad's inputs updates prepare and create op before run wip small fix initial commit ... # Conflicts: # cmake/external/eigen.cmake

gongweibao added 24 commits March 12, 2018 09:59

init

1377f06

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

0824a8a

… optsend

modify py

a0ecc4e

fix debugger

0aba233

add test_debugger.py

2de648d

test

022479d

fix learningratebugs

32fe144

modify grpc

f45b065

merge

9b3197b

compile ok

d159525

add client

55ccdb2

add testtime

12c8c06

add some

ff01579

add test_serde

69b6ee5

add selected rows

0e404ac

test gpu and cpu ok

476b56c

add selectrows

91e154b

clean up

6907810

add serializetraits

d3cd31b

compile ok

16c6e7e

fix bugs

c32f228

merge

9d028f9

cleanup some

9b7a35c

add print

ac081a2

typhoonzero reviewed Mar 21, 2018

View reviewed changes

gongweibao added 2 commits March 21, 2018 06:41

clean up

735a5c0

cleanup

e25a270

gongweibao requested review from Yancey1989, helinwang and typhoonzero March 21, 2018 07:15

fix by comments

0d36059

gongweibao changed the title ~~[WIP]Reuduce memory copy when communication between trainer and pserver.~~ Reuduce memory copy when communication between trainer and pserver. Mar 21, 2018

gongweibao mentioned this pull request Mar 21, 2018

do not copy when deserialize #9209

Closed

rename tensorparser

011c909

typhoonzero previously approved these changes Mar 21, 2018

View reviewed changes

cleanup codes

27c1ed7

gongweibao dismissed typhoonzero’s stale review via 27c1ed7 March 21, 2018 11:58

typhoonzero approved these changes Mar 22, 2018

View reviewed changes

gongweibao merged commit 990d639 into PaddlePaddle:develop Mar 22, 2018

gongweibao deleted the optsend branch March 22, 2018 02:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reuduce memory copy when communication between trainer and pserver. #9271

Reuduce memory copy when communication between trainer and pserver. #9271

gongweibao commented Mar 21, 2018

typhoonzero Mar 21, 2018

gongweibao Mar 21, 2018

typhoonzero Mar 21, 2018

gongweibao Mar 21, 2018

typhoonzero Mar 21, 2018

typhoonzero Mar 21, 2018

gongweibao Mar 21, 2018

typhoonzero Mar 21, 2018

gongweibao Mar 21, 2018

typhoonzero Mar 21, 2018

gongweibao Mar 21, 2018

typhoonzero Mar 21, 2018

gongweibao Mar 21, 2018

typhoonzero Mar 21, 2018

gongweibao Mar 21, 2018

typhoonzero left a comment

typhoonzero Mar 21, 2018

gongweibao commented Mar 21, 2018

		@@ -0,0 +1,315 @@
		# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.

Reuduce memory copy when communication between trainer and pserver. #9271

Reuduce memory copy when communication between trainer and pserver. #9271

Conversation

gongweibao commented Mar 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

typhoonzero left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gongweibao commented Mar 21, 2018