Send recv op #5520

typhoonzero · 2017-11-09T13:34:04Z

Implement by this design

Still work in progress, enable distributed training using gRPC.

RPC communication implement separated in detail, so we can switch to other libs quickly.
send_op and recv_op
unit test with c++

TODO in later PRs:

add remoteoptimizer and unit test
add benchmark
add queue op to buffer tensor send and recv

… send_recv_op

jacquesqiao · 2017-11-16T05:25:28Z

why choose grpc but not baidu-rpc

dzhwinter · 2017-11-16T05:47:37Z

paddle/framework/lod_tensor.cc

    for (size_t len : lod_length[i]) {
      level.push_back(level.back() + len);
    }
  }
 }

+void SerializeToStream(std::ostream &os, const LoDTensor &tensor,


@helinwang I think the serialize function can be put into the tensor_util.h. Which is a part of this PR. #5455
Did you change the viewpoint after reading the boost serialized reference manul?

@dzhwinter Thanks for asking! It's ok if all the tensor types that we currently and in the future support will only be the LoDTensor (so we don't need polymorphism). Otherwise I think having serialization as an interface that every tensor type implements would be better.

Putting non-member functions in separated source files seems good. Also, we may need to serialize SelectedRows in the future, so putting serialize functions in util is file, at this point I agree with @helinwang to have "serialization as an interface that every tensor type implements". Would you mind please update PR #5455 to add serialization functions?

@helinwang
We need to support multiple types, LoDTensor, SelectedRows, LoDTensorArray, we can put it as a template free-function(global function).
And I think:

polymorphism is not a real demand in application. When you do a serialization, you will have an object with a proper type. You can use a reference instead of a pointer. The class always provide sufficient information to serialize itself.
http://www.boost.org/doc/libs/1_45_0/libs/serialization/doc/serialization.html

Other popular libraries always serialized object through free function.
Such as protobuf, msgpack.

MsgPack::Serializer serializer(socket); std::vector<std::unique_ptr<MsgPack::Element>> arrayWithoutElements, arrayWith3Elements; arrayWith3Elements.push_back(MsgPack::Factory(true)); arrayWith3Elements.push_back(MsgPack__Factory(Array(std::move(arrayWithoutElements)))); arrayWith3Elements.push_back(MsgPack::Factory("Hello World!")); serializer << MsgPack__Factory(Array(std::move(arrayWith3Elements))); MsgPack::Deserializer deserializer(socket); deserializer.deserialize([](std::unique_ptr<MsgPack::Element> parsed) { std::cout << "Parsed: " << *parsed << "\n"; return false; }, true);

If we treat Tensor and the operators as a computing library, then we should use the free function. --- Just treat Tensor as a third-party library.
@typhoonzero @helinwang

treat Tensor and the operators as a computing library, then we'll have to describe the data inside each type to be serialized. We are using xxxDesc for now, eg.
message TensorDesc { required DataType data_type = 1; repeated int64 dims = 2; // [UNK, 640, 480] is saved as [-1, 640, 480] }

use the desc and the data pointers is enouph for a "Free Function" to serialize it.

dzhwinter · 2017-11-16T05:53:58Z

paddle/operators/recv_op.cc

+  ServerBuilder builder;
+  builder.AddListeningPort(server_address, grpc::InsecureServerCredentials());
+  builder.RegisterService(service.get());
+  // rpc_server.reset(new Server(builder.BuildAndStart());


No comment lines please.

typhoonzero · 2017-11-16T07:30:54Z

@jacquesqiao I made the RPC implementation in separate place, so it'll easy to switch. I'll add a benchmark for training job metric like throughput (model size), since brpc's benchmark is mainly about small queries.

helinwang

Thank you! Did you get a chance to profile the serialization speed of protobuf string? (used in tensor serialization).

helinwang · 2017-11-17T21:33:37Z

paddle/framework/lod_tensor.h

+ * You can pass ofstream or ostringstream to serilize to file
+ * or to a in memory string. GPU tensor will be copied to CPU.
+ */
+void SerializeToStream(std::ostream& os, const LoDTensor& tensor,


For discussion: consider make SerializeToStream and DeserializeFromStream member functions for LoDTensor. Reason stated here: https://github.com/PaddlePaddle/Paddle/pull/5520/files#r151577859
CC: @dzhwinter

helinwang · 2017-11-17T21:36:35Z

paddle/operators/detail/recv_impl.cc

+namespace operators {
+namespace detail {
+
+Status SendRecvServerImpl::InitVariables(


Is this for initializing variables? I think send/recv should not have anything to do with initing the variable.

Yep. Will remove.

helinwang · 2017-11-17T21:37:08Z

paddle/operators/detail/send_impl.cc

+namespace operators {
+namespace detail {
+
+bool RPCClient::InitVariables(const framework::Scope& scope,


Is this for initializing variables? I think send/recv should not have anything to do with initing the variable.

helinwang · 2017-11-17T21:39:34Z

paddle/operators/detail/send_impl.cc

+namespace operators {
+namespace detail {
+
+bool RPCClient::InitVariables(const framework::Scope& scope,


The naming of RPCClient and SendRecvServerImpl does not match as a pair, and maybe the name should state that they are related to send/recv, so maybe something like sendImpl / recvImpl?

helinwang · 2017-11-17T21:42:25Z

paddle/operators/detail/recv_impl.cc

+  // TODO(typhoonzero): desirealize in_tensor and run pserver network.
+  std::istringstream iss(in_var->serialized());
+  framework::DeserializeFromStream(iss, &t);
+  lodtensor_queue_.Push(std::move(t));


From my understanding each send/recv is for connecting one edge in the graph between nodes, why a queue (for multiple values) is necessary?

helinwang · 2017-11-17T21:45:45Z

paddle/operators/recv_op.cc

+namespace paddle {
+namespace operators {
+
+void RunServer(Server **rpc_server,


Does currently we start one server per recv OP? There could be easily tens or even hundreds of send/recv OP pairs, maybe we should only start only one server. E.g., the send/recv queue is an argument for the constructor of send/recv OP, and there will be only one single instance of send recv server.

This will be done in next PR. From my current consideration, we'll need a RemoteOptimizer interface which will do this work, it will create server-side program and put it in recv op as a member block to run.

… send_recv_op

wangkuiyi

I am afraid that this PR is too big and could be slow to merge. It looks to me that many pieces, e.g., the introduction of grpc, could be a separate PR.

wangkuiyi · 2017-11-21T03:57:35Z

.clang-format

@@ -24,5 +24,8 @@ Standard:  Cpp11
 AllowAllParametersOfDeclarationOnNextLine: true
 BinPackParameters: false
 BinPackArguments: false
+---
+Language: Proto
+# Don't format .proto files.


Why not format .proto file?

Removed these lines. I'm not sure why my previous style check won't pass.

wangkuiyi · 2017-11-21T03:58:00Z

CMakeLists.txt

@@ -133,6 +133,8 @@ include(external/any)       # download libn::any
 include(external/eigen)     # download eigen3
 include(external/pybind11)  # download pybind11
 include(external/nccl)
+include(external/cares)
+include(external/grpc)


I vaguely remember that you said you want to try brpc? @typhoonzero

Yep. Will add benchmark in next PR to decide which to use. That won't affect current code structure.

wangkuiyi · 2017-11-21T03:59:01Z

cmake/generic.cmake

@@ -467,3 +467,43 @@ function(py_test TARGET_NAME)
             WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR})
  endif()
 endfunction()
+
+
+function(grpc_library TARGET_NAME)


We need a comment for this function.

Xreki · 2017-11-21T05:35:33Z

cmake/external/cares.cmake

+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+


Are cares and grpc needed in all kinds of builds of Paddle? I just suggest to add an if statement here to return early if they are not needed, or at least add the following statements to avoid the failing of building for mobile:

IF(MOBILE_INFERENCE) return() ENDIF()

Sure. Will add. Thanks for reminding.

dzhwinter

This prototype has few difference with our design doc. @helinwang https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/refactor/parameter_server.md

This PR split the role of trainer and pserver, which should be isomorphic.
The SimpleBlockingQueue needs to support the asynchronized Send operatoration, just like the rendezvous in tensorflow.
The RPC service only has one SendVariable interface, can it satisfy the unblocking function call such as Send?
EmptyVariable in grpc seems ugly, do we have better method improve it?
@helinwang

dzhwinter · 2017-11-27T03:58:38Z

Since this prototype is separated with our main develop branch, I think we can merge this PR and enhanced the implementation later. BTW, the conflicts. @typhoonzero

… send_recv_op

typhoonzero · 2017-11-27T05:58:52Z

This PR split the role of trainer and pserver, which should be isomorphic.

No. This is just basic implementation of send/recv op. We must have higher level wrappers for how to use it, like in the unit test's code. I'm afraid that the high-level API must split roles of trainer and pserver for simpler API.

dzhwinter · 2017-11-27T11:12:30Z

ok. I see.

dzhwinter

LGTM.

typhoonzero added 17 commits November 8, 2017 21:50

WIP send recv op

20b44d0

WIP send recv

9adb447

WIP grpc build

ee27f4a

put grpc impl in details

a487601

put grpc impl in details

a637f38

update wip

fc5739d

update proto

e6a2f53

update proto

f009f97

update proto

7d30ad8

clean cmake

12f86d9

wip on op implementations

e0ae95b

wip on op implementations

4ef4c29

compile ok adding ut

22b414b

wip unitest

c230adf

add extern cares for linking

453421b

wip add ut

e647ab8

working version send recv

582c521

typhoonzero changed the title ~~[WIP] Send recv op~~ Send recv op Nov 16, 2017

typhoonzero added 4 commits November 16, 2017 12:22

revert optimizer.py

138cdef

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

e6c8080

… send_recv_op

update test cmake

a993bd7

add libtool to dockerfile

3ab2f65

dzhwinter reviewed Nov 16, 2017

View reviewed changes

typhoonzero added 3 commits November 16, 2017 16:49

update cmake dependency

3c98a33

update cmake depends

2d7e3ba

update cmake grpc depends

6271d86

helinwang reviewed Nov 17, 2017

View reviewed changes

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

28ff5c1

… send_recv_op

typhoonzero added 3 commits November 20, 2017 12:02

fix cmake dependency

02a8bd7

fix compile error

b236248

fix compile

7b3d081

wangkuiyi reviewed Nov 21, 2017

View reviewed changes

Xreki reviewed Nov 21, 2017

View reviewed changes

follow comments

3b1230f

dzhwinter reviewed Nov 27, 2017

View reviewed changes

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

8043f56

… send_recv_op

update

d966c20

update copyfrom

1a309bf

dzhwinter approved these changes Nov 28, 2017

View reviewed changes

typhoonzero merged commit 0a8a86e into PaddlePaddle:develop Nov 28, 2017

typhoonzero mentioned this pull request Nov 28, 2017

Benchmark brpc and grpc for send/recv op #5963

Closed

This was referenced Nov 28, 2017

need to upgrade libtool #5964

Closed

reduce the CI compile time #5975

Closed

typhoonzero added this to DONE in PaddlePaddle Distributed Refactoring (Due: 201802) Dec 4, 2017

typhoonzero deleted the send_recv_op branch December 22, 2017 05:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Send recv op #5520

Send recv op #5520

typhoonzero commented Nov 9, 2017 •

edited

Loading

jacquesqiao commented Nov 16, 2017

dzhwinter Nov 16, 2017

helinwang Nov 17, 2017 •

edited

Loading

typhoonzero Nov 20, 2017 •

edited

Loading

dzhwinter Nov 24, 2017

dzhwinter Nov 24, 2017

typhoonzero Nov 24, 2017

dzhwinter Nov 16, 2017

typhoonzero Nov 16, 2017

typhoonzero commented Nov 16, 2017

helinwang left a comment

helinwang Nov 17, 2017

helinwang Nov 17, 2017

typhoonzero Nov 20, 2017

helinwang Nov 17, 2017

helinwang Nov 17, 2017 •

edited

Loading

helinwang Nov 17, 2017

helinwang Nov 17, 2017

typhoonzero Nov 20, 2017

wangkuiyi left a comment

wangkuiyi Nov 21, 2017

typhoonzero Nov 23, 2017

wangkuiyi Nov 21, 2017

typhoonzero Nov 21, 2017

wangkuiyi Nov 21, 2017

typhoonzero Nov 23, 2017

Xreki Nov 21, 2017

typhoonzero Nov 21, 2017

typhoonzero Nov 23, 2017

dzhwinter left a comment

dzhwinter commented Nov 27, 2017

typhoonzero commented Nov 27, 2017

dzhwinter commented Nov 27, 2017

dzhwinter left a comment

Send recv op #5520

Send recv op #5520

Conversation

typhoonzero commented Nov 9, 2017 • edited Loading

jacquesqiao commented Nov 16, 2017

Choose a reason for hiding this comment

helinwang Nov 17, 2017 • edited Loading

Choose a reason for hiding this comment

typhoonzero Nov 20, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

typhoonzero commented Nov 16, 2017

helinwang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang Nov 17, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wangkuiyi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dzhwinter left a comment

Choose a reason for hiding this comment

dzhwinter commented Nov 27, 2017

typhoonzero commented Nov 27, 2017

dzhwinter commented Nov 27, 2017

dzhwinter left a comment

Choose a reason for hiding this comment

typhoonzero commented Nov 9, 2017 •

edited

Loading

helinwang Nov 17, 2017 •

edited

Loading

typhoonzero Nov 20, 2017 •

edited

Loading

helinwang Nov 17, 2017 •

edited

Loading