Split send_op into fetch_vars_op and send_vars_op #9161

Yancey1989 · 2018-03-16T08:01:43Z

Currently, trainer would send all gradients after execution of all the backward ops, like:

w1-->opA->w2->opB->opB(backward)->w2'->opB(backward)->w1'->send(w1',w2')

For the above process, send op will send all gradients until all the forward, backward ops done.

But actually, we would send the w2' after opB(backward), send w1' after opA(backward), parallel execution of computing op and IO op would improve the performance. For another hand, current SendOp would not only do SEND, but also wait all send request finished and receive parameters from pserver, so we also need to split these into multiple Op.

For sync update

fetch(w1)-->opA->fetch(w2)->opB->opB(backward)->w2'->send(w2')->opB(backward)->w1'->send(w1')->send_barrier()

for async update, there is no send_varrier() op at the end of the process.

fetch(w1)-->opA->fetch(w2)->opB->opB(backward)->w2'->send(w2')->opB(backward)->w1'->send(w1')

TODO

Implement AsyncSendOp, SendBarrierOp.
Implement an IO threadpool to deal with Async Send.
Enhancement distribute transpiler with the async send op.
Update benchmark report.

The text was updated successfully, but these errors were encountered:

helinwang · 2018-03-16T21:47:00Z

Thank you! This is indeed very important.

One question is do we need sync recvOP or async recvOP (same for send OP).

@reyoung is leading the effort of a C++ implementation of parallel executor which analyzes the dependency that automatically run different OPs in parallel. I think the parallel executor is assuming all CPU OPs are sync OP (e.g., when Run finishes, it will no longer read its input or write its output). This is important because the next OP that uses the output of the previous OP needs to wait for the previous OP to finish before starting execution. There is currently no way to know when the async CPU OP finished.

For example:
```
Async CPU recv OP that receives tensor A -> GPU OP that uses tensor A
```
The second OP needs the tensor A being fully received before starting running.

My understanding could be wrong, maybe can you sync with @reyoung so that he knows the use case from distributed training so that the parallel executor and send/recv operator can work together properly.
Another question is if we decided to use sync send/recv operator, it will block the executor thread. How many threads do we want the executor to have?
One solution is every async operator somehow notifies the executor (using callback or using future), so that the operator is still async but the executor can know when the async operator is actually finished.

Yancey1989 · 2018-03-19T08:49:52Z

Hi @helin, thanks for your comments.

One question is do we need sync recvOP or async recvOP (same for send OP).

I make a short discuession with teacher Yu @reyoung at office, maybe we don't need an Async Recv, for example:

forward:         fc1(w1)->send1(w1')->fc2(w2)->send2(w2')->fc3(w3)->send3(w3')->send_barrier()...
backpropagaion:  recv1(w1)->fc1'(w1)->recv2(w2)->fc2'(w2)->recv3(w3)->fc3'(w3)->recv_barrier()...

There is no dependency in the RecvOps, so ParallelExecutor will execute RecvOps
concurrently:

thread0:  Wait(send_barrier)-->recv1(w1)-->Wait(recv1)-->fc1(w1)
thread1:  Wait(send_barrier)-->recv2(w2)-->Wait(fc1,recv2)-->fc(w2)
thread2:  Wait(send_barrier)-->recv3(w3)-->Wait(recv3, fc2)-->fc(w3)
thread3:  Wait(recv1, recv2, recv3)-->recv_barrier()

So it looks like RecvOp would not block the computing Op, so maybe we just need Sync RecvOp and the PralllelExector will execute them concurrently.

Yancey1989 · 2018-03-19T09:12:53Z

Another question is if we decided to use sync send/recv operator, it will block the executor thread. How many threads do we want the executor to have?

I'm not sure, maybe we need do some experiments or maybe we need another IO ThreadPool which have many numbers of threads than the computing ThreadPool. But I think we need a parameter to configure the size of ThreadPool in the current situation.

One solution is every async operator somehow notifies the executor (using callback or using future), so that the operator is still async but the executor can know when the async operator is actually finished.

As the comment above, maybe we only need more threads and don't block the computing threads?

gongweibao · 2018-03-21T12:11:01Z

A simple picture:

helinwang · 2018-03-21T18:27:33Z

@Yancey1989 thanks! Maybe in future we need two types of threads, one for computing one for IO.

@gongweibao thanks for the picture!

Yancey1989 · 2018-03-25T03:46:11Z

thanks! Maybe in future, we need two types of threads, one for computing one for IO.

Agree with you, I have added it to TODO list and will implement it ASAP.

Yancey1989 added the enhancement label Mar 16, 2018

Yancey1989 assigned Yancey1989, helinwang and typhoonzero Mar 16, 2018

typhoonzero mentioned this issue Mar 16, 2018

Improve Fluid Distributed Training performance #8638

Closed

13 tasks

Yancey1989 changed the title ~~Send gradient immediately after the execution of backward op~~ Async send gradient after execution of backward op Mar 16, 2018

Yancey1989 mentioned this issue Mar 21, 2018

Split send op to send_vars and send_barrier #9303

Merged

Yancey1989 mentioned this issue Mar 26, 2018

Insert send op while backward op finished #9382

Closed

Yancey1989 mentioned this issue Apr 28, 2018

Fluid distributed training TODO #10279

Closed

Yancey1989 changed the title ~~Async send gradient after execution of backward op~~ Split send_op into fetch_vars_op and send_vars_op May 3, 2018

typhoonzero added this to TODO in PaddlePaddle Distributed Refactoring (Due: 201802) May 4, 2018

typhoonzero moved this from TODO to Perf TODOs in PaddlePaddle Distributed Refactoring (Due: 201802) May 4, 2018

Yancey1989 mentioned this issue May 10, 2018

overlap send ops and backward ops #10550

Merged

Yancey1989 closed this as completed in #10550 May 29, 2018

PaddlePaddle Distributed Refactoring (Due: 201802) automation moved this from Perf TODOs to DONE May 29, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split send_op into fetch_vars_op and send_vars_op #9161

Split send_op into fetch_vars_op and send_vars_op #9161

Yancey1989 commented Mar 16, 2018 •

edited

Loading

helinwang commented Mar 16, 2018 •

edited

Loading

Yancey1989 commented Mar 19, 2018 •

edited

Loading

Yancey1989 commented Mar 19, 2018 •

edited

Loading

gongweibao commented Mar 21, 2018 •

edited

Loading

helinwang commented Mar 21, 2018

Yancey1989 commented Mar 25, 2018

Split send_op into fetch_vars_op and send_vars_op #9161

Split send_op into fetch_vars_op and send_vars_op #9161

Comments

Yancey1989 commented Mar 16, 2018 • edited Loading

helinwang commented Mar 16, 2018 • edited Loading

Yancey1989 commented Mar 19, 2018 • edited Loading

Yancey1989 commented Mar 19, 2018 • edited Loading

gongweibao commented Mar 21, 2018 • edited Loading

helinwang commented Mar 21, 2018

Yancey1989 commented Mar 25, 2018

Yancey1989 commented Mar 16, 2018 •

edited

Loading

helinwang commented Mar 16, 2018 •

edited

Loading

Yancey1989 commented Mar 19, 2018 •

edited

Loading

Yancey1989 commented Mar 19, 2018 •

edited

Loading

gongweibao commented Mar 21, 2018 •

edited

Loading