Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split send_op into fetch_vars_op and send_vars_op #9161

Closed
2 of 4 tasks
Yancey1989 opened this issue Mar 16, 2018 · 6 comments
Closed
2 of 4 tasks

Split send_op into fetch_vars_op and send_vars_op #9161

Yancey1989 opened this issue Mar 16, 2018 · 6 comments

Comments

@Yancey1989
Copy link
Contributor

Yancey1989 commented Mar 16, 2018

Currently, trainer would send all gradients after execution of all the backward ops, like:

w1-->opA->w2->opB->opB(backward)->w2'->opB(backward)->w1'->send(w1',w2')

For the above process, send op will send all gradients until all the forward, backward ops done.

But actually, we would send the w2' after opB(backward), send w1' after opA(backward), parallel execution of computing op and IO op would improve the performance. For another hand, current SendOp would not only do SEND, but also wait all send request finished and receive parameters from pserver, so we also need to split these into multiple Op.

For sync update

fetch(w1)-->opA->fetch(w2)->opB->opB(backward)->w2'->send(w2')->opB(backward)->w1'->send(w1')->send_barrier()

for async update, there is no send_varrier() op at the end of the process.

fetch(w1)-->opA->fetch(w2)->opB->opB(backward)->w2'->send(w2')->opB(backward)->w1'->send(w1')

TODO

  • Implement AsyncSendOp, SendBarrierOp.
  • Implement an IO threadpool to deal with Async Send.
  • Enhancement distribute transpiler with the async send op.
  • Update benchmark report.
@Yancey1989 Yancey1989 changed the title Send gradient immediately after the execution of backward op Async send gradient after execution of backward op Mar 16, 2018
@helinwang
Copy link
Contributor

helinwang commented Mar 16, 2018

Thank you! This is indeed very important.

  1. One question is do we need sync recvOP or async recvOP (same for send OP).

    @reyoung is leading the effort of a C++ implementation of parallel executor which analyzes the dependency that automatically run different OPs in parallel. I think the parallel executor is assuming all CPU OPs are sync OP (e.g., when Run finishes, it will no longer read its input or write its output). This is important because the next OP that uses the output of the previous OP needs to wait for the previous OP to finish before starting execution. There is currently no way to know when the async CPU OP finished.

    For example:

    Async CPU recv OP that receives tensor A -> GPU OP that uses tensor A
    

    The second OP needs the tensor A being fully received before starting running.

    My understanding could be wrong, maybe can you sync with @reyoung so that he knows the use case from distributed training so that the parallel executor and send/recv operator can work together properly.

  2. Another question is if we decided to use sync send/recv operator, it will block the executor thread. How many threads do we want the executor to have?

  3. One solution is every async operator somehow notifies the executor (using callback or using future), so that the operator is still async but the executor can know when the async operator is actually finished.

@Yancey1989
Copy link
Contributor Author

Yancey1989 commented Mar 19, 2018

Hi @helin, thanks for your comments.

One question is do we need sync recvOP or async recvOP (same for send OP).

I make a short discuession with teacher Yu @reyoung at office, maybe we don't need an Async Recv, for example:

forward:         fc1(w1)->send1(w1')->fc2(w2)->send2(w2')->fc3(w3)->send3(w3')->send_barrier()...
backpropagaion:  recv1(w1)->fc1'(w1)->recv2(w2)->fc2'(w2)->recv3(w3)->fc3'(w3)->recv_barrier()...

There is no dependency in the RecvOps, so ParallelExecutor will execute RecvOps
concurrently:

thread0:  Wait(send_barrier)-->recv1(w1)-->Wait(recv1)-->fc1(w1)
thread1:  Wait(send_barrier)-->recv2(w2)-->Wait(fc1,recv2)-->fc(w2)
thread2:  Wait(send_barrier)-->recv3(w3)-->Wait(recv3, fc2)-->fc(w3)
thread3:  Wait(recv1, recv2, recv3)-->recv_barrier()

So it looks like RecvOp would not block the computing Op, so maybe we just need Sync RecvOp and the PralllelExector will execute them concurrently.

@Yancey1989
Copy link
Contributor Author

Yancey1989 commented Mar 19, 2018

Another question is if we decided to use sync send/recv operator, it will block the executor thread. How many threads do we want the executor to have?

I'm not sure, maybe we need do some experiments or maybe we need another IO ThreadPool which have many numbers of threads than the computing ThreadPool. But I think we need a parameter to configure the size of ThreadPool in the current situation.

One solution is every async operator somehow notifies the executor (using callback or using future), so that the operator is still async but the executor can know when the async operator is actually finished.

As the comment above, maybe we only need more threads and don't block the computing threads?

@gongweibao
Copy link
Contributor

gongweibao commented Mar 21, 2018

A simple picture:
image

@helinwang
Copy link
Contributor

@Yancey1989 thanks! Maybe in future we need two types of threads, one for computing one for IO.

@gongweibao thanks for the picture!

@Yancey1989
Copy link
Contributor Author

thanks! Maybe in future, we need two types of threads, one for computing one for IO.

Agree with you, I have added it to TODO list and will implement it ASAP.

@Yancey1989 Yancey1989 changed the title Async send gradient after execution of backward op Split send_op into fetch_vars_op and send_vars_op May 3, 2018
PaddlePaddle Distributed Refactoring (Due: 201802) automation moved this from Perf TODOs to DONE May 29, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants