Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

overlap send ops and backward ops #10550

Merged
merged 24 commits into from
May 29, 2018

Conversation

Yancey1989
Copy link
Contributor

@Yancey1989 Yancey1989 commented May 10, 2018

Fixed #9161
Fixed #10969

Experiment with vgg16 + flowers on P40, 2 pservers + 2 trainers

branch GPUs per trainer executor performance
develop 1 default executor 12.946086 imgs/s
overlap 1 default executor 14.587641 imgs/s
develop 8 parallel executor 144.8 imgs/s
overlap 8 parallel executor 175 imgs/s

The performance improves 12% on a single device, 20% on multi devices.

@Yancey1989 Yancey1989 changed the title [WIP]overlap send op and backward ops [WIP] May 15, 2018
@Yancey1989 Yancey1989 changed the title [WIP] overlap send ops and backward ops May 15, 2018
@Yancey1989 Yancey1989 requested a review from panyx0718 May 22, 2018 10:02
@panyx0718 panyx0718 requested a review from chengduoZH May 23, 2018 07:22
@typhoonzero
Copy link
Contributor

seems test_dist_transpiler need to be updated


class PSDispatcher(object):
"""
DistributedSpliter is the base class for dispatching vars
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DistributedSpliter this name is not the same as the class name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

std::vector<std::string> MultiDevSSAGraphBuilder::FindDistTrainSendVars(
const ProgramDesc &program) const {
std::vector<std::string> send_vars;
for (auto *op : program.Block(0).AllOps()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment to say "since parameters are all in block 0, it's enough to only scan send ops in block 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

const ProgramDesc &program) const {
std::vector<std::string> send_vars;
for (auto *op : program.Block(0).AllOps()) {
if (op->Type() == "send_vars" || op->Type() == "send") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to gradually remove these "send_vars" "send" strings. It's hard to maintain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment here, can we fix this in anohter PR?

const ProgramDesc &program) const {
std::vector<std::string> recv_vars;
for (auto *op : program.Block(0).AllOps()) {
if (op->Type() == "recv" || op->Type() == "send") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why "send" is here as well?

Copy link
Contributor Author

@Yancey1989 Yancey1989 May 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, delete this deprecated send op.

@@ -4,6 +4,7 @@ cc_library(scale_loss_grad_op_handle SRCS scale_loss_grad_op_handle.cc DEPS op_h
cc_library(fetch_op_handle SRCS fetch_op_handle.cc DEPS op_handle_base scope lod_tensor ddim memory)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR needs a convergence experiment to make sure it still converge to the same place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do that.

return false;
}

/**
* Check any of opvars contains `.block` and in sendvars
*/
auto checker = [](const std::vector<std::string> &opvars,
const std::vector<std::string> &sendvars) -> bool {
const std::vector<std::string> &rpc_vars) -> bool {
for (auto &var : opvars) {
if (var.find(".block") != std::string::npos &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the meaning of '.block'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dist transpiler would split the gradient var into multiple blocks, and append a suffix .block at the end of the var name, I will add some comments here to explain the hard code.

return true;
}
}
return false;
};

if (op.Type() == "split" || op.Type() == "split_byref") {
return checker(op.OutputArgumentNames(), send_op->InputArgumentNames());
if (op.Type() == "split" || op.Type() == "split_byref" ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to make these codes more maintainable?

Copy link
Contributor Author

@Yancey1989 Yancey1989 May 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed this PR again, and we don't need to pay attention to the op type, because in the DistributedTranspiler, we split the var before send and concat the vars after recv, so pay attention to the Inputs and Outputs is engouht.

const OpDesc &op) const {
CreateComputationalOp(result, op, 0);
if (op.Type() == "concat") {
ConnectOp(result, result->ops_.back().get(), "fetch_barrier");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can there be 2 fetch_barrier ops? and it connects to the wrong one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the current implementation of dist transpiler, there would be only one fetch_barrier_op, will also add some comments here and maybe add an argument bool connect_all in function ConnectOp would be better.

@@ -257,8 +257,7 @@ std::shared_ptr<grpc::Channel> RPCClient::GetChannel(const std::string& ep) {

auto ch =
grpc::CreateCustomChannel(ep, grpc::InsecureChannelCredentials(), args);

channels_[ep] = ch;
channels_[key] = ch;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, what's the benefit of having multiple channels per ep?

Copy link
Contributor Author

@Yancey1989 Yancey1989 May 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a testing feature about multi-channel and throughput, I will revert these codes and do much more test in the next PR.

@Yancey1989
Copy link
Contributor Author

Yancey1989 commented May 28, 2018

  1. develop branch acc logs:
Pass = 0, Training performance = 13.056212 imgs/s, Train accuracy = 0.034640, Test accuracy = 0.025490
Pass = 1, Training performance = 13.458697 imgs/s, Train accuracy = 0.042609, Test accuracy = 0.029412
Pass = 2, Training performance = 13.474373 imgs/s, Train accuracy = 0.049005, Test accuracy = 0.041176
Pass = 3, Training performance = 13.460915 imgs/s, Train accuracy = 0.053057, Test accuracy = 0.048039
Pass = 4, Training performance = 13.443091 imgs/s, Train accuracy = 0.057180, Test accuracy = 0.058824
Pass = 5, Training performance = 13.433230 imgs/s, Train accuracy = 0.059657, Test accuracy = 0.055882
Pass = 6, Training performance = 13.580382 imgs/s, Train accuracy = 0.061427, Test accuracy = 0.045098
Pass = 7, Training performance = 13.464448 imgs/s, Train accuracy = 0.063039, Test accuracy = 0.054902
  1. overlap branch acc logs:
Pass = 0, Training performance = 14.861740 imgs/s, Train accuracy = 0.030411, Test accuracy = 0.027451
Pass = 1, Training performance = 15.150844 imgs/s, Train accuracy = 0.034640, Test accuracy = 0.046078
Pass = 2, Training performance = 15.204456 imgs/s, Train accuracy = 0.042338, Test accuracy = 0.037255
Pass = 3, Training performance = 15.216520 imgs/s, Train accuracy = 0.048626, Test accuracy = 0.047059
Pass = 4, Training performance = 15.235357 imgs/s, Train accuracy = 0.052724, Test accuracy = 0.049020
Pass = 5, Training performance = 15.267941 imgs/s, Train accuracy = 0.055998, Test accuracy = 0.062745
Pass = 6, Training performance = 15.212211 imgs/s, Train accuracy = 0.059382, Test accuracy = 0.070588
Pass = 7, Training performance = 15.209697 imgs/s, Train accuracy = 0.061555, Test accuracy = 0.067647
Pass = 8, Training performance = 15.232591 imgs/s, Train accuracy = 0.064455, Test accuracy = 0.060784

@Yancey1989
Copy link
Contributor Author

Yancey1989 commented May 28, 2018

A problem was found in the process of testing distributed training, that trainer will be hung with ParallelExecutor but it works well with default Executor. This problem appears in develop branch and overlap branch, and I created an issue #10963 to follow this bug.

panyx0718
panyx0718 previously approved these changes May 28, 2018
Copy link
Contributor

@panyx0718 panyx0718 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small comments can be fixed now. Others can be left as TODO.

return checker(op.InputArgumentNames(), send_op->OutputArgumentNames());
return checker(op.OutputArgumentNames(), send_vars) ||
checker(op.InputArgumentNames(), recv_vars);
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

"rpc op should be in [send,"
"send_vars, send_barrier. recv, fetch_barrier]");
}

// FIXME(wuyi): send op always copy from GPU 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this FIXME as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, and add a TODO to improve the performance.

// Wait input done
for (auto *in : inputs_) {
auto &p = static_cast<VarHandle *>(in)->place_;
if (in->DebugString() == "dummy") { // HACK
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need a better solution here soon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, I add a TOOD comment here.

@@ -38,6 +39,7 @@ class Variable {

template <typename T>
T* GetMutable() {
std::unique_lock<std::mutex> lock(mutex_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a TODO here to make Variable completely thread-safe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -13,6 +13,8 @@ See the License for the specific language governing permissions and
limitations under the License. */
#pragma once

#pragma once
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redundant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -20,6 +20,9 @@ namespace operators {

inline bool NeedSend(const framework::Scope& scope,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this method test if the variable is a parameter on ps server?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May not, an op don't need to care about where the parameter is(transpiler knew much more about this), only consider the variable is initialized or not.

return [op.type for op in trainer.global_block().ops
] + ["split_byref", "send", "concat"]
ops = [op.type for op in trainer.global_block().ops] + [
"split_byref", "send_vars", "send_barrier", "recv", "recv",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recv duplicated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, one parameter -> one recv op, so there would be multiple send op and recv op, such as the ssa a graph of fit_a_line:
image


class HashName(PSDispatcher):
"""
Hash variable names to servral endpoints
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

several

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

eplist = ps_dispatcher.dispatch(splited_vars)
if len(splited_vars) == 1:
orig_varname = splited_vars[0].name
index = find_op_by_output_arg(program.global_block(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does dist_transpiler works well with memory_optimizer_transpiler? memory_optimize_transpiler tend to change var names.
leave a TODO?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reminding, we have not test whether distributed_transpiler works well with memory_optimize_transpiler, it's a good point.

# type="fetch_barrier",
# inputs={},
# outputs={"RPCClient": rpc_client_var},
# attrs={"endpoints": pserver_endpoints})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clean up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

for (auto *op : program.Block(0).AllOps()) {
// TODO(Yancey1989): use a graceful method to find send op,
// instead of the the hard code string
if (op->Type() == "send_vars") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we merge send_vars and send op by adding some attributes to control the behaviour?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow the comment: #10550 (comment)

std::ostringstream sout;
PrintGraphviz(*graph, sout);
VLOG(10) << sout.str();
std::ofstream fout("/tmp/graph.dot");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use GFLAG to define this output path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// See the License for the specific language governing permissions and
// limitations under the License.

#include "paddle/fluid/framework/details/rpc_op_handle.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If send_op_handle is not used anymore, can you just rename it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted the unused send_op_handle.

@@ -38,6 +39,7 @@ class Variable {

template <typename T>
T* GetMutable() {
std::unique_lock<std::mutex> lock(mutex_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can't give var a mutex here. Vars are always not thread-safe in any programming language. You must protect your vars where you use it concurrently, but not add it here.

And, adding a mutex cannot protect the var's content actually.

Copy link
Contributor Author

@Yancey1989 Yancey1989 May 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can not make var thread-safe completely, only make GetMutable function thread-safe, I filed an issue to follow make variable and gRPC client thread-safe.

Related issue #10969

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use a workaround instead of changing Variable implement anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can use GRPCClient instance as a singleton, or do you have some other ideas?

@@ -13,6 +13,8 @@ See the License for the specific language governing permissions and
limitations under the License. */
#pragma once

#pragma once
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why #pragma once twice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redundant codes, deleted it...

#include <future> // NOLINT
#include <ostream>

#include "paddle/fluid/framework/data_type.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might only need send and recv ops and adding attrs to control's it's behavior so that we won't copy these codes around.

Copy link
Contributor Author

@Yancey1989 Yancey1989 May 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, but it's hard for the current implement of SSAGraph, it's use op type and input/output argument to determine the dependency, we need to do more things to merge send_op, send_vars_op, batch_barrier, recv_op, fetch_barrier. I created an issue to follow this comment.
Related issue: #10968

Copy link
Contributor

@typhoonzero typhoonzero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not merge this PR until clean it up. Thanks!

@Yancey1989
Copy link
Contributor Author

Yancey1989 commented May 29, 2018

Hi @typhoonzero , fixed the comments above, and make RPCClient as a singleton mode.

Copy link
Contributor

@typhoonzero typhoonzero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM++

@@ -134,12 +175,14 @@ std::unique_ptr<SSAGraph> MultiDevSSAGraphBuilder::Build(

bool is_forwarding = true;
for (auto *op : program.Block(0).AllOps()) {
if (op->Type() == "send") {
// append send op if program is distributed trainer main program.
if (boost::get<int>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool.

auto &p = places_[0];
auto *s = local_scopes_[0];
// FIXME(wuyi): send op always copy from GPU 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add back this FIXME?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add back this comment in the next PR..

@Yancey1989 Yancey1989 merged commit d92a75b into PaddlePaddle:develop May 29, 2018
@Yancey1989 Yancey1989 deleted the overlap_send_op branch May 29, 2018 08:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants