Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize while_op for test #14764

Merged
merged 23 commits into from Jan 16, 2019
Merged

Optimize while_op for test #14764

merged 23 commits into from Jan 16, 2019

Conversation

Xreki
Copy link
Contributor

@Xreki Xreki commented Dec 6, 2018

  • Avoid creating and deleting variables in every iterator when is_test is set.
  • Add in dispensable input ExecutorPrepareContext in while_op, so that the operators of this block can be created in advance. move to another PR.
  • Fix some missing deps in cmake.
  • Use async TensorCopy instead of TensorCopySync in reshape_op.
  • Simplify the CPU kernel of compare_op when numel() of input tensor is 1.

@Xreki Xreki force-pushed the core_opt_while_op branch 2 times, most recently from 26a1cc8 to f377755 Compare December 10, 2018 06:51
@Xreki Xreki force-pushed the core_opt_while_op branch 2 times, most recently from 846b535 to 9a6ee11 Compare December 11, 2018 09:53
@Xreki
Copy link
Contributor Author

Xreki commented Jan 4, 2019

Met following linking error in PR_CI_python35, see building log:

../../operators/distributed/libsendrecvop_rpc.a(grpc_client.cc.o): In function `std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<std::unique_ptr<paddle::platform::EnforceNotMet, std::default_delete<paddle::platform::EnforceNotMet> > >, std::__future_base::_Result_base::_Deleter>, std::_Bind_simple<std::reference_wrapper<std::future<std::unique_ptr<paddle::platform::EnforceNotMet, std::default_delete<paddle::platform::EnforceNotMet> > > paddle::framework::ThreadPool::RunAndGetException<paddle::operators::distributed::GRPCClient::AsyncSendVar(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, paddle::platform::DeviceContext const&, paddle::framework::Scope const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long)::{lambda()#1}>(paddle::operators::distributed::GRPCClient::AsyncSendVar(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, paddle::platform::DeviceContext const&, paddle::framework::Scope const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long)::{lambda()#1})::{lambda()#1}> ()>, std::unique_ptr<paddle::platform::EnforceNotMet, std::default_delete<paddle::platform::EnforceNotMet> > > >::_M_invoke(std::_Any_data const&)':^M 
grpc_client.cc:(.text+0x28ec): undefined reference to `grpc::GenericStub::PrepareUnaryCall(grpc::ClientContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, grpc::ByteBuffer const&, grpc::CompletionQueue*)'^M
../../operators/distributed/libsendrecvop_rpc.a(grpc_client.cc.o): In function `std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<std::unique_ptr<paddle::platform::EnforceNotMet, std::default_delete<paddle::platform::EnforceNotMet> > >, std::__future_base::_Result_base::_Deleter>, std::_Bind_simple<std::reference_wrapper<std::future<std::unique_ptr<paddle::platform::EnforceNotMet, std::default_delete<paddle::platform::EnforceNotMet> > > paddle::framework::ThreadPool::RunAndGetException<paddle::operators::distributed::GRPCClient::AsyncPrefetchVar(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, paddle::platform::DeviceContext const&, paddle::framework::Scope const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long)::{lambda()#1}>(paddle::operators::distributed::GRPCClient::AsyncPrefetchVar(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, paddle::platform::DeviceContext const&, paddle::framework::Scope const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long)::{lambda()#1})::{lambda()#1}> ()>, std::unique_ptr<paddle::platform::EnforceNotMet, std::default_delete<paddle::platform::EnforceNotMet> > > >::_M_invoke(std::_Any_data const&)':^M
grpc_client.cc:(.text+0x304c): undefined reference to `grpc::GenericStub::PrepareUnaryCall(grpc::ClientContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, grpc::ByteBuffer const&, grpc::CompletionQueue*)'^M
../../operators/distributed/libsendrecvop_rpc.a(grpc_client.cc.o): In function `paddle::operators::distributed::GRPCClient::GetChannel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)':^M
grpc_client.cc:(.text+0x3683): undefined reference to `grpc::InsecureChannelCredentials()'^M
grpc_client.cc:(.text+0x369c): undefined reference to `grpc::CreateCustomChannel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::shared_ptr<grpc::ChannelCredentials> const&, grpc::ChannelArguments const&)'
../../operators/distributed/libsendrecvop_rpc.a(grpc_client.cc.o): In function `std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<std::unique_ptr<paddle::platform::EnforceNotMet, std::default_delete<paddle::platform::EnforceNotMet> > >, std::__future_base::_Result_base::_Deleter>, std::_Bind_simple<std::reference_wrapper<std::future<std::unique_ptr<paddle::platform::EnforceNotMet, std::default_delete<paddle::platform::EnforceNotMet> > > paddle::framework::ThreadPool::RunAndGetException<paddle::operators::distributed::GRPCClient::_AsyncGetVar(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, paddle::platform::DeviceContext const&, paddle::framework::Scope const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long)::{lambda()#1}>(paddle::operators::distributed::GRPCClient::_AsyncGetVar(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, paddle::platform::DeviceContext const&, paddle::framework::Scope const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char,[ 82%] ^[[34m^[[1mGenerating paddle_fluid.dir/read_op.objdir^[[0m^M
 std::char_traits<char>, std::allocator<char> > const&, long)::{lambda()#1})::{lambda()#1}> ()>, std::unique_ptr<paddle::platform::EnforceNotMet, std::default_delete<paddle::platform::EnforceNotMet> > > >::_M_invoke(std::_Any_data const&)':^M
grpc_client.cc:(.text+0x837e): undefined reference to `grpc::GenericStub::PrepareUnaryCall(grpc::ClientContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, grpc::ByteBuffer const&, grpc::CompletionQueue*)'^M
../../operators/distributed/libsendrecvop_rpc.a(grpc_client.cc.o): In function `paddle::operators::distributed::BaseProcessor::Prepare(std::shared_ptr<paddle::operators::distributed::VarHandle>, long)':^M
grpc_client.cc:(.text._ZN6paddle9operators11distributed13BaseProcessor7PrepareESt10shared_ptrINS1_9VarHandleEEl[_ZN6paddle9operators11distributed13BaseProcessor7PrepareESt10shared_ptrINS1_9VarHandleEEl]+0xcb): undefined reference to `grpc::Timepoint2Timespec(std::chrono::time_point<std::chrono::_V2::system_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const&, gpr_timespec*)'
../../operators/distributed/libsendrecvop_rpc.a(grpc_server.cc.o): In function `paddle::operators::distributed::AsyncGRPCServer::StartServer()':
grpc_server.cc:(.text+0x24b0): undefined reference to `grpc::ServerBuilder::ServerBuilder()'
grpc_server.cc:(.text+0x24b8): undefined reference to `grpc::InsecureServerCredentials()'
grpc_server.cc:(.text+0x24cd): undefined reference to `grpc::ServerBuilder::AddListeningPort(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::shared_ptr<grpc::ServerCredentials>, int*)'
grpc_server.cc:(.text+0x2530): undefined reference to `grpc::ServerBuilder::SetOption(std::unique_ptr<grpc::ServerBuilderOption, std::default_delete<grpc::ServerBuilderOption> >)'
grpc_server.cc:(.text+0x255c): undefined reference to `grpc::ServerBuilder::RegisterService(grpc::Service*)'
grpc_server.cc:(.text+0x25f4): undefined reference to `grpc::ServerBuilder::AddCompletionQueue(bool)'
grpc_server.cc:(.text+0x273c): undefined reference to `grpc::ServerBuilder::BuildAndStart()'
grpc_server.cc:(.text+0x2ca7): undefined reference to `grpc::ServerBuilder::~ServerBuilder()'
grpc_server.cc:(.text+0x3621): undefined reference to `grpc::ServerBuilder::~ServerBuilder()'
collect2: error: ld returned 1 exit status
paddle/fluid/inference/analysis/CMakeFiles/test_analyzer.dir/build.make:522: recipe for target 'paddle/fluid/inference/analysis/test_analyzer' failed
make[2]: *** [paddle/fluid/inference/analysis/test_analyzer] Error 1
CMakeFiles/Makefile2:62669: recipe for target 'paddle/fluid/inference/analysis/CMakeFiles/test_analyzer.dir/all' failed
make[1]: *** [paddle/fluid/inference/analysis/CMakeFiles/test_analyzer.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....

@CLAassistant
Copy link

CLAassistant commented Jan 8, 2019

CLA assistant check
All committers have signed the CLA.

@Xreki
Copy link
Contributor Author

Xreki commented Jan 9, 2019

The following unittests failed in CI:

[11:18:17][Step 1/1] The following tests FAILED:
[11:18:17][Step 1/1] 	136 - test_analyzer_rnn1 (SEGFAULT)
[11:18:17][Step 1/1] 	137 - test_analyzer_rnn2 (SEGFAULT)
[11:18:17][Step 1/1] 	138 - test_analyzer_dam (SEGFAULT)
[11:18:17][Step 1/1] 	139 - test_analyzer_small_dam (SEGFAULT)
[11:18:17][Step 1/1] 	140 - test_analyzer_ner (SEGFAULT)
[11:18:17][Step 1/1] 	141 - test_analyzer_lac (SEGFAULT)
[11:18:17][Step 1/1] 	142 - test_analyzer_mm_dnn (SEGFAULT)
[11:18:17][Step 1/1] 	143 - test_analyzer_text_classification (SEGFAULT)
[11:18:17][Step 1/1] 	144 - test_analyzer_seq_conv1 (SEGFAULT)
[11:18:17][Step 1/1] 	145 - test_analyzer_seq_pool1 (SEGFAULT)
[11:18:17][Step 1/1] 	146 - test_analyzer_ocr (SEGFAULT)
[11:18:17][Step 1/1] 	147 - test_analyzer_mobilenet_transpose (SEGFAULT)
[11:18:17][Step 1/1] 	148 - test_analyzer_resnet50 (SEGFAULT)
[11:18:17][Step 1/1] 	149 - test_analyzer_mobilenet_depthwise_conv (SEGFAULT)
[11:18:17][Step 1/1] Errors while running CTest
[11:18:17][Step 1/1] 	152 - test_trt_models (SEGFAULT

It is fixed in #15222 .

Copy link
Contributor

@luotao1 luotao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for cmake part.

auto &current_scope = scope.NewScope();
step_scopes->push_back(&current_scope);
executor.RunPreparedContext(ctx_ptr, &current_scope, false, true, true);
if (is_test) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

executor.RunPreparedContext(ctx.get(), &current_scope, false, true, true);
if (is_test) {
scope.DeleteScope(&current_scope);
executor.CreateVariables(*program, &current_scope, block->ID());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why inference doesn't need to create variables but train need?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

void Executor::RunPreparedContext(ExecutorPrepareContext* ctx, Scope* scope,
bool create_local_scope, bool create_vars,
bool keep_kids) {
PADDLE_ENFORCE_NOT_NULL(scope);
Scope* local_scope = scope;
if (create_vars) {
if (create_local_scope) {
local_scope = &scope->NewScope();
}
CreateVariables(ctx->prog_, local_scope, ctx->block_id_);
}

When create_vars is set to true, variables will be created in RunPreparedContext. Because the step scopes will be used in while_grad_op, so all the step scopes should be saved. So that if there is 100 iterations, CreatedVariables will be called 100 times.

if (is_test) {
scope.DeleteScope(&current_scope);
}

For test, the step scopes is deleted at the end of the iteration. In fact, there is no need to new 100 step scopes and call CreateVariables 100 times. Instead, 1 time is enough. This can reduce the overhead a lot, from 17.8619ms to 16.0768ms.

namespace api {
namespace details {

static void CheckWhileOpInput(framework::ProgramDesc *program) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not "check", this is "change"

framework::Variable *ptr = scope->Var(context_var->Name());
framework::InitializeVariable(ptr, context_var->GetType());

auto *tensor = ptr->GetMutable<framework::LoDTensor>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this LoDTensor, not ExecutorPrepareContext

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to use ExecutorPrepareContext directly, we need to add a new type to the var's type list. We need to do so much changes, first here:

message VarType {
enum Type {
// Pod Types
BOOL = 0;
INT16 = 1;
INT32 = 2;
INT64 = 3;
FP16 = 4;
FP32 = 5;
FP64 = 6;
// Tensor<size_t> is used in C++.
SIZE_T = 19;
UINT8 = 20;
INT8 = 21;
// Other types that may need additional descriptions
LOD_TENSOR = 7;
SELECTED_ROWS = 8;
FEED_MINIBATCH = 9;
FETCH_LIST = 10;
STEP_SCOPES = 11;
LOD_RANK_TABLE = 12;
LOD_TENSOR_ARRAY = 13;
PLACE_LIST = 14;
READER = 15;
// Any runtime decided variable type is raw
// raw variables should manage their own allocations
// in operators like nccl_op
RAW = 17;
TUPLE = 18;
}

Do you think that is better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does RAW work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I try it and was trapped in complicated cmake dependent errors. The change leads to circle dependent in cmake. I'd like to delete these codes and try to fix this in another PR, is that OK?

executor_->CreateVariables(*inference_program_,
sub_scope_ ? sub_scope_ : scope_.get(), 0);
framework::Scope *scope = sub_scope_ != nullptr ? sub_scope_ : scope_.get();
inference::api::details::PrepareExecutor(inference_program_.get(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this live in a pass?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, it is not suitable to implement this in a pass, because we need to create ExecutorPrepareContext for the final programs. These ExecutorPrepareContext will be used in Predictor.Run() directly.

In fact, I'm considering implementing this in

std::unique_ptr<ExecutorPrepareContext> Executor::Prepare(
const ProgramDesc& program, int block_id,
const std::vector<std::string>& skip_ref_cnt_vars) {

But I'm not sure it is stable enough, so implement in NativePredictor first to verify that.

@@ -0,0 +1,33 @@
/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for this file?

framework::Variable *ptr = scope->Var(context_var->Name());
framework::InitializeVariable(ptr, context_var->GetType());

auto *tensor = ptr->GetMutable<framework::LoDTensor>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does RAW work?

framework::TensorCopySync(*in, ctx.GetPlace(), out);
framework::TensorCopy(
*in, ctx.GetPlace(),
ctx.template device_context<platform::DeviceContext>(), out);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@velconia why was this copy sync?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is a copied from original implementation, but I guess Sync is NO need here.

@Xreki Xreki merged commit 568cc2f into PaddlePaddle:develop Jan 16, 2019
Xreki added a commit to Xreki/Paddle that referenced this pull request Jan 16, 2019
Xreki added a commit that referenced this pull request Jan 17, 2019
* Revert the modification of while_op in #14764.
test=develop

* Remove the dependency of GRPC_DEPS.
test=develop
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants