Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when transpile program with piecewise_decay to distributed program #11429

Closed
typhoonzero opened this issue Jun 13, 2018 · 4 comments · Fixed by #11520
Closed

Error when transpile program with piecewise_decay to distributed program #11429

typhoonzero opened this issue Jun 13, 2018 · 4 comments · Fixed by #11520

Comments

@typhoonzero
Copy link
Contributor

typhoonzero commented Jun 13, 2018

When piecewise_decay is defined the pserver side program will have a wrong conditional_block refer to a block id that doesn't exist on the pserver program.

This error was first met by @kolinwei

optimizer = fluid.optimizer.Momentum(
            learning_rate=fluid.layers.piecewise_decay(
                boundaries=bd, values=lr),
            momentum=0.9,
            regularization=fluid.regularizer.L2Decay(1e-4))
@kolinwei
Copy link
Contributor

图像的SE_ResNeXt50_32x4d 模型中lr_strategy=piecewise_decay

@Yancey1989
Copy link
Contributor

Yancey1989 commented Jun 13, 2018

Can also easy to reproduce it on fit-a-line, this model is simple enough and easy to debug:

sgd_optimizer = fluid.optimizer.SGD(learning_rate=0.001)

=>

    sgd_optimizer = fluid.optimizer.Momentum(learning_rate=fluid.layers.piecewise_decay(
                boundaries=[3,6,9], values=[0.1, 0.2, 0.3, 0.4]), momentum=0.9)

error logs:

I0613 10:07:19.290724  2461 grpc_server.cc:122] RequestGet fc_0.w_0
I0613 10:07:19.290731  2461 rpc_server.cc:99] RPCServer WaitCond RequestGet
PC: @                0x0 (unknown)
*** SIGSEGV (@0x8) received by PID 2383 (TID 0x7fa54ffef700) from PID 8; stack trace: ***
    @     0x7fa8b49ff390 (unknown)
    @     0x7fa85ec0213b paddle::operators::ConditionalBlockOp::RunImpl()
    @     0x7fa85e1e2638 paddle::framework::Executor::RunPreparedContext()
    @     0x7fa85eedd4cf _ZNSt17_Function_handlerIFSt10unique_ptrINSt13__future_base12_Result_baseENS2_8_DeleterEEvENS1_12_Task_setterIS0_INS1_7_ResultIS0_IN6paddle8platform13EnforceNotMetESt14default_deleteISA_EEEES3_ESt12_Bind_simpleIFSt17reference_wrapperIZNS8_9framework10ThreadPool18RunAndGetExceptionIZNS8_9operatorsL21ParallelExecuteBlocksERKSt6vectorImSaImEEPNSI_8ExecutorERKSM_ISt10shared_ptrINSI_22ExecutorPrepareContextEESaISV_EEPNSI_11ProgramDescEPNSI_5ScopeEEUlvE_EESt6futureISD_ET_EUlvE_EvEESD_EEE9_M_invokeERKSt9_Any_data
    @     0x7fa85ed5c8ce std::__future_base::_State_baseV2::_M_do_set()
    @     0x7fa8b49fca99 __pthread_once_slow
    @     0x7fa85eedcc87 _ZNSt13__future_base11_Task_stateIZN6paddle9framework10ThreadPool18RunAndGetExceptionIZNS1_9operatorsL21ParallelExecuteBlocksERKSt6vectorImSaImEEPNS2_8ExecutorERKS6_ISt10shared_ptrINS2_22ExecutorPrepareContextEESaISF_EEPNS2_11ProgramDescEPNS2_5ScopeEEUlvE_EESt6futureISt10unique_ptrINS1_8platform13EnforceNotMetESt14default_deleteISS_EEET_EUlvE_SaIiEFSV_vEE6_M_runEv
    @     0x7fa85f0fa3c4 paddle::framework::ThreadPool::TaskLoop()
    @     0x7fa87369dc80 (unknown)
    @     0x7fa8b49f56ba start_thread
    @     0x7fa8b472b41d clone
    @                0x0 (unknown)

Needs to support condition_block_op in DistributeTranspiler.

@velconia
Copy link
Collaborator

ConditionalBlock has the attribute: sub_block, the Block in sub_block attribute will NOT be cloned into pserver_program in dist_transpiler, we just added these sub_block recursivly

@Yancey1989
Copy link
Contributor

As the comment #11520 (comment), there also has a bug for sub_block in pserver.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants