Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fluid分布式pserver出现SIGSEGV异常 #9351

Closed
xymyeah opened this issue Mar 24, 2018 · 11 comments
Closed

fluid分布式pserver出现SIGSEGV异常 #9351

xymyeah opened this issue Mar 24, 2018 · 11 comments
Assignees
Labels
User 用于标记用户问题

Comments

@xymyeah
Copy link
Contributor

xymyeah commented Mar 24, 2018

fluid分布式pserver出现SIGSEGV异常
commit:c83dd9b4c2e4685319773d4bf6c2164c498cd1dc

pserver异常详细信息:
*** Aborted at 1521898840 (unix time) try "date -d @1521898840" if you are using GNU date ***
PC: @ 0x0 (unknown)
*** SIGSEGV (@0x1783000000) received by PID 498 (TID 0x7fd12ebfd700) from PID 18446744071612399616; stack trace: ***
@ 0x7fd311f42500 (unknown)
@ 0x7fd287c5bc98 paddle::memory::detail::MetadataCache::load()
@ 0x7fd287c5bb3c paddle::memory::detail::MemoryBlock::split()
@ 0x7fd287c5a542 _ZN6paddle6memory6detail14BuddyAllocator12SplitToAllocESt23_Rb_tree_const_iteratorISt5tupleIJ
mmPvEEEm
@ 0x7fd287c5aa45 paddle::memory::detail::BuddyAllocator::Alloc()
@ 0x7fd287c57f25 paddle::memory::Alloc<>()
@ 0x7fd287bd224b paddle::framework::Tensor::PlaceholderImpl<>::PlaceholderImpl()
@ 0x7fd287bd77f8 paddle::framework::Tensor::mutable_data()
@ 0x7fd2882bdbbf paddle::operators::detail::VariableResponse::CopyLodTensorData()
@ 0x7fd2882be537 paddle::operators::detail::VariableResponse::Parse()
@ 0x7fd2882b8e4d grpc::ServerInterface::PayloadAsyncRequest<>::FinalizeResult()
@ 0x7fd2883294f3 grpc::CompletionQueue::AsyncNextInternal()
@ 0x7fd2882b4adb paddle::operators::detail::AsyncGRPCServer::HandleRequest()
@ 0x7fd2882b88b5 std::thread::_Impl<>::_M_run()
@ 0x7fd3078f7470 (unknown)
@ 0x7fd311f3a851 start_thread
@ 0x7fd3115fd90d clone
@ 0x0 (unknown)

@xymyeah
Copy link
Contributor Author

xymyeah commented Mar 24, 2018

语言模型的job,fluid.layers.dynamic_gru,1个pserver、一个tainer的分布式cpu作业,数据集是imikolov
配置信息如下:
trainer的个数:1
trainer进程CPU核数:10
trainer进程内存资源:15Gi
parameter server节点数:1
parameter serverCPU核数:10
parameter server内存资源:15Gi

@typhoonzero
Copy link
Contributor

How can I reproduce this error, what code did you ran.

@xymyeah
Copy link
Contributor Author

xymyeah commented Mar 24, 2018

是业务方的代码,不方便直接贴上,代码已经私下给过你

@typhoonzero
Copy link
Contributor

Can not reproduce with empty dict and code given from you. Still need more details.

@typhoonzero typhoonzero added the User 用于标记用户问题 label Mar 26, 2018
@xymyeah
Copy link
Contributor Author

xymyeah commented Mar 26, 2018

dict是有的,在docker容器内也有,线下也单独给了
线下已经给出了运行环境的信息

@typhoonzero
Copy link
Contributor

通过用户提供的环境,ssh登录上去可以复现。应该是sparse的一个bug,麻烦 @Yancey1989 帮看下

@Yancey1989
Copy link
Contributor

好的。

@Yancey1989
Copy link
Contributor

After merged #9409, there is another error on PServer:

E0328 03:54:46.540556  3693 listen_and_serv_op.cc:152] run sub program error enforce framework::product(lr_dims) == 1 failed, 0 != 1
Learning rate should have 1 element at [/paddle/paddle/fluid/operators/sgd_op.cc:36]
PaddlePaddle Call Stacks:
0       0x7f6a2a6ddf1cp paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) + 572
1       0x7f6a2ae144eap paddle::operators::SGDOp::InferShape(paddle::framework::InferShapeContext*) const + 970
2       0x7f6a2aed7358p paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> const&) const + 104
3       0x7f6a2aed4cd8p paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> const&) + 72
4       0x7f6a2a78ff66p paddle::framework::Executor::RunPreparedContext(paddle::framework::ExecutorPrepareContext*, paddle::framework::Scope*, bool, bool) + 1750
5       0x7f6a2a790e77p paddle::framework::Executor::Run(paddle::framework::ProgramDesc const&, paddle::framework::Scope*, int, bool, bool) + 103

@Yancey1989 Yancey1989 reopened this Mar 28, 2018
@typhoonzero
Copy link
Contributor

This is a new issue caused by adding learning rate decay will cause transpiler generate wrong pserver program. Move discussion to #9429

@Yancey1989
Copy link
Contributor

After merge #9489 , this model would run well with more than 1 trainer instances(I tested it with 2trainers + 2pservers), but still has some error with one trainer instance.

@Yancey1989
Copy link
Contributor

After #9558 merged, it passed the test with 1trainer+1pserver and 2trainer + 2pserver.

Hi @xymyeah , I tested the job on my host and work well, maybe you can double check your job with the latest code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
User 用于标记用户问题
Projects
None yet
Development

No branches or pull requests

3 participants