-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fluid分布式pserver出现SIGSEGV异常 #9351
Comments
语言模型的job,fluid.layers.dynamic_gru,1个pserver、一个tainer的分布式cpu作业,数据集是imikolov |
How can I reproduce this error, what code did you ran. |
是业务方的代码,不方便直接贴上,代码已经私下给过你 |
Can not reproduce with empty dict and code given from you. Still need more details. |
dict是有的,在docker容器内也有,线下也单独给了 |
通过用户提供的环境,ssh登录上去可以复现。应该是sparse的一个bug,麻烦 @Yancey1989 帮看下 |
好的。 |
After merged #9409, there is another error on PServer: E0328 03:54:46.540556 3693 listen_and_serv_op.cc:152] run sub program error enforce framework::product(lr_dims) == 1 failed, 0 != 1
Learning rate should have 1 element at [/paddle/paddle/fluid/operators/sgd_op.cc:36]
PaddlePaddle Call Stacks:
0 0x7f6a2a6ddf1cp paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) + 572
1 0x7f6a2ae144eap paddle::operators::SGDOp::InferShape(paddle::framework::InferShapeContext*) const + 970
2 0x7f6a2aed7358p paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> const&) const + 104
3 0x7f6a2aed4cd8p paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> const&) + 72
4 0x7f6a2a78ff66p paddle::framework::Executor::RunPreparedContext(paddle::framework::ExecutorPrepareContext*, paddle::framework::Scope*, bool, bool) + 1750
5 0x7f6a2a790e77p paddle::framework::Executor::Run(paddle::framework::ProgramDesc const&, paddle::framework::Scope*, int, bool, bool) + 103 |
This is a new issue caused by adding learning rate decay will cause transpiler generate wrong pserver program. Move discussion to #9429 |
After merge #9489 , this model would run well with more than 1 trainer instances(I tested it with 2trainers + 2pservers), but still has some error with one trainer instance. |
fluid分布式pserver出现SIGSEGV异常
commit:c83dd9b4c2e4685319773d4bf6c2164c498cd1dc
pserver异常详细信息:
*** Aborted at 1521898840 (unix time) try "date -d @1521898840" if you are using GNU date ***
PC: @ 0x0 (unknown)
*** SIGSEGV (@0x1783000000) received by PID 498 (TID 0x7fd12ebfd700) from PID 18446744071612399616; stack trace: ***
@ 0x7fd311f42500 (unknown)
@ 0x7fd287c5bc98 paddle::memory::detail::MetadataCache::load()
@ 0x7fd287c5bb3c paddle::memory::detail::MemoryBlock::split()
@ 0x7fd287c5a542 _ZN6paddle6memory6detail14BuddyAllocator12SplitToAllocESt23_Rb_tree_const_iteratorISt5tupleIJ
mmPvEEEm
@ 0x7fd287c5aa45 paddle::memory::detail::BuddyAllocator::Alloc()
@ 0x7fd287c57f25 paddle::memory::Alloc<>()
@ 0x7fd287bd224b paddle::framework::Tensor::PlaceholderImpl<>::PlaceholderImpl()
@ 0x7fd287bd77f8 paddle::framework::Tensor::mutable_data()
@ 0x7fd2882bdbbf paddle::operators::detail::VariableResponse::CopyLodTensorData()
@ 0x7fd2882be537 paddle::operators::detail::VariableResponse::Parse()
@ 0x7fd2882b8e4d grpc::ServerInterface::PayloadAsyncRequest<>::FinalizeResult()
@ 0x7fd2883294f3 grpc::CompletionQueue::AsyncNextInternal()
@ 0x7fd2882b4adb paddle::operators::detail::AsyncGRPCServer::HandleRequest()
@ 0x7fd2882b88b5 std::thread::_Impl<>::_M_run()
@ 0x7fd3078f7470 (unknown)
@ 0x7fd311f3a851 start_thread
@ 0x7fd3115fd90d clone
@ 0x0 (unknown)
The text was updated successfully, but these errors were encountered: