Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run dist demo word2vec failed #8678

Closed
Yancey1989 opened this issue Mar 1, 2018 · 1 comment
Closed

run dist demo word2vec failed #8678

Yancey1989 opened this issue Mar 1, 2018 · 1 comment

Comments

@Yancey1989
Copy link
Contributor

Yancey1989 commented Mar 1, 2018

Start up 10 pservers + 10 trainers + turn on sparse update, pserver would crash:

server-1:

I0301 08:39:03.283473 12305 executor.cc:134] CPUPlace Op(sgd), inputs:{Grad[fc_1.w_0@GRAD.block8[26, 2073]({})], LearningRate[tmp_9[1]({})], Param[fc_1.w_0.block8[26, 2073]({})]}, outputs:{ParamOut[fc_1.w_0.block8[26, 2073]({})]}.
I0301 08:39:03.283483 12305 operator.cc:521] expected_kernel_key:data_type[float32]:data_layout[ANY_LAYOUT]:place[CPUPlace]:library_type[PLAIN]
I0301 08:39:03.310343 12305 listen_and_serv_op.cc:117] received grad: shared_w@GRAD.block1.trainer_9
*** Aborted at 1519893543 (unix time) try "date -d @1519893543" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGSEGV (@0x0) received by PID 12305 (TID 0x7f4879419700) from PID 0; stack trace: ***
    @     0x7f4878df3390 (unknown)
    @     0x7f482da161e7 paddle::memory::detail::MetadataCache::load()
    @     0x7f482da15c73 paddle::memory::detail::MemoryBlock::type()
    @     0x7f482da150a1 paddle::memory::detail::BuddyAllocator::Free()
    @     0x7f482da11060 paddle::memory::Free<>()
    @     0x7f482d97321e paddle::framework::Tensor::PlaceholderImpl<>::~PlaceholderImpl()
    @     0x7f482d979a26 std::_Sp_counted_base<>::_M_release()
    @     0x7f482d97d27b paddle::framework::Tensor::mutable_data()
    @     0x7f482e0e55ae paddle::framework::TensorFromStream()
    @     0x7f482e0e0481 paddle::framework::DeserializeFromStream()
    @     0x7f482e0c3b8b paddle::operators::detail::DeserializeFromMessage()
    @     0x7f482e017ad1 paddle::operators::ListenAndServOp::RunImpl()
    @     0x7f482e0772a8 paddle::framework::OperatorBase::Run()

server-2:

E0301 08:39:03.331059 12306 listen_and_serv_op.cc:139] run sub program error enforce numel() > 0 failed, 0 <= 0
When calling this method, the Tensor's numel must be larger than zero. Please check Tensor::Resize has been called first. at [/paddle/paddle/fluid/framework/tensor_impl.h:123]
PaddlePaddle Call Stacks:
0       0x7f98da6a876cp paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) + 572
1       0x7f98da6ae521p paddle::framework::Tensor::mutable_data(boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_>, std::type_index) + 1233
2       0x7f98dad0fafbp paddle::operators::SumKernel<paddle::platform::CPUDeviceContext, float>::Compute(paddle::framework::ExecutionContext const&) const + 1323
3       0x7f98dadaa944p paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> const&) const + 1556
4       0x7f98dada82a8p paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> const&) + 72
@Yancey1989 Yancey1989 added the Bug label Mar 1, 2018
@Yancey1989 Yancey1989 self-assigned this Mar 1, 2018
@Yancey1989
Copy link
Contributor Author

Yancey1989 commented Mar 7, 2018

Update progress:
There are two bugs in this issue:

  1. As the logs server-2, if one of the pserver instances didn't receive any sparse vars, the GRAD var would not be initialized, we should deal with this situation.

  2. As the log server-1, looks like a memory bug, but I'm not sure, will follow it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

1 participant