Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distribution trainning for transformer core dump #11387

Closed
xuezhong opened this issue Jun 12, 2018 · 3 comments
Closed

distribution trainning for transformer core dump #11387

xuezhong opened this issue Jun 12, 2018 · 3 comments
Assignees
Labels
User 用于标记用户问题

Comments

@xuezhong
Copy link
Contributor

paddle version 19fd071

transformer trainning script: PaddlePaddle/models#982

cmd for reproducing core:
python -u train.py --src_vocab_fpath /paddle/dataset/nist06n/cn_30001.dict --trg_vocab_fpath /paddle/dataset/nist06n/en_30001.dict --train_file_pattern '/paddle/train/part-*' --use_token_batch True --batch_size 1024 --pool_size 10000 --shuffle True --shuffle_batch True --sort_type pool --special_token '_GO' '_EOS' '_UNK'

Environment variable for psserver:
export PADDLE_PSERVERS=127.0.0.1
export POD_IP=127.0.0.1
export PADDLE_TRAINERS_NUM=2
export PADDLE_TRAINER_ID=0
export TRAINING_ROLE=PSERVER
export PADDLE_IS_LOCAL=0
export PADDLE_PORT=6176

Environment variable for trainers:
export CUDA_VISIBLE_DEVICES=3
export TRAINING_ROLE=TRAINER
export PADDLE_PSERVERS=127.0.0.1
export POD_IP=127.0.0.1
export PADDLE_TRAINERS_NUM=2
export PADDLE_TRAINER_ID=1
export PADDLE_IS_LOCAL=0
export PADDLE_PORT=6176

export CUDA_VISIBLE_DEVICES=2
export TRAINING_ROLE=TRAINER
export PADDLE_PSERVERS=127.0.0.1
export POD_IP=127.0.0.1
export PADDLE_TRAINERS_NUM=2
export PADDLE_TRAINER_ID=2
export PADDLE_IS_LOCAL=0
export PADDLE_PORT=6176

@panyx0718 panyx0718 self-assigned this Jun 12, 2018
@xuezhong
Copy link
Contributor Author

xuezhong commented Jun 12, 2018

coredump info with paddle version 19fd071
psserver begin run
*** Aborted at 1528781171 (unix time) try "date -d @1528781171" if you are using GNU date ***
PC: @ 0x0 (unknown)
*** SIGSEGV (@0x0) received by PID 9632 (TID 0x7fc897fff700) from PID 0; stack trace: ***
@ 0x7fca7c762390 (unknown)
@ 0x7fca41b3782b paddle::operators::detail::VariableResponse::CopyLodTensorData()
@ 0x7fca41b39d51 paddle::operators::detail::VariableResponse::Parse()
@ 0x7fca41b31f28 grpc::ServerInterface::PayloadAsyncRequest<>::FinalizeResult()
@ 0x7fca41b56c42 grpc::CompletionQueue::AsyncNextInternal()
@ 0x7fca41b2d192 paddle::operators::detail::AsyncGRPCServer::HandleRequest()
@ 0x7fca41b30607 std::thread::_Impl<>::_M_run()
@ 0x7fca76f4fc80 (unknown)
@ 0x7fca7c7586ba start_thread
@ 0x7fca7c48e41d clone
@ 0x0 (unknown)
Segmentation fault

@xuezhong
Copy link
Contributor Author

xuezhong commented Jun 12, 2018

more debug info with paddle version c36dd3b
(gdb) bt
Python Exception <type 'exceptions.ImportError'> No module named gdb.frames:
#0 0x00007fe94a0d0f60 in std::unique_ptr<paddle::framework::Variable::Placeholder, std::default_deletepaddle::framework::Variable::Placeholder >::get (
this=0x0) at /usr/include/c++/5/bits/unique_ptr.h:305
#1 0x00007fe94a0d0db8 in std::unique_ptr<paddle::framework::Variable::Placeholder, std::default_deletepaddle::framework::Variable::Placeholder >::operator bool (this=0x0) at /usr/include/c++/5/bits/unique_ptr.h:319
#2 0x00007fe94a0b8ea7 in std::operator!=<paddle::framework::Variable::Placeholder, std::default_deletepaddle::framework::Variable::Placeholder >(std::unique_ptr<paddle::framework::Variable::Placeholder, std::default_deletepaddle::framework::Variable::Placeholder > const&, decltype(nullptr)) (__x=...)
at /usr/include/c++/5/bits/unique_ptr.h:648
#3 0x00007fe94a0daf3c in paddle::framework::Variable::IsTypepaddle::framework::LoDTensor (this=0x0) at /paddle/Paddle/paddle/fluid/framework/variable.h:49
#4 0x00007fe94a0c72a1 in paddle::framework::Variable::GetMutablepaddle::framework::LoDTensor (this=0x0)
at /paddle/Paddle/paddle/fluid/framework/variable.h:41
#5 0x00007fe94b5ed327 in paddle::operators::detail::VariableResponse::CopyLodTensorData (this=0x7fe7ac05e4f0, input=0x7fe77a1fb940, ctx=..., dims=...,
length=2048) at /paddle/Paddle/paddle/fluid/operators/detail/variable_response.cc:121
#6 0x00007fe94b5eea98 in paddle::operators::detail::VariableResponse::Parse (this=0x7fe7ac05e4f0, source=0x7fe77a1fbb20)
at /paddle/Paddle/paddle/fluid/operators/detail/variable_response.cc:405
#7 0x00007fe94b5e0c3d in grpc::SerializationTraits<paddle::operators::detail::VariableResponse, void>::Deserialize (buffer=0x7fe79c001460,
msg=0x7fe7ac05e4f0, max_message_size=2147483647) at /paddle/Paddle/paddle/fluid/operators/detail/grpc_service.h:63
#8 0x00007fe94b5eb984 in grpc::ServerInterface::PayloadAsyncRequestpaddle::operators::detail::VariableResponse::FinalizeResult (this=0x7fe7ac05e5d0,
tag=0x7fe77a1fbce8, status=0x7fe77a1fbcd4) at /paddle/build/third_party/install/grpc/include/grpc++/impl/codegen/server_interface.h:194
#9 0x00007fe94b608362 in grpc::CompletionQueue::AsyncNextInternal(void**, bool*, gpr_timespec) () at /usr/include/c++/5/bits/stl_construct.h:75
#10 0x00007fe94b5d17c3 in grpc::CompletionQueue::Next (this=0x7fe7ac0191b0, tag=0x7fe77a1fbce8, ok=0x7fe77a1fbcd4)
at /paddle/build/third_party/install/grpc/include/grpc++/impl/codegen/completion_queue.h:168
#11 0x00007fe94b5de10d in paddle::operators::detail::AsyncGRPCServer::HandleRequest(grpc::ServerCompletionQueue*, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::function<void (std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, int)>) (this=0x140261f0, cq=0x7fe7ac0191b0, rpc_name=..., TryToRegisterNewOne=...) at /paddle/Paddle/paddle/fluid/operators/detail/grpc_server.cc:301
#12 0x00007fe94b5ecb0a in std::_Mem_fn_base<void (paddle::operators::detail::AsyncGRPCServer::)(grpc::ServerCompletionQueue, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::function<void (std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, int)>), true>::operator()<grpc::ServerCompletionQueue*&, std::__cxx11::basic_string<char, std::char_traits, std::allocator >&, std::function<void (std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, int)>&, void>(paddle::operators::detail::AsyncGRPCServer*, grpc::ServerCompletionQueue*&, std::__cxx11::basic_string<char, std::char_traits, std::allocator >&, std::function<void (std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, int)>&) const (this=0x7fe7ac07d468, __object=0x140261f0) at /usr/include/c++/5/functional:600
#13 0x00007fe94b5ec9ad in std::_Bind<std::_Mem_fn<void (paddle::operators::detail::AsyncGRPCServer::)(grpc::ServerCompletionQueue, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::function<void (std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, int)>)> (paddle::operators::detail::AsyncGRPCServer*, grpc::ServerCompletionQueue*, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::function<void (std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, int)>)>::__call<void, , 0ul, 1ul, 2ul, 3ul>(std::tuple<>&&, std::_Index_tuple<0ul, 1ul, 2ul, 3ul>) (this=0x7fe7ac07d468, __args=...) at /usr/include/c++/5/functional:1074
---Type to continue, or q to quit---
#14 0x00007fe94b5ec7e1 in std::_Bind<std::_Mem_fn<void (paddle::operators::detail::AsyncGRPCServer::)(grpc::ServerCompletionQueue, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::function<void (std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, int)>)> (paddle::operators::detail::AsyncGRPCServer*, grpc::ServerCompletionQueue*, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::function<void (std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, int)>)>::operator()<, void>()
(this=0x7fe7ac07d468) at /usr/include/c++/5/functional:1133
#15 0x00007fe94b5ec7a6 in std::_Bind_simple<std::_Bind<std::_Mem_fn<void (paddle::operators::detail::AsyncGRPCServer::)(grpc::ServerCompletionQueue, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::function<void (std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, int)>)> (paddle::operators::detail::AsyncGRPCServer*, grpc::ServerCompletionQueue*, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::function<void (std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, int)>)> ()>::_M_invoke<>(std::_Index_tuple<>) (this=0x7fe7ac07d468) at /usr/include/c++/5/functional:1531
#16 0x00007fe94b5ec1dc in std::_Bind_simple<std::_Bind<std::_Mem_fn<void (paddle::operators::detail::AsyncGRPCServer::)(grpc::ServerCompletionQueue, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::function<void (std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, int)>)> (paddle::operators::detail::AsyncGRPCServer*, grpc::ServerCompletionQueue*, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::function<void (std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, int)>)> ()>::operator()() (this=0x7fe7ac07d468) at /usr/include/c++/5/functional:1520
#17 0x00007fe94b5ebb3c in std::thread::_Impl<std::_Bind_simple<std::_Bind<std::_Mem_fn<void (paddle::operators::detail::AsyncGRPCServer::)(grpc::ServerCompletionQueue, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::function<void (std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, int)>)> (paddle::operators::detail::AsyncGRPCServer*, grpc::ServerCompletionQueue*, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::function<void (std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, int)>)> ()> >::_M_run() (this=0x7fe7ac07d450) at /usr/include/c++/5/thread:115
#18 0x00007fe97d90bc80 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#19 0x00007fe98718f6ba in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#20 0x00007fe986ec541d in clone () from /lib/x86_64-linux-gnu/libc.so.6

@luotao1 luotao1 added the User 用于标记用户问题 label Jun 12, 2018
@panyx0718
Copy link
Contributor

checked offline, it's because of setting wrong trainer_id

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
User 用于标记用户问题
Projects
None yet
Development

No branches or pull requests

7 participants