-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPI集群报错,SocketChannel.cpp check failed len >= 0 #2585
Comments
调小学习率 Mon Jun 26 10:30:50 2017[1,7]<stderr>:+ ./paddle_trainer --num_gradient_servers=100 --trainer_id=7 --pservers=... --rdma_tcp=tcp --nics=xgbe0 --port=8429 --ports_num=1 --test_all_data_in_one_period=true --log_period=20 --num_passes=50 --trainer_count=8 --config_args=is_cluster=1 --ports_num_for_sparse=1 --local=0 --config=conf/trainer_config.conf --save_dir=./output --use_gpu=0
Mon Jun 26 10:30:51 2017[1,7]<stderr>:[INFO 2017-06-26 10:30:51,104 networks.py:1482] The input order is [ad_user_layer, ad_unit_layer, ad_plan_layer, user_layer, media_layer, label]
Mon Jun 26 10:30:51 2017[1,7]<stderr>:[INFO 2017-06-26 10:30:51,105 networks.py:1488] The output order is [__cost_0__]
Mon Jun 26 10:31:06 2017[1,92]<stderr>:*** Aborted at 1498444266 (unix time) try "date -d @1498444266" if you are using GNU date ***
Mon Jun 26 10:31:06 2017[1,92]<stderr>:PC: @ 0x0 (unknown)
Mon Jun 26 10:31:06 2017[1,80]<stderr>:*** Aborted at 1498444266 (unix time) try "date -d @1498444266" if you are using GNU date ***
Mon Jun 26 10:31:06 2017[1,80]<stderr>:PC: @ 0x0 (unknown)
Mon Jun 26 10:31:06 2017[1,80]<stderr>:*** SIGSEGV (@0x0) received by PID 7976 (TID 0x7f991b84d780) from PID 0; stack trace: ***
Mon Jun 26 10:31:06 2017[1,80]<stderr>: @ 0x7f991b008160 (unknown)
Mon Jun 26 10:31:06 2017[1,80]<stderr>: @ 0x6bfd35 paddle::ClassificationErrorEvaluator::evalImp()
Mon Jun 26 10:31:06 2017[1,80]<stderr>: @ 0x6b8690 paddle::Evaluator::eval()
Mon Jun 26 10:31:06 2017[1,80]<stderr>: @ 0x6a2708 paddle::CombinedEvaluator::eval()
Mon Jun 26 10:31:06 2017[1,80]<stderr>: @ 0x6a857a paddle::MultiGradientMachine::eval()
Mon Jun 26 10:31:06 2017[1,80]<stderr>: @ 0x74bdf6 paddle::TrainerInternal::trainOneBatch()
Mon Jun 26 10:31:06 2017[1,80]<stderr>: @ 0x749616 paddle::Trainer::trainOneDataBatch()
Mon Jun 26 10:31:06 2017[1,80]<stderr>: @ 0x749abd paddle::Trainer::trainOnePass()
Mon Jun 26 10:31:06 2017[1,80]<stderr>: @ 0x74a375 paddle::Trainer::train()
Mon Jun 26 10:31:06 2017[1,31]<stderr>:*** Aborted at 1498444266 (unix time) try "date -d @1498444266" if you are using GNU date ***
Mon Jun 26 10:31:06 2017[1,80]<stderr>: @ 0x5a3d70 main
Mon Jun 26 10:31:06 2017[1,31]<stderr>:PC: @ 0x0 (unknown)
Mon Jun 26 10:31:06 2017[1,92]<stderr>:*** SIGSEGV (@0x0) received by PID 10024 (TID 0x7fe638281780) from PID 0; stack trace: ***
Mon Jun 26 10:31:06 2017[1,92]<stderr>: @ 0x7fe637a3c160 (unknown)
Mon Jun 26 10:31:06 2017[1,92]<stderr>: @ 0x6bfd35 paddle::ClassificationErrorEvaluator::evalImp()
Mon Jun 26 10:31:06 2017[1,92]<stderr>: @ 0x6b8690 paddle::Evaluator::eval()
Mon Jun 26 10:31:06 2017[1,92]<stderr>: @ 0x6a2708 paddle::CombinedEvaluator::eval()
Mon Jun 26 10:31:06 2017[1,31]<stderr>:*** SIGSEGV (@0x0) received by PID 11243 (TID 0x7f2fde0d9780) from PID 0; stack trace: ***
Mon Jun 26 10:31:06 2017[1,92]<stderr>: @ 0x6a857a paddle::MultiGradientMachine::eval()
Mon Jun 26 10:31:06 2017[1,92]<stderr>: @ 0x74bdf6 paddle::TrainerInternal::trainOneBatch()
Mon Jun 26 10:31:06 2017[1,92]<stderr>: @ 0x749616 paddle::Trainer::trainOneDataBatch()
Mon Jun 26 10:31:06 2017[1,31]<stderr>: @ 0x7f2fdd894160 (unknown)
Mon Jun 26 10:31:06 2017[1,92]<stderr>: @ 0x749abd paddle::Trainer::trainOnePass()
Mon Jun 26 10:31:06 2017[1,31]<stderr>: @ 0x6bfd35 paddle::ClassificationErrorEvaluator::evalImp()
Mon Jun 26 10:31:06 2017[1,92]<stderr>: @ 0x74a375 paddle::Trainer::train()
Mon Jun 26 10:31:06 2017[1,92]<stderr>: @ 0x5a3d70 main
Mon Jun 26 10:31:06 2017[1,31]<stderr>: @ 0x6b8690 paddle::Evaluator::eval()
Mon Jun 26 10:31:06 2017[1,31]<stderr>: @ 0x6a2708 paddle::CombinedEvaluator::eval()
Mon Jun 26 10:31:06 2017[1,31]<stderr>: @ 0x6a857a paddle::MultiGradientMachine::eval()
Mon Jun 26 10:31:06 2017[1,31]<stderr>: @ 0x74bdf6 paddle::TrainerInternal::trainOneBatch()
Mon Jun 26 10:31:06 2017[1,31]<stderr>: @ 0x749616 paddle::Trainer::trainOneDataBatch()
Mon Jun 26 10:31:06 2017[1,31]<stderr>: @ 0x749abd paddle::Trainer::trainOnePass()
Mon Jun 26 10:31:06 2017[1,31]<stderr>: @ 0x74a375 paddle::Trainer::train()
Mon Jun 26 10:31:06 2017[1,92]<stderr>: @ 0x7fe636c80bd5 __libc_start_main
Mon Jun 26 10:31:06 2017[1,31]<stderr>: @ 0x5a3d70 main
Mon Jun 26 10:31:06 2017[1,80]<stderr>: @ 0x7f991a24cbd5 __libc_start_main
Mon Jun 26 10:31:06 2017[1,92]<stderr>: @ 0x5b2169 (unknown)
Mon Jun 26 10:31:06 2017[1,80]<stderr>: @ 0x5b2169 (unknown)
Mon Jun 26 10:31:06 2017[1,92]<stderr>: @ 0x0 (unknown)
Mon Jun 26 10:31:06 2017[1,80]<stderr>: @ 0x0 (unknown)
Mon Jun 26 10:31:06 2017[1,31]<stderr>: @ 0x7f2fdcad8bd5 __libc_start_main
Mon Jun 26 10:31:06 2017[1,31]<stderr>: @ 0x5b2169 (unknown)
Mon Jun 26 10:31:06 2017[1,31]<stderr>: @ 0x0 (unknown)
Mon Jun 26 10:31:07 2017[1,45]<stderr>:*** Aborted at 1498444267 (unix time) try "date -d @1498444267" if you are using GNU date ***
Mon Jun 26 10:31:07 2017[1,45]<stderr>:PC: @ 0x0 (unknown)
Mon Jun 26 10:31:07 2017[1,31]<stderr>:./train.sh: line 207: 11243 Segmentation fault PYTHONPATH=./paddle:$PYTHONPATH GLOG_logtostderr=0 GLOG_log_dir="./log" ./paddle_trainer --num_gradient_servers=${OMPI_COMM_WORLD_SIZE} --trainer_id=${OMPI_COMM_WORLD_RANK} --pservers=$ipstring --rdma_tcp=${rdma_tcp} --nics=${nics} ${train_arg} --config=conf/trainer_config.conf --save_dir=./${save_dir} ${extern_arg}
Mon Jun 26 10:31:07 2017[1,31]<stderr>:+ '[' 139 -ne 0 ']' |
Closed
Please refer to latest FAQ: https://github.com/PaddlePaddle/Paddle/blob/develop/doc/faq/index_cn.rst#id15 Close this for now. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
现象,集群训练报错,Local模式正常。期间使用classification_cost报top k,依照#2574方法尝试解决,报相关错误如下:
Fri Jun 23 15:40:01 2017[1,53]:F0623 15:40:01.543020 31957 LightNetwork.cpp:397] Check failed: error >= 0 ERROR connecting to 10.87.100.36: Connection refused [111]
Fri Jun 23 15:40:01 2017[1,53]:*** Check failure stack trace: ***
Fri Jun 23 15:40:01 2017[1,53]:F0623 15:40:01.543030 32628 LightNetwork.cpp:397] Check failed: error >= 0 ERROR connecting to 10.87.100.36: Connection refused [111]
Fri Jun 23 15:40:01 2017[1,53]:*** Check failure stack trace: ***
Fri Jun 23 15:40:01 2017[1,53]: @ 0x91316d google::LogMessage::Fail()
Fri Jun 23 15:40:01 2017[1,53]: @ 0x91316d google::LogMessage::Fail()
Fri Jun 23 15:40:01 2017[1,53]: @ 0x916c1c google::LogMessage::SendToLog()
Fri Jun 23 15:40:01 2017[1,53]: @ 0x912c93 google::LogMessage::Flush()
Fri Jun 23 15:40:01 2017[1,53]: @ 0x916c1c google::LogMessage::SendToLog()
Fri Jun 23 15:40:01 2017[1,53]: @ 0x912e99 google::LogMessage::~LogMessage()
Fri Jun 23 15:40:01 2017[1,53]: @ 0x912c93 google::LogMessage::Flush()
Fri Jun 23 15:40:01 2017[1,53]: @ 0x916147 google::ErrnoLogMessage::~ErrnoLogMessage()
Fri Jun 23 15:40:01 2017[1,53]: @ 0x912e99 google::LogMessage::~LogMessage()
Fri Jun 23 15:40:01 2017[1,53]: @ 0x768fa1 paddle::SocketClient::TcpClient()
Fri Jun 23 15:40:01 2017[1,53]: @ 0x916147 google::ErrnoLogMessage::~ErrnoLogMessage()
Fri Jun 23 15:40:01 2017[1,53]: @ 0x7691a1 paddle::SocketClient::SocketClient()
Fri Jun 23 15:40:01 2017[1,53]: @ 0x768fa1 paddle::SocketClient::TcpClient()
Fri Jun 23 15:40:01 2017[1,53]: @ 0xf06e50 paddle::ParameterClient2::init()
The text was updated successfully, but these errors were encountered: