Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI集群报错,SocketChannel.cpp check failed len >= 0 #2585

Closed
zhangyong15 opened this issue Jun 23, 2017 · 4 comments
Closed

MPI集群报错,SocketChannel.cpp check failed len >= 0 #2585

zhangyong15 opened this issue Jun 23, 2017 · 4 comments
Assignees

Comments

@zhangyong15
Copy link

zhangyong15 commented Jun 23, 2017

现象,集群训练报错,Local模式正常。期间使用classification_cost报top k,依照#2574方法尝试解决,报相关错误如下:
Fri Jun 23 15:40:01 2017[1,53]:F0623 15:40:01.543020 31957 LightNetwork.cpp:397] Check failed: error >= 0 ERROR connecting to 10.87.100.36: Connection refused [111]
Fri Jun 23 15:40:01 2017[1,53]:*** Check failure stack trace: ***
Fri Jun 23 15:40:01 2017[1,53]:F0623 15:40:01.543030 32628 LightNetwork.cpp:397] Check failed: error >= 0 ERROR connecting to 10.87.100.36: Connection refused [111]
Fri Jun 23 15:40:01 2017[1,53]:*** Check failure stack trace: ***
Fri Jun 23 15:40:01 2017[1,53]: @ 0x91316d google::LogMessage::Fail()
Fri Jun 23 15:40:01 2017[1,53]: @ 0x91316d google::LogMessage::Fail()
Fri Jun 23 15:40:01 2017[1,53]: @ 0x916c1c google::LogMessage::SendToLog()
Fri Jun 23 15:40:01 2017[1,53]: @ 0x912c93 google::LogMessage::Flush()
Fri Jun 23 15:40:01 2017[1,53]: @ 0x916c1c google::LogMessage::SendToLog()
Fri Jun 23 15:40:01 2017[1,53]: @ 0x912e99 google::LogMessage::~LogMessage()
Fri Jun 23 15:40:01 2017[1,53]: @ 0x912c93 google::LogMessage::Flush()
Fri Jun 23 15:40:01 2017[1,53]: @ 0x916147 google::ErrnoLogMessage::~ErrnoLogMessage()
Fri Jun 23 15:40:01 2017[1,53]: @ 0x912e99 google::LogMessage::~LogMessage()
Fri Jun 23 15:40:01 2017[1,53]: @ 0x768fa1 paddle::SocketClient::TcpClient()
Fri Jun 23 15:40:01 2017[1,53]: @ 0x916147 google::ErrnoLogMessage::~ErrnoLogMessage()
Fri Jun 23 15:40:01 2017[1,53]: @ 0x7691a1 paddle::SocketClient::SocketClient()
Fri Jun 23 15:40:01 2017[1,53]: @ 0x768fa1 paddle::SocketClient::TcpClient()
Fri Jun 23 15:40:01 2017[1,53]: @ 0xf06e50 paddle::ParameterClient2::init()

@jacquesqiao
Copy link
Member

第一次错误被证明是集群上存在一些不正确的数据文件:
image
用户去掉这些错误文件之后,还是报错

@jacquesqiao
Copy link
Member

第二次错误是因为float point exception.
image

建议用户调小学习率,正在测试

@zhangyong15
Copy link
Author

zhangyong15 commented Jun 26, 2017

调小学习率

Mon Jun 26 10:30:50 2017[1,7]<stderr>:+ ./paddle_trainer --num_gradient_servers=100 --trainer_id=7 --pservers=...  --rdma_tcp=tcp --nics=xgbe0 --port=8429 --ports_num=1 --test_all_data_in_one_period=true --log_period=20 --num_passes=50 --trainer_count=8 --config_args=is_cluster=1 --ports_num_for_sparse=1 --local=0 --config=conf/trainer_config.conf --save_dir=./output --use_gpu=0
Mon Jun 26 10:30:51 2017[1,7]<stderr>:[INFO 2017-06-26 10:30:51,104 networks.py:1482] The input order is [ad_user_layer, ad_unit_layer, ad_plan_layer, user_layer, media_layer, label]
Mon Jun 26 10:30:51 2017[1,7]<stderr>:[INFO 2017-06-26 10:30:51,105 networks.py:1488] The output order is [__cost_0__]
Mon Jun 26 10:31:06 2017[1,92]<stderr>:*** Aborted at 1498444266 (unix time) try "date -d @1498444266" if you are using GNU date ***
Mon Jun 26 10:31:06 2017[1,92]<stderr>:PC: @                0x0 (unknown)
Mon Jun 26 10:31:06 2017[1,80]<stderr>:*** Aborted at 1498444266 (unix time) try "date -d @1498444266" if you are using GNU date ***
Mon Jun 26 10:31:06 2017[1,80]<stderr>:PC: @                0x0 (unknown)
Mon Jun 26 10:31:06 2017[1,80]<stderr>:*** SIGSEGV (@0x0) received by PID 7976 (TID 0x7f991b84d780) from PID 0; stack trace: ***
Mon Jun 26 10:31:06 2017[1,80]<stderr>:    @     0x7f991b008160 (unknown)
Mon Jun 26 10:31:06 2017[1,80]<stderr>:    @           0x6bfd35 paddle::ClassificationErrorEvaluator::evalImp()
Mon Jun 26 10:31:06 2017[1,80]<stderr>:    @           0x6b8690 paddle::Evaluator::eval()
Mon Jun 26 10:31:06 2017[1,80]<stderr>:    @           0x6a2708 paddle::CombinedEvaluator::eval()
Mon Jun 26 10:31:06 2017[1,80]<stderr>:    @           0x6a857a paddle::MultiGradientMachine::eval()
Mon Jun 26 10:31:06 2017[1,80]<stderr>:    @           0x74bdf6 paddle::TrainerInternal::trainOneBatch()
Mon Jun 26 10:31:06 2017[1,80]<stderr>:    @           0x749616 paddle::Trainer::trainOneDataBatch()
Mon Jun 26 10:31:06 2017[1,80]<stderr>:    @           0x749abd paddle::Trainer::trainOnePass()
Mon Jun 26 10:31:06 2017[1,80]<stderr>:    @           0x74a375 paddle::Trainer::train()
Mon Jun 26 10:31:06 2017[1,31]<stderr>:*** Aborted at 1498444266 (unix time) try "date -d @1498444266" if you are using GNU date ***
Mon Jun 26 10:31:06 2017[1,80]<stderr>:    @           0x5a3d70 main
Mon Jun 26 10:31:06 2017[1,31]<stderr>:PC: @                0x0 (unknown)
Mon Jun 26 10:31:06 2017[1,92]<stderr>:*** SIGSEGV (@0x0) received by PID 10024 (TID 0x7fe638281780) from PID 0; stack trace: ***
Mon Jun 26 10:31:06 2017[1,92]<stderr>:    @     0x7fe637a3c160 (unknown)
Mon Jun 26 10:31:06 2017[1,92]<stderr>:    @           0x6bfd35 paddle::ClassificationErrorEvaluator::evalImp()
Mon Jun 26 10:31:06 2017[1,92]<stderr>:    @           0x6b8690 paddle::Evaluator::eval()
Mon Jun 26 10:31:06 2017[1,92]<stderr>:    @           0x6a2708 paddle::CombinedEvaluator::eval()
Mon Jun 26 10:31:06 2017[1,31]<stderr>:*** SIGSEGV (@0x0) received by PID 11243 (TID 0x7f2fde0d9780) from PID 0; stack trace: ***
Mon Jun 26 10:31:06 2017[1,92]<stderr>:    @           0x6a857a paddle::MultiGradientMachine::eval()
Mon Jun 26 10:31:06 2017[1,92]<stderr>:    @           0x74bdf6 paddle::TrainerInternal::trainOneBatch()
Mon Jun 26 10:31:06 2017[1,92]<stderr>:    @           0x749616 paddle::Trainer::trainOneDataBatch()
Mon Jun 26 10:31:06 2017[1,31]<stderr>:    @     0x7f2fdd894160 (unknown)
Mon Jun 26 10:31:06 2017[1,92]<stderr>:    @           0x749abd paddle::Trainer::trainOnePass()
Mon Jun 26 10:31:06 2017[1,31]<stderr>:    @           0x6bfd35 paddle::ClassificationErrorEvaluator::evalImp()
Mon Jun 26 10:31:06 2017[1,92]<stderr>:    @           0x74a375 paddle::Trainer::train()
Mon Jun 26 10:31:06 2017[1,92]<stderr>:    @           0x5a3d70 main
Mon Jun 26 10:31:06 2017[1,31]<stderr>:    @           0x6b8690 paddle::Evaluator::eval()
Mon Jun 26 10:31:06 2017[1,31]<stderr>:    @           0x6a2708 paddle::CombinedEvaluator::eval()
Mon Jun 26 10:31:06 2017[1,31]<stderr>:    @           0x6a857a paddle::MultiGradientMachine::eval()
Mon Jun 26 10:31:06 2017[1,31]<stderr>:    @           0x74bdf6 paddle::TrainerInternal::trainOneBatch()
Mon Jun 26 10:31:06 2017[1,31]<stderr>:    @           0x749616 paddle::Trainer::trainOneDataBatch()
Mon Jun 26 10:31:06 2017[1,31]<stderr>:    @           0x749abd paddle::Trainer::trainOnePass()
Mon Jun 26 10:31:06 2017[1,31]<stderr>:    @           0x74a375 paddle::Trainer::train()
Mon Jun 26 10:31:06 2017[1,92]<stderr>:    @     0x7fe636c80bd5 __libc_start_main
Mon Jun 26 10:31:06 2017[1,31]<stderr>:    @           0x5a3d70 main
Mon Jun 26 10:31:06 2017[1,80]<stderr>:    @     0x7f991a24cbd5 __libc_start_main
Mon Jun 26 10:31:06 2017[1,92]<stderr>:    @           0x5b2169 (unknown)
Mon Jun 26 10:31:06 2017[1,80]<stderr>:    @           0x5b2169 (unknown)
Mon Jun 26 10:31:06 2017[1,92]<stderr>:    @                0x0 (unknown)
Mon Jun 26 10:31:06 2017[1,80]<stderr>:    @                0x0 (unknown)
Mon Jun 26 10:31:06 2017[1,31]<stderr>:    @     0x7f2fdcad8bd5 __libc_start_main
Mon Jun 26 10:31:06 2017[1,31]<stderr>:    @           0x5b2169 (unknown)
Mon Jun 26 10:31:06 2017[1,31]<stderr>:    @                0x0 (unknown)
Mon Jun 26 10:31:07 2017[1,45]<stderr>:*** Aborted at 1498444267 (unix time) try "date -d @1498444267" if you are using GNU date ***
Mon Jun 26 10:31:07 2017[1,45]<stderr>:PC: @                0x0 (unknown)
Mon Jun 26 10:31:07 2017[1,31]<stderr>:./train.sh: line 207: 11243 Segmentation fault      PYTHONPATH=./paddle:$PYTHONPATH GLOG_logtostderr=0 GLOG_log_dir="./log" ./paddle_trainer --num_gradient_servers=${OMPI_COMM_WORLD_SIZE} --trainer_id=${OMPI_COMM_WORLD_RANK} --pservers=$ipstring --rdma_tcp=${rdma_tcp} --nics=${nics} ${train_arg} --config=conf/trainer_config.conf --save_dir=./${save_dir} ${extern_arg}
Mon Jun 26 10:31:07 2017[1,31]<stderr>:+ '[' 139 -ne 0 ']'

@typhoonzero
Copy link
Contributor

Please refer to latest FAQ: https://github.com/PaddlePaddle/Paddle/blob/develop/doc/faq/index_cn.rst#id15

Close this for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants