Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mpi上跑fluid cpu分布式fluid_machine_translation不稳定出core #9326

Closed
alexqdh opened this issue Mar 22, 2018 · 7 comments
Closed

mpi上跑fluid cpu分布式fluid_machine_translation不稳定出core #9326

alexqdh opened this issue Mar 22, 2018 · 7 comments
Labels
User 用于标记用户问题

Comments

@alexqdh
Copy link

alexqdh commented Mar 22, 2018

在mpi上跑fluid cpu分布式fluid_machine_translation,2节点,不稳定出core,提交6次相同作业,失败2次,成功4次
基于代码commitID: 0e30fae
报错信息:

Thu Mar 22 16:42:24 2018[1,0]<stderr>:  File "workspace/python27-gcc482/lib/python2.7/site-packages/paddle/fluid/executor.py", line 349, in run
Thu Mar 22 16:42:24 2018[1,0]<stderr>:    self.executor.run(program_cache.desc, scope, 0, True, True)
Thu Mar 22 16:42:24 2018[1,0]<stderr>:paddle.fluid.core.EnforceNotMet:  at [/paddle/paddle/fluid/operators/send_op.cc:77]
Thu Mar 22 16:42:24 2018[1,0]<stderr>:PaddlePaddle Call Stacks: 
Thu Mar 22 16:42:24 2018[1,0]<stderr>:0       0x7fcd45e32ac6p paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) + 486
Thu Mar 22 16:42:24 2018[1,0]<stderr>:1       0x7fcd46464a24p paddle::operators::SendOp::RunImpl(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> const&) const + 2932
Thu Mar 22 16:42:24 2018[1,0]<stderr>:2       0x7fcd464bd108p paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> const&) + 56
Thu Mar 22 16:42:24 2018[1,0]<stderr>:3       0x7fcd45ebfc46p paddle::framework::Executor::RunPreparedContext(paddle::framework::ExecutorPrepareContext*, paddle::framework::Scope*, bool, bool) + 1110
Thu Mar 22 16:42:24 2018[1,0]<stderr>:4       0x7fcd45ec0b69p paddle::framework::Executor::Run(paddle::framework::ProgramDesc const&, paddle::framework::Scope*, int, bool, bool) + 89
Thu Mar 22 16:42:24 2018[1,0]<stderr>:5       0x7fcd45e47e5bp void pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<void, paddle::framework::Executor, paddle::framework::ProgramDesc const&, paddle::framework::Scope*, int, bool, bool, pybind11::name, pybind11::is_method, pybind11::sibling>(void (paddle::framework::Executor::*)(paddle::framework::ProgramDesc const&, paddle::framework::Scope*, int, bool, bool), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(paddle::framework::Executor*, paddle::framework::ProgramDesc const&, paddle::framework::Scope*, int, bool, bool)#1}, void, paddle::framework::Executor*, paddle::framework::ProgramDesc const&, paddle::framework::Scope*, int, bool, bool, pybind11::name, pybind11::is_method, pybind11::sibling>(pybind11::cpp_function::initialize<void, paddle::framework::Executor, paddle::framework::ProgramDesc const&, paddle::framework::Scope*, int, bool, bool, pybind11::name, pybind11::is_method, pybind11::sibling>(void (paddle::framework::Executor::*)(paddle::framework::ProgramDesc const&, paddle::framework::Scope*, int, bool, bool), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(paddle::framework::Executor*, paddle::framework::ProgramDesc const&, paddle::framework::Scope*, int, bool, bool)#1}&&, void (*)(paddle::framework::Executor*, paddle::framework::ProgramDesc const&, paddle::framework::Scope*, int, bool, bool), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call) + 555
Thu Mar 22 16:42:24 2018[1,0]<stderr>:6       0x7fcd45e41894p pybind11::cpp_function::dispatcher(_object*, _object*, _object*) + 2596
Thu Mar 22 16:42:24 2018[1,0]<stderr>:7       0x7fcda5423010p PyEval_EvalFrameEx + 16384
Thu Mar 22 16:42:24 2018[1,0]<stderr>:8       0x7fcda5424b80p PyEval_EvalCodeEx + 2128
Thu Mar 22 16:42:24 2018[1,0]<stderr>:9       0x7fcda542308ep PyEval_EvalFrameEx + 16510
Thu Mar 22 16:42:24 2018[1,0]<stderr>:10      0x7fcda5424b80p PyEval_EvalCodeEx + 2128
Thu Mar 22 16:42:24 2018[1,0]<stderr>:11      0x7fcda542308ep PyEval_EvalFrameEx + 16510
Thu Mar 22 16:42:24 2018[1,0]<stderr>:12      0x7fcda5424b80p PyEval_EvalCodeEx + 2128
Thu Mar 22 16:42:24 2018[1,0]<stderr>:13      0x7fcda542308ep PyEval_EvalFrameEx + 16510
Thu Mar 22 16:42:24 2018[1,0]<stderr>:14      0x7fcda5424
Thu Mar 22 16:42:24 2018[1,0]<stderr>:b80p PyEval_EvalCodeEx + 2128
Thu Mar 22 16:42:24 2018[1,0]<stderr>:15      0x7fcda5424c82p PyEval_EvalCode + 50
Thu Mar 22 16:42:24 2018[1,0]<stderr>:16      0x7fcda543d60fp
Thu Mar 22 16:42:24 2018[1,0]<stderr>:17      0x7fcda543e67ep PyRun_FileExFlags + 126
Thu Mar 22 16:42:24 2018[1,0]<stderr>:18      0x7fcda543f7d7p PyRun_SimpleFileExFlags + 199
Thu Mar 22 16:42:24 2018[1,0]<stderr>:19      0x7fcda544fd9dp Py_Main + 3133
Thu Mar 22 16:42:24 2018[1,0]<stderr>:20      0x7fcda4695bd5p __libc_start_main + 245
Thu Mar 22 16:42:24 2018[1,0]<stderr>:21            0x4007c1p

pserver报错:

Thu Mar 22 16:40:30 2018[1,0]<stderr>:E0322 16:40:30.836912222   17810 tcp_server_posix.cc:65]     check for SO_REUSEPORT: {"created":"@1521708030.836874063","description":"OS Error","errno":92,"file":"src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":168,"os_error":"Protocol not available","syscall":"setsockopt(SO_REUSEPORT)"}
Thu Mar 22 16:40:30 2018[1,0]<stderr>:E0322 16:40:30.840531023   17810 server_chttp2.cc:38]        {"created":"@1521708030.840470493","description":"No address added out of total 1 resolved","file":"src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":309,"referenced_errors":[{"created":"@1521708030.840463290","description":"Unable to configure socket","fd":7,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":200,"referenced_errors":[{"created":"@1521708030.840451258","description":"OS Error","errno":98,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":173,"os_error":"Address already in use","syscall":"bind"}]}]}
Thu Mar 22 16:40:30 2018[1,0]<stderr>:*** Aborted at 1521708030 (unix time) try "date -d @1521708030" if you are using GNU date ***
Thu Mar 22 16:40:30 2018[1,0]<stderr>:PC: @                0x0 (unknown)
@Yancey1989 Yancey1989 added Bug User 用于标记用户问题 and removed Bug labels Mar 22, 2018
@Yancey1989
Copy link
Contributor

在pserver日志:

Thu Mar 22 16:40:30 2018[1,0]:E0322 16:40:30.840531023 17810 server_chttp2.cc:38] {"created":"@1521708030.840470493","description":"No address added out of total 1 resolved","file":"src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":309,"referenced_errors":[{"created":"@1521708030.840463290","description":"Unable to configure socket","fd":7,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":200,"referenced_errors":[{"created":"@1521708030.840451258","description":"OS Error","errno":98,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":173,"os_error":"Address already in use","syscall":"bind"}]}]}

其中

src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":173,"os_error":"Address already in use

看起来是端口被占用了?

@alexqdh
Copy link
Author

alexqdh commented Mar 23, 2018

现在在跑mpi作业的时候增加了端口冲突重试的判断,之前是按照类似这种错误log来判断端口冲突的"

Check failed: bind(socket_, (struct# sockaddr *)&serv_addr, sizeof(serv_addr)) >= 0 ERROR on binding 10.89.xxx.xx

请帮忙确认下类似错误

os_error":"Address already in use

也是端口冲突引起的话,我们会额外加上这个条件来进行重试操作

@Yancey1989
Copy link
Contributor

这是gRPC的报错日志,目前端口占用是会出现此类关键字,但不建议用日志内容作为判断条件,因为可能会由于gRPC或者Fluid的版本更新,使日志内容有修改导致判断条件失效。

@chengduoZH
Copy link
Contributor

chengduoZH commented Mar 29, 2018

@alexqdh 请问问题解决了吗?

@alexqdh
Copy link
Author

alexqdh commented Mar 30, 2018

谢谢,已经加了新的关键字来判断端口冲突了,这块后面我理解解决端口冲突的话需要paddle内部来做自动检测了吧

@Yancey1989
Copy link
Contributor

暂时没有这个计划,并且目前大部分的集群调度,类似Kubernetes都做了比较好的端口管理,不会有端口冲突的情况发生。

@alexqdh
Copy link
Author

alexqdh commented Mar 30, 2018

好的,了解了,不过mpi存量用户还是不少的,这块临时只能先按这个方案判断了。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
User 用于标记用户问题
Projects
None yet
Development

No branches or pull requests

3 participants