Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed training crash: grpc error:Connect Failed #7702

Closed
helinwang opened this issue Jan 19, 2018 · 5 comments
Closed

Distributed training crash: grpc error:Connect Failed #7702

helinwang opened this issue Jan 19, 2018 · 5 comments

Comments

@helinwang
Copy link
Contributor

helinwang commented Jan 19, 2018

Happens on the trainer, after the training has run for a while.

In my setting I have changed the dist fit a line to run for 1000 passes, it happens frequently (2 out of 3 tries).

Commands:

GLOG_logtostderr=1 GLOG_v=3 PSERVERS=172.17.0.5:6174 SERVER_ENDPOINT=172.17.0.5:6174 TRAINING_ROLE=PSERVER python notest_dist_fit_a_line.py 

GLOG_logtostderr=1 GLOG_v=3 PSERVERS=172.17.0.5:6174 SERVER_ENDPOINT=172.17.0.5:6174 TRAINING_ROLE=TRAINER python notest_dist_fit_a_line.py 

GLOG_logtostderr=1 GLOG_v=0 PSERVERS=172.17.0.5:6174 SERVER_ENDPOINT=172.17.0.5:6174 TRAINING_ROLE=TRAINER python notest_dist_fit_a_line.py 

notest_dist_fit_a_line.py is taken from here

I0119 21:26:14.525514 16639 send_op.cc:44] sending fc_0.w_0@GRAD
I0119 21:26:14.525590 16639 send_op.cc:44] sending fc_0.b_0@GRAD
E0119 21:26:14.529606 16639 grpc_client.cc:119] proc param error:name:[fc_0.w_0@GRAD] ep:[172.17.0.5:6174] grpc error:Connect Failed
Traceback (most recent call last):
  File "notest_dist_fit_a_line.py", line 70, in <module>
    fetch_list=[avg_cost])
  File "/root/.local/lib/python2.7/site-packages/paddle/v2/fluid/executor.py", line 177, in run
    self.executor.run(program.desc, scope, 0, True, True)
paddle.v2.fluid.core.EnforceNotMet:  at [/home/helin/repo/Paddle/paddle/operators/send_op.cc:47]
PaddlePaddle Call Stacks: 
0       0x7faab725cf17p paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) + 727
1       0x7faab7aefaacp paddle::operators::SendOp::Run(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> const&) const + 2988
2       0x7faab7310107p paddle::framework::Executor::Run(paddle::framework::ProgramDesc const&, paddle::framework::Scope*, int, bool, bool) + 1463
3       0x7faab7275893p void pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<void, paddle::framework::Executor, paddle::framework::ProgramDesc const&, paddle::framework::Scope*, int, bool, bool, pybind11::name, pybind11::is_method, pybind11::sibling>(void (paddle::framework::Executor::*)(paddle::framework::ProgramDesc const&, paddle::framework::Scope*, int, bool, bool), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(paddle::framework::Executor*, paddle::framework::ProgramDesc const&, paddle::framework::Scope*, int, bool, bool)#1}, void, paddle::framework::Executor*, paddle::framework::ProgramDesc const&, paddle::framework::Scope*, int, bool, bool, pybind11::name, pybind11::is_method, pybind11::sibling>(pybind11::cpp_function::initialize<void, paddle::framework::Executor, paddle::framework::ProgramDesc const&, paddle::framework::Scope*, int, bool, bool, pybind11::name, pybind11::is_method, pybind11::sibling>(void (paddle::framework::Executor::*)(paddle::framework::ProgramDesc const&, paddle::framework::Scope*, int, bool, bool), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(paddle::framework::Executor*, paddle::framework::ProgramDesc const&, paddle::framework::Scope*, int, bool, bool)#1}&&, void (*)(paddle::framework::Executor*, paddle::framework::ProgramDesc const&, paddle::framework::Scope*, int, bool, bool), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call) + 579
4       0x7faab72734e4p pybind11::cpp_function::dispatcher(_object*, _object*, _object*) + 1236
5             0x4cad00p PyEval_EvalFrameEx + 28048
6             0x4c2705p PyEval_EvalCodeEx + 597
7             0x4ca088p PyEval_EvalFrameEx + 24856
8             0x4c2705p PyEval_EvalCodeEx + 597
9             0x4c24a9p PyEval_EvalCode + 25
10            0x4f19efp
11            0x4ec372p PyRun_FileExFlags + 130
12            0x4eaaf1p PyRun_SimpleFileExFlags + 401
13            0x49e208p Py_Main + 1736
14      0x7fab4c825830p __libc_start_main + 240
15            0x49da59p _start + 41
@helinwang
Copy link
Contributor Author

@gongweibao
Copy link
Contributor

proc param error:name:[fc_0.w_0@GRAD] ep:[172.17.0.5:6174] grpc error:Connect Failed

This crash maybe because the pserver exits or not exist.
And what's the reason for pserver's errors?

@helinwang
Copy link
Contributor Author

I keep the training program run for few minutes, and get this error. The pserver is still on.

@gongweibao
Copy link
Contributor

Intuitively, there should not be a frequent problem of network sending and receiving data on a local machine. Is there logic error we don't notice?

@helinwang
Copy link
Contributor Author

Can not reproduce on the latest develop branch anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

6 participants