Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run distributed text_classification error proc param error:name:[fc_0.w_0@GRAD.block0.trainer_3] ep:[192.168.16.28:30256] grpc error:Connect Failed #9454

Closed
typhoonzero opened this issue Mar 28, 2018 · 2 comments
Assignees
Labels

Comments

@typhoonzero
Copy link
Contributor

Background, run distributed vgg16 goes well but model:
https://github.com/typhoonzero/fluid_gpu_benchmark/blob/master/text_fluid.py results in following error first-time trainer want to send variables.

E0328 11:59:35.739938   211 grpc_client.cc:189] proc param error:name:[fc_0.w_0@GRAD.block3.trainer_3] ep:[192.168.16.27:30256] grpc error:Connect Failed
E0328 11:59:35.739984   208 grpc_client.cc:189] proc param error:name:[fc_0.b_0@GRAD.trainer_3] ep:[192.168.16.27:30256] grpc error:Connect Failed
E0328 11:59:35.740000   203 grpc_client.cc:189] proc param error:name:[sequence_conv_0.w_0@GRAD.block0.trainer_3] ep:[192.168.16.27:30256] grpc error:Connect Failed
E0328 11:59:35.740012   204 grpc_client.cc:189] proc param error:name:[embedding_0.w_0@GRAD.block0.trainer_3] ep:[192.168.16.27:30256] grpc error:Connect Failed
E0328 11:59:35.740399   202 grpc_client.cc:189] proc param error:name:[fc_1.b_0@GRAD.trainer_3] ep:[192.168.16.28:30256] grpc error:Connect Failed
E0328 11:59:35.740417   198 grpc_client.cc:189] proc param error:name:[sequence_conv_0.w_0@GRAD.block1.trainer_3] ep:[192.168.16.28:30256] grpc error:Connect Failed
E0328 11:59:35.740432   209 grpc_client.cc:189] proc param error:name:[embedding_0.w_0@GRAD.block1.trainer_3] ep:[192.168.16.28:30256] grpc error:Connect Failed
E0328 11:59:35.740423   207 grpc_client.cc:189] proc param error:name:[fc_0.w_0@GRAD.block0.trainer_3] ep:[192.168.16.28:30256] grpc error:Connect Failed
@typhoonzero typhoonzero self-assigned this Mar 28, 2018
@typhoonzero
Copy link
Contributor Author

typhoonzero commented Mar 28, 2018

Can not reproduce this on a single node to start 4 pservers and 4 trainers using different ports, but this issue can reproduce on two of our test kubernetes clusters, using hostNetwork mode.

@typhoonzero
Copy link
Contributor Author

Connection error due to pserver starts very slow, trainer has to wait utils server is ready. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant