Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix distribute transpiler GRPC error code 4, RPC Deadline #18984

Merged
merged 5 commits into from Aug 26, 2019

Conversation

seiriosPlus
Copy link
Collaborator

@seiriosPlus seiriosPlus commented Aug 2, 2019

修复 同步训练 中存在的hang住问题导致GRPC Deadline.

效果验证:
在曾出问题的业务线场景, 累计运行80小时未出GRPC Deadline

TODO:
针对error code 14 的问题,需要对rpc相关设计进行重写后,加入retry修复。

@seiriosPlus seiriosPlus changed the title [WIP]Sync mode fix [WIP]sync mode fix Aug 7, 2019
@seiriosPlus seiriosPlus changed the title [WIP]sync mode fix sync mode fix Aug 7, 2019
@seiriosPlus seiriosPlus changed the title sync mode fix fix distribute transpiler GRPC error code 4, GRPC Deadline Aug 7, 2019
@seiriosPlus seiriosPlus changed the title fix distribute transpiler GRPC error code 4, GRPC Deadline fix distribute transpiler GRPC error code 4, RPC Deadline Aug 7, 2019
@@ -575,7 +575,7 @@ def transpile(self,
self.grad_name_to_param_name[grad_varname],
splited_grad_varname
],
"sync_mode": not self.sync_mode,
"sync_mode": True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why hard code True?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

取消trainer端 rpc client内部的计数,修改为send_op/recv_op内部计数, op不再需要sync_mode

Copy link
Contributor

@gongweibao gongweibao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some need fix.

Copy link
Member

@guru4elephant guru4elephant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@luotao1 luotao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for paddle_enforce

Copy link
Contributor

@gongweibao gongweibao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM++

@seiriosPlus seiriosPlus merged commit 19dac67 into PaddlePaddle:develop Aug 26, 2019
seiriosPlus added a commit to seiriosPlus/Paddle that referenced this pull request Aug 28, 2019
…le#18984)

* fix sync mode hang in transpiler
* remove sync mode in send/recv
* replace PADDLE_ENFORCE with PADDLE_ENFORCE_NE
seiriosPlus added a commit that referenced this pull request Aug 29, 2019
* fix bug in Class MultiSlotDataGenerator's function _gen_str, test=develop (#18222)
* fix some bug when merge sparse embedding parameters, test=develop (#18223)
* fix communicator with pyreader (#18350)
* delete AllocatorFacade destructor  (#18606)
* fix distribute transpiler GRPC error code 4, RPC Deadline (#18984)
* merge pr #18441
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants