New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix distribute transpiler GRPC error code 4, RPC Deadline #18984
Conversation
@@ -575,7 +575,7 @@ def transpile(self, | |||
self.grad_name_to_param_name[grad_varname], | |||
splited_grad_varname | |||
], | |||
"sync_mode": not self.sync_mode, | |||
"sync_mode": True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why hard code True
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
取消trainer端 rpc client内部的计数,修改为send_op/recv_op内部计数, op不再需要sync_mode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some need fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for paddle_enforce
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM++
…le#18984) * fix sync mode hang in transpiler * remove sync mode in send/recv * replace PADDLE_ENFORCE with PADDLE_ENFORCE_NE
* fix bug in Class MultiSlotDataGenerator's function _gen_str, test=develop (#18222) * fix some bug when merge sparse embedding parameters, test=develop (#18223) * fix communicator with pyreader (#18350) * delete AllocatorFacade destructor (#18606) * fix distribute transpiler GRPC error code 4, RPC Deadline (#18984) * merge pr #18441
修复 同步训练 中存在的hang住问题导致GRPC Deadline.
效果验证:
在曾出问题的业务线场景, 累计运行80小时未出GRPC Deadline
TODO:
针对error code 14 的问题,需要对rpc相关设计进行重写后,加入retry修复。