-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split send op to send_vars and send_barrier #9303
Conversation
"(RPCClient) The RPC client object which is" | ||
"initialized at most once."); | ||
AddComment(R"DOC( | ||
Send operator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference between SendVarsOp
and the current SendOp
? Should we plan to remove the current SendOp
and rename SendVarsOp
to SendOP
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the current implementation, SendOp
would execute both send vars
and send barrier signal
. For SendVarsOp
, it would only execute send vars
, and SendBarrierOp
would execute send barrier single
so that we can execute send vars
with computing process, and send barrier signal
finally.
Should we plan to remove the current SendOp and rename SendVarsOp to SendOP
I think so, maybe we can do it after all the works of #9161 done.
AddInput("X", "(Tensor, SelectedRows) Input variables to be sent") | ||
.AsDuplicable(); | ||
AddOutput("RPCClient", | ||
"(RPCClient) The RPC client object which is" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean "The RPC client object which is already initialized", or "The RPC client object which will be initialized at most once"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
This operator will send variables to listen_and_serve op at the parameter server. | ||
)DOC"); | ||
AddAttr<int>("wait", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does wait means it's a sync send or async send? Maybe change "wait" to "sync_send"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
public: | ||
SendVarsOpMaker(OpProto* proto, OpAttrChecker* op_checker) | ||
: OpProtoAndCheckerMaker(proto, op_checker) { | ||
AddInput("X", "(Tensor, SelectedRows) Input variables to be sent") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use one send per tensor, we really need to separate compute thread pool and IO thread pool. One model could easily have hundreds of parameters. If compute and IO is on the same thread pool, we can't decide on how many threads for the pool: few threads means it's not enough for IO, many threads means it's too much for compute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, agree with you, and I think it's an important feature, I added a TODO comment here, and will implement it ASAP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM++
|
||
This operator will send variables to listen_and_serve op at the parameter server. | ||
)DOC"); | ||
AddAttr<int>("ync_send", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ync_send
=> sync_send
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, will fix the typo in next PR.
Related #9161