-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[2.4] Add support to large object in LauncherExecutor using CellPipe #2401
[2.4] Add support to large object in LauncherExecutor using CellPipe #2401
Conversation
/build |
/build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why so many changes? The only change needed is cell.py.
If you want to improve other things (seem to be non-trivial), please do so in another PR and describe the reason.
72169dc
to
5668a1b
Compare
Sounds good, the cell changing is moved to #2406 This PR handles all other issues I encountered in testing large models + LauncherExecutor/Client API |
/build |
can you update the descriptions, otherwise, it seems a lot of changes in one PR ( seems to be 3+ PR changes get into one PR) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add few quesstions
Issue
We need #2406 to enable streaming capabilities to CellPipe as well
When the object/model is large, the pull_task from client side will takes a lot of time, so we don't want to start the PipeHandler heartbeat at "START_RUN" event
Metric relay component does not need separate timeout itself, so adding an option in PipeHandler to disable the heartbeat
Since we always will have task pipe for flare_agent, we just need it to do heartbeats so even if metric pipe is provided, it does not need to send heartbeats
When peer_read_timeout is None, it is not waiting forever, instead the underlying PipeHandler will just use a default request timeout of 5 seconds which is not enough for the model weights/weight diff, so change to a larger default value
PipeHandler send Abort or End signal does not need to wait forever, so provide a timeout there
Same as Fix LauncherExecutor handle_event #2370 we need to invoke TaskExchanger's handle_event in LauncherExecutor
Description
Add support to large object to LauncherExecutor/ClientAPI utilizing CellPipe
Types of changes
./runtest.sh
.