Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[2.4] Add support to large object in LauncherExecutor using CellPipe #2401

Conversation

YuanTingHsieh
Copy link
Collaborator

@YuanTingHsieh YuanTingHsieh commented Mar 15, 2024

Issue

We need #2406 to enable streaming capabilities to CellPipe as well

  1. When the object/model is large, the pull_task from client side will takes a lot of time, so we don't want to start the PipeHandler heartbeat at "START_RUN" event

  2. Metric relay component does not need separate timeout itself, so adding an option in PipeHandler to disable the heartbeat

  3. Since we always will have task pipe for flare_agent, we just need it to do heartbeats so even if metric pipe is provided, it does not need to send heartbeats

  4. When peer_read_timeout is None, it is not waiting forever, instead the underlying PipeHandler will just use a default request timeout of 5 seconds which is not enough for the model weights/weight diff, so change to a larger default value

  5. PipeHandler send Abort or End signal does not need to wait forever, so provide a timeout there

  6. Same as Fix LauncherExecutor handle_event #2370 we need to invoke TaskExchanger's handle_event in LauncherExecutor

Description

Add support to large object to LauncherExecutor/ClientAPI utilizing CellPipe

  • Update timeout values for larger objects
  • Should start the pipe_handler's heartbeat / checking for heartbeat mechanism after the task is pulled and ready to be executed
  • MetricRelay component does not need its own heartbeat mechanism

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Quick tests passed locally by running ./runtest.sh.
  • In-line docstrings updated.
  • Documentation updated.

@YuanTingHsieh
Copy link
Collaborator Author

/build

@YuanTingHsieh
Copy link
Collaborator Author

/build

Copy link
Collaborator

@yanchengnv yanchengnv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why so many changes? The only change needed is cell.py.

If you want to improve other things (seem to be non-trivial), please do so in another PR and describe the reason.

@YuanTingHsieh YuanTingHsieh force-pushed the add_support_to_large_objects_to_cell_pipe branch from 72169dc to 5668a1b Compare March 18, 2024 17:37
@YuanTingHsieh YuanTingHsieh changed the title [2.4] Add support to large object to CellPipe [2.4] Add support to large object in LauncherExecutor using CellPipe Mar 18, 2024
@YuanTingHsieh
Copy link
Collaborator Author

Sounds good, the cell changing is moved to #2406

This PR handles all other issues I encountered in testing large models + LauncherExecutor/Client API

@YuanTingHsieh
Copy link
Collaborator Author

/build

@chesterxgchen
Copy link
Collaborator

3. Since we always will have task pipe for flare_agent, we just need it to do heartbeats so even if metric pipe is provided, it does not need to send heartbeats

can you update the descriptions, otherwise, it seems a lot of changes in one PR ( seems to be 3+ PR changes get into one PR)

Copy link
Collaborator

@chesterxgchen chesterxgchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add few quesstions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants