-
Notifications
You must be signed in to change notification settings - Fork 7
Creating standardized apis for inframework and hf deployment #302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
oyilmaz-nvidia
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very good to me. Hoping PyTriton in-frameowkr can also have the same structure for multi-GPU after this PR :)
nemo_deploy/deploy_ray.py
Outdated
| enable_flash_decode: bool = False, | ||
| legacy_ckpt: bool = False, | ||
| max_batch_size: int = 32, | ||
| random_seed: Optional[int] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add max_ongoing_requests for in-fw as well ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this this be still required if we expose max_ongoing_requests to the user ?
Also how does max_ongoing_requests work ? Is it per replica and is a batch of requests for ex if bs=8 then 8 requests together considered as 1 request by the max_ongoing_requests arg ?
Ignore the question on per replica. Just saw that it is per replica
|
Thank you so much @pthombre for the PR! Overall LGTM. Left some comments. Lmk once they are addressed and tests are added, I can approve. |
| Args: | ||
| device_map (str): The device mapping strategy ('auto', 'balanced', etc.) | ||
| """ | ||
| if device_map == "balanced" or device_map == "auto": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also is the device_map removed to resolve the issue with latest transformers version ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The default value of device_map is None. User's can pass whatever value they need to support their deployment now
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
|
/ok to test 029119b |
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
|
/ok to test 2a51fc7 |
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
|
/ok to test 42fceaf |
Creating standardized APIs for inframework and HF deployment. These will be used in the deploy scripts and the NeMo Eval repo.