Skip to content

Conversation

@pthombre
Copy link
Contributor

@pthombre pthombre commented Aug 5, 2025

Creating standardized APIs for inframework and HF deployment. These will be used in the deploy scripts and the NeMo Eval repo.

Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Aug 5, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@pthombre pthombre requested a review from athitten August 5, 2025 03:00
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
Copy link
Contributor

@oyilmaz-nvidia oyilmaz-nvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very good to me. Hoping PyTriton in-frameowkr can also have the same structure for multi-GPU after this PR :)

enable_flash_decode: bool = False,
legacy_ckpt: bool = False,
max_batch_size: int = 32,
random_seed: Optional[int] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add max_ongoing_requests for in-fw as well ?

Copy link
Contributor

@athitten athitten Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this this be still required if we expose max_ongoing_requests to the user ?

Also how does max_ongoing_requests work ? Is it per replica and is a batch of requests for ex if bs=8 then 8 requests together considered as 1 request by the max_ongoing_requests arg ?

Ignore the question on per replica. Just saw that it is per replica

@athitten
Copy link
Contributor

athitten commented Aug 7, 2025

Thank you so much @pthombre for the PR! Overall LGTM. Left some comments. Lmk once they are addressed and tests are added, I can approve.

Args:
device_map (str): The device mapping strategy ('auto', 'balanced', etc.)
"""
if device_map == "balanced" or device_map == "auto":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also is the device_map removed to resolve the issue with latest transformers version ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The default value of device_map is None. User's can pass whatever value they need to support their deployment now

Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
@pthombre
Copy link
Contributor Author

/ok to test 029119b

Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
@pthombre
Copy link
Contributor Author

/ok to test 2a51fc7

Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
@pthombre
Copy link
Contributor Author

/ok to test 42fceaf

@pthombre pthombre merged commit 62485cc into main Aug 14, 2025
59 of 60 checks passed
@pthombre pthombre deleted the pranav/uniform_deployment_apis branch August 14, 2025 07:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants