-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add TRT-LLM params like max_num_tokens and opt_num_tokens #9210
Add TRT-LLM params like max_num_tokens and opt_num_tokens #9210
Conversation
Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com>
Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
…idia/NeMo into add-opt-export-params
Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
…idia/NeMo into add-opt-export-params
Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com>
max_lora_rank (int): maximum lora rank. | ||
max_num_tokens (int): | ||
opt_num_tokens (int): | ||
save_nemo_model_config (bool): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing description for last three params
"-dcf", | ||
"--disable_context_fmha", | ||
"-drip", | ||
"--disable_remove_input_padding", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we rather use --remove_input_padding
instead of --disable_remove_input_padding
with negations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that would require providing --remove_input_padding
in customer code to keep the current behavior unchanged (i.e. to use remove_input_padding=True
when no parameter are specified).
It is only worthwhile if it leads to better naming consistency with TensorRT-LLM code.
@@ -214,6 +214,8 @@ def run_trt_llm_inference( | |||
max_prompt_embedding_table_size=max_prompt_embedding_table_size, | |||
use_lora_plugin=use_lora_plugin, | |||
lora_target_modules=lora_target_modules, | |||
max_num_tokens=int(max_input_token * max_batch_size * 0.2), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we possibly have a comment in this line why max_input_token * max_batch_size * 0.2
, e.g. what 0.2 stands for and how is this derived in general?
@@ -214,6 +214,8 @@ def run_trt_llm_inference( | |||
max_prompt_embedding_table_size=max_prompt_embedding_table_size, | |||
use_lora_plugin=use_lora_plugin, | |||
lora_target_modules=lora_target_modules, | |||
max_num_tokens=int(max_input_token * max_batch_size * 0.2), | |||
opt_num_tokens=60, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's have this 60 in the place where all the defaults are? All other params are defined elsewhere
paged_kv_cache: bool = True, | ||
remove_input_padding: bool = True, | ||
max_num_tokens: int = None, | ||
opt_num_tokens: int = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Formally max_num_tokens
and opt_num_tokens
parameters should have Optional[int]
typing
use_lora_plugin=args.use_lora_plugin, | ||
lora_target_modules=args.lora_target_modules, | ||
max_lora_rank=args.max_lora_rank, | ||
save_nemo_model_config=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe also here:
save_nemo_model_config=True
->args.save_nemo_model_config
pipeline_parallel_size=1
->args.pipeline_parallel_size
(if it is possible to change this at all currently)
Are all argparse args in https://github.com/NVIDIA/NeMo/blob/main/scripts/export/export_to_trt_llm.py#L24 and https://github.com/NVIDIA/NeMo/blob/main/scripts/deploy/nlp/deploy_triton.py#L28 the same? If yes, or a reasonable common subset exists, then we could perhaps define them in a single location somewhere in |
I left some comments, rather minor things like params default organization, naming, etc. |
* Add params like max_num_tokens and opt_num_tokens Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * remove padding param added * update params like max_num_token Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> * remove context context_fmha param for now Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * add params like max num token to the script Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> --------- Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> Co-authored-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> Co-authored-by: Pablo Garay <palenq@gmail.com>
* Add params like max_num_tokens and opt_num_tokens Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * remove padding param added * update params like max_num_token Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> * remove context context_fmha param for now Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * add params like max num token to the script Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> --------- Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> Co-authored-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> Co-authored-by: Pablo Garay <palenq@gmail.com> Signed-off-by: Boxiang Wang <boxiangw@nvidia.com>
* Add params like max_num_tokens and opt_num_tokens Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * remove padding param added * update params like max_num_token Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> * remove context context_fmha param for now Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * add params like max num token to the script Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> --------- Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> Co-authored-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> Co-authored-by: Pablo Garay <palenq@gmail.com> Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
* Add params like max_num_tokens and opt_num_tokens Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * remove padding param added * update params like max_num_token Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> * remove context context_fmha param for now Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * add params like max num token to the script Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> --------- Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> Co-authored-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> Co-authored-by: Pablo Garay <palenq@gmail.com>
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Collection: [Note which collection this PR will affect]
Changelog
Usage
# Add a code snippet demonstrating how to use this
GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information