Add TRT-LLM params like max_num_tokens and opt_num_tokens #9210

oyilmaz-nvidia · 2024-05-15T21:52:08Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Collection: [Note which collection this PR will affect]

Changelog

Add specific line by line info of high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>

Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com>

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>

…idia/NeMo into add-opt-export-params

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>

…idia/NeMo into add-opt-export-params

Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com>

janekl · 2024-05-21T10:25:23Z

nemo/export/tensorrt_llm.py

+            max_lora_rank (int): maximum lora rank.
+            max_num_tokens (int):
+            opt_num_tokens (int):
+            save_nemo_model_config (bool):


missing description for last three params

janekl · 2024-05-21T10:28:44Z

scripts/deploy/nlp/deploy_triton.py

-        "-dcf",
-        "--disable_context_fmha",
+        "-drip",
+        "--disable_remove_input_padding",


Could we rather use --remove_input_padding instead of --disable_remove_input_padding with negations?

that would require providing --remove_input_padding in customer code to keep the current behavior unchanged (i.e. to use remove_input_padding=True when no parameter are specified).

It is only worthwhile if it leads to better naming consistency with TensorRT-LLM code.

janekl · 2024-05-21T10:30:09Z

tests/export/test_nemo_export.py

@@ -214,6 +214,8 @@ def run_trt_llm_inference(
            max_prompt_embedding_table_size=max_prompt_embedding_table_size,
            use_lora_plugin=use_lora_plugin,
            lora_target_modules=lora_target_modules,
+            max_num_tokens=int(max_input_token * max_batch_size * 0.2),


Could we possibly have a comment in this line why max_input_token * max_batch_size * 0.2, e.g. what 0.2 stands for and how is this derived in general?

janekl · 2024-05-21T10:32:03Z

tests/export/test_nemo_export.py

@@ -214,6 +214,8 @@ def run_trt_llm_inference(
            max_prompt_embedding_table_size=max_prompt_embedding_table_size,
            use_lora_plugin=use_lora_plugin,
            lora_target_modules=lora_target_modules,
+            max_num_tokens=int(max_input_token * max_batch_size * 0.2),
+            opt_num_tokens=60,


Let's have this 60 in the place where all the defaults are? All other params are defined elsewhere

janekl · 2024-05-21T10:33:37Z

nemo/export/trt_llm/tensorrt_llm_build.py

+    paged_kv_cache: bool = True,
+    remove_input_padding: bool = True,
+    max_num_tokens: int = None,
+    opt_num_tokens: int = None,


Formally max_num_tokens and opt_num_tokens parameters should have Optional[int] typing

janekl · 2024-05-21T10:36:27Z

scripts/deploy/nlp/deploy_triton.py

+            use_lora_plugin=args.use_lora_plugin,
+            lora_target_modules=args.lora_target_modules,
+            max_lora_rank=args.max_lora_rank,
+            save_nemo_model_config=True,


maybe also here:

save_nemo_model_config=True -> args.save_nemo_model_config

pipeline_parallel_size=1 -> args.pipeline_parallel_size (if it is possible to change this at all currently)

janekl · 2024-05-21T10:48:39Z

Are all argparse args in https://github.com/NVIDIA/NeMo/blob/main/scripts/export/export_to_trt_llm.py#L24 and https://github.com/NVIDIA/NeMo/blob/main/scripts/deploy/nlp/deploy_triton.py#L28 the same?

If yes, or a reasonable common subset exists, then we could perhaps define them in a single location somewhere in nemo.export and import elsewhere for transparency (fine to do this outside this MR).

janekl · 2024-05-21T11:18:09Z

I left some comments, rather minor things like params default organization, naming, etc.

* Add params like max_num_tokens and opt_num_tokens Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * remove padding param added * update params like max_num_token Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> * remove context context_fmha param for now Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * add params like max num token to the script Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> --------- Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> Co-authored-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> Co-authored-by: Pablo Garay <palenq@gmail.com>

* Add params like max_num_tokens and opt_num_tokens Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * remove padding param added * update params like max_num_token Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> * remove context context_fmha param for now Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * add params like max num token to the script Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> --------- Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> Co-authored-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> Co-authored-by: Pablo Garay <palenq@gmail.com> Signed-off-by: Boxiang Wang <boxiangw@nvidia.com>

* Add params like max_num_tokens and opt_num_tokens Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * remove padding param added * update params like max_num_token Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> * remove context context_fmha param for now Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * add params like max num token to the script Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> --------- Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> Co-authored-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> Co-authored-by: Pablo Garay <palenq@gmail.com> Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Add params like max_num_tokens and opt_num_tokens Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * remove padding param added * update params like max_num_token Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> * remove context context_fmha param for now Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * add params like max num token to the script Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> --------- Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com> Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> Co-authored-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> Co-authored-by: Pablo Garay <palenq@gmail.com>

oyilmaz-nvidia and others added 4 commits May 15, 2024 17:51

Add params like max_num_tokens and opt_num_tokens

5d008c7

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>

remove padding param added

f53d1fd

update params like max_num_token

77c35bd

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>

Apply isort and black reformatting

a0372b6

Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com>

oyilmaz-nvidia requested a review from meatybobby May 16, 2024 20:49

oyilmaz-nvidia marked this pull request as ready for review May 17, 2024 14:21

oyilmaz-nvidia added 3 commits May 17, 2024 10:25

remove context context_fmha param for now

762a390

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>

Merge branch 'add-opt-export-params' of https://github.com/oyilmaz-nv…

ba18914

…idia/NeMo into add-opt-export-params

fix merge conflicts

1826ae1

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>

oyilmaz-nvidia added NLP Run CICD labels May 17, 2024

Merge branch 'main' into add-opt-export-params

4c11fbf

github-actions bot removed the NLP label May 17, 2024

Merge branch 'main' into add-opt-export-params

dfa0cc0

oyilmaz-nvidia added Run CICD NLP and removed Run CICD labels May 17, 2024

Merge branch 'main' into add-opt-export-params

430f927

github-actions bot removed the NLP label May 18, 2024

pablo-garay added NLP Run CICD and removed Run CICD labels May 18, 2024

oyilmaz-nvidia requested a review from janekl May 20, 2024 16:05

oyilmaz-nvidia added 2 commits May 20, 2024 15:26

add params like max num token to the script

2dc2f16

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>

Merge branch 'add-opt-export-params' of https://github.com/oyilmaz-nv…

a98ac1a

…idia/NeMo into add-opt-export-params

github-actions bot removed the NLP label May 20, 2024

oyilmaz-nvidia and others added 2 commits May 20, 2024 15:27

Merge branch 'main' into add-opt-export-params

35f3fc1

Apply isort and black reformatting

cb04a51

Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com>

oyilmaz-nvidia added NLP and removed Run CICD labels May 20, 2024

oyilmaz-nvidia added the Run CICD label May 20, 2024

janekl reviewed May 21, 2024

View reviewed changes

janekl approved these changes May 21, 2024

View reviewed changes

oyilmaz-nvidia merged commit 2e1814c into NVIDIA:main May 21, 2024
130 checks passed

ko3n1g mentioned this pull request Jul 18, 2024

Release 2.0.0rc1 #9786

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add TRT-LLM params like max_num_tokens and opt_num_tokens #9210

Add TRT-LLM params like max_num_tokens and opt_num_tokens #9210

oyilmaz-nvidia commented May 15, 2024

janekl May 21, 2024

janekl May 21, 2024

janekl May 21, 2024

janekl May 21, 2024 •

edited

Loading

janekl May 21, 2024

janekl May 21, 2024

janekl May 21, 2024

janekl commented May 21, 2024

janekl commented May 21, 2024

Add TRT-LLM params like max_num_tokens and opt_num_tokens #9210

Add TRT-LLM params like max_num_tokens and opt_num_tokens #9210

Conversation

oyilmaz-nvidia commented May 15, 2024

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

janekl May 21, 2024

Choose a reason for hiding this comment

janekl May 21, 2024

Choose a reason for hiding this comment

janekl May 21, 2024

Choose a reason for hiding this comment

janekl May 21, 2024 • edited Loading

Choose a reason for hiding this comment

janekl May 21, 2024

Choose a reason for hiding this comment

janekl May 21, 2024

Choose a reason for hiding this comment

janekl May 21, 2024

Choose a reason for hiding this comment

janekl commented May 21, 2024

janekl commented May 21, 2024

janekl May 21, 2024 •

edited

Loading