TRTLLM new API support #9003

meatybobby · 2024-04-22T20:20:20Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Collection: [Note which collection this PR will affect]

Changelog

Add specific line by line info of high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Jenkins CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

There's no need to comment jenkins on the PR to trigger Jenkins CI.
The GitHub Actions CI will run automatically when the PR is opened.
To run CI on an untrusted fork, a NeMo user with write access must click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

nemo/export/trt_llm/tensorrt_llm_run.py

oyilmaz-nvidia

Tested llama, gemma, starcoder1, nemotrons and they work. But getting the following error from starcoder2

Traceback (most recent call last):
  File "/opt/NeMo/tests/export/test_nemo_export.py", line 540, in <module>
    run_inference_tests(args)
  File "/opt/NeMo/tests/export/test_nemo_export.py", line 453, in run_inference_tests
    ) = run_existing_checkpoints(
  File "/opt/NeMo/tests/export/test_nemo_export.py", line 319, in run_existing_checkpoints
    return run_trt_llm_inference(
  File "/opt/NeMo/tests/export/test_nemo_export.py", line 195, in run_trt_llm_inference
    trt_llm_exporter.export(
  File "/opt/NeMo/nemo/export/tensorrt_llm.py", line 204, in export
    build_and_save_engine(
  File "/opt/NeMo/nemo/export/trt_llm/tensorrt_llm_build.py", line 411, in build_and_save_engine
    model = model_cls.from_config(model_config)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 360, in from_config
    return cls(config)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 328, in __call__
    obj = type.__call__(cls, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/gpt/model.py", line 224, in __init__
    transformer = GPTModel(config)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/gpt/model.py", line 179, in __init__
    self.layers = DecoderLayerList(GPTDecoderLayer, config)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 262, in __init__
    super().__init__([cls(config, idx) for idx in self.layer_list])
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 262, in <listcomp>
    super().__init__([cls(config, idx) for idx in self.layer_list])
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/gpt/model.py", line 80, in __init__
    self.attention = Attention(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/layers/attention.py", line 510, in __init__
    self.rotary_embedding_dim = int(self.attention_head_size *
TypeError: unsupported operand type(s) for *: 'int' and 'NoneType'
Exception ignored in: <function PretrainedModel.__del__ at 0x7ff64dcdecb0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 351, in __del__
    self.release()
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 348, in release
    release_gc()
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_utils.py", line 439, in release_gc
    torch.cuda.ipc_collect()
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 813, in ipc_collect
    _lazy_init()
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 321, in _lazy_init
    raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: 'NoneType' object is not iterable

CUDA call was originally invoked at:

  File "/opt/NeMo/tests/export/test_nemo_export.py", line 19, in <module>
    import torch
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/torch/__init__.py", line 1427, in <module>
    _C._initExtension(manager_path())
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 1303, in <module>
    _lazy_call(_register_triton_kernels)
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 244, in _lazy_call
    _queued_calls.append((callable, traceback.format_stack()))

and from falcon, getting the following

Path: /opt/checkpoints/FALCON-7B-base/FALCON-7B-base-1.nemo and model: FALCON-7B-base with 1 gpus will be tested
saving weights: 100%|█████████████████████████████████████████████████████████████████████████████████████| 194/194 [00:03<00:00, 50.28it/s]
tokenizer_config.json: 100%|████████████████████████████████████████████████████████████████████████████████| 287/287 [00:00<00:00, 881kB/s]
tokenizer.json: 100%|██████████████████████████████████████████████████████████████████████████████████| 2.73M/2.73M [00:00<00:00, 8.40MB/s]
special_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████| 281/281 [00:00<00:00, 1.02MB/s]
Traceback (most recent call last):
  File "/opt/NeMo/tests/export/test_nemo_export.py", line 540, in <module>
    run_inference_tests(args)
  File "/opt/NeMo/tests/export/test_nemo_export.py", line 453, in run_inference_tests
    ) = run_existing_checkpoints(
  File "/opt/NeMo/tests/export/test_nemo_export.py", line 319, in run_existing_checkpoints
    return run_trt_llm_inference(
  File "/opt/NeMo/tests/export/test_nemo_export.py", line 195, in run_trt_llm_inference
    trt_llm_exporter.export(
  File "/opt/NeMo/nemo/export/tensorrt_llm.py", line 192, in export
    weights_dicts, model_configs, self.tokenizer = nemo_to_trtllm_config(
  File "/opt/NeMo/nemo/export/trt_llm/nemo_utils.py", line 374, in nemo_to_trtllm_config
    DECODER_MODEL_TYPE[decoder_type],
KeyError: 'falcon'

We either update these models to the new API or call the old code support them.

Haven't tested the mistral and the mixtral.

cjluo-omniml · 2024-04-25T16:13:20Z

nemo/export/trt_llm/tensorrt_llm_run.py

+
+            prompt_tasks = None if task_ids is None else ",".join(str(task) for task in task_ids)
+
+            outputs = decoder.generate(


I think you can just use the HLAPI like LLM to run inference. The API is much cleaner to call than this ModelRunnerCpp

for more information, see https://pre-commit.ci

nemo/export/trt_llm/nemo/convert.py

nemo/export/trt_llm/nemo_utils.py

oyilmaz-nvidia · 2024-05-06T21:30:25Z

Tested most of the models and getting this gibberish output from only the nemotron 22B base one

--- Output:  [['is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is in the best way to get the best price for your car.\n\nHow to sell your car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car'], ['is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is the best way to get to the hotel from the airport?\n\nThe hotel is located 1 km from the airport, which means that there is is is is is is is is is is is is is is is is is is is is is is is is is is'], ['is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is the best way to get to the hotel from the airport?\n\nThe hotel is located 1 km from the airport.\n\nWhat is the best way way way way way way way way way way way way way way way way way way way way way way way way way']]

Can you guys please take a look at this?

oyilmaz-nvidia

Let's merge this PR and then we'll likely do a cleanup, basically we'll need to remove the parts where we don't use anymore.

oyilmaz-nvidia · 2024-05-08T17:57:07Z

jenkins

Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com>

* Add trtllm checkpoint * Change model config * fix no query_group * Using build API * Change export to new API * Update generate API * Fix runtime config * Fix for llama * Fix for ptuning * Fix TP issue * Change TP rank for building weight dict * Add lora config * add prompt embedding table config * Fix PP isue * PP layers fix * Fix no prompt task ids * Add bos for Gemma * Add multi block mode * Embedding and layernorm for PP * MPI multiprocess support for multinode * Only output text on first rank * Change to ModelRunnerCpp * Add falcon * Add rotary_pct default value * Falcon fix * Add MOE config * Fix MOE weight dict * Clean code * Add rotary_base * Fix MOE config * Fix falcon new architecture * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Gemma 7B * Add rotary_scaling * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> --------- Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> Co-authored-by: abharwani <abharwani@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Onur Yilmaz <35306097+oyilmaz-nvidia@users.noreply.github.com> Co-authored-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com>

* Add trtllm checkpoint * Change model config * fix no query_group * Using build API * Change export to new API * Update generate API * Fix runtime config * Fix for llama * Fix for ptuning * Fix TP issue * Change TP rank for building weight dict * Add lora config * add prompt embedding table config * Fix PP isue * PP layers fix * Fix no prompt task ids * Add bos for Gemma * Add multi block mode * Embedding and layernorm for PP * MPI multiprocess support for multinode * Only output text on first rank * Change to ModelRunnerCpp * Add falcon * Add rotary_pct default value * Falcon fix * Add MOE config * Fix MOE weight dict * Clean code * Add rotary_base * Fix MOE config * Fix falcon new architecture * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Gemma 7B * Add rotary_scaling * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> --------- Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> Co-authored-by: abharwani <abharwani@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Onur Yilmaz <35306097+oyilmaz-nvidia@users.noreply.github.com> Co-authored-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: Boxiang Wang <boxiangw@nvidia.com>

* Add trtllm checkpoint * Change model config * fix no query_group * Using build API * Change export to new API * Update generate API * Fix runtime config * Fix for llama * Fix for ptuning * Fix TP issue * Change TP rank for building weight dict * Add lora config * add prompt embedding table config * Fix PP isue * PP layers fix * Fix no prompt task ids * Add bos for Gemma * Add multi block mode * Embedding and layernorm for PP * MPI multiprocess support for multinode * Only output text on first rank * Change to ModelRunnerCpp * Add falcon * Add rotary_pct default value * Falcon fix * Add MOE config * Fix MOE weight dict * Clean code * Add rotary_base * Fix MOE config * Fix falcon new architecture * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Gemma 7B * Add rotary_scaling * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> --------- Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> Co-authored-by: abharwani <abharwani@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Onur Yilmaz <35306097+oyilmaz-nvidia@users.noreply.github.com> Co-authored-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com>

meatybobby and others added 20 commits April 12, 2024 18:07

Add trtllm checkpoint

2d01795

Change model config

fa12e77

fix no query_group

4c715ae

Merge branch 'main' into bobchen/nemotron

c79543b

Using build API

6005519

Change export to new API

38b3b41

Update generate API

ed409f8

Fix runtime config

a472f01

Fix for llama

a1c477d

Fix for ptuning

b43f848

Fix TP issue

a827421

Change TP rank for building weight dict

2b38efb

Add lora config

64dd631

add prompt embedding table config

cdb7389

Fix PP isue

487eb26

PP layers fix

b80388b

Fix no prompt task ids

fab487b

Add bos for Gemma

8f0f36d

Add multi block mode

5d3503e

Embedding and layernorm for PP

a8f54b0

jiemingz reviewed Apr 24, 2024

View reviewed changes

nemo/export/trt_llm/tensorrt_llm_run.py Outdated Show resolved Hide resolved

oyilmaz-nvidia requested changes Apr 24, 2024

View reviewed changes

meatybobby added 3 commits April 24, 2024 17:09

MPI multiprocess support for multinode

bdf7cfc

Only output text on first rank

599520f

Change to ModelRunnerCpp

7821ff9

cjluo-omniml reviewed Apr 25, 2024

View reviewed changes

meatybobby added 4 commits April 25, 2024 10:11

Add falcon

3ecd9a7

Add rotary_pct default value

0ce5ae5

Falcon fix

4d576ef

Add MOE config

aa28fc9

meatybobby added 5 commits May 1, 2024 10:56

Fix MOE weight dict

da84b22

Clean code

30e6ece

Add rotary_base

479d871

Fix MOE config

05b4cbc

Fix falcon new architecture

d2ff752

meatybobby changed the title ~~TRTLLM new API support(WIP)~~ TRTLLM new API support May 3, 2024

meatybobby and others added 2 commits May 3, 2024 10:11

Merge branch 'main' into bobchen/nemotron

d3f0307

[pre-commit.ci] auto fixes from pre-commit.com hooks

ad5c2fa

for more information, see https://pre-commit.ci

github-advanced-security bot found potential problems May 3, 2024

View reviewed changes

nemo/export/trt_llm/nemo/convert.py Fixed Show fixed Hide fixed

nemo/export/trt_llm/nemo_utils.py Fixed Show fixed Hide fixed

Fix Gemma 7B

170df0e

meatybobby added 2 commits May 7, 2024 15:43

Add rotary_scaling

b2413a1

Merge branch 'main' into bobchen/nemotron

b19cfac

oyilmaz-nvidia added the Run CICD label May 8, 2024

Merge branch 'main' into bobchen/nemotron

4ca6eba

oyilmaz-nvidia previously approved these changes May 8, 2024

View reviewed changes

oyilmaz-nvidia and others added 2 commits May 13, 2024 13:27

Merge branch 'main' into bobchen/nemotron

625c239

Apply isort and black reformatting

d75d601

Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com>

artbataev dismissed oyilmaz-nvidia’s stale review via d75d601 May 13, 2024 17:28

oyilmaz-nvidia approved these changes May 13, 2024

View reviewed changes

Merge branch 'main' into bobchen/nemotron

6bb65da

ericharper added Run CICD and removed Run CICD labels May 13, 2024

pablo-garay merged commit 77090d4 into main May 13, 2024
131 checks passed

pablo-garay deleted the bobchen/nemotron branch May 13, 2024 23:14

ko3n1g mentioned this pull request Jul 18, 2024

Release 2.0.0rc1 #9786

Draft

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TRTLLM new API support #9003

TRTLLM new API support #9003

meatybobby commented Apr 22, 2024

oyilmaz-nvidia left a comment

cjluo-omniml Apr 25, 2024

oyilmaz-nvidia commented May 6, 2024

oyilmaz-nvidia left a comment

oyilmaz-nvidia commented May 8, 2024


		prompt_tasks = None if task_ids is None else ",".join(str(task) for task in task_ids)

		outputs = decoder.generate(

TRTLLM new API support #9003

TRTLLM new API support #9003

Conversation

meatybobby commented Apr 22, 2024

What does this PR do ?

Changelog

Usage

Jenkins CI

Before your PR is "Ready for review"

Who can review?

Additional Information

oyilmaz-nvidia left a comment

Choose a reason for hiding this comment

cjluo-omniml Apr 25, 2024

Choose a reason for hiding this comment

oyilmaz-nvidia commented May 6, 2024

oyilmaz-nvidia left a comment

Choose a reason for hiding this comment

oyilmaz-nvidia commented May 8, 2024