Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TRTLLM new API support #9003

Merged
merged 41 commits into from
May 13, 2024
Merged

TRTLLM new API support #9003

merged 41 commits into from
May 13, 2024

Conversation

meatybobby
Copy link
Collaborator

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Collection: [Note which collection this PR will affect]

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

Jenkins CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

There's no need to comment jenkins on the PR to trigger Jenkins CI.
The GitHub Actions CI will run automatically when the PR is opened.
To run CI on an untrusted fork, a NeMo user with write access must click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

Copy link
Collaborator

@oyilmaz-nvidia oyilmaz-nvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested llama, gemma, starcoder1, nemotrons and they work. But getting the following error from starcoder2

Traceback (most recent call last):
  File "/opt/NeMo/tests/export/test_nemo_export.py", line 540, in <module>
    run_inference_tests(args)
  File "/opt/NeMo/tests/export/test_nemo_export.py", line 453, in run_inference_tests
    ) = run_existing_checkpoints(
  File "/opt/NeMo/tests/export/test_nemo_export.py", line 319, in run_existing_checkpoints
    return run_trt_llm_inference(
  File "/opt/NeMo/tests/export/test_nemo_export.py", line 195, in run_trt_llm_inference
    trt_llm_exporter.export(
  File "/opt/NeMo/nemo/export/tensorrt_llm.py", line 204, in export
    build_and_save_engine(
  File "/opt/NeMo/nemo/export/trt_llm/tensorrt_llm_build.py", line 411, in build_and_save_engine
    model = model_cls.from_config(model_config)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 360, in from_config
    return cls(config)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 328, in __call__
    obj = type.__call__(cls, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/gpt/model.py", line 224, in __init__
    transformer = GPTModel(config)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/gpt/model.py", line 179, in __init__
    self.layers = DecoderLayerList(GPTDecoderLayer, config)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 262, in __init__
    super().__init__([cls(config, idx) for idx in self.layer_list])
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 262, in <listcomp>
    super().__init__([cls(config, idx) for idx in self.layer_list])
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/gpt/model.py", line 80, in __init__
    self.attention = Attention(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/layers/attention.py", line 510, in __init__
    self.rotary_embedding_dim = int(self.attention_head_size *
TypeError: unsupported operand type(s) for *: 'int' and 'NoneType'
Exception ignored in: <function PretrainedModel.__del__ at 0x7ff64dcdecb0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 351, in __del__
    self.release()
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 348, in release
    release_gc()
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_utils.py", line 439, in release_gc
    torch.cuda.ipc_collect()
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 813, in ipc_collect
    _lazy_init()
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 321, in _lazy_init
    raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: 'NoneType' object is not iterable

CUDA call was originally invoked at:

  File "/opt/NeMo/tests/export/test_nemo_export.py", line 19, in <module>
    import torch
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/torch/__init__.py", line 1427, in <module>
    _C._initExtension(manager_path())
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 1303, in <module>
    _lazy_call(_register_triton_kernels)
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 244, in _lazy_call
    _queued_calls.append((callable, traceback.format_stack()))

and from falcon, getting the following

Path: /opt/checkpoints/FALCON-7B-base/FALCON-7B-base-1.nemo and model: FALCON-7B-base with 1 gpus will be tested
saving weights: 100%|█████████████████████████████████████████████████████████████████████████████████████| 194/194 [00:03<00:00, 50.28it/s]
tokenizer_config.json: 100%|████████████████████████████████████████████████████████████████████████████████| 287/287 [00:00<00:00, 881kB/s]
tokenizer.json: 100%|██████████████████████████████████████████████████████████████████████████████████| 2.73M/2.73M [00:00<00:00, 8.40MB/s]
special_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████| 281/281 [00:00<00:00, 1.02MB/s]
Traceback (most recent call last):
  File "/opt/NeMo/tests/export/test_nemo_export.py", line 540, in <module>
    run_inference_tests(args)
  File "/opt/NeMo/tests/export/test_nemo_export.py", line 453, in run_inference_tests
    ) = run_existing_checkpoints(
  File "/opt/NeMo/tests/export/test_nemo_export.py", line 319, in run_existing_checkpoints
    return run_trt_llm_inference(
  File "/opt/NeMo/tests/export/test_nemo_export.py", line 195, in run_trt_llm_inference
    trt_llm_exporter.export(
  File "/opt/NeMo/nemo/export/tensorrt_llm.py", line 192, in export
    weights_dicts, model_configs, self.tokenizer = nemo_to_trtllm_config(
  File "/opt/NeMo/nemo/export/trt_llm/nemo_utils.py", line 374, in nemo_to_trtllm_config
    DECODER_MODEL_TYPE[decoder_type],
KeyError: 'falcon'

We either update these models to the new API or call the old code support them.

Haven't tested the mistral and the mixtral.


prompt_tasks = None if task_ids is None else ",".join(str(task) for task in task_ids)

outputs = decoder.generate(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just use the HLAPI like LLM to run inference. The API is much cleaner to call than this ModelRunnerCpp

@meatybobby meatybobby changed the title TRTLLM new API support(WIP) TRTLLM new API support May 3, 2024
nemo/export/trt_llm/nemo/convert.py Fixed Show fixed Hide fixed
nemo/export/trt_llm/nemo_utils.py Fixed Show fixed Hide fixed
@oyilmaz-nvidia
Copy link
Collaborator

Tested most of the models and getting this gibberish output from only the nemotron 22B base one

--- Output:  [['is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is in the best way to get the best price for your car.\n\nHow to sell your car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car car'], ['is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is the best way to get to the hotel from the airport?\n\nThe hotel is located 1 km from the airport, which means that there is is is is is is is is is is is is is is is is is is is is is is is is is is'], ['is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is the best way to get to the hotel from the airport?\n\nThe hotel is located 1 km from the airport.\n\nWhat is the best way way way way way way way way way way way way way way way way way way way way way way way way way']]

Can you guys please take a look at this?

oyilmaz-nvidia
oyilmaz-nvidia previously approved these changes May 8, 2024
Copy link
Collaborator

@oyilmaz-nvidia oyilmaz-nvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge this PR and then we'll likely do a cleanup, basically we'll need to remove the parts where we don't use anymore.

@oyilmaz-nvidia
Copy link
Collaborator

jenkins

oyilmaz-nvidia and others added 2 commits May 13, 2024 13:27
Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com>
@pablo-garay pablo-garay merged commit 77090d4 into main May 13, 2024
131 checks passed
@pablo-garay pablo-garay deleted the bobchen/nemotron branch May 13, 2024 23:14
pablo-garay pushed a commit that referenced this pull request May 22, 2024
* Add trtllm checkpoint

* Change model config

* fix no query_group

* Using build API

* Change export to new API

* Update generate API

* Fix runtime config

* Fix for llama

* Fix for ptuning

* Fix TP issue

* Change TP rank for building weight dict

* Add lora config

* add prompt embedding table config

* Fix PP isue

* PP layers fix

* Fix no prompt task ids

* Add bos for Gemma

* Add multi block mode

* Embedding and layernorm for PP

* MPI multiprocess support for multinode

* Only output text on first rank

* Change to ModelRunnerCpp

* Add falcon

* Add rotary_pct default value

* Falcon fix

* Add MOE config

* Fix MOE weight dict

* Clean code

* Add rotary_base

* Fix MOE config

* Fix falcon new architecture

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix Gemma 7B

* Add rotary_scaling

* Apply isort and black reformatting

Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com>

---------

Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com>
Co-authored-by: abharwani <abharwani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Onur Yilmaz <35306097+oyilmaz-nvidia@users.noreply.github.com>
Co-authored-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>
BoxiangW pushed a commit to BoxiangW/NeMo that referenced this pull request Jun 5, 2024
* Add trtllm checkpoint

* Change model config

* fix no query_group

* Using build API

* Change export to new API

* Update generate API

* Fix runtime config

* Fix for llama

* Fix for ptuning

* Fix TP issue

* Change TP rank for building weight dict

* Add lora config

* add prompt embedding table config

* Fix PP isue

* PP layers fix

* Fix no prompt task ids

* Add bos for Gemma

* Add multi block mode

* Embedding and layernorm for PP

* MPI multiprocess support for multinode

* Only output text on first rank

* Change to ModelRunnerCpp

* Add falcon

* Add rotary_pct default value

* Falcon fix

* Add MOE config

* Fix MOE weight dict

* Clean code

* Add rotary_base

* Fix MOE config

* Fix falcon new architecture

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix Gemma 7B

* Add rotary_scaling

* Apply isort and black reformatting

Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com>

---------

Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com>
Co-authored-by: abharwani <abharwani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Onur Yilmaz <35306097+oyilmaz-nvidia@users.noreply.github.com>
Co-authored-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>
Signed-off-by: Boxiang Wang <boxiangw@nvidia.com>
rohitrango pushed a commit to rohitrango/NeMo that referenced this pull request Jun 25, 2024
* Add trtllm checkpoint

* Change model config

* fix no query_group

* Using build API

* Change export to new API

* Update generate API

* Fix runtime config

* Fix for llama

* Fix for ptuning

* Fix TP issue

* Change TP rank for building weight dict

* Add lora config

* add prompt embedding table config

* Fix PP isue

* PP layers fix

* Fix no prompt task ids

* Add bos for Gemma

* Add multi block mode

* Embedding and layernorm for PP

* MPI multiprocess support for multinode

* Only output text on first rank

* Change to ModelRunnerCpp

* Add falcon

* Add rotary_pct default value

* Falcon fix

* Add MOE config

* Fix MOE weight dict

* Clean code

* Add rotary_base

* Fix MOE config

* Fix falcon new architecture

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix Gemma 7B

* Add rotary_scaling

* Apply isort and black reformatting

Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com>

---------

Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com>
Co-authored-by: abharwani <abharwani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Onur Yilmaz <35306097+oyilmaz-nvidia@users.noreply.github.com>
Co-authored-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>
@ko3n1g ko3n1g mentioned this pull request Jul 18, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants