-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TRTLLM new API support #9003
TRTLLM new API support #9003
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested llama, gemma, starcoder1, nemotrons and they work. But getting the following error from starcoder2
Traceback (most recent call last):
File "/opt/NeMo/tests/export/test_nemo_export.py", line 540, in <module>
run_inference_tests(args)
File "/opt/NeMo/tests/export/test_nemo_export.py", line 453, in run_inference_tests
) = run_existing_checkpoints(
File "/opt/NeMo/tests/export/test_nemo_export.py", line 319, in run_existing_checkpoints
return run_trt_llm_inference(
File "/opt/NeMo/tests/export/test_nemo_export.py", line 195, in run_trt_llm_inference
trt_llm_exporter.export(
File "/opt/NeMo/nemo/export/tensorrt_llm.py", line 204, in export
build_and_save_engine(
File "/opt/NeMo/nemo/export/trt_llm/tensorrt_llm_build.py", line 411, in build_and_save_engine
model = model_cls.from_config(model_config)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 360, in from_config
return cls(config)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 328, in __call__
obj = type.__call__(cls, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/gpt/model.py", line 224, in __init__
transformer = GPTModel(config)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/gpt/model.py", line 179, in __init__
self.layers = DecoderLayerList(GPTDecoderLayer, config)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 262, in __init__
super().__init__([cls(config, idx) for idx in self.layer_list])
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 262, in <listcomp>
super().__init__([cls(config, idx) for idx in self.layer_list])
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/gpt/model.py", line 80, in __init__
self.attention = Attention(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/layers/attention.py", line 510, in __init__
self.rotary_embedding_dim = int(self.attention_head_size *
TypeError: unsupported operand type(s) for *: 'int' and 'NoneType'
Exception ignored in: <function PretrainedModel.__del__ at 0x7ff64dcdecb0>
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 351, in __del__
self.release()
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 348, in release
release_gc()
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_utils.py", line 439, in release_gc
torch.cuda.ipc_collect()
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 813, in ipc_collect
_lazy_init()
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 321, in _lazy_init
raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: 'NoneType' object is not iterable
CUDA call was originally invoked at:
File "/opt/NeMo/tests/export/test_nemo_export.py", line 19, in <module>
import torch
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 883, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "/usr/local/lib/python3.10/dist-packages/torch/__init__.py", line 1427, in <module>
_C._initExtension(manager_path())
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 883, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 1303, in <module>
_lazy_call(_register_triton_kernels)
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 244, in _lazy_call
_queued_calls.append((callable, traceback.format_stack()))
and from falcon, getting the following
Path: /opt/checkpoints/FALCON-7B-base/FALCON-7B-base-1.nemo and model: FALCON-7B-base with 1 gpus will be tested
saving weights: 100%|█████████████████████████████████████████████████████████████████████████████████████| 194/194 [00:03<00:00, 50.28it/s]
tokenizer_config.json: 100%|████████████████████████████████████████████████████████████████████████████████| 287/287 [00:00<00:00, 881kB/s]
tokenizer.json: 100%|██████████████████████████████████████████████████████████████████████████████████| 2.73M/2.73M [00:00<00:00, 8.40MB/s]
special_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████| 281/281 [00:00<00:00, 1.02MB/s]
Traceback (most recent call last):
File "/opt/NeMo/tests/export/test_nemo_export.py", line 540, in <module>
run_inference_tests(args)
File "/opt/NeMo/tests/export/test_nemo_export.py", line 453, in run_inference_tests
) = run_existing_checkpoints(
File "/opt/NeMo/tests/export/test_nemo_export.py", line 319, in run_existing_checkpoints
return run_trt_llm_inference(
File "/opt/NeMo/tests/export/test_nemo_export.py", line 195, in run_trt_llm_inference
trt_llm_exporter.export(
File "/opt/NeMo/nemo/export/tensorrt_llm.py", line 192, in export
weights_dicts, model_configs, self.tokenizer = nemo_to_trtllm_config(
File "/opt/NeMo/nemo/export/trt_llm/nemo_utils.py", line 374, in nemo_to_trtllm_config
DECODER_MODEL_TYPE[decoder_type],
KeyError: 'falcon'
We either update these models to the new API or call the old code support them.
Haven't tested the mistral and the mixtral.
|
||
prompt_tasks = None if task_ids is None else ",".join(str(task) for task in task_ids) | ||
|
||
outputs = decoder.generate( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can just use the HLAPI like LLM to run inference. The API is much cleaner to call than this ModelRunnerCpp
Tested most of the models and getting this gibberish output from only the nemotron 22B base one
Can you guys please take a look at this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's merge this PR and then we'll likely do a cleanup, basically we'll need to remove the parts where we don't use anymore.
jenkins |
Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com>
* Add trtllm checkpoint * Change model config * fix no query_group * Using build API * Change export to new API * Update generate API * Fix runtime config * Fix for llama * Fix for ptuning * Fix TP issue * Change TP rank for building weight dict * Add lora config * add prompt embedding table config * Fix PP isue * PP layers fix * Fix no prompt task ids * Add bos for Gemma * Add multi block mode * Embedding and layernorm for PP * MPI multiprocess support for multinode * Only output text on first rank * Change to ModelRunnerCpp * Add falcon * Add rotary_pct default value * Falcon fix * Add MOE config * Fix MOE weight dict * Clean code * Add rotary_base * Fix MOE config * Fix falcon new architecture * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Gemma 7B * Add rotary_scaling * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> --------- Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> Co-authored-by: abharwani <abharwani@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Onur Yilmaz <35306097+oyilmaz-nvidia@users.noreply.github.com> Co-authored-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com>
* Add trtllm checkpoint * Change model config * fix no query_group * Using build API * Change export to new API * Update generate API * Fix runtime config * Fix for llama * Fix for ptuning * Fix TP issue * Change TP rank for building weight dict * Add lora config * add prompt embedding table config * Fix PP isue * PP layers fix * Fix no prompt task ids * Add bos for Gemma * Add multi block mode * Embedding and layernorm for PP * MPI multiprocess support for multinode * Only output text on first rank * Change to ModelRunnerCpp * Add falcon * Add rotary_pct default value * Falcon fix * Add MOE config * Fix MOE weight dict * Clean code * Add rotary_base * Fix MOE config * Fix falcon new architecture * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Gemma 7B * Add rotary_scaling * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> --------- Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> Co-authored-by: abharwani <abharwani@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Onur Yilmaz <35306097+oyilmaz-nvidia@users.noreply.github.com> Co-authored-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: Boxiang Wang <boxiangw@nvidia.com>
* Add trtllm checkpoint * Change model config * fix no query_group * Using build API * Change export to new API * Update generate API * Fix runtime config * Fix for llama * Fix for ptuning * Fix TP issue * Change TP rank for building weight dict * Add lora config * add prompt embedding table config * Fix PP isue * PP layers fix * Fix no prompt task ids * Add bos for Gemma * Add multi block mode * Embedding and layernorm for PP * MPI multiprocess support for multinode * Only output text on first rank * Change to ModelRunnerCpp * Add falcon * Add rotary_pct default value * Falcon fix * Add MOE config * Fix MOE weight dict * Clean code * Add rotary_base * Fix MOE config * Fix falcon new architecture * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Gemma 7B * Add rotary_scaling * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> --------- Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> Co-authored-by: abharwani <abharwani@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Onur Yilmaz <35306097+oyilmaz-nvidia@users.noreply.github.com> Co-authored-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com>
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Collection: [Note which collection this PR will affect]
Changelog
Usage
# Add a code snippet demonstrating how to use this
Jenkins CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
There's no need to comment
jenkins
on the PR to trigger Jenkins CI.The GitHub Actions CI will run automatically when the PR is opened.
To run CI on an untrusted fork, a NeMo user with write access must click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information