Update TensorRT-LLM #1598

kaiyux · 2024-05-14T08:25:26Z

Model Support
- Support Neva
- Support Kosmos-2
Features
- Support quantization for Nemotron models
- Add LoRA support for Mixtral and Qwen
- Add weight-stripping feature
  - A new command trtllm-refit is added
  - See documentation at examples/sample_weight_stripping/README.md
- Add weight streaming feature
  - See documentation at docs/source/advanced/weight-streaming.md
- The Python high level API
  - Add embedding parallel, embedding sharing and fused MLP support
  - Enable the usage of executor API
- Add in-flight batching support for ChatGLM models
- Add support to ModelRunnerCpp so that it runs with the executor API for IFB-compatible models
API
- [BREAKING CHANGE] Refactor scheduling configurations
  - Unify the SchedulerPolicy with the same name in batch_scheduler and executor, and rename it to CapacitySchedulerPolicy.
  - Expand the existing configuration scheduling strategy from SchedulerPolicy to SchedulerConfig to enhance extensibility. The latter also introduces a chunk-based configuration called ContextChunkingPolicy.
- [BREAKING CHANGE] Remove use_context_fmha_for_generation argument from trtllm-build command since it’s not used anymore
- [BREAKING CHANGE] Input prompt is removed from the generation output in the generate() and generate_async() APIs.
  - E.g. Given a prompt as A B, the original generation result could be <s>A B C D E where only C D E is the actual output, and now the result is C D E.
- [BREAKING CHANGE] Switch default add_special_token in the TensorRT-LLM backend to True make add_special_tokens/skip_special_tokens default value is true which align with hf setting triton-inference-server/tensorrtllm_backend#446, thanks to the contribution from @XiaobingSuper , the changes are integrated in Update TensorRT-LLM backend triton-inference-server/tensorrtllm_backend#454.
- GptSession and TrtGptModelV1 are marked as deprecated
Bug fixes
- Fix a NVRTC runtime ABI compatibility issue
Performance
- [BREAKING CHANGE] Set default tokens_per_block argument of trtllm-build command to 64 for better performance
- Enhance the multiple profiles feature, multiple_profiles argument in trtllm-build command builds more optimization profiles now for better performance
- Enhance the custom AllReduce by adding a heuristic: fall back to use native NCCL kernel when hardware requirements are not satisfied to get the best performance
- Optimize the performance of checkpoint conversion process for LLaMA
Documentation
- Add documentation for KV cache reuse feature, see docs/source/kv_cache_reuse.md

Update TensorRT-LLM

fa37400

Shixiaowei02 approved these changes May 14, 2024

View reviewed changes

update

9d66871

kaiyux merged commit bf0a5af into main May 14, 2024

kaiyux deleted the kaiyu/update branch May 14, 2024 08:50

bprus mentioned this pull request Jun 13, 2024

Recommendation for quantizing models to fp8 that don't fit on a single GPU (Mixtral) #1777

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update TensorRT-LLM #1598

Update TensorRT-LLM #1598

kaiyux commented May 14, 2024 •

edited

Loading

Update TensorRT-LLM #1598

Update TensorRT-LLM #1598

Conversation

kaiyux commented May 14, 2024 • edited Loading

kaiyux commented May 14, 2024 •

edited

Loading