Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update TensorRT-LLM #1598

Merged
merged 2 commits into from
May 14, 2024
Merged

Update TensorRT-LLM #1598

merged 2 commits into from
May 14, 2024

Conversation

kaiyux
Copy link
Member

@kaiyux kaiyux commented May 14, 2024

  • Model Support
    • Support Neva
    • Support Kosmos-2
  • Features
    • Support quantization for Nemotron models
    • Add LoRA support for Mixtral and Qwen
    • Add weight-stripping feature
      • A new command trtllm-refit is added
      • See documentation at examples/sample_weight_stripping/README.md
    • Add weight streaming feature
      • See documentation at docs/source/advanced/weight-streaming.md
    • The Python high level API
      • Add embedding parallel, embedding sharing and fused MLP support
      • Enable the usage of executor API
    • Add in-flight batching support for ChatGLM models
    • Add support to ModelRunnerCpp so that it runs with the executor API for IFB-compatible models
  • API
    • [BREAKING CHANGE] Refactor scheduling configurations
      • Unify the SchedulerPolicy with the same name in batch_scheduler and executor, and rename it to CapacitySchedulerPolicy.
      • Expand the existing configuration scheduling strategy from SchedulerPolicy to SchedulerConfig to enhance extensibility. The latter also introduces a chunk-based configuration called ContextChunkingPolicy.
    • [BREAKING CHANGE] Remove use_context_fmha_for_generation argument from trtllm-build command since it’s not used anymore
    • [BREAKING CHANGE] Input prompt is removed from the generation output in the generate() and generate_async() APIs.
      • E.g. Given a prompt as A B, the original generation result could be <s>A B C D E where only C D E is the actual output, and now the result is C D E.
    • [BREAKING CHANGE] Switch default add_special_token in the TensorRT-LLM backend to True make add_special_tokens/skip_special_tokens default value is true which align with hf setting triton-inference-server/tensorrtllm_backend#446, thanks to the contribution from @XiaobingSuper , the changes are integrated in Update TensorRT-LLM backend triton-inference-server/tensorrtllm_backend#454.
    • GptSession and TrtGptModelV1 are marked as deprecated
  • Bug fixes
    • Fix a NVRTC runtime ABI compatibility issue
  • Performance
    • [BREAKING CHANGE] Set default tokens_per_block argument of trtllm-build command to 64 for better performance
    • Enhance the multiple profiles feature, multiple_profiles argument in trtllm-build command builds more optimization profiles now for better performance
    • Enhance the custom AllReduce by adding a heuristic: fall back to use native NCCL kernel when hardware requirements are not satisfied to get the best performance
    • Optimize the performance of checkpoint conversion process for LLaMA
  • Documentation
    • Add documentation for KV cache reuse feature, see docs/source/kv_cache_reuse.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants