Skip to content

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

License

Notifications You must be signed in to change notification settings

JaehongCS20/TensorRT-LLM

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TensorRT-LLM

A TensorRT Toolbox for Optimized Large Language Model Inference

Documentation python cuda trt version license

Architecture   |   Results   |   Examples   |   Documentation


How to use TensorRT-LLM for LLMServingSim

This repo is modifed to generate performance model used in LLMServingsim.

We used NVIDIA RTX 3090 and GPT as example model, you should modify the scripts and codes to your own device and model.

For more technical issues while using TensorRT-LLM, refer to its original documentation and github.

  1. Git clone
git clone https://github.com/JaehongCS20/TensorRT-LLM.git
cd TensorRT-LLM
  1. Update submodule (ensure that git lfs is installed)
# installing git lfs (Optional)
apt-get update && apt-get -y install git git-lfs
git lfs install
# Update the repo
git submodule update --init --recursive
git lfs pull
  1. Make docker and run (modify run_docker.sh as your system configuration)
make -C docker release_build
./run_docker.sh
  1. Build your TensorRT-LLM model

We used GPT for the example.

cd examples/gpt
./make_gpt.sh
  1. Run to generate performance model of LLMServingSim
./profile.sh
python parse_csv.py

Latest News

  • [2024/10/22] New 📝 Step-by-step instructions on how to ✅ Optimize LLMs with NVIDIA TensorRT-LLM, ✅ Deploy the optimized models with Triton Inference Server, ✅ Autoscale LLMs deployment in a Kubernetes environment. 🙌 Technical Deep Dive: ➡️ link
  • [2024/10/07] 🚀🚀🚀Optimizing Microsoft Bing Visual Search with NVIDIA Accelerated Libraries ➡️ link

  • [2024/09/29] 🌟 AI at Meta PyTorch + TensorRT v2.4 🌟 ⚡TensorRT 10.1 ⚡PyTorch 2.4 ⚡CUDA 12.4 ⚡Python 3.12 ➡️ link

  • [2024/09/17] ✨ NVIDIA TensorRT-LLM Meetup ➡️ link

  • [2024/09/17] ✨ Accelerating LLM Inference at Databricks with TensorRT-LLM ➡️ link

  • [2024/09/17] ✨ TensorRT-LLM @ Baseten ➡️ link

  • [2024/09/04] 🏎️🏎️🏎️ Best Practices for Tuning TensorRT-LLM for Optimal Serving with BentoML ➡️ link

  • [2024/08/20] 🏎️SDXL with #TensorRT Model Optimizer ⏱️⚡ 🏁 cache diffusion 🏁 quantization aware training 🏁 QLoRA 🏁 #Python 3.12 ➡️ link

  • [2024/08/13] 🐍 DIY Code Completion with #Mamba ⚡ #TensorRT #LLM for speed 🤖 NIM for ease ☁️ deploy anywhere ➡️ link

  • [2024/08/06] 🗫 Multilingual Challenge Accepted 🗫 🤖 #TensorRT #LLM boosts low-resource languages like Hebrew, Indonesian and Vietnamese ⚡➡️ link

  • [2024/07/30] Introducing🍊 @SliceXAI ELM Turbo 🤖 train ELM once ⚡ #TensorRT #LLM optimize ☁️ deploy anywhere ➡️ link

  • [2024/07/23] 👀 @AIatMeta Llama 3.1 405B trained on 16K NVIDIA H100s - inference is #TensorRT #LLM optimized ⚡ 🦙 400 tok/s - per node 🦙 37 tok/s - per user 🦙 1 node inference ➡️ link

  • [2024/07/09] Checklist to maximize multi-language performance of @meta #Llama3 with #TensorRT #LLM inference: ✅ MultiLingual ✅ NIM ✅ LoRA tuned adaptors➡️ Tech blog

  • [2024/07/02] Let the @MistralAI MoE tokens fly 📈 🚀 #Mixtral 8x7B with NVIDIA #TensorRT #LLM on #H100. ➡️ Tech blog

Previous News

TensorRT-LLM Overview

TensorRT-LLM is a library for optimizing Large Language Model (LLM) inference. It provides state-of-the-art optimizations, including custom attention kernels, inflight batching, paged KV caching, quantization (FP8, INT4 AWQ, INT8 SmoothQuant, ++) and much more, to perform inference efficiently on NVIDIA GPUs

TensorRT-LLM provides a Python API to build LLMs into optimized TensorRT engines. It contains runtimes in Python (bindings) and C++ to execute those TensorRT engines. It also includes a backend for integration with the NVIDIA Triton Inference Server. Models built with TensorRT-LLM can be executed on a wide range of configurations from a single GPU to multiple nodes with multiple GPUs (using Tensor Parallelism and/or Pipeline Parallelism).

TensorRT-LLM comes with several popular models pre-defined. They can easily be modified and extended to fit custom needs via a PyTorch-like Python API. Refer to the Support Matrix for a list of supported models.

TensorRT-LLM is built on top of the TensorRT Deep Learning Inference library. It leverages much of TensorRT's deep learning optimizations and adds LLM-specific optimizations on top, as described above. TensorRT is an ahead-of-time compiler; it builds "Engines" which are optimized representations of the compiled model containing the entire execution graph. These engines are optimized for a specific GPU architecture, and can be validated, benchmarked, and serialized for later deployment in a production environment.

Getting Started

To get started with TensorRT-LLM, visit our documentation:

Community

  • Model zoo (generated by TRT-LLM rel 0.9 a9356d4b7610330e89c1010f342a9ac644215c52)

About

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 99.3%
  • Python 0.6%
  • Cuda 0.1%
  • CMake 0.0%
  • Shell 0.0%
  • Smarty 0.0%