TensorRT Edge-LLM 0.8.0 Release 2026-06-02
We are very excited to announce release 0.8.0 of TensorRT Edge-LLM!
TensorRT Edge-LLM 0.8.0 completes the transition to the checkpoint-based export workflow. The tensorrt_edgellm package is now the official front-end for quantization and ONNX export.
This release has been validated on the primary NVIDIA edge platforms, including Jetson Orin and Thor-class devices, and adds support to the just released Jetpack 7.2, which can be downloaded from Nvidia’s official site. As a general sizing guideline, a model is expected to run when the target device has at least the model size plus approximately 3 GB of available memory for runtime overhead. Please see our updated Quick Start Guide, Supported Model List, and model examples for the latest deployment guidance.
Breaking Changes
- The legacy ONNX export workflow has been removed. The checkpoint-based workflow is now the official TensorRT Edge-LLM export path.
experimental/llm_loaderandexperimental/quantizationfunctionality has moved into thetensorrt_edgellmpackage.- Use the unified CLI commands:
tensorrt-edgellm-quantizetensorrt-edgellm-exporttensorrt-edgellm-merge-loratensorrt-edgellm-reduce-vocabtensorrt-edgellm-preprocess-audio
- Older per-component export commands such as
tensorrt-edgellm-export-llm,tensorrt-edgellm-export-visual, andpython -m llm_loader.export_all_clishould be replaced bytensorrt-edgellm-export.
Key Features
- Promoted
tensorrt_edgellmas the official checkpoint-based quantization and ONNX export workflow, with unified CLI commands for export, quantization, LoRA, vocabulary reduction, and audio preprocessing - Added Qwen3/Qwen3.5/Qwen3.6 MoE NVFP4 export
- Reduced engine build memory for INT4 FFN, INT4 MoE, and LM-head weights
- Added OpenAI-compatible server tool-calling support, per-request stop strings, and Qwen3-TTS streaming improvements
- Updated dependency pins, including PyTorch 2.12, Transformers 5.9, and ModelOpt 0.44
Other Important Features
Runtime and Performance
- Added pluggable decoding support for vanilla, EAGLE, and MTP runtime paths
- Added Qwen3-ASR export and quantization workflows, including independent LLM and audio-encoder precision selection, including FP8 Qwen3 ASR Audio Encoder
- Added fused gate+up and XQA kernel support for new MoE configurations
- Fixed XQA kernel-loader thread-safety and CUDA 13.3 MoE top-k softmax build issues
Export and Quantization
- Improved mixed-precision quantization handling for fused QKV and gate/up projections
Server and API
- Added experimental Dockerfiles for containerized development
- Expanded high-level Python API and server validation for LLM, VLM, and streaming flows
Documentation
- Updated documentation for checkpoint export, quantization, LoRA, FP8 KV cache, FP8 embedding, speculative decoding, ASR, TTS, and server usage
NVIDIA Contributors
@fans-nv @nvluxiaoz @nvamberl @ruocheng-nv @poweiw @yuanyao-nv @Jasper-NV @charllll @mahu888 @xinrzhang-star @JCalafato @nvmbreughe @nvxingkaiz @ever-wong @zhijial-nvidia @willg-nv @levichen-nvidia @xiangg-nv @nvyocox @nv-samcheng