TensorRT Edge-LLM 0.8.0 Release 2026-06-02

We are very excited to announce release 0.8.0 of TensorRT Edge-LLM!

TensorRT Edge-LLM 0.8.0 completes the transition to the checkpoint-based export workflow. The tensorrt_edgellm package is now the official front-end for quantization and ONNX export.

This release has been validated on the primary NVIDIA edge platforms, including Jetson Orin and Thor-class devices, and adds support to the just released Jetpack 7.2, which can be downloaded from Nvidia’s official site. As a general sizing guideline, a model is expected to run when the target device has at least the model size plus approximately 3 GB of available memory for runtime overhead. Please see our updated Quick Start Guide, Supported Model List, and model examples for the latest deployment guidance.

Breaking Changes

The legacy ONNX export workflow has been removed. The checkpoint-based workflow is now the official TensorRT Edge-LLM export path.
experimental/llm_loader and experimental/quantization functionality has moved into the tensorrt_edgellm package.
Use the unified CLI commands:
- tensorrt-edgellm-quantize
- tensorrt-edgellm-export
- tensorrt-edgellm-merge-lora
- tensorrt-edgellm-reduce-vocab
- tensorrt-edgellm-preprocess-audio
Older per-component export commands such as tensorrt-edgellm-export-llm, tensorrt-edgellm-export-visual, and python -m llm_loader.export_all_cli should be replaced by tensorrt-edgellm-export.

Key Features

Promoted tensorrt_edgellm as the official checkpoint-based quantization and ONNX export workflow, with unified CLI commands for export, quantization, LoRA, vocabulary reduction, and audio preprocessing
Added Qwen3/Qwen3.5/Qwen3.6 MoE NVFP4 export
Reduced engine build memory for INT4 FFN, INT4 MoE, and LM-head weights
Added OpenAI-compatible server tool-calling support, per-request stop strings, and Qwen3-TTS streaming improvements
Updated dependency pins, including PyTorch 2.12, Transformers 5.9, and ModelOpt 0.44

Other Important Features

Runtime and Performance

Added pluggable decoding support for vanilla, EAGLE, and MTP runtime paths
Added Qwen3-ASR export and quantization workflows, including independent LLM and audio-encoder precision selection, including FP8 Qwen3 ASR Audio Encoder
Added fused gate+up and XQA kernel support for new MoE configurations
Fixed XQA kernel-loader thread-safety and CUDA 13.3 MoE top-k softmax build issues

Export and Quantization

Improved mixed-precision quantization handling for fused QKV and gate/up projections

Server and API

Added experimental Dockerfiles for containerized development
Expanded high-level Python API and server validation for LLM, VLM, and streaming flows

Documentation

Updated documentation for checkpoint export, quantization, LoRA, FP8 KV cache, FP8 embedding, speculative decoding, ASR, TTS, and server usage

NVIDIA Contributors

@fans-nv @nvluxiaoz @nvamberl @ruocheng-nv @poweiw @yuanyao-nv @Jasper-NV @charllll @mahu888 @xinrzhang-star @JCalafato @nvmbreughe @nvxingkaiz @ever-wong @zhijial-nvidia @willg-nv @levichen-nvidia @xiangg-nv @nvyocox @nv-samcheng

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.8.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

TensorRT Edge-LLM 0.8.0 Release 2026-06-02

Breaking Changes

Key Features

Other Important Features

Runtime and Performance

Export and Quantization

Server and API

Documentation

NVIDIA Contributors

Contributors

Uh oh!