Skip to content

v0.8.0

Latest

Choose a tag to compare

@JCalafato JCalafato released this 03 Jun 18:40
f9cc746

TensorRT Edge-LLM 0.8.0 Release 2026-06-02

We are very excited to announce release 0.8.0 of TensorRT Edge-LLM!

TensorRT Edge-LLM 0.8.0 completes the transition to the checkpoint-based export workflow. The tensorrt_edgellm package is now the official front-end for quantization and ONNX export.

This release has been validated on the primary NVIDIA edge platforms, including Jetson Orin and Thor-class devices, and adds support to the just released Jetpack 7.2, which can be downloaded from Nvidia’s official site. As a general sizing guideline, a model is expected to run when the target device has at least the model size plus approximately 3 GB of available memory for runtime overhead. Please see our updated Quick Start Guide, Supported Model List, and model examples for the latest deployment guidance.

Breaking Changes

  • The legacy ONNX export workflow has been removed. The checkpoint-based workflow is now the official TensorRT Edge-LLM export path.
  • experimental/llm_loader and experimental/quantization functionality has moved into the tensorrt_edgellm package.
  • Use the unified CLI commands:
    • tensorrt-edgellm-quantize
    • tensorrt-edgellm-export
    • tensorrt-edgellm-merge-lora
    • tensorrt-edgellm-reduce-vocab
    • tensorrt-edgellm-preprocess-audio
  • Older per-component export commands such as tensorrt-edgellm-export-llm, tensorrt-edgellm-export-visual, and python -m llm_loader.export_all_cli should be replaced by tensorrt-edgellm-export.

Key Features

  • Promoted tensorrt_edgellm as the official checkpoint-based quantization and ONNX export workflow, with unified CLI commands for export, quantization, LoRA, vocabulary reduction, and audio preprocessing
  • Added Qwen3/Qwen3.5/Qwen3.6 MoE NVFP4 export
  • Reduced engine build memory for INT4 FFN, INT4 MoE, and LM-head weights
  • Added OpenAI-compatible server tool-calling support, per-request stop strings, and Qwen3-TTS streaming improvements
  • Updated dependency pins, including PyTorch 2.12, Transformers 5.9, and ModelOpt 0.44

Other Important Features

Runtime and Performance

  • Added pluggable decoding support for vanilla, EAGLE, and MTP runtime paths
  • Added Qwen3-ASR export and quantization workflows, including independent LLM and audio-encoder precision selection, including FP8 Qwen3 ASR Audio Encoder
  • Added fused gate+up and XQA kernel support for new MoE configurations
  • Fixed XQA kernel-loader thread-safety and CUDA 13.3 MoE top-k softmax build issues

Export and Quantization

  • Improved mixed-precision quantization handling for fused QKV and gate/up projections

Server and API

  • Added experimental Dockerfiles for containerized development
  • Expanded high-level Python API and server validation for LLM, VLM, and streaming flows

Documentation

  • Updated documentation for checkpoint export, quantization, LoRA, FP8 KV cache, FP8 embedding, speculative decoding, ASR, TTS, and server usage

NVIDIA Contributors

@fans-nv @nvluxiaoz @nvamberl @ruocheng-nv @poweiw @yuanyao-nv @Jasper-NV @charllll @mahu888 @xinrzhang-star @JCalafato @nvmbreughe @nvxingkaiz @ever-wong @zhijial-nvidia @willg-nv @levichen-nvidia @xiangg-nv @nvyocox @nv-samcheng