Skip to content

v1.3.0rc20

Pre-release
Pre-release

Choose a tag to compare

@mikeiovine mikeiovine released this 30 Jun 01:11
· 51 commits to main since this release

This RC version will be the last one supporting the TensorRT backend, in the next version the TensorRT backend will be removed!

  • Known Issues

    • DeepSeek V3/V3.2 can crash with an illegal memory access or hang during warm up.
    • Autotuning for Qwen3-family models can crash with "Assertion failed: Failed to initialize cutlass TMA WS grouped gemm."
  • API

    • Add API to configure TeaCache coefficients (#13170)
    • BREAKING CHANGE: Make request chat_template opt-in (#14646)
  • Feature

    • Add DeepSeek V4 preparation (#15378, #15379, #15381, #15394, #15402, #15222)
    • Add MXFP8 weight format plus CUTLASS W8A8 Linear and MoE (#14962)
    • Add Marlin NVFP4 backend for MoE and Linear on Hopper (#13476)
    • Add CUDA graph wrapper for multimodal encoders (#14829)
    • Support cross-attention with FlashInfer TRT-LLM Gen kernels on Blackwell (#15429)
    • Support post-norm and per-aux fc_norm for Eagle3 draft models (Eagle 3.1) (#14988)
    • Add EPLB support for Qwen3.5 (#15543)
    • Optimize CuteDSL NVFP4 MoE grouped/SwiGLU GEMM accumulation pipeline (#15258)
    • Add CuTe DSL GVR-TopK load-balance optimization (#15304)
    • Enable split-KV heuristic for low-occupancy cross-attention in LTX-2 FA4 (#15399)
    • Fuse MLP up-GEMM + bias + GELU(tanh) + NVFP4-quant into the CuteDSL epilogue for LTX2 and WAN (#15299)
    • Add async mp4 encode and configurable noise latent via env vars in VisualGen (#15229)
  • Fix

    • Harden disagg cache transceiver teardown (#15422)
    • Fix encoder-decoder beam search corruption via per-slot fragmentPointerDevice (#15461)
    • Fix overallocation of draft KV cache (#15017)
    • Disable NCCL window buffers on GB10 (#15559)
    • Fix wrong NCCL fallback in nemotron-h (#15294)
    • Fix CuteDSL NVFP4 EPLB weight layout (#15538)
    • Enable CuTe DSL BF16 kernels for SM100 PP (#14993)
    • Fix Gemma4 multimodal vision TP and xgrammar startup crashes (#15566)
    • Add necessary methods for guided decoding in Kimi K2.5 (#15180)
    • Re-enable Ulysses for LTX-2 v2a cross-attention (#15303)
    • Fix passing scaled timestep to time_embedder in Cosmos3 (#15545)
    • Clarify and align trtllm-bench runtime logging (#15254)
  • Documentation

    • Add deploy guide for Minimax M3 (#15587)
    • Add Qwen Image visual generation examples (#15235)
  • Benchmark

    • Add Qwen-Image-Bench evaluator (#14837)
    • Add modularized perf tests for attention and MoE (discrete/continuous) (#15541)
    • Add Qwen3.5-397B-A17B-NVFP4 B200 aggregated perf-sanity tests (#15650)
    • Add DeepSeek R1 0528 FP4 performance test to llm_perf_core.yml (#15453)
  • Test & Infra

    • Move more test cases to post-merge (#15568)
    • Stabilize perf-sanity tests (#15440)
    • Avoid type checking failures due to pip dependency resolution (#15517)
    • Gate GPT-OSS TRT-LLM Gen MoE tests to SM100/SM103 (#15128)
    • Add GPT-OSS disagg test for transceiver v2 (#15301)
    • Fix Cosmos3 tests after VisualGen config split (#15170)
    • Fix visual gen test leaked issue (#15236)
    • Fix Qwen3-Next bf16 4gpu test (#15206)
    • Clean up Nemotron test cases (#15586)
    • Fix and unwaive step3p7 test cases (#15583)
    • Add test coverage for MiniMax model with multi-node M2.5 checkpoints eval (#15361)
    • Add GLM NVFP4 stress test (#15437)
    • Remove unreferenced accuracy tests and orphaned entries (#15593)
    • Update .gitattributes (#15606)

What's Changed

  • [None][fix] AutoDeploy: Fixed wrong dist_backend AUTO detection when using trtllm-llmapi-launch by @MrGeva in #15423
  • [None][test] Waive 2 failed cases for main in QA CI by @tensorrt-cicd in #15341
  • [TRTLLMINF-81][feat] Avoid failed runners on infra retry by @dpitman-nvda in #15237
  • [https://nvbugs/6179661][fix] Harden disagg cache transceiver teardown by @chienchunhung in #15422
  • [https://nvbugs/6273846][test] gate GPT-OSS TRTLLM Gen MoE tests to SM100/SM103 by @dongfengy in #15128
  • [None][fix] avoid type checking failures due to pip dependency resolution by @ixlmar in #15517
  • [None][feat] VisualGen: async mp4 encode + fixed noise latent via env vars by @wu6u3tw in #15229
  • [https://nvbugs/6337235][test] Fix MX/GMS model loader fixtures by @chienchunhung in #15471
  • [None][test] Un-waive K2.5 Thinking FP4 disagg-NIXL e2e/gen_only tests by @chenfeiz0326 in #15443
  • [None][test] Waive 3 failed cases for main in QA CI by @tensorrt-cicd in #15509
  • [None][test] Waive 11 failed cases for main in QA CI by @tensorrt-cicd in #15506
  • [None][test] Waive 4 failed cases for main in QA CI by @tensorrt-cicd in #15505
  • [TRTLLM-13550][feat] WideEP FT: add MPI signal handler replacement (1d.0) by @chienchunhung in #14160
  • [None][test] Remove 60 closed-bug waive entries for main by @tensorrt-cicd in #15511
  • [#3237][fix] Support negative numbers in MajorityVote digit validation by @nikJ13 in #12294
  • [None][test] Waive 10 failed cases for main in post-merge by @tensorrt-cicd in #15535
  • [None][test] Waive 9 failed cases for main in QA CI by @tensorrt-cicd in #15504
  • [None][test] Waive 1 failed cases for main in QA CI by @tensorrt-cicd in #15499
  • [None][test] Waive 4 failed cases for main in QA CI by @tensorrt-cicd in #15510
  • [None][fix] AutoDeploy: handle torch dist all_gather in multi_stream MLA transform by @MrGeva in #15456
  • [None][feat] Add Gemma-4 NVFP4 quantized models to AutoDeploy registry by @marinayanov in #15382
  • [None][fix] Fix encoder-decoder beam search corruption via per-slot fragmentPointerDevice by @achartier in #15461
  • [https://nvbugs/6306936][test] Re-enable AutoDeploy disagg tests by @govind-ramnarayan in #15325
  • [None][infra] split single-node perf sanity GB200 by @tburt-nv in #15548
  • [None][chore] Bump version to 1.3.0rc20 by @yuanjingx87 in #15551
  • [#10710][fix] clarify and align trtllm-bench runtime logging by @marinayanov in #15254
  • [https://nvbugs/6290345][fix] Fix allreduce benchmark input setup by @nv-lschneider in #15427
  • [None][feat] DSv4 prep: IndexerTopK and TopK primitives by @lfr-0531 in #15381
  • [None][perf] Cutedsl NVF4 MOE: grouped/swiglu GEMM: Fix acc pipeline release arrive threads + FC2 meta stage code clean by @liyuhannnnn in #15258
  • [https://nvbugs/6271740][test] Update llm_perf_core.yml to include new performance test for DeepSeek R1 0528 FP4 model by @yufeiwu-nv in #15453
  • [None][fix] Stabilize perf-sanity tests by @chenfeiz0326 in #15440
  • [None][test] fix Cosmos3 tests after VisualGen config split by @bobboli in #15170
  • [None][feat] DSv4 prep: compressor and mHC primitives by @lfr-0531 in #15379
  • [None][infra] Waive 3 failed cases for main in post-merge 2802 by @ZhanruiSunCh in #15571
  • [https://nvbugs/6264844][fix] Fix wrong NCCL fallback in nemotron-h by @Wanli-Jiang in #15294
  • [None][test] Waive 6 failed cases for main in QA CI by @tensorrt-cicd in #15570
  • [https://nvbugs/6344108][fix] skip TestNemotron3Super120B on pre-blackwell by @bo-nv in #15539
  • [None][fix] Fix passing scaled timestep to time_embedder in Cosmos3 by @bastefaniak in #15545
  • [None][chore] Remove nv-internal-release guardword comments in mega_moe_nvfp4 by @xxi-nv in #15575
  • [None][ci] move more test cases to post merge by @QiJune in #15568
  • [https://nvbugs/6185146][fix] Use mat_a.new_empty([m, n_out//2]) / input_scale.new_empty([sf_size]) in the by @tensorrt-cicd in #14710
  • [TRTLLM-35882][feat] cute dsl gvr-topk load-balance optimization by @limin2021 in #15304
  • [None][test] Waive 2 failed cases for main in QA CI by @tensorrt-cicd in #15579
  • [None][test] waive hang issues by @xinhe-nv in #15576
  • [None][test] waive hang issues by @xinhe-nv in #15581
  • [#14874][feat] AutoDeploy : Perf optimization for gpt-oss-120b for low conc by @taylor-yb-lee in #15531
  • [TRTLLM-12982][perf] reuse multi-item scoring position_ids and params by @ixlmar in #15413
  • [TRTLLM-13599][test] Refine Qwen3.5 test cases by @nv-guomingz in #15544
  • [TRTLLMINF-111][infra] Reuse image sqsh file by @EmmaQiaoCh in #15147
  • [None][feat] DSv4 prep: MoE routing and backend support by @lfr-0531 in #15402
  • [None][feat] DSv4 prep: runtime cache foundations by @lfr-0531 in #15378
  • [https://nvbugs/6156233][fix] Lower GSM8K reference for the three GPT-OSS/20B-MXFP4 entries with… by @tensorrt-cicd in #15393
  • [None][chore] Small cleanups to MultimodalModelMixin by @2ez4bz in #15322
  • [TRTLLM-13123][feat] CUDA graph wrapper for multimodal encoders by @2ez4bz in #14829
  • [TRTLLM-12622][feat] Add native post-processing hook to trtllm-serve by @xwang233 in #15239
  • [None][feat] Add Qwen Image visual generation examples by @yibinl-nvidia in #15235
  • [TRTLLM-13490][feat] Support cross-attention with FlashInfer TRTLLM-Gen kernels on Blackwell by @cascade812 in #15429
  • [None][fix] LTX-2: re-enable Ulysses for v2a cross-attention by @luyiyun1021 in #15303
  • [TRTLLM-13246][feat] Wave 1: migrate aliases to setup_aliases and stage GMS RO load by @chienchunhung in #15014
  • [None][feat] Support post-norm and per-aux fc_norm for Eagle3 draft models by @Dogacel in #14988
  • [None][fix] fix FA4 install in devel docker by @o-stoner in #14706
  • [https://nvbugs/6276842][test] Loosen rtol/atol on encoder CUDA graph logits parity check by @tingyangk in #15527
  • [#15179][fix] Add necessary methods for guided decoding in Kimi K2.5 by @chungen04 in #15180
  • [None][test] Waive failed unittest on all devices (nvbugs/6335726) by @guqiqi in #15585
  • [None][infra] add blossom-ci authorized users by @niukuo in #15549
  • [https://nvbugs/6166097][fix] Fix CuteDSL NVFP4 EPLB weight layout by @nv-xtf in #15538
  • [None][test] GPT-OSS disagg test for transceiver v2 by @Shixiaowei02 in #15301
  • [None][feat] Add BaseResourceManager-based KV-cache compression manager framework by @Hudayday in #15106
  • [None][infra] use default split when CBTS test-db download fails by @crazydemo in #15592
  • [#12715][fix] Disable NCCL window buffers on GB10 by @nv-lschneider in #15559
  • [TRTLLM-11353][feat] API to configure TeaCache coefficients by @o-stoner in #13170
  • [TRTLLM-12242][feat] Add Marlin NVFP4 backend for MoE and Linear on Hopper by @xuantengh in #13476
  • [https://nvbugs/6094068][fix] Fix Qwen3-Next bf16 4gpu test by @JadoTu in #15206
  • [None][feat] Dis-agg transceiver mass integration from the DSV4 branch by @Shixiaowei02 in #15222
  • [https://nvbugs/6224637][fix] Enable CuTe DSL BF16 kernels for SM100 PP by @yuxianq in #14993
  • [https://nvbugs/6256531][test] Unwaive Llama guided decoding xgrammar by @sunnyqgg in #15240
  • [None][feat] DSv4: sparse cache manager adapter by @lfr-0531 in #15394
  • [TRTLLM-12982][chore] relocate torch_multi_arange by @ixlmar in #15416
  • [None][infra] Waive 15 failed cases for main in post-merge 2804 by @ZhanruiSunCh in #15620
  • [None][test] Waive hang issues by @xinhe-nv in #15609
  • [TRTLLM-13371][perf] LTX-2 FA4: enable split-KV heuristic (num_splits=0) for low-occupancy cross-attn by @luyiyun1021 in #15399
  • [https://nvbugs/6346546][fix] fix mRoPE CUDA graph gate for text requests by @yechank-nvidia in #15589
  • [https://nvbugs/6274932] [fix] Fix and unwaive step3p7 test cases by @kaiyux in #15583
  • [TRTLLM-13600][test] Clean up Qwen3 test cases by @nv-guomingz in #15591
  • [TRTLLM-13601][test] Clean up Nemotron test cases by @nv-guomingz in #15586
  • [None][infra] Fix node list query failing on tcsh login nodes by @yiqingy0 in #15623
  • Revert "[TRTLLM-12622][feat] Add native post-processing hook to trtllm-serve" by @tburt-nv in #15629
  • [TRTLLM-13612][test] Remove unreferenced accuracy tests and orphaned … by @nv-guomingz in #15593
  • [https://nvbugs/6215688][fix] Fix visual gen test leaked issue by @yibinl-nvidia in #15236
  • [https://nvbugs/6021427][fix] BREAKING CHANGE: Make request chat_template opt-in by @yibinl-nvidia in #14646
  • [None][infra] AutoDeploy: Add trtllm runner for standalone llm-c by @bmarimuthu-nv in #15630
  • [https://nvbugs/6274614][fix] remove spec tokens env for stress test by @chuangz0 in #15153
  • [TRTLLM-13444][test] Add Qwen-Image text-to-image unit tests by @yingguo-trt in #15580
  • [TRTLLM-13247][feat] Wave 2: stage Linear and Attention transforms by @chienchunhung in #15288
  • [https://nvbugs/6368480][fix] Cache the SM count once in FmhaDispatcher's constructor and reuse the cached… by @chenfeiz0326 in #15611
  • [None][test] Add modularized perf tests (attention + MoE discrete/continuous) by @ruodil in #15541
  • [#15565][fix] AutoDeploy: Fix Super MTP IMA introduced by checkpointing replay by @galagam in #15622
  • [#15613][fix] Gemma4 multimodal: fix vision TP and xgrammar startup crashes by @Thachnh in #15566
  • [TRTLLM-12762][test] Add Test coverage for MiniMax Model with multi-node, M2.5 checkpoints eval by @jieli-matrix in #15361
  • [TRTLLM-13575][feat] Add EPLB support for Qwen3.5 by @nv-guomingz in #15543
  • [None][test] add GLM nvfp4 stress test by @xinhe-nv in #15437
  • [TRTLLM-12982][chore] improve multi-item scoring request validation by @ixlmar in #15627
  • [None][test] Add Qwen3.5-397B-A17B-NVFP4 B200 aggregated perf-sanity tests by @chenfeiz0326 in #15650
  • [None][infra] take test durations into account to determine cbts splits num by @crazydemo in #15614
  • [None][doc] Add deploy guide for Minimax M3 by @WeiHaocheng in #15587
  • [None][chore] Update .gitattributes by @ziyixiong-nv in #15606
  • [https://nvbugs/6239637][fix] Unwaive Qwen3.5 cases on A100 platform by @nv-guomingz in #15481
  • [TRTLLM-13712][feat] Add Qwen-Image-Bench evaluator by @yibinl-nvidia in #14837
  • [https://nvbugs/6248783][test] Unwaive Qwen3 skip softmax test by @bobboli in #15652
  • [None][fix] User/tjohnsen/evict empty blocks first by @thorjohnsen in #11685
  • [https://nvbugs/6293536][fix] Stage KV block offsets through a fresh host buffer by @thorjohnsen in #15546
  • [TRTLLM-13370][perf] LTX2 + WAN: Fuse MLP up-GEMM + bias + GELU(tanh) + NVFP4-quant into CuteDSL epilogue by @luyiyun1021 in #15299
  • [https://nvbugs/6248837][chore] waive memory polluters by @tburt-nv in #15665
  • [https://nvbugs/6269778][fix] Fix overallocation of draft KV cache by @mikeiovine in #15017
  • [None][feat] add MXFP8 weight format + CUTLASS W8A8 Linear and MoE by @WeiHaocheng in #14962
  • [https://nvbugs/6344612][test] relax GPT-OSS GPQA references due to high variance in random sampling by @dongfengy in #15567
  • [https://nvbugs/6062416][fix] Cache NCCL window allocation failures by size by @nv-lschneider in #15596

New Contributors

Full Changelog: v1.3.0rc19...v1.3.0rc20