Skip to content

v1.3.0rc19

Pre-release
Pre-release

Choose a tag to compare

@mikeiovine mikeiovine released this 23 Jun 16:49
· 28 commits to main since this release
  • Known Issues

    • Llama 3.1 8B FP8 can hang during the autotuner warmup on GB200.
  • Model Support

    • Support NVIDIA Wan2.2-T2V quantized checkpoints (#15093)
    • Enable MTP for Step-3.7 NVFP4 and port Step-3.7VL vision tower to TRT-LLM modules (#14926)
    • Support T5 and BART in the PyTorch backend (#13919)
    • Support MiniMax-M3 in the PyTorch backend (#15292)
  • API

    • Align VisualGen serve request schema with VisualGenParams (#14733)
    • Support multi-item scoring in LLM.encode (#14693)
    • Drop legacy --extra_visual_gen_options CLI alias (#15262)
  • Feature

    • Enable TRTLLM MoE backend for Nemotron-H BF16 checkpoint (#14944)
    • Add async Ulysses pipeline (enabled for LTX-2 and WAN) (#13978)
    • Make TrtllmGenAttention the default decode backend on Blackwell+ (#14618)
    • Skip redundant data expand in DeepGemmFusedMoE via fused expand+quant Triton kernel (#14591)
    • Add Prometheus metrics for prompt cache, speculative decoding, perplexity, and batch occupancy (#12636)
    • Add Indexer TopK single-block / multi-pass radix implementation (#14268)
    • Enable gen-only speculative decoding for disagg setups (#14546)
    • Support EAGLE3 dynamic trees on Blackwell (#12958)
    • Add CUDA graph support for per-expert LoRA in Cutlass backend (#14881)
    • Add support for beam search in disaggregated serving (#14876)
    • Add maximal LLMAPI capture in usage telemetry (#14398)
    • Optimize Qwen2.5/3/3.5-VL performance (#11943)
    • Add skip-softmax TMA-load + sync-MMA warp-specialized context FMHA for sm_120/sm_121 (#15163)
    • Enable TRTLLM cross attention backend (#15345)
    • Support per-request mm_processor_kwargs for Qwen3-VL (#14702)
    • Add prefetch_reuse_blocks and configurable prefetch count (#15149)
    • Add MegaMoECuteDsl NVFP4 MoE backend (#14608)
    • Make EAGLE3 honor sampling params by default (#14745)
    • Add multiple FMHA library support to TRTLLM attention backend (#15204)
    • Add checkpointing variant of replay for MTP for mamba models (#14203)
  • Fix

    • Remove redundant TikTokenTokenizer shim from Kimi-K2.5 input processor (#14741)
    • Rename misnamed tunable_fp4_quantize kwarg and add real SF-swizzle control (#15002)
    • Gate FlashInfer GDN kernels to supported configurations (#15094)
    • Count DSA indexer K-cache correctly as UINT8 in KV cache size estimate (#15088)
    • Select CUTLASS MoE backend on non-Blackwell SMs for Qwen3.5-35B-A3B FP8 (#15081)
    • Fix SageAttention kernel regression by using static scheduler (#15047)
    • Fall back to local cache when loading tokenizer for gated models (#12998)
    • Fix PyExecutor FPM iteration timing (#14922)
    • Register multimodal placeholders for Qwen3.5 MoE VLM serving (#15079)
    • Fix and unwaive Nemotron-related bugs (#15085)
    • Guard DSA DSL atom-split against MTP draft next (#14891)
    • Scope disagg-ctx cache-transfer quorum vote to TP instead of WORLD (#15136)
    • Clear workspace in run_mla_generation to avoid illegal memory access (#15173)
    • Fix MAX_UTILIZATION reuse token budget (#15066)
    • Add kv_transfer_timeout_ms to avoid timeout (#15152)
    • Preserve ip:port for trtllm-serve visual-gen (#14355)
    • Fix guided decoding (xgrammar) + EAGLE-3 + draft_len_schedule crash during CUDA graph capture (#15023)
    • Stabilize Mamba replay state update (#14841)
    • Fix max_context_length value for attention workspace sizing (#15156)
    • Fix issue where host KV cache usage would double when speculative decoding is used (#14373)
    • Disable NCCL_SYMMETRIC tactic on GB10 (DGX Spark) (#12902)
    • Fix attentionOp FP8 MLA KV-reuse workspace calculation (#14852)
    • Fix beam search log_probs non-determinism with batch_size > 1 (#15125)
    • Forward secondary_offload_min_priority to KVCacheManager in PyTorch executor (#13768)
    • Enable multi-block mode for XQA HMMA spec-dec (#15312)
    • Fix TinyGEMM barrier bug (#15338)
    • Fix stale sparse attention kwargs (#15460)
    • Fix CppMambaHybridCacheManager to handle dp dummy request (#15054)
    • Fix embedding vocab mask for rejection sampling in Kimi-K2.5 (#15233)
  • Documentation

    • Add FLUX visual generation examples (#14987)
    • Add Qwen3.5 deployment guide doc (#15111)
    • Fix stale --disable_xqa reference in legacy docs (#13395)
    • Add Cache-DiT documentation (#15268)
  • Benchmark

    • Weight trtllm-bench AR/AL averages by output length (#14998)
  • Test & Infra

    • Add accuracy tests for nemotron-v3-ultra (#14808)
    • Remove TestLlama4ScoutInstruct tests (#15144)
    • Require minimum of 4 GPUs in llm_perf_core.yml and add new performance tests (#15090)
    • Add DFlash coverage for Qwen3.5 MoE variant (#15132)
    • Add e2e example tests for flux1/2, ltx2, wan_i2v, and cosmos3 (#15126)
    • Enable disagg cancellation stress test (#15174)
    • Fix periodic-junit in unittest pytest (#14075)
    • Update K2.5 and GLM-5 into CI perf test (#14960)
    • Add Qwen3-32B FP8 disagg stress test (#14278)
    • Sunset old disagg test cases for the QA side (#15290)
    • Add e2e Tensor Parallel LPIPS tests for VisualGen (#15208)
    • Remove TensorRT performance baseline and update to PyTorch only (#15256)
    • Add integration tests for MoE LoRA and bugfixes (#15271)

What's Changed

  • [None][infra] Waive TestQwen3NextInstruct nvfp4 cases by @mzweilz in #15086
  • [https://nvbugs/6248757][fix] Avoid running all reduce in aux stream by @tensorrt-cicd in #14917
  • [https://nvbugs/6221483][fix] AutoDeploy: Fix Eagle metadata host syncs by @govind-ramnarayan in #14714
  • [None][feat] add FLUX visual generation examples by @karljang in #14987
  • [https://nvbugs/6261164][fix] AutoDeploy: Don't allocate speculative caches when speculation is off by @tensorrt-cicd in #15020
  • [https://nvbugs/6211189][fix] Lower the reference to 46.5 (matching cross-GPU empirical mean) and remove the t by @tensorrt-cicd in #14799
  • [None][refactor] split VisualGen pipeline and model configs by @bobboli in #14956
  • [TRTLLM-11457][feat] Async Ulysses pipeline (Enabled for LTX-2 + WAN) by @luyiyun1021 in #13978
  • [TRTLLM-11548][doc] Add Qwen3.5 deployment guide doc by @nv-guomingz in #15111
  • [https://nvbugs/6181383][fix] Build inner text/vision/audio sub-configs as empty PretrainedConfig() then setat by @tensorrt-cicd in #14399
  • [https://nvbugs/6273850][chore] waive TestQwen3_5_4B::test_bf16 for all GPUs by @tburt-nv in #15112
  • [None][doc] Add docs for AutoDeploy transforms by @bmarimuthu-nv in #15122
  • [None][infra] Waive 4 failed cases for main in post-merge 2769 by @ZhanruiSunCh in #15140
  • [https://nvbugs/6227203][fix] Remove redundant TikTokenTokenizer shim from KimiK25InputProcessor by @tianyuxbear in #14741
  • [None][fix] tunable_fp4_quantize: rename misnamed kwarg + add real SF-swizzle control by @luyiyun1021 in #15002
  • [None][test] Fix gen_only missing prev_device_step_time race in perf sanity by @tensorrt-cicd in #15108
  • [None][test] Fix disagg test result dir by @fredricz-20070104 in #14864
  • [TRTLLM-13332][test] Remove TestLlama4ScoutInstruct tests by @QiJune in #15144
  • [https://nvbugs/6266705][fix] Gate FlashInfer GDN kernels to supporte… by @nv-guomingz in #15094
  • [https://nvbugs/6255037][fix] Count DSA indexer K-cache correctly as UINT8 in KV cache size estimate by @eopXD in #15088
  • [https://nvbugs/6194812][test] Update llm_perf_core.yml to require a minimum of 4 GPUs and add new performance tests by @yufeiwu-nv in #15090
  • [TRTLLMINF-112][infra] Reduce the waiting time between check node is online or not by @EmmaQiaoCh in #14819
  • [None][infra] Waive 1 failed cases for main in pre-merge 41821 by @ZhanruiSunCh in #15135
  • [None][infra] CBTS Layer 3: pass test-db via Artifactory instead of env var by @crazydemo in #15142
  • [TRTLLM-13264][feat] Add native bias epilogue to NVFP4 GEMM by @luyiyun1021 in #15053
  • [https://nvbugs/6278380][unwaive] unwaive ad cases by @crazydemo in #15148
  • [https://nvbugs/6244474][fix] AutoDeploy: Remove llama perf test from CI by @MrGeva in #15107
  • [https://nvbugs/6212252][fix] Select CUTLASS MoE backend on non-Blackwell SMs in TestQwen3_5_35B_A3B::test_fp8 by @xxi-nv in #15081
  • [TRTLLM-13302][feat] Register NVIDIA Wan2.2-T2V quantized checkpoints by @zhenhuaw-me in #15093
  • [None][chore] add VisualGen team as the codeowner of the VisualGen Attention by @zhenhuaw-me in #15150
  • [None][feat] Default on FlashInferTrtllmGenAttention by @yihwang-nv in #14618
  • [None][infra] Test DFW with BSL branch by @yuanjingx87 in #14597
  • [TRTLLM-12214][perf] customMoeRoutingKernel: lower BLOCK_SIZE to 128, raise maxNumBlocks by @xwang233 in #14590
  • [TRTLLM-12214][perf] DeepGemmFusedMoE: skip redundant data expand via fused expand+quant Triton kernel by @xwang233 in #14591
  • [TRTLLM-12648][test] implement disagg cancellation load thread by @chienchunhung in #15124
  • [None][fix] Fix regression from SageAttention kernel: Use static scheduler by @xrq-phys in #15047
  • [TRTLLM-12467][feat] EPD improvements by @venkywonka in #13864
  • [None][feat] Expose stored block-hash chain to KV cache connector by @jthomson04 in #14806
  • [#12805][fix] Fall back to local cache when loading tokenizer for gated models by @1MrazorT1 in #12998
  • [None][feat] Support partial RoPE fusion for Hopper kernels in XQA for Laguna by @DomBrown in #15110
  • [None][infra] Add nv-xtf, rahul-steiger-nv, tedzhouhk, tensorrt-cicd to blossom-ci allowlist by @ZhanruiSunCh in #14955
  • [None][feat] Add Prometheus metrics for prompt cache, speculative decoding, perplexity, and batch occupancy by @vedularaghu in #12636
  • [None][chore] Unwaive DSV32 helix tests by @brb-nv in #14871
  • [None][fix] unset UCX_TLS=tcp by @tburt-nv in #15008
  • [None][feat] Port 13 AutoDeploy custom models to sharding IR + opt them in via registry by @greg-kwasniewski1 in #14778
  • [None][chore] Make image paths absolute in blog22 by @brb-nv in #15177
  • Fix PyExecutor FPM iteration timing by @tedzhouhk in #14922
  • [#13816][feat] AutoDeploy: Optimize gpt-oss-120b perf by @taylor-yb-lee in #14202
  • [None][fix] Register Multimodal Placeholders for Qwen3.5 MoE VLM Serving by @anurags25 in #15079
  • [None][feat] Weight trtllm-bench AR/AL averages by output length by @zhaoyangwang-nvidia in #14998
  • [TRTLLM-13052][feat] Enable TRTLLM moe backend for nemotron-h BF16 ckpt by @Wanli-Jiang in #14944
  • [None][fix] Fix and unwaive nemotron related bugs by @Wanli-Jiang in #15085
  • [https://nvbugs/6140226][test] Add DFlash coverage for Qwen3.5 MoE variant by @yingguo-trt in #15132
  • [None][test] temporarily waive Cosmos3 B200 failures by @bobboli in #15195
  • [NVBUG-6241842][fix] DSA DSL atom-split: guard against MTP draft next… by @limin2021 in #14891
  • [#11423][feat] AutoDeploy: Basic Disagg Support by @govind-ramnarayan in #14057
  • [https://nvbugs/6280060][fix] Scope disagg-ctx cache-transfer quorum vote to TP instead of WORLD by @tensorrt-cicd in #15136
  • [None][test] Add e2e example tests for flux1/2, ltx2, wan_i2v, and cosmos3 by @chang-l in #15126
  • [#12632][feat] Add pipeline cache support for AutoDeploy by @nvchenghaoz in #13729
  • [None][test] Add support for nemotron_3_ultra_550b_nvfp4 model in performance tests and configurations by @yufeiwu-nv in #15166
  • [None][feat] Indexer TopK: single-block / multi-pass radix by @dcampora in #14268
  • [None][fix] Clear workspace in run_mla_generation to avoid potential illegal memory access issue by @yihwang-nv in #15173
  • [None][chore] Unwaive AutoDeploy accuracy tests by @bmarimuthu-nv in #14971
  • [None][test] Increase kv_transfer_timeout_ms for b200 deepseek-r1 disagg gen_only perf test by @tensorrt-cicd in #15205
  • [None][feat] Enable MTP for Step-3.7 NVFP4 and port Step-3.7VL vision tower to TRT-LLM modules by @kaiyux in #14926
  • [https://nvbugs/6266370][fix] Fix MAX_UTILIZATION reuse token budget on main by @brb-nv in #15066
  • [https://nvbugs/6272573][ci] Unwaive skipped test by @2ez4bz in #15118
  • [https://nvbugs/6245279][fix] AutoDeploy: Unwaive accuracy tests by @galagam in #15214
  • [TRTLLM-12491][feat] Align VisualGen serve request schema with VisualGenParams by @zhenhuaw-me in #14733
  • [None][test] Add MLA chunked-prefill SM dispatch regression coverage by @DhineshPonnarasan in #13904
  • [TRTLLM-12648][test] enable disagg cancellation stress test by @chienchunhung in #15174
  • [None][feat] Preserve cache_salt string in KV cache events by @jthomson04 in #13051
  • [https://nvbugs/6104831][fix] Port dataTransceiver shared_ptr lifetime fix by @chienchunhung in #14979
  • [None][fix] Fix AutoDeploy transform docs generation by @bmarimuthu-nv in #15228
  • [None][feat] Targeted warmup-waste cleanup by @dominicshanshan in #14609
  • [None][fix] Remove TLLM_RUBIN_FEATURES by @yuxianq in #15143
  • [https://nvbugs/6108994][fix] add kv_transfer_timeout_ms to avoid timeout by @bo-nv in #15152
  • [TRTLLM-12657][infra] Fix periodic-junit in unittest pytest by @yiqingy0 in #14075
  • [https://nvbugs/6143883][fix] Preserve ip:port for trtllm-serve visual-gen by @JunyiXu-nv in #14355
  • [TRTLLM-12958][feat] Enable gen-only spec dec by @bo-nv in #14546
  • [https://nvbugs/6162120][test] Remove 78 closed-bug waive entries for main by @tensorrt-cicd in #15061
  • [https://nvbugs/6278399][fix] Add x86_64 path using CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR with… by @tensorrt-cicd in #15129
  • [TRTLLM-11538][feat] Blackwell custom mask fmha support by @sunnyqgg in #12958
  • [None][infra] Waive 6 failed cases for main in post-merge 2773 by @ZhanruiSunCh in #15250
  • [None][feat] Enhance CuteDSL NVF4 MOE by @liyuhannnnn in #15092
  • [None][infra] Waive 3 failed cases for main in post-merge 2772 by @ZhanruiSunCh in #15253
  • [None][test] Update K2.5 andGLM-5 into CI Perf Test by @chenfeiz0326 in #14960
  • [None][feat] enable GQA and cross-attention for attn2d by @NVShreyas in #14961
  • [#12230][fix] Add bounds checking in autotuner _find_nearest_profile for SM121 by @mihai-chiorean in #12310
  • [None][refactor] visual_gen Attention: drop redundant enable_ulysses kwarg (rebase artifact from #13978) by @luyiyun1021 in #15141
  • [None][fix] Generalize FP8 checkpoint loading for Qwen3.5 by @amukkara in #15067
  • [#13858][fix] AutoDeploy fix the piecewise vlm issue by @nvchenghaoz in #14006
  • [TRTLLM-12507][feat] Cudagraph support for per-expert lora in Cutlass backend - Part 2 by @brb-nv in #14881
  • [None][test] Remove stale perf sanity waives by @cascade812 in #15269
  • [None][infra] Waive 8 failed cases for main in pre-merge 42699 by @ZhanruiSunCh in #15273
  • [None][fix] Install processor-output validation filter at module import by @aswinvisva in #14832
  • [None][infra] Waive 10 failed cases for main in pre-merge 42753 by @ZhanruiSunCh in #15275
  • [TRTLLM-12534][fix] Nemotron Nano - properly account for text prompts in inflight batching with EVS on by @moraxu in #15016
  • [None][doc] Fix stale --disable_xqa reference in legacy docs by @Erfandarzi in #13395
  • [TRTLLM-11403][doc] Cache-DiT documentation by @o-stoner in #15268
  • [#15022][fix] Guided decoding (xgrammar) + EAGLE-3 + draft_len_schedule reaching 0 crashes during CUDA graph capture, "bitmask must have the same batch size as logits" by @chungen04 in #15023
  • [TRTLLM-12154][test] Add Qwen3-32B FP8 disagg stress test by @brnguyen2 in #14278
  • [TRTLLM-13141][feat] Add backend-agnostic SourceIdentity gate for weight sharing by @chienchunhung in #14878
  • [None][feat] Add PyTorch reset_prefix_cache API by @milesial in #14970
  • [None][fix] Stabilize Mamba replay state update by @sunnyqgg in #14841
  • [None][infra] Waive remaining AutoDeploy Disagg tests until fix lands by @govind-ramnarayan in #15282
  • [None][test] Sunset the old disagg test cases for the qa side by @fredricz-20070104 in #15290
  • [None][infra] Waive 1 failed cases for main in pre-merge 42836 by @ZhanruiSunCh in #15293
  • [None][fix] Fix max_context_length value for attention workspace sizing by @pengbowang-nv in #15156
  • [TRTLLM-12038][feat] Add accuracy tests for nemotron-v3-ultra by @Wanli-Jiang in #14808
  • [#14672][fix] AutoDeploy: Vendor OpenELMConfig locally to fix OpenELM config loading by @plapagesse in #15175
  • [https://nvbugs/6035425][fix] Fix KV cache host splitting logic by @mikeiovine in #14373
  • [None][refactor] Move KV cache manager V2 to separate file by @jiaganc in #14680
  • [TRTLLM-12963][refactor] LTX-2 attention: drop dead k_pe parameter; require cached cross-attn by @luyiyun1021 in #14555
  • [TRTLLM-10184][chore] Remove legacy XQA precompiled path by @pengbowang-nv in #14941
  • [TRTLLM-35882][feat] cute dsl gvr-top multi-cta optimization by @limin2021 in #15198
  • [None][fix] Revert "Add PyTorch reset_prefix_cache API (#14970)" by @xxi-nv in #15306
  • Revert "[None][test] Add support for nemotron_3_ultra_550b_nvfp4 model in performance tests and configurations" by @tburt-nv in #15310
  • [https://nvbugs/6309375][test] AutoDeploy: Remove stale fallback test by @govind-ramnarayan in #15316
  • [None][fix] AutoDeploy: set enable_spec_decode on ADEngine for disagg by @Shixiaowei02 in #15260
  • [TRTLLM-12498][feat] Add support for beam search in disaggregated serving by @athena-nv in #14876
  • [None][chore] 2 more WAN multi-gpu tests by @NVShreyas in #15223
  • [TRTLLM-12721][feat] Add disagg transfer state consensus by @chienchunhung in #15139
  • [None][infra] Waive 1 failed cases for main in pre-merge 43047 by @ZhanruiSunCh in #15326
  • [#12715][fix] disable NCCL_SYMMETRIC tactic on GB10 (DGX Spark) by @nv-lschneider in #12902
  • [None][feat] AutoDeploy: Qwen3.5: Apply whielist based sharding and apply lm_head sharding by @taylor-yb-lee in #15185
  • [https://nvbugs/6293015][fix] Add a delegating `@property def vocab_size_padded(self) -> int: return… by @tensorrt-cicd in #15219
  • [TRTLLM-12842][feat] Maximal LLMAPI capture in usage telemetry by @venkywonka in #14398
  • [TRTLLM-12427][perf] Qwen2.5/3/3.5-VL Performance Optimization by @yechank-nvidia in #11943
  • [TRTLLM-11408][test] Add e2e Tensor Parallel LPIPS tests for VisualGen by @yingguo-trt in #15208
  • [None][infra] Waive 1 failed cases for main in pre-merge 43173 by @ZhanruiSunCh in #15358
  • [None][infra] Record CBTS decision to OpenSearch for CI-health monitoring by @crazydemo in #15210
  • [None][feat] MNNVL Performance Optimization and FP8/NVFP4 Quant Fusion by @timlee0212 in #14476
  • [None][refactor] Remove TensorRT performance baseline and update to PyTorch only by @yufeiwu-nv in #15256
  • [None][test] Waive 1 failed cases for main in QA CI by @tensorrt-cicd in #15315
  • [https://nvbugs/6029882][fix] Fix attentionOp fp8 mla kvreuse workspace calculation by @pengbowang-nv in #14852
  • [None][infra] pin pytest and click workaround by @cascade812 in #15357
  • [None][feat] skip-softmax on SM120: TMA-load + sync-MMA warp-specialized context FMHA for sm_120/sm_121 by @dcampora in #15163
  • [None][fix] Fix beam search log_probs non-determinism with batch_size > 1 by @achartier in #15125
  • [Bugfix] Forward secondary_offload_min_priority to KVCacheManager in PyTorch executor by @Saddss in #13768
  • [None][chore] Bump version to 1.3.0rc19 by @yuanjingx87 in #15188
  • [TRTLLMINF-103][feat] Keep SLURM timeouts non-retryable by @dpitman-nvda in #15183
  • [TRTLLM-12982][feat] support multi item scoring in LLM.encode by @ixlmar in #14693
  • [https://nvbugs/6281014][fix] fix the repeated cute.compile and simpilify the test by @JadoTu in #15331
  • [None][chore] Integration tests for MoE lora & bugfixes by @brb-nv in #15271
  • [TRTLLM-12339][feat] enable TRTLLM cross attention backend by @cascade812 in #15345
  • [TRTLLM-12807][test] Guard thop attention kwarg aliases by @yuxianq in #15335
  • [None][infra] Waive 21 failed cases for main in post-merge 2780 by @ZhanruiSunCh in #15373
  • [None][fix] pool-qualify KV cache transfer pending keys by @chienchunhung in #15272
  • [None][refactor] Enhance pytest integration by updating test node generation to support fixture inheritance and dynamic collection by @yufeiwu-nv in #15374
  • [None][test] Waive 1 failed cases for main in QA CI by @tensorrt-cicd in #15377
  • [https://nvbugs/312578][fix] split test_cache_transceiver_single_process by @chuangz0 in #15369
  • [None][infra] Update the new duration base on opensearch result by @EmmaQiaoCh in #15364
  • [https://nvbugs/6245861][fix] Gate the two ID None-checks on finish_reason in _GEN_PENDING_FINISH_REASONS… by @tensorrt-cicd in #14908
  • [https://nvbugs/6223556][fix] Propagate gen-first ctx usage via aux buffer to postproc by @reasonsolo in #15246
  • [None][test] Fix Mamba hybrid transceiver helper by @chienchunhung in #15323
  • [None][feat] Qwen3-VL: support per-request mm_processor_kwargs by @aswinvisva in #14702
  • [TRTLLM-12982][chore] NVTX-annotate logits processor by @ixlmar in #15408
  • [TRTLLM-12339][feat] Support T5 and BART in the PyTorch backend by @cascade812 in #13919
  • [TRTLLM-13333][feat] Add prefetch_reuse_blocks and configurable prefetch count by @reasonsolo in #15149
  • [None][feat] DSv4 prep: attention op plumbing by @lfr-0531 in #15384
  • [None][test] Waive 8 failed cases for main in post-merge by @tensorrt-cicd in #15389
  • [#15182][fix] Fix embedding vocab mask for handling rejection sampling in Kimi-K2.5 by @chungen04 in #15233
  • [None][test] Waive 1 failed cases for main in QA CI by @tensorrt-cicd in #15320
  • [None][refactor] Refactor Skip Softmax Attention Interface by @bobboli in #14687
  • [None][infra] Waive 1 failed cases for main in pre-merge 43656 by @ZhanruiSunCh in #15439
  • [None][infra] Waive 11 failed cases for main in post-merge 2782 by @ZhanruiSunCh in #15395
  • [https://nvbugs/6248837][fix] Densify trtllm-gen fmha warmup grid to catch missing kernels by @pengbowang-nv in #15305
  • [TRTLLM-13378][feat] Drop legacy --extra_visual_gen_options CLI alias by @zhenhuaw-me in #15262
  • [TRTLLM-12950][feat] Add MegaMoECuteDsl NVFP4 MoE backend by @xxi-nv in #14608
  • [None][perf] DSv4 prep: attention fusion custom ops by @lfr-0531 in #15390
  • [TRTLLM-12669][refactor] Eagle3 sampling: auto-detect greedy fast-path, mixed-batch rejection sampling, draft honors target params by @zhaoyangwang-nvidia in #14745
  • [TRTLLMINF-137][infra] Skip to create perf report when there is not perf test results by @yiqingy0 in #15446
  • [https://nvbugs/6270671][fix] Replace the hardcoded multiBlock=1 with a call to… by @tensorrt-cicd in #15312
  • [TRTLLMINF-113][infra] Add timeout protection to Setup/Initialize stages by @ZhanruiSunCh in #14682
  • [None][infra] Waive 1 failed cases for main in pre-merge 43720 by @ZhanruiSunCh in #15449
  • [None][infra] Waive 2 failed cases for main in post-merge 2785 by @ZhanruiSunCh in #15450
  • [None][perf] executor: avoid deepcopy of prompt_token_ids on enqueue by @lancelly in #14895
  • [None][infra] Waive 1 failed cases for main in pre-merge 43712 by @ZhanruiSunCh in #15447
  • [None][ci] tighten VisualGen CBTS routing by @zhenhuaw-me in #15259
  • [None][fix] fix tinygemm barrier bug by @yweng0828 in #15338
  • [TRTLLM-12199][feat] WideEP FT: add EPGroupHealth thread-safe rank mask (1a.1) by @chienchunhung in #13302
  • [None][infra] Waive 18 failed cases for main in pre-merge 43878 by @ZhanruiSunCh in #15469
  • [None][fix] Fix stale sparse attention kwargs by @bobboli in #15460
  • [None][test] Waive 1 failed cases for main in QA CI by @tensorrt-cicd in #15411
  • [TRTLLM-12807][feat] Add multiple FMHA library support to TRTLLM attention backend by @yuxianq in #15204
  • [None][infra] Waive 1 failed cases for main in pre-merge 43917 by @ZhanruiSunCh in #15478
  • [None][feat] Side-stream for MM encoder by @2ez4bz in #14322
  • [None][feat] BREAKING: Add MiniMax-M3 PyTorch backend bring-up with API changes by @WeiHaocheng in #15292
  • [https://nvbugs/6215678][fix] Point --output-artifact-dir at a unique per-run subdir `{model}-openai-complet by @tensorrt-cicd in #14742
  • [None][fix] fix CppMambaHybridCacheManager to handle dp dummy request by @bo-nv in #15054
  • [None][test] Waive 5 failed cases for main in post-merge by @tensorrt-cicd in #15392
  • [None][test] Waive 9 failed cases for main in post-merge by @tensorrt-cicd in #15391
  • [None][test] Waive 5 failed cases for main in QA CI by @tensorrt-cicd in #15360
  • [None][test] Waive 8 failed cases for main in QA CI by @tensorrt-cicd in #15342
  • [None][feat] Checkpointing variant of replay for MTP for mamba models by @hnover-nv in #14203
  • [None][test] Waive 23 failed cases for main in QA CI by @tensorrt-cicd in #15337
  • [None][test] Waive 3 failed cases for main in QA CI by @tensorrt-cicd in #15319

New Contributors

Full Changelog: v1.3.0rc18...v1.3.0rc19