Highlights
This release features 274 commits, from 123 contributors (27 new contributors!)
- Progress in large scale serving
- DP Attention + Expert Parallelism: CUDA graph support (#18724), DeepEP dispatch-combine kernel (#18434), batched/masked DeepGEMM kernel (#19111), CUTLASS MoE kernel with PPLX (#18762)
- Heterogeneous TP (#18833), NixlConnector Enable FlashInfer backend (#19090)
- DP: API-server scaleout with many-to-many server-engine comms (#17546), Support DP with Ray (#18779), allow AsyncLLMEngine.generate to target a specific DP rank (#19102), data parallel rank to KVEventBatch (#18925)
- Tooling: Simplify EP kernels installation (#19412)
- RLHF workflow: Support inplace model weights loading (#18745)
- Initial full support for Hybrid Memory Allocator (#17996), support cross-layer KV sharing (#18212)
- Add FlexAttention to vLLM V1 (#16078)
- Various production hardening related to full cuda graph mode (#19171, #19106, #19321)
Model Support
- Support Magistral (#19193), LoRA support for InternVL (#18842), minicpm eagle support (#18943), NemotronH support (#18863, #19249)
- Enable data parallel for Llama4 vision encoder (#18368)
- Add DeepSeek-R1-0528 function call chat template (#18874)
Hardware Support & Performance Optimizations
- Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3 (#19205), Qwen3-235B-A22B (#19315)
- Blackwell: Add Cutlass MLA backend (#17625), Tunings for SM100 FP8 CUTLASS kernel (#18778), Use FlashInfer by default on Blackwell GPUs (#19118), Tune
scaled_fp8_quant
by increasing vectorization (#18844) - FP4: Add compressed-tensors NVFP4 support (#18312), FP4 MoE kernel optimization (#19110)
- CPU: V1 support for the CPU backend (#16441)
- ROCm: Add AITER grouped topk for DeepSeekV2 (#18825)
- POWER: Add IBM POWER11 Support to CPU Extension Detection (#19082)
- TPU: Initial support of model parallelism with single worker using SPMD (#18011), Multi-LoRA Optimizations for the V1 TPU backend (#15655)
- Neuron: Add multi-LoRA support for Neuron. (#18284), Add Multi-Modal model support for Neuron (#18921), Support quantization on neuron (#18283)
- Platform: Make torch distributed process group extendable (#18763)
Engine features
- Add Lora Support to Beam Search (#18346)
- Add rerank support to run_batch endpoint (#16278)
- CLI: add run batch (#18804)
- Server: custom logging (#18403), allowed_token_ids in ChatCompletionRequest (#19143)
LLM
API: make use_tqdm accept a callable for custom progress bars (#19357)- perf: [KERNEL] Sampler. CUDA kernel for applying repetition penalty (#18437)
API Deprecations
- Disallow pos-args other than
model
when initializingLLM
(#18802) - Remove
inputs
arg fallback in Engine classes (#18799) - Remove fallbacks for Embeddings API (#18795)
- Remove mean pooling default for
Qwen2EmbeddingModel
(#18913) - Require overriding
get_dummy_text
andget_dummy_mm_data
(#18796) - Remove metrics that were deprecated in 0.8 (#18837)
Documentation
- Add CLI doc (#18871)
- Update SECURITY.md with link to our security guide (#18961), Add security warning to bug report template (#19365)
What's Changed
- [CI/Build] [TPU] Fix TPU CI exit code by @CAROLZXYZXY in #18282
- [Neuron] Support quantization on neuron by @aws-satyajith in #18283
- Support datasets in
vllm bench serve
and sync with benchmark_[serving,datasets].py by @mgoin in #18566 - [Bugfix] Disable prefix caching by default for benchmark by @cascade812 in #18771
- [Build] Fixes for CMake install by @ProExpertProg in #18570
- [Core] Improve Tensor serialisation by @lgeiger in #18774
- [rocm] Fix wrong attention log by @fxmarty-amd in #18764
- [Bugfix] Fix nomic max_model_len by @noooop in #18755
- [Bugfix]: correctly propagate errors message caught at the chat_templating step to the client by @gcalmettes in #18769
- [V1] fix torch profiling for V1 offline scenarios by @divakar-amd in #18445
- [V1] [Bugfix] eagle bugfix and enable correct lm_head for multimodal (2) by @RonaldBXu in #18781
- [Bugfix][FailingTest]Fix test_model_load_with_params.py by @rabi in #18758
- [Deprecation] Require overriding
get_dummy_text
andget_dummy_mm_data
by @DarkLight1337 in #18796 - [Deprecation] Remove unused sync methods in
async_timeout
by @DarkLight1337 in #18792 - [Deprecation] Remove fallbacks for Embeddings API by @DarkLight1337 in #18795
- [CI] improve embed testing by @noooop in #18747
- Fix PiecewiseCompileInterpreter by @zou3519 in #17338
- [BugFix] FA2 MLA Accuracy Issue by @LucasWilkinson in #18807
- [Platform][Dist] Make torch distributed process group extendable by @MengqingCao in #18763
- Enable Pydantic mypy checks and convert configs to Pydantic dataclasses by @hmellor in #17599
- [Frontend] add run batch to CLI by @reidliu41 in #18804
- decrement server_load on listen for disconnect by @daniel-salib in #18784
- [Core] Add Lora Support to Beam Search by @alex-jw-brooks in #18346
- [Chore] update ty configuration by @aarnphm in #18839
- [Misc] fix olmoe model layer for TP > 1 by @lengrongfu in #18828
- [V1][Metrics] Remove metrics that were deprecated in 0.8 by @markmc in #18837
- [Chore][Spec Decode] Update check NoneType instead of assigning variables by @aarnphm in #18836
- [Hardware][TPU][V1] Multi-LoRA Optimisations for the V1 TPU backend by @Akshat-Tripathi in #15655
- Remove checks for
None
for fields which should never beNone
by @hmellor in #17985 - [Core] Enable CUDA graphs for DP + All2All kernels by @varun-sundar-rabindranath in #18724
- [Bugfix][ROCm] fix the power of 2 exception from triton_unified_attention.py when running llama4 models and unit test fix by @hongxiayang in #18100
- Prevent the cross-encoder logic from being applied to classification tasks by @maxdebayser in #18838
- Add ability to use CUDAGraphs with use_inductor=False by @zou3519 in #17345
- [Bugfix][TPU] fix moe custom kernel import by @yaochengji in #18853
- [Doc][Neuron] Update documentation for Neuron by @elaineyz in #18868
- Skip device and quant Pydantic validation to make plugin device work by @Yikun in #18843
- Fixes a dead link in nightly benchmark readme by @nerdalert in #18856
- [Neuron] Add multi-LoRA support for Neuron. by @aws-satyajith in #18284
- [LoRA] Add LoRA support for InternVL by @jeejeelee in #18842
- [Doc] Remove redundant spaces from compatibility_matrix.md by @windsonsea in #18891
- [doc] add CLI doc by @reidliu41 in #18871
- [Bugfix] Fix misleading information in the documentation by @jeejeelee in #18845
- [Misc] Replace TODO in serving transcription by @NickLucche in #18895
- [Bugfix] Ensure tensors are contiguous during serialisation by @lgeiger in #18860
- [BugFix] Update pydantic to fix error on python 3.10 by @ProExpertProg in #18852
- Fix an error in dummy weight loading for quantization models by @Chenyaaang in #18855
- [Misc][Tools][Benchmark] Add benchmark_serving supports for llama.cpp. by @Duyi-Wang in #18692
- [Doc] Fix codeblocks formatting in LoRA adapters documentation by @Zerohertz in #18907
- [Bugfix] Fix the failing gte embedding test by @Isotr0py in #18720
- [Attention][V1] Toggle for v1 attention backend by @gshtras in #18275
- [ROCm][V0][Attention] Revert to the previous FA triton kernel by @gshtras in #18226
- [Deprecation] Disallow pos-args other than
model
when initializingLLM
by @DarkLight1337 in #18802 - [Misc] Remove duplicate init for self.vllm_config by @googs1025 in #18896
- [V1] Allocate kv_cache with stride order for V1 by @NickLucche in #18775
- [BugFix] Make DP work with connector-delayed new requests by @njhill in #18559
- [P/D] NixlConnector DP fixes by @wseaton in #18903
- Use standalone_compile by default in torch >= 2.8.0 by @zou3519 in #18846
- [TPU] remove transpose ops in moe kernel by @yaochengji in #18923
- [Bugfix] Fix PP default fallback behavior for V1 by @mgoin in #18915
- [Misc] Update type annotation for rotary embedding
base
by @DarkLight1337 in #18914 - [TPU][CI/CD] Clean up docker for TPU tests. by @CAROLZXYZXY in #18926
- improve the robustness of parsing vlms config in AutoRound by @wenhuach21 in #18894
- [Bugfix] Consistent ascii handling in tool parsers by @chaunceyjiang in #18883
- [Model] Use AutoWeightsLoader for mamba2 by @jinyouzhi in #18918
- [docs] fix: fix markdown syntax by @eric-haibin-lin in #18927
- [ROCm] Remove unnecessary assertion of max_model_len in ROCM_AITER_MLA attention backend. by @vllmellm in #18938
- [Bugfix] Remove NVFP4 scales assertions to fix load_format=dummy by @mgoin in #18861
- [Deprecation] Remove mean pooling default for
Qwen2EmbeddingModel
by @DarkLight1337 in #18913 - [Misc]Fix benchmarks/README.md for speculative decoding by @rabi in #18897
- [doc] add mkdocs doc by @reidliu41 in #18930
- [Model] Use in-place adds in SigLIP by @lgeiger in #18922
- [Bugfix][Failing Test] Fix test_vllm_port.py by @rabi in #18618
- [Misc]Fix typo by @Always-Naive in #18947
- [Bugfix][TPU] Fix tpu model runner testcase failure by @CAROLZXYZXY in #18810
- [CI/Build] remove regex from build dependencies by @dtrifiro in #18945
- [Feature] minicpm eagle support by @huangyuxiang03 in #18943
- [doc] show the count for fork and watch by @reidliu41 in #18950
- [Docs] Update SECURITY.md with link to our security guide by @russellb in #18961
- Improve "failed to get the hash of the compiled graph" error by @zou3519 in #18956
- [Perf] API-server scaleout with many-to-many server-engine comms by @njhill in #17546
- Benchmark script for fp8 vs bf16 gemm by @mgoin in #17126
- [VLM] Add PP support and fix GPTQ inference for Ovis models by @Isotr0py in #18958
- [Misc] add group_size is -1 in awq quantization by @lengrongfu in #18910
- Tool parser regex timeout handling by @wseaton in #18960
- [Docs] Correct multiprocessing design doc by @lgeiger in #18964
- create util function for batched arange by @yuguo68 in #18937
- [Frontend] Add rerank support to run_batch endpoint by @pooyadavoodi in #16278
- [Misc] Fix estimated max model len msg by @sarckk in #18966
- [Bugfix]: Fix the incompatibility issue with Structured Outputs when Thinking is disabled by @chaunceyjiang in #18879
- fix security issue of logging llm output by @luccafong in #18980
- [Neuron] Add Multi-Modal model support for Neuron by @aws-satyajith in #18921
- [doc] fix the list rendering issue - security.md by @reidliu41 in #18982
- [BugFix] Pydantic part 2 by @ProExpertProg in #18911
- [FEAT][ROCm] Add AITER grouped topk for DeepSeekV2 by @vllmellm in #18825
- [Bugfix] Fix for issue 17396 by @frreiss in #18773
- [ROCm][Kernel] Add gfx950 support for skinny gemms by @charlifu in #18010
- [P/D] NixlConnector use cache device index for memory registration by @ptarasiewiczNV in #18969
- [BugFix] Fix multi-node offline data-parallel by @njhill in #18981
- [Misc] add return token strs for tokenize by @reidliu41 in #18941
- [Misc][Benchmark] Add support for CustomDataset by @ekagra-ranjan in #18511
- [Bugfix] Fix EAGLE3 broken logits by @benchislett in #18909
- [Core] Rework dtype resolution by @DarkLight1337 in #18751
- [LoRA] Support dynamically initialize
packed_modules_mapping
for VLM with arbitrary components by @Isotr0py in #18987 - [doc] small fix - mkdocs by @reidliu41 in #18996
- Let max_num_batched_tokens use human_readable_int for large numbers by @mgoin in #18968
- [BugFix] fix data parallel construct ipv6 url addres by @lengrongfu in #18991
- [BugFix] Fix incorrect metrics shutdown error log message by @njhill in #18992
- [doc] wrong output by @reidliu41 in #19000
- [Misc] reuse num_tokens_across_dp of get_dp_padding to avoid unnecessary dp all reduce in set_forward_context by @izhuhaoran in #18935
- [Bugfix][Nixl] Fix DP Metadata Handshake by @robertgshaw2-redhat in #19008
- [Core] Support inplace model weights loading by @22quinn in #18745
- [doc] add pytest tips by @reidliu41 in #19010
- [Model] enable data parallel for Llama4 vision encoder by @jennyyyyzhen in #18368
- [Frontend] enable custom logging for the uvicorn server (OpenAI API server) by @fpaupier in #18403
- [Bugfix][Model] Attempt to fix eagle in V0. by @gshtras in #18978
- add an absolute path for run.sh by @calvin0327 in #18258
- [Hardware][TPU] Initial support of model parallelism with single worker using SPMD by @lsy323 in #18011
- [Doc] Remove duplicate TOCs during MkDocs migration by @Zerohertz in #19021
- [Bugfix][EP+DP] Use pplx-kernel internode instead of intranode by @tlrmchlsmth in #19034
- Adding "LoRA Test %N" to AMD production tests by @Concurrensee in #18929
- [CPU][CI] Re-enable the CPU CI tests by @bigPYJ1151 in #19046
- [ROCm][Build] Clean up the ROCm build by @gshtras in #19040
- [V1] Support DP with Ray by @ruisearch42 in #18779
- Add tarsier model support by @princepride in #18985
- [bugfix] small fix logic issue by @reidliu41 in #18999
- Reduce logs in CLI scripts and plugin loader by @mgoin in #18970
- [Bugfix] Use cmake 3.26.1 instead of 3.26 to avoid build failure by @houseroad in #19019
- [v1][KVCacheManager] Rename BlockHashType to BlockHash by @heheda12345 in #19015
- Update docker docs with ARM CUDA cross-compile by @mgoin in #19037
- [Doc] Add InternVL LoRA support by @jeejeelee in #19055
- [Misc] Update
WeightsMapper
for qwen2-vl/qwen2.5-vl by @Isotr0py in #19054 - [Doc] Update V1 user guide for embedding and enc-dec models by @DarkLight1337 in #19060
- [doc] clarify windows support by @youkaichao in #19088
- [CI/Build] Remove V0 LoRA test by @jeejeelee in #19066
- Fix underscores in dict keys passed via CLI by @hmellor in #19030
- [Bugfix] disable processor cache by @zucchini-nlp in #19068
- [Doc] Improve the Pull Request template with key components by @houseroad in #19086
- [Misc] Add missing
_Backend
enums by @NickLucche in #19081 - [Misc] fix: add miss best_of param validation by @googs1025 in #18555
- [Misc] Add SPDX-FileCopyrightText by @simon-mo in #19100
- [Doc] Readme standardization by @SorenDreano in #18695
- [doc] update docker version by @reidliu41 in #19074
- [Kernel] DeepEP dispatch-combine kernel integration by @varun-sundar-rabindranath in #18434
- [V1] Support cross-layer KV sharing by @sarckk in #18212
- [Perf] Tune
scaled_fp8_quant
by increasing vectorization by @mgoin in #18844 - Fix interaction between
Optional
andAnnotated
in CLI typing by @hmellor in #19093 - [v1] Re-init input batch for multiple kv cache groups by @heheda12345 in #18654
- [V1][Spec Decode][Ngram] 1.35x gain -> 1.95x gain on InstructCoder with prompt fix by @ekagra-ranjan in #18971
- [Bugfix] get_num_blocks_to_allocate with null_block by @heheda12345 in #19031
- [Bugfix]: Fix the incompatibility issue with tool_choice 'required' when Thinking is enabled by @chaunceyjiang in #19075
- [Bugfix][P/D] Fix Prefix Cache Bug by @NickLucche in #18411
- [Bugfix] Max concurrency estimation and check_enough_kv_cache_memory for models with sliding window layers by @heheda12345 in #19029
- feat: add data parallel rank to KVEventBatch by @PeaBrane in #18925
- [Misc] Fix path and python alias errors in disagg_prefill exmaples by @Jeffwan in #18919
- [Docs] Add developer doc about CI failures by @russellb in #18782
- [CPU] V1 support for the CPU backend by @bigPYJ1151 in #16441
- [Core] Cast multimodal input in hf processor by @lgeiger in #18862
- [KERNEL] Sampler. CUDA kernel for applying repetition penalty by @vadiklyutiy in #18437
- [Cleanup][v1]:remote guided-decoding-backend for example by @calvin0327 in #19059
- [NVIDIA] Add Cutlass MLA backend by @kaixih in #17625
- [Bugfix] Fix FA3 full cuda graph correctness by @WoosukKwon in #19106
- Fix #19130 by @princepride in #19132
- [TPU] Skip hanging tests by @lsy323 in #19115
- Fix ValueError: Missing value for tag key(s): model_name,engine. by @eicherseiji in #19113
- [Misc] Add packages for benchmark as extra dependency by @Isotr0py in #19089
- Improve the output precision of embedding models by @noooop in #19092
- [CI/Build][Bugfix] Ensure compatibility with transformers 4.52 by @DarkLight1337 in #18678
- Add DeepSeek-R1-0528 function call chat template by @Xu-Wenqing in #18874
- Sm100 blockwise fp8 swap ab by @IwakuraRein in #18564
- [Doc] Update V1 Guide for embedding models by @DarkLight1337 in #19141
- Allow AsyncLLMEngine.generate to target a specific DP rank by @jmswen in #19102
- [Bugfix][EP+DP] Fix internode check by @tlrmchlsmth in #19112
- [Perf] Tunings for SM100 FP8 CUTLASS kernel by @mgoin in #18778
- [TPU] Update dynamo dump file name in compilation test by @lsy323 in #19108
- [Bugfix] fix v1 cpu worker fails on macOS by @kebe7jun in #19121
- [Kernel] Integrate batched/masked deepgemm kernel by @varun-sundar-rabindranath in #19111
- [Misc] refactor: simplify EngineCoreClient.make_async_mp_client in AsyncLLM by @googs1025 in #18817
- [P/D] Heterogeneous TP by @NickLucche in #18833
- [doc] small fix by @reidliu41 in #19167
- [Bugfix][Nixl] Fix full prefix cache hit bug by @robertgshaw2-redhat in #18632
- [Bugfix] Fix port handling in make_zmq_path by @mgoin in #19117
- [Torch Nightly]add missing dependency by @yangw-dev in #18770
- Handle non-serializable objects when dumping benchmark results by @huydhn in #19114
- [BugFix][Minor] Fix full cuda graph bug when max_num_seqs < 512 by @WoosukKwon in #19171
- [Bugfix]: Fix the incompatibility issue with stream when Thinking is disabled by @chaunceyjiang in #19135
- [Build] Annotate wheel and container path for release workflow by @simon-mo in #19162
- [Misc] Remove unnecessary fallback to prefill-decode attention by @vllmellm in #19138
- [Misc] Do not override NCCL_CUMEM_ENABLE if set explicitly by @22quinn in #19105
- [Frontend] improve vllm run-batch --help display by @reidliu41 in #19187
- [Bugfix] properly catch PIL-related errors for vision models when incorrect data urls are provided by @gcalmettes in #19202
- [mistral_common] Add v11 tokenizer by @patrickvonplaten in #19193
- Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3 by @Xu-Wenqing in #19205
- [Hardware][NVIDIA] FP4 MoE kernel optimization by @dubcyfor3 in #19110
- [MISC][Bugfix] Use less CPU when message queue has been empty for some time by @p12tic in #16226
- [P/D][NixlConnector] Enable FlashInfer backend by @NickLucche in #19090
- [Quantization] Skip Fp4 Test for
compressed-tensors
by @dsikka in #19217 - [V1] Use FlashInfer by default on Blackwell GPUs by @mgoin in #19118
- [Model] NemotronH support by @vegaluisjose in #18863
- Fix AOPerModuleConfig name changes by @jerryzh168 in #18869
- [Bugfix] Fix EAGLE vocab embedding construction for Llama 70B by @benchislett in #19033
- [v1] Hybrid Memory Allocator by @heheda12345 in #17996
- [TPU] update torch_xla pin by @yaochengji in #19231
- Support allowed_token_ids in ChatCompletionRequest by @xu-song in #19143
- [Chore] update CODEOWNERS by @aarnphm in #19247
- [v1][P/D] Fix a edge case in kv cache schedule by @KingsleyZhang123 in #19182
- [TPU] fix kv cache dtype in model runner by @yaochengji in #19244
- [Quantization] Bump compressed-tensors version; update NVFP4A16 test model by @dsikka in #19224
- [Docs] Improve V1 KVConnector interface documentation by @njhill in #19172
- Fix CompilationConfig repr by @zou3519 in #19091
- Unit Test for run_dp_sharded_vision_model by @cryptopic in #19103
- [Model] Optimize nemotron_h implementation by @jeejeelee in #19249
- [Core] Raise when non-multi-instance DP clients target a DP rank by @jmswen in #19227
- improve logits bias by @yuguo68 in #19041
- Fixed ppc build when it runs on non-RHEL based linux distros by @npanpaliya in #18422
- [BugFix] Fix MultiConnector test after HMA changes by @njhill in #19291
- [Bugfix][Core] Update cancellation logic in
generate()
to handle Generator exits by @Adolfo-Karim in #19225 - [Core] Fix abrupt request abort by @NickLucche in #18485
- [BugFix] Fix tpu_model_runner block_id concatenation by @njhill in #19228
- [Misc][Tools][Benchmark] Fix and improve auto tune script by @Chenyaaang in #19163
- [Build][ROCm] Update Dockerfile.rocm by @Alexei-V-Ivanov-AMD in #19296
- [Easy][Test] Simplify test_function_tool_use with multiple parametrizes by @houseroad in #19269
- [Kernel] Integrate CUTLASS MoE kernel with PPLX by @ElizaWszola in #18762
- [TPU][Test] Add script to run benchmark on TPU for buildkite by @QiliangCui in #19039
- [CI][PowerPC] Use a more appropriate way to select testcase in tests/models/language/pooling/test_embedding.py by @AaruniAggarwal in #19253
- Add FlexAttention to V1 by @drisspg in #16078
- [Misc] refactor context extension by @reidliu41 in #19246
- [CI/Build] Improve Llama GGUF test robustness by @Isotr0py in #19287
- [Nit][Benchmark]Fix example in benchmark_serving_structured_output.py by @draftbk in #19311
- [AMD] Update compatible packaging version by @pramenku in #19309
- [BugFix][V1] Fix memory profiling bug by @ProExpertProg in #18974
- [Bugfix]: Fix TypeError: 'float' object cannot be interpreted as an integer by @chaunceyjiang in #19283
- [Bugfix] Re-enable use_cudagraph in vLLM v1 by @zou3519 in #19299
- [Misc] Change tests/compile to use VLLM_V1 by default by @zou3519 in #19302
- Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B by @Xu-Wenqing in #19315
- [Hardware][POWER] Add IBM POWER11 Support to CPU Extension Detection by @Akashcodes732 in #19082
- [Quantization] Add compressed-tensors NVFP4 support by @dsikka in #18312
- [Multi Modal] Add an env var for message queue max chunk bytes by @jennyyyyzhen in #19242
- [Bugfix] model_max_length should consider max_model_len in tokenizer_config by @noooop in #19201
- [Deprecation] Remove
inputs
arg fallback in Engine classes by @DarkLight1337 in #18799 - [Misc] Add documentation update reminder to PR template by @Isotr0py in #19289
- [Frontend] Remove unreachable code from llm.py by @KsuParkhamchuk in #19288
- [Misc] Cleanup compilation tests by @zou3519 in #19343
- [doc] improve ci doc by @reidliu41 in #19307
- [Doc] Fix description in the Automatic Prefix Caching design doc by @cr7258 in #19333
- [CI/Build] Fix LoRA test by @jeejeelee in #19350
- [Fix] Allow kernel compilation for CUDA capability 8.7 by @conroy-cheers in #19328
- [CI] Introduce rules for llama auto-label by @houseroad in #19323
- [Docs] Fix a bullet list in usage/security.md by @windsonsea in #19358
- [full_graph] Fix query_start_loc padding by @yinghai in #19321
- [v1] Add fp32 support to v1 engine through flex attn by @Isotr0py in #19319
- [Misc] Fixes and Optimizations for DeepEP + DeepGEMM combination. by @varun-sundar-rabindranath in #19298
- [Bugfix][Core] Prevent token lengths exceeding
max_model_len
in V0 by @22quinn in #19348 - [Quantization] Bump compressed-tensors version by @kylesayrs in #19295
- [Frontend] Make TIMEOUT_KEEP_ALIVE configurable through env var by @liusiqian-tal in #18472
- [TPU]Fix KV cache sharing tests by @lsy323 in #19371
- [HOT-FIX] Add
kv_sharing_target_layer_name
argument to cutlass_mla backend by @pavanimajety in #19374 - [Misc] Fix a config typo in disable_hybrid_kv_cache_manager configuration by @lsy323 in #19383
- [V1] Reuse V0's memory_profiling util for gpu worker memory profiling by @yeqcharlotte in #19312
- [Bugfix] Fix benchmark_moe.py by @gty111 in #19016
- Use xla flag to improve the quantized model performance by @vanbasten23 in #19303
- Fix docs/mkdocs/hooks/remove_announcement.py by @hmellor in #19382
- [Frontend] Make use_tqdm accept a callable for custom progress bars by @reidliu41 in #19357
- [Core] Use tuple for kv cache group block ids by @njhill in #19175
- [Bugfix] Fix modelscope token passed in by @Potabk in #19389
- [Core] Batch multi modal input using pinned memory by @lgeiger in #19169
- Add security warning to bug report template by @russellb in #19365
- [Misc] refactor neuron_multimodal and profiling by @reidliu41 in #19397
- Add clear documentation around the impact of debugging flag by @annapendleton in #19369
- Automatically bind CPU OMP Threads of a rank to CPU ids of a NUMA node. by @louie-tsai in #17930
- Revert "[v1] Add fp32 support to v1 engine through flex attn" by @Isotr0py in #19404
- [BugFix][FlashInfer] Fix attention backend interface mismatch with unexpected keyword
use_irope
by @YUNQIUGUO in #19134 - [BugFix][CPU] Fix CPU CI by ignore collecting test_pixtral by @bigPYJ1151 in #19411
- Simplify ep kernels installation by @youkaichao in #19412
- [Misc] Slight improvement of the BNB by @jeejeelee in #19418
New Contributors
- @nerdalert made their first contribution in #18856
- @Duyi-Wang made their first contribution in #18692
- @jinyouzhi made their first contribution in #18918
- @eric-haibin-lin made their first contribution in #18927
- @Always-Naive made their first contribution in #18947
- @yuguo68 made their first contribution in #18937
- @ptarasiewiczNV made their first contribution in #18969
- @izhuhaoran made their first contribution in #18935
- @jennyyyyzhen made their first contribution in #18368
- @zucchini-nlp made their first contribution in #19068
- @SorenDreano made their first contribution in #18695
- @PeaBrane made their first contribution in #18925
- @jmswen made their first contribution in #19102
- @dubcyfor3 made their first contribution in #19110
- @p12tic made their first contribution in #16226
- @KingsleyZhang123 made their first contribution in #19182
- @cryptopic made their first contribution in #19103
- @Adolfo-Karim made their first contribution in #19225
- @QiliangCui made their first contribution in #19039
- @draftbk made their first contribution in #19311
- @pramenku made their first contribution in #19309
- @KsuParkhamchuk made their first contribution in #19288
- @cr7258 made their first contribution in #19333
- @liusiqian-tal made their first contribution in #18472
- @annapendleton made their first contribution in #19369
- @louie-tsai made their first contribution in #17930
- @YUNQIUGUO made their first contribution in #19134
Full Changelog: v0.9.0...v0.9.1