Release v0.9.1 · vllm-project/vllm

Highlights

This release features 274 commits, from 123 contributors (27 new contributors!)

Progress in large scale serving
- DP Attention + Expert Parallelism: CUDA graph support (#18724), DeepEP dispatch-combine kernel (#18434), batched/masked DeepGEMM kernel (#19111), CUTLASS MoE kernel with PPLX (#18762)
- Heterogeneous TP (#18833), NixlConnector Enable FlashInfer backend (#19090)
- DP: API-server scaleout with many-to-many server-engine comms (#17546), Support DP with Ray (#18779), allow AsyncLLMEngine.generate to target a specific DP rank (#19102), data parallel rank to KVEventBatch (#18925)
- Tooling: Simplify EP kernels installation (#19412)
RLHF workflow: Support inplace model weights loading (#18745)
Initial full support for Hybrid Memory Allocator (#17996), support cross-layer KV sharing (#18212)
Add FlexAttention to vLLM V1 (#16078)
Various production hardening related to full cuda graph mode (#19171, #19106, #19321)

Model Support

Support Magistral (#19193), LoRA support for InternVL (#18842), minicpm eagle support (#18943), NemotronH support (#18863, #19249)
Enable data parallel for Llama4 vision encoder (#18368)
Add DeepSeek-R1-0528 function call chat template (#18874)

Hardware Support & Performance Optimizations

Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3 (#19205), Qwen3-235B-A22B (#19315)
Blackwell: Add Cutlass MLA backend (#17625), Tunings for SM100 FP8 CUTLASS kernel (#18778), Use FlashInfer by default on Blackwell GPUs (#19118), Tune scaled_fp8_quant by increasing vectorization (#18844)
FP4: Add compressed-tensors NVFP4 support (#18312), FP4 MoE kernel optimization (#19110)
CPU: V1 support for the CPU backend (#16441)
ROCm: Add AITER grouped topk for DeepSeekV2 (#18825)
POWER: Add IBM POWER11 Support to CPU Extension Detection (#19082)
TPU: Initial support of model parallelism with single worker using SPMD (#18011), Multi-LoRA Optimizations for the V1 TPU backend (#15655)
Neuron: Add multi-LoRA support for Neuron. (#18284), Add Multi-Modal model support for Neuron (#18921), Support quantization on neuron (#18283)
Platform: Make torch distributed process group extendable (#18763)

Engine features

Add Lora Support to Beam Search (#18346)
Add rerank support to run_batch endpoint (#16278)
CLI: add run batch (#18804)
Server: custom logging (#18403), allowed_token_ids in ChatCompletionRequest (#19143)
LLM API: make use_tqdm accept a callable for custom progress bars (#19357)
perf: [KERNEL] Sampler. CUDA kernel for applying repetition penalty (#18437)

API Deprecations

Disallow pos-args other than model when initializing LLM (#18802)
Remove inputs arg fallback in Engine classes (#18799)
Remove fallbacks for Embeddings API (#18795)
Remove mean pooling default for Qwen2EmbeddingModel (#18913)
Require overriding get_dummy_text and get_dummy_mm_data (#18796)
Remove metrics that were deprecated in 0.8 (#18837)

Documentation

Add CLI doc (#18871)
Update SECURITY.md with link to our security guide (#18961), Add security warning to bug report template (#19365)

What's Changed

[CI/Build] [TPU] Fix TPU CI exit code by @CAROLZXYZXY in #18282
[Neuron] Support quantization on neuron by @aws-satyajith in #18283
Support datasets in vllm bench serve and sync with benchmark_[serving,datasets].py by @mgoin in #18566
[Bugfix] Disable prefix caching by default for benchmark by @cascade812 in #18771
[Build] Fixes for CMake install by @ProExpertProg in #18570
[Core] Improve Tensor serialisation by @lgeiger in #18774
[rocm] Fix wrong attention log by @fxmarty-amd in #18764
[Bugfix] Fix nomic max_model_len by @noooop in #18755
[Bugfix]: correctly propagate errors message caught at the chat_templating step to the client by @gcalmettes in #18769
[V1] fix torch profiling for V1 offline scenarios by @divakar-amd in #18445
[V1] [Bugfix] eagle bugfix and enable correct lm_head for multimodal (2) by @RonaldBXu in #18781
[Bugfix][FailingTest]Fix test_model_load_with_params.py by @rabi in #18758
[Deprecation] Require overriding get_dummy_text and get_dummy_mm_data by @DarkLight1337 in #18796
[Deprecation] Remove unused sync methods in async_timeout by @DarkLight1337 in #18792
[Deprecation] Remove fallbacks for Embeddings API by @DarkLight1337 in #18795
[CI] improve embed testing by @noooop in #18747
Fix PiecewiseCompileInterpreter by @zou3519 in #17338
[BugFix] FA2 MLA Accuracy Issue by @LucasWilkinson in #18807
[Platform][Dist] Make torch distributed process group extendable by @MengqingCao in #18763
Enable Pydantic mypy checks and convert configs to Pydantic dataclasses by @hmellor in #17599
[Frontend] add run batch to CLI by @reidliu41 in #18804
decrement server_load on listen for disconnect by @daniel-salib in #18784
[Core] Add Lora Support to Beam Search by @alex-jw-brooks in #18346
[Chore] update ty configuration by @aarnphm in #18839
[Misc] fix olmoe model layer for TP > 1 by @lengrongfu in #18828
[V1][Metrics] Remove metrics that were deprecated in 0.8 by @markmc in #18837
[Chore][Spec Decode] Update check NoneType instead of assigning variables by @aarnphm in #18836
[Hardware][TPU][V1] Multi-LoRA Optimisations for the V1 TPU backend by @Akshat-Tripathi in #15655
Remove checks for None for fields which should never be None by @hmellor in #17985
[Core] Enable CUDA graphs for DP + All2All kernels by @varun-sundar-rabindranath in #18724
[Bugfix][ROCm] fix the power of 2 exception from triton_unified_attention.py when running llama4 models and unit test fix by @hongxiayang in #18100
Prevent the cross-encoder logic from being applied to classification tasks by @maxdebayser in #18838
Add ability to use CUDAGraphs with use_inductor=False by @zou3519 in #17345
[Bugfix][TPU] fix moe custom kernel import by @yaochengji in #18853
[Doc][Neuron] Update documentation for Neuron by @elaineyz in #18868
Skip device and quant Pydantic validation to make plugin device work by @Yikun in #18843
Fixes a dead link in nightly benchmark readme by @nerdalert in #18856
[Neuron] Add multi-LoRA support for Neuron. by @aws-satyajith in #18284
[LoRA] Add LoRA support for InternVL by @jeejeelee in #18842
[Doc] Remove redundant spaces from compatibility_matrix.md by @windsonsea in #18891
[doc] add CLI doc by @reidliu41 in #18871
[Bugfix] Fix misleading information in the documentation by @jeejeelee in #18845
[Misc] Replace TODO in serving transcription by @NickLucche in #18895
[Bugfix] Ensure tensors are contiguous during serialisation by @lgeiger in #18860
[BugFix] Update pydantic to fix error on python 3.10 by @ProExpertProg in #18852
Fix an error in dummy weight loading for quantization models by @Chenyaaang in #18855
[Misc][Tools][Benchmark] Add benchmark_serving supports for llama.cpp. by @Duyi-Wang in #18692
[Doc] Fix codeblocks formatting in LoRA adapters documentation by @Zerohertz in #18907
[Bugfix] Fix the failing gte embedding test by @Isotr0py in #18720
[Attention][V1] Toggle for v1 attention backend by @gshtras in #18275
[ROCm][V0][Attention] Revert to the previous FA triton kernel by @gshtras in #18226
[Deprecation] Disallow pos-args other than model when initializing LLM by @DarkLight1337 in #18802
[Misc] Remove duplicate init for self.vllm_config by @googs1025 in #18896
[V1] Allocate kv_cache with stride order for V1 by @NickLucche in #18775
[BugFix] Make DP work with connector-delayed new requests by @njhill in #18559
[P/D] NixlConnector DP fixes by @wseaton in #18903
Use standalone_compile by default in torch >= 2.8.0 by @zou3519 in #18846
[TPU] remove transpose ops in moe kernel by @yaochengji in #18923
[Bugfix] Fix PP default fallback behavior for V1 by @mgoin in #18915
[Misc] Update type annotation for rotary embedding base by @DarkLight1337 in #18914
[TPU][CI/CD] Clean up docker for TPU tests. by @CAROLZXYZXY in #18926
improve the robustness of parsing vlms config in AutoRound by @wenhuach21 in #18894
[Bugfix] Consistent ascii handling in tool parsers by @chaunceyjiang in #18883
[Model] Use AutoWeightsLoader for mamba2 by @jinyouzhi in #18918
[docs] fix: fix markdown syntax by @eric-haibin-lin in #18927
[ROCm] Remove unnecessary assertion of max_model_len in ROCM_AITER_MLA attention backend. by @vllmellm in #18938
[Bugfix] Remove NVFP4 scales assertions to fix load_format=dummy by @mgoin in #18861
[Deprecation] Remove mean pooling default for Qwen2EmbeddingModel by @DarkLight1337 in #18913
[Misc]Fix benchmarks/README.md for speculative decoding by @rabi in #18897
[doc] add mkdocs doc by @reidliu41 in #18930
[Model] Use in-place adds in SigLIP by @lgeiger in #18922
[Bugfix][Failing Test] Fix test_vllm_port.py by @rabi in #18618
[Misc]Fix typo by @Always-Naive in #18947
[Bugfix][TPU] Fix tpu model runner testcase failure by @CAROLZXYZXY in #18810
[CI/Build] remove regex from build dependencies by @dtrifiro in #18945
[Feature] minicpm eagle support by @huangyuxiang03 in #18943
[doc] show the count for fork and watch by @reidliu41 in #18950
[Docs] Update SECURITY.md with link to our security guide by @russellb in #18961
Improve "failed to get the hash of the compiled graph" error by @zou3519 in #18956
[Perf] API-server scaleout with many-to-many server-engine comms by @njhill in #17546
Benchmark script for fp8 vs bf16 gemm by @mgoin in #17126
[VLM] Add PP support and fix GPTQ inference for Ovis models by @Isotr0py in #18958
[Misc] add group_size is -1 in awq quantization by @lengrongfu in #18910
Tool parser regex timeout handling by @wseaton in #18960
[Docs] Correct multiprocessing design doc by @lgeiger in #18964
create util function for batched arange by @yuguo68 in #18937
[Frontend] Add rerank support to run_batch endpoint by @pooyadavoodi in #16278
[Misc] Fix estimated max model len msg by @sarckk in #18966
[Bugfix]: Fix the incompatibility issue with Structured Outputs when Thinking is disabled by @chaunceyjiang in #18879
fix security issue of logging llm output by @luccafong in #18980
[Neuron] Add Multi-Modal model support for Neuron by @aws-satyajith in #18921
[doc] fix the list rendering issue - security.md by @reidliu41 in #18982
[BugFix] Pydantic part 2 by @ProExpertProg in #18911
[FEAT][ROCm] Add AITER grouped topk for DeepSeekV2 by @vllmellm in #18825
[Bugfix] Fix for issue 17396 by @frreiss in #18773
[ROCm][Kernel] Add gfx950 support for skinny gemms by @charlifu in #18010
[P/D] NixlConnector use cache device index for memory registration by @ptarasiewiczNV in #18969
[BugFix] Fix multi-node offline data-parallel by @njhill in #18981
[Misc] add return token strs for tokenize by @reidliu41 in #18941
[Misc][Benchmark] Add support for CustomDataset by @ekagra-ranjan in #18511
[Bugfix] Fix EAGLE3 broken logits by @benchislett in #18909
[Core] Rework dtype resolution by @DarkLight1337 in #18751
[LoRA] Support dynamically initialize packed_modules_mapping for VLM with arbitrary components by @Isotr0py in #18987
[doc] small fix - mkdocs by @reidliu41 in #18996
Let max_num_batched_tokens use human_readable_int for large numbers by @mgoin in #18968
[BugFix] fix data parallel construct ipv6 url addres by @lengrongfu in #18991
[BugFix] Fix incorrect metrics shutdown error log message by @njhill in #18992
[doc] wrong output by @reidliu41 in #19000
[Misc] reuse num_tokens_across_dp of get_dp_padding to avoid unnecessary dp all reduce in set_forward_context by @izhuhaoran in #18935
[Bugfix][Nixl] Fix DP Metadata Handshake by @robertgshaw2-redhat in #19008
[Core] Support inplace model weights loading by @22quinn in #18745
[doc] add pytest tips by @reidliu41 in #19010
[Model] enable data parallel for Llama4 vision encoder by @jennyyyyzhen in #18368
[Frontend] enable custom logging for the uvicorn server (OpenAI API server) by @fpaupier in #18403
[Bugfix][Model] Attempt to fix eagle in V0. by @gshtras in #18978
add an absolute path for run.sh by @calvin0327 in #18258
[Hardware][TPU] Initial support of model parallelism with single worker using SPMD by @lsy323 in #18011
[Doc] Remove duplicate TOCs during MkDocs migration by @Zerohertz in #19021
[Bugfix][EP+DP] Use pplx-kernel internode instead of intranode by @tlrmchlsmth in #19034
Adding "LoRA Test %N" to AMD production tests by @Concurrensee in #18929
[CPU][CI] Re-enable the CPU CI tests by @bigPYJ1151 in #19046
[ROCm][Build] Clean up the ROCm build by @gshtras in #19040
[V1] Support DP with Ray by @ruisearch42 in #18779
Add tarsier model support by @princepride in #18985
[bugfix] small fix logic issue by @reidliu41 in #18999
Reduce logs in CLI scripts and plugin loader by @mgoin in #18970
[Bugfix] Use cmake 3.26.1 instead of 3.26 to avoid build failure by @houseroad in #19019
[v1][KVCacheManager] Rename BlockHashType to BlockHash by @heheda12345 in #19015
Update docker docs with ARM CUDA cross-compile by @mgoin in #19037
[Doc] Add InternVL LoRA support by @jeejeelee in #19055
[Misc] Update WeightsMapper for qwen2-vl/qwen2.5-vl by @Isotr0py in #19054
[Doc] Update V1 user guide for embedding and enc-dec models by @DarkLight1337 in #19060
[doc] clarify windows support by @youkaichao in #19088
[CI/Build] Remove V0 LoRA test by @jeejeelee in #19066
Fix underscores in dict keys passed via CLI by @hmellor in #19030
[Bugfix] disable processor cache by @zucchini-nlp in #19068
[Doc] Improve the Pull Request template with key components by @houseroad in #19086
[Misc] Add missing _Backend enums by @NickLucche in #19081
[Misc] fix: add miss best_of param validation by @googs1025 in #18555
[Misc] Add SPDX-FileCopyrightText by @simon-mo in #19100
[Doc] Readme standardization by @SorenDreano in #18695
[doc] update docker version by @reidliu41 in #19074
[Kernel] DeepEP dispatch-combine kernel integration by @varun-sundar-rabindranath in #18434
[V1] Support cross-layer KV sharing by @sarckk in #18212
[Perf] Tune scaled_fp8_quant by increasing vectorization by @mgoin in #18844
Fix interaction between Optional and Annotated in CLI typing by @hmellor in #19093
[v1] Re-init input batch for multiple kv cache groups by @heheda12345 in #18654
[V1][Spec Decode][Ngram] 1.35x gain -> 1.95x gain on InstructCoder with prompt fix by @ekagra-ranjan in #18971
[Bugfix] get_num_blocks_to_allocate with null_block by @heheda12345 in #19031
[Bugfix]: Fix the incompatibility issue with tool_choice 'required' when Thinking is enabled by @chaunceyjiang in #19075
[Bugfix][P/D] Fix Prefix Cache Bug by @NickLucche in #18411
[Bugfix] Max concurrency estimation and check_enough_kv_cache_memory for models with sliding window layers by @heheda12345 in #19029
feat: add data parallel rank to KVEventBatch by @PeaBrane in #18925
[Misc] Fix path and python alias errors in disagg_prefill exmaples by @Jeffwan in #18919
[Docs] Add developer doc about CI failures by @russellb in #18782
[CPU] V1 support for the CPU backend by @bigPYJ1151 in #16441
[Core] Cast multimodal input in hf processor by @lgeiger in #18862
[KERNEL] Sampler. CUDA kernel for applying repetition penalty by @vadiklyutiy in #18437
[Cleanup][v1]:remote guided-decoding-backend for example by @calvin0327 in #19059
[NVIDIA] Add Cutlass MLA backend by @kaixih in #17625
[Bugfix] Fix FA3 full cuda graph correctness by @WoosukKwon in #19106
Fix #19130 by @princepride in #19132
[TPU] Skip hanging tests by @lsy323 in #19115
Fix ValueError: Missing value for tag key(s): model_name,engine. by @eicherseiji in #19113
[Misc] Add packages for benchmark as extra dependency by @Isotr0py in #19089
Improve the output precision of embedding models by @noooop in #19092
[CI/Build][Bugfix] Ensure compatibility with transformers 4.52 by @DarkLight1337 in #18678
Add DeepSeek-R1-0528 function call chat template by @Xu-Wenqing in #18874
Sm100 blockwise fp8 swap ab by @IwakuraRein in #18564
[Doc] Update V1 Guide for embedding models by @DarkLight1337 in #19141
Allow AsyncLLMEngine.generate to target a specific DP rank by @jmswen in #19102
[Bugfix][EP+DP] Fix internode check by @tlrmchlsmth in #19112
[Perf] Tunings for SM100 FP8 CUTLASS kernel by @mgoin in #18778
[TPU] Update dynamo dump file name in compilation test by @lsy323 in #19108
[Bugfix] fix v1 cpu worker fails on macOS by @kebe7jun in #19121
[Kernel] Integrate batched/masked deepgemm kernel by @varun-sundar-rabindranath in #19111
[Misc] refactor: simplify EngineCoreClient.make_async_mp_client in AsyncLLM by @googs1025 in #18817
[P/D] Heterogeneous TP by @NickLucche in #18833
[doc] small fix by @reidliu41 in #19167
[Bugfix][Nixl] Fix full prefix cache hit bug by @robertgshaw2-redhat in #18632
[Bugfix] Fix port handling in make_zmq_path by @mgoin in #19117
[Torch Nightly]add missing dependency by @yangw-dev in #18770
Handle non-serializable objects when dumping benchmark results by @huydhn in #19114
[BugFix][Minor] Fix full cuda graph bug when max_num_seqs < 512 by @WoosukKwon in #19171
[Bugfix]: Fix the incompatibility issue with stream when Thinking is disabled by @chaunceyjiang in #19135
[Build] Annotate wheel and container path for release workflow by @simon-mo in #19162
[Misc] Remove unnecessary fallback to prefill-decode attention by @vllmellm in #19138
[Misc] Do not override NCCL_CUMEM_ENABLE if set explicitly by @22quinn in #19105
[Frontend] improve vllm run-batch --help display by @reidliu41 in #19187
[Bugfix] properly catch PIL-related errors for vision models when incorrect data urls are provided by @gcalmettes in #19202
[mistral_common] Add v11 tokenizer by @patrickvonplaten in #19193
Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3 by @Xu-Wenqing in #19205
[Hardware][NVIDIA] FP4 MoE kernel optimization by @dubcyfor3 in #19110
[MISC][Bugfix] Use less CPU when message queue has been empty for some time by @p12tic in #16226
[P/D][NixlConnector] Enable FlashInfer backend by @NickLucche in #19090
[Quantization] Skip Fp4 Test for compressed-tensors by @dsikka in #19217
[V1] Use FlashInfer by default on Blackwell GPUs by @mgoin in #19118
[Model] NemotronH support by @vegaluisjose in #18863
Fix AOPerModuleConfig name changes by @jerryzh168 in #18869
[Bugfix] Fix EAGLE vocab embedding construction for Llama 70B by @benchislett in #19033
[v1] Hybrid Memory Allocator by @heheda12345 in #17996
[TPU] update torch_xla pin by @yaochengji in #19231
Support allowed_token_ids in ChatCompletionRequest by @xu-song in #19143
[Chore] update CODEOWNERS by @aarnphm in #19247
[v1][P/D] Fix a edge case in kv cache schedule by @KingsleyZhang123 in #19182
[TPU] fix kv cache dtype in model runner by @yaochengji in #19244
[Quantization] Bump compressed-tensors version; update NVFP4A16 test model by @dsikka in #19224
[Docs] Improve V1 KVConnector interface documentation by @njhill in #19172
Fix CompilationConfig repr by @zou3519 in #19091
Unit Test for run_dp_sharded_vision_model by @cryptopic in #19103
[Model] Optimize nemotron_h implementation by @jeejeelee in #19249
[Core] Raise when non-multi-instance DP clients target a DP rank by @jmswen in #19227
improve logits bias by @yuguo68 in #19041
Fixed ppc build when it runs on non-RHEL based linux distros by @npanpaliya in #18422
[BugFix] Fix MultiConnector test after HMA changes by @njhill in #19291
[Bugfix][Core] Update cancellation logic in generate() to handle Generator exits by @Adolfo-Karim in #19225
[Core] Fix abrupt request abort by @NickLucche in #18485
[BugFix] Fix tpu_model_runner block_id concatenation by @njhill in #19228
[Misc][Tools][Benchmark] Fix and improve auto tune script by @Chenyaaang in #19163
[Build][ROCm] Update Dockerfile.rocm by @Alexei-V-Ivanov-AMD in #19296
[Easy][Test] Simplify test_function_tool_use with multiple parametrizes by @houseroad in #19269
[Kernel] Integrate CUTLASS MoE kernel with PPLX by @ElizaWszola in #18762
[TPU][Test] Add script to run benchmark on TPU for buildkite by @QiliangCui in #19039
[CI][PowerPC] Use a more appropriate way to select testcase in tests/models/language/pooling/test_embedding.py by @AaruniAggarwal in #19253
Add FlexAttention to V1 by @drisspg in #16078
[Misc] refactor context extension by @reidliu41 in #19246
[CI/Build] Improve Llama GGUF test robustness by @Isotr0py in #19287
[Nit][Benchmark]Fix example in benchmark_serving_structured_output.py by @draftbk in #19311
[AMD] Update compatible packaging version by @pramenku in #19309
[BugFix][V1] Fix memory profiling bug by @ProExpertProg in #18974
[Bugfix]: Fix TypeError: 'float' object cannot be interpreted as an integer by @chaunceyjiang in #19283
[Bugfix] Re-enable use_cudagraph in vLLM v1 by @zou3519 in #19299
[Misc] Change tests/compile to use VLLM_V1 by default by @zou3519 in #19302
Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B by @Xu-Wenqing in #19315
[Hardware][POWER] Add IBM POWER11 Support to CPU Extension Detection by @Akashcodes732 in #19082
[Quantization] Add compressed-tensors NVFP4 support by @dsikka in #18312
[Multi Modal] Add an env var for message queue max chunk bytes by @jennyyyyzhen in #19242
[Bugfix] model_max_length should consider max_model_len in tokenizer_config by @noooop in #19201
[Deprecation] Remove inputs arg fallback in Engine classes by @DarkLight1337 in #18799
[Misc] Add documentation update reminder to PR template by @Isotr0py in #19289
[Frontend] Remove unreachable code from llm.py by @KsuParkhamchuk in #19288
[Misc] Cleanup compilation tests by @zou3519 in #19343
[doc] improve ci doc by @reidliu41 in #19307
[Doc] Fix description in the Automatic Prefix Caching design doc by @cr7258 in #19333
[CI/Build] Fix LoRA test by @jeejeelee in #19350
[Fix] Allow kernel compilation for CUDA capability 8.7 by @conroy-cheers in #19328
[CI] Introduce rules for llama auto-label by @houseroad in #19323
[Docs] Fix a bullet list in usage/security.md by @windsonsea in #19358
[full_graph] Fix query_start_loc padding by @yinghai in #19321
[v1] Add fp32 support to v1 engine through flex attn by @Isotr0py in #19319
[Misc] Fixes and Optimizations for DeepEP + DeepGEMM combination. by @varun-sundar-rabindranath in #19298
[Bugfix][Core] Prevent token lengths exceeding max_model_len in V0 by @22quinn in #19348
[Quantization] Bump compressed-tensors version by @kylesayrs in #19295
[Frontend] Make TIMEOUT_KEEP_ALIVE configurable through env var by @liusiqian-tal in #18472
[TPU]Fix KV cache sharing tests by @lsy323 in #19371
[HOT-FIX] Add kv_sharing_target_layer_name argument to cutlass_mla backend by @pavanimajety in #19374
[Misc] Fix a config typo in disable_hybrid_kv_cache_manager configuration by @lsy323 in #19383
[V1] Reuse V0's memory_profiling util for gpu worker memory profiling by @yeqcharlotte in #19312
[Bugfix] Fix benchmark_moe.py by @gty111 in #19016
Use xla flag to improve the quantized model performance by @vanbasten23 in #19303
Fix docs/mkdocs/hooks/remove_announcement.py by @hmellor in #19382
[Frontend] Make use_tqdm accept a callable for custom progress bars by @reidliu41 in #19357
[Core] Use tuple for kv cache group block ids by @njhill in #19175
[Bugfix] Fix modelscope token passed in by @Potabk in #19389
[Core] Batch multi modal input using pinned memory by @lgeiger in #19169
Add security warning to bug report template by @russellb in #19365
[Misc] refactor neuron_multimodal and profiling by @reidliu41 in #19397
Add clear documentation around the impact of debugging flag by @annapendleton in #19369
Automatically bind CPU OMP Threads of a rank to CPU ids of a NUMA node. by @louie-tsai in #17930
Revert "[v1] Add fp32 support to v1 engine through flex attn" by @Isotr0py in #19404
[BugFix][FlashInfer] Fix attention backend interface mismatch with unexpected keyword use_irope by @YUNQIUGUO in #19134
[BugFix][CPU] Fix CPU CI by ignore collecting test_pixtral by @bigPYJ1151 in #19411
Simplify ep kernels installation by @youkaichao in #19412
[Misc] Slight improvement of the BNB by @jeejeelee in #19418

New Contributors

@nerdalert made their first contribution in #18856
@Duyi-Wang made their first contribution in #18692
@jinyouzhi made their first contribution in #18918
@eric-haibin-lin made their first contribution in #18927
@Always-Naive made their first contribution in #18947
@yuguo68 made their first contribution in #18937
@ptarasiewiczNV made their first contribution in #18969
@izhuhaoran made their first contribution in #18935
@jennyyyyzhen made their first contribution in #18368
@zucchini-nlp made their first contribution in #19068
@SorenDreano made their first contribution in #18695
@PeaBrane made their first contribution in #18925
@jmswen made their first contribution in #19102
@dubcyfor3 made their first contribution in #19110
@p12tic made their first contribution in #16226
@KingsleyZhang123 made their first contribution in #19182
@cryptopic made their first contribution in #19103
@Adolfo-Karim made their first contribution in #19225
@QiliangCui made their first contribution in #19039
@draftbk made their first contribution in #19311
@pramenku made their first contribution in #19309
@KsuParkhamchuk made their first contribution in #19288
@cr7258 made their first contribution in #19333
@liusiqian-tal made their first contribution in #18472
@annapendleton made their first contribution in #19369
@louie-tsai made their first contribution in #17930
@YUNQIUGUO made their first contribution in #19134

Full Changelog: v0.9.0...v0.9.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

v0.9.1

Highlights

Model Support

Hardware Support & Performance Optimizations

Engine features

API Deprecations

Documentation

What's Changed

New Contributors

Contributors

Uh oh!