chore(deps-dev): update vllm requirement from >=0.22.1 to >=0.24.0 by dependabot[bot] · Pull Request #458 · OpenBMB/UltraRAG

dependabot · 2026-07-01T11:27:07Z

Updates the requirements on vllm to permit the latest version.

Release notes

v0.24.0

vLLM v0.24.0 Release Notes

Highlights

This release features 571 commits from 256 contributors (77 new)!

MiniMax-M3: Added support for the new MiniMax-M3 model (#45381), with a fast follow-on of BF16/FP8 indexer via MSA (#45892), MXFP4 support (#45896), FP8 sparse GQA (#45744), and extensive AMD/ROCm tuning — mxfp8 MoE/linear on gfx950 (#45725), fp8_per_channel for bf16 weights on MI300X (#45854), FP8 KV-cache fix (#45720), and packed-modules mapping (#45794). A MiniMax-M2 perf regression was also fixed (#45935).

DeepSeek-V4 keeps maturing: Following its debut, DeepSeek-V4 received another large optimization pass — a FlashInfer sparse index cache (2–4% TTFT) (#45863), prefill chunk-planning optimization (4% E2E throughput) (#45061), a cluster-cooperative topK kernel for low-latency (#43008), contiguous per-block KV allocations (#44577), TEP=16 for the block-FP8 shared expert (#46001), and native DSA indexer decode for next_n > 2 on SM100 (#45322). It is now enabled on SM120 alongside GLM-5.1 (#43477), with XPU (#44144, #44517, #45240) and ROCm (#44899, #45103, #45681) attention/MoE paths added.

Model Runner V2 (MRv2) continues to expand: MRv2 now supports quantized models by default (#44446), enables GraniteMoE by default (#45461), and gained migration of Qwen + DeepSeek-V2 MoE models (#42667), DFlash speculative decoding (#44586), and more accurate FP32 Gumbel sampling (#45996).

Streaming Parser Engine: A new streaming parser engine unifies tool-call/reasoning parsing across models, with parsers for Qwen3 (#45413), MiniMax-M2 (#45701), GLM-4.7/5.1/5.2 (#45915), and Nemotron V3 (#45755).

Diffusion LLMs: Added DiffusionGemma (#45163), including a CPU path (#45690) and structured-output guardrails for diffusion decoders (#45468).

WideEP / DeepEP v2: Integrated DeepEP v2 for expert parallelism (#41183), with follow-on robustness fixes (#46404, #46432).

Rust frontend matures further: Added API-key authentication (#44321), CORS (#45753), /tokenize + /detokenize (#44222), /pause /resume /is_paused (#44499), /abort_requests (#44382), /get_world_size (#44801), thinking_token_budget (#46137), a Python bridge for Rust tool parsers (#44624), and many new parsers and validation paths.

Device selection change: vLLM no longer sets CUDA_VISIBLE_DEVICES internally; a new device_ids argument is provided instead (#45026). On ROCm, a deprecation window for CUDA_VISIBLE_DEVICES has begun (#46636).

Model Support

New models: MiniMax-M3 (#45381), DiffusionGemma (#45163) + Gemma Diffusion on CPU (#45690), Hierarchical Reasoning Model — Text / HrmTextForCausalLM (#43098), OpenMOSS (#44124).

Gemma 4: Unified FlashAttention (FA4) across all layers + mm_prefix support (#42175); many parser/serving fixes — forced-JSON skip for required/named tool choice (#45795), parsing with thinking disabled (#45832), streaming reasoning-state init (#45852), reasoning rendering on assistant turns (#45867), offline-parser truncation/token-leak fix (#45553); legacy Gemma4 parsers replaced with an engine-based implementation (#45588).

DeepSeek-V4: OOM fix (#44914), MTP projection prefixing (#44821), supported KV-cache dtypes (#44892).

Qwen / multimodal: Qwen3-VL video loader (#44412), Qwen2-VL/Qwen2.5-VL processor-mapped video loader (#45555), Qwen3-VL multi-video processing optimization (#46026) and multi-video crash fix (#46305), Qwen3-Omni VIT cu_seqlens device fix (#44264), fused qk-rmsnorm-rope-gate for Qwen3.5 (#44176), Qwen3.5 EP weight-loading fix (#45002).

ViT full CUDA graph: GLM-4.1V (#40576), DeepSeek-OCR dual-path (#43586), Kimi-VL (#41992), mllama4 (#40660), Lfm2VL encoder (#44930).

Other model fixes: Llama4 weight loading (#45047) and streamed loading to avoid host-OOM (#44645), MiMo v2.x QKV TP sharding + FP4 (#45200), ColQwen3.5 retrieval correctness (#46108), EXAONE-4.5 vision encoder (#45073), MiDashengLM TP>1 audio-encoder crash (#44408), MiniCPM-o/V device-placement and image-size fixes (#43844, #42332, #44980, #45244), Cohere2 MoE weight loading + parser (#44747, #44907), Nemotron V3 reasoning-as-content (#39091), ColBERT AutoWeightsLoader + query/document embedding io processor (#44999, #45210).

Kernels: GLM-5 TRT-LLM ragged MLA prefill dimensions (#43525), GLM-5 router GEMM (#46385).

Engine Core

Model Runner V2: Quantized models by default (#44446), GraniteMoE default (#45461), Qwen/DSv2 MoE migration (#42667), DFlash (#44586), simplified async output handling (#45442), attention-group split on num_heads_q (#45564), LoRA warmup fix (#35536), more accurate FP32 Gumbel sampling (#45996), min_tokens off-by-one fix in the V2 GPU sampler (#46243), plus assorted model/config compatibility fixes (#45868).

Speculative decoding: Dynamic SD (#32374); DFlash with FlashInfer (#43081), mixed KV page sizes (#45181), and Qwen3Next targets (#45319); EAGLE3 support for Qwen3 (#43132); reduced TP communication for large-vocab drafts (#39419); race fix in async accepted counts (#45100); EAGLE multimodal encoder cache fixes (#46315).

KV cache & scheduler: KV-cache watermark to reduce preemptions (#44594), two-phase allocation for cross-group prefix-cache hits (#44409), Marconi-style admission policy for hybrid cache (#37898), prefix-cache retention for Mamba/linear attention (#45845), DS Mamba tail-copy for MTP align mode (#45473), reduced scheduler copy overhead (#45840).

Attention: Re-enabled cross-layer KV cache layout for MLA via stride-aware kernels (#45111), MLA prefill FA4 fp8 output (#43050), FlexAttention custom mask mods made fully cudagraphable (#45232), triton diff-kv backend for MiMo (#41797), FlashMLA sparse accuracy fix (#36616).

Weight loading & core: fastsafetensors ParallelLoader for weight loading (#40183), release of cached device memory under pressure on UMA GPUs (#45179), structured outputs for beam search (#35022), device_ids arg / no internal CUDA_VISIBLE_DEVICES (#45026), graceful fallback when numactl --membind is blocked (#45438), config-class registration before tokenizer init (#40299), async scheduling with prompt embeds for multimodal models (#45673).

Large Scale Serving & Distributed

Expert parallel: DeepEP v2 integration (#41183) with token-bound and topk-index fixes (#46404, #46432); NIXL EP — DBO with NIXL EP (#45275), top-k index dtype query (#45298), NVFP4 post-receive quantization skip (#45606), elastic-EP communicator (#45013); reject NCCL-based EPLB with async EPLB (#44978).

KV connectors / disaggregated serving: KV push from prefill to decode via NIXL (#35264); per-region KV transfer classification for mixed full-attn + MLA groups (#44583); Mooncake pipeline-parallel PD support (#44528), async lookup (#45659), compact chunk-hash zero-copy lookup (#45969), SWA-block skipping (#45444); P/D fixes with DP supervisor (#46628) and DSV4 disaggregation (#45831); removed P2pNcclConnector (#44854).

KV offloading: Multi-tier async batched lookup (#44193), packed HMA KV-cache layout (#46205, gated #46252), parallel-agnostic fs-tier cache (#44733), offloading-manager stats (#35669) and labeled/CPU-usage metrics (#45957, #45737), self-describing KV events (#43468), non-blocking idle flush (#45595), and numerous correctness/race fixes (#44784, #45823, #46231, #46278).

Distributed core: Prefill step cadence for better non-PD DP balancing (#44558), KV-event map encoding (#42892), one-shot fused all-reduce PDL NaN fix (#45448).

Hardware & Performance

NVIDIA / kernels: SM90 CUTLASS FP8 mm odd-M support via swap_ab (180–290% kernel speedup) (#44572), tuned fused_moe FP8 for Qwen3-Next-80B on H100 (+25%) (#44830), native DSA indexer decode on SM100 (#45322), cluster-cooperative topK for DeepSeek low-latency (#43008), PDL support for DeepGEMM (#46006), FlashInfer cutedsl NVFP4 GEMM (#42235) and cute-dsl MXFP8 linear kernel (#46393), new Helion kernels for FP8/RMSNorm quant (#36902, #33790, #36895, #34432).

torch stable ABI: Continued (and completed) migration of kernels to the libtorch stable ABI — MoE [10c/n] (#44565), Marlin [11a/n] (#45176), Machete [11b/n] (#45304), final _C library migration [12/n] (#45415).

AMD ROCm: Torch 2.11 (#45362); fused AR + RMSNorm + per-group FP8 quant (#42864), fused softplus-sqrt-topk MoE router under AITER (#44945), DSv4 flash-decode split-K kernel (#44899) and inverse-RoPE fusion (#45103), W4A16 FlyDSL MoE (#44400), A8W4 MoE CDNA4 swizzle gate for gpt-oss (#44804); deprecation window begun for CUDA_VISIBLE_DEVICES on ROCm (#46636).

Intel XPU: Sequence-parallel support (#38608), torch-xpu 2.12 (#42262), vllm-xpu-kernels v0.1.10 (#40367), W4A16 int4 group_size=32 MoE (#45136), DeepSeek-V4 attention/MoE paths (#44144, #44517, #45240), top-p sampling correctness fix (#44470).

CPU & other architectures: 2.5× faster ASR CPU preprocessing via multi-threading (#44612), CPU W4A16 INT4 MoE (#43409), cgroup memory-limit-aware KV cache sizing (#45086), RISC-V oneDNN W8A8 INT8 (#44478) and RVV micro-GEMM for WNA16 (#44324), pinned memory for WSL2 (#41496), ZenCPU runtime logging (#42726).

TPU: tpu-inference upgraded to v0.22.1 (#45793).

Misc perf: VLLM_TRITON_FORCE_FIRST_CONFIG to skip Triton autotuning (#42425), Triton recompile detection (#45631), fused multi-group block-table staged writes (#44944).

Quantization

Online & mixed-precision: Online FP8 per-token-per-channel (PTPC) quantization (#44132); modelopt_mixed support extended to Ampere/SM80-86 (#45306) and Turing/SM75 (#45375).

FP4 / MXFP: FlashInfer cutedsl NVFP4 GEMM backend (#42235) and cute-dsl MXFP8 linear kernel (#46393), MXFP4 W4A4 MoE CUTLASS E8M0 scale fix (#43557), SwiGLU clamp wired for NVFP4 MoE on non-Blackwell (#45836), flashinfer_cutlass allowed as a clamped NVFP4 MoE backend (#46492), NVFP4/OCP MX MoE emulation fix (#46254), FP8 MoE re-enabled on NVIDIA Thor (#46339).

... (truncated)

Commits

ee0da84 [KV-Offloading] Fix tensors_per_block stride (#46888)
217c64a [CI] Raise gsm8k startup timeout for MoE Refactor Qwen3 NVFP4 configs (#46882)
cfe8a4d [CI] Raise gsm8k startup timeout for Qwen3 NVFP4 trtllm configs (#46881)
6d37570 Fix P/D with DP Supervisor (#46628)
f85a9f1 [Bugfix] FLASHINFER_MLA_SPARSE_SM120 compatibility with GLM-5 NVFP4 (#46506)
836b5ac [ROCm] Begin Deprecation Window for CUDA_VISIBLE_DEVICES on ROCm (#46636)
b36db10 [KV Offload] Gate packed HMA KV cache on cross-layer config (#46252)
b70c13e [Bug] Fix `IndentationError: expected an indented block after 'with' statemen...
6829a6d [Bugfix] Re-enable FP8 MoE on NVIDIA Thor (#46339)
6ed56e0 [Bugfix] Fix illegal memory access from a forward during a partial wake_up (#...
Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR
@dependabot recreate will recreate this PR, overwriting any edits that have been made to it
@dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

Updates the requirements on [vllm](https://github.com/vllm-project/vllm) to permit the latest version. - [Release notes](https://github.com/vllm-project/vllm/releases) - [Changelog](https://github.com/vllm-project/vllm/blob/main/RELEASE.md) - [Commits](vllm-project/vllm@v0.22.1...v0.24.0) --- updated-dependencies: - dependency-name: vllm dependency-version: 0.24.0 dependency-type: direct:development ... Signed-off-by: dependabot[bot] <support@github.com>

dependabot Bot added dependencies Pull requests that update a dependency file python Pull requests that update python code labels Jul 1, 2026

dependabot Bot mentioned this pull request Jul 1, 2026

chore(deps-dev): update vllm requirement from >=0.22.1 to >=0.23.0 #450

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore(deps-dev): update vllm requirement from >=0.22.1 to >=0.24.0#458

chore(deps-dev): update vllm requirement from >=0.22.1 to >=0.24.0#458
dependabot[bot] wants to merge 1 commit into
mainfrom
dependabot/pip/vllm-gte-0.24.0

dependabot Bot commented on behalf of github Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

Uh oh!

Conversation

dependabot Bot commented on behalf of github Jul 1, 2026

v0.24.0

vLLM v0.24.0 Release Notes

Highlights

Model Support

Engine Core

Large Scale Serving & Distributed

Hardware & Performance

Quantization

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants