Pulse · vllm-project/vllm · GitHub

August 28, 2025 – September 4, 2025

Overview

316 Active pull requests

230 Active issues

Could not load contribution data

Please try again later

149 Pull requests merged by 92 people

[XPU][P/D] Add XPU support in NixlConnector
#22436 merged Sep 5, 2025
[gpt-oss] tool parser supports for /chat/completions [1/n]
#22386 merged Sep 5, 2025
[Frontend] Skip unnecessary detokenization when token_id is requested
#24236 merged Sep 4, 2025
[CI/Build] Reduce the number of redundant cases to test for LoRA
#24276 merged Sep 4, 2025
[Bugfix][Misc] Fix silu_and_mul_nvfp4_quant issue and extract common utils for nvfp4 kernel source files
#23727 merged Sep 4, 2025
[Misc] Have AsyncLLM custom_stat_loggers extend default logger list
#20952 merged Sep 4, 2025
QWEN3 Coder Fused MoE kernels Optimization configs
#24266 merged Sep 4, 2025
Upgrade FlashInfer to v0.3.0
#24086 merged Sep 4, 2025
[Misc] Slight improve deepgemm print
#24085 merged Sep 4, 2025
[Doc]: fix typos in Python comments
#24173 merged Sep 4, 2025
[Perf] Freeze core engine proc heap after init
#24008 merged Sep 4, 2025
[Misc] Removed force_fp8_e4m3fnuz from FP8LinearOp
#23725 merged Sep 4, 2025
[LoRA]: Add lora support to qwen-2.5-omni
#24231 merged Sep 4, 2025
[XPU] support Triton Attention backend on Intel GPU
#24149 merged Sep 4, 2025
Use hidden_size_per_head as head_size fallback
#24221 merged Sep 4, 2025
[Model] Add pp support for hunyuan
#24212 merged Sep 4, 2025
[Doc] Update vLLM Singapore Meetup info
#24234 merged Sep 4, 2025
[Feature][Response API] Add streaming support for non-harmony
#23741 merged Sep 4, 2025
[Hardware][Apple-CPU] Disable OneDNN build for Apple Silicon
#24200 merged Sep 4, 2025
[Attention] FlashAttn MLA
#14258 merged Sep 4, 2025
[Bugfix] Fix Incremental Detokenization with tokenizers == 0.22.0
#24159 merged Sep 4, 2025
[Attention][Platform] Refactor MLA to support Custom Op
#23332 merged Sep 4, 2025
Improve flexibility of auto_tune.sh execution.
#23766 merged Sep 4, 2025
[Core][Model] Terratorch backend integration
#23513 merged Sep 4, 2025
[Model] Add MiDashengLM model support
#23652 merged Sep 4, 2025
[Misc] Enhance output readability of helper script
#24214 merged Sep 4, 2025
[CPU] Refactor CPU unquantized linear
#24150 merged Sep 4, 2025
Migrate ultravox inputs to TensorSchema
#23503 merged Sep 4, 2025
[Refactor] Introduce basic Renderer for completion-style request
#24010 merged Sep 4, 2025
[Kernel][Bugfix] Fix grouped topk cu
#24146 merged Sep 4, 2025
[Feature][Responses API]Support MCP tools with streaming mode + background mode
#23927 merged Sep 4, 2025
Remove deprecated PyNcclConnector
#24151 merged Sep 3, 2025
[Feature][gpt-oss] Add support for num_cached_tokens and num_reasoning_tokens tracking
#23460 merged Sep 3, 2025
[Bugfix][DP] DP distribution does not require ray[default]
#23822 merged Sep 3, 2025
[Feature][P/D]: Optimize NIXL Connector xfer Launch
#23887 merged Sep 3, 2025
[Attention] Blackwell FP8 MLA support with CUTLASS_MLA backend
#23289 merged Sep 3, 2025
Migrate whisper inputs to TensorSchema
#23505 merged Sep 3, 2025
[Kernels] Overlap shared experts with send/recv
#23273 merged Sep 3, 2025
[V1] v1 engine + full CUDA graph support for PLaMo2
#23998 merged Sep 3, 2025
[Bugfix] Fixing division by zero in triton_attn if query_heads/kv_heads > 16
#23424 merged Sep 3, 2025
FIX: Add libnuma-dev to Dockerfile for dev stage
#20388 merged Sep 3, 2025
Fix MiniMax attention module prefix and remove useless code
#23982 merged Sep 3, 2025
Support add_generation_prompt in embeddings endpoint with chat request
#23931 merged Sep 3, 2025
[CI] Accelerate mteb test by setting SentenceTransformers mteb score to a constant
#24088 merged Sep 3, 2025
[Misc] Clean up deadcode for legacy processing pipeline
#24153 merged Sep 3, 2025
[CI/Build] Serve images used by multimodal tests through local HTTP Server
#23907 merged Sep 3, 2025
[Nixl] Heterogeneous TP support FlashInfer
#20189 merged Sep 3, 2025
[distributed][rl] remove nccl cumem env var override
#24141 merged Sep 3, 2025
[BugFix] Fix routed_scaling_factor double mul for dots1 and glm4 MoE models
#24132 merged Sep 3, 2025
[Misc] Add check for dual_chunk_attention
#24070 merged Sep 3, 2025
[Doc]: fix typos in Python comments
#24115 merged Sep 3, 2025
[Doc]: fix typos in Python comments
#24093 merged Sep 3, 2025
[Compile] Fix Compile Warning for w4a8_mm_entry.cu
#23660 merged Sep 3, 2025
fix some typos
#24071 merged Sep 3, 2025
[V1] Wrapper which plumbs request-level logits processors into vLLM batch-level logits processing
#23656 merged Sep 3, 2025
Upgrade xgrammar to 0.1.23
#22988 merged Sep 3, 2025
Update release pipeline post PyTorch 2.8.0 update
#24073 merged Sep 3, 2025
[XPU] Fix the bug of LoRA logits on the XPU platform
#24081 merged Sep 3, 2025
[CI/Build] Disable SiluMul NVFP4 quant fusion tests
#24121 merged Sep 2, 2025
[Bug] R1 Accuracy: Fix routed_scaling_factor Double Mul Issue
#24119 merged Sep 2, 2025
[AMD][Kernel][Bugfix] Cast offsets tensor bn to tl.int64 to avoid GPU segfault
#23692 merged Sep 2, 2025
[CI] Enable all hf transformers baselines in test_hybrid
#23936 merged Sep 2, 2025
[Log] Only Print Profiler Results on Rank 0
#23370 merged Sep 2, 2025
Fix weights loading for Apertus
#24100 merged Sep 2, 2025
[Metrics] Deprecate TPOT in favor of ITL
#24110 merged Sep 2, 2025
[Bugfix] Fix packed_factor missing attribute error
#23902 merged Sep 2, 2025
Run ruff format on a few files.
#24075 merged Sep 2, 2025
[Bugfix] Fix transform_config parsing in Compressed Tensors
#23945 merged Sep 2, 2025
[Benchmark] Add support for local hf dataset path in benchmark
#23999 merged Sep 2, 2025
[docs] add SYS_NICE cap & security-opt for docker/k8s
#24017 merged Sep 2, 2025
[CI Failure] Skip failing nvfp4 silu test
#23959 merged Sep 2, 2025
[Model] Classification models support logit_bias / sigmoid_normalize
#24031 merged Sep 2, 2025
[BugFix] Fix EXAONE4 rotary embeddings
#23918 merged Sep 2, 2025
[Gemma3n] Fix audio batching
#24052 merged Sep 2, 2025
correct LWS deployment yaml
#23104 merged Sep 2, 2025
[CI]: reduce HTTP calls inside entrypoints openai tests
#23646 merged Sep 2, 2025
[Model] Support dp on ViT on GLM-4.5V
#23168 merged Sep 2, 2025
[Doc]: fix typos in Python comments
#24077 merged Sep 2, 2025
Migrate Interns1 inputs to TensorSchema
#23510 merged Sep 2, 2025
[XPU][Feature] fp8 online quantization support for XPU
#23148 merged Sep 2, 2025
Migrate OvisImagePatchInputs to TensorSchema
#22024 merged Sep 2, 2025
Remove runtime checks based on pooling params
#24051 merged Sep 2, 2025
[Bugfix] Fix the issue that Blip2ForConditionalGeneration' object has…
#24028 merged Sep 2, 2025
[V1][Mamba1] - FP32 SSM Kernel Support
#23506 merged Sep 2, 2025
[Doc]: fix typos in Python comments
#24042 merged Sep 2, 2025
[bugfix]fix MTP hidden states
#24056 merged Sep 1, 2025
[Chore][V0 Deprecation] Move LogProb to a separate file
#24055 merged Sep 1, 2025
[Model] Support DP for ViT on Kimi-VL-A3B-Thinking-2506
#23817 merged Sep 1, 2025
[docs][misc] IOProcessor plugins fixes
#24046 merged Sep 1, 2025
[Misc] Minor code simplification for spec decode
#24053 merged Sep 1, 2025
Document multi-proc method selection for profiling
#23802 merged Sep 1, 2025
[Model]: support KeyeVL-1_5-8B
#23838 merged Sep 1, 2025
[Doc]: Fix CPU install docs: force torch-backend=cpu to avoid GPU torchvision errors
#24033 merged Sep 1, 2025
[Frontend] Gemma3n audio transcriptions/translations endpoint
#23735 merged Sep 1, 2025
[Doc]: fix typos in Python comments
#24026 merged Sep 1, 2025
[Kernel] Update DeepGEMM to latest commit
#23915 merged Sep 1, 2025
[Frontend] Update the warning log when using VLLM_ALLOW_LONG_MAX_MODEL_LEN
#20904 merged Sep 1, 2025
[Misc] Enable V1 FP16 inference on pre-Ampere GPUs
#24022 merged Sep 1, 2025
[Misc] add hash_function doc string
#24014 merged Sep 1, 2025
[Bugfix] Add support for <tool_call> format in streaming mode for XLAM Tool Parser
#22769 merged Sep 1, 2025
[Misc] IO Processor plugins for pooling models
#22820 merged Sep 1, 2025
Migrate Phi4 inputs to TensorSchema
#23471 merged Sep 1, 2025
[Misc] refactor code by import as for torch._inductor.config
#23677 merged Sep 1, 2025
[CI/Build] Improve Tensor Schema tests speed by avoid engine core initialization
#23357 merged Sep 1, 2025
[Misc] Move fast prefill logic to separate method
#24013 merged Sep 1, 2025
Fix the bug related to loading GPTP INT3 weights.
#23328 merged Sep 1, 2025
[Misc] Avoid redundant copy for encoder-only models
#24012 merged Sep 1, 2025
[BUGFIX] GPTQ quantization compatibility for Qwen3 MOE models (AutoGPTQ and AutoRound-GPTQ)
#23994 merged Sep 1, 2025
v1: Support KV events from connectors
#19737 merged Sep 1, 2025
[Minor] Fix some random typos in comments
#24009 merged Aug 31, 2025
vllm fix check on max vocab size
#22471 merged Aug 31, 2025
[Doc]: fix typos in Python comments
#24001 merged Aug 31, 2025
[Core][Multimodal] Allow passing multi_modal_uuids as multimodal identifiers.
#23394 merged Aug 31, 2025
Fix wrong truncate_prompt_tokens type hint
#22761 merged Aug 30, 2025
[LoRA] Much faster startup when LoRA is enabled
#23777 merged Aug 30, 2025
[Misc] enhance type hint for rearrange return value
#23519 merged Aug 30, 2025
[Refactor] refactor freezing_value/cuda_event initialize outside try finally
#23758 merged Aug 30, 2025
[Misc] add reorder_batch AttentionMetadataBuilder
#23798 merged Aug 30, 2025
Add LoRA support for DeepSeek models (V2, V3, R1-0528)
#23971 merged Aug 30, 2025
[Model] Enable encoder DP for MiniCPM-V
#23948 merged Aug 30, 2025
[UT] fix unify_kv_cache_configs when kv cache config needs sort
#23843 merged Aug 30, 2025
[Bugfix] Fix test_lora_resolvers.py
#23984 merged Aug 30, 2025
[V1] [Hybrid] Move MiniMaxLinearAttention into layers/mamba
#23831 merged Aug 30, 2025
[Core] Cleanup TPU model runner for MM
#23894 merged Aug 30, 2025
[CI] Fix broken compile tests due to unsupported SiluMul+Nvfp4Quant fusion
#23973 merged Aug 30, 2025
[CI] Move testing image from remote URL to S3
#23980 merged Aug 30, 2025
Add routed_scaling_factor to MoE grouped topk
#23123 merged Aug 30, 2025
[Bugfix] Fix --config arg expansion called from api_server.py
#23944 merged Aug 30, 2025
[CI] Fix unavailable image remote URL
#23966 merged Aug 29, 2025
[Misc] Make download_weights_from_hf more reliable
#23863 merged Aug 29, 2025
Revert gemma3n fast prefill changes
#23897 merged Aug 29, 2025
[Docs] [V1] [Hybrid] Add new documentation re: contributing mamba-based models
#23824 merged Aug 29, 2025
Tuned H100/H200 triton fp8 block configs for fused_qkv_a_proj
#23939 merged Aug 29, 2025
[RL][BugFix] Fix missing tokenizer error for token-in-token-out
#23904 merged Aug 29, 2025
[BUGFIX ] fix undefined silu_and_mul_nvfp4_quant
#23929 merged Aug 29, 2025
[CI] Add aiter to matching list of issue auto labeller for rocm tag
#23942 merged Aug 29, 2025
[BugFix] Async scheduling and PP compatibility with DP
#23770 merged Aug 29, 2025
[Models] Use in-place adds in Idefics2Vision
#23932 merged Aug 29, 2025
[MODEL] Apertus and XIELU
#23068 merged Aug 29, 2025
Adds json_count_leaves utility function
#23899 merged Aug 29, 2025
Update PyTorch to 2.8.0
#20358 merged Aug 29, 2025
[Multimodal] Consolidate mm inputs into MultiModalFeatureSpec
#23779 merged Aug 29, 2025
[Performance] V1 Classify Models E2E Performance Optimization
#23541 merged Aug 29, 2025
[CPU] Enable data parallel for CPU backend
#23903 merged Aug 29, 2025
[V0 Deprecation] Remove pooling model support in V0
#23434 merged Aug 29, 2025
Better errors for Transformers backend missing features
#23759 merged Aug 29, 2025
[Misc] Fix warnings for mistral model
#23552 merged Aug 29, 2025
[mrope][Qwen2-VL] Fix edge case where getting index of image/video token can potentially throw in default vl mrope implementation.
#23895 merged Aug 29, 2025
[CI/Build] Clean up LoRA test
#23890 merged Aug 29, 2025

167 Pull requests opened by 136 people

[feat] preserve metadata for quantized model weight reload
#23901 opened Aug 29, 2025
allow calc_kv_scales
#23906 opened Aug 29, 2025
[Benchmark] add benchmark for custom activation op
#23908 opened Aug 29, 2025
[Model] enable data parallel for InternVL vision encoder
#23909 opened Aug 29, 2025
[Attention]: Fix Torch compile error when --calculate-kv-scales is enable
#23912 opened Aug 29, 2025
[Core] Refactor EPLB
#23913 opened Aug 29, 2025
[Performance] implement async_scheduling in single process mode
#23914 opened Aug 29, 2025
kv_output_aggregator support heterogeneous
#23917 opened Aug 29, 2025
Add automatic max model length selection
#23920 opened Aug 29, 2025
[Model loader]: support multi-thread model weight loading
#23928 opened Aug 29, 2025
Add actionable solutions to top 3 error messages
#23930 opened Aug 29, 2025
Dequant kv_a_proj_with_mqa for DSV3
#23933 opened Aug 29, 2025
[Benchmark] Allow arbitrary headers to be passed to benchmarked endpoints
#23937 opened Aug 29, 2025
[Bugfix] Handle the edge case in detokenizer where processed tokens contain both `stop` str and `eos` token
#23938 opened Aug 29, 2025
[V1] [Hybrid] Mamba2 Automatic Prefix Caching
#23941 opened Aug 29, 2025
fit the qwen3 moe's awq quantization for 2080Ti.
#23949 opened Aug 29, 2025
[wip] allow skip media
#23950 opened Aug 29, 2025
feat: Add Eagle3 speculative decoding support for Llama4
#23951 opened Aug 29, 2025
[BugFix] Fix de-functionalization pass for rotary_embedding
#23953 opened Aug 29, 2025
[Attention] FlashAttention MLA cudagraph support
#23958 opened Aug 29, 2025
Remove old cutlass mla
#23961 opened Aug 29, 2025
[Core] Add tensor analysis utility for multimodal cache debugging
#23962 opened Aug 29, 2025
Enable Allgather/ReduceScatter backend for NaiveAllToAll
#23964 opened Aug 29, 2025
[rocm] update pytorch rocm from 6.3 to 6.4
#23968 opened Aug 29, 2025
[Kernel] Faster pre-processing time for W4A8
#23972 opened Aug 29, 2025
[Hybrid Allocator] Support Pipeline Parallel
#23974 opened Aug 30, 2025
Next Fix for Compile with Cuda 13
#23976 opened Aug 30, 2025
Feature/vit attention unification# 23880
#23978 opened Aug 30, 2025
fix total_time of benchmark_hashing
#23987 opened Aug 30, 2025
[Model] Add LongCat-Flash
#23991 opened Aug 30, 2025
[Bugfix] Fix several issues with p2p xPyD in GET type
#23993 opened Aug 30, 2025
Feature/deepseek v31 lora support
#23995 opened Aug 30, 2025
[Feature]: Support Phi4Flash model in V1
#23996 opened Aug 31, 2025
Feature/sampler benchmark #23977
#23997 opened Aug 31, 2025
optimize serving_score loops.
#24000 opened Aug 31, 2025
[V1][CUDA Graph] Fix attention metadata tensor sizes for padded batches
#24002 opened Aug 31, 2025
[LoRA] Gemma3n LoRA support
#24003 opened Aug 31, 2025
[WIP][cudagraph] Add cudagraph_capture_prefill_size to fine-grained control the prefill size in mixed batch
#24011 opened Sep 1, 2025
[V1][Metrics] Add per-request TPOT histogram
#24015 opened Sep 1, 2025
Allow loading of cpatonn/InternVL3_5-14B-AWQ-4bit
#24018 opened Sep 1, 2025
[Model] Add Eagle 2.5 VL
#24019 opened Sep 1, 2025
[Bugfix] Fix sequence parallelism bug when enable pipeline parallelism
#24021 opened Sep 1, 2025
[Feature][Quantization] auto_round format add support for regex
#24024 opened Sep 1, 2025
Support using SigLIP2 text and image embedding as standalone model
#24027 opened Sep 1, 2025
Fix typo in test_attention_backends.py
#24030 opened Sep 1, 2025
[BugFix] GPT-OSS Attention DP + MoE TP weight loading issue
#24032 opened Sep 1, 2025
[Hardware][IBM Z] Fix Outlines Core issue for s390x
#24034 opened Sep 1, 2025
[BugFix] This fix resolves the current memory bind issue in NUMA settings when number of nodes used for benchmarking is more than on
#24037 opened Sep 1, 2025
[Feature][Quantization] Support Quark for mixed-precision quantized model
#24040 opened Sep 1, 2025
[Docs] Enable relative links in examples to function when rendered in the docs
#24041 opened Sep 1, 2025
Update to Transformers 4.55.3
#24043 opened Sep 1, 2025
[Bugfix] DeepSeek MTP assertion error in vllm/v1/attention/backends/utils.py and local variable access error in vllm/v1/spec_decode/eagle.py
#24045 opened Sep 1, 2025
Issue 19007 Individual GuidedDecodingParams for each prompt in prompts
#24047 opened Sep 1, 2025
[Spec Decoding]Support Spec Decoding Metrics in DP Mode
#24049 opened Sep 1, 2025
[Kernels][DP/EP] Optimize Silu Kernel for R1
#24054 opened Sep 1, 2025
[CI] Replace large models with tiny alternatives in tests
#24057 opened Sep 1, 2025
Use Numpy array for sampled_token_ids
#24061 opened Sep 2, 2025
[P/D]support for the v1/chat/completions interface to the disagg_proxy_server
#24065 opened Sep 2, 2025
[BugFix] `python collect_env.py` and `vllm collect-env` compatibility with uv venv
#24066 opened Sep 2, 2025
Gfx908 attn fix
#24068 opened Sep 2, 2025
Reconstruct EPLB algorithm invocation method
#24069 opened Sep 2, 2025
[Bugfix] Fix Qwen3-coder moe tuned config
#24072 opened Sep 2, 2025
[BugFix][Model] Fix Ernie4.5-VL hanging on long inputs
#24074 opened Sep 2, 2025
[Core] Remove tokenizer group in vLLM
#24078 opened Sep 2, 2025
[CI] Move V1 Core tests to CPU
#24080 opened Sep 2, 2025
Support LongCat-Flash-Chat tool call
#24083 opened Sep 2, 2025
[Bugfix] Fix AssertionError in cache_full_blocks due to dirty blocks
#24084 opened Sep 2, 2025
[Benchmarks] Add --skip-check argument to reduce wait time
#24087 opened Sep 2, 2025
[V1] Add sliding window support to Flex Attention backend
#24089 opened Sep 2, 2025
[Perf] EPLB optimize export_load_view update
#24091 opened Sep 2, 2025
[Docs] Fix warnings in `mkdocs build` (continued)
#24092 opened Sep 2, 2025
[ROCm][FEAT] Fuse DeepSeek shared experts into AITER fused_moe ops
#24097 opened Sep 2, 2025
The downloaded tags directory is missing a `.git` folder, which is ca…
#24099 opened Sep 2, 2025
[Bugfix] Enable swiglu oai for fused marlin moe
#24101 opened Sep 2, 2025
[CI] Add nightly multiarch manifests to dockerhub
#24102 opened Sep 2, 2025
[Bugfix] sliding_window AttributeError
#24103 opened Sep 2, 2025
Update num_tokens_across_dp to use nccl instead of gloo
#24105 opened Sep 2, 2025
[Transform] Deterministic Hadacore Transforms
#24106 opened Sep 2, 2025
fixed reasoning streaming with tool_choice="required"
#24108 opened Sep 2, 2025
[Kernels][AR] Enable Torch Symmetric Memory By Default
#24111 opened Sep 2, 2025
[Kernel] Split moe tuned configs
#24113 opened Sep 2, 2025
test_chunked_prefill_pooler refrencing #23436
#24114 opened Sep 2, 2025
[Models][Quantization] Add quantization configuration update in Voxtral model
#24122 opened Sep 2, 2025
[Compilation][WideEP] Enable Piecewise CUDAGraph for DeepEPHT
#24123 opened Sep 2, 2025
update spec decode metrics to use throughput
#24127 opened Sep 2, 2025
[Core] Run garbage collector after CUDA graph capture to fix throughput regression
#24128 opened Sep 2, 2025
[Hardware][Apple-CPU] Enable native bfloat16 on Apple Silicon (M2 and later)
#24129 opened Sep 2, 2025
[CI Sprint] Quantization CI Cleanup
#24130 opened Sep 2, 2025
[Ultravox] Fix gemma instantiation
#24131 opened Sep 2, 2025
[Bugfix][Wide EP] Fix redundant work when using DeepEP, TP Attn, and EP MoE
#24134 opened Sep 3, 2025
WIP [Renderer] Move Processor out of AsyncLLM
#24138 opened Sep 3, 2025
reduce the weight loading time
#24154 opened Sep 3, 2025
[Kernel][SM100]: Enable FI FusedMoE By Default for Llama
#24157 opened Sep 3, 2025
[GPT-OSS] Fix Pydantic union resolution for ResponseFunctionToolCall in Responses API
#24158 opened Sep 3, 2025
[VLM] Optimize GLM4.5-V-style video processing to only decode necessary frames
#24161 opened Sep 3, 2025
[Log] Per Rank Log
#24162 opened Sep 3, 2025
fix some typos
#24167 opened Sep 3, 2025
[logging] Refine PyNcclConnector Proxy logging
#24168 opened Sep 3, 2025
[Refactor]: Use M-RoPE interface directly while defining model class instead of maintaining model specific M-RoPE implementation in mrope.py
#24172 opened Sep 3, 2025
Optimize detokenizer performance for long-generation sequences
#24174 opened Sep 3, 2025
[Core] Exposing engine sleep & wake_up state as prometheus metrics
#24176 opened Sep 3, 2025
[MacOS]skip pip-compile for pre-commit on MacOS
#24177 opened Sep 3, 2025
[Bugfix] fix modelopt exclude_modules name mapping
#24178 opened Sep 3, 2025
[Feature] add reasoning tokens
#24181 opened Sep 3, 2025
[Misc] Harden `SamplingParams.from_optional` support
#24183 opened Sep 3, 2025
[Spec Decode][Model]Add qwen2-eagle
#24187 opened Sep 3, 2025
[torch.compile] Custom op matching
#24188 opened Sep 3, 2025
[CI/Build] bump timm dependency
#24189 opened Sep 3, 2025
[feat]: Create interface for model-specific M-RoPE
#24194 opened Sep 3, 2025
[Docs] Fix install device tabs being out of sync when directly linked to
#24195 opened Sep 3, 2025
[flashinfer] [kernel] support for fp8 kv cache for trtllm prefill attention
#24197 opened Sep 3, 2025
Support prompt hidden states
#24202 opened Sep 3, 2025
[Refactor] Refactor to extract model forward logic to allow plug-in t…
#24205 opened Sep 4, 2025
[Frontend] add 'verbose_json' and 'timestamp' feature on Whisper Transcription/Translation
#24209 opened Sep 4, 2025
[DO NOT MERGE] PR for testing
#24210 opened Sep 4, 2025
[Docs]add eplb_config param use docs
#24213 opened Sep 4, 2025
[backends][short_conv] CUDA graph piecewise edits
#24215 opened Sep 4, 2025
[Misc] fix lmcache cpu offload example
#24216 opened Sep 4, 2025
Fix Auto_Round Quatization Loading on SM75 and Lower GPUs
#24217 opened Sep 4, 2025
[Core] Support async scheduling with uniproc executor
#24219 opened Sep 4, 2025
[UT] enhance free kv cache block queue popleft_n
#24220 opened Sep 4, 2025
[Docs] add the parallel sampling usage in LLMEngine and AsyncLLM
#24222 opened Sep 4, 2025
[bugfix] fix returned chunk too large bug
#24224 opened Sep 4, 2025
[Benchmarks]Accelerate random dataset generation
#24225 opened Sep 4, 2025
[Misc] update log level debug to warning when process port is used by
#24226 opened Sep 4, 2025
[Feature]support xPyD reconnect
#24227 opened Sep 4, 2025
[kv cache] update num_free_blocks in the end
#24228 opened Sep 4, 2025
[Misc] rename interval to max_recent_requests
#24229 opened Sep 4, 2025
Fix unknown recipient none #24170
#24233 opened Sep 4, 2025
Change the default value of truncate_prompt_tokens in the embedding/rerank/pooling model to -1
#24235 opened Sep 4, 2025
[Feature][Quantization] extend Quark to support mixed-precision quantized model
#24239 opened Sep 4, 2025
Eagle3 that supports the Minicpm3 model
#24243 opened Sep 4, 2025
[Metrics] Hide deprecated metrics with gpu_ prefix
#24245 opened Sep 4, 2025
[kernel] Add stride checks for rms_norm kernels
#24247 opened Sep 4, 2025
[PERF] Allreduce fusion. Support torch native matching. Tuning of the thresholds
#24248 opened Sep 4, 2025
[test] make NixlConnector example more clear
#24249 opened Sep 4, 2025
[Core] Add delayed batching
#24250 opened Sep 4, 2025
v1: CPU offloading
#24251 opened Sep 4, 2025
[Compile] Conditional compilation. Introduce compile_ranges
#24252 opened Sep 4, 2025
[CI] Speed up model unit tests in CI
#24253 opened Sep 4, 2025
[Kernels] Overlap shared experts with combine instead of dispatch
#24254 opened Sep 4, 2025
[CI/Build] Fail test groups fast using pytest -x and bash -e
#24255 opened Sep 4, 2025
[Spec Decode] Fix offline spec_decode.py
#24257 opened Sep 4, 2025
[ci/testing]: ensure the gpu memory is cleaned when exiting the remote openAI remote server
#24258 opened Sep 4, 2025
[CI] Small Accuracy Eval Test for Deepseek Model
#24259 opened Sep 4, 2025
[CI] Add timeouts to tests
#24260 opened Sep 4, 2025
[do not merge] Tokens in<>out `/generate` endpoint
#24261 opened Sep 4, 2025
[IGNORE] Timing model tests in fast-check
#24262 opened Sep 4, 2025
Support SeedOss Reason Parser
#24263 opened Sep 4, 2025
break execute_model in gpu_model_runner into sub-functions for custom scopes
#24265 opened Sep 4, 2025
[Misc] Add ReplicaId to Ray metrics
#24267 opened Sep 4, 2025
Draft: make deletion atomic in nixl timeout handling
#24268 opened Sep 4, 2025
[Core] Simplify and unify mm uuid handling & auto-generated mm hash overrides processing.
#24271 opened Sep 4, 2025
[Tests] fix initialization of kv hash in tests
#24273 opened Sep 4, 2025
AOT Compilation for torch.compile (Bundled)
#24274 opened Sep 4, 2025
[ROCm][Feature] Enable Pipeline Parallelism with Ray Compiled Graph on ROCm
#24275 opened Sep 4, 2025
[Core] Support configuration parsing plugin
#24277 opened Sep 4, 2025
[CORE] Prompt Embeddings Support for v1 Engine
#24278 opened Sep 4, 2025
[ROCm][CI/Build] Sync ROCm dockerfiles with the ROCm fork
#24279 opened Sep 4, 2025
CUDAGraph partition integration
#24281 opened Sep 4, 2025
[Frontend][Responses API] Support reporting tool output tokens and fix reasoning token count
#24285 opened Sep 5, 2025
Add Support for Grok2
#24286 opened Sep 5, 2025
Fix cmake incremental build when running "pip install --no-build-isolation -e ."
#24287 opened Sep 5, 2025
[Bugfix] guard missing attn_metadata in KV scales path
#24290 opened Sep 5, 2025
[Doc]: fix typos in Python comments
#24294 opened Sep 5, 2025
[feat] fast inplace model update
#24295 opened Sep 5, 2025
Add vllm:request_prefill_comp_speed metric to Prometheus
#24296 opened Sep 5, 2025

120 Issues closed by 26 people

[Bug]:Question about logprobs output being 0.0 when using `vllm` sampling params
#17286 closed Sep 5, 2025
[Feature]: LoRA support for qwen2-vl Models
#11255 closed Sep 5, 2025
[RFC]: Refactor tool parsers to eliminate coding errors and allow more efficient implementations.
#11522 closed Sep 5, 2025
[Usage]: Automatic Prefix Cache life cycle
#12077 closed Sep 5, 2025
[Misc] [ROCm]: Build from source failure with Arch/gcc14 with ROCm 6.3
#13777 closed Sep 5, 2025
[Bug]: ModuleNotFoundError: No module named 'pyarrow" in main branch
#14487 closed Sep 5, 2025
[Usage]: Segmentation Fault caused by model indexing errors (token sequence length exceeding 16384) in vLLM 0.7.3 multi-node deployment for DeepSeek R1 67B
#14652 closed Sep 5, 2025
[Usage]: Can AsyncLLMEngine support batch infer？
#14717 closed Sep 5, 2025
[Bug]: Design flaws in the current tool parser.
#15177 closed Sep 5, 2025
[Bug]: H20*TP16，can't start service, get error: Cannot allocate memory
#16142 closed Sep 5, 2025
[Bug]: vLLM still runs after Ray workers crash
#16259 closed Sep 5, 2025
[Feature Request]: Support data_parallel_size in offline inference mode
#16588 closed Sep 5, 2025
[Bug]: Problems with vllm serve DeepSeek-R1 with 2 nodes and TP = 16（include vllm v0.8.4 v0.7.3 v0.7.2 V0 V1 engine）
#16692 closed Sep 5, 2025
[Doc]: state requirements for testing or update to work for CPU-only
#16920 closed Sep 5, 2025
[Bug]: swap_blocks and copy_blocks functions are wrong in flashinfer.py
#17362 closed Sep 5, 2025
[Bug]: A800 GPU set VLLM_USE_V1=1 ValueError: No available memory for the cache blocks
#17431 closed Sep 5, 2025
[Bug]: [v1][Spec Dec] Specifying draft TP does not have any impact.
#17499 closed Sep 5, 2025
[Bug]: Can't serve can we serve Q4_K_M-GGUF Model
#17661 closed Sep 5, 2025
[Bug]: Slight Embedding Precision Difference When Running bge-m3 in vLLM Compared to Original Model
#17713 closed Sep 5, 2025
[Feature]: Support for OpenGVLab/InternVL3-38B-AWQ
#17734 closed Sep 5, 2025
[Feature]: Does vLLM allow 'dropping' requests instead of preempting them?
#17736 closed Sep 5, 2025
[Bug]: Interrupting inference with ctrl-c causes future requests to hang
#17738 closed Sep 5, 2025
[Feature]: Support quantization for pooling model which does embedding.
#17760 closed Sep 5, 2025
[Usage]: How to Truncate multi-modal tokens
#17765 closed Sep 5, 2025
[Bug]: Logits processing with Lora is incorrect
#17766 closed Sep 5, 2025
[Feature]: Support for IBGDA
#17774 closed Sep 5, 2025
[Bug]: Large Data Parallel Size Cause Loading Safetensors Extremely Slow
#17783 closed Sep 5, 2025
[Usage]: Is it possible to use CUDA Graph during the encoding for encoder-decoder models?
#17789 closed Sep 5, 2025
[Usage]: 自己部署vllm，无法调用工具，需要开启--enable-auto-tool-choice，开启后提示要配置--chat-template-content-format，最后报错
#17792 closed Sep 5, 2025
[Usage]: How to output metrics information from vllm?
#17795 closed Sep 5, 2025
[Usage]: how to return attention_weight logits in page_attention
#17796 closed Sep 5, 2025
[Installation]: How to deploy docling model on vllm
#17807 closed Sep 5, 2025
[Bug]: Disaggregated Prefill in vLLM 0.8.3 Produces Incorrect/Unreasonable Outputs
#17808 closed Sep 5, 2025
[Usage]: Deploy EasyOCR , Docling models on vllm
#17814 closed Sep 5, 2025
[Bug]: vllm 0.8.5.dev468+g98834fefa.precompiled OOM on Qwen3-32B with 1 lora module
#17822 closed Sep 5, 2025
[Performance]: why the batch-embeddings inputs are separated to small single one?
#18867 closed Sep 5, 2025
[Feature]: Add LoRA adapter support for Qwen2.5-Omni models
#24193 closed Sep 4, 2025
[Bug]: PLaMo2.1 does not work with v1 engine
#24204 closed Sep 4, 2025
[Feature][Responses API] Support MCP tool in background mode
#23295 closed Sep 4, 2025
[Bug]: responses api - no error on exceeding `max_tokens`
#24184 closed Sep 4, 2025
[Bug]: PyNcclConnector is deprecated, but some docs/tests still use it
#24152 closed Sep 3, 2025
[Feature][Response API] Support `num_cached_tokens` and `num_reasoning_tokens` in ResponseUsage
#23363 closed Sep 3, 2025
[Feature]: Support `Plamo2` Model in V1
#23956 closed Sep 3, 2025
[Bug]: plamo2 broken on main using transformers==4.55.0
#22999 closed Sep 3, 2025
[Bug]: Docker build fails on dev stage due to missing libnuma-dev
#20384 closed Sep 3, 2025
[CI]: Host images used by multimodal tests locally
#23594 closed Sep 3, 2025
[Usage]: How to start vllm actor of ray without loading weights when RLHF?
#24064 closed Sep 3, 2025
[Doc]: content in "Add models with the FSDP backend" is expired
#24143 closed Sep 3, 2025
[Bug]: TypeError: FlashAttentionImpl.__init__() got an unexpected keyword argument 'layer_idx' in Qwen/Qwen2.5-14B-Instruct-1M
#24048 closed Sep 3, 2025
[Bug]: vllm.LLM does not seem to re-initialize for distributed inference with subsequent models with Offline Inference
#9727 closed Sep 3, 2025
[Bug]: 张量并行离线推理报错 CalledProcessError: Command '['/usr/bin/gcc'....] returned non-zero exit status 1.
#15013 closed Sep 3, 2025
[Bug]: Use the latest version of the inference model and use API calls to report errors.（V0.8.5）
#17430 closed Sep 3, 2025
[Bug]: VisionArena Benchmark for Vision Language Models (with `benchmark_serving.py`) crashes with `Initial test run failed - Please make sure benchmark arguments are correctly specified. Error: Forbidden`
#17489 closed Sep 3, 2025
[Bug]: failed to run LMCache example for v0
#17545 closed Sep 3, 2025
[Bug]: content is null when use "chat_template_kwargs": {"enable_thinking": false} in the request.
#17609 closed Sep 3, 2025
[Bug]: Qwen2.5-vl-7B stuck after loading weight and use a lot of shared GPU memory
#17611 closed Sep 3, 2025
[Feature]: How to enable an LLM to simultaneously provide OpenAI API-compatible /v1/completions and /v1/embeddings services
#17627 closed Sep 3, 2025
[Usage]: vLLM on multiple node GPUs
#17645 closed Sep 3, 2025
[Feature]: Support for streaming N tokens at a time in AsyncLLMEngine
#17681 closed Sep 3, 2025
[Bug]: R1 Accuracy Issue in Main for `deepep_high_througput`
#24118 closed Sep 2, 2025
[Usage]: `get_mempolicy: Operation not permitted` in docker
#24016 closed Sep 2, 2025
[Usage]: vLLM `/score` with Mixedbread reranker (Qwen2 seq-cls override): **scores differ vs local Mixedbread**; small payload = same order/different scores; large payload (\~1K chars/doc) = **order diverges**
#22983 closed Sep 2, 2025
[Bug]: Gemma3n audio path crashes when input_features is a list not a Tensor.
#24006 closed Sep 2, 2025
[Doc]: LWS deployment yaml incorrect
#23103 closed Sep 2, 2025
[RFC]: Custom sampling params support in REST API
#17191 closed Sep 2, 2025
[Bug]: Quantized models - NotImplementedError: Could not run '_C::machete_prepack_B'
#16131 closed Sep 2, 2025
[Bug]: `cannot access local variable 'hidden_states'` while trying to enable MTP for deepseek-r1
#23773 closed Sep 2, 2025
[Bug]: Does V0 support DP?
#24036 closed Sep 1, 2025
[Bug]: [P/D] the nixl_connector toy_proxy_server.py will always return httpstatus 200 OK
#23981 closed Sep 1, 2025
[Bug]: Outputs always miss responses if n of SamplingParams>1 with AsyncLLM!
#24029 closed Sep 1, 2025
[Usage]: Is vllm actor of ray an asynchronous engine and supports continuous batching when RLHF?
#23990 closed Sep 1, 2025
[RFC]: Hidden states processor
#12249 closed Sep 1, 2025
[Bug]: Do we really need to implement additional functions for custom_allreduce to serve graph capture?
#18899 closed Sep 1, 2025
[Bug]: HF_HUB_OFFLINE Parameter does not take effect
#22492 closed Sep 1, 2025
[Doc]: Is Qwen2.5's long context YARN handled?
#8793 closed Sep 1, 2025
[Performance]: vllm Eagle performance is worse than expected
#9565 closed Sep 1, 2025
[Bug]: vllm serve: error: the following arguments are required: model_tag
#13150 closed Sep 1, 2025
[Bug]: AssertionError - assert loaded_weight.shape[output_dim] == self.org_vocab_size
#15124 closed Sep 1, 2025
[Bug]: Can't run vllm model because of the FlashAttention.
#15238 closed Sep 1, 2025
[Bug]: OPEA/Mistral-Small-3.1-24B-Instruct-2503-int4-AutoRound-awq-sym error
#15300 closed Sep 1, 2025
[Bug]: Llama-3.2-11B-Vision-Instruct has an issue in vision language embedding
#15496 closed Sep 1, 2025
[Bug]: Fail to use deepseek vl2 with images, maybe need a new chat template?
#16953 closed Sep 1, 2025
[Bug]: `http*` metrics missing when running with V0 engine
#17406 closed Sep 1, 2025
[Bug]: 0.8.5 部署qwen-vl模型报错，降级0.8.4没问题
#17456 closed Sep 1, 2025
[Feature]: benchmarks for vllm, it should support OpenAI Chat Completions API
#17586 closed Sep 1, 2025
[Bug]: Cannot load Gemma3 27b QAT GGUF on RTX 5090
#17587 closed Sep 1, 2025
[Bug]: fp8 w8a8 quantized Qwen2.5-VL hits AssertionError
#17595 closed Sep 1, 2025
[Installation]: compilation of flash-attn e4m3 kernels fails due to layout incompatibility in copy_traits.hpp
#17597 closed Sep 1, 2025
[Bug]: [Precision issues] test_flash_attn.py::test_flash_attn_with_paged_kv
#17610 closed Sep 1, 2025
[Usage]: NCCL error when using tow AMD GPUs ( gfx1100 )
#18805 closed Sep 1, 2025
[Bug]: nrt_tensor_allocate status=4 message="Allocation Failure" on AWS Neuron
#12443 closed Aug 31, 2025
[Feature]: Better systemd security feature support
#12474 closed Aug 31, 2025
[Usage]: How to get "num_gpu_blocks" in V1？
#15538 closed Aug 31, 2025
[Bug]: Vllm 0.8.2 + Ray 2.44 (Ray serve deployment) fallbacks to V0 Engine
#15569 closed Aug 31, 2025
[Performance]:
#16342 closed Aug 31, 2025
[Bug]: 为什么在部署qwen2.5-vl-32b-instruct的时候，部署过程被卡死不动了
#17151 closed Aug 31, 2025
[Bug]: [v0.8.5] Qwen3 returned reasoning content, but --enable-reasoning was not enabled.
#17346 closed Aug 31, 2025
[Bug]: Can't configure VllmConfig
#17376 closed Aug 31, 2025
[Bug]: fused moe lose weight_loader in verl
#17429 closed Aug 31, 2025
[Bug]: Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal!
#17432 closed Aug 31, 2025
[Usage]: OOM happend when run DeepSeek-R1-BF16 with 80k max model len by 16 gpu/90G memory
#17470 closed Aug 31, 2025
[Bug]: [V1][Spec Dec] Rejection sampler accepts different tokens when TP > 1 and Temp > 0
#17498 closed Aug 31, 2025
Issue attempting to serve a model from HF with base model `Llama-3.1-8B-Instruct`
#17505 closed Aug 31, 2025
[Bug]:
#17516 closed Aug 31, 2025
[Bug]: vllm-v0 engine Qwen2.5 Model run eagle algo, KeyError: 'norm.weight' bugfix
#17517 closed Aug 31, 2025
[Bug]: Training with vllm not supports Qwen3
#17527 closed Aug 31, 2025
[Usage]: understanding the vllm's gpu_memory_utilization and cuda graph memory requirement
#17549 closed Aug 31, 2025
[Bug]: Possible mismatch in `truncate_prompt_tokens` value validation for `-1`
#22635 closed Aug 30, 2025
[Bug]: CUDA illegal memory access error on 2x RTX PRO 6000 GPUs with --tensor-parallel-size=2
#23781 closed Aug 30, 2025
[Feature]: Implement vAttention: Virtual Memory Management for KV Cache on NVIDIA GPUs
#17612 closed Aug 30, 2025
[Usage]: Do vllm actor of ray an asynchronous engine and supports continuous batching?
#23989 closed Aug 30, 2025
[Installation]: Dependency conflict installing vLLM 0.6.3 due to outlines → pyairports dependency
#23983 closed Aug 30, 2025
[New Model]: stepfun-ai/GOT-OCR2_0
#9606 closed Aug 30, 2025
[Usage]: who to run cluster withou docker
#12053 closed Aug 30, 2025
[Bug]: Issue with SpecDecode when using data parallel
#17056 closed Aug 30, 2025
[Renderer]: Consolidate MM classes to `MultiModalFeatureSpec`
#23872 closed Aug 29, 2025
[Feature]: RuntimeError: FlashAttention only supports Ampere GPUs or newer.
#8189 closed Aug 29, 2025
[Installation]: ImportError: libtorch_cuda.so: cannot open shared object file: No such file or directory
#23910 closed Aug 29, 2025
[Bug]: rocm build crashes with libcuda.so.1: cannot open shared object file
#19681 closed Aug 29, 2025
Tool calls not triggered properly with vLLM 0.8.5 and Qwen2.5-Coder-32B-Instruct-GPTQ-Int4
#17821 closed Aug 29, 2025

110 Issues opened by 98 people

[Bug]: Crash on --otlp-traces-endpoint=${OTEL_EXPORTER_OTLP_TRACES_ENDPOINT} when CPU mode
#24297 opened Sep 5, 2025
[RFC]: Environment variable to switch backend between CPU and GPU
#24293 opened Sep 5, 2025
[Usage]: Adjusting reasoning efforts for GPT-OSS in direct sampling
#24292 opened Sep 5, 2025
[Bug]: ARM V1 version dependency on OneDNN
#24291 opened Sep 5, 2025
[Bug]: module 'triton.language' has no attribute 'constexpr_function'
#24289 opened Sep 5, 2025
[RFC]: Support Returning Prompt Hidden States
#24288 opened Sep 5, 2025
[RFC]: Support reporting tool output tokens in OutputTokensDetails
#24284 opened Sep 5, 2025
[Bug]: GPT-OSS more robust way to handle messages in commentary channel
#24283 opened Sep 5, 2025
[Bug]: Deployment of Apertus-Instruct-8B failed with error
#24282 opened Sep 4, 2025
[Feature]: decouple attention backend block size from KVCacheManager block size
#24280 opened Sep 4, 2025
[Bug]: Multi-node DeepSeek-V3-0324 errors out with CUDA Illegal Memory Access
#24272 opened Sep 4, 2025
[Bug]: CPU Memory leak in P/D disaggregation (with NIXL?)
#24264 opened Sep 4, 2025
[RFC]: Add a cache hit threshold to enable simple PD-Disaggregation implementations
#24256 opened Sep 4, 2025
[Installation]: Warning on char conversion on aarch64
#24246 opened Sep 4, 2025
[Improvement]: The fixed "language_model" prefix issue in multimodal models
#24244 opened Sep 4, 2025
[Bug]: Qwen3-reranker默认示例的输出结果不精准，与Qwen3-embedding仓库中提供的vllm示例输出结果不一致（顺序不一致）。
#24242 opened Sep 4, 2025
[Bug]: v0.10.1rc1 推理偶现RuntimeError: ACL stream synchronize failed, error code:507035
#24241 opened Sep 4, 2025
[Bug]: Xformers is not available, falling back, even though I have Xformers installed
#24237 opened Sep 4, 2025
[Feature]: When the tool choice configuration parameters are invalid, the API server returns a 500 error code.
#24232 opened Sep 4, 2025
[Bug]: RuntimeError: There is no current event loop in thread 'MPClientEngineMonitor'.
#24230 opened Sep 4, 2025
[Feature]: Propose in docs a complete example of `pyproject.toml` to be used directly with `uv sync`
#24218 opened Sep 4, 2025
[Bug]: Detokenizer Overflow error occurred on DeepSeek-R1/V3
#24211 opened Sep 4, 2025
[Bug]: KeyError: 'model.layers.60.mlp.experts.w2_weight'
#24208 opened Sep 4, 2025
[Feature]: Support similar API, such as /health_generate
#24207 opened Sep 4, 2025
[Installation]: fail to install in cuda 118 with v100.
#24206 opened Sep 4, 2025
[Feature][gpt-oss] Responses API test enhancement
#24201 opened Sep 3, 2025
[Feature][gpt-oss] Python Tool Test Enhancement
#24199 opened Sep 3, 2025
[Feature][gpt-oss]: Browser Tool Test Enhancement
#24198 opened Sep 3, 2025
[Feature]: Expose Componentized GPU Memory Metrics
#24196 opened Sep 3, 2025
[New Model]: New model support stepfun-ai/Step-Audio-2-mini
#24192 opened Sep 3, 2025
[Feature]: Model FLOPs Utilization Reporting
#24190 opened Sep 3, 2025
[Usage]: how does v1 engine perform the model parameter hot update?
#24186 opened Sep 3, 2025
[Feature]: Extend QuantFP8 to support per-token-group quantization
#24185 opened Sep 3, 2025
FastAPI Swagger Documentation Name to be Updated to the Model Name
#24182 opened Sep 3, 2025
[Performance]:When running vllm at 30b-a3b MOE and turning on kv quantization, the decoding speed of 40K input drops significantly from 100t to 40t.
#24179 opened Sep 3, 2025
[Feature]: Expose Engine Sleep & Wake_up Mode as Prometheus Metrics
#24175 opened Sep 3, 2025
[MM processor]: Benchmark mm processor's performance
#24171 opened Sep 3, 2025
[Bug]: Intermittent "Unknown recipient: None" when calling gpt-oss-20b with Responses
#24170 opened Sep 3, 2025
[Performance]: MoE FP8 and Gemm FP8 for CPU
#24169 opened Sep 3, 2025
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
#24166 opened Sep 3, 2025
[Refactor]: Let each modeling file define M-RoPE implementation
#24165 opened Sep 3, 2025
[Bug]: Qwen2-VL-7B-Instruct model`s output of the 1st inference is different with the subsequent inferences.
#24164 opened Sep 3, 2025
[RFC]: Support fast inplace model update by shared IPC buffer
#24163 opened Sep 3, 2025
[Bug]: [AMD GPU][OOM][RCCL Allreduce][NCCL error: unhandled cuda error] Got RCCL allreduce OOM error when running Llama405B TP8 on AMD GPUs
#24160 opened Sep 3, 2025
[Bug]: Crash when running embedding model on CPU (kv_cache_spec_values empty)
#24156 opened Sep 3, 2025
[Usage]: Add toy example for gpt-oss container tools
#24148 opened Sep 3, 2025
[Bug]: model failure for OpenGVLab/InternVL3-38B-hf
#24147 opened Sep 3, 2025
[CI Failure]: Flaky OOM in Entrypoints Tests
#24144 opened Sep 3, 2025
[Feature]: support Hunyuan-MT-Chimera-7B and HunYuanDenseV1ForCausalLM
#24142 opened Sep 3, 2025
[Bug]: Deepseek V3.1 tool_choice=required，输出混乱
#24140 opened Sep 3, 2025
[Bug]: B200 hang on flashinfer fa2 prefill
#24139 opened Sep 3, 2025
[Feature]: Decoupled Vision-Language Deployment
#24136 opened Sep 3, 2025
[Usage]: vllm+ray launches extra jobs on existing cluster, and not just actors
#24135 opened Sep 3, 2025
[Bug]: vLLM stuck when serving GLM-4.5 model
#24133 opened Sep 3, 2025
[Bug]: uv installation seems broken for nightly wheels (dependency problem with `outlines` and filename<>metadata mismatch)
#24126 opened Sep 2, 2025
[Bug]: Should upgrade to PyTorch's MultiOutputMatch
#24125 opened Sep 2, 2025
Remove CUDA 11.8
#24124 opened Sep 2, 2025
[Bug]: Running on AMD Epyc 9654 (CPU Only) always tries to use intel_extension_for_pytorch and crashes.
#24120 opened Sep 2, 2025
[Feature]: Optimize DP/EP Low Batch Size Decode DeepSeek-R1
#24117 opened Sep 2, 2025
[Feature]: Optimize EPLB Rearrange Experts
#24116 opened Sep 2, 2025
[RFC]: Improve MoE triton kernel tuning
#24112 opened Sep 2, 2025
[Bug]: DeepSeek fails with enabled VLLM_USE_FLASHINFER_MOE_FP8=1
#24109 opened Sep 2, 2025
[Bug]: CUDA illegal memory access during structured output (xgrammar) crashes vLLM workers and API returns 500
#24107 opened Sep 2, 2025
[Feature]: Support extendable configuration files
#24096 opened Sep 2, 2025
[Usage]: What is the benchmark configuration?
#24095 opened Sep 2, 2025
[Bug]: Running Jamba FP8 crashes with cutlass_moe_mm
#24094 opened Sep 2, 2025
[New Model]: OpenCUA
#24090 opened Sep 2, 2025
[Bug]: v1.10.x is slower than 0.8.5.post1 when running qwen3
#24082 opened Sep 2, 2025
[Feature]: Add uccl as kvconnect provide
#24079 opened Sep 2, 2025
[Bug]: While serving GPT-OSS, Streaming function calls output only reasoning_text, without function tool call
#24076 opened Sep 2, 2025
[Bug]: how to get purely deterministic output for gpt-oss-120b?
#24067 opened Sep 2, 2025
[Bug]: In `uv` venv, running `python collect_env.py` will return error.
#24063 opened Sep 2, 2025
20-series GPUs do not support the `sinks` parameter; attempting to access it directly raises an error. Could you fix this
#24062 opened Sep 2, 2025
[Bug]: prevent HuggingFace access when VLLM_USE_MODELSCOPE is enabled for gpt-oss-20b
#24060 opened Sep 2, 2025
[Feature][KV Connector]: Async lookup policy support for MultiConnector
#24059 opened Sep 1, 2025
[Feature]: Improve `vllm bench serve` startup time with random data
#24058 opened Sep 1, 2025
[Feature]: Cutlass v4.2.0 Support
#24050 opened Sep 1, 2025
[Usage]: how to disable thinking for different model
#24039 opened Sep 1, 2025
[Bug]: vLLM >V0.9.2 with AWQ model producing nonsense in longer context chats
#24038 opened Sep 1, 2025
[MTP][PP]: Does PP mode not support MTP? Is this how it is?
#24035 opened Sep 1, 2025
[Usage]: "When running the QWEN3 MoE GGUF quantized model with VLLM, the inference output is all exclamation marks (！！！！！！！！！！！！！！！)"
#24025 opened Sep 1, 2025
[Bug]: Bug in PrefixCaching for float16 dtype on RTX 8000
#24007 opened Aug 31, 2025
[Doc]: why vllm bench tset Successful requests very low
#24005 opened Aug 31, 2025
[Bug]: GLM-4.5V - AssertionError: 12 is not divisible by 8
#24004 opened Aug 31, 2025
[Bug]: vLLM Deployment of Qwen3-8B Model Streaming Output Tool Content Missing Issue
#23992 opened Aug 30, 2025
Accuracy Drop with OpenGVLab/InternVL3-14B when using vLLM
#23988 opened Aug 30, 2025
[Bug]: lmcache server points to wrong file in entrypoint
#23986 opened Aug 30, 2025
[Bug]: Non-output-rank Workers Fail to Report Runtime Errors, Causing MultiProcExecutor to Wait for RPC Timeout
#23985 opened Aug 30, 2025
[Bug]: new version critical bug with 100% gpu util but get stuck
#23979 opened Aug 30, 2025
[Feature]: Benchmark for the Sampler
#23977 opened Aug 30, 2025
[gpt-oss]: Ability to set model_identity dynamically which is used in building the system prompt
#23975 opened Aug 30, 2025
[Bug]: Torch Compilation Failure for Gemma3n with LoRA Support - Dynamic Shape Constraints Violated
#23970 opened Aug 29, 2025
[Bug]: v1 xformers + sliding window not working
#23969 opened Aug 29, 2025
[Bug]: vllm bench serve fails with CPU-only head node
#23967 opened Aug 29, 2025
Model Performance Bash!
#23963 opened Aug 29, 2025
[Feature]: Support `Phi4Flash` model in V1
#23957 opened Aug 29, 2025
ValueError: Currently, MiniCPMV only supports versions 2.0, 2.5, 2.6, 4.0. Got version: (4, 5)
#23955 opened Aug 29, 2025
[Bug]: CUDA error when serving MiniCPM-V model
#23954 opened Aug 29, 2025
[Feature]: Any plans to add nvidia/parakeet-tdt-0.6b-v3 to vllm?
#23943 opened Aug 29, 2025
[Bug]: No platform detected, vLLM is running on UnspecifiedPlatform in Docker with Kubernetes, Nvidia L4
#23935 opened Aug 29, 2025
[Bug]: CPU Backend with GPT-OSS Failed
#23934 opened Aug 29, 2025
[Bug]: Illegal memory access with 4 GPUS
#23926 opened Aug 29, 2025
[Bug]: _C.abi3.so: undefined symbol: _Z24silu_and_mul_nvfp4_quantRN2at6TensorES1_S1_S1_
#23925 opened Aug 29, 2025
[Feature]: Allow usage of chat_template_kwargs and add_generation_prompt in /embeddings endpoint
#23923 opened Aug 29, 2025
[Bug]: Unrecognized FP8 dtype: fp8_e5m2
#23922 opened Aug 29, 2025
[Bug]: 5090 Qwen3-30B-A3B-FP8 fails when TP=2！
#23921 opened Aug 29, 2025
[Bug]: 'AttributeError: '_OpNamespace' '_C' object has no attribute 'silu_and_mul_nvfp4_quant'
#23916 opened Aug 29, 2025
[Bug]: The compilation of v0.9.2 succeeded with MacheteLinearKernel enabled for the RTX 4090 (CUDA architecture 8.9), but a runtime error was encountered.
#23911 opened Aug 29, 2025
[Bug]: gpt-oss-120b has high possibility to generate response as part of reasoning by using vllm v0.10.1
#23905 opened Aug 29, 2025
[Feature]: Kubernetes 1.34 support (Dynamic Resource Allocation DRA)
#23900 opened Aug 29, 2025

423 Unresolved conversations

Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.

`torch.compile` caching of config fields should be opt-out by default
#23134 commented on Sep 4, 2025 • 39 new comments
[Model] Systematic support for fp32 head, pooling models part
#23810 commented on Sep 4, 2025 • 38 new comments
[Feature] Support Decode Context Parallel (DCP) for MLA
#23734 commented on Sep 5, 2025 • 37 new comments
[Perf][V1] Fully overlap model execution
#23569 commented on Sep 4, 2025 • 35 new comments
Add Dual-Batch Overlap mechanism to VLLM
#23693 commented on Sep 4, 2025 • 35 new comments
[Core][Hybrid allocator + connector] Support hybrid allocator + kv cache connector
#23624 commented on Sep 5, 2025 • 24 new comments
[Core] Shared memory based object store for Multimodal data caching and IPC
#20452 commented on Sep 4, 2025 • 19 new comments
[Performance][EPLB] EPLB Execution Optimization
#22179 commented on Sep 4, 2025 • 17 new comments
[Bugfix] Merge MM embeddings by index instead of token IDs
#16229 commented on Sep 2, 2025 • 14 new comments
[BugFix] pp cannot run successfully under NixlConnector
#22976 commented on Sep 4, 2025 • 13 new comments
[P/D] Add a shutdown method to the Connector API
#22699 commented on Sep 4, 2025 • 13 new comments
[Feat][EPLB] A novel static EPLB placement strategy for MoE models.
#23745 commented on Sep 5, 2025 • 13 new comments
[V1][Spec Decode][Feature] Spec decode with probs
#20459 commented on Sep 5, 2025 • 12 new comments
[V1] [P/D] Add Support for KV Load Failure Recovery
#19330 commented on Sep 4, 2025 • 11 new comments
EVS Support (Video tokens pruning)
#22980 commented on Sep 2, 2025 • 10 new comments
[Bugfix] Fix mamba2 prefill chunking
#23279 commented on Sep 1, 2025 • 10 new comments
[Core] Use sha256 bytes instead of BlockHash to reduce GC overhead
#23673 commented on Sep 5, 2025 • 8 new comments
[Core] Encoder separation for Encode-Prefill-Decode Disaggregation
#21740 commented on Sep 4, 2025 • 8 new comments
[Frontend] User-provided uuids for medias in chat. (RFC #22044)
#23449 commented on Sep 5, 2025 • 8 new comments
[torch.compile][ROCm][V1] Enable attention output FP8 fusion for V1 attention backends
#19767 commented on Sep 4, 2025 • 8 new comments
[gpt-oss] Harmony changes with container tool support
#23386 commented on Sep 5, 2025 • 7 new comments
[ROCm][Bugfix] Fix Aiter RMSNorm
#23412 commented on Sep 2, 2025 • 7 new comments
[PERF] Allreduce Fusion tuning and compile_ranges introduction
#22086 commented on Sep 4, 2025 • 7 new comments
[Kernel][B200] `mxfp4` fused cutlass moe
#23696 commented on Sep 4, 2025 • 6 new comments
Generate _ModelInfo properties file when loading to improve loading speed
#23558 commented on Sep 5, 2025 • 6 new comments
[P/D][Nixl] Introduce `KVTransferMetrics` and aggregation strategy
#22188 commented on Sep 3, 2025 • 6 new comments
[Sampler] Support returning all prompt logprobs
#23868 commented on Sep 3, 2025 • 6 new comments
[v1] Add Whisper model support (encoder-decoder)
#21088 commented on Sep 5, 2025 • 6 new comments
[V1] implement tree sampler for draft token acceptance
#22752 commented on Sep 5, 2025 • 6 new comments
[Build] Split Kernels into Separate `vllm-kernels` package
#23866 commented on Sep 3, 2025 • 6 new comments
fix(v1/kv_cache): resolve async KV transfer bug in cascade attention
#23485 commented on Sep 2, 2025 • 5 new comments
[Frontend] Pass API server count to each process
#23717 commented on Sep 1, 2025 • 5 new comments
v1: Offloading connector
#22595 commented on Sep 4, 2025 • 5 new comments
[Model] New model support for Motif-1-Tiny
#23414 commented on Sep 3, 2025 • 5 new comments
[Perf] Warmup FlashInfer attention during startup
#23439 commented on Sep 4, 2025 • 4 new comments
[CI/Build] Add bc-linter to vLLM CI
#21234 commented on Sep 5, 2025 • 4 new comments
[RFC] allow cancelation after shutdown in blocking collective_rpc
#23390 commented on Sep 4, 2025 • 3 new comments
[Bugfix] Make unspecified --host bind to dual stack
#22823 commented on Aug 30, 2025 • 3 new comments
[Feature][OCP MX] Support mxfp6 and mixed mxfp6-mxfp4
#21166 commented on Sep 5, 2025 • 3 new comments
Migrate Qwen2 inputs to TensorSchema
#23475 commented on Sep 4, 2025 • 3 new comments
[XPU] Fix OOM when manually specifying ZE_AFFINITY_MASK with Ray distributed executor on XPU
#22413 commented on Sep 2, 2025 • 2 new comments
[Bugfix] Mistral tool parser streaming update
#19425 commented on Sep 2, 2025 • 2 new comments
Enable modelopt gemma3 nvfp4/fp8, make workflow more robust
#22771 commented on Sep 3, 2025 • 2 new comments
[Chore] Cleanup guided namespace, move to structured outputs config
#22772 commented on Sep 4, 2025 • 2 new comments
[Model] Activated LoRA
#19710 commented on Sep 4, 2025 • 2 new comments
Allows initialize TorchAOConfig object through quantization_config_file
#23014 commented on Sep 2, 2025 • 2 new comments
[Frontend] OpenAI Responses API supports Tool/Function calling
#20874 commented on Sep 4, 2025 • 2 new comments
[Feature] limit thinking tokens (hard limit)
#20859 commented on Sep 4, 2025 • 2 new comments
DeepSeek fix: awq x mergedreplicatedlinear
#23764 commented on Aug 30, 2025 • 2 new comments
Support for NemotronH Nano VLM
#23644 commented on Sep 5, 2025 • 2 new comments
[Core] Nanoflow-style Computation-Communication Overlap
#23592 commented on Sep 3, 2025 • 2 new comments
Fp8 paged attention update
#22222 commented on Sep 4, 2025 • 1 new comment
[V1] address post issues related to #20059 (part 1)
#23046 commented on Sep 4, 2025 • 1 new comment
[Feature][EPLB] Add EPLB support for hunyuan_v1
#23078 commented on Sep 5, 2025 • 1 new comment
[Bugfix][V1] Raise ValueError when draft max model len is too small
#22935 commented on Sep 3, 2025 • 1 new comment
[V1] Logits processor docs
#22919 commented on Sep 4, 2025 • 1 new comment
[KV Connector] More async support for `get_num_new_matched_tokens`
#23620 commented on Sep 5, 2025 • 1 new comment
[Bug]: v0.8.2, enable calculate_kv_scales, caught exception
#15973 commented on Sep 3, 2025 • 0 new comments
fix: return {} for tool arguments when no argument is needed, so that…
#21365 commented on Sep 1, 2025 • 0 new comments
[Core][Feat] Add max-waiting-queue-length parameter to reject requests when waiting queue is full
#21352 commented on Sep 4, 2025 • 0 new comments
[Usage]:
#18679 commented on Sep 5, 2025 • 0 new comments
[Bug]: Qwen3 uses vllm automatic batch inference to abnormal output
#18252 commented on Sep 5, 2025 • 0 new comments
[Feature]: Auto tokenizer mode should detect mistral tokenizer
#18090 commented on Sep 5, 2025 • 0 new comments
[RFC]: Enabling Arm Neoverse CI Runners
#17720 commented on Sep 5, 2025 • 0 new comments
[Feature][Kernel]FusedMoE LoRA
#21229 commented on Sep 1, 2025 • 0 new comments
[New Model]: nemotron Super GGUF
#16944 commented on Sep 5, 2025 • 0 new comments
[V1][PP] Pipeline chunked prefill
#13638 commented on Sep 5, 2025 • 0 new comments
[Kernel] Flashinfer MLA (trtllm-gen) decode kernel integration
#21078 commented on Sep 4, 2025 • 0 new comments
[Performance]: Plan to support DP attention for Deepseek models
#12871 commented on Sep 5, 2025 • 0 new comments
[Feature]: Compute and log the serving FLOPs
#3490 commented on Sep 5, 2025 • 0 new comments
Support mnnvl all2allv from Flashinfer
#21003 commented on Sep 4, 2025 • 0 new comments
[RFC]: Enabling Suffix Decoding, LSTM Speculator, Sequence Parallelism from Arctic Inference
#18037 commented on Sep 5, 2025 • 0 new comments
[Bug]: qwq32b-128k accuracy loss compare with sglang ， with proprietory business benchmark
#19245 commented on Sep 5, 2025 • 0 new comments
[Misc]add replicaid to ray metrics
#22159 commented on Sep 4, 2025 • 0 new comments
[Hardware][RISC-V] Add riscv64 support for vLLM with scalar
#22112 commented on Sep 5, 2025 • 0 new comments
[Performance]: EAGLE-3: Discrepancy Between Throughput and Acceptance Length Improvements
#19226 commented on Sep 5, 2025 • 0 new comments
[Speculators][Speculative Decoding] Add Eagle3 Support For HunYuan Model
#22080 commented on Sep 1, 2025 • 0 new comments
[Core] Enable HF processing on GPU
#22070 commented on Aug 30, 2025 • 0 new comments
[Usage]: Is the service interface exposed by PD separation compatible with the service API of OpenAPI?
#19214 commented on Sep 5, 2025 • 0 new comments
[Bugfix]: Fix Possible Output Corruption in Cascade Attention Caused by Non-Contiguous LSE Tensor
#22003 commented on Sep 5, 2025 • 0 new comments
[Bugfix] Fix hermes tool parser handling of non-string argument types
#22002 commented on Sep 2, 2025 • 0 new comments
[Structured Output][Refactor] Move `apply_grammar_bitmask()` method from `ModelRunner` to structured output utils
#21999 commented on Sep 4, 2025 • 0 new comments
Add support for model signature verification
#21957 commented on Sep 4, 2025 • 0 new comments
[Usage]: How to improve the gpu usage with Qwen-VL
#19208 commented on Sep 5, 2025 • 0 new comments
[Bug]: Granite-Speech-3.3-2b hangs forever, never produces output
#19198 commented on Sep 5, 2025 • 0 new comments
Limit concurrent long partial prefills via max_long_partial_prefills
#21651 commented on Sep 2, 2025 • 0 new comments
[Bugfix] Handle None case for dt_bias and D in selective_state_update
#21532 commented on Aug 30, 2025 • 0 new comments
[Model] Mamba2 varlen and metadata refactor
#21467 commented on Sep 4, 2025 • 0 new comments
[Bug]: Quantization method specified in the model config (fp8) does not match the quantization method specified in the `quantization` argument (gguf).
#19050 commented on Sep 5, 2025 • 0 new comments
v1/offloading: Add worker-side CPU support
#21448 commented on Sep 4, 2025 • 0 new comments
[Usage]: TorchDispatchMode does not work for vllm
#19044 commented on Sep 5, 2025 • 0 new comments
[ROCm] Get rid of RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES
#15246 commented on Sep 4, 2025 • 0 new comments
Enable Outlines with JSON Sub-Schema References
#15627 commented on Sep 4, 2025 • 0 new comments
[Bugfix]: fix JSON decode error when tool call argument is empty
#19428 commented on Aug 31, 2025 • 0 new comments
fix: can not install torch+cpu for no index url
#15822 commented on Aug 30, 2025 • 0 new comments
Fix #15483 : Add error handling for model-dependent endpoints during sleep mode
#16536 commented on Sep 3, 2025 • 0 new comments
[FeatureRequest] Support Cascade Attention for Sliding Window Attention #15738
#16550 commented on Aug 31, 2025 • 0 new comments
Use PyTorch util for traced files instead of monkey-patching inline_call()
#19235 commented on Sep 5, 2025 • 0 new comments
[WIP] download config json file from modelscope
#19212 commented on Sep 4, 2025 • 0 new comments
Fixes crashes in vLLM v1 engine when using LMCache KV
#19194 commented on Sep 4, 2025 • 0 new comments
[MTIA] Add mtia as a literal in device config.
#19026 commented on Sep 3, 2025 • 0 new comments
[Core] Remove int32->int64->int32 overhead in FlashInfer sampling
#18920 commented on Sep 3, 2025 • 0 new comments
[BugFix] v0 cache evictor：priority_queue and free_table desynchronization
#18882 commented on Sep 2, 2025 • 0 new comments
[bugfix][v1]fixed the missing prompt value in RequestOutputs
#18880 commented on Sep 3, 2025 • 0 new comments
[Hardware][Intel-Gaudi] t.compile optimizations
#18137 commented on Sep 3, 2025 • 0 new comments
[Feature]: reasoning_tokens in Chat Completion Response usage
#18067 commented on Sep 5, 2025 • 0 new comments
[WIP]: DRY sampling
#16695 commented on Sep 1, 2025 • 0 new comments
[Bugfix] fix: close issue #16554 to make it real async
#16557 commented on Aug 30, 2025 • 0 new comments
[Bugfix] Move current_platform import to avoid python import cache.
#16601 commented on Sep 1, 2025 • 0 new comments
fix(frontend): always include usage, when configured to do so
#20983 commented on Sep 3, 2025 • 0 new comments
[Bug]: How to set reasoning_effort for gpt-oss model to "high" in vllm
#22809 commented on Sep 5, 2025 • 0 new comments
[Bug]: update the kv_connector from v0 to v1 in example
#22093 commented on Sep 5, 2025 • 0 new comments
[Feature] Add command tool parser for Command-A model
#20800 commented on Sep 3, 2025 • 0 new comments
feat: Add streaming support for Mistral v11 tool format
#20503 commented on Aug 29, 2025 • 0 new comments
[Bug]: CI not running all tests/compile tests
#23865 commented on Sep 5, 2025 • 0 new comments
[Bug]: RuntimeError: NCCL error: unhandled cuda error
#20226 commented on Sep 5, 2025 • 0 new comments
[V1] feat:add engine v1 tracing
#20372 commented on Sep 2, 2025 • 0 new comments
[Frontend] Feature: support transcription API with language detection
#13465 commented on Sep 2, 2025 • 0 new comments
[Feature]: Implement `check_health` for V1
#20164 commented on Sep 3, 2025 • 0 new comments
v1: Introduce LRU-based CPU offloading management
#20075 commented on Sep 4, 2025 • 0 new comments
[Feature][Kernel] Blocked FP8 CUTLASS MoE for Hopper
#19983 commented on Aug 30, 2025 • 0 new comments
v1: Introduce an offloading component
#19848 commented on Sep 4, 2025 • 0 new comments
[CI] Make UT cases in test_comm_ops.py compatible with more devices
#14229 commented on Sep 3, 2025 • 0 new comments
[Model] add colqwen2_vl code & inference
#14291 commented on Sep 2, 2025 • 0 new comments
[V1] Logit processors for rejection sampler
#19482 commented on Sep 2, 2025 • 0 new comments
[Frontend] Skip `stop` in reasoning content
#14550 commented on Sep 4, 2025 • 0 new comments
[Core] Add a level 3 sleep/wake_up that offloads tensors to disk
#14678 commented on Aug 29, 2025 • 0 new comments
[CI] Optimize entrypoints API server tests
#23896 commented on Sep 2, 2025 • 0 new comments
Adding int4 and int8 models for CPU benchmarking
#23709 commented on Sep 5, 2025 • 0 new comments
[XPU][Feature] sleep mode support for XPU platform
#23704 commented on Sep 5, 2025 • 0 new comments
[V1] Support MP Distributed Executor for multi node distributed inference
#23691 commented on Aug 30, 2025 • 0 new comments
[Spec-decode] fix and refoctor cudagraphs for spec-decode
#23679 commented on Sep 4, 2025 • 0 new comments
[Flashinfer] Support Flashinfer TRTLLM FP8-qkv BF16/FP16-out Attention Kernel
#23647 commented on Sep 5, 2025 • 0 new comments
[Model] Add tuned fused_moe configs for H200_NVL based on H200 config
#23642 commented on Sep 1, 2025 • 0 new comments
Fix regex patterns in DeepSeekV31ToolParser to use non-greedy matching
#23618 commented on Sep 4, 2025 • 0 new comments
[Spec Decode][Benchmark] Add Blitzedit dataset
#23605 commented on Sep 4, 2025 • 0 new comments
Synchronize TYPE_CHECKING section with environment_variables dictionary in envs.py
#23602 commented on Aug 30, 2025 • 0 new comments
[Speculators][Speculative Decoding] Support gpt-oss eagle3 on blackwell
#23596 commented on Sep 3, 2025 • 0 new comments
[Model] Add lite-whisper model support in vLLM
#23566 commented on Sep 2, 2025 • 0 new comments
[Spec Decode][Benchmark] Add Spec Bench Dataset for benchmarking
#23563 commented on Sep 3, 2025 • 0 new comments
model modify of eplb
#23553 commented on Aug 29, 2025 • 0 new comments
[ISSUE 23474] Remove lora additional vocabulary
#23540 commented on Sep 5, 2025 • 0 new comments
Fix gpt-oss tool call
#23518 commented on Sep 5, 2025 • 0 new comments
Redesign Persistent Batch in vLLM
#23514 commented on Sep 5, 2025 • 0 new comments
feat: Add Grafana and Perces monitoring dashboards for vLLM
#23498 commented on Sep 5, 2025 • 0 new comments
[Misc] rename determine_available_memory to determine_kv_cache_availa…
#23495 commented on Aug 31, 2025 • 0 new comments
[Misc] refactor usage report by reuse report_usage_stats function
#23493 commented on Aug 30, 2025 • 0 new comments
[Bugfix] fix is_usage_stats_enabled when disable it
#23489 commented on Aug 30, 2025 • 0 new comments
[Core] Support sleep mode for cuda graph
#23482 commented on Sep 3, 2025 • 0 new comments
[bug fix] disable memory pool to release unused `bf16` weights
#23875 commented on Sep 4, 2025 • 0 new comments
[Benchmark] Add ability to round robin over a set of urls for benchmarking
#23870 commented on Aug 31, 2025 • 0 new comments
[benchmark] add peak throughput metrics and plot
#23867 commented on Aug 30, 2025 • 0 new comments
[gpt-oss] Validate gpt-oss python tool during initialization
#23856 commented on Sep 5, 2025 • 0 new comments
Check bc linter
#23855 commented on Sep 5, 2025 • 0 new comments
Update v1/entrypoints test_struct_output_generate tests to use ligher models
#23850 commented on Sep 4, 2025 • 0 new comments
[DP][ray] Support different VLLM_RAY_DP_PACK_STRATEGY
#23849 commented on Sep 4, 2025 • 0 new comments
[Misc]Fix an error when enabling allreduce fusion pass
#23848 commented on Sep 3, 2025 • 0 new comments
[Log] Use a relative path in debug-level logs to distinguish files with identical names
#23846 commented on Sep 3, 2025 • 0 new comments
[Bugfix] Update Run:AI Model Streamer Loading Integration
#23845 commented on Sep 4, 2025 • 0 new comments
[Kernels][Nvidia] AOT compilation workflow [1/n]
#23844 commented on Sep 4, 2025 • 0 new comments
[Bugfix] support from s3 load model
#23842 commented on Sep 5, 2025 • 0 new comments
[xpu] upgrade ipex/python3.12 for xpu
#23830 commented on Sep 5, 2025 • 0 new comments
[Feature][Quantization] auto_round support for mixed bits quantization
#23812 commented on Sep 2, 2025 • 0 new comments
valley-eagle-7b (not finished yet)
#23799 commented on Sep 1, 2025 • 0 new comments
[CI] Fail subprocess tests with root-cause error
#23795 commented on Sep 4, 2025 • 0 new comments
[benchmark] add random and common prefix usage
#23788 commented on Sep 3, 2025 • 0 new comments
[do not merge] this is for testing ci-infra changes
#23771 commented on Aug 29, 2025 • 0 new comments
[Bugfix] when nixl port by bind, process cannot stop
#23756 commented on Sep 1, 2025 • 0 new comments
[Misc] Use CpuGpuBuffer for FlashInfer metadata builder
#23731 commented on Aug 30, 2025 • 0 new comments
[Misc] Moved override for allreduce fusion thresholds from env var to config
#23722 commented on Sep 2, 2025 • 0 new comments
Fix several unnecessary CUDA sync points
#22875 commented on Aug 29, 2025 • 0 new comments
[FIXBUG] Add stop and stop_token_ids to BeamSearchParams
#22869 commented on Sep 4, 2025 • 0 new comments
[P/D][NIXL]NixlConnector Reliability Enhancement
#22866 commented on Sep 3, 2025 • 0 new comments
[V0 Deprecation] Remove V0 xFormers attention backend
#22777 commented on Sep 2, 2025 • 0 new comments
[Bugfix] V1 engine positional model argument handling
#22764 commented on Aug 30, 2025 • 0 new comments
[Frontend] Add Sentry SDK for error reporting
#22753 commented on Sep 4, 2025 • 0 new comments
[Feat] Support elastic KV cache memory pool for dynamic GPU memory sharing
#22706 commented on Sep 1, 2025 • 0 new comments
[CI] run tests/compile/test_config.py
#22682 commented on Sep 1, 2025 • 0 new comments
Enable Intel Gaudi accelerator for vLLM Benchmark suite
#22680 commented on Sep 5, 2025 • 0 new comments
Support Anthropic API /v1/messages Endpoint
#22627 commented on Sep 5, 2025 • 0 new comments
Vectorize RMSNorm CUDA kernel
#22602 commented on Aug 30, 2025 • 0 new comments
consistency between the test and final Docker image
#22490 commented on Sep 5, 2025 • 0 new comments
[Refactor] Refactor FP8 & INT8 Quant Folder inside `w8a8`
#22474 commented on Aug 29, 2025 • 0 new comments
[V1][Metrics][Plugin] Add plugin support for custom `StatLoggerBase` implementations
#22456 commented on Sep 4, 2025 • 0 new comments
[CI/Build] Fix ppc64le CPU build and tests
#22443 commented on Sep 3, 2025 • 0 new comments
[Bugfix] Simulate mxfp4 quark model execution on cdna4 until kernels are integrated
#22355 commented on Sep 4, 2025 • 0 new comments
`NixlConnector` Support HTTP/S metadata exchange instead of zmq
#22274 commented on Sep 4, 2025 • 0 new comments
[TPU][Misc] Fix TPU.device_name
#22254 commented on Aug 29, 2025 • 0 new comments
[Perf][Feat][Core] Workload-Aware KVCache Eviction Policy
#22236 commented on Sep 1, 2025 • 0 new comments
[Bugfix] Disable the statslogger if the api_server_count is greater than 1
#22227 commented on Sep 5, 2025 • 0 new comments
feat: Add native support for XLM-RoBERTa embedding and BAAI/bge-reranker-v2-m3
#22216 commented on Sep 3, 2025 • 0 new comments
[Misc] pop virtual_engine in from_broadcasted_tensor_dict
#23476 commented on Aug 30, 2025 • 0 new comments
[#20711] Use QuantFp8 CustomOp-abstraction for MoE layers
#23463 commented on Aug 30, 2025 • 0 new comments
Add Predicted Outputs API
#23450 commented on Sep 4, 2025 • 0 new comments
[gpt-oss][Bugfix] Fix gpt-oss toolcall
#23440 commented on Sep 5, 2025 • 0 new comments
[Frontend] Add unit tests for OpenAI Responses streaming IDs (item_id/content_index + delta path) #23218
#23382 commented on Sep 4, 2025 • 0 new comments
[EPLB] Add Asynchronous Expert Rebalancing
#23343 commented on Sep 4, 2025 • 0 new comments
[Refactor] Small cleanup for quantized FusedMoE
#23339 commented on Sep 3, 2025 • 0 new comments
[ROCm][FEAT] Integrate AITER CustomAllreduce in cuda communicator.
#23336 commented on Aug 29, 2025 • 0 new comments
[Bugfix] remove duplicate tokens streamed in required tool choice streaming
#23312 commented on Sep 2, 2025 • 0 new comments
[V0 Deprecation] Drop V0 encoder-decoder runner
#23300 commented on Aug 29, 2025 • 0 new comments
[Perf] Use upstream CUTLASS for SM90 Block FP8 kernel
#23280 commented on Aug 30, 2025 • 0 new comments
fix: response_format for completion
#23212 commented on Sep 2, 2025 • 0 new comments
[Misc][qwen2_5_vl] Enable `supports_torch_compile` on generic nn.Module
#23207 commented on Aug 29, 2025 • 0 new comments
[Misc][Feature] confidence based early stopping
#23201 commented on Sep 5, 2025 • 0 new comments
ON HOLD - [Core] Lazy/Delayed CUDA graph
#23184 commented on Sep 2, 2025 • 0 new comments
[Bugfix] Fix gemma3 with transformers backend
#23178 commented on Sep 2, 2025 • 0 new comments
[V1] check request priority if scheduler policy is fcfs
#23043 commented on Aug 31, 2025 • 0 new comments
[Core] Support weight_loader_v2 for `UnquantizedLinearMethod`
#23036 commented on Sep 3, 2025 • 0 new comments
[Core] Allow disabling TP sharding for parallel Linear layer
#23024 commented on Sep 4, 2025 • 0 new comments
Optimize MoE Token Dispatch for Tensor Parallel Configurations
#22993 commented on Sep 4, 2025 • 0 new comments
feat(multimodal): Add support for SigLIP pooling model
#22921 commented on Aug 30, 2025 • 0 new comments
[Bug]: EngineCore died unexpectedly When Inference llama(generate)
#23517 commented on Sep 2, 2025 • 0 new comments
[Doc]: Steps to run vLLM on your RTX5080 or 5090!
#14452 commented on Sep 2, 2025 • 0 new comments
[Bug]: Memory Leak Issue in Load Testing Scenario
#22736 commented on Sep 2, 2025 • 0 new comments
[Usage]: During testing of the LoRA model, the "enable-prefix-caching" feature did not take effect
#23301 commented on Sep 2, 2025 • 0 new comments
[Bug]: 模型运行期间，报错TimeoutError: RPC call to execute_model timed out.，导致模型退出。
#19197 commented on Sep 2, 2025 • 0 new comments
[Bug]: When "tool_choice": "auto" is set, there is a reasoning_content process in the output, but this process is missing when "tool_choice": "required" is used.
#19846 commented on Sep 2, 2025 • 0 new comments
[Usage]: minicpm-4.5v
#23784 commented on Sep 2, 2025 • 0 new comments
[Feature]: Allow oot custom compiler extension via CompilerInterface and reuse backend-agnostic FX passes
#23612 commented on Sep 2, 2025 • 0 new comments
[Bug]: when nsight cature nvtx with PP>1, vllmWorkerProcess will unexpectedly terminate
#13482 commented on Sep 2, 2025 • 0 new comments
[Feature]: Pin vLLM process to the right NUMA Region
#13855 commented on Sep 2, 2025 • 0 new comments
[Installation]: RuntimeError: Unknown runtime environment
#15450 commented on Sep 2, 2025 • 0 new comments
[Usage]: Deciding max-num-seqs and max-num-batched-tokens for desired throughput
#16886 commented on Sep 2, 2025 • 0 new comments
[Performance]: UVA vs UVM for CPU offloading on v0.8.4+
#17062 commented on Sep 2, 2025 • 0 new comments
[Feature]: GGUF support for GLM4
#17069 commented on Sep 2, 2025 • 0 new comments
[Bug]: failed to run latest offline PD example code
#17624 commented on Sep 2, 2025 • 0 new comments
[RFC]: Model Parallelism with Single Worker using SPMD
#18009 commented on Sep 2, 2025 • 0 new comments
[Bug]: build source errors
#18691 commented on Sep 2, 2025 • 0 new comments
[Bug]: AsyncLLM when DP > 1, device allocation bug
#18942 commented on Sep 2, 2025 • 0 new comments
[Bug]: gpu-memory-utilization does not work
#19023 commented on Sep 2, 2025 • 0 new comments
[Usage]: ValueError: Initial test run failed - Please make sure benchmark arguments are correctly specified. Error: Not Found
#19047 commented on Sep 2, 2025 • 0 new comments
[Bug]: System Memory OOM after upgrading to v0.9.0.1
#19048 commented on Sep 2, 2025 • 0 new comments
[RFC]: Drop CUDA 11.8 Support
#19061 commented on Sep 2, 2025 • 0 new comments
[Usage]: how could I use vllm docker image in platform with arm64 tech cpu and nvidia a600 gpu
#19065 commented on Sep 2, 2025 • 0 new comments
[Feature]: Metal support
#19073 commented on Sep 2, 2025 • 0 new comments
[Bug]: CUDA error: unknown error when running vllm serve on WSL2 Ubuntu22.04
#19077 commented on Sep 2, 2025 • 0 new comments
[Usage]: intent is added for guided generation
#19107 commented on Sep 2, 2025 • 0 new comments
[Bug]: 0.8.4 serve QwQ-32B-AWQ failed
#16811 commented on Sep 3, 2025 • 0 new comments
[Bug]: Not allowed: Wheel dist/vllm-0.9.1.dev2+ge0cbad4e3-cp38-abi3-linux_x86_64.whl is larger (824.73 MB) than the limit (400 MB)
#18786 commented on Sep 3, 2025 • 0 new comments
[Bug]: Models misbehaves when --tensor-parallel-size 2 on 2x Nvidia L4
#19022 commented on Sep 3, 2025 • 0 new comments
[Bug]: vllm profiling result contains invalid utf-8 code
#19043 commented on Sep 3, 2025 • 0 new comments
[Usage]: Implement Method to Obtain Token-Level Log Probabilities from Models with Different Weights for KL Divergence Calculation
#19127 commented on Sep 3, 2025 • 0 new comments
[Bug]: Error occurred while performing model inference using 0.8 H20s from the virtualized computing pool.
#19137 commented on Sep 3, 2025 • 0 new comments
[Usage]: Distributed Inference Over CPU
#19142 commented on Sep 3, 2025 • 0 new comments
[Feature]: asyncio_mode and not multiprocess_mode EngineCore imple
#19146 commented on Sep 3, 2025 • 0 new comments
[Bug]: max-model-len + max-num-seqs is not reducing vram usage
#19148 commented on Sep 3, 2025 • 0 new comments
[Bug]: KeyError: 'language_model.layers.0.self_attn.qkv_proj.weight'
#19149 commented on Sep 3, 2025 • 0 new comments
[Usage]: OutofMemoryError with LMCache example and cpu_offload_gb enabled
#19154 commented on Sep 3, 2025 • 0 new comments
[Renderer]: Move `Processor` out of `AsyncLLM`
#23869 commented on Sep 3, 2025 • 0 new comments
[Bug]: vllm, EngineCore encountered a fatal error TimeoutError
#19668 commented on Sep 3, 2025 • 0 new comments
[RFC]: Dynamic Expert Load Balance with Zero-like-overhead
#22246 commented on Sep 3, 2025 • 0 new comments
[Feature]: Support loading vision layers in VLM LoRA adapters
#16364 commented on Sep 3, 2025 • 0 new comments
[RFC]: Context Parallelism && Sequence Parallelism
#22693 commented on Sep 3, 2025 • 0 new comments
[Installation]: no version of pip install vllm works - Failed to initialize NumPy: No Module named 'numpy'
#11037 commented on Sep 2, 2025 • 0 new comments
[Bug]: vllm.third_party.pynvml.NVMLError_InvalidArgument: Invalid Argument
#19071 commented on Sep 2, 2025 • 0 new comments
[Feature]: Optimize RoPE
#22293 commented on Sep 2, 2025 • 0 new comments
[Bug]: Setting up vLLM with a multi-host for example v6e-4x4 TPU topology fails
#23860 commented on Sep 2, 2025 • 0 new comments
[Bug]: There are no CI tests for chunked prefill for pooling models.
#23436 commented on Sep 2, 2025 • 0 new comments
[Feature]: Add LORA Model Name in Open Telemetry
#23767 commented on Sep 2, 2025 • 0 new comments
[Feature]: qwen2.5 omni doesn't support bnb quantification.
#23240 commented on Sep 2, 2025 • 0 new comments
[Bug]: vLLM aarch64 support (GH200)
#23350 commented on Sep 2, 2025 • 0 new comments
[Bug]: openai/gpt-oss-20b breaks on data parallel
#23244 commented on Sep 2, 2025 • 0 new comments
[Bug]: Unknown quantization method: mxfp4
#22276 commented on Sep 2, 2025 • 0 new comments
[Bug]: VLLM_ALL2ALL_BACKEND=naive hangs/crashes on multi nodes when serving DeepSeekV3
#23448 commented on Sep 2, 2025 • 0 new comments
[New Model]: Google SigLip 2
#13663 commented on Aug 31, 2025 • 0 new comments
[Bug]: LoRA support for Mistral 3.1
#18574 commented on Aug 31, 2025 • 0 new comments
[Feature]: Context Parallelism
#7519 commented on Aug 31, 2025 • 0 new comments
[Bug]: Unexpected CUDA OOM with larger TP size
#22702 commented on Aug 30, 2025 • 0 new comments
[Feature]: add DoRA support
#10849 commented on Aug 30, 2025 • 0 new comments
[Bug]: LLVM ERROR: Failed to compute parent layout for slice layout. when using fp16
#17152 commented on Aug 30, 2025 • 0 new comments
[Bug]: Can't serve Qwen3-AWQ
#18156 commented on Aug 30, 2025 • 0 new comments
[Bug]: vLLM v0.8.5.post1 hanging with Llama 3.3 70b
#18260 commented on Aug 30, 2025 • 0 new comments
[Usage]: Controll Deepseek R1 think or not
#18988 commented on Aug 30, 2025 • 0 new comments
[Bug]: KV cache specs are not equal accross rank
#23883 commented on Aug 30, 2025 • 0 new comments
[Bug]: vLLM server crashes with CUDA illegal memory access for specific sequence lengths on B200
#23724 commented on Aug 29, 2025 • 0 new comments
[Doc]: clarify support for cpu-based image
#23681 commented on Aug 29, 2025 • 0 new comments
[Usage]: how to use built-in python tool of gpt-oss-20b after starting vllm serve --tool-server demo?
#23108 commented on Aug 29, 2025 • 0 new comments
[RFC]: Remove LoRA bias
#23892 commented on Aug 29, 2025 • 0 new comments
[Bug]: vLLM (AsyncLLMEngine, LLM) engine initialization fails when using runai_streamer
#22843 commented on Aug 29, 2025 • 0 new comments
[CI]: Entrypoints tests cleanup
#23667 commented on Aug 29, 2025 • 0 new comments
[Feature]: Logging details about incorrect requests
#19739 commented on Aug 29, 2025 • 0 new comments
[Bug]: Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
#18455 commented on Aug 29, 2025 • 0 new comments
[Bug]: Bad result with parallel generation.
#20561 commented on Aug 29, 2025 • 0 new comments
[MM Encoder] Add Encoder DP to Kimi-VL
#23878 commented on Aug 29, 2025 • 0 new comments
[Bug]: InternVL3 FP8 missing module/parameter on model load
#19424 commented on Aug 29, 2025 • 0 new comments
[Bug]: Sampling discrepancy between ollama and vLLM for gemma-3-27b-it et al.
#20060 commented on Aug 29, 2025 • 0 new comments
[Bug]: GPU memory allocate problem
#23163 commented on Aug 29, 2025 • 0 new comments
[MM Encoder] ViT attention performance and consolidation
#23880 commented on Aug 29, 2025 • 0 new comments
[MM Encoder] Add Encoder DP to InternVL
#23876 commented on Aug 29, 2025 • 0 new comments
[Bug]: assortment of warnings / errors coming out of vllm basic python inference script
#18634 commented on Aug 29, 2025 • 0 new comments
[Bug]: FlashMLA V1 with FP8 KV cache not yet supported!
#18887 commented on Sep 1, 2025 • 0 new comments
[Feature]: Individual GuidedDecodingParams for each prompt in prompts.
#19007 commented on Sep 1, 2025 • 0 new comments
[Bug]: Phi-4-mini-instruct / Phi-4-multimodal-instruct produces gibberish when input <4096 tokens and output is >4096 tokens
#19489 commented on Sep 1, 2025 • 0 new comments
[Feature][Wide EP]: Add NIXL, DeepEP, DeepGEMM, and PPLX to Docker Image
#23344 commented on Sep 1, 2025 • 0 new comments
[RFC]: Deprecating vLLM V0
#18571 commented on Sep 1, 2025 • 0 new comments
[Bug]: calculate_kv_scales leads to dynamo compilation issue; enforce_eager=True leads to another issue
#21640 commented on Sep 1, 2025 • 0 new comments
[Bug]: TypeError: FlashAttentionImpl.__init__() got an unexpected keyword argument 'sinks'
#22383 commented on Sep 1, 2025 • 0 new comments
[Bug]: apply_temperature may cause nan in probs
#22180 commented on Sep 1, 2025 • 0 new comments
[Feature][Kernel][B200]: FI MoE LL does not use `allgatherv` and `reduce-scatterv` for dispatch and combine
#22916 commented on Sep 1, 2025 • 0 new comments
[Performance]: Low GPU Utilization (70%) for ViT+Qwen2 VLM Model.
#18392 commented on Sep 1, 2025 • 0 new comments
[Bug]: something wrong with hermes tool parser
#18791 commented on Sep 1, 2025 • 0 new comments
[Usage]: 请问0.9.0版容器是限制只能在CUDA12.8以上版本运行了吗？
#18813 commented on Sep 1, 2025 • 0 new comments
[Bug]: 0.8.x with vllm V1 fails on loading Qwen-vl-2.5 with UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 2218: ordinal not in range(128)
#18823 commented on Sep 1, 2025 • 0 new comments
[Performance]: The Unstable Performance Difference between CUDA and PyTorch
#18884 commented on Sep 1, 2025 • 0 new comments
[Usage]: How to Retrieve Model Parameters (e.g., Supported Embedding Dimensions) for an Embedding Model (Online Service)
#18984 commented on Sep 1, 2025 • 0 new comments
[Feature]: support Microsoft Tutel as inference backend for Moe models
#19013 commented on Sep 1, 2025 • 0 new comments
[Attention]: Pad for cudagraphs before constructing attention metadata
#23789 commented on Sep 1, 2025 • 0 new comments
[Bug]: RuntimeError: operator _C::marlin_qqq_gemm does not exist
#23662 commented on Sep 1, 2025 • 0 new comments
[Bug]: Docker vLLM 0.9.1 CUDA error: an illegal memory access, sampled_token_ids.tolist()
#19483 commented on Sep 1, 2025 • 0 new comments
[MM Encoder] Investigate heuristic for enabling encoder DP by default
#23879 commented on Aug 31, 2025 • 0 new comments
[Bug]: vllm/vllm-openai:gptoss AssertionError: Sinks are only supported in FlashAttention 3 (4090 48gb)
#22331 commented on Aug 31, 2025 • 0 new comments
[Usage]: ModuleNotFoundError: No module named 'vllm.vllm_flash_attn.layers' vllm@0.9.0.1
#19131 commented on Aug 31, 2025 • 0 new comments
[Bug]: Unable to use Qwen/Qwen2.5-Omni-7B with --mm-processor-kwargs
#20995 commented on Aug 31, 2025 • 0 new comments
[RFC]: Optimize Input Media Processing in vLLM
#22044 commented on Aug 31, 2025 • 0 new comments
[Bug]: NIXL disaggregation example does not work
#22532 commented on Aug 31, 2025 • 0 new comments
[Bug]: After wake_up sleeping model in OpenAI API server the model generate gibberish output
#20627 commented on Aug 31, 2025 • 0 new comments
[Usage]: does v1 support seqence paralism now?
#19256 commented on Sep 5, 2025 • 0 new comments
[Bug]: VLLM v0.10.0 failed to deploy the qwen3-30b-moe model. The error is AttributeError: '_OpNamespace' '_moe_C' object has no attribute 'topk_softmax'.
#22225 commented on Sep 4, 2025 • 0 new comments
[Feature]: Add OpenTelemetry API to v1
#17794 commented on Sep 4, 2025 • 0 new comments
[Bug]: JSON decode error when tool call argument is empty
#19419 commented on Sep 4, 2025 • 0 new comments
[Bug]: gpt-oss Intermittent 500 Internal Server Error with empty response body when using strict JSON “function router” system prompt
#23837 commented on Sep 4, 2025 • 0 new comments
[Bug]: [gpt oss 20b] [tool_call] Unexpected token 12606 while expecting start token 200006
#22519 commented on Sep 4, 2025 • 0 new comments
[Bug]: illegal memory access when there are multiple concurrent request
#23814 commented on Sep 4, 2025 • 0 new comments
[Usage]: When I use the Qwen3-32B with tool_choice='required' parameter, the tool calling gets stuck in a loop
#21026 commented on Sep 4, 2025 • 0 new comments
[Feature]: AttributeError: Model GptOssForCausalLM does not support BitsAndBytes quantization yet. No 'packed_modules_mapping' found. Support GptOssForCausalLM of BitsAndBytes quantization?
#23632 commented on Sep 4, 2025 • 0 new comments
[Bug]: tensor parallelism inference doesn't run on Nvidia Blackwell 5070ti
#21239 commented on Sep 4, 2025 • 0 new comments
[Bug]: MoE models fail at startup: AttributeError: '_OpNamespace' '_moe_C' object has no attribute 'topk_softmax'
#18967 commented on Sep 4, 2025 • 0 new comments
[New Model]: OpenAI OSS model support
#22265 commented on Sep 4, 2025 • 0 new comments
[Bug]: The quantization method mxfp4 is not supported for the current GPU SM75
#22288 commented on Sep 4, 2025 • 0 new comments
[Bug]: Performance Analysis: Significant Latency on First Inference due to Engine Warm-up (torch.compile & Graph Capture)
#23787 commented on Sep 4, 2025 • 0 new comments
[Bug]: vLLM server hangs and timeouts after initial requests
#17972 commented on Sep 4, 2025 • 0 new comments
[Bug]: When enabling LoRA, greedy search got different answers.
#7977 commented on Sep 4, 2025 • 0 new comments
[Bug]: 启动qwen2.5-vl系列的时候为啥老是卡着
#13651 commented on Sep 4, 2025 • 0 new comments
[Bug]: 100% CPU usage when idle
#16660 commented on Sep 4, 2025 • 0 new comments
[Bug]: When I use tool call, the "tool_calls" list in the response is empty, and the value is in "content", which does not conform to the standard provided by OpenAI.
#17161 commented on Sep 4, 2025 • 0 new comments
[Bug]: Serve Qwen3 MOE GPTQ models raise `torch._dynamo.exc.Unsupported` error
#18044 commented on Sep 4, 2025 • 0 new comments
[Doc]: add `--build-arg RUN_WHEEL_CHECK=false` to the "building-vllm-s-docker-image-from-source" section to avoid `check-wheel-size.py`-errors when building vllm for blackwell
#18309 commented on Sep 4, 2025 • 0 new comments
[Bug]: Internal Server Error: python3 openai_chat_completion_client_for_multimodal.py -c audio when using Qwen/Qwen2-Audio-7B-Instruct
#19083 commented on Sep 4, 2025 • 0 new comments
[Usage]: Why is the CPU usage fully utilized but the graphics card power is always low when I use vllm to deploy the model in a multi-card environment?
#19133 commented on Sep 4, 2025 • 0 new comments
[Bug]: error `is not a multimodal model` when serving `Qwen/Qwen3-8B` connected to `gr.load_chat(...)`
#19144 commented on Sep 4, 2025 • 0 new comments
[Feature]: support soft thinking
#19180 commented on Sep 4, 2025 • 0 new comments
[Usage]: How to quantize a custom model
#19190 commented on Sep 4, 2025 • 0 new comments
[Usage]: File Access Error Preventing vLLM API Server from Starting
#19192 commented on Sep 4, 2025 • 0 new comments
[Performance]: 在H20 运行vllm:v0.9.0 llama2-0.2B模型LoRA推理 ,执行模型Forward时Pytroch Dynamo 在CPU侧遍历；
#19261 commented on Sep 5, 2025 • 0 new comments
[Performance]: The same latency of Qwen3-8B and Qwen3-8b-Fp8B
#19264 commented on Sep 5, 2025 • 0 new comments
[Feature]: Under non-streaming output, add background heartbeat detection
#19268 commented on Sep 5, 2025 • 0 new comments
[New Model]: jinaai/jina-colbert-v2
#19278 commented on Sep 5, 2025 • 0 new comments
[Bug]: continue_final_message + echo + prefix-caching + V0 crash the server
#19285 commented on Sep 5, 2025 • 0 new comments
[Bug]: openai.LengthFinishReasonError from client.beta.chat.completions.parse
#19293 commented on Sep 5, 2025 • 0 new comments
[Feature]: Add token-level progress bar for `LLM.beam_search` inference
#19300 commented on Sep 5, 2025 • 0 new comments
[Bug]: openai_harmony.HarmonyError: unexpected tokens remaining in message header
#23567 commented on Sep 5, 2025 • 0 new comments
[Feature][Responses API] Support tool_choice other than "auto"
#23227 commented on Sep 5, 2025 • 0 new comments
[Roadmap] vLLM Release/CI/Performance Benchmark Q2 2025
#16284 commented on Sep 5, 2025 • 0 new comments
[CI]: Speed up Models Tests
#23670 commented on Sep 4, 2025 • 0 new comments
[Feature Request]: Per-rank log files (especially per-actor for Ray)
#23761 commented on Sep 4, 2025 • 0 new comments
[RFC]: Address piecewise graph splitting and attention fusion incompatibility
#23261 commented on Sep 4, 2025 • 0 new comments
[CI]: Have CI tests fail-fast
#23453 commented on Sep 4, 2025 • 0 new comments
[New Model]: Grok 2
#23557 commented on Sep 4, 2025 • 0 new comments
[CI]: Reduce docker build time with caching
#23588 commented on Sep 4, 2025 • 0 new comments
[Feature]: GPT-OSS harmony format support
#23217 commented on Sep 4, 2025 • 0 new comments
[RFC]: Enabling Multiple Graphs Based on pre-defined conditions
#23113 commented on Sep 4, 2025 • 0 new comments
[Bug]: Failed to load model from local s3 instance
#23236 commented on Sep 4, 2025 • 0 new comments
[Feature][Tools]: Complete Redesign of Tool Calling
#22918 commented on Sep 4, 2025 • 0 new comments
[Feature]: Add Moving Average Statistics for Better Performance Monitoring
#22480 commented on Sep 4, 2025 • 0 new comments
[Bug]: Incorrect output throughput calculation for concurrent requests in benchmark_serving.py
#23820 commented on Sep 4, 2025 • 0 new comments
[Usage]: embed prompts
#19746 commented on Sep 4, 2025 • 0 new comments
[Bug]: `graph.eliminate_dead_code()` break the fx graph with `enable_fi_allreduce_fusion` when TP == 2
#23091 commented on Sep 4, 2025 • 0 new comments
[Feature]: Add LoRA support for gpt-oss model
#23610 commented on Sep 4, 2025 • 0 new comments
[Usage]: Load Qwen3 Moe model error when starting the vllm server on TPU
#23834 commented on Sep 4, 2025 • 0 new comments
[Bug]: Inference with the Moonlight model, the output becomes corrupted when n exceeds 1
#19206 commented on Sep 4, 2025 • 0 new comments
[Bug]: vLLM server timeout due to multiprocessing communication error
#23582 commented on Sep 3, 2025 • 0 new comments
[Usage]: Which dataset do you recommend using for the ngram spec decoding method?
#23611 commented on Sep 3, 2025 • 0 new comments
[Usage]: Unable to see more than 20% improvement on b200 for vllm
#23609 commented on Sep 3, 2025 • 0 new comments
[RFC]: Disaggregated Everything - Token In <> Token Out API Server
#22817 commented on Sep 3, 2025 • 0 new comments
[CI]: Declarative regression tests for API parameters
#23593 commented on Sep 3, 2025 • 0 new comments
[Installation]: Nightly builds not available in container registry
#19335 commented on Sep 3, 2025 • 0 new comments
[Feature]: Multimodal Benchmarking Support (MMLM)
#21887 commented on Sep 3, 2025 • 0 new comments
[Feature]: Add support for Apple MPS(Metal Performance Shaders)
#22629 commented on Sep 3, 2025 • 0 new comments
[Doc]: update contributing guide for macOS Apple silicon
#16940 commented on Sep 3, 2025 • 0 new comments
[Bug]: [P/D] P/d is incompatible with spec decoding
#21583 commented on Sep 3, 2025 • 0 new comments
[Feature]: If I want gpt-oss to be able to call custom tools, how should I set the --tool-call-parser parameter during deployment?
#22308 commented on Sep 3, 2025 • 0 new comments
[Bug]: Devstral-Small-2507 tool parsing issue when streaming
#23180 commented on Sep 3, 2025 • 0 new comments
[Bug]: When accessing the API with the 'stop' parameter, the 'qwen3-reasoning-parser' fails to function correctly.
#22412 commented on Sep 3, 2025 • 0 new comments
[Feature]: Qwen2_5_VLForEmbedding
#13373 commented on Sep 3, 2025 • 0 new comments
[Bug]: gpt-oss model output issue
#23694 commented on Sep 3, 2025 • 0 new comments
[Bug]: Structured output is not correctly enforced when using GPT-OSS
#23120 commented on Sep 3, 2025 • 0 new comments
[Bug]: RuntimeError: NCCL error: unhandled cuda error
#21661 commented on Sep 3, 2025 • 0 new comments
[Bug]: Qwen 3 2507 update models use `deepseek_r1` reasoning parser - suggest renaming
#22657 commented on Sep 3, 2025 • 0 new comments
[Performance]: Long startup delay due to plugin loading and subprocess spawning
#21051 commented on Sep 3, 2025 • 0 new comments
[Bug]: Qwen3-Reranker-vllm exhibits a large gap between offline and online inference.
#20730 commented on Sep 3, 2025 • 0 new comments
[Bug]: watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
#8177 commented on Sep 3, 2025 • 0 new comments
[RFC]: Refactor CI/CD
#22992 commented on Sep 3, 2025 • 0 new comments
[Bug]: vllm can' t serve for Multi-audio input inference
#16914 commented on Sep 3, 2025 • 0 new comments
[Bug]: Qwen2vl vllm grounding任务效果不如transformers推理
#11254 commented on Sep 3, 2025 • 0 new comments
Issue with Mistral Small and greek characters
#14307 commented on Sep 3, 2025 • 0 new comments
First tpot/itl is too long?
#15106 commented on Sep 3, 2025 • 0 new comments
[Usage]: how to get the hidden states
#19207 commented on Sep 4, 2025 • 0 new comments
[Usage]: Can multimodal models, such as qwen2.5vl, use the PD separation feature?
#19213 commented on Sep 4, 2025 • 0 new comments
[Usage]: Error when running a finetuned, quantized model with vllm.
#19218 commented on Sep 4, 2025 • 0 new comments
[Feature]: Warn or auto-convert to FlexibleArgumentParser in AsyncEngineArgs.add_cli_args
#19221 commented on Sep 4, 2025 • 0 new comments
[Feature]: Use `QuantFp8` `CustomOp`-abstraction for MoE layers
#20711 commented on Sep 4, 2025 • 0 new comments
[Feature][Chat Completion] Support builtin tools of gpt-oss
#23292 commented on Sep 3, 2025 • 0 new comments
[Feature]: Qwen3 Models GGUF Support
#21511 commented on Sep 3, 2025 • 0 new comments
[Bug]: V1 pre-compiled graph loading much slower than V0
#20342 commented on Sep 3, 2025 • 0 new comments
[RFC]: Reduce Unit Test to Speed Up CI
#22041 commented on Sep 3, 2025 • 0 new comments
[Bug]: FunctionDefinition missing optional param strict
#15526 commented on Sep 3, 2025 • 0 new comments
[Feature]: Vulkan support
#21182 commented on Sep 3, 2025 • 0 new comments
[CI]: Use `HF_HUB_OFFLINE=1` in CI tests
#23451 commented on Sep 3, 2025 • 0 new comments
[Usage]: gpt-oss-120b tool calls
#22337 commented on Sep 3, 2025 • 0 new comments
[Bug]: Support `qwen3` Models in `eagle3` Speculative Decoding
#23464 commented on Sep 3, 2025 • 0 new comments
[Bug]: Strange error `AssertionError: failed to get the hash of the compiled graph` when running `Qwen/Qwen3-8B` via `LLM` class
#18851 commented on Sep 3, 2025 • 0 new comments
[Bug]: Enable cutom_op of rotary_embedding goes error for Qwen3-4B
#21101 commented on Sep 3, 2025 • 0 new comments
[Usage]: How to run model - `RedHatAI/Mixtral-8x7B-Instruct-v0.1-FP8`
#23192 commented on Sep 3, 2025 • 0 new comments
[Bug]: 5090 cannot run Qwen3-30B-A3B-NVFP4!
#23826 commented on Sep 3, 2025 • 0 new comments
[Bug]: Stub function of moe_wna16_marlin_gemm takes less positional arguments than real implementation
#22634 commented on Sep 3, 2025 • 0 new comments
[Bug]: AttributeError: module 'torch._tensor' has no attribute 'split'
#22676 commented on Sep 3, 2025 • 0 new comments
[Bug]: Numerics of Embedding Models
#22862 commented on Sep 3, 2025 • 0 new comments
[Bug]: vLLM v1 hanging during Torch compilation
#15360 commented on Sep 3, 2025 • 0 new comments
[Bug]: [V1][Spec Dec] EAGLE TP > 1 leads to errors when using --enforce_eager
#17513 commented on Sep 3, 2025 • 0 new comments
[Bug]: Qwen3-GPTQ | Error in inspecting model architecture 'Qwen3MoeForCausalLM'
#19504 commented on Sep 3, 2025 • 0 new comments
[Bug]: Consider deleting envs.VLLM_TEST_DYNAMO_FULLGRAPH_CAPTURE
#21834 commented on Sep 3, 2025 • 0 new comments
[Bug]: TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 fails with vLLM-compile in torch <= 2.7.1
#21858 commented on Sep 3, 2025 • 0 new comments