-
-
Notifications
You must be signed in to change notification settings - Fork 9.9k
Insights: vllm-project/vllm
Overview
Could not load contribution data
Please try again later
149 Pull requests merged by 92 people
-
[XPU][P/D] Add XPU support in NixlConnector
#22436 merged
Sep 5, 2025 -
[gpt-oss] tool parser supports for /chat/completions [1/n]
#22386 merged
Sep 5, 2025 -
[Frontend] Skip unnecessary detokenization when token_id is requested
#24236 merged
Sep 4, 2025 -
[CI/Build] Reduce the number of redundant cases to test for LoRA
#24276 merged
Sep 4, 2025 -
[Bugfix][Misc] Fix silu_and_mul_nvfp4_quant issue and extract common utils for nvfp4 kernel source files
#23727 merged
Sep 4, 2025 -
[Misc] Have AsyncLLM
custom_stat_loggers
extend default logger list#20952 merged
Sep 4, 2025 -
QWEN3 Coder Fused MoE kernels Optimization configs
#24266 merged
Sep 4, 2025 -
Upgrade FlashInfer to v0.3.0
#24086 merged
Sep 4, 2025 -
[Misc] Slight improve deepgemm print
#24085 merged
Sep 4, 2025 -
[Doc]: fix typos in Python comments
#24173 merged
Sep 4, 2025 -
[Perf] Freeze core engine proc heap after init
#24008 merged
Sep 4, 2025 -
[Misc] Removed force_fp8_e4m3fnuz from FP8LinearOp
#23725 merged
Sep 4, 2025 -
[LoRA]: Add lora support to qwen-2.5-omni
#24231 merged
Sep 4, 2025 -
[XPU] support Triton Attention backend on Intel GPU
#24149 merged
Sep 4, 2025 -
Use hidden_size_per_head as head_size fallback
#24221 merged
Sep 4, 2025 -
[Model] Add pp support for hunyuan
#24212 merged
Sep 4, 2025 -
[Doc] Update vLLM Singapore Meetup info
#24234 merged
Sep 4, 2025 -
[Feature][Response API] Add streaming support for non-harmony
#23741 merged
Sep 4, 2025 -
[Hardware][Apple-CPU] Disable OneDNN build for Apple Silicon
#24200 merged
Sep 4, 2025 -
[Attention] FlashAttn MLA
#14258 merged
Sep 4, 2025 -
[Bugfix] Fix Incremental Detokenization with
tokenizers == 0.22.0
#24159 merged
Sep 4, 2025 -
[Attention][Platform] Refactor MLA to support Custom Op
#23332 merged
Sep 4, 2025 -
Improve flexibility of auto_tune.sh execution.
#23766 merged
Sep 4, 2025 -
[Core][Model] Terratorch backend integration
#23513 merged
Sep 4, 2025 -
[Model] Add MiDashengLM model support
#23652 merged
Sep 4, 2025 -
[Misc] Enhance output readability of helper script
#24214 merged
Sep 4, 2025 -
[CPU] Refactor CPU unquantized linear
#24150 merged
Sep 4, 2025 -
Migrate ultravox inputs to TensorSchema
#23503 merged
Sep 4, 2025 -
[Refactor] Introduce basic Renderer for completion-style request
#24010 merged
Sep 4, 2025 -
[Kernel][Bugfix] Fix grouped topk cu
#24146 merged
Sep 4, 2025 -
[Feature][Responses API]Support MCP tools with streaming mode + background mode
#23927 merged
Sep 4, 2025 -
Remove deprecated
PyNcclConnector
#24151 merged
Sep 3, 2025 -
[Feature][gpt-oss] Add support for num_cached_tokens and num_reasoning_tokens tracking
#23460 merged
Sep 3, 2025 -
[Bugfix][DP] DP distribution does not require ray[default]
#23822 merged
Sep 3, 2025 -
[Feature][P/D]: Optimize NIXL Connector xfer Launch
#23887 merged
Sep 3, 2025 -
[Attention] Blackwell FP8 MLA support with CUTLASS_MLA backend
#23289 merged
Sep 3, 2025 -
Migrate whisper inputs to TensorSchema
#23505 merged
Sep 3, 2025 -
[Kernels] Overlap shared experts with send/recv
#23273 merged
Sep 3, 2025 -
[V1] v1 engine + full CUDA graph support for PLaMo2
#23998 merged
Sep 3, 2025 -
[Bugfix] Fixing division by zero in triton_attn if query_heads/kv_heads > 16
#23424 merged
Sep 3, 2025 -
FIX: Add libnuma-dev to Dockerfile for dev stage
#20388 merged
Sep 3, 2025 -
Fix MiniMax attention module prefix and remove useless code
#23982 merged
Sep 3, 2025 -
Support add_generation_prompt in embeddings endpoint with chat request
#23931 merged
Sep 3, 2025 -
[CI] Accelerate mteb test by setting SentenceTransformers mteb score to a constant
#24088 merged
Sep 3, 2025 -
[Misc] Clean up deadcode for legacy processing pipeline
#24153 merged
Sep 3, 2025 -
[CI/Build] Serve images used by multimodal tests through local HTTP Server
#23907 merged
Sep 3, 2025 -
[Nixl] Heterogeneous TP support FlashInfer
#20189 merged
Sep 3, 2025 -
[distributed][rl] remove nccl cumem env var override
#24141 merged
Sep 3, 2025 -
[BugFix] Fix routed_scaling_factor double mul for dots1 and glm4 MoE models
#24132 merged
Sep 3, 2025 -
[Misc] Add check for dual_chunk_attention
#24070 merged
Sep 3, 2025 -
[Doc]: fix typos in Python comments
#24115 merged
Sep 3, 2025 -
[Doc]: fix typos in Python comments
#24093 merged
Sep 3, 2025 -
[Compile] Fix Compile Warning for
w4a8_mm_entry.cu
#23660 merged
Sep 3, 2025 -
fix some typos
#24071 merged
Sep 3, 2025 -
[V1] Wrapper which plumbs request-level logits processors into vLLM batch-level logits processing
#23656 merged
Sep 3, 2025 -
Upgrade xgrammar to 0.1.23
#22988 merged
Sep 3, 2025 -
Update release pipeline post PyTorch 2.8.0 update
#24073 merged
Sep 3, 2025 -
[XPU] Fix the bug of LoRA logits on the XPU platform
#24081 merged
Sep 3, 2025 -
[CI/Build] Disable SiluMul NVFP4 quant fusion tests
#24121 merged
Sep 2, 2025 -
[Bug] R1 Accuracy: Fix
routed_scaling_factor
Double Mul Issue#24119 merged
Sep 2, 2025 -
[AMD][Kernel][Bugfix] Cast offsets tensor bn to tl.int64 to avoid GPU segfault
#23692 merged
Sep 2, 2025 -
[CI] Enable all hf transformers baselines in test_hybrid
#23936 merged
Sep 2, 2025 -
[Log] Only Print Profiler Results on Rank 0
#23370 merged
Sep 2, 2025 -
Fix weights loading for Apertus
#24100 merged
Sep 2, 2025 -
[Metrics] Deprecate TPOT in favor of ITL
#24110 merged
Sep 2, 2025 -
[Bugfix] Fix packed_factor missing attribute error
#23902 merged
Sep 2, 2025 -
Run ruff format on a few files.
#24075 merged
Sep 2, 2025 -
[Bugfix] Fix transform_config parsing in Compressed Tensors
#23945 merged
Sep 2, 2025 -
[Benchmark] Add support for local hf dataset path in benchmark
#23999 merged
Sep 2, 2025 -
[docs] add SYS_NICE cap &
security-opt
for docker/k8s#24017 merged
Sep 2, 2025 -
[CI Failure] Skip failing nvfp4 silu test
#23959 merged
Sep 2, 2025 -
[Model] Classification models support logit_bias / sigmoid_normalize
#24031 merged
Sep 2, 2025 -
[BugFix] Fix EXAONE4 rotary embeddings
#23918 merged
Sep 2, 2025 -
[Gemma3n] Fix audio batching
#24052 merged
Sep 2, 2025 -
correct LWS deployment yaml
#23104 merged
Sep 2, 2025 -
[CI]: reduce HTTP calls inside entrypoints openai tests
#23646 merged
Sep 2, 2025 -
[Model] Support dp on ViT on GLM-4.5V
#23168 merged
Sep 2, 2025 -
[Doc]: fix typos in Python comments
#24077 merged
Sep 2, 2025 -
Migrate Interns1 inputs to TensorSchema
#23510 merged
Sep 2, 2025 -
[XPU][Feature] fp8 online quantization support for XPU
#23148 merged
Sep 2, 2025 -
Migrate OvisImagePatchInputs to TensorSchema
#22024 merged
Sep 2, 2025 -
Remove runtime checks based on pooling params
#24051 merged
Sep 2, 2025 -
[Bugfix] Fix the issue that Blip2ForConditionalGeneration' object has…
#24028 merged
Sep 2, 2025 -
[V1][Mamba1] - FP32 SSM Kernel Support
#23506 merged
Sep 2, 2025 -
[Doc]: fix typos in Python comments
#24042 merged
Sep 2, 2025 -
[bugfix]fix MTP hidden states
#24056 merged
Sep 1, 2025 -
[Chore][V0 Deprecation] Move LogProb to a separate file
#24055 merged
Sep 1, 2025 -
[Model] Support DP for ViT on Kimi-VL-A3B-Thinking-2506
#23817 merged
Sep 1, 2025 -
[docs][misc] IOProcessor plugins fixes
#24046 merged
Sep 1, 2025 -
[Misc] Minor code simplification for spec decode
#24053 merged
Sep 1, 2025 -
Document multi-proc method selection for profiling
#23802 merged
Sep 1, 2025 -
[Model]: support KeyeVL-1_5-8B
#23838 merged
Sep 1, 2025 -
[Doc]: Fix CPU install docs: force torch-backend=cpu to avoid GPU torchvision errors
#24033 merged
Sep 1, 2025 -
[Frontend] Gemma3n audio
transcriptions
/translations
endpoint#23735 merged
Sep 1, 2025 -
[Doc]: fix typos in Python comments
#24026 merged
Sep 1, 2025 -
[Kernel] Update DeepGEMM to latest commit
#23915 merged
Sep 1, 2025 -
[Frontend] Update the warning log when using VLLM_ALLOW_LONG_MAX_MODEL_LEN
#20904 merged
Sep 1, 2025 -
[Misc] Enable V1 FP16 inference on pre-Ampere GPUs
#24022 merged
Sep 1, 2025 -
[Misc] add hash_function doc string
#24014 merged
Sep 1, 2025 -
[Bugfix] Add support for
<tool_call>
format in streaming mode for XLAM Tool Parser#22769 merged
Sep 1, 2025 -
[Misc] IO Processor plugins for pooling models
#22820 merged
Sep 1, 2025 -
Migrate Phi4 inputs to TensorSchema
#23471 merged
Sep 1, 2025 -
[Misc] refactor code by import as for torch._inductor.config
#23677 merged
Sep 1, 2025 -
[CI/Build] Improve Tensor Schema tests speed by avoid engine core initialization
#23357 merged
Sep 1, 2025 -
[Misc] Move fast prefill logic to separate method
#24013 merged
Sep 1, 2025 -
Fix the bug related to loading GPTP INT3 weights.
#23328 merged
Sep 1, 2025 -
[Misc] Avoid redundant copy for encoder-only models
#24012 merged
Sep 1, 2025 -
[BUGFIX] GPTQ quantization compatibility for Qwen3 MOE models (AutoGPTQ and AutoRound-GPTQ)
#23994 merged
Sep 1, 2025 -
v1: Support KV events from connectors
#19737 merged
Sep 1, 2025 -
[Minor] Fix some random typos in comments
#24009 merged
Aug 31, 2025 -
vllm fix check on max vocab size
#22471 merged
Aug 31, 2025 -
[Doc]: fix typos in Python comments
#24001 merged
Aug 31, 2025 -
[Core][Multimodal] Allow passing
multi_modal_uuids
as multimodal identifiers.#23394 merged
Aug 31, 2025 -
Fix wrong truncate_prompt_tokens type hint
#22761 merged
Aug 30, 2025 -
[LoRA] Much faster startup when LoRA is enabled
#23777 merged
Aug 30, 2025 -
[Misc] enhance type hint for rearrange return value
#23519 merged
Aug 30, 2025 -
[Refactor] refactor freezing_value/cuda_event initialize outside try finally
#23758 merged
Aug 30, 2025 -
[Misc] add reorder_batch AttentionMetadataBuilder
#23798 merged
Aug 30, 2025 -
Add LoRA support for DeepSeek models (V2, V3, R1-0528)
#23971 merged
Aug 30, 2025 -
[Model] Enable encoder DP for MiniCPM-V
#23948 merged
Aug 30, 2025 -
[UT] fix unify_kv_cache_configs when kv cache config needs sort
#23843 merged
Aug 30, 2025 -
[Bugfix] Fix test_lora_resolvers.py
#23984 merged
Aug 30, 2025 -
[V1] [Hybrid] Move MiniMaxLinearAttention into layers/mamba
#23831 merged
Aug 30, 2025 -
[Core] Cleanup TPU model runner for MM
#23894 merged
Aug 30, 2025 -
[CI] Fix broken compile tests due to unsupported SiluMul+Nvfp4Quant fusion
#23973 merged
Aug 30, 2025 -
[CI] Move testing image from remote URL to S3
#23980 merged
Aug 30, 2025 -
Add routed_scaling_factor to MoE grouped topk
#23123 merged
Aug 30, 2025 -
[Bugfix] Fix --config arg expansion called from api_server.py
#23944 merged
Aug 30, 2025 -
[CI] Fix unavailable image remote URL
#23966 merged
Aug 29, 2025 -
[Misc] Make
download_weights_from_hf
more reliable#23863 merged
Aug 29, 2025 -
Revert gemma3n fast prefill changes
#23897 merged
Aug 29, 2025 -
[Docs] [V1] [Hybrid] Add new documentation re: contributing mamba-based models
#23824 merged
Aug 29, 2025 -
Tuned H100/H200 triton fp8 block configs for fused_qkv_a_proj
#23939 merged
Aug 29, 2025 -
[RL][BugFix] Fix missing tokenizer error for token-in-token-out
#23904 merged
Aug 29, 2025 -
[BUGFIX ] fix undefined silu_and_mul_nvfp4_quant
#23929 merged
Aug 29, 2025 -
[CI] Add
aiter
to matching list of issue auto labeller forrocm
tag#23942 merged
Aug 29, 2025 -
[BugFix] Async scheduling and PP compatibility with DP
#23770 merged
Aug 29, 2025 -
[Models] Use in-place adds in Idefics2Vision
#23932 merged
Aug 29, 2025 -
[MODEL]
Apertus
andXIELU
#23068 merged
Aug 29, 2025 -
Adds
json_count_leaves
utility function#23899 merged
Aug 29, 2025 -
Update PyTorch to 2.8.0
#20358 merged
Aug 29, 2025 -
[Multimodal] Consolidate mm inputs into MultiModalFeatureSpec
#23779 merged
Aug 29, 2025 -
[Performance] V1 Classify Models E2E Performance Optimization
#23541 merged
Aug 29, 2025 -
[CPU] Enable data parallel for CPU backend
#23903 merged
Aug 29, 2025 -
[V0 Deprecation] Remove pooling model support in V0
#23434 merged
Aug 29, 2025 -
Better errors for Transformers backend missing features
#23759 merged
Aug 29, 2025 -
[Misc] Fix warnings for mistral model
#23552 merged
Aug 29, 2025 -
[CI/Build] Clean up LoRA test
#23890 merged
Aug 29, 2025
167 Pull requests opened by 136 people
-
[feat] preserve metadata for quantized model weight reload
#23901 opened
Aug 29, 2025 -
allow calc_kv_scales
#23906 opened
Aug 29, 2025 -
[Benchmark] add benchmark for custom activation op
#23908 opened
Aug 29, 2025 -
[Model] enable data parallel for InternVL vision encoder
#23909 opened
Aug 29, 2025 -
[Attention]: Fix Torch compile error when --calculate-kv-scales is enable
#23912 opened
Aug 29, 2025 -
[Core] Refactor EPLB
#23913 opened
Aug 29, 2025 -
[Performance] implement async_scheduling in single process mode
#23914 opened
Aug 29, 2025 -
kv_output_aggregator support heterogeneous
#23917 opened
Aug 29, 2025 -
Add automatic max model length selection
#23920 opened
Aug 29, 2025 -
[Model loader]: support multi-thread model weight loading
#23928 opened
Aug 29, 2025 -
Add actionable solutions to top 3 error messages
#23930 opened
Aug 29, 2025 -
Dequant kv_a_proj_with_mqa for DSV3
#23933 opened
Aug 29, 2025 -
[Benchmark] Allow arbitrary headers to be passed to benchmarked endpoints
#23937 opened
Aug 29, 2025 -
[Bugfix] Handle the edge case in detokenizer where processed tokens contain both `stop` str and `eos` token
#23938 opened
Aug 29, 2025 -
[V1] [Hybrid] Mamba2 Automatic Prefix Caching
#23941 opened
Aug 29, 2025 -
fit the qwen3 moe's awq quantization for 2080Ti.
#23949 opened
Aug 29, 2025 -
[wip] allow skip media
#23950 opened
Aug 29, 2025 -
feat: Add Eagle3 speculative decoding support for Llama4
#23951 opened
Aug 29, 2025 -
[BugFix] Fix de-functionalization pass for rotary_embedding
#23953 opened
Aug 29, 2025 -
[Attention] FlashAttention MLA cudagraph support
#23958 opened
Aug 29, 2025 -
Remove old cutlass mla
#23961 opened
Aug 29, 2025 -
[Core] Add tensor analysis utility for multimodal cache debugging
#23962 opened
Aug 29, 2025 -
Enable Allgather/ReduceScatter backend for NaiveAllToAll
#23964 opened
Aug 29, 2025 -
[rocm] update pytorch rocm from 6.3 to 6.4
#23968 opened
Aug 29, 2025 -
[Kernel] Faster pre-processing time for W4A8
#23972 opened
Aug 29, 2025 -
[Hybrid Allocator] Support Pipeline Parallel
#23974 opened
Aug 30, 2025 -
Next Fix for Compile with Cuda 13
#23976 opened
Aug 30, 2025 -
Feature/vit attention unification# 23880
#23978 opened
Aug 30, 2025 -
fix total_time of benchmark_hashing
#23987 opened
Aug 30, 2025 -
[Model] Add LongCat-Flash
#23991 opened
Aug 30, 2025 -
[Bugfix] Fix several issues with p2p xPyD in GET type
#23993 opened
Aug 30, 2025 -
Feature/deepseek v31 lora support
#23995 opened
Aug 30, 2025 -
[Feature]: Support Phi4Flash model in V1
#23996 opened
Aug 31, 2025 -
Feature/sampler benchmark #23977
#23997 opened
Aug 31, 2025 -
optimize serving_score loops.
#24000 opened
Aug 31, 2025 -
[V1][CUDA Graph] Fix attention metadata tensor sizes for padded batches
#24002 opened
Aug 31, 2025 -
[LoRA] Gemma3n LoRA support
#24003 opened
Aug 31, 2025 -
[V1][Metrics] Add per-request TPOT histogram
#24015 opened
Sep 1, 2025 -
Allow loading of cpatonn/InternVL3_5-14B-AWQ-4bit
#24018 opened
Sep 1, 2025 -
[Model] Add Eagle 2.5 VL
#24019 opened
Sep 1, 2025 -
[Bugfix] Fix sequence parallelism bug when enable pipeline parallelism
#24021 opened
Sep 1, 2025 -
[Feature][Quantization] auto_round format add support for regex
#24024 opened
Sep 1, 2025 -
Support using SigLIP2 text and image embedding as standalone model
#24027 opened
Sep 1, 2025 -
Fix typo in test_attention_backends.py
#24030 opened
Sep 1, 2025 -
[BugFix] GPT-OSS Attention DP + MoE TP weight loading issue
#24032 opened
Sep 1, 2025 -
[Hardware][IBM Z] Fix Outlines Core issue for s390x
#24034 opened
Sep 1, 2025 -
[Feature][Quantization] Support Quark for mixed-precision quantized model
#24040 opened
Sep 1, 2025 -
[Docs] Enable relative links in examples to function when rendered in the docs
#24041 opened
Sep 1, 2025 -
Update to Transformers 4.55.3
#24043 opened
Sep 1, 2025 -
Issue 19007 Individual GuidedDecodingParams for each prompt in prompts
#24047 opened
Sep 1, 2025 -
[Spec Decoding]Support Spec Decoding Metrics in DP Mode
#24049 opened
Sep 1, 2025 -
[Kernels][DP/EP] Optimize Silu Kernel for R1
#24054 opened
Sep 1, 2025 -
[CI] Replace large models with tiny alternatives in tests
#24057 opened
Sep 1, 2025 -
Use Numpy array for sampled_token_ids
#24061 opened
Sep 2, 2025 -
[P/D]support for the v1/chat/completions interface to the disagg_proxy_server
#24065 opened
Sep 2, 2025 -
[BugFix] `python collect_env.py` and `vllm collect-env` compatibility with uv venv
#24066 opened
Sep 2, 2025 -
Gfx908 attn fix
#24068 opened
Sep 2, 2025 -
Reconstruct EPLB algorithm invocation method
#24069 opened
Sep 2, 2025 -
[Bugfix] Fix Qwen3-coder moe tuned config
#24072 opened
Sep 2, 2025 -
[BugFix][Model] Fix Ernie4.5-VL hanging on long inputs
#24074 opened
Sep 2, 2025 -
[Core] Remove tokenizer group in vLLM
#24078 opened
Sep 2, 2025 -
[CI] Move V1 Core tests to CPU
#24080 opened
Sep 2, 2025 -
Support LongCat-Flash-Chat tool call
#24083 opened
Sep 2, 2025 -
[Bugfix] Fix AssertionError in cache_full_blocks due to dirty blocks
#24084 opened
Sep 2, 2025 -
[Benchmarks] Add --skip-check argument to reduce wait time
#24087 opened
Sep 2, 2025 -
[V1] Add sliding window support to Flex Attention backend
#24089 opened
Sep 2, 2025 -
[Perf] EPLB optimize export_load_view update
#24091 opened
Sep 2, 2025 -
[Docs] Fix warnings in `mkdocs build` (continued)
#24092 opened
Sep 2, 2025 -
[ROCm][FEAT] Fuse DeepSeek shared experts into AITER fused_moe ops
#24097 opened
Sep 2, 2025 -
The downloaded tags directory is missing a `.git` folder, which is ca…
#24099 opened
Sep 2, 2025 -
[Bugfix] Enable swiglu oai for fused marlin moe
#24101 opened
Sep 2, 2025 -
[CI] Add nightly multiarch manifests to dockerhub
#24102 opened
Sep 2, 2025 -
[Bugfix] sliding_window AttributeError
#24103 opened
Sep 2, 2025 -
Update num_tokens_across_dp to use nccl instead of gloo
#24105 opened
Sep 2, 2025 -
[Transform] Deterministic Hadacore Transforms
#24106 opened
Sep 2, 2025 -
fixed reasoning streaming with tool_choice="required"
#24108 opened
Sep 2, 2025 -
[Kernels][AR] Enable Torch Symmetric Memory By Default
#24111 opened
Sep 2, 2025 -
[Kernel] Split moe tuned configs
#24113 opened
Sep 2, 2025 -
test_chunked_prefill_pooler refrencing #23436
#24114 opened
Sep 2, 2025 -
[Models][Quantization] Add quantization configuration update in Voxtral model
#24122 opened
Sep 2, 2025 -
[Compilation][WideEP] Enable Piecewise CUDAGraph for DeepEPHT
#24123 opened
Sep 2, 2025 -
update spec decode metrics to use throughput
#24127 opened
Sep 2, 2025 -
[Core] Run garbage collector after CUDA graph capture to fix throughput regression
#24128 opened
Sep 2, 2025 -
[Hardware][Apple-CPU] Enable native bfloat16 on Apple Silicon (M2 and later)
#24129 opened
Sep 2, 2025 -
[CI Sprint] Quantization CI Cleanup
#24130 opened
Sep 2, 2025 -
[Ultravox] Fix gemma instantiation
#24131 opened
Sep 2, 2025 -
[Bugfix][Wide EP] Fix redundant work when using DeepEP, TP Attn, and EP MoE
#24134 opened
Sep 3, 2025 -
WIP [Renderer] Move Processor out of AsyncLLM
#24138 opened
Sep 3, 2025 -
reduce the weight loading time
#24154 opened
Sep 3, 2025 -
[Kernel][SM100]: Enable FI FusedMoE By Default for Llama
#24157 opened
Sep 3, 2025 -
[GPT-OSS] Fix Pydantic union resolution for ResponseFunctionToolCall in Responses API
#24158 opened
Sep 3, 2025 -
[VLM] Optimize GLM4.5-V-style video processing to only decode necessary frames
#24161 opened
Sep 3, 2025 -
[Log] Per Rank Log
#24162 opened
Sep 3, 2025 -
fix some typos
#24167 opened
Sep 3, 2025 -
[logging] Refine PyNcclConnector Proxy logging
#24168 opened
Sep 3, 2025 -
Optimize detokenizer performance for long-generation sequences
#24174 opened
Sep 3, 2025 -
[Core] Exposing engine sleep & wake_up state as prometheus metrics
#24176 opened
Sep 3, 2025 -
[MacOS]skip pip-compile for pre-commit on MacOS
#24177 opened
Sep 3, 2025 -
[Bugfix] fix modelopt exclude_modules name mapping
#24178 opened
Sep 3, 2025 -
[Feature] add reasoning tokens
#24181 opened
Sep 3, 2025 -
[Misc] Harden `SamplingParams.from_optional` support
#24183 opened
Sep 3, 2025 -
[Spec Decode][Model]Add qwen2-eagle
#24187 opened
Sep 3, 2025 -
[torch.compile] Custom op matching
#24188 opened
Sep 3, 2025 -
[CI/Build] bump timm dependency
#24189 opened
Sep 3, 2025 -
[feat]: Create interface for model-specific M-RoPE
#24194 opened
Sep 3, 2025 -
[Docs] Fix install device tabs being out of sync when directly linked to
#24195 opened
Sep 3, 2025 -
[flashinfer] [kernel] support for fp8 kv cache for trtllm prefill attention
#24197 opened
Sep 3, 2025 -
Support prompt hidden states
#24202 opened
Sep 3, 2025 -
[Refactor] Refactor to extract model forward logic to allow plug-in t…
#24205 opened
Sep 4, 2025 -
[Frontend] add 'verbose_json' and 'timestamp' feature on Whisper Transcription/Translation
#24209 opened
Sep 4, 2025 -
[DO NOT MERGE] PR for testing
#24210 opened
Sep 4, 2025 -
[Docs]add eplb_config param use docs
#24213 opened
Sep 4, 2025 -
[backends][short_conv] CUDA graph piecewise edits
#24215 opened
Sep 4, 2025 -
[Misc] fix lmcache cpu offload example
#24216 opened
Sep 4, 2025 -
Fix Auto_Round Quatization Loading on SM75 and Lower GPUs
#24217 opened
Sep 4, 2025 -
[Core] Support async scheduling with uniproc executor
#24219 opened
Sep 4, 2025 -
[UT] enhance free kv cache block queue popleft_n
#24220 opened
Sep 4, 2025 -
[Docs] add the parallel sampling usage in LLMEngine and AsyncLLM
#24222 opened
Sep 4, 2025 -
[bugfix] fix returned chunk too large bug
#24224 opened
Sep 4, 2025 -
[Benchmarks]Accelerate random dataset generation
#24225 opened
Sep 4, 2025 -
[Misc] update log level debug to warning when process port is used by
#24226 opened
Sep 4, 2025 -
[Feature]support xPyD reconnect
#24227 opened
Sep 4, 2025 -
[kv cache] update num_free_blocks in the end
#24228 opened
Sep 4, 2025 -
[Misc] rename interval to max_recent_requests
#24229 opened
Sep 4, 2025 -
Fix unknown recipient none #24170
#24233 opened
Sep 4, 2025 -
Change the default value of truncate_prompt_tokens in the embedding/rerank/pooling model to -1
#24235 opened
Sep 4, 2025 -
[Feature][Quantization] extend Quark to support mixed-precision quantized model
#24239 opened
Sep 4, 2025 -
Eagle3 that supports the Minicpm3 model
#24243 opened
Sep 4, 2025 -
[Metrics] Hide deprecated metrics with gpu_ prefix
#24245 opened
Sep 4, 2025 -
[kernel] Add stride checks for rms_norm kernels
#24247 opened
Sep 4, 2025 -
[PERF] Allreduce fusion. Support torch native matching. Tuning of the thresholds
#24248 opened
Sep 4, 2025 -
[test] make NixlConnector example more clear
#24249 opened
Sep 4, 2025 -
[Core] Add delayed batching
#24250 opened
Sep 4, 2025 -
v1: CPU offloading
#24251 opened
Sep 4, 2025 -
[Compile] Conditional compilation. Introduce compile_ranges
#24252 opened
Sep 4, 2025 -
[CI] Speed up model unit tests in CI
#24253 opened
Sep 4, 2025 -
[Kernels] Overlap shared experts with combine instead of dispatch
#24254 opened
Sep 4, 2025 -
[CI/Build] Fail test groups fast using pytest -x and bash -e
#24255 opened
Sep 4, 2025 -
[Spec Decode] Fix offline spec_decode.py
#24257 opened
Sep 4, 2025 -
[ci/testing]: ensure the gpu memory is cleaned when exiting the remote openAI remote server
#24258 opened
Sep 4, 2025 -
[CI] Small Accuracy Eval Test for Deepseek Model
#24259 opened
Sep 4, 2025 -
[CI] Add timeouts to tests
#24260 opened
Sep 4, 2025 -
[do not merge] Tokens in<>out `/generate` endpoint
#24261 opened
Sep 4, 2025 -
[IGNORE] Timing model tests in fast-check
#24262 opened
Sep 4, 2025 -
Support SeedOss Reason Parser
#24263 opened
Sep 4, 2025 -
break execute_model in gpu_model_runner into sub-functions for custom scopes
#24265 opened
Sep 4, 2025 -
[Misc] Add ReplicaId to Ray metrics
#24267 opened
Sep 4, 2025 -
Draft: make deletion atomic in nixl timeout handling
#24268 opened
Sep 4, 2025 -
[Core] Simplify and unify mm uuid handling & auto-generated mm hash overrides processing.
#24271 opened
Sep 4, 2025 -
[Tests] fix initialization of kv hash in tests
#24273 opened
Sep 4, 2025 -
AOT Compilation for torch.compile (Bundled)
#24274 opened
Sep 4, 2025 -
[ROCm][Feature] Enable Pipeline Parallelism with Ray Compiled Graph on ROCm
#24275 opened
Sep 4, 2025 -
[Core] Support configuration parsing plugin
#24277 opened
Sep 4, 2025 -
[CORE] Prompt Embeddings Support for v1 Engine
#24278 opened
Sep 4, 2025 -
[ROCm][CI/Build] Sync ROCm dockerfiles with the ROCm fork
#24279 opened
Sep 4, 2025 -
CUDAGraph partition integration
#24281 opened
Sep 4, 2025 -
[Frontend][Responses API] Support reporting tool output tokens and fix reasoning token count
#24285 opened
Sep 5, 2025 -
Add Support for Grok2
#24286 opened
Sep 5, 2025 -
Fix cmake incremental build when running "pip install --no-build-isolation -e ."
#24287 opened
Sep 5, 2025 -
[Bugfix] guard missing attn_metadata in KV scales path
#24290 opened
Sep 5, 2025 -
[Doc]: fix typos in Python comments
#24294 opened
Sep 5, 2025 -
[feat] fast inplace model update
#24295 opened
Sep 5, 2025 -
Add vllm:request_prefill_comp_speed metric to Prometheus
#24296 opened
Sep 5, 2025
120 Issues closed by 26 people
-
[Bug]:Question about logprobs output being 0.0 when using `vllm` sampling params
#17286 closed
Sep 5, 2025 -
[Feature]: LoRA support for qwen2-vl Models
#11255 closed
Sep 5, 2025 -
[RFC]: Refactor tool parsers to eliminate coding errors and allow more efficient implementations.
#11522 closed
Sep 5, 2025 -
[Usage]: Automatic Prefix Cache life cycle
#12077 closed
Sep 5, 2025 -
[Misc] [ROCm]: Build from source failure with Arch/gcc14 with ROCm 6.3
#13777 closed
Sep 5, 2025 -
[Bug]: ModuleNotFoundError: No module named 'pyarrow" in main branch
#14487 closed
Sep 5, 2025 -
[Usage]: Can AsyncLLMEngine support batch infer?
#14717 closed
Sep 5, 2025 -
[Bug]: Design flaws in the current tool parser.
#15177 closed
Sep 5, 2025 -
[Bug]: H20*TP16,can't start service, get error: Cannot allocate memory
#16142 closed
Sep 5, 2025 -
[Bug]: vLLM still runs after Ray workers crash
#16259 closed
Sep 5, 2025 -
[Feature Request]: Support data_parallel_size in offline inference mode
#16588 closed
Sep 5, 2025 -
[Doc]: state requirements for testing or update to work for CPU-only
#16920 closed
Sep 5, 2025 -
[Bug]: swap_blocks and copy_blocks functions are wrong in flashinfer.py
#17362 closed
Sep 5, 2025 -
[Bug]: A800 GPU set VLLM_USE_V1=1 ValueError: No available memory for the cache blocks
#17431 closed
Sep 5, 2025 -
[Bug]: [v1][Spec Dec] Specifying draft TP does not have any impact.
#17499 closed
Sep 5, 2025 -
[Bug]: Can't serve can we serve Q4_K_M-GGUF Model
#17661 closed
Sep 5, 2025 -
[Bug]: Slight Embedding Precision Difference When Running bge-m3 in vLLM Compared to Original Model
#17713 closed
Sep 5, 2025 -
[Feature]: Support for OpenGVLab/InternVL3-38B-AWQ
#17734 closed
Sep 5, 2025 -
[Feature]: Does vLLM allow 'dropping' requests instead of preempting them?
#17736 closed
Sep 5, 2025 -
[Bug]: Interrupting inference with ctrl-c causes future requests to hang
#17738 closed
Sep 5, 2025 -
[Feature]: Support quantization for pooling model which does embedding.
#17760 closed
Sep 5, 2025 -
[Usage]: How to Truncate multi-modal tokens
#17765 closed
Sep 5, 2025 -
[Bug]: Logits processing with Lora is incorrect
#17766 closed
Sep 5, 2025 -
[Feature]: Support for IBGDA
#17774 closed
Sep 5, 2025 -
[Bug]: Large Data Parallel Size Cause Loading Safetensors Extremely Slow
#17783 closed
Sep 5, 2025 -
[Usage]: Is it possible to use CUDA Graph during the encoding for encoder-decoder models?
#17789 closed
Sep 5, 2025 -
[Usage]: 自己部署vllm,无法调用工具,需要开启--enable-auto-tool-choice,开启后提示要配置--chat-template-content-format,最后报错
#17792 closed
Sep 5, 2025 -
[Usage]: How to output metrics information from vllm?
#17795 closed
Sep 5, 2025 -
[Usage]: how to return attention_weight logits in page_attention
#17796 closed
Sep 5, 2025 -
[Installation]: How to deploy docling model on vllm
#17807 closed
Sep 5, 2025 -
[Bug]: Disaggregated Prefill in vLLM 0.8.3 Produces Incorrect/Unreasonable Outputs
#17808 closed
Sep 5, 2025 -
[Usage]: Deploy EasyOCR , Docling models on vllm
#17814 closed
Sep 5, 2025 -
[Bug]: vllm 0.8.5.dev468+g98834fefa.precompiled OOM on Qwen3-32B with 1 lora module
#17822 closed
Sep 5, 2025 -
[Performance]: why the batch-embeddings inputs are separated to small single one?
#18867 closed
Sep 5, 2025 -
[Feature]: Add LoRA adapter support for Qwen2.5-Omni models
#24193 closed
Sep 4, 2025 -
[Bug]: PLaMo2.1 does not work with v1 engine
#24204 closed
Sep 4, 2025 -
[Feature][Responses API] Support MCP tool in background mode
#23295 closed
Sep 4, 2025 -
[Bug]: responses api - no error on exceeding `max_tokens`
#24184 closed
Sep 4, 2025 -
[Bug]: PyNcclConnector is deprecated, but some docs/tests still use it
#24152 closed
Sep 3, 2025 -
[Feature][Response API] Support `num_cached_tokens` and `num_reasoning_tokens` in ResponseUsage
#23363 closed
Sep 3, 2025 -
[Feature]: Support `Plamo2` Model in V1
#23956 closed
Sep 3, 2025 -
[Bug]: plamo2 broken on main using transformers==4.55.0
#22999 closed
Sep 3, 2025 -
[Bug]: Docker build fails on dev stage due to missing libnuma-dev
#20384 closed
Sep 3, 2025 -
[CI]: Host images used by multimodal tests locally
#23594 closed
Sep 3, 2025 -
[Usage]: How to start vllm actor of ray without loading weights when RLHF?
#24064 closed
Sep 3, 2025 -
[Doc]: content in "Add models with the FSDP backend" is expired
#24143 closed
Sep 3, 2025 -
[Bug]: 张量并行离线推理报错 CalledProcessError: Command '['/usr/bin/gcc'....] returned non-zero exit status 1.
#15013 closed
Sep 3, 2025 -
[Bug]: Use the latest version of the inference model and use API calls to report errors.(V0.8.5)
#17430 closed
Sep 3, 2025 -
[Bug]: failed to run LMCache example for v0
#17545 closed
Sep 3, 2025 -
[Bug]: content is null when use "chat_template_kwargs": {"enable_thinking": false} in the request.
#17609 closed
Sep 3, 2025 -
[Bug]: Qwen2.5-vl-7B stuck after loading weight and use a lot of shared GPU memory
#17611 closed
Sep 3, 2025 -
[Usage]: vLLM on multiple node GPUs
#17645 closed
Sep 3, 2025 -
[Feature]: Support for streaming N tokens at a time in AsyncLLMEngine
#17681 closed
Sep 3, 2025 -
[Bug]: R1 Accuracy Issue in Main for `deepep_high_througput`
#24118 closed
Sep 2, 2025 -
[Usage]: `get_mempolicy: Operation not permitted` in docker
#24016 closed
Sep 2, 2025 -
[Bug]: Gemma3n audio path crashes when input_features is a list not a Tensor.
#24006 closed
Sep 2, 2025 -
[Doc]: LWS deployment yaml incorrect
#23103 closed
Sep 2, 2025 -
[RFC]: Custom sampling params support in REST API
#17191 closed
Sep 2, 2025 -
[Bug]: Quantized models - NotImplementedError: Could not run '_C::machete_prepack_B'
#16131 closed
Sep 2, 2025 -
[Bug]: `cannot access local variable 'hidden_states'` while trying to enable MTP for deepseek-r1
#23773 closed
Sep 2, 2025 -
[Bug]: Does V0 support DP?
#24036 closed
Sep 1, 2025 -
[Bug]: [P/D] the nixl_connector toy_proxy_server.py will always return httpstatus 200 OK
#23981 closed
Sep 1, 2025 -
[Bug]: Outputs always miss responses if n of SamplingParams>1 with AsyncLLM!
#24029 closed
Sep 1, 2025 -
[Usage]: Is vllm actor of ray an asynchronous engine and supports continuous batching when RLHF?
#23990 closed
Sep 1, 2025 -
[RFC]: Hidden states processor
#12249 closed
Sep 1, 2025 -
[Bug]: Do we really need to implement additional functions for custom_allreduce to serve graph capture?
#18899 closed
Sep 1, 2025 -
[Bug]: HF_HUB_OFFLINE Parameter does not take effect
#22492 closed
Sep 1, 2025 -
[Doc]: Is Qwen2.5's long context YARN handled?
#8793 closed
Sep 1, 2025 -
[Performance]: vllm Eagle performance is worse than expected
#9565 closed
Sep 1, 2025 -
[Bug]: vllm serve: error: the following arguments are required: model_tag
#13150 closed
Sep 1, 2025 -
[Bug]: AssertionError - assert loaded_weight.shape[output_dim] == self.org_vocab_size
#15124 closed
Sep 1, 2025 -
[Bug]: Can't run vllm model because of the FlashAttention.
#15238 closed
Sep 1, 2025 -
[Bug]: OPEA/Mistral-Small-3.1-24B-Instruct-2503-int4-AutoRound-awq-sym error
#15300 closed
Sep 1, 2025 -
[Bug]: Llama-3.2-11B-Vision-Instruct has an issue in vision language embedding
#15496 closed
Sep 1, 2025 -
[Bug]: Fail to use deepseek vl2 with images, maybe need a new chat template?
#16953 closed
Sep 1, 2025 -
[Bug]: `http*` metrics missing when running with V0 engine
#17406 closed
Sep 1, 2025 -
[Bug]: 0.8.5 部署qwen-vl模型报错,降级0.8.4没问题
#17456 closed
Sep 1, 2025 -
[Feature]: benchmarks for vllm, it should support OpenAI Chat Completions API
#17586 closed
Sep 1, 2025 -
[Bug]: Cannot load Gemma3 27b QAT GGUF on RTX 5090
#17587 closed
Sep 1, 2025 -
[Bug]: fp8 w8a8 quantized Qwen2.5-VL hits AssertionError
#17595 closed
Sep 1, 2025 -
[Bug]: [Precision issues] test_flash_attn.py::test_flash_attn_with_paged_kv
#17610 closed
Sep 1, 2025 -
[Usage]: NCCL error when using tow AMD GPUs ( gfx1100 )
#18805 closed
Sep 1, 2025 -
[Bug]: nrt_tensor_allocate status=4 message="Allocation Failure" on AWS Neuron
#12443 closed
Aug 31, 2025 -
[Feature]: Better systemd security feature support
#12474 closed
Aug 31, 2025 -
[Usage]: How to get "num_gpu_blocks" in V1?
#15538 closed
Aug 31, 2025 -
[Bug]: Vllm 0.8.2 + Ray 2.44 (Ray serve deployment) fallbacks to V0 Engine
#15569 closed
Aug 31, 2025 -
[Performance]:
#16342 closed
Aug 31, 2025 -
[Bug]: 为什么在部署qwen2.5-vl-32b-instruct的时候,部署过程被卡死不动了
#17151 closed
Aug 31, 2025 -
[Bug]: [v0.8.5] Qwen3 returned reasoning content, but --enable-reasoning was not enabled.
#17346 closed
Aug 31, 2025 -
[Bug]: Can't configure VllmConfig
#17376 closed
Aug 31, 2025 -
[Bug]: fused moe lose weight_loader in verl
#17429 closed
Aug 31, 2025 -
[Bug]: Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal!
#17432 closed
Aug 31, 2025 -
[Usage]: OOM happend when run DeepSeek-R1-BF16 with 80k max model len by 16 gpu/90G memory
#17470 closed
Aug 31, 2025 -
[Bug]: [V1][Spec Dec] Rejection sampler accepts different tokens when TP > 1 and Temp > 0
#17498 closed
Aug 31, 2025 -
Issue attempting to serve a model from HF with base model `Llama-3.1-8B-Instruct`
#17505 closed
Aug 31, 2025 -
[Bug]:
#17516 closed
Aug 31, 2025 -
[Bug]: vllm-v0 engine Qwen2.5 Model run eagle algo, KeyError: 'norm.weight' bugfix
#17517 closed
Aug 31, 2025 -
[Bug]: Training with vllm not supports Qwen3
#17527 closed
Aug 31, 2025 -
[Usage]: understanding the vllm's gpu_memory_utilization and cuda graph memory requirement
#17549 closed
Aug 31, 2025 -
[Bug]: Possible mismatch in `truncate_prompt_tokens` value validation for `-1`
#22635 closed
Aug 30, 2025 -
[Bug]: CUDA illegal memory access error on 2x RTX PRO 6000 GPUs with --tensor-parallel-size=2
#23781 closed
Aug 30, 2025 -
[Feature]: Implement vAttention: Virtual Memory Management for KV Cache on NVIDIA GPUs
#17612 closed
Aug 30, 2025 -
[Usage]: Do vllm actor of ray an asynchronous engine and supports continuous batching?
#23989 closed
Aug 30, 2025 -
[Installation]: Dependency conflict installing vLLM 0.6.3 due to outlines → pyairports dependency
#23983 closed
Aug 30, 2025 -
[New Model]: stepfun-ai/GOT-OCR2_0
#9606 closed
Aug 30, 2025 -
[Usage]: who to run cluster withou docker
#12053 closed
Aug 30, 2025 -
[Bug]: Issue with SpecDecode when using data parallel
#17056 closed
Aug 30, 2025 -
[Renderer]: Consolidate MM classes to `MultiModalFeatureSpec`
#23872 closed
Aug 29, 2025 -
[Feature]: RuntimeError: FlashAttention only supports Ampere GPUs or newer.
#8189 closed
Aug 29, 2025 -
[Installation]: ImportError: libtorch_cuda.so: cannot open shared object file: No such file or directory
#23910 closed
Aug 29, 2025 -
[Bug]: rocm build crashes with libcuda.so.1: cannot open shared object file
#19681 closed
Aug 29, 2025 -
Tool calls not triggered properly with vLLM 0.8.5 and Qwen2.5-Coder-32B-Instruct-GPTQ-Int4
#17821 closed
Aug 29, 2025
110 Issues opened by 98 people
-
[Bug]: Crash on --otlp-traces-endpoint=${OTEL_EXPORTER_OTLP_TRACES_ENDPOINT} when CPU mode
#24297 opened
Sep 5, 2025 -
[RFC]: Environment variable to switch backend between CPU and GPU
#24293 opened
Sep 5, 2025 -
[Usage]: Adjusting reasoning efforts for GPT-OSS in direct sampling
#24292 opened
Sep 5, 2025 -
[Bug]: ARM V1 version dependency on OneDNN
#24291 opened
Sep 5, 2025 -
[Bug]: module 'triton.language' has no attribute 'constexpr_function'
#24289 opened
Sep 5, 2025 -
[RFC]: Support Returning Prompt Hidden States
#24288 opened
Sep 5, 2025 -
[RFC]: Support reporting tool output tokens in OutputTokensDetails
#24284 opened
Sep 5, 2025 -
[Bug]: GPT-OSS more robust way to handle messages in commentary channel
#24283 opened
Sep 5, 2025 -
[Bug]: Deployment of Apertus-Instruct-8B failed with error
#24282 opened
Sep 4, 2025 -
[Feature]: decouple attention backend block size from KVCacheManager block size
#24280 opened
Sep 4, 2025 -
[Bug]: Multi-node DeepSeek-V3-0324 errors out with CUDA Illegal Memory Access
#24272 opened
Sep 4, 2025 -
[Bug]: CPU Memory leak in P/D disaggregation (with NIXL?)
#24264 opened
Sep 4, 2025 -
[RFC]: Add a cache hit threshold to enable simple PD-Disaggregation implementations
#24256 opened
Sep 4, 2025 -
[Installation]: Warning on char conversion on aarch64
#24246 opened
Sep 4, 2025 -
[Improvement]: The fixed "language_model" prefix issue in multimodal models
#24244 opened
Sep 4, 2025 -
[Bug]: Qwen3-reranker默认示例的输出结果不精准,与Qwen3-embedding仓库中提供的vllm示例输出结果不一致(顺序不一致)。
#24242 opened
Sep 4, 2025 -
[Bug]: v0.10.1rc1 推理偶现RuntimeError: ACL stream synchronize failed, error code:507035
#24241 opened
Sep 4, 2025 -
[Bug]: Xformers is not available, falling back, even though I have Xformers installed
#24237 opened
Sep 4, 2025 -
[Bug]: RuntimeError: There is no current event loop in thread 'MPClientEngineMonitor'.
#24230 opened
Sep 4, 2025 -
[Feature]: Propose in docs a complete example of `pyproject.toml` to be used directly with `uv sync`
#24218 opened
Sep 4, 2025 -
[Bug]: Detokenizer Overflow error occurred on DeepSeek-R1/V3
#24211 opened
Sep 4, 2025 -
[Bug]: KeyError: 'model.layers.60.mlp.experts.w2_weight'
#24208 opened
Sep 4, 2025 -
[Feature]: Support similar API, such as /health_generate
#24207 opened
Sep 4, 2025 -
[Installation]: fail to install in cuda 118 with v100.
#24206 opened
Sep 4, 2025 -
[Feature][gpt-oss] Responses API test enhancement
#24201 opened
Sep 3, 2025 -
[Feature][gpt-oss] Python Tool Test Enhancement
#24199 opened
Sep 3, 2025 -
[Feature][gpt-oss]: Browser Tool Test Enhancement
#24198 opened
Sep 3, 2025 -
[Feature]: Expose Componentized GPU Memory Metrics
#24196 opened
Sep 3, 2025 -
[New Model]: New model support stepfun-ai/Step-Audio-2-mini
#24192 opened
Sep 3, 2025 -
[Feature]: Model FLOPs Utilization Reporting
#24190 opened
Sep 3, 2025 -
[Usage]: how does v1 engine perform the model parameter hot update?
#24186 opened
Sep 3, 2025 -
[Feature]: Extend QuantFP8 to support per-token-group quantization
#24185 opened
Sep 3, 2025 -
FastAPI Swagger Documentation Name to be Updated to the Model Name
#24182 opened
Sep 3, 2025 -
[Feature]: Expose Engine Sleep & Wake_up Mode as Prometheus Metrics
#24175 opened
Sep 3, 2025 -
[MM processor]: Benchmark mm processor's performance
#24171 opened
Sep 3, 2025 -
[Bug]: Intermittent "Unknown recipient: None" when calling gpt-oss-20b with Responses
#24170 opened
Sep 3, 2025 -
[Performance]: MoE FP8 and Gemm FP8 for CPU
#24169 opened
Sep 3, 2025 -
[Refactor]: Let each modeling file define M-RoPE implementation
#24165 opened
Sep 3, 2025 -
[RFC]: Support fast inplace model update by shared IPC buffer
#24163 opened
Sep 3, 2025 -
[Bug]: Crash when running embedding model on CPU (kv_cache_spec_values empty)
#24156 opened
Sep 3, 2025 -
[Usage]: Add toy example for gpt-oss container tools
#24148 opened
Sep 3, 2025 -
[Bug]: model failure for OpenGVLab/InternVL3-38B-hf
#24147 opened
Sep 3, 2025 -
[CI Failure]: Flaky OOM in Entrypoints Tests
#24144 opened
Sep 3, 2025 -
[Feature]: support Hunyuan-MT-Chimera-7B and HunYuanDenseV1ForCausalLM
#24142 opened
Sep 3, 2025 -
[Bug]: Deepseek V3.1 tool_choice=required,输出混乱
#24140 opened
Sep 3, 2025 -
[Bug]: B200 hang on flashinfer fa2 prefill
#24139 opened
Sep 3, 2025 -
[Feature]: Decoupled Vision-Language Deployment
#24136 opened
Sep 3, 2025 -
[Usage]: vllm+ray launches extra jobs on existing cluster, and not just actors
#24135 opened
Sep 3, 2025 -
[Bug]: vLLM stuck when serving GLM-4.5 model
#24133 opened
Sep 3, 2025 -
[Bug]: Should upgrade to PyTorch's MultiOutputMatch
#24125 opened
Sep 2, 2025 -
Remove CUDA 11.8
#24124 opened
Sep 2, 2025 -
[Bug]: Running on AMD Epyc 9654 (CPU Only) always tries to use intel_extension_for_pytorch and crashes.
#24120 opened
Sep 2, 2025 -
[Feature]: Optimize DP/EP Low Batch Size Decode DeepSeek-R1
#24117 opened
Sep 2, 2025 -
[Feature]: Optimize EPLB Rearrange Experts
#24116 opened
Sep 2, 2025 -
[RFC]: Improve MoE triton kernel tuning
#24112 opened
Sep 2, 2025 -
[Bug]: DeepSeek fails with enabled VLLM_USE_FLASHINFER_MOE_FP8=1
#24109 opened
Sep 2, 2025 -
[Feature]: Support extendable configuration files
#24096 opened
Sep 2, 2025 -
[Usage]: What is the benchmark configuration?
#24095 opened
Sep 2, 2025 -
[Bug]: Running Jamba FP8 crashes with cutlass_moe_mm
#24094 opened
Sep 2, 2025 -
[New Model]: OpenCUA
#24090 opened
Sep 2, 2025 -
[Bug]: v1.10.x is slower than 0.8.5.post1 when running qwen3
#24082 opened
Sep 2, 2025 -
[Feature]: Add uccl as kvconnect provide
#24079 opened
Sep 2, 2025 -
[Bug]: how to get purely deterministic output for gpt-oss-120b?
#24067 opened
Sep 2, 2025 -
[Bug]: In `uv` venv, running `python collect_env.py` will return error.
#24063 opened
Sep 2, 2025 -
[Bug]: prevent HuggingFace access when VLLM_USE_MODELSCOPE is enabled for gpt-oss-20b
#24060 opened
Sep 2, 2025 -
[Feature][KV Connector]: Async lookup policy support for MultiConnector
#24059 opened
Sep 1, 2025 -
[Feature]: Improve `vllm bench serve` startup time with random data
#24058 opened
Sep 1, 2025 -
[Feature]: Cutlass v4.2.0 Support
#24050 opened
Sep 1, 2025 -
[Usage]: how to disable thinking for different model
#24039 opened
Sep 1, 2025 -
[Bug]: vLLM >V0.9.2 with AWQ model producing nonsense in longer context chats
#24038 opened
Sep 1, 2025 -
[MTP][PP]: Does PP mode not support MTP? Is this how it is?
#24035 opened
Sep 1, 2025 -
[Bug]: Bug in PrefixCaching for float16 dtype on RTX 8000
#24007 opened
Aug 31, 2025 -
[Doc]: why vllm bench tset Successful requests very low
#24005 opened
Aug 31, 2025 -
[Bug]: GLM-4.5V - AssertionError: 12 is not divisible by 8
#24004 opened
Aug 31, 2025 -
[Bug]: vLLM Deployment of Qwen3-8B Model Streaming Output Tool Content Missing Issue
#23992 opened
Aug 30, 2025 -
Accuracy Drop with OpenGVLab/InternVL3-14B when using vLLM
#23988 opened
Aug 30, 2025 -
[Bug]: lmcache server points to wrong file in entrypoint
#23986 opened
Aug 30, 2025 -
[Bug]: new version critical bug with 100% gpu util but get stuck
#23979 opened
Aug 30, 2025 -
[Feature]: Benchmark for the Sampler
#23977 opened
Aug 30, 2025 -
[gpt-oss]: Ability to set model_identity dynamically which is used in building the system prompt
#23975 opened
Aug 30, 2025 -
[Bug]: Torch Compilation Failure for Gemma3n with LoRA Support - Dynamic Shape Constraints Violated
#23970 opened
Aug 29, 2025 -
[Bug]: v1 xformers + sliding window not working
#23969 opened
Aug 29, 2025 -
[Bug]: vllm bench serve fails with CPU-only head node
#23967 opened
Aug 29, 2025 -
Model Performance Bash!
#23963 opened
Aug 29, 2025 -
[Feature]: Support `Phi4Flash` model in V1
#23957 opened
Aug 29, 2025 -
ValueError: Currently, MiniCPMV only supports versions 2.0, 2.5, 2.6, 4.0. Got version: (4, 5)
#23955 opened
Aug 29, 2025 -
[Bug]: CUDA error when serving MiniCPM-V model
#23954 opened
Aug 29, 2025 -
[Feature]: Any plans to add nvidia/parakeet-tdt-0.6b-v3 to vllm?
#23943 opened
Aug 29, 2025 -
[Bug]: No platform detected, vLLM is running on UnspecifiedPlatform in Docker with Kubernetes, Nvidia L4
#23935 opened
Aug 29, 2025 -
[Bug]: CPU Backend with GPT-OSS Failed
#23934 opened
Aug 29, 2025 -
[Bug]: Illegal memory access with 4 GPUS
#23926 opened
Aug 29, 2025 -
[Bug]: _C.abi3.so: undefined symbol: _Z24silu_and_mul_nvfp4_quantRN2at6TensorES1_S1_S1_
#23925 opened
Aug 29, 2025 -
[Feature]: Allow usage of chat_template_kwargs and add_generation_prompt in /embeddings endpoint
#23923 opened
Aug 29, 2025 -
[Bug]: Unrecognized FP8 dtype: fp8_e5m2
#23922 opened
Aug 29, 2025 -
[Bug]: 5090 Qwen3-30B-A3B-FP8 fails when TP=2!
#23921 opened
Aug 29, 2025 -
[Bug]: 'AttributeError: '_OpNamespace' '_C' object has no attribute 'silu_and_mul_nvfp4_quant'
#23916 opened
Aug 29, 2025 -
[Bug]: gpt-oss-120b has high possibility to generate response as part of reasoning by using vllm v0.10.1
#23905 opened
Aug 29, 2025 -
[Feature]: Kubernetes 1.34 support (Dynamic Resource Allocation DRA)
#23900 opened
Aug 29, 2025
423 Unresolved conversations
Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.
-
`torch.compile` caching of config fields should be opt-out by default
#23134 commented on
Sep 4, 2025 • 39 new comments -
[Model] Systematic support for fp32 head, pooling models part
#23810 commented on
Sep 4, 2025 • 38 new comments -
[Feature] Support Decode Context Parallel (DCP) for MLA
#23734 commented on
Sep 5, 2025 • 37 new comments -
[Perf][V1] Fully overlap model execution
#23569 commented on
Sep 4, 2025 • 35 new comments -
Add Dual-Batch Overlap mechanism to VLLM
#23693 commented on
Sep 4, 2025 • 35 new comments -
[Core][Hybrid allocator + connector] Support hybrid allocator + kv cache connector
#23624 commented on
Sep 5, 2025 • 24 new comments -
[Core] Shared memory based object store for Multimodal data caching and IPC
#20452 commented on
Sep 4, 2025 • 19 new comments -
[Performance][EPLB] EPLB Execution Optimization
#22179 commented on
Sep 4, 2025 • 17 new comments -
[Bugfix] Merge MM embeddings by index instead of token IDs
#16229 commented on
Sep 2, 2025 • 14 new comments -
[BugFix] pp cannot run successfully under NixlConnector
#22976 commented on
Sep 4, 2025 • 13 new comments -
[P/D] Add a shutdown method to the Connector API
#22699 commented on
Sep 4, 2025 • 13 new comments -
[Feat][EPLB] A novel static EPLB placement strategy for MoE models.
#23745 commented on
Sep 5, 2025 • 13 new comments -
[V1][Spec Decode][Feature] Spec decode with probs
#20459 commented on
Sep 5, 2025 • 12 new comments -
[V1] [P/D] Add Support for KV Load Failure Recovery
#19330 commented on
Sep 4, 2025 • 11 new comments -
EVS Support (Video tokens pruning)
#22980 commented on
Sep 2, 2025 • 10 new comments -
[Bugfix] Fix mamba2 prefill chunking
#23279 commented on
Sep 1, 2025 • 10 new comments -
[Core] Use sha256 bytes instead of BlockHash to reduce GC overhead
#23673 commented on
Sep 5, 2025 • 8 new comments -
[Core] Encoder separation for Encode-Prefill-Decode Disaggregation
#21740 commented on
Sep 4, 2025 • 8 new comments -
[Frontend] User-provided uuids for medias in chat. (RFC #22044)
#23449 commented on
Sep 5, 2025 • 8 new comments -
[torch.compile][ROCm][V1] Enable attention output FP8 fusion for V1 attention backends
#19767 commented on
Sep 4, 2025 • 8 new comments -
[gpt-oss] Harmony changes with container tool support
#23386 commented on
Sep 5, 2025 • 7 new comments -
[ROCm][Bugfix] Fix Aiter RMSNorm
#23412 commented on
Sep 2, 2025 • 7 new comments -
[PERF] Allreduce Fusion tuning and compile_ranges introduction
#22086 commented on
Sep 4, 2025 • 7 new comments -
[Kernel][B200] `mxfp4` fused cutlass moe
#23696 commented on
Sep 4, 2025 • 6 new comments -
Generate _ModelInfo properties file when loading to improve loading speed
#23558 commented on
Sep 5, 2025 • 6 new comments -
[P/D][Nixl] Introduce `KVTransferMetrics` and aggregation strategy
#22188 commented on
Sep 3, 2025 • 6 new comments -
[Sampler] Support returning all prompt logprobs
#23868 commented on
Sep 3, 2025 • 6 new comments -
[v1] Add Whisper model support (encoder-decoder)
#21088 commented on
Sep 5, 2025 • 6 new comments -
[V1] implement tree sampler for draft token acceptance
#22752 commented on
Sep 5, 2025 • 6 new comments -
[Build] Split Kernels into Separate `vllm-kernels` package
#23866 commented on
Sep 3, 2025 • 6 new comments -
fix(v1/kv_cache): resolve async KV transfer bug in cascade attention
#23485 commented on
Sep 2, 2025 • 5 new comments -
[Frontend] Pass API server count to each process
#23717 commented on
Sep 1, 2025 • 5 new comments -
v1: Offloading connector
#22595 commented on
Sep 4, 2025 • 5 new comments -
[Model] New model support for Motif-1-Tiny
#23414 commented on
Sep 3, 2025 • 5 new comments -
[Perf] Warmup FlashInfer attention during startup
#23439 commented on
Sep 4, 2025 • 4 new comments -
[CI/Build] Add bc-linter to vLLM CI
#21234 commented on
Sep 5, 2025 • 4 new comments -
[RFC] allow cancelation after shutdown in blocking collective_rpc
#23390 commented on
Sep 4, 2025 • 3 new comments -
[Bugfix] Make unspecified --host bind to dual stack
#22823 commented on
Aug 30, 2025 • 3 new comments -
[Feature][OCP MX] Support mxfp6 and mixed mxfp6-mxfp4
#21166 commented on
Sep 5, 2025 • 3 new comments -
Migrate Qwen2 inputs to TensorSchema
#23475 commented on
Sep 4, 2025 • 3 new comments -
[XPU] Fix OOM when manually specifying ZE_AFFINITY_MASK with Ray distributed executor on XPU
#22413 commented on
Sep 2, 2025 • 2 new comments -
[Bugfix] Mistral tool parser streaming update
#19425 commented on
Sep 2, 2025 • 2 new comments -
Enable modelopt gemma3 nvfp4/fp8, make workflow more robust
#22771 commented on
Sep 3, 2025 • 2 new comments -
[Chore] Cleanup guided namespace, move to structured outputs config
#22772 commented on
Sep 4, 2025 • 2 new comments -
[Model] Activated LoRA
#19710 commented on
Sep 4, 2025 • 2 new comments -
Allows initialize TorchAOConfig object through quantization_config_file
#23014 commented on
Sep 2, 2025 • 2 new comments -
[Frontend] OpenAI Responses API supports Tool/Function calling
#20874 commented on
Sep 4, 2025 • 2 new comments -
[Feature] limit thinking tokens (hard limit)
#20859 commented on
Sep 4, 2025 • 2 new comments -
DeepSeek fix: awq x mergedreplicatedlinear
#23764 commented on
Aug 30, 2025 • 2 new comments -
Support for NemotronH Nano VLM
#23644 commented on
Sep 5, 2025 • 2 new comments -
[Core] Nanoflow-style Computation-Communication Overlap
#23592 commented on
Sep 3, 2025 • 2 new comments -
Fp8 paged attention update
#22222 commented on
Sep 4, 2025 • 1 new comment -
[V1] address post issues related to #20059 (part 1)
#23046 commented on
Sep 4, 2025 • 1 new comment -
[Feature][EPLB] Add EPLB support for hunyuan_v1
#23078 commented on
Sep 5, 2025 • 1 new comment -
[Bugfix][V1] Raise ValueError when draft max model len is too small
#22935 commented on
Sep 3, 2025 • 1 new comment -
[V1] Logits processor docs
#22919 commented on
Sep 4, 2025 • 1 new comment -
[KV Connector] More async support for `get_num_new_matched_tokens`
#23620 commented on
Sep 5, 2025 • 1 new comment -
[Bug]: v0.8.2, enable calculate_kv_scales, caught exception
#15973 commented on
Sep 3, 2025 • 0 new comments -
fix: return {} for tool arguments when no argument is needed, so that…
#21365 commented on
Sep 1, 2025 • 0 new comments -
[Core][Feat] Add max-waiting-queue-length parameter to reject requests when waiting queue is full
#21352 commented on
Sep 4, 2025 • 0 new comments -
[Usage]:
#18679 commented on
Sep 5, 2025 • 0 new comments -
[Bug]: Qwen3 uses vllm automatic batch inference to abnormal output
#18252 commented on
Sep 5, 2025 • 0 new comments -
[Feature]: Auto tokenizer mode should detect mistral tokenizer
#18090 commented on
Sep 5, 2025 • 0 new comments -
[RFC]: Enabling Arm Neoverse CI Runners
#17720 commented on
Sep 5, 2025 • 0 new comments -
[Feature][Kernel]FusedMoE LoRA
#21229 commented on
Sep 1, 2025 • 0 new comments -
[New Model]: nemotron Super GGUF
#16944 commented on
Sep 5, 2025 • 0 new comments -
[V1][PP] Pipeline chunked prefill
#13638 commented on
Sep 5, 2025 • 0 new comments -
[Kernel] Flashinfer MLA (trtllm-gen) decode kernel integration
#21078 commented on
Sep 4, 2025 • 0 new comments -
[Performance]: Plan to support DP attention for Deepseek models
#12871 commented on
Sep 5, 2025 • 0 new comments -
[Feature]: Compute and log the serving FLOPs
#3490 commented on
Sep 5, 2025 • 0 new comments -
Support mnnvl all2allv from Flashinfer
#21003 commented on
Sep 4, 2025 • 0 new comments -
[RFC]: Enabling Suffix Decoding, LSTM Speculator, Sequence Parallelism from Arctic Inference
#18037 commented on
Sep 5, 2025 • 0 new comments -
[Bug]: qwq32b-128k accuracy loss compare with sglang , with proprietory business benchmark
#19245 commented on
Sep 5, 2025 • 0 new comments -
[Misc]add replicaid to ray metrics
#22159 commented on
Sep 4, 2025 • 0 new comments -
[Hardware][RISC-V] Add riscv64 support for vLLM with scalar
#22112 commented on
Sep 5, 2025 • 0 new comments -
[Performance]: EAGLE-3: Discrepancy Between Throughput and Acceptance Length Improvements
#19226 commented on
Sep 5, 2025 • 0 new comments -
[Speculators][Speculative Decoding] Add Eagle3 Support For HunYuan Model
#22080 commented on
Sep 1, 2025 • 0 new comments -
[Core] Enable HF processing on GPU
#22070 commented on
Aug 30, 2025 • 0 new comments -
[Usage]: Is the service interface exposed by PD separation compatible with the service API of OpenAPI?
#19214 commented on
Sep 5, 2025 • 0 new comments -
[Bugfix]: Fix Possible Output Corruption in Cascade Attention Caused by Non-Contiguous LSE Tensor
#22003 commented on
Sep 5, 2025 • 0 new comments -
[Bugfix] Fix hermes tool parser handling of non-string argument types
#22002 commented on
Sep 2, 2025 • 0 new comments -
[Structured Output][Refactor] Move `apply_grammar_bitmask()` method from `ModelRunner` to structured output utils
#21999 commented on
Sep 4, 2025 • 0 new comments -
Add support for model signature verification
#21957 commented on
Sep 4, 2025 • 0 new comments -
[Usage]: How to improve the gpu usage with Qwen-VL
#19208 commented on
Sep 5, 2025 • 0 new comments -
[Bug]: Granite-Speech-3.3-2b hangs forever, never produces output
#19198 commented on
Sep 5, 2025 • 0 new comments -
Limit concurrent long partial prefills via max_long_partial_prefills
#21651 commented on
Sep 2, 2025 • 0 new comments -
[Bugfix] Handle None case for dt_bias and D in selective_state_update
#21532 commented on
Aug 30, 2025 • 0 new comments -
[Model] Mamba2 varlen and metadata refactor
#21467 commented on
Sep 4, 2025 • 0 new comments -
[Bug]: Quantization method specified in the model config (fp8) does not match the quantization method specified in the `quantization` argument (gguf).
#19050 commented on
Sep 5, 2025 • 0 new comments -
v1/offloading: Add worker-side CPU support
#21448 commented on
Sep 4, 2025 • 0 new comments -
[Usage]: TorchDispatchMode does not work for vllm
#19044 commented on
Sep 5, 2025 • 0 new comments -
[ROCm] Get rid of RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES
#15246 commented on
Sep 4, 2025 • 0 new comments -
Enable Outlines with JSON Sub-Schema References
#15627 commented on
Sep 4, 2025 • 0 new comments -
[Bugfix]: fix JSON decode error when tool call argument is empty
#19428 commented on
Aug 31, 2025 • 0 new comments -
fix: can not install torch+cpu for no index url
#15822 commented on
Aug 30, 2025 • 0 new comments -
Fix #15483 : Add error handling for model-dependent endpoints during sleep mode
#16536 commented on
Sep 3, 2025 • 0 new comments -
[FeatureRequest] Support Cascade Attention for Sliding Window Attention #15738
#16550 commented on
Aug 31, 2025 • 0 new comments -
Use PyTorch util for traced files instead of monkey-patching inline_call()
#19235 commented on
Sep 5, 2025 • 0 new comments -
[WIP] download config json file from modelscope
#19212 commented on
Sep 4, 2025 • 0 new comments -
Fixes crashes in vLLM v1 engine when using LMCache KV
#19194 commented on
Sep 4, 2025 • 0 new comments -
[MTIA] Add mtia as a literal in device config.
#19026 commented on
Sep 3, 2025 • 0 new comments -
[Core] Remove int32->int64->int32 overhead in FlashInfer sampling
#18920 commented on
Sep 3, 2025 • 0 new comments -
[BugFix] v0 cache evictor:priority_queue and free_table desynchronization
#18882 commented on
Sep 2, 2025 • 0 new comments -
[bugfix][v1]fixed the missing prompt value in RequestOutputs
#18880 commented on
Sep 3, 2025 • 0 new comments -
[Hardware][Intel-Gaudi] t.compile optimizations
#18137 commented on
Sep 3, 2025 • 0 new comments -
[Feature]: reasoning_tokens in Chat Completion Response usage
#18067 commented on
Sep 5, 2025 • 0 new comments -
[WIP]: DRY sampling
#16695 commented on
Sep 1, 2025 • 0 new comments -
[Bugfix] fix: close issue #16554 to make it real async
#16557 commented on
Aug 30, 2025 • 0 new comments -
[Bugfix] Move current_platform import to avoid python import cache.
#16601 commented on
Sep 1, 2025 • 0 new comments -
fix(frontend): always include usage, when configured to do so
#20983 commented on
Sep 3, 2025 • 0 new comments -
[Bug]: How to set reasoning_effort for gpt-oss model to "high" in vllm
#22809 commented on
Sep 5, 2025 • 0 new comments -
[Bug]: update the kv_connector from v0 to v1 in example
#22093 commented on
Sep 5, 2025 • 0 new comments -
[Feature] Add command tool parser for Command-A model
#20800 commented on
Sep 3, 2025 • 0 new comments -
feat: Add streaming support for Mistral v11 tool format
#20503 commented on
Aug 29, 2025 • 0 new comments -
[Bug]: CI not running all tests/compile tests
#23865 commented on
Sep 5, 2025 • 0 new comments -
[Bug]: RuntimeError: NCCL error: unhandled cuda error
#20226 commented on
Sep 5, 2025 • 0 new comments -
[V1] feat:add engine v1 tracing
#20372 commented on
Sep 2, 2025 • 0 new comments -
[Frontend] Feature: support transcription API with language detection
#13465 commented on
Sep 2, 2025 • 0 new comments -
[Feature]: Implement `check_health` for V1
#20164 commented on
Sep 3, 2025 • 0 new comments -
v1: Introduce LRU-based CPU offloading management
#20075 commented on
Sep 4, 2025 • 0 new comments -
[Feature][Kernel] Blocked FP8 CUTLASS MoE for Hopper
#19983 commented on
Aug 30, 2025 • 0 new comments -
v1: Introduce an offloading component
#19848 commented on
Sep 4, 2025 • 0 new comments -
[CI] Make UT cases in test_comm_ops.py compatible with more devices
#14229 commented on
Sep 3, 2025 • 0 new comments -
[Model] add colqwen2_vl code & inference
#14291 commented on
Sep 2, 2025 • 0 new comments -
[V1] Logit processors for rejection sampler
#19482 commented on
Sep 2, 2025 • 0 new comments -
[Frontend] Skip `stop` in reasoning content
#14550 commented on
Sep 4, 2025 • 0 new comments -
[Core] Add a level 3 sleep/wake_up that offloads tensors to disk
#14678 commented on
Aug 29, 2025 • 0 new comments -
[CI] Optimize entrypoints API server tests
#23896 commented on
Sep 2, 2025 • 0 new comments -
Adding int4 and int8 models for CPU benchmarking
#23709 commented on
Sep 5, 2025 • 0 new comments -
[XPU][Feature] sleep mode support for XPU platform
#23704 commented on
Sep 5, 2025 • 0 new comments -
[V1] Support MP Distributed Executor for multi node distributed inference
#23691 commented on
Aug 30, 2025 • 0 new comments -
[Spec-decode] fix and refoctor cudagraphs for spec-decode
#23679 commented on
Sep 4, 2025 • 0 new comments -
[Flashinfer] Support Flashinfer TRTLLM FP8-qkv BF16/FP16-out Attention Kernel
#23647 commented on
Sep 5, 2025 • 0 new comments -
[Model] Add tuned fused_moe configs for H200_NVL based on H200 config
#23642 commented on
Sep 1, 2025 • 0 new comments -
Fix regex patterns in DeepSeekV31ToolParser to use non-greedy matching
#23618 commented on
Sep 4, 2025 • 0 new comments -
[Spec Decode][Benchmark] Add Blitzedit dataset
#23605 commented on
Sep 4, 2025 • 0 new comments -
Synchronize TYPE_CHECKING section with environment_variables dictionary in envs.py
#23602 commented on
Aug 30, 2025 • 0 new comments -
[Speculators][Speculative Decoding] Support gpt-oss eagle3 on blackwell
#23596 commented on
Sep 3, 2025 • 0 new comments -
[Model] Add lite-whisper model support in vLLM
#23566 commented on
Sep 2, 2025 • 0 new comments -
[Spec Decode][Benchmark] Add Spec Bench Dataset for benchmarking
#23563 commented on
Sep 3, 2025 • 0 new comments -
model modify of eplb
#23553 commented on
Aug 29, 2025 • 0 new comments -
[ISSUE 23474] Remove lora additional vocabulary
#23540 commented on
Sep 5, 2025 • 0 new comments -
Fix gpt-oss tool call
#23518 commented on
Sep 5, 2025 • 0 new comments -
Redesign Persistent Batch in vLLM
#23514 commented on
Sep 5, 2025 • 0 new comments -
feat: Add Grafana and Perces monitoring dashboards for vLLM
#23498 commented on
Sep 5, 2025 • 0 new comments -
[Misc] rename determine_available_memory to determine_kv_cache_availa…
#23495 commented on
Aug 31, 2025 • 0 new comments -
[Misc] refactor usage report by reuse report_usage_stats function
#23493 commented on
Aug 30, 2025 • 0 new comments -
[Bugfix] fix is_usage_stats_enabled when disable it
#23489 commented on
Aug 30, 2025 • 0 new comments -
[Core] Support sleep mode for cuda graph
#23482 commented on
Sep 3, 2025 • 0 new comments -
[bug fix] disable memory pool to release unused `bf16` weights
#23875 commented on
Sep 4, 2025 • 0 new comments -
[Benchmark] Add ability to round robin over a set of urls for benchmarking
#23870 commented on
Aug 31, 2025 • 0 new comments -
[benchmark] add peak throughput metrics and plot
#23867 commented on
Aug 30, 2025 • 0 new comments -
[gpt-oss] Validate gpt-oss python tool during initialization
#23856 commented on
Sep 5, 2025 • 0 new comments -
Check bc linter
#23855 commented on
Sep 5, 2025 • 0 new comments -
Update v1/entrypoints test_struct_output_generate tests to use ligher models
#23850 commented on
Sep 4, 2025 • 0 new comments -
[DP][ray] Support different VLLM_RAY_DP_PACK_STRATEGY
#23849 commented on
Sep 4, 2025 • 0 new comments -
[Misc]Fix an error when enabling allreduce fusion pass
#23848 commented on
Sep 3, 2025 • 0 new comments -
[Log] Use a relative path in debug-level logs to distinguish files with identical names
#23846 commented on
Sep 3, 2025 • 0 new comments -
[Bugfix] Update Run:AI Model Streamer Loading Integration
#23845 commented on
Sep 4, 2025 • 0 new comments -
[Kernels][Nvidia] AOT compilation workflow [1/n]
#23844 commented on
Sep 4, 2025 • 0 new comments -
[Bugfix] support from s3 load model
#23842 commented on
Sep 5, 2025 • 0 new comments -
[xpu] upgrade ipex/python3.12 for xpu
#23830 commented on
Sep 5, 2025 • 0 new comments -
[Feature][Quantization] auto_round support for mixed bits quantization
#23812 commented on
Sep 2, 2025 • 0 new comments -
valley-eagle-7b (not finished yet)
#23799 commented on
Sep 1, 2025 • 0 new comments -
[CI] Fail subprocess tests with root-cause error
#23795 commented on
Sep 4, 2025 • 0 new comments -
[benchmark] add random and common prefix usage
#23788 commented on
Sep 3, 2025 • 0 new comments -
[do not merge] this is for testing ci-infra changes
#23771 commented on
Aug 29, 2025 • 0 new comments -
[Bugfix] when nixl port by bind, process cannot stop
#23756 commented on
Sep 1, 2025 • 0 new comments -
[Misc] Use CpuGpuBuffer for FlashInfer metadata builder
#23731 commented on
Aug 30, 2025 • 0 new comments -
[Misc] Moved override for allreduce fusion thresholds from env var to config
#23722 commented on
Sep 2, 2025 • 0 new comments -
Fix several unnecessary CUDA sync points
#22875 commented on
Aug 29, 2025 • 0 new comments -
[FIXBUG] Add stop and stop_token_ids to BeamSearchParams
#22869 commented on
Sep 4, 2025 • 0 new comments -
[P/D][NIXL]NixlConnector Reliability Enhancement
#22866 commented on
Sep 3, 2025 • 0 new comments -
[V0 Deprecation] Remove V0 xFormers attention backend
#22777 commented on
Sep 2, 2025 • 0 new comments -
[Bugfix] V1 engine positional model argument handling
#22764 commented on
Aug 30, 2025 • 0 new comments -
[Frontend] Add Sentry SDK for error reporting
#22753 commented on
Sep 4, 2025 • 0 new comments -
[Feat] Support elastic KV cache memory pool for dynamic GPU memory sharing
#22706 commented on
Sep 1, 2025 • 0 new comments -
[CI] run tests/compile/test_config.py
#22682 commented on
Sep 1, 2025 • 0 new comments -
Enable Intel Gaudi accelerator for vLLM Benchmark suite
#22680 commented on
Sep 5, 2025 • 0 new comments -
Support Anthropic API /v1/messages Endpoint
#22627 commented on
Sep 5, 2025 • 0 new comments -
Vectorize RMSNorm CUDA kernel
#22602 commented on
Aug 30, 2025 • 0 new comments -
consistency between the test and final Docker image
#22490 commented on
Sep 5, 2025 • 0 new comments -
[Refactor] Refactor FP8 & INT8 Quant Folder inside `w8a8`
#22474 commented on
Aug 29, 2025 • 0 new comments -
[V1][Metrics][Plugin] Add plugin support for custom `StatLoggerBase` implementations
#22456 commented on
Sep 4, 2025 • 0 new comments -
[CI/Build] Fix ppc64le CPU build and tests
#22443 commented on
Sep 3, 2025 • 0 new comments -
[Bugfix] Simulate mxfp4 quark model execution on cdna4 until kernels are integrated
#22355 commented on
Sep 4, 2025 • 0 new comments -
`NixlConnector` Support HTTP/S metadata exchange instead of zmq
#22274 commented on
Sep 4, 2025 • 0 new comments -
[TPU][Misc] Fix TPU.device_name
#22254 commented on
Aug 29, 2025 • 0 new comments -
[Perf][Feat][Core] Workload-Aware KVCache Eviction Policy
#22236 commented on
Sep 1, 2025 • 0 new comments -
[Bugfix] Disable the statslogger if the api_server_count is greater than 1
#22227 commented on
Sep 5, 2025 • 0 new comments -
feat: Add native support for XLM-RoBERTa embedding and BAAI/bge-reranker-v2-m3
#22216 commented on
Sep 3, 2025 • 0 new comments -
[Misc] pop virtual_engine in from_broadcasted_tensor_dict
#23476 commented on
Aug 30, 2025 • 0 new comments -
[#20711] Use QuantFp8 CustomOp-abstraction for MoE layers
#23463 commented on
Aug 30, 2025 • 0 new comments -
Add Predicted Outputs API
#23450 commented on
Sep 4, 2025 • 0 new comments -
[gpt-oss][Bugfix] Fix gpt-oss toolcall
#23440 commented on
Sep 5, 2025 • 0 new comments -
[Frontend] Add unit tests for OpenAI Responses streaming IDs (item_id/content_index + delta path) #23218
#23382 commented on
Sep 4, 2025 • 0 new comments -
[EPLB] Add Asynchronous Expert Rebalancing
#23343 commented on
Sep 4, 2025 • 0 new comments -
[Refactor] Small cleanup for quantized FusedMoE
#23339 commented on
Sep 3, 2025 • 0 new comments -
[ROCm][FEAT] Integrate AITER CustomAllreduce in cuda communicator.
#23336 commented on
Aug 29, 2025 • 0 new comments -
[Bugfix] remove duplicate tokens streamed in required tool choice streaming
#23312 commented on
Sep 2, 2025 • 0 new comments -
[V0 Deprecation] Drop V0 encoder-decoder runner
#23300 commented on
Aug 29, 2025 • 0 new comments -
[Perf] Use upstream CUTLASS for SM90 Block FP8 kernel
#23280 commented on
Aug 30, 2025 • 0 new comments -
fix: response_format for completion
#23212 commented on
Sep 2, 2025 • 0 new comments -
[Misc][qwen2_5_vl] Enable `supports_torch_compile` on generic nn.Module
#23207 commented on
Aug 29, 2025 • 0 new comments -
[Misc][Feature] confidence based early stopping
#23201 commented on
Sep 5, 2025 • 0 new comments -
ON HOLD - [Core] Lazy/Delayed CUDA graph
#23184 commented on
Sep 2, 2025 • 0 new comments -
[Bugfix] Fix gemma3 with transformers backend
#23178 commented on
Sep 2, 2025 • 0 new comments -
[V1] check request priority if scheduler policy is fcfs
#23043 commented on
Aug 31, 2025 • 0 new comments -
[Core] Support weight_loader_v2 for `UnquantizedLinearMethod`
#23036 commented on
Sep 3, 2025 • 0 new comments -
[Core] Allow disabling TP sharding for parallel Linear layer
#23024 commented on
Sep 4, 2025 • 0 new comments -
Optimize MoE Token Dispatch for Tensor Parallel Configurations
#22993 commented on
Sep 4, 2025 • 0 new comments -
feat(multimodal): Add support for SigLIP pooling model
#22921 commented on
Aug 30, 2025 • 0 new comments -
[Bug]: EngineCore died unexpectedly When Inference llama(generate)
#23517 commented on
Sep 2, 2025 • 0 new comments -
[Doc]: Steps to run vLLM on your RTX5080 or 5090!
#14452 commented on
Sep 2, 2025 • 0 new comments -
[Bug]: Memory Leak Issue in Load Testing Scenario
#22736 commented on
Sep 2, 2025 • 0 new comments -
[Usage]: During testing of the LoRA model, the "enable-prefix-caching" feature did not take effect
#23301 commented on
Sep 2, 2025 • 0 new comments -
[Bug]: 模型运行期间,报错TimeoutError: RPC call to execute_model timed out.,导致模型退出。
#19197 commented on
Sep 2, 2025 • 0 new comments -
[Bug]: When "tool_choice": "auto" is set, there is a reasoning_content process in the output, but this process is missing when "tool_choice": "required" is used.
#19846 commented on
Sep 2, 2025 • 0 new comments -
[Usage]: minicpm-4.5v
#23784 commented on
Sep 2, 2025 • 0 new comments -
[Feature]: Allow oot custom compiler extension via CompilerInterface and reuse backend-agnostic FX passes
#23612 commented on
Sep 2, 2025 • 0 new comments -
[Bug]: when nsight cature nvtx with PP>1, vllmWorkerProcess will unexpectedly terminate
#13482 commented on
Sep 2, 2025 • 0 new comments -
[Feature]: Pin vLLM process to the right NUMA Region
#13855 commented on
Sep 2, 2025 • 0 new comments -
[Installation]: RuntimeError: Unknown runtime environment
#15450 commented on
Sep 2, 2025 • 0 new comments -
[Usage]: Deciding max-num-seqs and max-num-batched-tokens for desired throughput
#16886 commented on
Sep 2, 2025 • 0 new comments -
[Performance]: UVA vs UVM for CPU offloading on v0.8.4+
#17062 commented on
Sep 2, 2025 • 0 new comments -
[Feature]: GGUF support for GLM4
#17069 commented on
Sep 2, 2025 • 0 new comments -
[Bug]: failed to run latest offline PD example code
#17624 commented on
Sep 2, 2025 • 0 new comments -
[RFC]: Model Parallelism with Single Worker using SPMD
#18009 commented on
Sep 2, 2025 • 0 new comments -
[Bug]: build source errors
#18691 commented on
Sep 2, 2025 • 0 new comments -
[Bug]: AsyncLLM when DP > 1, device allocation bug
#18942 commented on
Sep 2, 2025 • 0 new comments -
[Bug]: gpu-memory-utilization does not work
#19023 commented on
Sep 2, 2025 • 0 new comments -
[Usage]: ValueError: Initial test run failed - Please make sure benchmark arguments are correctly specified. Error: Not Found
#19047 commented on
Sep 2, 2025 • 0 new comments -
[Bug]: System Memory OOM after upgrading to v0.9.0.1
#19048 commented on
Sep 2, 2025 • 0 new comments -
[RFC]: Drop CUDA 11.8 Support
#19061 commented on
Sep 2, 2025 • 0 new comments -
[Usage]: how could I use vllm docker image in platform with arm64 tech cpu and nvidia a600 gpu
#19065 commented on
Sep 2, 2025 • 0 new comments -
[Feature]: Metal support
#19073 commented on
Sep 2, 2025 • 0 new comments -
[Bug]: CUDA error: unknown error when running vllm serve on WSL2 Ubuntu22.04
#19077 commented on
Sep 2, 2025 • 0 new comments -
[Usage]: intent is added for guided generation
#19107 commented on
Sep 2, 2025 • 0 new comments -
[Bug]: 0.8.4 serve QwQ-32B-AWQ failed
#16811 commented on
Sep 3, 2025 • 0 new comments -
[Bug]: Not allowed: Wheel dist/vllm-0.9.1.dev2+ge0cbad4e3-cp38-abi3-linux_x86_64.whl is larger (824.73 MB) than the limit (400 MB)
#18786 commented on
Sep 3, 2025 • 0 new comments -
[Bug]: Models misbehaves when --tensor-parallel-size 2 on 2x Nvidia L4
#19022 commented on
Sep 3, 2025 • 0 new comments -
[Bug]: vllm profiling result contains invalid utf-8 code
#19043 commented on
Sep 3, 2025 • 0 new comments -
[Usage]: Implement Method to Obtain Token-Level Log Probabilities from Models with Different Weights for KL Divergence Calculation
#19127 commented on
Sep 3, 2025 • 0 new comments -
[Bug]: Error occurred while performing model inference using 0.8 H20s from the virtualized computing pool.
#19137 commented on
Sep 3, 2025 • 0 new comments -
[Usage]: Distributed Inference Over CPU
#19142 commented on
Sep 3, 2025 • 0 new comments -
[Feature]: asyncio_mode and not multiprocess_mode EngineCore imple
#19146 commented on
Sep 3, 2025 • 0 new comments -
[Bug]: max-model-len + max-num-seqs is not reducing vram usage
#19148 commented on
Sep 3, 2025 • 0 new comments -
[Bug]: KeyError: 'language_model.layers.0.self_attn.qkv_proj.weight'
#19149 commented on
Sep 3, 2025 • 0 new comments -
[Usage]: OutofMemoryError with LMCache example and cpu_offload_gb enabled
#19154 commented on
Sep 3, 2025 • 0 new comments -
[Renderer]: Move `Processor` out of `AsyncLLM`
#23869 commented on
Sep 3, 2025 • 0 new comments -
[Bug]: vllm, EngineCore encountered a fatal error TimeoutError
#19668 commented on
Sep 3, 2025 • 0 new comments -
[RFC]: Dynamic Expert Load Balance with Zero-like-overhead
#22246 commented on
Sep 3, 2025 • 0 new comments -
[Feature]: Support loading vision layers in VLM LoRA adapters
#16364 commented on
Sep 3, 2025 • 0 new comments -
[RFC]: Context Parallelism && Sequence Parallelism
#22693 commented on
Sep 3, 2025 • 0 new comments -
[Installation]: no version of pip install vllm works - Failed to initialize NumPy: No Module named 'numpy'
#11037 commented on
Sep 2, 2025 • 0 new comments -
[Bug]: vllm.third_party.pynvml.NVMLError_InvalidArgument: Invalid Argument
#19071 commented on
Sep 2, 2025 • 0 new comments -
[Feature]: Optimize RoPE
#22293 commented on
Sep 2, 2025 • 0 new comments -
[Bug]: Setting up vLLM with a multi-host for example v6e-4x4 TPU topology fails
#23860 commented on
Sep 2, 2025 • 0 new comments -
[Bug]: There are no CI tests for chunked prefill for pooling models.
#23436 commented on
Sep 2, 2025 • 0 new comments -
[Feature]: Add LORA Model Name in Open Telemetry
#23767 commented on
Sep 2, 2025 • 0 new comments -
[Feature]: qwen2.5 omni doesn't support bnb quantification.
#23240 commented on
Sep 2, 2025 • 0 new comments -
[Bug]: vLLM aarch64 support (GH200)
#23350 commented on
Sep 2, 2025 • 0 new comments -
[Bug]: openai/gpt-oss-20b breaks on data parallel
#23244 commented on
Sep 2, 2025 • 0 new comments -
[Bug]: Unknown quantization method: mxfp4
#22276 commented on
Sep 2, 2025 • 0 new comments -
[Bug]: VLLM_ALL2ALL_BACKEND=naive hangs/crashes on multi nodes when serving DeepSeekV3
#23448 commented on
Sep 2, 2025 • 0 new comments -
[New Model]: Google SigLip 2
#13663 commented on
Aug 31, 2025 • 0 new comments -
[Bug]: LoRA support for Mistral 3.1
#18574 commented on
Aug 31, 2025 • 0 new comments -
[Feature]: Context Parallelism
#7519 commented on
Aug 31, 2025 • 0 new comments -
[Bug]: Unexpected CUDA OOM with larger TP size
#22702 commented on
Aug 30, 2025 • 0 new comments -
[Feature]: add DoRA support
#10849 commented on
Aug 30, 2025 • 0 new comments -
[Bug]: LLVM ERROR: Failed to compute parent layout for slice layout. when using fp16
#17152 commented on
Aug 30, 2025 • 0 new comments -
[Bug]: Can't serve Qwen3-AWQ
#18156 commented on
Aug 30, 2025 • 0 new comments -
[Bug]: vLLM v0.8.5.post1 hanging with Llama 3.3 70b
#18260 commented on
Aug 30, 2025 • 0 new comments -
[Usage]: Controll Deepseek R1 think or not
#18988 commented on
Aug 30, 2025 • 0 new comments -
[Bug]: KV cache specs are not equal accross rank
#23883 commented on
Aug 30, 2025 • 0 new comments -
[Bug]: vLLM server crashes with CUDA illegal memory access for specific sequence lengths on B200
#23724 commented on
Aug 29, 2025 • 0 new comments -
[Doc]: clarify support for cpu-based image
#23681 commented on
Aug 29, 2025 • 0 new comments -
[Usage]: how to use built-in python tool of gpt-oss-20b after starting vllm serve --tool-server demo?
#23108 commented on
Aug 29, 2025 • 0 new comments -
[RFC]: Remove LoRA bias
#23892 commented on
Aug 29, 2025 • 0 new comments -
[Bug]: vLLM (AsyncLLMEngine, LLM) engine initialization fails when using runai_streamer
#22843 commented on
Aug 29, 2025 • 0 new comments -
[CI]: Entrypoints tests cleanup
#23667 commented on
Aug 29, 2025 • 0 new comments -
[Feature]: Logging details about incorrect requests
#19739 commented on
Aug 29, 2025 • 0 new comments -
[Bug]: Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
#18455 commented on
Aug 29, 2025 • 0 new comments -
[Bug]: Bad result with parallel generation.
#20561 commented on
Aug 29, 2025 • 0 new comments -
[MM Encoder] Add Encoder DP to Kimi-VL
#23878 commented on
Aug 29, 2025 • 0 new comments -
[Bug]: InternVL3 FP8 missing module/parameter on model load
#19424 commented on
Aug 29, 2025 • 0 new comments -
[Bug]: Sampling discrepancy between ollama and vLLM for gemma-3-27b-it et al.
#20060 commented on
Aug 29, 2025 • 0 new comments -
[Bug]: GPU memory allocate problem
#23163 commented on
Aug 29, 2025 • 0 new comments -
[MM Encoder] ViT attention performance and consolidation
#23880 commented on
Aug 29, 2025 • 0 new comments -
[MM Encoder] Add Encoder DP to InternVL
#23876 commented on
Aug 29, 2025 • 0 new comments -
[Bug]: assortment of warnings / errors coming out of vllm basic python inference script
#18634 commented on
Aug 29, 2025 • 0 new comments -
[Bug]: FlashMLA V1 with FP8 KV cache not yet supported!
#18887 commented on
Sep 1, 2025 • 0 new comments -
[Feature]: Individual GuidedDecodingParams for each prompt in prompts.
#19007 commented on
Sep 1, 2025 • 0 new comments -
[Bug]: Phi-4-mini-instruct / Phi-4-multimodal-instruct produces gibberish when input <4096 tokens and output is >4096 tokens
#19489 commented on
Sep 1, 2025 • 0 new comments -
[Feature][Wide EP]: Add NIXL, DeepEP, DeepGEMM, and PPLX to Docker Image
#23344 commented on
Sep 1, 2025 • 0 new comments -
[RFC]: Deprecating vLLM V0
#18571 commented on
Sep 1, 2025 • 0 new comments -
[Bug]: calculate_kv_scales leads to dynamo compilation issue; enforce_eager=True leads to another issue
#21640 commented on
Sep 1, 2025 • 0 new comments -
[Bug]: TypeError: FlashAttentionImpl.__init__() got an unexpected keyword argument 'sinks'
#22383 commented on
Sep 1, 2025 • 0 new comments -
[Bug]: apply_temperature may cause nan in probs
#22180 commented on
Sep 1, 2025 • 0 new comments -
[Feature][Kernel][B200]: FI MoE LL does not use `allgatherv` and `reduce-scatterv` for dispatch and combine
#22916 commented on
Sep 1, 2025 • 0 new comments -
[Performance]: Low GPU Utilization (70%) for ViT+Qwen2 VLM Model.
#18392 commented on
Sep 1, 2025 • 0 new comments -
[Bug]: something wrong with hermes tool parser
#18791 commented on
Sep 1, 2025 • 0 new comments -
[Usage]: 请问0.9.0版容器是限制只能在CUDA12.8以上版本运行了吗?
#18813 commented on
Sep 1, 2025 • 0 new comments -
[Bug]: 0.8.x with vllm V1 fails on loading Qwen-vl-2.5 with UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 2218: ordinal not in range(128)
#18823 commented on
Sep 1, 2025 • 0 new comments -
[Performance]: The Unstable Performance Difference between CUDA and PyTorch
#18884 commented on
Sep 1, 2025 • 0 new comments -
[Usage]: How to Retrieve Model Parameters (e.g., Supported Embedding Dimensions) for an Embedding Model (Online Service)
#18984 commented on
Sep 1, 2025 • 0 new comments -
[Feature]: support Microsoft Tutel as inference backend for Moe models
#19013 commented on
Sep 1, 2025 • 0 new comments -
[Attention]: Pad for cudagraphs before constructing attention metadata
#23789 commented on
Sep 1, 2025 • 0 new comments -
[Bug]: RuntimeError: operator _C::marlin_qqq_gemm does not exist
#23662 commented on
Sep 1, 2025 • 0 new comments -
[Bug]: Docker vLLM 0.9.1 CUDA error: an illegal memory access, sampled_token_ids.tolist()
#19483 commented on
Sep 1, 2025 • 0 new comments -
[MM Encoder] Investigate heuristic for enabling encoder DP by default
#23879 commented on
Aug 31, 2025 • 0 new comments -
[Bug]: vllm/vllm-openai:gptoss AssertionError: Sinks are only supported in FlashAttention 3 (4090 48gb)
#22331 commented on
Aug 31, 2025 • 0 new comments -
[Usage]: ModuleNotFoundError: No module named 'vllm.vllm_flash_attn.layers' vllm@0.9.0.1
#19131 commented on
Aug 31, 2025 • 0 new comments -
[Bug]: Unable to use Qwen/Qwen2.5-Omni-7B with --mm-processor-kwargs
#20995 commented on
Aug 31, 2025 • 0 new comments -
[RFC]: Optimize Input Media Processing in vLLM
#22044 commented on
Aug 31, 2025 • 0 new comments -
[Bug]: NIXL disaggregation example does not work
#22532 commented on
Aug 31, 2025 • 0 new comments -
[Bug]: After wake_up sleeping model in OpenAI API server the model generate gibberish output
#20627 commented on
Aug 31, 2025 • 0 new comments -
[Usage]: does v1 support seqence paralism now?
#19256 commented on
Sep 5, 2025 • 0 new comments -
[Bug]: VLLM v0.10.0 failed to deploy the qwen3-30b-moe model. The error is AttributeError: '_OpNamespace' '_moe_C' object has no attribute 'topk_softmax'.
#22225 commented on
Sep 4, 2025 • 0 new comments -
[Feature]: Add OpenTelemetry API to v1
#17794 commented on
Sep 4, 2025 • 0 new comments -
[Bug]: JSON decode error when tool call argument is empty
#19419 commented on
Sep 4, 2025 • 0 new comments -
[Bug]: gpt-oss Intermittent 500 Internal Server Error with empty response body when using strict JSON “function router” system prompt
#23837 commented on
Sep 4, 2025 • 0 new comments -
[Bug]: [gpt oss 20b] [tool_call] Unexpected token 12606 while expecting start token 200006
#22519 commented on
Sep 4, 2025 • 0 new comments -
[Bug]: illegal memory access when there are multiple concurrent request
#23814 commented on
Sep 4, 2025 • 0 new comments -
[Usage]: When I use the Qwen3-32B with tool_choice='required' parameter, the tool calling gets stuck in a loop
#21026 commented on
Sep 4, 2025 • 0 new comments -
[Feature]: AttributeError: Model GptOssForCausalLM does not support BitsAndBytes quantization yet. No 'packed_modules_mapping' found. Support GptOssForCausalLM of BitsAndBytes quantization?
#23632 commented on
Sep 4, 2025 • 0 new comments -
[Bug]: tensor parallelism inference doesn't run on Nvidia Blackwell 5070ti
#21239 commented on
Sep 4, 2025 • 0 new comments -
[Bug]: MoE models fail at startup: AttributeError: '_OpNamespace' '_moe_C' object has no attribute 'topk_softmax'
#18967 commented on
Sep 4, 2025 • 0 new comments -
[New Model]: OpenAI OSS model support
#22265 commented on
Sep 4, 2025 • 0 new comments -
[Bug]: The quantization method mxfp4 is not supported for the current GPU SM75
#22288 commented on
Sep 4, 2025 • 0 new comments -
[Bug]: Performance Analysis: Significant Latency on First Inference due to Engine Warm-up (torch.compile & Graph Capture)
#23787 commented on
Sep 4, 2025 • 0 new comments -
[Bug]: vLLM server hangs and timeouts after initial requests
#17972 commented on
Sep 4, 2025 • 0 new comments -
[Bug]: When enabling LoRA, greedy search got different answers.
#7977 commented on
Sep 4, 2025 • 0 new comments -
[Bug]: 启动qwen2.5-vl系列的时候为啥老是卡着
#13651 commented on
Sep 4, 2025 • 0 new comments -
[Bug]: 100% CPU usage when idle
#16660 commented on
Sep 4, 2025 • 0 new comments -
[Bug]: When I use tool call, the "tool_calls" list in the response is empty, and the value is in "content", which does not conform to the standard provided by OpenAI.
#17161 commented on
Sep 4, 2025 • 0 new comments -
[Bug]: Serve Qwen3 MOE GPTQ models raise `torch._dynamo.exc.Unsupported` error
#18044 commented on
Sep 4, 2025 • 0 new comments -
[Doc]: add `--build-arg RUN_WHEEL_CHECK=false` to the "building-vllm-s-docker-image-from-source" section to avoid `check-wheel-size.py`-errors when building vllm for blackwell
#18309 commented on
Sep 4, 2025 • 0 new comments -
[Bug]: Internal Server Error: python3 openai_chat_completion_client_for_multimodal.py -c audio when using Qwen/Qwen2-Audio-7B-Instruct
#19083 commented on
Sep 4, 2025 • 0 new comments -
[Usage]: Why is the CPU usage fully utilized but the graphics card power is always low when I use vllm to deploy the model in a multi-card environment?
#19133 commented on
Sep 4, 2025 • 0 new comments -
[Bug]: error `is not a multimodal model` when serving `Qwen/Qwen3-8B` connected to `gr.load_chat(...)`
#19144 commented on
Sep 4, 2025 • 0 new comments -
[Feature]: support soft thinking
#19180 commented on
Sep 4, 2025 • 0 new comments -
[Usage]: How to quantize a custom model
#19190 commented on
Sep 4, 2025 • 0 new comments -
[Usage]: File Access Error Preventing vLLM API Server from Starting
#19192 commented on
Sep 4, 2025 • 0 new comments -
[Performance]: 在H20 运行vllm:v0.9.0 llama2-0.2B模型LoRA推理 ,执行模型Forward时Pytroch Dynamo 在CPU侧遍历;
#19261 commented on
Sep 5, 2025 • 0 new comments -
[Performance]: The same latency of Qwen3-8B and Qwen3-8b-Fp8B
#19264 commented on
Sep 5, 2025 • 0 new comments -
[Feature]: Under non-streaming output, add background heartbeat detection
#19268 commented on
Sep 5, 2025 • 0 new comments -
[New Model]: jinaai/jina-colbert-v2
#19278 commented on
Sep 5, 2025 • 0 new comments -
[Bug]: continue_final_message + echo + prefix-caching + V0 crash the server
#19285 commented on
Sep 5, 2025 • 0 new comments -
[Bug]: openai.LengthFinishReasonError from client.beta.chat.completions.parse
#19293 commented on
Sep 5, 2025 • 0 new comments -
[Feature]: Add token-level progress bar for `LLM.beam_search` inference
#19300 commented on
Sep 5, 2025 • 0 new comments -
[Bug]: openai_harmony.HarmonyError: unexpected tokens remaining in message header
#23567 commented on
Sep 5, 2025 • 0 new comments -
[Feature][Responses API] Support tool_choice other than "auto"
#23227 commented on
Sep 5, 2025 • 0 new comments -
[Roadmap] vLLM Release/CI/Performance Benchmark Q2 2025
#16284 commented on
Sep 5, 2025 • 0 new comments -
[CI]: Speed up Models Tests
#23670 commented on
Sep 4, 2025 • 0 new comments -
[Feature Request]: Per-rank log files (especially per-actor for Ray)
#23761 commented on
Sep 4, 2025 • 0 new comments -
[RFC]: Address piecewise graph splitting and attention fusion incompatibility
#23261 commented on
Sep 4, 2025 • 0 new comments -
[CI]: Have CI tests fail-fast
#23453 commented on
Sep 4, 2025 • 0 new comments -
[New Model]: Grok 2
#23557 commented on
Sep 4, 2025 • 0 new comments -
[CI]: Reduce docker build time with caching
#23588 commented on
Sep 4, 2025 • 0 new comments -
[Feature]: GPT-OSS harmony format support
#23217 commented on
Sep 4, 2025 • 0 new comments -
[RFC]: Enabling Multiple Graphs Based on pre-defined conditions
#23113 commented on
Sep 4, 2025 • 0 new comments -
[Bug]: Failed to load model from local s3 instance
#23236 commented on
Sep 4, 2025 • 0 new comments -
[Feature][Tools]: Complete Redesign of Tool Calling
#22918 commented on
Sep 4, 2025 • 0 new comments -
[Feature]: Add Moving Average Statistics for Better Performance Monitoring
#22480 commented on
Sep 4, 2025 • 0 new comments -
[Bug]: Incorrect output throughput calculation for concurrent requests in benchmark_serving.py
#23820 commented on
Sep 4, 2025 • 0 new comments -
[Usage]: embed prompts
#19746 commented on
Sep 4, 2025 • 0 new comments -
[Bug]: `graph.eliminate_dead_code()` break the fx graph with `enable_fi_allreduce_fusion` when TP == 2
#23091 commented on
Sep 4, 2025 • 0 new comments -
[Feature]: Add LoRA support for gpt-oss model
#23610 commented on
Sep 4, 2025 • 0 new comments -
[Usage]: Load Qwen3 Moe model error when starting the vllm server on TPU
#23834 commented on
Sep 4, 2025 • 0 new comments -
[Bug]: Inference with the Moonlight model, the output becomes corrupted when n exceeds 1
#19206 commented on
Sep 4, 2025 • 0 new comments -
[Bug]: vLLM server timeout due to multiprocessing communication error
#23582 commented on
Sep 3, 2025 • 0 new comments -
[Usage]: Which dataset do you recommend using for the ngram spec decoding method?
#23611 commented on
Sep 3, 2025 • 0 new comments -
[Usage]: Unable to see more than 20% improvement on b200 for vllm
#23609 commented on
Sep 3, 2025 • 0 new comments -
[RFC]: Disaggregated Everything - Token In <> Token Out API Server
#22817 commented on
Sep 3, 2025 • 0 new comments -
[CI]: Declarative regression tests for API parameters
#23593 commented on
Sep 3, 2025 • 0 new comments -
[Installation]: Nightly builds not available in container registry
#19335 commented on
Sep 3, 2025 • 0 new comments -
[Feature]: Multimodal Benchmarking Support (MMLM)
#21887 commented on
Sep 3, 2025 • 0 new comments -
[Feature]: Add support for Apple MPS(Metal Performance Shaders)
#22629 commented on
Sep 3, 2025 • 0 new comments -
[Doc]: update contributing guide for macOS Apple silicon
#16940 commented on
Sep 3, 2025 • 0 new comments -
[Bug]: [P/D] P/d is incompatible with spec decoding
#21583 commented on
Sep 3, 2025 • 0 new comments -
[Feature]: If I want gpt-oss to be able to call custom tools, how should I set the --tool-call-parser parameter during deployment?
#22308 commented on
Sep 3, 2025 • 0 new comments -
[Bug]: Devstral-Small-2507 tool parsing issue when streaming
#23180 commented on
Sep 3, 2025 • 0 new comments -
[Bug]: When accessing the API with the 'stop' parameter, the 'qwen3-reasoning-parser' fails to function correctly.
#22412 commented on
Sep 3, 2025 • 0 new comments -
[Feature]: Qwen2_5_VLForEmbedding
#13373 commented on
Sep 3, 2025 • 0 new comments -
[Bug]: gpt-oss model output issue
#23694 commented on
Sep 3, 2025 • 0 new comments -
[Bug]: Structured output is not correctly enforced when using GPT-OSS
#23120 commented on
Sep 3, 2025 • 0 new comments -
[Bug]: RuntimeError: NCCL error: unhandled cuda error
#21661 commented on
Sep 3, 2025 • 0 new comments -
[Bug]: Qwen 3 2507 update models use `deepseek_r1` reasoning parser - suggest renaming
#22657 commented on
Sep 3, 2025 • 0 new comments -
[Performance]: Long startup delay due to plugin loading and subprocess spawning
#21051 commented on
Sep 3, 2025 • 0 new comments -
[Bug]: Qwen3-Reranker-vllm exhibits a large gap between offline and online inference.
#20730 commented on
Sep 3, 2025 • 0 new comments -
[Bug]: watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
#8177 commented on
Sep 3, 2025 • 0 new comments -
[RFC]: Refactor CI/CD
#22992 commented on
Sep 3, 2025 • 0 new comments -
[Bug]: vllm can' t serve for Multi-audio input inference
#16914 commented on
Sep 3, 2025 • 0 new comments -
[Bug]: Qwen2vl vllm grounding任务效果不如transformers推理
#11254 commented on
Sep 3, 2025 • 0 new comments -
Issue with Mistral Small and greek characters
#14307 commented on
Sep 3, 2025 • 0 new comments -
First tpot/itl is too long?
#15106 commented on
Sep 3, 2025 • 0 new comments -
[Usage]: how to get the hidden states
#19207 commented on
Sep 4, 2025 • 0 new comments -
[Usage]: Can multimodal models, such as qwen2.5vl, use the PD separation feature?
#19213 commented on
Sep 4, 2025 • 0 new comments -
[Usage]: Error when running a finetuned, quantized model with vllm.
#19218 commented on
Sep 4, 2025 • 0 new comments -
[Feature]: Warn or auto-convert to FlexibleArgumentParser in AsyncEngineArgs.add_cli_args
#19221 commented on
Sep 4, 2025 • 0 new comments -
[Feature]: Use `QuantFp8` `CustomOp`-abstraction for MoE layers
#20711 commented on
Sep 4, 2025 • 0 new comments -
[Feature][Chat Completion] Support builtin tools of gpt-oss
#23292 commented on
Sep 3, 2025 • 0 new comments -
[Feature]: Qwen3 Models GGUF Support
#21511 commented on
Sep 3, 2025 • 0 new comments -
[Bug]: V1 pre-compiled graph loading much slower than V0
#20342 commented on
Sep 3, 2025 • 0 new comments -
[RFC]: Reduce Unit Test to Speed Up CI
#22041 commented on
Sep 3, 2025 • 0 new comments -
[Bug]: FunctionDefinition missing optional param strict
#15526 commented on
Sep 3, 2025 • 0 new comments -
[Feature]: Vulkan support
#21182 commented on
Sep 3, 2025 • 0 new comments -
[CI]: Use `HF_HUB_OFFLINE=1` in CI tests
#23451 commented on
Sep 3, 2025 • 0 new comments -
[Usage]: gpt-oss-120b tool calls
#22337 commented on
Sep 3, 2025 • 0 new comments -
[Bug]: Support `qwen3` Models in `eagle3` Speculative Decoding
#23464 commented on
Sep 3, 2025 • 0 new comments -
[Bug]: Strange error `AssertionError: failed to get the hash of the compiled graph` when running `Qwen/Qwen3-8B` via `LLM` class
#18851 commented on
Sep 3, 2025 • 0 new comments -
[Bug]: Enable cutom_op of rotary_embedding goes error for Qwen3-4B
#21101 commented on
Sep 3, 2025 • 0 new comments -
[Usage]: How to run model - `RedHatAI/Mixtral-8x7B-Instruct-v0.1-FP8`
#23192 commented on
Sep 3, 2025 • 0 new comments -
[Bug]: 5090 cannot run Qwen3-30B-A3B-NVFP4!
#23826 commented on
Sep 3, 2025 • 0 new comments -
[Bug]: Stub function of moe_wna16_marlin_gemm takes less positional arguments than real implementation
#22634 commented on
Sep 3, 2025 • 0 new comments -
[Bug]: AttributeError: module 'torch._tensor' has no attribute 'split'
#22676 commented on
Sep 3, 2025 • 0 new comments -
[Bug]: Numerics of Embedding Models
#22862 commented on
Sep 3, 2025 • 0 new comments -
[Bug]: vLLM v1 hanging during Torch compilation
#15360 commented on
Sep 3, 2025 • 0 new comments -
[Bug]: [V1][Spec Dec] EAGLE TP > 1 leads to errors when using --enforce_eager
#17513 commented on
Sep 3, 2025 • 0 new comments -
[Bug]: Qwen3-GPTQ | Error in inspecting model architecture 'Qwen3MoeForCausalLM'
#19504 commented on
Sep 3, 2025 • 0 new comments -
[Bug]: Consider deleting envs.VLLM_TEST_DYNAMO_FULLGRAPH_CAPTURE
#21834 commented on
Sep 3, 2025 • 0 new comments -
[Bug]: TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 fails with vLLM-compile in torch <= 2.7.1
#21858 commented on
Sep 3, 2025 • 0 new comments