Pulse · vllm-project/vllm · GitHub

June 16, 2025 – June 23, 2025

Overview

208 Active pull requests

225 Active issues

Could not load contribution data

Please try again later

113 Pull requests merged by 72 people

[BugFix][P/D] Fix for cases where _recving_transfers can be cleaned up when *all* transfer done
#19874 merged Jun 23, 2025
[P/D][NixlConnector] Support tp_size > num_kv_heads deployments
#19691 merged Jun 23, 2025
[doc] Fold long code blocks to improve readability
#19926 merged Jun 23, 2025
Fix(models/siglip): Add compatibility for Gemma models quantized by llm-compressor
#19643 merged Jun 23, 2025
[Misc] Configurable timeout for execute_model RPC calls via env var
#19544 merged Jun 23, 2025
[Core] feat: Implement Priority Scheduling in V1 Engine
#19057 merged Jun 23, 2025
[Perf][CLI] Improve overall startup time
#19941 merged Jun 22, 2025
[BugFix] Add an env to disable moe chunking to work around compile incompatibility
#19642 merged Jun 22, 2025
[Chore] dedup logs
#19955 merged Jun 22, 2025
[Misc] Simplify vllm bench cli subcommand implementation
#19948 merged Jun 22, 2025
[Misc] Update model-specific PR tagging
#19949 merged Jun 22, 2025
[doc] use snippets for contact us
#19944 merged Jun 22, 2025
[CI/Build] Auto tag perf benchmarks related PRs
#19943 merged Jun 22, 2025
[Benchmark] fix request loss if "ping" is returned
#19535 merged Jun 22, 2025
[MISC] add cpu_kvcache_space_bytes to CacheConfig
#19812 merged Jun 22, 2025
[Misc] add vllm_config in __init__
#19866 merged Jun 22, 2025
[Docs] Add GPT2ForSequenceClassification to supported models in docs
#19932 merged Jun 21, 2025
[Multimodal] Optimize Qwen2/2.5-VL startup time
#19756 merged Jun 21, 2025
[doc] add contact us in community
#19922 merged Jun 21, 2025
[New model support]Support Tarsier2
#19887 merged Jun 21, 2025
[Bugfix] Fix bnb 8bit model weights loading
#19917 merged Jun 21, 2025
Fix: Check the type of params to be a Sequence not list.
#19910 merged Jun 20, 2025
[Misc] Clean up useless code
#19889 merged Jun 20, 2025
[Kernel] mark TorchSDPABackend swap_blocks NotImplementedError
#19749 merged Jun 20, 2025
[CPU][CI] Fallback sliding window to v0 and fix CPU pooling model tests
#19901 merged Jun 20, 2025
Export NaNs in logits to scheduler_stats if output is corrupted
#18777 merged Jun 20, 2025
[custom_op][vllm-plugin] update custom_op class to use op_registry
#19164 merged Jun 20, 2025
[Model] GPT2ForSequenceClassification model
#19663 merged Jun 20, 2025
[Fix] import regex instead of re
#19875 merged Jun 20, 2025
[Kernel] correct cpu worker function parameter type
#19745 merged Jun 20, 2025
[Misc] refactor example - openai_transcription_client
#19851 merged Jun 20, 2025
[Misc] update cuda version
#19526 merged Jun 20, 2025
[Bugfix][Ray] Set the cuda context eagerly in the ray worker
#19583 merged Jun 20, 2025
[Bugfix] Enable PP with AITER+V1
#19822 merged Jun 20, 2025
[Chore]: qwen3-moe-type-hints-mistake
#19860 merged Jun 20, 2025
[Benchmark] Fix Value of type "SampleRequest" is not indexable
#18032 merged Jun 20, 2025
[CI][Neuron] Fail and exit on first error
#19622 merged Jun 20, 2025
[CI/Build][Bugfix] Fix deadlock on v1 engine test CI
#19872 merged Jun 20, 2025
[Benchmark][Bugfix] Fix Dataset Length Calculation
#19868 merged Jun 20, 2025
[Frontend] early return chat format resolution when specified
#19735 merged Jun 19, 2025
[Core][Bugfix] Fix Online MM Beam Search
#19688 merged Jun 19, 2025
[CI][CPU] Improve dummy Triton interfaces and fix the CPU CI
#19838 merged Jun 19, 2025
[Doc] Update V1 user guide for embedding models
#19842 merged Jun 19, 2025
Fixing Chunked Prefill Test.
#19762 merged Jun 19, 2025
[Frontend] Add optional token-level progress bar to LLM.beam_search
#19301 merged Jun 19, 2025
Add xLAM tool parser support
#17148 merged Jun 19, 2025
[Minor] Allow redirecting model path for HfRunner in test
#19795 merged Jun 19, 2025
raise exception for pin_lora
#19809 merged Jun 19, 2025
[Misc] [ROCm] Prevent surplus tensor reshape
#19803 merged Jun 19, 2025
[ROCm] [AITER] [Bugfix] Patch for AITER commit 648764942e552a8bb5fe16026703716a81f05374
#18990 merged Jun 19, 2025
Mark invariant normalizer in Gemma as non-persistent
#19788 merged Jun 19, 2025
[Bugfix] Add check_health to v1 async client.
#19821 merged Jun 19, 2025
[Bugfix] Fix the linter
#19826 merged Jun 19, 2025
Support embedding models in V1
#16188 merged Jun 19, 2025
[Quantization] Modify the logic of BNB double quantization
#19742 merged Jun 19, 2025
[Misc][ROCm] Enforce no unused variable in ROCm C++ files
#19796 merged Jun 19, 2025
Fix FA2 fallback for Blackwell V1
#19781 merged Jun 19, 2025
[Frontend] Expose custom args in OpenAI APIs
#16862 merged Jun 19, 2025
[BugFix] Fix use_cudagraph=False
#19612 merged Jun 19, 2025
[Multimodal] Use fast processor for Qwen2/2.5-VL
#19789 merged Jun 18, 2025
[Core] More fixes to MultiModalEmbeddings type handling
#19715 merged Jun 18, 2025
[TPU] Update torch-xla version to include paged attention tuned block change
#19813 merged Jun 18, 2025
[Core] Do not copy array during hashing
#19484 merged Jun 18, 2025
Disable "Forbid direct 'import triton'" check for vllm/triton_utils/importing.py in an extensible way
#19783 merged Jun 18, 2025
docs: fix Slack bulletpoint in README
#19811 merged Jun 18, 2025
[v1] Support mamba2
#19327 merged Jun 18, 2025
[Docs] Add Huzaifa Sidhpurwala to vuln mgmt team doc
#19808 merged Jun 18, 2025
[Bugfix] fix RAY_CGRAPH_get_timeout is not set successfully
#19725 merged Jun 18, 2025
[Hardware][AMD] integrate aiter chunked prefill into vllm
#18596 merged Jun 18, 2025
[Qwen] Add tagging rule for Qwen related PRs
#19799 merged Jun 18, 2025
[Platform] Allow platform use V1 Engine by default
#19792 merged Jun 18, 2025
[doc] fix the incorrect label
#19787 merged Jun 18, 2025
[Minor] Zero-initialize attn output buffer
#19784 merged Jun 18, 2025
[V1] Decouple GPU and TPU InputBatch
#19778 merged Jun 18, 2025
[V1][P/D] An native implementation of xPyD based on P2P NCCL
#18242 merged Jun 18, 2025
[V1] Add API docs for EncoderCacheManager
#19294 merged Jun 18, 2025
[Misc] Add __str__ for RequestStatus
#19780 merged Jun 18, 2025
[MISC] correct DeviceConfig device field static type analysis
#19699 merged Jun 18, 2025
[MISC] correct copy_blocks src_to_dists param type
#19696 merged Jun 18, 2025
[TPU] Update torch version to include paged attention kernel change
#19706 merged Jun 17, 2025
[Feature][ROCm] Add full graph capture support for TritonAttentionBackend
#19158 merged Jun 17, 2025
[Bugfix] Fix faulty triton importing logic when using Ray for DP
#19734 merged Jun 17, 2025
[Misc] Update lmcache connector with the latest connector apis
#19441 merged Jun 17, 2025
Remove sm120 arch from sm100 cutlass kernel arch list
#19716 merged Jun 17, 2025
[Perf] Optimize moe_align_block_size CUDA kernel
#19572 merged Jun 17, 2025
[Bugfix] Update multimodel models mapping to fit new checkpoint after Transformers v4.52
#19151 merged Jun 17, 2025
[Mis] remove duplicate engine status checks
#19647 merged Jun 17, 2025
[V1][Kernel] Flashinfer HND KV cache layout
#19280 merged Jun 17, 2025
[doc] split "Other AI Accelerators" tabs
#19708 merged Jun 17, 2025
[doc][mkdocs] Add edit button to documentation
#19637 merged Jun 17, 2025
[Kernel] Add Split-KV Support to Unified Triton Attention Kernel
#19152 merged Jun 17, 2025
Add a doc on how to update PyTorch version
#19705 merged Jun 17, 2025
[Doc] Add missing llava family multi-image examples
#19698 merged Jun 17, 2025
[Core] add remove_seq_from_computed_blocks_tracker to BlockSpaceManager
#19686 merged Jun 17, 2025
Fixes IMA for TP w/ flex-attention
#19712 merged Jun 17, 2025
[DOC] fix doc typos
#19600 merged Jun 17, 2025
[Frontend] add chunking audio for > 30s audio
#19597 merged Jun 17, 2025
[Wheel Size] Only build FA2 8.0+PTX
#19336 merged Jun 17, 2025
[doc] add project flag to gcloud TPU command
#19664 merged Jun 17, 2025
[Fix] Fall back to Gloo when NCCL backend is unavailable
#19641 merged Jun 17, 2025
[Quantization] Remove FP4 emulation; Fall-back to marlin for device < 100
#19563 merged Jun 16, 2025
[V1] Change return type on get_multimodal_embeddings()
#19446 merged Jun 16, 2025
[Model] Add support for MiniMaxM1ForCausalLM (shares architecture with MiniMaxText01ForCausalLM)
#19677 merged Jun 16, 2025
[Kernels] Use empty for modular MoE workspaces
#19667 merged Jun 16, 2025
[Bugfix] fix missing 'finish_reason': null in streaming chat
#19662 merged Jun 16, 2025
[MISC] bump huggingface_hub pkg to 0.33.0
#19547 merged Jun 16, 2025
[Bugfix] Fix TP inference for Flex attention backend
#19657 merged Jun 16, 2025
[Feature]:Allow for Granite MoE Hybrid models with _only_ shared experts.
#19652 merged Jun 16, 2025
[DOC] Add reasoning capability to vLLM streamlit code
#19557 merged Jun 16, 2025
[BugFix] Don't catch BaseException when dumping execute_model errors
#19626 merged Jun 16, 2025
[Kernel] GGUF MMVQ kernel for multiple input vectors
#18754 merged Jun 16, 2025
[Docs] Move multiproc doc to v1 dir
#19651 merged Jun 16, 2025
[CI] Add mteb testing for rerank models
#19344 merged Jun 16, 2025

95 Pull requests opened by 70 people

[Bugfix] ensure tool_choice is popped when `tool_choice:null` is passed in json payload
#19679 opened Jun 16, 2025
[V1] [Metrics] Hide deprecated metrics.
#19682 opened Jun 16, 2025
Add Thor SBSA and Spark
#19685 opened Jun 16, 2025
[Feat] Add enforce_include_usage option
#19695 opened Jun 16, 2025
add type assertion of request_id for LLMEngine.add_request
#19700 opened Jun 16, 2025
[Docs] Enhance SupportsMultiModal interface documentation
#19701 opened Jun 16, 2025
Make sure the correct version of ao is installed in CI
#19704 opened Jun 16, 2025
Adding "AMD: Plugin Tests" to amdproduction.
#19707 opened Jun 16, 2025
[Model] Activated LoRA
#19710 opened Jun 16, 2025
[Misc][Tools][Benchmark] Add profile to autotune script
#19711 opened Jun 16, 2025
[Kernels][Bugfix] Use torch op for all kernels in FusedMoE forward. Add additional testing for cudagraphs.
#19717 opened Jun 16, 2025
[V1] Perf optimization for layers reusing shared KV cache
#19719 opened Jun 17, 2025
[Kernel] Masked act_mul and fp8-quant Kernels for Batched MoE
#19721 opened Jun 17, 2025
v1: Add Request.block_hashes
#19728 opened Jun 17, 2025
[PD] let toy proxy handle /chat/completions
#19730 opened Jun 17, 2025
v1: Support KV events from connectors
#19737 opened Jun 17, 2025
[P/D] Handle Abort and Make Lifecycle Explicit
#19740 opened Jun 17, 2025
[Feature] add quick all reduce
#19744 opened Jun 17, 2025
[BugFix] fix: aot passes kvcache dtype information
#19750 opened Jun 17, 2025
[v1] Re-add fp32 support to v1 engine through FlexAttention
#19754 opened Jun 17, 2025
[feat]: CUTLASS block scaled group gemm for SM100
#19757 opened Jun 17, 2025
Register deepgemm moe kernels to work with v1 engine
#19759 opened Jun 17, 2025
[BugFix] Fix topk_softmax assert
#19764 opened Jun 17, 2025
add mamba head fix
#19766 opened Jun 17, 2025
[Draft][torch.compile][ROCm][V1] Enable attention output FP8 fusion for V1 attention backends
#19767 opened Jun 17, 2025
BLOCK_SIZE_K fix
#19769 opened Jun 17, 2025
Workaround for an integer overflow with large CHUNK_SIZE
#19770 opened Jun 17, 2025
Triton-fused DeepseekScalingRotaryEmbedding
#19771 opened Jun 17, 2025
[AMD][P/D] Add libamdhip64.so.6 for llmd
#19773 opened Jun 17, 2025
[WIP] Splitting attention _fwd_grouped_kernel_stage1 to improve occupancy
#19774 opened Jun 17, 2025
[Ray] v1 Change device str for platform compatibility
#19785 opened Jun 18, 2025
[DP] Support external DP Load Balancer mode
#19790 opened Jun 18, 2025
Add SM120 to the Dockerfile
#19794 opened Jun 18, 2025
Allow to override KV cache memory calculation
#19804 opened Jun 18, 2025
Improve quant config semantic clarity, add Nvidia ModelOpt config adaptation
#19815 opened Jun 18, 2025
Introduce RayCudaCommunicator as Ray Compiled Graph communicator
#19816 opened Jun 18, 2025
[Kernel] Add Conch backend for mixed-precision linear layer
#19818 opened Jun 18, 2025
LoRA support on llama4
#19819 opened Jun 18, 2025
[Feature] Integrate new deepgemm
#19820 opened Jun 18, 2025
[Core] Add Flashinfer TRTLLM Backend for Flashinfer decode path (SM100).
#19825 opened Jun 19, 2025
Move Gemma's stacked_params_mapping to class scope
#19829 opened Jun 19, 2025
FP8 custom ops
#19830 opened Jun 19, 2025
[WIP] Async Scheduler Prototype
#19831 opened Jun 19, 2025
[P/D] Asynchronously do _nixl_handshake
#19836 opened Jun 19, 2025
[Do not merge] Cache model info for faster startup
#19837 opened Jun 19, 2025
Add Cutlass integration for MoE FP8
#19843 opened Jun 19, 2025
[BugFix][V0] Fix AssertionError for prompt_logprobs
#19844 opened Jun 19, 2025
refactor example - qwen3_reranker
#19847 opened Jun 19, 2025
v1: Introduce an offloading component
#19848 opened Jun 19, 2025
[Chore] logging metrics rename
#19852 opened Jun 19, 2025
optimze attn
#19858 opened Jun 19, 2025
[Docs] Fix syntax highlighting of shell commands
#19870 opened Jun 19, 2025
[V1 Scheduler] BatchScheduler to balance token-based microbatches and reduce GPU pipeline bubbles
#19873 opened Jun 19, 2025
Add page-aligned prefill scheduling.
#19878 opened Jun 19, 2025
[Quantization] Add compressed-tensors emulations support for NVFP4
#19879 opened Jun 19, 2025
[Misc] Add type alias `ReqId` and `EngineId` for better readability
#19880 opened Jun 19, 2025
[Core] Add `update_load_config` RPC method
#19884 opened Jun 20, 2025
[EP+DP] Optimize the little operations in the DeepGEMM + DeepEP low latency case
#19885 opened Jun 20, 2025
[Fix][ROCm] Remove unused variables to fix build error on GFX11/12
#19891 opened Jun 20, 2025
add some examples for other benchmark scripts
#19893 opened Jun 20, 2025
[Benchmark][New Dataset]Added benchmark support for Unsloth Vision Datasets
#19894 opened Jun 20, 2025
[Docs] Change response symbol to json in openai_compatible_server.md
#19895 opened Jun 20, 2025
[CI/Build] Push latest tag for cpu and neuron docker image
#19897 opened Jun 20, 2025
[Bugfix][V1][ROCm] Fix AITER Flash Attention Backend to enable Llama-4
#19904 opened Jun 20, 2025
add smollm3 support
#19905 opened Jun 20, 2025
deepep low latency + fp8 dispatch - test fixes
#19911 opened Jun 20, 2025
[V1] Logits processors extensibility
#19912 opened Jun 20, 2025
[Misc] make get_class check for Executor instead of ExecutorBase
#19914 opened Jun 20, 2025
Track expert selection metrics
#19915 opened Jun 20, 2025
Fix: Missing newline at end of file
#19916 opened Jun 20, 2025
[TPU] Add TPU specific var VLLM_TPU_MOST_MODEL_LEN
#19919 opened Jun 20, 2025
[doc] improve readability for long commands
#19920 opened Jun 20, 2025
Use FusedMoEQuantConfig everywhere
#19921 opened Jun 20, 2025
enable multiple ssm groups duplication
#19924 opened Jun 20, 2025
[V1] Solve potential deadlock issue in v1 engine core client internally
#19927 opened Jun 21, 2025
[TPU] add kv cache update kernel
#19928 opened Jun 21, 2025
[Bugfix][Benchmark] Fix Marlin benchmark
#19929 opened Jun 21, 2025
[BugFix] Fix multi-node offline data parallel
#19937 opened Jun 21, 2025
[PERF] Speedup of MRoPE prepare inputs
#19939 opened Jun 21, 2025
[Bugfix] fix sampling seeding being off when sequences are prempted
#19940 opened Jun 21, 2025
[Perf][Frontend]: eliminate api_key and x_request_id headers middleware overhead
#19946 opened Jun 22, 2025
[Doc] Update V1 status for decoder-only embedding models
#19952 opened Jun 22, 2025
[Bugfix][v1] Fix step pooler implementation and step pooling usage in v1
#19956 opened Jun 22, 2025
[Doc] cmd+k
#19957 opened Jun 22, 2025
[CI/Build] Add basic multimodal lm eval for CI testing
#19959 opened Jun 23, 2025
feat(audio): add flag for Whisper chunking (#19772)
#19961 opened Jun 23, 2025
[CI/Build] Upgrade lm-eval to 0.4.9
#19962 opened Jun 23, 2025
[Chore] Clarifying log messages for KV Connector
#19965 opened Jun 23, 2025
feat: offload weights to cpu before fp8 online quant
#19967 opened Jun 23, 2025
feat: add reward model + min_p speculative decode
#19968 opened Jun 23, 2025
[Bugfix] Fix CI bitsandbytes failure
#19969 opened Jun 23, 2025
Implement Async Scheduling
#19970 opened Jun 23, 2025
[Core][V1] Support sharded state loading in v1 engine
#19971 opened Jun 23, 2025
Enabling Safe KVConnector
#19972 opened Jun 23, 2025
[doc] use MkDocs collapsible blocks - supplement
#19973 opened Jun 23, 2025

127 Issues closed by 33 people

[Usage]: online server requests do not return token usage information in version 0.7.2
#15426 closed Jun 23, 2025
[Usage][UT]:Why the answer is ' 0, 1'
#15380 closed Jun 23, 2025
[Performance]: Performance decrease after upgrading from 0.8.5 to 0.9.2
#19954 closed Jun 23, 2025
[RFC]: Hybrid Memory Allocator
#11382 closed Jun 23, 2025
[Feature]: Implement Priority Scheduling In V1 Engine
#14002 closed Jun 23, 2025
[Bug]: Deepseek resoning content is coming as null and the think content is going inside content when using vllm-openai v0.7.2 docker containers
#13375 closed Jun 23, 2025
[Feature]: deepseek gguf support
#13665 closed Jun 23, 2025
[Usage]: vllm: error: unrecognized arguments: --lora-path
#13669 closed Jun 23, 2025
[Usage]: How to use a custom model with LLM without HuggingFace
#13680 closed Jun 23, 2025
[Installation]: vLLM wheel names are broken in recent versions
#13692 closed Jun 23, 2025
[Installation]: requirement packaging errors during pip install
#13694 closed Jun 23, 2025
[V1] Define WorkerBase for V1 Workers
#13711 closed Jun 23, 2025
[Feature]: Qwen3-Rerank-8B online serving score API
#19930 closed Jun 22, 2025
[Bug]: Qwen3 generation degradation on ampere GPUs
#19384 closed Jun 22, 2025
[RFC]: Make device agnostic for diverse hardware support
#9268 closed Jun 22, 2025
[Bug]: v0.6.5 breaks AI SDK's `generateObject` with nullable strings in schema (`"type mismatch! call is<type>() before get<type>()" && is<std::string>()`)
#11415 closed Jun 22, 2025
[Bug]: CUDA initialization error with vLLM 0.5.4 and PyTorch 2.4.0+cu121
#12189 closed Jun 22, 2025
[Usage]: Does vLLM compile draft model?
#13144 closed Jun 22, 2025
[Bug]: when i use docker vllm/vllm-openai:v0.7.2 to deploy r1 awq, i got empty content
#13219 closed Jun 22, 2025
[Installation]: The startup failed, and it might be related to xformers.
#13279 closed Jun 22, 2025
After deploying the qwen2.5-vl model using vllm, multiple images cannot be passed in. What is going on?
#13513 closed Jun 22, 2025
[New Model]: facebook/contriever support requring
#13525 closed Jun 22, 2025
[Bug]: Use Qwen 2.5-VL with TP=2, the memory of one GPU card will be cleared to zero during the request.
#13581 closed Jun 22, 2025
[Bug]: arm64 No module named 'xformers'
#13585 closed Jun 22, 2025
[Bug]: Marlin kernel doesn't work for multi-gpus
#13590 closed Jun 22, 2025
[Performance]: TTFT Spikes When QPS Increases During DeepSeek-R1 Testing with TP8 and PP2
#13610 closed Jun 22, 2025
[Feature]: Support multiple models per GPU
#13633 closed Jun 22, 2025
生成时插入token？
#13639 closed Jun 22, 2025
[RFC]: Implement disaggregated prefilling via KV cache transfer
#5557 closed Jun 21, 2025
[Feature]: vllm support for Ascend NPU
#6728 closed Jun 21, 2025
[Bug]: HFValidation
#13485 closed Jun 21, 2025
[Doc]: Add list of commands for `vllm serve`
#19859 closed Jun 20, 2025
[Bug]: Inductor codegen: fatal error: stddef.h: No such file or directory
#19656 closed Jun 20, 2025
[Usage]: How to eliminate randomness and obtain fixed results with VLLM 0.8
#15205 closed Jun 20, 2025
[Bug]: enqueue.cc:1556 NCCL WARN Cuda failure 700 'an illegal memory access was encountered'
#19890 closed Jun 20, 2025
[Performance]: speed regression 0.6.2 => 0.6.3?
#9476 closed Jun 20, 2025
[Usage]: vLLM For maximally batched use case
#9760 closed Jun 20, 2025
[Bug]: Interference of Tokens in Concurrent Requests Causing Result Confusion in Version 0.6.3
#9910 closed Jun 20, 2025
[Usage]: Multi-Step Scheduling with Speculative Decoding
#11917 closed Jun 20, 2025
[Bug]: Nccl Test Error
#12008 closed Jun 20, 2025
[Usage]: V0 Does Qwen2-VL Support torch.compile in vllm?
#12693 closed Jun 20, 2025
[Bug]: Pooling request fails for classification task
#12753 closed Jun 20, 2025
[Bug]: V1 engine fails with offline batched inference code in V0 engine
#12929 closed Jun 20, 2025
[Installation]: When build the 0.7.2 docker image using the Dockerfile , got "LookupError: setuptools-scm was unable to detect version for /workspace."
#13130 closed Jun 20, 2025
[Bug]: Using VLLM0.7.2 server to start DeepSeek-r1-awq model, there is a phenomenon of cuda out of memory and service shutting down.
#13252 closed Jun 20, 2025
[Bug]: empty begin steam output
#13293 closed Jun 20, 2025
[Bug]: GPU drops to 0 usage when handling concurrent requests
#13422 closed Jun 20, 2025
[Bug]: Guided decoding only generating single character during inference with finetuned model
#13448 closed Jun 20, 2025
[Feature]: Support xTTSv2
#13457 closed Jun 20, 2025
[Bug]: When using the tool call streaming output of hermes,deleting the last"}"of the parameter may result in missing information
#13459 closed Jun 20, 2025
[Bug]: AWQ doesn't support 4-bit?
#13462 closed Jun 20, 2025
[Feature]: How can I set a different max_pixels for each request when starting the service?
#13463 closed Jun 20, 2025
[Bug]: Memory leak in 0.6 and 0.7 when setting "max_tokens=1"
#13464 closed Jun 20, 2025
[Bug]: Mamba should return states in fp32
#13466 closed Jun 20, 2025
[Usage]: How different between 'obtain the full response' and 'fixed length output' when using benchmark performance test
#13467 closed Jun 20, 2025
[Doc]: List of Models Supported By TPU Backend
#13476 closed Jun 20, 2025
[Bug]: When deploying two llm services on the same batch of GPUs. Inference will be twice as slow
#13477 closed Jun 20, 2025
[Feature]: kv cahce int8：Dynamic kv cache scaling factors computation
#13478 closed Jun 20, 2025
[Bug]: vLLM on TPU is broken with XLA errors
#13479 closed Jun 20, 2025
[Usage]: How to write scoring script when deploying to a managed Azure machine learning real-time endpoint?
#13491 closed Jun 20, 2025
[Usage]: How to use logits processors with max_num_seqs > 1?
#13553 closed Jun 20, 2025
[Bug]: Increasing root volume with guided decoding
#13556 closed Jun 20, 2025
[Bug]: Index Out of Range Bug in Pooler when Using returned_token_ids with hidden_states
#13559 closed Jun 20, 2025
[New Model]: Qwen3-Rerank 0.6B, 4B, 8B
#19529 closed Jun 19, 2025
[Bug]: ValueError: Cannot cast <zmq.Socket(zmq.ROUTER) at 0x796c63de24a0> to int
#19444 closed Jun 19, 2025
[Bug]: Get NCCL_ERROR_SYSTEM_ERROR with latest Docker vLLM image (v0.9.1)
#19613 closed Jun 19, 2025
[Bug]: Async Beam Search Doesn't Pass Multimodal Data Correctly
#19687 closed Jun 19, 2025
[Bug]: fail to load OpenGVLab/InternVL3-78B with vllm
#19856 closed Jun 19, 2025
[Usage]: Implement a custom scheduler
#16479 closed Jun 19, 2025
[Usage]: `gpu_memory_utilization` backend parameter questions
#19805 closed Jun 19, 2025
[Bug]: 使用vllm 0.8.2 torch 0.2.6版本启动模型报错: CRITICAL 04-02 10:00:15 [core_client.py:269] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.rm) 已杀死
#15918 closed Jun 19, 2025
[Misc]: why `3B-Instruct-AWQ` takes 16G
#15204 closed Jun 19, 2025
[Usage]: How to deploy DeepSeek R1 in a K8s environment
#14740 closed Jun 19, 2025
vector search
#15268 closed Jun 19, 2025
[Bug]: Is vllm support function call mode?
#6631 closed Jun 19, 2025
Conda Forge Package
#3126 closed Jun 19, 2025
[Bug]: OOM with QwQ-32B
#15258 closed Jun 19, 2025
[Bug]: TypeError: Qwen2_5OmniProcessor.__init__() got multiple values for argument 'image_processor'
#19833 closed Jun 19, 2025
[Bug]:
#19832 closed Jun 19, 2025
[Bug]: PreemptionMode.RECOMPUTE is incorrect
#16832 closed Jun 19, 2025
[Bug]: Reproduction failed when evaluate model
#19802 closed Jun 19, 2025
[Bug]: 'CUDAGraphBatchDecodeWithPagedKVCacheWrapper' object has no attribute 'plan' when serving QwenMoE model
#12821 closed Jun 19, 2025
[Bug]: Runtime error occurs when running deepseek v3
#12827 closed Jun 19, 2025
[Feature]: Support custom args in OpenAI (chat) completion requests
#16802 closed Jun 19, 2025
Revert "[CI] Update FlashInfer to 0.2.6.post1" --- edit: No, better add "12.0" to FlashInfer TORCH_CUDA_ARCH_LIST see PR #19794
#19810 closed Jun 18, 2025
[CI Failure]: Samplers Test - samplers/test_beam_search.py::test_beam_search_passes_multimodal_data
#19736 closed Jun 18, 2025
[Bug]: RAY_CGRAPH_get_timeout is not set successfully. Ray still detects default timeout value.
#19703 closed Jun 18, 2025
[Usage]: How to start the vllm service and pass parameters on XPU
#19528 closed Jun 18, 2025
[Docs] Feedback for `/en/latest/getting_started/installation/index.html`
#19755 closed Jun 18, 2025
[Docs] Feedback for `/en/latest/getting_started/installation/google_tpu.html`
#19753 closed Jun 18, 2025
[Docs] Feedback for `/en/latest/getting_started/installation/google_tpu.html`
#19752 closed Jun 18, 2025
[Docs] Feedback for `/en/latest/`
#19751 closed Jun 18, 2025
[Bug]: vllm running on new H20-3e Nvidia card has occasional garbled bug using Qwen 2.5 VL 72B
#19723 closed Jun 18, 2025
[Feature]: do you plan to support "suffix" of "v1/completions"
#9976 closed Jun 18, 2025
[Bug]: Continuous batching (OpenAI Server) with greedy search return different results
#11658 closed Jun 18, 2025
[Usage]: Does DeepSeek-R1 1.58-bit Dynamic Quant work on VLLM?
#12573 closed Jun 18, 2025
[Usage]: How to get access to scheduler
#12772 closed Jun 18, 2025
[Usage]: How to check the corresponding functionality of operators in Llama-2-7b-hf?
#13010 closed Jun 18, 2025
[Bug]: CUDA memory error with benchmark_serving.py
#13152 closed Jun 18, 2025
[Bug]: Cannot pull the docker image for installation
#13330 closed Jun 18, 2025
[Bug]: vllm server bad
#13340 closed Jun 18, 2025
For individual inference return expected result and batched inference returns different results for same prompts - Qwen2-VL-7B
#13346 closed Jun 18, 2025
[Bug]: DeepSeek R1 deployment panics when serving requests with cuda memory access
#13389 closed Jun 18, 2025
[Bug]:Lora Adapters with num-scheduler-steps doesn't work in version 0.7.2, even with VLLM_USE_V1=0
#13394 closed Jun 18, 2025
[Misc]: Why do we need to explicitly pass tool parsers?
#13399 closed Jun 18, 2025
[Feature]: Support token-level timestamps in whisper models
#13400 closed Jun 18, 2025
[Installation]: flash-attention internal "git submodule update" problematic for offline-install
#13424 closed Jun 18, 2025
[Feature]: load_weights function in JambaForSequenceClassification
#13430 closed Jun 18, 2025
[CI Failure]: Distributed Tests (2 GPUs) - v1/test_async_llm_dp.py::test_load
#19731 closed Jun 17, 2025
[Feature]: Optimize `moe_align_block_size` CUDA kernel
#19517 closed Jun 17, 2025
[Bug]: Error when loading model(gemma-3-4b) merged after DeepSpeed training into vLLM
#19139 closed Jun 17, 2025
[Bug]: guided_regex parsing error crashes the server
#19270 closed Jun 17, 2025
Release dataset of bug-fixing commits and test cases on Hugging Face
#19738 closed Jun 17, 2025
[Bug]: GPU Placement Group Creation Error in Multi-Node Setup with vLLM
#13388 closed Jun 17, 2025
[Bug]: Strange cuda out of memory when runing llava1.5 7b on 80G A100
#19724 closed Jun 17, 2025
[Bug]: Broken Structured Output (Guided Decoding) with Qwen3 models when `enable_thinking=False`
#18819 closed Jun 17, 2025
[Doc]: Does llava onevision support VLM multi images?
#19521 closed Jun 17, 2025
[Usage]: How to identify mix and max pixels for the image
#15034 closed Jun 17, 2025
[Usage]: Can vllm multimodal generate use preprocessed image?
#14998 closed Jun 17, 2025
[Usage]: Transcription "Maximum clip duration (30s) exceeded
#15012 closed Jun 17, 2025
centos7 package err, is my problem?
#14750 closed Jun 17, 2025
[Bug]: RuntimeError: HIP Error on vLLM ROCm Image in Kubernetes Cluster with AMD GPUs
#10855 closed Jun 17, 2025
[Bug]: Does vLLM support offloading the weights of the DeepSeek R1 model to the CPU during large model inference?
#13326 closed Jun 17, 2025
[Feature]: multiple gpus specification
#13357 closed Jun 17, 2025
[CI Failure]: Spec Decoding - spec_decode/e2e/test_multistep_correctness.py
#18954 closed Jun 16, 2025
[Bug]: AttributeError: 'MultiprocExecutor' object has no attribute 'workers' when VLLM_USE_V1=1 on rocm platform serve deepseek-r1 671B
#17533 closed Jun 16, 2025
[Bug]: (regression from v0.8.5) missing "finish_reason": null in streaming chat completion outputs
#19650 closed Jun 16, 2025

98 Issues opened by 90 people

[Doc]: The install step is missed in the section “Build wheel from source” in the Installation of CPU.
#19974 opened Jun 23, 2025
[CI Failure]: Quantization Test - quantization/test_bitsandbytes.py::test_4bit_bnb_embedding_model
#19964 opened Jun 23, 2025
[New Model]: Support HCXVisionForCausalLM
#19963 opened Jun 23, 2025
[Bug]: the worker node joins successfully for a few seconds and exits without a reason
#19960 opened Jun 23, 2025
[Doc]: Performance dashboard down
#19958 opened Jun 22, 2025
[Feature]: vllm support for mistral3.1 with no consolidated.safetensors
#19953 opened Jun 22, 2025
[Bug]: Model architectures ['Qwen2_5_VLForConditionalGeneration'] failed to be inspected
#19951 opened Jun 22, 2025
[Usage]: 使用Vllm如何对Qwen3ForSequenceClassification模型进行文本分类加速？
#19950 opened Jun 22, 2025
[Installation]: Multiple errors when compiling with clang: error: no member named 'min'. fmax
#19947 opened Jun 22, 2025
[Bug]: disable_any_whitespace is ineffective when passed through SamplingParams' guided_decoding in the V1 engine.
#19945 opened Jun 22, 2025
[Usage]: how to invoke KVCache save in one node deployment development enviroment
#19942 opened Jun 22, 2025
[Bug]: openapi doesnt generate for Java cleanly AnyOf for Stop has a default value
#19938 opened Jun 21, 2025
[Usage]: mac run vllm failed by docker
#19936 opened Jun 21, 2025
[Usage]: How to debug in subprocess
#19935 opened Jun 21, 2025
[Feature]: deploying the Qwen/Qwen3-235B-A22B-FP8 using the PD disaggregation + DP + EP + DeepEP
#19934 opened Jun 21, 2025
[Bug]: The vllm/vllm-openai version 0.9.1 is nearly 30% faster compared to lmsysorg/sglang:v0.4.7.post, but it stops running every two to three hours.
#19933 opened Jun 21, 2025
[Feature]: Support casting lm_head to FP32 to get old logprobs in RLHF
#19925 opened Jun 21, 2025
[Bug]: Unexpected Memory Usage in cutlass_moe_fp8() on Latest main 6bc7b57
#19923 opened Jun 20, 2025
Issue 1: Missing type hint for `wheel` argument in `generate_index.py`
#19918 opened Jun 20, 2025
[Docs] Feedback for `/en/latest/getting_started/installation/gpu.html`
#19913 opened Jun 20, 2025
[Bug]: file in `vllm/benchmarks/kernels/benchmark_marlin.py` cannot execute
#19909 opened Jun 20, 2025
[Bug]: Deepseek R1 0528 tool calling not working
#19907 opened Jun 20, 2025
[Bug]: RTX5080 got CUDA error: no kernel image is available for execution on the device
#19906 opened Jun 20, 2025
[Bug]: nsys cann't open the file
#19903 opened Jun 20, 2025
[Feature]: `kv_transfer_params` not returned for multiple subrequests
#19902 opened Jun 20, 2025
[Doc]: The documentation should be updated to cover GPU compatibility
#19900 opened Jun 20, 2025
[Feature]: EXL3 support
#19896 opened Jun 20, 2025
[Bug]: AsyncLLMEngine stuck in V1
#19892 opened Jun 20, 2025
[Performance]: the performance decline in fp8 inference mode
#19888 opened Jun 20, 2025
[RFC]: Inplace model weights loading
#19886 opened Jun 20, 2025
Issue 1: Incorrect comparison with MAIN_CUDA_VERSION for CPU target
#19882 opened Jun 19, 2025
[Feature]: Implement `check_health` for V1
#19881 opened Jun 19, 2025
[Feature]: Support passing token-level schedules for temperature and other sampling parameters
#19877 opened Jun 19, 2025
[Bug]: InternVL3 poor (random) output with 8bit quantization
#19876 opened Jun 19, 2025
[Bug]: 'IndexError: tuple index out of range' when using 8 gpu's
#19871 opened Jun 19, 2025
[Usage]: missing latest tag from cpu docker registry
#19869 opened Jun 19, 2025
[Bug]: AiterFlashAttentionImpl.__init__() got multiple values for argument 'use_irope' for llama4 model
#19867 opened Jun 19, 2025
[Bug]: NCCL issues when running vllm v0.9.1 for the Deepseek-R1 model [B200 GPU]
#19865 opened Jun 19, 2025
[Feature]: Returning embedding dimensions in /v1/models
#19864 opened Jun 19, 2025
[Bug]: 5090 gemma-3-12b-it using FP8/INT8/FP16 quantization for conncurent requests DOCKER.
#19863 opened Jun 19, 2025
[Bug]: Internal Server Error when use max_tokens=null
#19862 opened Jun 19, 2025
[Bug]: max_completion_tokens doesn't work as max
#19861 opened Jun 19, 2025
[Bug]: AttributeError: 'InferenceClient' object has no attribute 'post'
#19857 opened Jun 19, 2025
[Bug]: dynamic fp8 quantization does not save memory when enable_sleep_mode=True
#19855 opened Jun 19, 2025
[RFC]: KV cache offloading
#19854 opened Jun 19, 2025
[Bug]: Unable to deploy NVFP4 quantized model
#19853 opened Jun 19, 2025
[Feature]: Quant & TP for VIT
#19850 opened Jun 19, 2025
[Bug]: Subprocess health check / automatic restart for V1 EngineCore
#19849 opened Jun 19, 2025
[Bug]: When "tool_choice": "auto" is set, there is a reasoning_content process in the output, but this process is missing when "tool_choice": "required" is used.
#19846 opened Jun 19, 2025
[Bug]: Not able to run vllm cpu using Dockerfile.cpu
#19845 opened Jun 19, 2025
[Usage]: Troubleshooting Inconsistencies Between VLLM and Transformer Outputs
#19841 opened Jun 19, 2025
[Bug]: qwen3不思考模型，【内容放到reasoning_content，正常应该在content】
#19839 opened Jun 19, 2025
[Usage]: vllm启动qwen2.5vl-7b以后为什么显存使用越来越多
#19828 opened Jun 19, 2025
[Feature]: Improve startup time UX
#19824 opened Jun 19, 2025
[Feature]: `CustomOp` cleanup
#19817 opened Jun 18, 2025
[Bug]: Loading Qwen3MoE using Transformers backend
#19801 opened Jun 18, 2025
[Bug]: After receiving the request, the service froze
#19800 opened Jun 18, 2025
[Bug]:ValueError: Exceeds max model len when embedding using bge-large-zh-v1.5
#19798 opened Jun 18, 2025
[Usage]: How could vllm support token classifications(albert model)?
#19797 opened Jun 18, 2025
[Usage]: ValueError: The checkpoint you are trying to load has model type `qwen3_moe` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.
#19793 opened Jun 18, 2025
[Bug]: MP Executor does not correctly handle device allocation for non-CUDA devices (e.g., NPUs)
#19791 opened Jun 18, 2025
[Bug]: 使用vllm0.7.3对Qwen2.5VL-7b有时会报错
#19786 opened Jun 18, 2025
[Doc]: Update the vllm quantization support for the AMD GPU
#19782 opened Jun 18, 2025
[Bug]: wrong output on L20 using fp8
#19779 opened Jun 17, 2025
[Bug]: nixl handshake is slow and can accumulate in a batch of requests where each request is pulling from a different prefill node
#19777 opened Jun 17, 2025
[Bug][Spec Decode]: TPOT in prometheus is ITL in vllm serve
#19776 opened Jun 17, 2025
[Bug]: Potential bug when using speculative decoding example similar as the one from docs
#19775 opened Jun 17, 2025
[Feature]: Evaluate prompt presence on subsequent audio chunks
#19772 opened Jun 17, 2025
Vllm + FlexAttention Work Tracking
#19765 opened Jun 17, 2025
[Bug]: Gemma3 reporting low image accuracy with v1 engine
#19763 opened Jun 17, 2025
[Feature]: Upgrade base Python version of vllm/vllm-tpu docker image to 3.11+
#19761 opened Jun 17, 2025
[Bug]: vllm/vllm-tpu uses Debian base but Ubuntu APT sources, causing package installation errors
#19760 opened Jun 17, 2025
[Feature]: Remove cupy dependency for multi-node Ray deployment
#19758 opened Jun 17, 2025
[Bug]: Error response from daemon: file integrity checksum failed for "usr/local/cuda-12.1/compat/libnvidia-nvvm.so.530.30.02"
#19748 opened Jun 17, 2025
[Usage]: embed prompts
#19746 opened Jun 17, 2025
[Usage]: ValueError: Initial test run failed - Please make sure benchmark arguments are correctly specified. Error: Gateway Time-out
#19743 opened Jun 17, 2025
[Bug]: Incorrect kernel selected when multiple GPUs
#19741 opened Jun 17, 2025
[Feature]: Logging details about incorrect requests
#19739 opened Jun 17, 2025
[Bug]: Fails to load llmcompressor GPTQ Qwen2.5-VL model
#19733 opened Jun 17, 2025
[Performance]: very slow performance for nested list with length constraints
#19732 opened Jun 17, 2025
[Bug]: tool_chat_template_deepseekr1 is no use
#19729 opened Jun 17, 2025
[Feature]: AWQ DeepSeek support on MI300X
#19727 opened Jun 17, 2025
[Bug]: DeepGEMM does not work with CUDA Graph
#19722 opened Jun 17, 2025
[Doc]: 请问支持的模型里，有没有支持嵌入维度可设置为1536的embedding模型？
#19720 opened Jun 17, 2025
[Performance]: No signficant speedup from Wfp8Afp8 (vs Wbf16Abf16) in Llama-4 Scout
#19714 opened Jun 16, 2025
[Bug]: Audio transcription cannot load `preprocessor_config.json` when using runai streamer
#19713 opened Jun 16, 2025
[Feature]: Simplify speculative-config format for vllm serve
#19709 opened Jun 16, 2025
[RFC]: Multimodal data IPC improvement
#19702 opened Jun 16, 2025
[Bug]: Truncated && Incomplete Response from LLAMA4 Scout Prefix Caching
#19697 opened Jun 16, 2025
[Bug]: In the DP online scenario of DeepSeek, when concurrency and request rate increase, TTFT drops sharply.
#19694 opened Jun 16, 2025
[Bug]: Enable LORA on Version 0.9.1 and RTX 5090 causes an issue
#19693 opened Jun 16, 2025
[Performance]: V1 engine runs slower than V0 on the MI300X
#19692 opened Jun 16, 2025
[Bug]: RuntimeError: CUDA error: no kernel image is available for execution on the device using H100 starting RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic
#19690 opened Jun 16, 2025
[Usage]: Changing image_feature and image_input_shape has no effect on VLM output
#19689 opened Jun 16, 2025
[Bug]: deploy 32B model using vllm + ray with two nodes failed with nccl error
#19684 opened Jun 16, 2025
[Bug]:MooncakeStoreConnector with IndexError('Dimension out of range (expected to be in range of [-2, 1], but got 2)') ,because the new cache_kernels.cu changed.
#19683 opened Jun 16, 2025
[Bug]: rocm build crashes with libcuda.so.1: cannot open shared object file
#19681 opened Jun 16, 2025
[Usage]: What is the meaning of `Avg generation throughput`
#19680 opened Jun 16, 2025

403 Unresolved conversations

Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.

[Core] Support Local Chunked Attention for Hybrid KV Cache
#19351 commented on Jun 19, 2025 • 35 new comments
[Feature] Support sequence parallelism for static fp8 quantization
#19181 commented on Jun 23, 2025 • 32 new comments
[Model] Support TP/PP/mamba2 kernel for PLaMo2
#19674 commented on Jun 17, 2025 • 21 new comments
[Model] Automatic conversion of score (CrossEncoding) models
#19675 commented on Jun 23, 2025 • 19 new comments
[V1] - Enable worker -> scheduler connector metadata
#19555 commented on Jun 22, 2025 • 18 new comments
[Bugfix] Move hardware-dependent configuration resolution (FlashMLA capability, `dtype: 'auto'`) to worker
#18979 commented on Jun 18, 2025 • 15 new comments
[Feature] microbatch tokenization
#19334 commented on Jun 22, 2025 • 11 new comments
[Feature][Quantization] MXFP4 support for MOE models
#17888 commented on Jun 19, 2025 • 8 new comments
Draft: WIP NixlConnector drop ZMQ in favor of HTTP metadata exchanges
#19447 commented on Jun 19, 2025 • 7 new comments
[Bug][Frontend] Fix structure of transcription's decoder_prompt
#18809 commented on Jun 22, 2025 • 7 new comments
[Frontend] Expand tools even if tool_choice="none"
#17177 commented on Jun 22, 2025 • 6 new comments
[Frontend] [Core] Integrate Tensorizer in to S3 loading machinery, allow passing arbitrary arguments during save/load
#19619 commented on Jun 20, 2025 • 6 new comments
[Platform] Add custom default max tokens
#18557 commented on Jun 19, 2025 • 6 new comments
Sync test dependency with test.in for torch nightly
#19632 commented on Jun 19, 2025 • 5 new comments
[Misc] feat output content in stream response
#19608 commented on Jun 18, 2025 • 5 new comments
[BugFix][V1][ROCm] Triton MLA uses V0 backend on V1 engine
#19067 commented on Jun 19, 2025 • 4 new comments
[Frontend] Add unix domain socket support
#18097 commented on Jun 17, 2025 • 4 new comments
[Bugfix][Benchmarks]Fixed async_request_deepspeed_mii() to get ttft
#18689 commented on Jun 23, 2025 • 4 new comments
[Core] Add Support for Default Modality Specific LoRAs [generate / chat completions]
#19126 commented on Jun 20, 2025 • 4 new comments
[Core] Add support for sampling penalties to v1 ngram speculative decoding
#18441 commented on Jun 22, 2025 • 4 new comments
[Kernel] Adding basic Triton JitCache for triton_attn
#16606 commented on Jun 18, 2025 • 4 new comments
[Feature] Expert Parallelism Load Balancer (EPLB)
#18343 commented on Jun 22, 2025 • 3 new comments
[Frontend] Add `/v1/audio/translations` OpenAI API endpoint
#19615 commented on Jun 19, 2025 • 3 new comments
[Feature] A calibration-free RTN-based quantization for accurate and accelerated INT4/INT8 inference
#18768 commented on Jun 18, 2025 • 3 new comments
[Frontend] Support image object in llm.chat
#19635 commented on Jun 20, 2025 • 3 new comments
[Bugfix] VLLM_V1 supports passing other compilation levels
#19340 commented on Jun 19, 2025 • 2 new comments
[Doc] Add inplace weights loading example
#19640 commented on Jun 20, 2025 • 2 new comments
[Chore] debloat some initial logs
#19438 commented on Jun 20, 2025 • 2 new comments
Enable CPU nightly performance benchmark and its Markdown report
#18444 commented on Jun 22, 2025 • 2 new comments
Add quickreduce as alternative to custom allreduce
#16804 commented on Jun 23, 2025 • 2 new comments
[P/D][Bugfix]: Fix the issue where the remote KVCache cannot be loaded when PP > 1
#19558 commented on Jun 17, 2025 • 2 new comments
[CI] bump mypy version to 1.16.0
#19548 commented on Jun 17, 2025 • 2 new comments
[V1] LogitsProcessor programming model
#16728 commented on Jun 19, 2025 • 2 new comments
[Kernels] MoE refactor
#19636 commented on Jun 20, 2025 • 2 new comments
[V1] Only print cudagraph tqdm on rank 0 with `is_global_first_rank`
#19516 commented on Jun 16, 2025 • 1 new comment
[Bugfix]fix asyncLLM test_abort
#16090 commented on Jun 17, 2025 • 1 new comment
[Misc] Add gemma3 chat template with pythonic-style function calling
#17149 commented on Jun 16, 2025 • 1 new comment
[Kernel] Apply torch.Tag.needs_fixed_stride_order only for torch==2.6.0
#19346 commented on Jun 20, 2025 • 1 new comment
[Bugfix] Skip loading extra parameters for modelopt Qwen3 MoE model
#19598 commented on Jun 18, 2025 • 1 new comment
[Bugfix]: Fix messy code when using logprobs
#19209 commented on Jun 20, 2025 • 1 new comment
[WIP] [Core][P/D] CPU connector for PD disagg
#18332 commented on Jun 16, 2025 • 1 new comment
[Frontend] Add -d/--detach option for vllm serve and process management
#18065 commented on Jun 19, 2025 • 1 new comment
[V1][Experimental] Jump-forward decoding
#15490 commented on Jun 19, 2025 • 0 new comments
[Feat][Frontend] Added support for HermesToolParser for models without special tokens
#16890 commented on Jun 20, 2025 • 0 new comments
[Misc] Add model list API in disagg proxy
#13083 commented on Jun 23, 2025 • 0 new comments
set UV_PYTHON_INSTALL_DIR to a world readable/executable location
#15302 commented on Jun 23, 2025 • 0 new comments
Add missed ray[data] dependence in cuda.txt
#15283 commented on Jun 22, 2025 • 0 new comments
[Bugfix] Fix include prompt in stream response when echo=true
#15233 commented on Jun 18, 2025 • 0 new comments
[Perf] Optimize Qwen2/2.5-VL ViT tensor generating performance
#14684 commented on Jun 19, 2025 • 0 new comments
When an exception happens in multiproc, die hard and fast
#15000 commented on Jun 18, 2025 • 0 new comments
Support multicard for Disaggregated Prefill/Decode and provide a automatic benchmark test
#15221 commented on Jun 23, 2025 • 0 new comments
[WIP][TPU] Support mrope models (Qwen2VL)
#15149 commented on Jun 19, 2025 • 0 new comments
Metrics proposal OpenTelemetry API
#15138 commented on Jun 22, 2025 • 0 new comments
[Misc]Fix incorrect local IP detection in multi-network interface environments
#15071 commented on Jun 18, 2025 • 0 new comments
[Perf] Optimize MRoPR position preparing performance with numba
#16881 commented on Jun 18, 2025 • 0 new comments
[Model][Frontend] Adding timeseries modality support and Qwen2.5-ChatTS model support
#16852 commented on Jun 19, 2025 • 0 new comments
[Misc] benchmark supports disaggregated prefill
#16824 commented on Jun 23, 2025 • 0 new comments
[Misc][Benchmark]feat(benchmarks): Optimize response handling for OpenAI Chat Completions API, add processing for reasoning_content field
#16817 commented on Jun 23, 2025 • 0 new comments
[Kernel] Add Split-KV Attention Kernel to the triton_attn Backend
#16794 commented on Jun 18, 2025 • 0 new comments
[Bugfix][Disaggregated] Set min_tokens in disagg_proxy_demo.py
#16705 commented on Jun 19, 2025 • 0 new comments
[NIXL] vllm v0 nixl integration
#16677 commented on Jun 19, 2025 • 0 new comments
[Bugfix][Model] fix Phi3Small model only support v0
#16493 commented on Jun 17, 2025 • 0 new comments
[Misc][Benchmark]feat(benchmarks): Add async_request_generate function to support generate endpoint
#16421 commented on Jun 23, 2025 • 0 new comments
[Model][VLM] Add Qwen2.5-Omni model support (end-to-end full support)
#16347 commented on Jun 19, 2025 • 0 new comments
[Draft] SnapKV
#16160 commented on Jun 16, 2025 • 0 new comments
[Kernel] Support merge attn cuda kernel
#16060 commented on Jun 23, 2025 • 0 new comments
[Misc][Benchmark] Remove colon from key 'request_goodput:'
#16018 commented on Jun 23, 2025 • 0 new comments
[V0][V1][Core] Add outlines integration for V1, and update V0 integration.
#15975 commented on Jun 22, 2025 • 0 new comments
[Kernel] Enable FP16 and BF16 CUTLASS MoE kernels
#15932 commented on Jun 23, 2025 • 0 new comments
feat: update allow_pattern
#15797 commented on Jun 19, 2025 • 0 new comments
[Benchmark] Fix two issues in benchmark result
#15795 commented on Jun 22, 2025 • 0 new comments
[Bugfix]: The sequence becomes shorter after encoding and decoding
#15516 commented on Jun 23, 2025 • 0 new comments
[Minor] QoL for Benchmarking
#15512 commented on Jun 23, 2025 • 0 new comments
Add inference_benchmark_script.sh
#15504 commented on Jun 23, 2025 • 0 new comments
Adding Share Expert Fusion for DeepSeek
#15502 commented on Jun 23, 2025 • 0 new comments
[Bugfix] Adjust tool call handling in llama template to support single tool calls only
#12938 commented on Jun 23, 2025 • 0 new comments
[CI/Build][v1] vLLM v1 automatic benchmarking
#12919 commented on Jun 23, 2025 • 0 new comments
[Doc] Update benchmarks/README.md
#12903 commented on Jun 23, 2025 • 0 new comments
[V1][PoC] Refactor EngineCoreOutputs
#12853 commented on Jun 23, 2025 • 0 new comments
[Hardware][Metal] Apple Metal support
#12640 commented on Jun 19, 2025 • 0 new comments
[Misc]add modules_to_not_convert attribute to gptq series
#12103 commented on Jun 20, 2025 • 0 new comments
[Doc] update docs for nightly benchmarks
#12022 commented on Jun 23, 2025 • 0 new comments
Support FP8 Quantization and Inference Run on Intel Gaudi (HPU) using INC (Intel Neural Compressor)
#12010 commented on Jun 17, 2025 • 0 new comments
[Model] Add T5 model (2/2)
#11901 commented on Jun 23, 2025 • 0 new comments
[Model] LoRA with lm_head and embed_tokens fully trained - 4
#11714 commented on Jun 23, 2025 • 0 new comments
[Core] Rank-to-device mapping env var
#11662 commented on Jun 23, 2025 • 0 new comments
[Misc] Adding API Key to the benchmark
#11384 commented on Jun 23, 2025 • 0 new comments
[Kernel] Add ExLlamaV2 Weight Quantization Support
#11348 commented on Jun 23, 2025 • 0 new comments
[Kernel][Model] PagedAttention: Support custom attention bias for T5 model (1/2)
#11334 commented on Jun 23, 2025 • 0 new comments
[CI/Build] Adds Modal runners for performance benchmark
#11239 commented on Jun 23, 2025 • 0 new comments
[Core] Efficient transmission for CPU prefix caching, based on PR#10874
#11099 commented on Jun 23, 2025 • 0 new comments
Turn on V1 for H200 build
#10505 commented on Jun 23, 2025 • 0 new comments
[Frontend] Add Command-R and Llama-3 chat template
#10496 commented on Jun 23, 2025 • 0 new comments
[Bugfix] Generate multiple different prompts in benchmark_prefix_caching.py based on --num-prompts
#9687 commented on Jun 23, 2025 • 0 new comments
[Bug][Regression]: Dimension out of range when using MooncakeStoreConnector
#18834 commented on Jun 23, 2025 • 0 new comments
[Bug]: `undefined symbol: _ZN3c105ErrorC2ENS_14SourceLocationENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE` when running `0.7.3.dev57+g2ae88905.precompiled` on A100
#13047 commented on Jun 23, 2025 • 0 new comments
[Installation]: undefined symbol: _ZNK3c1011StorageImpl27throw_data_ptr_access_errorEv
#15010 commented on Jun 19, 2025 • 0 new comments
[Core] Add a level 3 sleep/wake_up that offloads tensors to disk
#14678 commented on Jun 17, 2025 • 0 new comments
[Quant] SupportsQuant handles ignored_modules
#14635 commented on Jun 19, 2025 • 0 new comments
[Quant] Add SupportsQuant and packed_modules_mapping to all models
#14631 commented on Jun 19, 2025 • 0 new comments
[Hardware][Intel GPU] Add V1 engine support and `chunked_prefill` kernel
#14612 commented on Jun 19, 2025 • 0 new comments
[DO NOT MERGE]Varun/v1 lora kernels tuner
#14594 commented on Jun 23, 2025 • 0 new comments
First working PoC for bge-m3 sparse embeddings
#14526 commented on Jun 17, 2025 • 0 new comments
[#14109][bug] Fix Ray placement group allocation is not respecting env VLLM_RAY_PER_WORKER_GPUS (fractional gpu)
#14521 commented on Jun 22, 2025 • 0 new comments
[WIP] Support models with mrope (Qwen2VL) on TPU
#14442 commented on Jun 19, 2025 • 0 new comments
[Misc][Minor] Benchmarks: Fix guided decoding, token sampling, and request sorting
#14368 commented on Jun 23, 2025 • 0 new comments
[Model] add colqwen2_vl code & inference
#14291 commented on Jun 19, 2025 • 0 new comments
[Doc] Create tool_chat_template_llama3.3_json.jinja
#14269 commented on Jun 23, 2025 • 0 new comments
Add CUDA kernel for per_token_group_quant_fp8
#14175 commented on Jun 23, 2025 • 0 new comments
[bug fix]: benchmark enabling torch profiler in openai chat backend
#14162 commented on Jun 23, 2025 • 0 new comments
[Bugfix] Make memory profiler account for speculative draft model weights
#14067 commented on Jun 20, 2025 • 0 new comments
[V1] Avoid false positives when warning for unimplemented methods
#14046 commented on Jun 22, 2025 • 0 new comments
[Model] [Quantization] Support deepseek v3/r1 w8a8 int8 block-wise quantization
#13942 commented on Jun 22, 2025 • 0 new comments
[WIP][Core] Support tensor parallelism with uneven heads
#13934 commented on Jun 19, 2025 • 0 new comments
[Bugfix] Enable speculative decoding for models with nearly-identical vocab sizes
#13849 commented on Jun 19, 2025 • 0 new comments
[Misc] support variable remote backend for model loader
#13809 commented on Jun 23, 2025 • 0 new comments
[Model] Support VLMs with transformers backend
#13754 commented on Jun 18, 2025 • 0 new comments
[Benchmark] Add LongBench to benchmark_serving
#13350 commented on Jun 23, 2025 • 0 new comments
[V0][Sampler] Use raw logits for greedy argmax
#13312 commented on Jun 19, 2025 • 0 new comments
[Bugfix] Modify the method of generating random datasets to avoid creating duplicate prompts.
#13159 commented on Jun 23, 2025 • 0 new comments
[Misc] add CLI completion
#19669 commented on Jun 20, 2025 • 0 new comments
[DO_NOT_REIVEW] autotune
#19375 commented on Jun 23, 2025 • 0 new comments
Add GLM4.1V model (Draft)
#19331 commented on Jun 18, 2025 • 0 new comments
[V1] [P/D] Add Support for KV Load Failure Recovery
#19330 commented on Jun 22, 2025 • 0 new comments
[Bugfix] ROCm FP8 Quantization Padding Issue
#19251 commented on Jun 19, 2025 • 0 new comments
[Core] Allow vLLM to stream n tokens at a time
#19240 commented on Jun 19, 2025 • 0 new comments
[Misc][Bugfix] specify docker registry to support podman
#19236 commented on Jun 17, 2025 • 0 new comments
[Bugfix] Fix Qwen2-Audio chat template for online serving
#19230 commented on Jun 18, 2025 • 0 new comments
[Doc]: improve CPU(x86) build instructions and fix include path
#19156 commented on Jun 17, 2025 • 0 new comments
Large scale bench
#19128 commented on Jun 23, 2025 • 0 new comments
[Tests] V1 EAGLE Tests for Acceptance Rate
#19104 commented on Jun 23, 2025 • 0 new comments
[Core] Add constants for CUDA compute capabilities
#19099 commented on Jun 19, 2025 • 0 new comments
Fix Incorrect data_parallel_rank and subsequent errors under torchrun
#19096 commented on Jun 19, 2025 • 0 new comments
[feature] Integrate quick allreduce into custom allreduce and select the best allreduce implementation
#19094 commented on Jun 20, 2025 • 0 new comments
[Bugfix]: Fix DualChunkFlashAttention for short sequences
#19084 commented on Jun 19, 2025 • 0 new comments
[BugFix]: Hermes tool parser stream output error in Qwen3 case #19056
#19058 commented on Jun 19, 2025 • 0 new comments
Create E=128,N=768,device_name=NVIDIA_A100-PCIE-40GB.json
#19049 commented on Jun 19, 2025 • 0 new comments
[Bugfix] Improve JSON extraction in LlamaToolParser
#19024 commented on Jun 16, 2025 • 0 new comments
[DRAFT] Self-Speculative Decoding using LayerSkip
#18994 commented on Jun 18, 2025 • 0 new comments
[Benchmark] Add hf_stream arg to enable or disable datasets streaming loading
#18989 commented on Jun 23, 2025 • 0 new comments
[Kernel] Fix fp8 support for pplx and BatchedTritonExperts.
#18864 commented on Jun 19, 2025 • 0 new comments
feat(model loader): add load format 'prefetch_auto' for parallel mmap…
#19659 commented on Jun 18, 2025 • 0 new comments
[Feature]: Support offline expert load distribution recording
#19658 commented on Jun 19, 2025 • 0 new comments
When numa support is found but size is 0, divide by zero exception
#19654 commented on Jun 21, 2025 • 0 new comments
[Kernel] Add Conch Triton Attention backend
#19625 commented on Jun 19, 2025 • 0 new comments
Use the correct torch dtype in topk kernel assertion
#19614 commented on Jun 16, 2025 • 0 new comments
[Frontend] /metadata: Get more useful server information easily.
#19604 commented on Jun 22, 2025 • 0 new comments
[Hardware][Intel GPU] Add v1 Intel GPU support with Flash attention backend.
#19560 commented on Jun 19, 2025 • 0 new comments
[Core] Rationalize boolean environment variable handling
#19550 commented on Jun 19, 2025 • 0 new comments
[Bugfix] Register reducer even if transformers_modules not available
#19510 commented on Jun 16, 2025 • 0 new comments
[Models] Improve iteration over layers
#19497 commented on Jun 18, 2025 • 0 new comments
090
#19488 commented on Jun 18, 2025 • 0 new comments
fix: Properly set engine_id when using multi connector in dynamo
#19487 commented on Jun 23, 2025 • 0 new comments
[Perf] Improve/Fix-regression for FA3 in High QPS regimes
#19463 commented on Jun 20, 2025 • 0 new comments
[Kernel] Integrate IBM/Applied-AI fused moe kernels
#19443 commented on Jun 18, 2025 • 0 new comments
[Misc][Benchmarking] Add variable request-rate ("ramp-up") to the benchmarking client.
#19423 commented on Jun 23, 2025 • 0 new comments
Added FP8 support quantization support to DualChunkFlashAttentionBackend
#19420 commented on Jun 18, 2025 • 0 new comments
[ROCm][FEAT] Integrate AITER gemm w8a8 ptpc
#19417 commented on Jun 22, 2025 • 0 new comments
[PD] Skip `tp_size` exchange with rank0
#19413 commented on Jun 16, 2025 • 0 new comments
qwen optimze
#19406 commented on Jun 19, 2025 • 0 new comments
Feat Dynamic Quantization for MoE Layers in GPTQ Marlin Backend
#19395 commented on Jun 17, 2025 • 0 new comments
[Misc][Benchmark] Fix error on benchmark_moe.py
#18723 commented on Jun 23, 2025 • 0 new comments
[Frontend] speed up import time of vllm.config
#18036 commented on Jun 22, 2025 • 0 new comments
[Benchmark] fixing profling for benchmark latency
#18035 commented on Jun 23, 2025 • 0 new comments
[kernel] integrate permute/unpermute kernel into deepgemm moe
#17934 commented on Jun 22, 2025 • 0 new comments
[Core] Parallel multi-modal processor
#17831 commented on Jun 19, 2025 • 0 new comments
[Model] Ultravox: Support Llama 4 and Gemma 3 backends
#17818 commented on Jun 23, 2025 • 0 new comments
Update registry.py
#17762 commented on Jun 19, 2025 • 0 new comments
[Kernel] Bf16 data type support for awq quantization
#17705 commented on Jun 20, 2025 • 0 new comments
[Misc] Refactor VLM common generation tests to support audio inputs and mix-modality tests
#17633 commented on Jun 19, 2025 • 0 new comments
[PERF] Speed up of prepare_inputs / mrope
#17617 commented on Jun 19, 2025 • 0 new comments
[Model] 1.58bits BitNet Model Support
#17588 commented on Jun 23, 2025 • 0 new comments
[Bugfix][ROCm] Fix incorrect casting in GPTQ GEMM kernel
#17583 commented on Jun 23, 2025 • 0 new comments
[Bugfix][Model] vllm-v0 engine run eagle algo with qwen2.5 model, KeyError: 'norm.weight' bugfix
#17518 commented on Jun 19, 2025 • 0 new comments
[benchmark][structured output] Add offline benchmark script for structured output
#17440 commented on Jun 23, 2025 • 0 new comments
[Frontend] Fix tool_call handling in llama3.1 and llama3.2 chat template to allow zero tool_calls
#17409 commented on Jun 23, 2025 • 0 new comments
enable multiple platform device in DP init
#17368 commented on Jun 20, 2025 • 0 new comments
[Experiment] Parallel multi-modal processor
#17361 commented on Jun 19, 2025 • 0 new comments
[NVIDIA] Support Cutlass w8a8 for Blackwell Geforce GPUs (sm120)
#17280 commented on Jun 19, 2025 • 0 new comments
[Bugfix] fix phi4-mini tool call parse in streaming mode
#17094 commented on Jun 17, 2025 • 0 new comments
[RFC] per module sharded weight tagging
#17001 commented on Jun 19, 2025 • 0 new comments
[Bugfix] Fix the missing '}' issue for nested object parameters in stream function call.
#16919 commented on Jun 19, 2025 • 0 new comments
[Kernel] Porting triton_kernels for FusedMoE
#18595 commented on Jun 22, 2025 • 0 new comments
[Model][Speculative Decoding] Integrate PARD into vLLM
#18541 commented on Jun 20, 2025 • 0 new comments
Remove Vision FA warning
#18522 commented on Jun 19, 2025 • 0 new comments
Add reorder_batch to TPU V1
#18515 commented on Jun 19, 2025 • 0 new comments
[Misc][benchmark] add warmup; add e2el_per_concurrency and throughput; add random_output_ratio
#18475 commented on Jun 23, 2025 • 0 new comments
[V1] Support `LLM.apply_model`
#18465 commented on Jun 23, 2025 • 0 new comments
[WIP] Two batch overlap
#18415 commented on Jun 19, 2025 • 0 new comments
[V1] [Spec decode] Llama4 type eagle support in v1
#18369 commented on Jun 21, 2025 • 0 new comments
[Misc] add xgrammar for arm64
#18359 commented on Jun 16, 2025 • 0 new comments
[P/D] Fix minor case in example disagg_prefill_proxy_server.py
#18341 commented on Jun 23, 2025 • 0 new comments
[Bugfix] Use a different prompt for benchmark_serving.py's test prompt
#18311 commented on Jun 23, 2025 • 0 new comments
[Don't merge] Debug failing quantization test with input batch move
#18298 commented on Jun 19, 2025 • 0 new comments
[P/D] Support CPU Transfer in NixlConnector
#18293 commented on Jun 19, 2025 • 0 new comments
[Model] support dots1
#18254 commented on Jun 21, 2025 • 0 new comments
[Frontend] speed up import time of vllm.reasoning
#18236 commented on Jun 19, 2025 • 0 new comments
[WIP][Benchmarking] Benchmarking script for v1 attention backends
#18207 commented on Jun 23, 2025 • 0 new comments
[Misc] Remove duplicate division check between num_query_heads and num_kv_heads.
#18074 commented on Jun 23, 2025 • 0 new comments
[V1] feat:add engine v1 tracing
#18069 commented on Jun 18, 2025 • 0 new comments
[Feature]: reasoning_tokens in Chat Completion Response usage
#18067 commented on Jun 19, 2025 • 0 new comments
[CI/Build] Allow hermetic builds
#18064 commented on Jun 18, 2025 • 0 new comments
[Bug]: Model fails to load in background thread in versions >0.8.5
#18816 commented on Jun 18, 2025 • 0 new comments
[Bug]: Inconsistent outputs with deterministic sampling (temperature=0) when serving Qwen3-32B with vllm-0.8.5
#17759 commented on Jun 18, 2025 • 0 new comments
[Bug]: N-gram speculative decoding performs slower than Qwen3-32B-FP8 with vLLM 0.9.0.1
#19254 commented on Jun 18, 2025 • 0 new comments
[RFC]: Blackwell Enablement for vLLM (SM100)
#18153 commented on Jun 18, 2025 • 0 new comments
[Feature]: Limit thinking tokens
#15418 commented on Jun 18, 2025 • 0 new comments
[RFC]: Schema for checking input shapes for multi-modal models
#14764 commented on Jun 18, 2025 • 0 new comments
[Bug]: ValueError out of range float values are not json compliant
#19661 commented on Jun 18, 2025 • 0 new comments
[Bug]: `size_k must divisible by BLOCK_SIZE_K` error when using tensor parallelism with AWQ-quantized MoE models
#17604 commented on Jun 18, 2025 • 0 new comments
[Bug]: gpu-memory-utilization is not exact
#17269 commented on Jun 18, 2025 • 0 new comments
[Usage]: Can I get the loss of model directly?
#9750 commented on Jun 18, 2025 • 0 new comments
[Bug]: vLLM CPU mode broken Unable to get JIT kernel for brgemm
#10478 commented on Jun 18, 2025 • 0 new comments
[Bug]: ValueError: Model architectures ['LlamaForCausalLM'] failed to be inspected
#11715 commented on Jun 18, 2025 • 0 new comments
[Feature]: Phi-4 tool support
#11985 commented on Jun 18, 2025 • 0 new comments
[Bug]: V100 may can not support enable-prefix-caching
#13738 commented on Jun 18, 2025 • 0 new comments
[Bug]: The accuracy of multiple cards and single card is inconsistent
#13801 commented on Jun 18, 2025 • 0 new comments
[Feature]: Support DeepEP
#13804 commented on Jun 18, 2025 • 0 new comments
[Bug]: use cpu_offload_gb in gguf failed.
#14096 commented on Jun 18, 2025 • 0 new comments
[Feature]: `Invalid attention backend for cuda` with `TORCH_SDPA` better error message
#14320 commented on Jun 18, 2025 • 0 new comments
[Bug]: Multi GPU inference using two RTX 5090s(TP=2)
#14628 commented on Jun 18, 2025 • 0 new comments
[Bug]: EAGLE / DeepSeek MTP Handles First Input Token Incorrectly - 25% Acceptance Rate Drop
#14647 commented on Jun 18, 2025 • 0 new comments
[Bug] [ROCm]: RuntimeError: Calling `torch.linalg.cholesky` on a CUDA tensor requires compiling PyTorch with MAGMA. Please use PyTorch built with MAGMA support.
#14914 commented on Jun 18, 2025 • 0 new comments
[Bug]: CPU infrencing won't work for DeepSeek-R1
#15044 commented on Jun 18, 2025 • 0 new comments
[New Model]: StableLMAlphaForCausalLM
#15046 commented on Jun 18, 2025 • 0 new comments
[Usage]: How to use FlashMLA for DeepSeek-V2,
#15079 commented on Jun 18, 2025 • 0 new comments
[Usage]: can vllm support deepseek R1 to inference on FP8 natively on H20 servers?
#15084 commented on Jun 18, 2025 • 0 new comments
[Bug]: ImportError: /workspace/vllm-abo/vllm/_C.abi3.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKSsb
#13608 commented on Jun 19, 2025 • 0 new comments
[Bug]: the issue of "cuda out of memory" arises
#15182 commented on Jun 19, 2025 • 0 new comments
[Bug]: Extra Characters in `content` When Using `enable_reasoning` with `stop` Parameter
#15188 commented on Jun 19, 2025 • 0 new comments
[Bug]: Is the logic order correct during the scheduler procedure?
#16982 commented on Jun 19, 2025 • 0 new comments
[Bug]: RuntimeError: CUDA error: no kernel image is available for execution on the device
#5547 commented on Jun 19, 2025 • 0 new comments
[Bug]: prefix-caching: inconsistent completions
#5543 commented on Jun 19, 2025 • 0 new comments
[Bug]: with `--enable-prefix-caching` , `/completions` crashes server with `echo=True` above certain prompt length
#5344 commented on Jun 19, 2025 • 0 new comments
[Bug]: [V1][Spec Dec] EAGLE TP > 1 leads to errors when using --enforce_eager
#17513 commented on Jun 19, 2025 • 0 new comments
[Bug]: Image Fails to Initialize (Undetected Platform) because of LD_LIBRARY_PATH, PATH environment error with vllm >= 0.9.0
#19184 commented on Jun 19, 2025 • 0 new comments
ValueError: Model architectures ['Qwen2ForCausalLM'] failed to be inspected. Please check the logs for more details.
#13216 commented on Jun 19, 2025 • 0 new comments
[New Model]: CSM 1b
#18005 commented on Jun 19, 2025 • 0 new comments
[RFC]: Multi-modality Support on vLLM
#4194 commented on Jun 18, 2025 • 0 new comments
[RFC]: Logits processor extensibility
#17799 commented on Jun 18, 2025 • 0 new comments
[Bug]: RuntimeError on RTX 5090: "no kernel image is available for execution on the device
#16901 commented on Jun 18, 2025 • 0 new comments
[Bug]: vLLM does not serve text-only version of Llama4
#18022 commented on Jun 18, 2025 • 0 new comments
[Feature]: Llama4 LoRA support
#16894 commented on Jun 18, 2025 • 0 new comments
[New Model]: surport for model：jinaai/jina-reranker-m0
#18447 commented on Jun 18, 2025 • 0 new comments
[TPU] Supported models for multimodal multi-image inference on TPU?
#18463 commented on Jun 18, 2025 • 0 new comments
[RFC]: Enhancing vLLM Plugin Architecture
#19161 commented on Jun 18, 2025 • 0 new comments
[Bug]: V1 piecewise cudagraph capture size on ROCm is much higher than on cuda
#19579 commented on Jun 18, 2025 • 0 new comments
[Bug]: Stuck request and empty streaming for gemma3 serving with ^v0.8.5
#17658 commented on Jun 18, 2025 • 0 new comments
[Bug]: vLLM sleep experiences segmentation fault when used in TRL
#16993 commented on Jun 18, 2025 • 0 new comments
[Bug]: Issue of Unstable Output for Identical Queries
#19403 commented on Jun 18, 2025 • 0 new comments
[Feature]: Fused moe config for NVIDIA RTX 6000 ADA
#17768 commented on Jun 18, 2025 • 0 new comments
[Bug]: Dual a6000 pros not working. Arch 120.
#19025 commented on Jun 18, 2025 • 0 new comments
[Doc]: Newest documentation for engine arguments is significantly worse than v0.8.5 and prior
#18707 commented on Jun 17, 2025 • 0 new comments
[Usage]: Full cuda graph for vllm v1
#19607 commented on Jun 17, 2025 • 0 new comments
[Bug]: error: Segmentation fault(SIGSEGV received at time)
#6918 commented on Jun 17, 2025 • 0 new comments
[Bug]: ValueError: could not broadcast input array from shape (513,) into shape (512,)
#8432 commented on Jun 17, 2025 • 0 new comments
[Bug]: stuck at "generating GPU P2P access cache in /home/luban/.cache/vllm/gpu_p2p_access_cache_for_0,1.json"
#8735 commented on Jun 17, 2025 • 0 new comments
[Feature]: Support for priority preemption with chunked-prefill
#10101 commented on Jun 17, 2025 • 0 new comments
[Usage]: mlx-community/DeepSeek-R1-4bit exception：OSError: /data/coding/model-671b-MS/dir does not appear to have a file named configuration_deepseek.py；
#13283 commented on Jun 17, 2025 • 0 new comments
[Bug]: terminate called after throwing an instance of 'std::system_error' what(): Operation not permitted
#14416 commented on Jun 17, 2025 • 0 new comments
[Bug]: Weird output when server with high load
#14491 commented on Jun 17, 2025 • 0 new comments
[Bug]:0.74 dev ,the error occurred in the gptq_marlin_gemm function call
#14887 commented on Jun 17, 2025 • 0 new comments
[Bug]: asyncio.exceptions.CancelledError and engine_client.dead_error
#14994 commented on Jun 17, 2025 • 0 new comments
[Usage]: How to use vllm in parallel
#14997 commented on Jun 17, 2025 • 0 new comments
[Bug]: Failed to Run Qwen2.5-7B with RTX 3070 & CPU Offload (14GB) Despite Sufficient Theoretical Memory
#15004 commented on Jun 17, 2025 • 0 new comments
[Feature]: PallasAttentionBackendImpl.__init__() got an unexpected keyword argument 'q_lora_rank'
#15026 commented on Jun 17, 2025 • 0 new comments
[Performance]: Speculative Decoder Optimization for Large-Batch Inference Overhead
#15029 commented on Jun 17, 2025 • 0 new comments
Precision loss occurs when using the MoE sum kernel.
#15045 commented on Jun 17, 2025 • 0 new comments
[Bug]: ValueError: Model architectures ['LlamaForCausalLM'] failed to be inspected
#15058 commented on Jun 17, 2025 • 0 new comments
[Bug]: Bad requests are not captured as traces
#17528 commented on Jun 17, 2025 • 0 new comments
[Bug]: TRACKING ISSUE: CUDA OOM with Logprobs
#5907 commented on Jun 16, 2025 • 0 new comments
[Bug]: LLMEngine.add_request can't handle erroneous type of request_id
#19588 commented on Jun 16, 2025 • 0 new comments
[Feature]: Optimize parallel sampling by batching add_request calls to avoid split scheduling latency
#16373 commented on Jun 16, 2025 • 0 new comments
[Perf]: Support non-contiguous input for `dynamic_scaled_int8_quant` and `dynamic_per_token_scaled_fp8_quant`
#19630 commented on Jun 16, 2025 • 0 new comments
[Feature]: Return hidden states (in progress?)
#6165 commented on Jun 16, 2025 • 0 new comments
[Bug]: Qwen2.5-VL-32B, Following weights were not initialized from checkpoint
#15536 commented on Jun 16, 2025 • 0 new comments
[Bug]: RuntimeError: CUDA error: no kernel image is available for execution on the device with nvidia v100
#19185 commented on Jun 16, 2025 • 0 new comments
[Usage]: Request to Include vllm["audio,video"] Package in v0.8.0 Docker Image
#15087 commented on Jun 18, 2025 • 0 new comments
[Misc]: Why not sort the waiting queue before popleft waiting queue?
#15091 commented on Jun 18, 2025 • 0 new comments
[Usage]: The difference between 0.7.3 and 0.8.0
#15092 commented on Jun 18, 2025 • 0 new comments
[Usage]: `torch.compile` is turned on, but the model LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct does not support it.
#15093 commented on Jun 18, 2025 • 0 new comments
[Bug]: 0.8.0(V1) crash on NCCL when load MoE model on 16 GPUs(H20)
#15098 commented on Jun 18, 2025 • 0 new comments
[Usage]: How to use VLLM added functions for torch in a separate environment?
#15108 commented on Jun 18, 2025 • 0 new comments
[Feature]: Improve GPTQ implementation
#15116 commented on Jun 18, 2025 • 0 new comments
[Bug]: BadRequestError(400) when using completions API with stream=true and echo=true
#15119 commented on Jun 18, 2025 • 0 new comments
[Bug]: vllm.engine.async_llm_engine.AsyncEngineDeadError
#15127 commented on Jun 18, 2025 • 0 new comments
[Usage]: multiround QA when using qwen2.5vl with the same input image
#15132 commented on Jun 18, 2025 • 0 new comments
[Feature]: Configurable metrics export format - Prometheus, OpenTelemetry
#15141 commented on Jun 18, 2025 • 0 new comments
[Bug]: Error running ShieldGemma: 'guideline' is undefined
#15147 commented on Jun 18, 2025 • 0 new comments
[Bug]: Can't see NCCL profiling data in nsight sys for expert parallel
#15168 commented on Jun 18, 2025 • 0 new comments
[Bug]: Failed to initialize the TMA descriptor 700 use Qwen2.5 72B on H200
#15175 commented on Jun 18, 2025 • 0 new comments
[Usage]: ModuleNotFoundError: No module named 'vllm.vllm_flash_attn.layers' vllm@0.9.0.1
#19131 commented on Jun 18, 2025 • 0 new comments
[Bug]: Failed to run model Qwen3-30B-A3B on DGX V100x4
#17392 commented on Jun 18, 2025 • 0 new comments
[RFC]: Deprecating vLLM V0
#18571 commented on Jun 17, 2025 • 0 new comments
[Feature]: Consider parallel_tool_calls parameter at the API level
#9451 commented on Jun 17, 2025 • 0 new comments
[Bug]: Phi-3-Small model reporting AttributeError: 'NoneType' object has no attribute 'prefill_metadata'
#19665 commented on Jun 17, 2025 • 0 new comments
[Feature]: Add Support for thinking_budget for Qwen3 Models
#17887 commented on Jun 17, 2025 • 0 new comments
[Feature]: Support Gemma 3 QAT series
#16856 commented on Jun 17, 2025 • 0 new comments
[Feature]: Support for RTX 5090 (CUDA 12.8)
#13306 commented on Jun 17, 2025 • 0 new comments
[Usage]: How to let Whisper return timestamps in transcript?
#19556 commented on Jun 17, 2025 • 0 new comments
[Usage]: why speculate decoding is slower than normal decoding？
#8439 commented on Jun 17, 2025 • 0 new comments
[Bug]: Engine stuck with requests are blocked, running/waiting request count and KV cache usage remain constant.
#18431 commented on Jun 17, 2025 • 0 new comments
[Roadmap] vLLM Roadmap Q2 2025
#15735 commented on Jun 23, 2025 • 0 new comments
[Bug]: vllm启动模型后使用openai格式请求传base64值有问题
#18890 commented on Jun 23, 2025 • 0 new comments
[Bug]: benchmark_serving.py cannot reach the specified generated tokens even with the flag --ignore-eos
#18687 commented on Jun 23, 2025 • 0 new comments
[Bug]:RuntimeError: CUDA error: no kernel image is available for execution on the device
#19018 commented on Jun 22, 2025 • 0 new comments
[Bug]: Docker image Cuda `system has unsupported display driver` error on RTX 2080 Ti
#19445 commented on Jun 22, 2025 • 0 new comments
[Bug]: V0.9.0.1 Docker images, NCCL fails with Cuda failure 'operation not supported' on torchrun with vLLM
#19188 commented on Jun 22, 2025 • 0 new comments
[Bug]: Illegal memory access on llama4 maverick
#19631 commented on Jun 22, 2025 • 0 new comments
[Bug]: vLLM 0.84 (others as well) TypeError: unsupported operand type(s) for *: 'int' and 'NoneType' Mistral 7b
#18972 commented on Jun 22, 2025 • 0 new comments
[Bug]: When running phi-4-reasoning-plus with vLLM, the model gets stuck repeating reasoning phrases
#18141 commented on Jun 22, 2025 • 0 new comments
[Performance]: benchmark_serving results for Qwen3-32B vs Qwen2-32B-FP8 are almost the same.
#17788 commented on Jun 22, 2025 • 0 new comments
[Bug]: Unable to run Jamba 1.6 Large with Tensor Parallelism
#19638 commented on Jun 22, 2025 • 0 new comments
[Bug]: Degradation of Qwen/Qwen3-30B-A3B performance depending on batch size
#17652 commented on Jun 22, 2025 • 0 new comments
[Bug]: [V1] Testla T4 cannot work for V1
#15853 commented on Jun 22, 2025 • 0 new comments
[Doc]: Steps to run vLLM on your RTX5080 or 5090!
#14452 commented on Jun 22, 2025 • 0 new comments
[Usage]: ValueError: The checkpoint you are trying to load has model type `qwen2_5_vl` but Transformers does not recognize this architecture
#13446 commented on Jun 22, 2025 • 0 new comments
[Bug]: "Loading safetensors checkpoint shards" runs twice when serving model
#13765 commented on Jun 22, 2025 • 0 new comments
[Bug]: Likely Regression - Was working in v0.6.3.post1, now using response_format parameter with "type": "bool" in v0.7.3: BadRequestError: Error code 400 - {'object': 'error', 'message': 'json_schema_converter.cc:595 Unsupported type bool in schema {type":"bool"}\n, 'type': 'BadRequestError', 'param': None, 'code': 400}
#13864 commented on Jun 22, 2025 • 0 new comments
[Usage]: How to benchmark throughput of DeepSeek-R1-671B on 2 nodes
#15024 commented on Jun 22, 2025 • 0 new comments
[Doc]: new attention layer
#15077 commented on Jun 22, 2025 • 0 new comments
[Bug]: ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?)
#15327 commented on Jun 22, 2025 • 0 new comments
[Bug]: RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination
#15335 commented on Jun 22, 2025 • 0 new comments
[Bug]: No available block found in 60 second.
#10661 commented on Jun 21, 2025 • 0 new comments
[Bug]: wake up OOM (72B model in 8*A800(40G))
#13941 commented on Jun 21, 2025 • 0 new comments
[Bug]: One node exits unexpectedly when run DP on 2 nodes.
#17241 commented on Jun 21, 2025 • 0 new comments
[Usage]: how to use prefill-decode disaggregation ??
#11490 commented on Jun 21, 2025 • 0 new comments
[Feature]: Colocating multiple LLM engines in the same process with sleep mode.
#18975 commented on Jun 21, 2025 • 0 new comments
[Usage]: Gemma3 not supported on B200 w/ Flash-Infer
#19584 commented on Jun 23, 2025 • 0 new comments
[Usage]: Model `compute_logits` always get None for `sampling_metadata`
#15115 commented on Jun 23, 2025 • 0 new comments
[Usage]: Will dynamo be on vllm main branch?
#15606 commented on Jun 23, 2025 • 0 new comments
[Bug]: qwen2.5-72b-instruct-q4_K_S GGUF format Output error
#11160 commented on Jun 23, 2025 • 0 new comments
[Bug]: Unsloth bitsandbytes quantized model cannot be run due to: `KeyError: 'layers.42.mlp.down_proj.weight.absmax`
#10710 commented on Jun 23, 2025 • 0 new comments
[Bug]: The service request for vllm064post1 was prematurely terminated, and it could not output a fixed number of tokens.”
#13156 commented on Jun 23, 2025 • 0 new comments
[Bug]: vllm cannot connect to an external ray cluster
#14349 commented on Jun 23, 2025 • 0 new comments
[Usage]: VLLM Inference - 2x slower with LoRA rank=256 vs none.
#14435 commented on Jun 23, 2025 • 0 new comments
[Installation]: Cannot compile vLLM from source on XPU
#14747 commented on Jun 23, 2025 • 0 new comments
[Bug]: AssertionError with Speculative Decoding in vLLM Using DeepSeek R1 Distill Qwen Models
#14939 commented on Jun 23, 2025 • 0 new comments
[Bug]: Internal Server Error when using Qwen2-VL-7B with vLLM Docker Container
#15110 commented on Jun 23, 2025 • 0 new comments
[Usage]: relationship between embedding size and vocab_size
#15131 commented on Jun 23, 2025 • 0 new comments
[New Model]: surport for model: https://huggingface.co/jinaai/jina-clip-v2
#18448 commented on Jun 23, 2025 • 0 new comments
[Feature]: Ability to warm up vLLM instances
#15225 commented on Jun 23, 2025 • 0 new comments
[Bug]: working with openai-agents sdk an use Runner.run_streamed() got fucntion call error
#15256 commented on Jun 23, 2025 • 0 new comments
[Feature]: Dynamic Memory Release for GPU after idle time
#15287 commented on Jun 23, 2025 • 0 new comments
[Bug]: Crashing on unsupported Sampling params
#15312 commented on Jun 23, 2025 • 0 new comments
[Bug]: 0.8.0 and 0.8.1 bugs
#15365 commented on Jun 23, 2025 • 0 new comments
[Bug]: VLLM Build Using Docker Error Deploy
#15376 commented on Jun 23, 2025 • 0 new comments
[Feature]: Support Top-nσ sampling
#15379 commented on Jun 23, 2025 • 0 new comments
[Bug]: Different logprobs output behaviour under vllm 0.8.0 and 0.8.1
#15381 commented on Jun 23, 2025 • 0 new comments
[Feature]: Request for Support of Dense and Sparse Features in bge-m3 Embedding Model
#15384 commented on Jun 23, 2025 • 0 new comments
[Bug]: AttributeError: Model PixtralForConditionalGeneration does not support BitsAndBytes quantization yet. No 'packed_modules_mapping' found.
#15396 commented on Jun 23, 2025 • 0 new comments
[New Model]: Baichuan-Audio
#15425 commented on Jun 23, 2025 • 0 new comments
[Bug]: vllm gives 400 bad request with high logprob count on Mistral-small because pydantic flags logits of byte tokens as invalid
#16540 commented on Jun 23, 2025 • 0 new comments
[Usage]: Cannot use FA version 2 is not supported due to FA3 is only supported on devices with compute capability >= 8 excluding 8.6 and 8.9
#13766 commented on Jun 20, 2025 • 0 new comments
[Bug]: MTP Implementation Inconsistency Between DeepSeek Paper and vllm
#14137 commented on Jun 20, 2025 • 0 new comments
[Usage]: Distributed inference not supported with OpenVINO?
#14933 commented on Jun 20, 2025 • 0 new comments
[Performance]: only 0.4 tokens/s when running 2 or more request
#15018 commented on Jun 20, 2025 • 0 new comments
[Bug]: Capture CudaGraph with LoRA
#15090 commented on Jun 20, 2025 • 0 new comments
[Bug]: flash_attn_with_kvcache kernel, an illegal memory access
#15113 commented on Jun 20, 2025 • 0 new comments
[RFC]: layer-wise kv cache offloading to enable larger batches
#15123 commented on Jun 20, 2025 • 0 new comments
[Performance]: online batch inference faster than offline batch inference
#15178 commented on Jun 20, 2025 • 0 new comments
[Usage]: VLLM 0.7.3 with tensor parallelism outputs only exclamation marks when using multiple GPUs
#15194 commented on Jun 20, 2025 • 0 new comments
[Feature]: vllm what supports dialog prefix continuation?
#15198 commented on Jun 20, 2025 • 0 new comments
[Misc][Help]: Adding support for a Custom model with External MoE Routing
#15214 commented on Jun 20, 2025 • 0 new comments
[Usage]: : How to properly use vllm when serving - keyerror 'text'
#15219 commented on Jun 20, 2025 • 0 new comments
[Bug]: `VLLM_USE_V1` + `TORCH_SDPA` regression in v0.8
#15251 commented on Jun 20, 2025 • 0 new comments
[Performance]: V0 and V1 give the same throughput number
#15253 commented on Jun 20, 2025 • 0 new comments
[Bug]: --tensor-parallel-size Error
#15255 commented on Jun 20, 2025 • 0 new comments
[Bug]: Inconsistent Output Based on Presence of chat_template Parameter
#15272 commented on Jun 20, 2025 • 0 new comments
[Bug]: vLLM declares itself healthy before it can serve requests
#15313 commented on Jun 20, 2025 • 0 new comments
[Feature]: looking into adding a generation algorithm
#15315 commented on Jun 20, 2025 • 0 new comments
[Bug]:rtx5060ti apply_w8a8_block_fp8_linear
#19596 commented on Jun 20, 2025 • 0 new comments
[Feature]: Qwen 3 MoE Lora adapter support.
#18120 commented on Jun 19, 2025 • 0 new comments
[Bug]: RuntimeError: The size of tensor a (1059) must match the size of tensor b (376) at non-singleton dimension, DeepSeek R1 H20x16 pp2, v1 engine
#15332 commented on Jun 19, 2025 • 0 new comments
[Bug]: `guided_regex` not working on M2 Ultra VLLM
#19676 commented on Jun 19, 2025 • 0 new comments
[Usage]: How to use DeepSeek-R1-0528-Qwen3-8B with function call
#19001 commented on Jun 19, 2025 • 0 new comments
[Bug]: InternVL3 image dynamic preprocess issue
#19585 commented on Jun 19, 2025 • 0 new comments
[Bug]: vllm 0.8.4 whisper possible memory leak?
#16966 commented on Jun 19, 2025 • 0 new comments
[Bug]: Qwen/Qwen2.5-1.5B-Instruct generates out of vocabulary tokens
#13175 commented on Jun 21, 2025 • 0 new comments
[Bug]: AttributeError: 'MergedColumnParallelLinear' object has no attribute 'weight'
#3900 commented on Jun 21, 2025 • 0 new comments
[Bug]: Qwen1.5-14B-Chat使用vllm==0.3.3版本在Tesla V100-PCIE-32GB显卡上部署结果全部是感叹号，无结果
#3998 commented on Jun 21, 2025 • 0 new comments
[Bug]: TRACKING ISSUE: `AsyncEngineDeadError`
#5901 commented on Jun 21, 2025 • 0 new comments
[Bug]: No available block found in 60 second in shm
#6614 commented on Jun 21, 2025 • 0 new comments
[Bug]: There is no module or parameter named '_orig_mod' in Qwen2ForCausalLM
#12783 commented on Jun 21, 2025 • 0 new comments
[Bug]: Docker vLLM 0.9.1 CUDA error: an illegal memory access, sampled_token_ids.tolist()
#19483 commented on Jun 20, 2025 • 0 new comments
[Bug]: I am traing to run unsloth/phi-4-bnb-4bit but I am getting always the same error Validation Error:1 validatiopn error for modelconfig Infer_schema(func): Parameter block_size has unsupported type list[int]
#19628 commented on Jun 20, 2025 • 0 new comments
[Bug]: RTX50xx GPU is not supported for running W8A8 FP8 quant models!
#19605 commented on Jun 20, 2025 • 0 new comments
[Installation]: deployment failure on Kuberentes with CPU device (testing).
#17187 commented on Jun 20, 2025 • 0 new comments
[Bug]: Grammar error: Pointer '/$defs/xxxxx' does not exist
#16467 commented on Jun 20, 2025 • 0 new comments
[Feature] [ROCm]: AITER Kernel Integration
#14964 commented on Jun 20, 2025 • 0 new comments
[Feature]: support reasoning output when offline batched inference
#17292 commented on Jun 20, 2025 • 0 new comments
[Bug]: Single-Node data parallel (--data-parallel-size=4) leads to vLLM crash
#18567 commented on Jun 20, 2025 • 0 new comments
[Bug]: `uv run vllm serve` with DP results in NCCL error: two ranks use the same device
#17176 commented on Jun 20, 2025 • 0 new comments
[Bug]: Docker, v0.9.0.1, Gemma3-4B, "Unsupported conversion from f16 to f16" on Nvidia T4
#19203 commented on Jun 20, 2025 • 0 new comments
[Bug]: Outlines broken on vLLM 0.8+
#15636 commented on Jun 20, 2025 • 0 new comments
[Usage]: 使用vllm的docker镜像启动Qwen/Qwen3-32B模型服务，CPU占用一直100%
#19150 commented on Jun 20, 2025 • 0 new comments
[Usage]: RTX 5090 with vllm/vllm-openai docker image
#16652 commented on Jun 20, 2025 • 0 new comments
[Bug] TP=2 fails on dual RTX 5090: TorchInductor compile error or CUDA illegal memory access (TP=1 works)
#18814 commented on Jun 20, 2025 • 0 new comments
[New Model]: Support BAAI/bge-reranker-v2-gemma model
#19673 commented on Jun 20, 2025 • 0 new comments
Llama3.2 Vision Model: Guides and Issues
#8826 commented on Jun 20, 2025 • 0 new comments
[Feature]: Enhance integration with advanced LB/gateways with better load/cost reporting and LoRA management
#10086 commented on Jun 20, 2025 • 0 new comments
[Bug]: Multi-Node Online Inference on TPUs Failing
#12179 commented on Jun 20, 2025 • 0 new comments
[Feature]: Disaggregated Prefill on multi-node & multi-gpu
#13004 commented on Jun 20, 2025 • 0 new comments