Update the vLLM-ATOM benchmark scope#739
Merged
Merged
Conversation
Normalize model tagging to MET/OOB/AW prefixes and labels, and replace nightly A/B/C date rotation with weekday-based grouping (Mon/Wed MET, Tue/Thu AW, Fri ALL, weekends skipped) for clearer benchmark cadence control.
Collaborator
|
Add --kv-cache-dtype fp8 for MiniMax-M2.5 AW TP2/4/8 so startup flags match expected cache settings, and always pass --num-warmups=$((2 * CONC)) for vllm bench serve runs to keep warmup load consistent.
Collaborator
Author
|
Add nightly_group=B for all AW model entries, align all gpt-oss variants to OOT_GPU_MEMORY_UTILIZATION=0.5, and restrict scheduled benchmark cron to weekdays to avoid weekend empty runs.
Collaborator
Author
|
@zejunchen-zejun please help review the PR again. |
zejunchen-zejun
approved these changes
May 11, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
ATOM vLLM Benchmark Workflow Models
当前夜间轮询机制已更新到 ATOM/.github/workflows/atom-vllm-benchmark.yaml:
周一、周三:A-MET(prefix 以 -met 结尾)
周二、周四:B-AW(prefix 包含 -aw-)
周五:C-ALL(全量模型)
周六、周日:SKIP-WEEKEND(不选模型,衔接周五的全量benchmark以及手动trigger的benchmark)
Matrix rules aligned with workflow:
1024/1024,1024/8192,8192/1024) and apply model-levelsupported_input_output_pairs/excluded_input_output_pairs.prefixcontains-aw-): force ISL/OSL pairs to1000/100,5000/500,10000/1000.DeepSeek-R1 FP8 TP8 (MET)deepseek-ai/DeepSeek-R1-0528AITER_QUICK_REDUCE_QUANTIZATION=INT41024/1024, 1024/8192, 8192/1024; TP=8--trust-remote-code --tensor-parallel-size 8 --max-num-batched-tokens 16384 --max-model-len 16384benchmark_serving.py;bench_args=(none)${RANDOM_RANGE_RATIO}(default0.8)DeepSeek-R1 MXFP4 TP8 (MET)amd/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4AITER_QUICK_REDUCE_QUANTIZATION=INT41024/1024, 1024/8192, 8192/1024; TP=8--trust-remote-code --tensor-parallel-size 8 --max-num-batched-tokens 16384 --max-model-len 16384benchmark_serving.py;bench_args=(none)${RANDOM_RANGE_RATIO}(default0.8)gpt-oss-120b TP1 (MET)openai/gpt-oss-120bATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1GPTOSS_USE_GENERIC_SWIGLU_MXFP4_LAYOUT=11024/1024, 1024/8192, 8192/1024; TP=1--trust-remote-code --tensor-parallel-size 1 --gpu-memory-utilization 0.5 --max-num-batched-tokens 16384 --max-model-len 16384benchmark_serving.py;bench_args=(none)${RANDOM_RANGE_RATIO}(default0.8)GLM-5.1-FP8 TP8 (OOB)zai-org/GLM-5.1-FP8AITER_QUICK_REDUCE_QUANTIZATION=INT41024/1024, 8192/1024; TP=8--trust-remote-code --tensor-parallel-size 8 --default-chat-template-kwargs '{"enable_thinking":false}' --max-num-batched-tokens 16384 --max-model-len 16384benchmark_serving.py;bench_args=(none)${RANDOM_RANGE_RATIO}(default0.8)Kimi-K2-Thinking-MXFP4 TP4 (MET)amd/Kimi-K2-Thinking-MXFP4-AttnFP8AITER_QUICK_REDUCE_QUANTIZATION=INT41024/1024, 1024/8192, 8192/1024; TP=4--trust-remote-code --tensor-parallel-size 4 --max-num-batched-tokens 16384 --max-model-len 16384benchmark_serving.py;bench_args=(none)${RANDOM_RANGE_RATIO}(default0.8)Kimi-K2-Thinking-MXFP4 TP8 (MET)amd/Kimi-K2-Thinking-MXFP4-AttnFP8AITER_QUICK_REDUCE_QUANTIZATION=INT41024/1024, 1024/8192, 8192/1024; TP=8--trust-remote-code --tensor-parallel-size 8 --max-num-batched-tokens 16384 --max-model-len 16384benchmark_serving.py;bench_args=(none)${RANDOM_RANGE_RATIO}(default0.8)Kimi-K2.5-MXFP4 TP4 (MET)amd/Kimi-K2.5-MXFP4-AttnFP8AITER_QUICK_REDUCE_QUANTIZATION=INT41024/1024, 1024/8192, 8192/1024; TP=4--trust-remote-code --tensor-parallel-size 4benchmark_serving.py;bench_args=(none)${RANDOM_RANGE_RATIO}(default0.8)Qwen3.5-397B-A17B-FP8 TP8 (OOB)Qwen/Qwen3.5-397B-A17B-FP8AITER_QUICK_REDUCE_QUANTIZATION=INT4ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1ATOM_USE_CUSTOM_ALL_GATHER=01024/1024, 8192/1024; TP=8--trust-remote-code --tensor-parallel-size 8 --attention-backend ROCM_AITER_FA --gpu-memory-utilization 0.8 --max-num-batched-tokens 16384 --max-model-len 16384benchmark_serving.py;bench_args=(none)${RANDOM_RANGE_RATIO}(default0.8)Qwen3.5-397B-A17B-FP8 TP4 (OOB)Qwen/Qwen3.5-397B-A17B-FP8AITER_QUICK_REDUCE_QUANTIZATION=INT4ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1ATOM_USE_CUSTOM_ALL_GATHER=01024/1024, 8192/1024; TP=4--trust-remote-code --tensor-parallel-size 4 --attention-backend ROCM_AITER_FA --gpu-memory-utilization 0.8 --max-num-batched-tokens 16384 --max-model-len 16384benchmark_serving.py;bench_args=(none)${RANDOM_RANGE_RATIO}(default0.8)Qwen3.5-397B-A17B TP8 (OOB)Qwen/Qwen3.5-397B-A17BAITER_QUICK_REDUCE_QUANTIZATION=INT4ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1ATOM_USE_CUSTOM_ALL_GATHER=01024/1024, 8192/1024; TP=8--trust-remote-code --tensor-parallel-size 8 --attention-backend ROCM_AITER_FA --gpu-memory-utilization 0.8 --max-num-batched-tokens 16384 --max-model-len 16384benchmark_serving.py;bench_args=(none)${RANDOM_RANGE_RATIO}(default0.8)Qwen3-Next-80B-A3B-Instruct-FP8 TP1 (MET)Qwen/Qwen3-Next-80B-A3B-Instruct-FP8ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1ATOM_USE_CUSTOM_ALL_GATHER=0ATOM_USE_FLYDSL_GDR=01024/1024, 1024/8192, 8192/1024; TP=1--trust-remote-code --tensor-parallel-size 1 --max-num-batched-tokens 32768 --max-model-len 16384benchmark_serving.py;bench_args=(none)${RANDOM_RANGE_RATIO}(default0.8)Qwen3-Next-80B-A3B-Instruct-FP8 TP4 (MET)Qwen/Qwen3-Next-80B-A3B-Instruct-FP8AITER_QUICK_REDUCE_QUANTIZATION=INT4ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1ATOM_USE_CUSTOM_ALL_GATHER=0ATOM_USE_FLYDSL_GDR=01024/1024, 1024/8192, 8192/1024; TP=4--trust-remote-code --tensor-parallel-size 4 --max-num-batched-tokens 32768 --max-model-len 16384benchmark_serving.py;bench_args=(none)${RANDOM_RANGE_RATIO}(default0.8)Qwen3-Next-80B-A3B-Instruct-FP8 TP1 (AW)Qwen/Qwen3-Next-80B-A3B-Instruct-FP8ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1ATOM_USE_CUSTOM_ALL_GATHER=0ATOM_USE_FLYDSL_GDR=01000/100, 5000/500, 10000/1000; TP=1--trust-remote-code --tensor-parallel-size 1 --max-num-batched-tokens 32768 --max-model-len 16384vllm bench serve;bench_args=--random-range-ratio 11(bench_args override)Qwen3-Next-80B-A3B-Instruct-FP8 TP2 (AW)Qwen/Qwen3-Next-80B-A3B-Instruct-FP8AITER_QUICK_REDUCE_QUANTIZATION=INT4ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1ATOM_USE_CUSTOM_ALL_GATHER=0ATOM_USE_FLYDSL_GDR=01000/100, 5000/500, 10000/1000; TP=2--trust-remote-code --tensor-parallel-size 2 --max-num-batched-tokens 32768 --max-model-len 16384vllm bench serve;bench_args=--random-range-ratio 11(bench_args override)Qwen3-Next-80B-A3B-Instruct-FP8 TP4 (AW)Qwen/Qwen3-Next-80B-A3B-Instruct-FP8AITER_QUICK_REDUCE_QUANTIZATION=INT4ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1ATOM_USE_CUSTOM_ALL_GATHER=0ATOM_USE_FLYDSL_GDR=01000/100, 5000/500, 10000/1000; TP=4--trust-remote-code --tensor-parallel-size 4 --max-num-batched-tokens 32768 --max-model-len 16384vllm bench serve;bench_args=--random-range-ratio 11(bench_args override)DeepSeek-V3.2 FP8 TP4 (AW)deepseek-ai/DeepSeek-V3.2AITER_QUICK_REDUCE_QUANTIZATION=INT41000/100, 5000/500, 10000/1000; TP=4--trust-remote-code --tensor-parallel-size 4 --kv-cache-dtype auto --block-size 1 --max-num-batched-tokens 16384 --max-model-len 16384vllm bench serve;bench_args=--random-range-ratio 11(bench_args override)DeepSeek-V3.2 FP8 TP8 (AW)deepseek-ai/DeepSeek-V3.2AITER_QUICK_REDUCE_QUANTIZATION=INT41000/100, 5000/500, 10000/1000; TP=8--trust-remote-code --tensor-parallel-size 8 --block-size 1 --max-num-batched-tokens 16384 --max-model-len 16384vllm bench serve;bench_args=--random-range-ratio 11(bench_args override)GLM-4.7-FP8 TP4 (AW)zai-org/GLM-4.7-FP8AITER_QUICK_REDUCE_QUANTIZATION=INT41000/100, 5000/500, 10000/1000; TP=4--trust-remote-code --tensor-parallel-size 4 --max-num-batched-tokens 16384 --max-model-len 16384vllm bench serve;bench_args=--random-range-ratio 11(bench_args override)GLM-4.7-FP8 TP8 (AW)zai-org/GLM-4.7-FP8AITER_QUICK_REDUCE_QUANTIZATION=INT41000/100, 5000/500, 10000/1000; TP=8--trust-remote-code --tensor-parallel-size 8 --max-num-batched-tokens 16384 --max-model-len 16384vllm bench serve;bench_args=--random-range-ratio 11(bench_args override)gpt-oss-120b TP1 (AW)openai/gpt-oss-120bATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1GPTOSS_USE_GENERIC_SWIGLU_MXFP4_LAYOUT=11000/100, 5000/500, 10000/1000; TP=1--trust-remote-code --tensor-parallel-size 1 --max-num-batched-tokens 16384 --max-model-len 16384vllm bench serve;bench_args=--random-range-ratio 11(bench_args override)gpt-oss-120b TP2 (AW)openai/gpt-oss-120bATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1GPTOSS_USE_GENERIC_SWIGLU_MXFP4_LAYOUT=11000/100, 5000/500, 10000/1000; TP=2--trust-remote-code --tensor-parallel-size 2 --max-num-batched-tokens 16384 --max-model-len 16384vllm bench serve;bench_args=--random-range-ratio 11(bench_args override)gpt-oss-120b TP8 (AW)openai/gpt-oss-120bATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1GPTOSS_USE_GENERIC_SWIGLU_MXFP4_LAYOUT=1OOT_GPU_MEMORY_UTILIZATION=0.51000/100, 5000/500, 10000/1000; TP=8--trust-remote-code --tensor-parallel-size 8 --max-num-batched-tokens 16384 --max-model-len 16384vllm bench serve;bench_args=--random-range-ratio 11(bench_args override)Kimi-K2.5-MXFP4 TP4 (AW)amd/Kimi-K2.5-MXFP4-AttnFP8AITER_QUICK_REDUCE_QUANTIZATION=INT41000/100, 5000/500, 10000/1000; TP=4--trust-remote-code --tensor-parallel-size 4 --max-num-batched-tokens 16384 --max-model-len 16384vllm bench serve;bench_args=--random-range-ratio 11(bench_args override)Kimi-K2.5-MXFP4 TP8 (AW)amd/Kimi-K2.5-MXFP4-AttnFP8AITER_QUICK_REDUCE_QUANTIZATION=INT41000/100, 5000/500, 10000/1000; TP=8--trust-remote-code --tensor-parallel-size 8 --max-num-batched-tokens 16384 --max-model-len 16384vllm bench serve;bench_args=--random-range-ratio 11(bench_args override)MiniMax-M2.5 TP2 (AW)MiniMaxAI/MiniMax-M2.5AITER_QUICK_REDUCE_QUANTIZATION=INT4ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=11000/100, 5000/500, 10000/1000; TP=2--trust-remote-code --tensor-parallel-size 2 --max-num-batched-tokens 16384 --max-model-len 16384vllm bench serve;bench_args=--random-range-ratio 11(bench_args override)MiniMax-M2.5 TP4 (AW)MiniMaxAI/MiniMax-M2.5AITER_QUICK_REDUCE_QUANTIZATION=INT4ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=11000/100, 5000/500, 10000/1000; TP=4--trust-remote-code --tensor-parallel-size 4 --max-num-batched-tokens 16384 --max-model-len 16384vllm bench serve;bench_args=--random-range-ratio 11(bench_args override)MiniMax-M2.5 TP8 (AW)MiniMaxAI/MiniMax-M2.5AITER_QUICK_REDUCE_QUANTIZATION=INT4ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=11000/100, 5000/500, 10000/1000; TP=8--trust-remote-code --tensor-parallel-size 8 --max-num-batched-tokens 16384 --max-model-len 16384vllm bench serve;bench_args=--random-range-ratio 11(bench_args override)