Skip to content

Update the vLLM-ATOM benchmark scope#739

Merged
zejunchen-zejun merged 5 commits into
mainfrom
hattie/add_workload
May 11, 2026
Merged

Update the vLLM-ATOM benchmark scope#739
zejunchen-zejun merged 5 commits into
mainfrom
hattie/add_workload

Conversation

@wuhuikx
Copy link
Copy Markdown
Collaborator

@wuhuikx wuhuikx commented May 11, 2026

ATOM vLLM Benchmark Workflow Models

当前夜间轮询机制已更新到 ATOM/.github/workflows/atom-vllm-benchmark.yaml:

周一、周三:A-MET(prefix 以 -met 结尾)
周二、周四:B-AW(prefix 包含 -aw-)
周五:C-ALL(全量模型)
周六、周日:SKIP-WEEKEND(不选模型,衔接周五的全量benchmark以及手动trigger的benchmark)

Matrix rules aligned with workflow:

  • Non-AW models: use default workflow param pairs (1024/1024, 1024/8192, 8192/1024) and apply model-level supported_input_output_pairs / excluded_input_output_pairs.
  • AW models (prefix contains -aw-): force ISL/OSL pairs to 1000/100, 5000/500, 10000/1000.
Current Model Weight Link Env Vars (env_vars) ISL/OSL/TP Serve Args (extra_args) Benchmark Args Client Ratio
DeepSeek-R1 FP8 TP8 (MET) deepseek-ai/DeepSeek-R1-0528 AITER_QUICK_REDUCE_QUANTIZATION=INT4 1024/1024, 1024/8192, 8192/1024; TP=8 --trust-remote-code --tensor-parallel-size 8 --max-num-batched-tokens 16384 --max-model-len 16384 benchmark_serving.py; bench_args=(none) ${RANDOM_RANGE_RATIO} (default 0.8)
DeepSeek-R1 MXFP4 TP8 (MET) amd/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4 AITER_QUICK_REDUCE_QUANTIZATION=INT4 1024/1024, 1024/8192, 8192/1024; TP=8 --trust-remote-code --tensor-parallel-size 8 --max-num-batched-tokens 16384 --max-model-len 16384 benchmark_serving.py; bench_args=(none) ${RANDOM_RANGE_RATIO} (default 0.8)
gpt-oss-120b TP1 (MET) openai/gpt-oss-120b ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1
GPTOSS_USE_GENERIC_SWIGLU_MXFP4_LAYOUT=1
1024/1024, 1024/8192, 8192/1024; TP=1 --trust-remote-code --tensor-parallel-size 1 --gpu-memory-utilization 0.5 --max-num-batched-tokens 16384 --max-model-len 16384 benchmark_serving.py; bench_args=(none) ${RANDOM_RANGE_RATIO} (default 0.8)
GLM-5.1-FP8 TP8 (OOB) zai-org/GLM-5.1-FP8 AITER_QUICK_REDUCE_QUANTIZATION=INT4 1024/1024, 8192/1024; TP=8 --trust-remote-code --tensor-parallel-size 8 --default-chat-template-kwargs '{"enable_thinking":false}' --max-num-batched-tokens 16384 --max-model-len 16384 benchmark_serving.py; bench_args=(none) ${RANDOM_RANGE_RATIO} (default 0.8)
Kimi-K2-Thinking-MXFP4 TP4 (MET) amd/Kimi-K2-Thinking-MXFP4-AttnFP8 AITER_QUICK_REDUCE_QUANTIZATION=INT4 1024/1024, 1024/8192, 8192/1024; TP=4 --trust-remote-code --tensor-parallel-size 4 --max-num-batched-tokens 16384 --max-model-len 16384 benchmark_serving.py; bench_args=(none) ${RANDOM_RANGE_RATIO} (default 0.8)
Kimi-K2-Thinking-MXFP4 TP8 (MET) amd/Kimi-K2-Thinking-MXFP4-AttnFP8 AITER_QUICK_REDUCE_QUANTIZATION=INT4 1024/1024, 1024/8192, 8192/1024; TP=8 --trust-remote-code --tensor-parallel-size 8 --max-num-batched-tokens 16384 --max-model-len 16384 benchmark_serving.py; bench_args=(none) ${RANDOM_RANGE_RATIO} (default 0.8)
Kimi-K2.5-MXFP4 TP4 (MET) amd/Kimi-K2.5-MXFP4-AttnFP8 AITER_QUICK_REDUCE_QUANTIZATION=INT4 1024/1024, 1024/8192, 8192/1024; TP=4 --trust-remote-code --tensor-parallel-size 4 benchmark_serving.py; bench_args=(none) ${RANDOM_RANGE_RATIO} (default 0.8)
Qwen3.5-397B-A17B-FP8 TP8 (OOB) Qwen/Qwen3.5-397B-A17B-FP8 AITER_QUICK_REDUCE_QUANTIZATION=INT4
ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1
ATOM_USE_CUSTOM_ALL_GATHER=0
1024/1024, 8192/1024; TP=8 --trust-remote-code --tensor-parallel-size 8 --attention-backend ROCM_AITER_FA --gpu-memory-utilization 0.8 --max-num-batched-tokens 16384 --max-model-len 16384 benchmark_serving.py; bench_args=(none) ${RANDOM_RANGE_RATIO} (default 0.8)
Qwen3.5-397B-A17B-FP8 TP4 (OOB) Qwen/Qwen3.5-397B-A17B-FP8 AITER_QUICK_REDUCE_QUANTIZATION=INT4
ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1
ATOM_USE_CUSTOM_ALL_GATHER=0
1024/1024, 8192/1024; TP=4 --trust-remote-code --tensor-parallel-size 4 --attention-backend ROCM_AITER_FA --gpu-memory-utilization 0.8 --max-num-batched-tokens 16384 --max-model-len 16384 benchmark_serving.py; bench_args=(none) ${RANDOM_RANGE_RATIO} (default 0.8)
Qwen3.5-397B-A17B TP8 (OOB) Qwen/Qwen3.5-397B-A17B AITER_QUICK_REDUCE_QUANTIZATION=INT4
ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1
ATOM_USE_CUSTOM_ALL_GATHER=0
1024/1024, 8192/1024; TP=8 --trust-remote-code --tensor-parallel-size 8 --attention-backend ROCM_AITER_FA --gpu-memory-utilization 0.8 --max-num-batched-tokens 16384 --max-model-len 16384 benchmark_serving.py; bench_args=(none) ${RANDOM_RANGE_RATIO} (default 0.8)
Qwen3-Next-80B-A3B-Instruct-FP8 TP1 (MET) Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1
ATOM_USE_CUSTOM_ALL_GATHER=0
ATOM_USE_FLYDSL_GDR=0
1024/1024, 1024/8192, 8192/1024; TP=1 --trust-remote-code --tensor-parallel-size 1 --max-num-batched-tokens 32768 --max-model-len 16384 benchmark_serving.py; bench_args=(none) ${RANDOM_RANGE_RATIO} (default 0.8)
Qwen3-Next-80B-A3B-Instruct-FP8 TP4 (MET) Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 AITER_QUICK_REDUCE_QUANTIZATION=INT4
ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1
ATOM_USE_CUSTOM_ALL_GATHER=0
ATOM_USE_FLYDSL_GDR=0
1024/1024, 1024/8192, 8192/1024; TP=4 --trust-remote-code --tensor-parallel-size 4 --max-num-batched-tokens 32768 --max-model-len 16384 benchmark_serving.py; bench_args=(none) ${RANDOM_RANGE_RATIO} (default 0.8)
Qwen3-Next-80B-A3B-Instruct-FP8 TP1 (AW) Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1
ATOM_USE_CUSTOM_ALL_GATHER=0
ATOM_USE_FLYDSL_GDR=0
1000/100, 5000/500, 10000/1000; TP=1 --trust-remote-code --tensor-parallel-size 1 --max-num-batched-tokens 32768 --max-model-len 16384 vllm bench serve; bench_args=--random-range-ratio 1 1 (bench_args override)
Qwen3-Next-80B-A3B-Instruct-FP8 TP2 (AW) Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 AITER_QUICK_REDUCE_QUANTIZATION=INT4
ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1
ATOM_USE_CUSTOM_ALL_GATHER=0
ATOM_USE_FLYDSL_GDR=0
1000/100, 5000/500, 10000/1000; TP=2 --trust-remote-code --tensor-parallel-size 2 --max-num-batched-tokens 32768 --max-model-len 16384 vllm bench serve; bench_args=--random-range-ratio 1 1 (bench_args override)
Qwen3-Next-80B-A3B-Instruct-FP8 TP4 (AW) Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 AITER_QUICK_REDUCE_QUANTIZATION=INT4
ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1
ATOM_USE_CUSTOM_ALL_GATHER=0
ATOM_USE_FLYDSL_GDR=0
1000/100, 5000/500, 10000/1000; TP=4 --trust-remote-code --tensor-parallel-size 4 --max-num-batched-tokens 32768 --max-model-len 16384 vllm bench serve; bench_args=--random-range-ratio 1 1 (bench_args override)
DeepSeek-V3.2 FP8 TP4 (AW) deepseek-ai/DeepSeek-V3.2 AITER_QUICK_REDUCE_QUANTIZATION=INT4 1000/100, 5000/500, 10000/1000; TP=4 --trust-remote-code --tensor-parallel-size 4 --kv-cache-dtype auto --block-size 1 --max-num-batched-tokens 16384 --max-model-len 16384 vllm bench serve; bench_args=--random-range-ratio 1 1 (bench_args override)
DeepSeek-V3.2 FP8 TP8 (AW) deepseek-ai/DeepSeek-V3.2 AITER_QUICK_REDUCE_QUANTIZATION=INT4 1000/100, 5000/500, 10000/1000; TP=8 --trust-remote-code --tensor-parallel-size 8 --block-size 1 --max-num-batched-tokens 16384 --max-model-len 16384 vllm bench serve; bench_args=--random-range-ratio 1 1 (bench_args override)
GLM-4.7-FP8 TP4 (AW) zai-org/GLM-4.7-FP8 AITER_QUICK_REDUCE_QUANTIZATION=INT4 1000/100, 5000/500, 10000/1000; TP=4 --trust-remote-code --tensor-parallel-size 4 --max-num-batched-tokens 16384 --max-model-len 16384 vllm bench serve; bench_args=--random-range-ratio 1 1 (bench_args override)
GLM-4.7-FP8 TP8 (AW) zai-org/GLM-4.7-FP8 AITER_QUICK_REDUCE_QUANTIZATION=INT4 1000/100, 5000/500, 10000/1000; TP=8 --trust-remote-code --tensor-parallel-size 8 --max-num-batched-tokens 16384 --max-model-len 16384 vllm bench serve; bench_args=--random-range-ratio 1 1 (bench_args override)
gpt-oss-120b TP1 (AW) openai/gpt-oss-120b ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1
GPTOSS_USE_GENERIC_SWIGLU_MXFP4_LAYOUT=1
1000/100, 5000/500, 10000/1000; TP=1 --trust-remote-code --tensor-parallel-size 1 --max-num-batched-tokens 16384 --max-model-len 16384 vllm bench serve; bench_args=--random-range-ratio 1 1 (bench_args override)
gpt-oss-120b TP2 (AW) openai/gpt-oss-120b ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1
GPTOSS_USE_GENERIC_SWIGLU_MXFP4_LAYOUT=1
1000/100, 5000/500, 10000/1000; TP=2 --trust-remote-code --tensor-parallel-size 2 --max-num-batched-tokens 16384 --max-model-len 16384 vllm bench serve; bench_args=--random-range-ratio 1 1 (bench_args override)
gpt-oss-120b TP8 (AW) openai/gpt-oss-120b ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1
GPTOSS_USE_GENERIC_SWIGLU_MXFP4_LAYOUT=1
OOT_GPU_MEMORY_UTILIZATION=0.5
1000/100, 5000/500, 10000/1000; TP=8 --trust-remote-code --tensor-parallel-size 8 --max-num-batched-tokens 16384 --max-model-len 16384 vllm bench serve; bench_args=--random-range-ratio 1 1 (bench_args override)
Kimi-K2.5-MXFP4 TP4 (AW) amd/Kimi-K2.5-MXFP4-AttnFP8 AITER_QUICK_REDUCE_QUANTIZATION=INT4 1000/100, 5000/500, 10000/1000; TP=4 --trust-remote-code --tensor-parallel-size 4 --max-num-batched-tokens 16384 --max-model-len 16384 vllm bench serve; bench_args=--random-range-ratio 1 1 (bench_args override)
Kimi-K2.5-MXFP4 TP8 (AW) amd/Kimi-K2.5-MXFP4-AttnFP8 AITER_QUICK_REDUCE_QUANTIZATION=INT4 1000/100, 5000/500, 10000/1000; TP=8 --trust-remote-code --tensor-parallel-size 8 --max-num-batched-tokens 16384 --max-model-len 16384 vllm bench serve; bench_args=--random-range-ratio 1 1 (bench_args override)
MiniMax-M2.5 TP2 (AW) MiniMaxAI/MiniMax-M2.5 AITER_QUICK_REDUCE_QUANTIZATION=INT4
ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1
1000/100, 5000/500, 10000/1000; TP=2 --trust-remote-code --tensor-parallel-size 2 --max-num-batched-tokens 16384 --max-model-len 16384 vllm bench serve; bench_args=--random-range-ratio 1 1 (bench_args override)
MiniMax-M2.5 TP4 (AW) MiniMaxAI/MiniMax-M2.5 AITER_QUICK_REDUCE_QUANTIZATION=INT4
ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1
1000/100, 5000/500, 10000/1000; TP=4 --trust-remote-code --tensor-parallel-size 4 --max-num-batched-tokens 16384 --max-model-len 16384 vllm bench serve; bench_args=--random-range-ratio 1 1 (bench_args override)
MiniMax-M2.5 TP8 (AW) MiniMaxAI/MiniMax-M2.5 AITER_QUICK_REDUCE_QUANTIZATION=INT4
ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1
1000/100, 5000/500, 10000/1000; TP=8 --trust-remote-code --tensor-parallel-size 8 --max-num-batched-tokens 16384 --max-model-len 16384 vllm bench serve; bench_args=--random-range-ratio 1 1 (bench_args override)

@wuhuikx wuhuikx requested a review from zejunchen-zejun May 11, 2026 03:33
Normalize model tagging to MET/OOB/AW prefixes and labels, and replace nightly A/B/C date rotation with weekday-based grouping (Mon/Wed MET, Tue/Thu AW, Fri ALL, weekends skipped) for clearer benchmark cadence control.
@zejunchen-zejun
Copy link
Copy Markdown
Collaborator

  1. Minimax启动项丢失了 --kv-cache-dtype fp8
  2. vllm bench命令也有warmup参数,是不是有必要也设置为 两倍concurrency

Add --kv-cache-dtype fp8 for MiniMax-M2.5 AW TP2/4/8 so startup flags match expected cache settings, and always pass --num-warmups=$((2 * CONC)) for vllm bench serve runs to keep warmup load consistent.
@wuhuikx
Copy link
Copy Markdown
Collaborator Author

wuhuikx commented May 11, 2026

  1. Minimax启动项丢失了 --kv-cache-dtype fp8
  2. vllm bench命令也有warmup参数,是不是有必要也设置为 两倍concurrency
  1. MiniMax startup args:
    I added --kv-cache-dtype fp8 for MiniMax-M2.5 AW TP2/TP4/TP8 in oot_benchmark_models.json.
  2. vllm bench warmup:
    I set warmup to 2x concurrency by adding:
    --num-warmups "$(( 2 * CONC ))"
    to the vllm bench serve path in the workflow.

wuhuikx added 2 commits May 11, 2026 01:57
Add nightly_group=B for all AW model entries, align all gpt-oss variants to OOT_GPU_MEMORY_UTILIZATION=0.5, and restrict scheduled benchmark cron to weekdays to avoid weekend empty runs.
@wuhuikx
Copy link
Copy Markdown
Collaborator Author

wuhuikx commented May 11, 2026

@zejunchen-zejun please help review the PR again.

@zejunchen-zejun zejunchen-zejun merged commit 50461c8 into main May 11, 2026
22 of 28 checks passed
@zejunchen-zejun zejunchen-zejun deleted the hattie/add_workload branch May 11, 2026 08:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants