Description
🚀 The feature, motivation and pitch
Currently moe_align_block_size
is a performance bottleneck, even slower than fused_moe_kernel

I think we can reuse the kernel from SGL to optimize it as the first step (as the default one). I will have a pr for this later
Benchmark test
moe-align-block-size-performance:
num_tokens num_experts topk VLLM SGL
0 1.0 16.0 1.0 20.064000 21.663999
1 1.0 16.0 2.0 19.616000 22.272000
2 1.0 16.0 8.0 19.616000 21.695999
3 1.0 64.0 1.0 25.632000 27.680000
4 1.0 64.0 2.0 26.384000 27.680000
5 1.0 64.0 8.0 26.368000 27.680000
6 1.0 224.0 1.0 62.463999 32.768000
7 1.0 224.0 2.0 62.463999 32.864001
8 1.0 224.0 8.0 62.463999 32.896001
9 1.0 256.0 1.0 71.872003 31.744000
10 1.0 256.0 2.0 71.744002 31.776000
11 1.0 256.0 8.0 71.744002 31.808000
12 16.0 16.0 1.0 21.344000 23.615999
13 16.0 16.0 2.0 21.376001 23.584001
14 16.0 16.0 8.0 21.344000 23.712000
15 16.0 64.0 1.0 25.632000 28.720001
16 16.0 64.0 2.0 25.632000 28.672000
17 16.0 64.0 8.0 26.400000 29.792000
18 16.0 224.0 1.0 62.463999 32.800000
19 16.0 224.0 2.0 62.463999 32.864001
20 16.0 224.0 8.0 62.463999 32.800000
21 16.0 256.0 1.0 72.704002 33.856001
22 16.0 256.0 2.0 72.704002 33.824001
23 16.0 256.0 8.0 72.704002 33.824001
24 256.0 16.0 1.0 19.487999 25.536001
25 256.0 16.0 2.0 21.632001 29.568000
26 256.0 16.0 8.0 33.792000 28.384000
27 256.0 64.0 1.0 27.648000 30.432000
28 256.0 64.0 2.0 27.648000 31.776000
29 256.0 64.0 8.0 31.776000 25.728000
30 256.0 224.0 1.0 62.463999 32.832000
31 256.0 224.0 2.0 62.463999 31.776000
32 256.0 224.0 8.0 62.560000 30.751999
33 256.0 256.0 1.0 70.720002 31.872001
34 256.0 256.0 2.0 72.704002 33.631999
35 256.0 256.0 8.0 74.752003 33.824001
36 4096.0 16.0 1.0 48.128001 27.680000
37 4096.0 16.0 2.0 74.784003 27.744001
38 4096.0 16.0 8.0 261.119992 44.128001
39 4096.0 64.0 1.0 44.096000 27.872000
40 4096.0 64.0 2.0 66.624001 33.599999
41 4096.0 64.0 8.0 192.959994 48.191998
42 4096.0 224.0 1.0 68.672001 33.760000
43 4096.0 224.0 2.0 80.911998 38.943999
44 4096.0 224.0 8.0 167.776003 54.272000
45 4096.0 256.0 1.0 76.800004 33.824001
46 4096.0 256.0 2.0 85.967999 37.888002
47 4096.0 256.0 8.0 131.007999 52.223999
Throughput(fp16)
vllm bench throughput --model Qwen/Qwen3-30B-A3B --load-format dummy --input-len 1000 --output-len 100
Throughput: 46.03 requests/s, 50547.85 total tokens/s, 4603.43 output tokens/s(B200)
Throughput: 47.63 requests/s, 52312.68 total tokens/s, 4762.52 output tokens/s(B200 sgl moe_align_block_size)
Throughput(fp8)
vllm bench throughput --model Qwen/Qwen3-30B-A3B-FP8 --load-format dummy --input-len 1000 --output-len 100
Throughput: 42.28 requests/s, 46424.61 total tokens/s, 4228.17 output tokens/s(B200)
Throughput: 44.17 requests/s, 48497.60 total tokens/s, 4417.43 output tokens/s (B200 sgl moe_align_block_size)
Alternatives
No response
Additional context
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.