Skip to content

[Feature]: Optimize moe_align_block_size CUDA kernel #19517

Closed
@yewentao256

Description

@yewentao256

🚀 The feature, motivation and pitch

Currently moe_align_block_size is a performance bottleneck, even slower than fused_moe_kernel

Image

I think we can reuse the kernel from SGL to optimize it as the first step (as the default one). I will have a pr for this later

Benchmark test

moe-align-block-size-performance:
    num_tokens  num_experts  topk        VLLM        SGL
0          1.0         16.0   1.0   20.064000  21.663999
1          1.0         16.0   2.0   19.616000  22.272000
2          1.0         16.0   8.0   19.616000  21.695999
3          1.0         64.0   1.0   25.632000  27.680000
4          1.0         64.0   2.0   26.384000  27.680000
5          1.0         64.0   8.0   26.368000  27.680000
6          1.0        224.0   1.0   62.463999  32.768000
7          1.0        224.0   2.0   62.463999  32.864001
8          1.0        224.0   8.0   62.463999  32.896001
9          1.0        256.0   1.0   71.872003  31.744000
10         1.0        256.0   2.0   71.744002  31.776000
11         1.0        256.0   8.0   71.744002  31.808000
12        16.0         16.0   1.0   21.344000  23.615999
13        16.0         16.0   2.0   21.376001  23.584001
14        16.0         16.0   8.0   21.344000  23.712000
15        16.0         64.0   1.0   25.632000  28.720001
16        16.0         64.0   2.0   25.632000  28.672000
17        16.0         64.0   8.0   26.400000  29.792000
18        16.0        224.0   1.0   62.463999  32.800000
19        16.0        224.0   2.0   62.463999  32.864001
20        16.0        224.0   8.0   62.463999  32.800000
21        16.0        256.0   1.0   72.704002  33.856001
22        16.0        256.0   2.0   72.704002  33.824001
23        16.0        256.0   8.0   72.704002  33.824001
24       256.0         16.0   1.0   19.487999  25.536001
25       256.0         16.0   2.0   21.632001  29.568000
26       256.0         16.0   8.0   33.792000  28.384000
27       256.0         64.0   1.0   27.648000  30.432000
28       256.0         64.0   2.0   27.648000  31.776000
29       256.0         64.0   8.0   31.776000  25.728000
30       256.0        224.0   1.0   62.463999  32.832000
31       256.0        224.0   2.0   62.463999  31.776000
32       256.0        224.0   8.0   62.560000  30.751999
33       256.0        256.0   1.0   70.720002  31.872001
34       256.0        256.0   2.0   72.704002  33.631999
35       256.0        256.0   8.0   74.752003  33.824001
36      4096.0         16.0   1.0   48.128001  27.680000
37      4096.0         16.0   2.0   74.784003  27.744001
38      4096.0         16.0   8.0  261.119992  44.128001
39      4096.0         64.0   1.0   44.096000  27.872000
40      4096.0         64.0   2.0   66.624001  33.599999
41      4096.0         64.0   8.0  192.959994  48.191998
42      4096.0        224.0   1.0   68.672001  33.760000
43      4096.0        224.0   2.0   80.911998  38.943999
44      4096.0        224.0   8.0  167.776003  54.272000
45      4096.0        256.0   1.0   76.800004  33.824001
46      4096.0        256.0   2.0   85.967999  37.888002
47      4096.0        256.0   8.0  131.007999  52.223999

Throughput(fp16)

vllm bench throughput --model Qwen/Qwen3-30B-A3B --load-format dummy --input-len 1000 --output-len 100

Throughput: 46.03 requests/s, 50547.85 total tokens/s, 4603.43 output tokens/s(B200)
Throughput: 47.63 requests/s, 52312.68 total tokens/s, 4762.52 output tokens/s(B200 sgl moe_align_block_size)

Throughput(fp8)

vllm bench throughput --model Qwen/Qwen3-30B-A3B-FP8 --load-format dummy --input-len 1000 --output-len 100

Throughput: 42.28 requests/s, 46424.61 total tokens/s, 4228.17 output tokens/s(B200)
Throughput: 44.17 requests/s, 48497.60 total tokens/s, 4417.43 output tokens/s (B200 sgl moe_align_block_size)

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions