[Feature]: Optimize `moe_align_block_size` CUDA kernel

### 🚀 The feature, motivation and pitch

Currently `moe_align_block_size` is a performance bottleneck, even slower than `fused_moe_kernel`

<img width="1536" alt="Image" src="https://github.com/user-attachments/assets/8ea6b4ea-3747-498c-8b35-25c4c2a1ce0c" />

I think we can reuse the kernel from SGL to optimize it as the first step (as the default one). I will have a pr for this later

Benchmark test

```bash
moe-align-block-size-performance:
    num_tokens  num_experts  topk        VLLM        SGL
0          1.0         16.0   1.0   20.064000  21.663999
1          1.0         16.0   2.0   19.616000  22.272000
2          1.0         16.0   8.0   19.616000  21.695999
3          1.0         64.0   1.0   25.632000  27.680000
4          1.0         64.0   2.0   26.384000  27.680000
5          1.0         64.0   8.0   26.368000  27.680000
6          1.0        224.0   1.0   62.463999  32.768000
7          1.0        224.0   2.0   62.463999  32.864001
8          1.0        224.0   8.0   62.463999  32.896001
9          1.0        256.0   1.0   71.872003  31.744000
10         1.0        256.0   2.0   71.744002  31.776000
11         1.0        256.0   8.0   71.744002  31.808000
12        16.0         16.0   1.0   21.344000  23.615999
13        16.0         16.0   2.0   21.376001  23.584001
14        16.0         16.0   8.0   21.344000  23.712000
15        16.0         64.0   1.0   25.632000  28.720001
16        16.0         64.0   2.0   25.632000  28.672000
17        16.0         64.0   8.0   26.400000  29.792000
18        16.0        224.0   1.0   62.463999  32.800000
19        16.0        224.0   2.0   62.463999  32.864001
20        16.0        224.0   8.0   62.463999  32.800000
21        16.0        256.0   1.0   72.704002  33.856001
22        16.0        256.0   2.0   72.704002  33.824001
23        16.0        256.0   8.0   72.704002  33.824001
24       256.0         16.0   1.0   19.487999  25.536001
25       256.0         16.0   2.0   21.632001  29.568000
26       256.0         16.0   8.0   33.792000  28.384000
27       256.0         64.0   1.0   27.648000  30.432000
28       256.0         64.0   2.0   27.648000  31.776000
29       256.0         64.0   8.0   31.776000  25.728000
30       256.0        224.0   1.0   62.463999  32.832000
31       256.0        224.0   2.0   62.463999  31.776000
32       256.0        224.0   8.0   62.560000  30.751999
33       256.0        256.0   1.0   70.720002  31.872001
34       256.0        256.0   2.0   72.704002  33.631999
35       256.0        256.0   8.0   74.752003  33.824001
36      4096.0         16.0   1.0   48.128001  27.680000
37      4096.0         16.0   2.0   74.784003  27.744001
38      4096.0         16.0   8.0  261.119992  44.128001
39      4096.0         64.0   1.0   44.096000  27.872000
40      4096.0         64.0   2.0   66.624001  33.599999
41      4096.0         64.0   8.0  192.959994  48.191998
42      4096.0        224.0   1.0   68.672001  33.760000
43      4096.0        224.0   2.0   80.911998  38.943999
44      4096.0        224.0   8.0  167.776003  54.272000
45      4096.0        256.0   1.0   76.800004  33.824001
46      4096.0        256.0   2.0   85.967999  37.888002
47      4096.0        256.0   8.0  131.007999  52.223999
```

#### Throughput(fp16)

`vllm bench throughput --model Qwen/Qwen3-30B-A3B --load-format dummy --input-len 1000 --output-len 100`

Throughput: 46.03 requests/s, 50547.85 total tokens/s, 4603.43 output tokens/s（B200）
Throughput: 47.63 requests/s, 52312.68 total tokens/s, 4762.52 output tokens/s(B200 sgl moe_align_block_size)

#### Throughput(fp8)

`vllm bench throughput --model Qwen/Qwen3-30B-A3B-FP8 --load-format dummy --input-len 1000 --output-len 100`

Throughput: 42.28 requests/s, 46424.61 total tokens/s, 4228.17 output tokens/s（B200）
Throughput: 44.17 requests/s, 48497.60 total tokens/s, 4417.43 output tokens/s (B200 sgl moe_align_block_size)

### Alternatives

_No response_

### Additional context

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature]: Optimize `moe_align_block_size` CUDA kernel #19517

🚀 The feature, motivation and pitch

Throughput(fp16)

Throughput(fp8)

Alternatives

Additional context

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Feature]: Optimize moe_align_block_size CUDA kernel #19517

Description

🚀 The feature, motivation and pitch

Throughput(fp16)

Throughput(fp8)

Alternatives

Additional context

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Feature]: Optimize `moe_align_block_size` CUDA kernel #19517