Skip to content

[Triton] add fused_routing_from_topk switch#725

Merged
valarLip merged 4 commits into
mainfrom
shaoclee/dsv4-topk
May 10, 2026
Merged

[Triton] add fused_routing_from_topk switch#725
valarLip merged 4 commits into
mainfrom
shaoclee/dsv4-topk

Conversation

@k50112113
Copy link
Copy Markdown
Contributor

This PR depends on ROCm/aiter#3096

The inter-decode-layer "bubbles" are due to poorly captured multi-stream trace, in which the MOE path is not correctly captured, the topk routing is within that path and it's composed of +10 small torch kernel. This contributes to the ~275 us "bubble" seen in the trace:
image

after fusing all those ops into a single triton fused kernel, the gap closes down to ~160 us:
image

lm_eval with fused topk

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     3|exact_match|↑  | 0.96|±  |0.0197|
|     |       |strict-match    |     3|exact_match|↑  | 0.96|±  |0.0197|

+15% total token throughput end-to-end

baseline

Maximum request concurrency: 64
============ Serving Benchmark Result ============
Successful requests:                     512       
Benchmark duration (s):                  209.11    
Total input tokens:                      524288    
Total generated tokens:                  524288    
Request throughput (req/s):              2.45      
Output token throughput (tok/s):         2507.24   
Total Token throughput (tok/s):          5014.49   
---------------Time to First Token----------------
Mean TTFT (ms):                          1109.65   
Median TTFT (ms):                        1088.55   
P99 TTFT (ms):                           1956.70   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          24.46     
Median TPOT (ms):                        24.40     
P99 TPOT (ms):                           25.40     
---------------Inter-token Latency----------------
Mean ITL (ms):                           24.43     
Median ITL (ms):                         23.98     
P99 ITL (ms):                            26.45     
==================================================

fused topk

Maximum request concurrency: 64
============ Serving Benchmark Result ============
Successful requests:                     512       
Benchmark duration (s):                  181.92    
Total input tokens:                      524288    
Total generated tokens:                  524288    
Request throughput (req/s):              2.81      
Output token throughput (tok/s):         2881.92   
Total Token throughput (tok/s):          5763.84   
---------------Time to First Token----------------
Mean TTFT (ms):                          1072.04   
Median TTFT (ms):                        1103.72   
P99 TTFT (ms):                           1811.84   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          21.17     
Median TPOT (ms):                        21.17     
P99 TPOT (ms):                           22.36     
---------------Inter-token Latency----------------
Mean ITL (ms):                           21.15     
Median ITL (ms):                         20.57     
P99 ITL (ms):                            23.65     
==================================================

@valarLip
Copy link
Copy Markdown
Collaborator

valarLip commented May 9, 2026

let me konw once the ruff issue fixed

@k50112113
Copy link
Copy Markdown
Contributor Author

fixed ruff

@k50112113
Copy link
Copy Markdown
Contributor Author

updates: performance further increased after splitting the routing topk fusion into 3 kernels

Maximum request concurrency: 64
============ Serving Benchmark Result ============
Successful requests:                     512       
Benchmark duration (s):                  173.14    
Total input tokens:                      524288    
Total generated tokens:                  524288    
Request throughput (req/s):              2.96      
Output token throughput (tok/s):         3028.10   
Total Token throughput (tok/s):          6056.20   
---------------Time to First Token----------------
Mean TTFT (ms):                          1102.35   
Median TTFT (ms):                        1203.95   
P99 TTFT (ms):                           1791.72   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          20.07     
Median TPOT (ms):                        19.97     
P99 TPOT (ms):                           20.86     
---------------Inter-token Latency----------------
Mean ITL (ms):                           20.05     
Median ITL (ms):                         19.72     
P99 ITL (ms):                            21.53     
==================================================

@valarLip valarLip merged commit 303a2e4 into main May 10, 2026
19 of 28 checks passed
@valarLip valarLip deleted the shaoclee/dsv4-topk branch May 10, 2026 03:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants