[Triton] add fused_routing_from_topk switch by k50112113 · Pull Request #725 · ROCm/ATOM

k50112113 · 2026-05-09T03:11:24Z

The inter-decode-layer "bubbles" are due to poorly captured multi-stream trace, in which the MOE path is not correctly captured, the topk routing is within that path and it's composed of +10 small torch kernel. This contributes to the ~275 us "bubble" seen in the trace:

after fusing all those ops into a single triton fused kernel, the gap closes down to ~160 us:

lm_eval with fused topk

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     3|exact_match|↑  | 0.96|±  |0.0197|
|     |       |strict-match    |     3|exact_match|↑  | 0.96|±  |0.0197|

+15% total token throughput end-to-end

baseline

Maximum request concurrency: 64
============ Serving Benchmark Result ============
Successful requests:                     512       
Benchmark duration (s):                  209.11    
Total input tokens:                      524288    
Total generated tokens:                  524288    
Request throughput (req/s):              2.45      
Output token throughput (tok/s):         2507.24   
Total Token throughput (tok/s):          5014.49   
---------------Time to First Token----------------
Mean TTFT (ms):                          1109.65   
Median TTFT (ms):                        1088.55   
P99 TTFT (ms):                           1956.70   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          24.46     
Median TPOT (ms):                        24.40     
P99 TPOT (ms):                           25.40     
---------------Inter-token Latency----------------
Mean ITL (ms):                           24.43     
Median ITL (ms):                         23.98     
P99 ITL (ms):                            26.45     
==================================================

fused topk

Maximum request concurrency: 64
============ Serving Benchmark Result ============
Successful requests:                     512       
Benchmark duration (s):                  181.92    
Total input tokens:                      524288    
Total generated tokens:                  524288    
Request throughput (req/s):              2.81      
Output token throughput (tok/s):         2881.92   
Total Token throughput (tok/s):          5763.84   
---------------Time to First Token----------------
Mean TTFT (ms):                          1072.04   
Median TTFT (ms):                        1103.72   
P99 TTFT (ms):                           1811.84   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          21.17     
Median TPOT (ms):                        21.17     
P99 TPOT (ms):                           22.36     
---------------Inter-token Latency----------------
Mean ITL (ms):                           21.15     
Median ITL (ms):                         20.57     
P99 ITL (ms):                            23.65     
==================================================

valarLip · 2026-05-09T08:39:37Z

let me konw once the ruff issue fixed

k50112113 · 2026-05-09T17:57:18Z

fixed ruff

k50112113 · 2026-05-09T18:35:44Z

updates: performance further increased after splitting the routing topk fusion into 3 kernels

Maximum request concurrency: 64
============ Serving Benchmark Result ============
Successful requests:                     512       
Benchmark duration (s):                  173.14    
Total input tokens:                      524288    
Total generated tokens:                  524288    
Request throughput (req/s):              2.96      
Output token throughput (tok/s):         3028.10   
Total Token throughput (tok/s):          6056.20   
---------------Time to First Token----------------
Mean TTFT (ms):                          1102.35   
Median TTFT (ms):                        1203.95   
P99 TTFT (ms):                           1791.72   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          20.07     
Median TPOT (ms):                        19.97     
P99 TPOT (ms):                           20.86     
---------------Inter-token Latency----------------
Mean ITL (ms):                           20.05     
Median ITL (ms):                         19.72     
P99 ITL (ms):                            21.53     
==================================================

add _aiter_fused_routing_from_topk switch

ae95f16

ruff fix

1910028

change import

5102a93

black format

abea4e1

valarLip approved these changes May 10, 2026

View reviewed changes

valarLip merged commit 303a2e4 into main May 10, 2026
19 of 28 checks passed

valarLip deleted the shaoclee/dsv4-topk branch May 10, 2026 03:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Triton] add fused_routing_from_topk switch#725

[Triton] add fused_routing_from_topk switch#725
valarLip merged 4 commits into
mainfrom
shaoclee/dsv4-topk

k50112113 commented May 9, 2026

Uh oh!

valarLip commented May 9, 2026

Uh oh!

k50112113 commented May 9, 2026

Uh oh!

k50112113 commented May 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

k50112113 commented May 9, 2026

Uh oh!

valarLip commented May 9, 2026

Uh oh!

k50112113 commented May 9, 2026

Uh oh!

k50112113 commented May 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants