[FlyDSL] fused RoPE kernel with layout APIs by amd-weisun · Pull Request #300 · ROCm/FlyDSL

amd-weisun · 2026-03-27T11:31:59Z

As discussed in #272. Replace manual byte-offset arithmetic in fused_rope_cache_kernel with FlyDSL layout API (make_layout + crd2idx) for all structured address computations (Q/K/V, cos/sin, KV cache).

Declare tensor layouts as (shape, stride) tuples, use crd2idx for address computation instead of manual mul+add chains

92/92 tests passed, Cross-validated with AITER results.

Test Plan

Usage:

# Fast CI — correctness only (GPT-OSS 120B TP=8, 10 tests):
PYTHONPATH=./ pytest tests/kernels/test_fused_rope_cache.py -v -s

# All models × TPs (multi-model sweep):
FLYDSL_ALL_MODELS=1 PYTHONPATH=./ pytest tests/kernels/test_fused_rope_cache.py -v -s

# With benchmarking + optional AITER comparison:
FLYDSL_BENCH=1 AITER_REPO=../aiter PYTHONPATH=./ pytest tests/kernels/test_fused_rope_cache.py -v -s

# CLI — all models:
PYTHONPATH=./ python tests/kernels/test_fused_rope_cache.py --all-models

# CLI — with benchmark + AITER comparison:
FLYDSL_BENCH=1 AITER_REPO=../aiter PYTHONPATH=./ python tests/kernels/test_fused_rope_cache.py --all-models

Test Result

Tested on MI350: 0 numerical error vs PyTorch reference. Performance: 1.4-1.6x faster than Triton (AITER) across all configs ( GPT-OSS-120B, Qwen3, Llama-3.1), both layouts verified. Cross-validated against AITER output.

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Copilot

Pull request overview

Adds a fused RoPE + KV-cache kernel implementation that uses FlyDSL’s layout APIs for structured address calculations, along with a dedicated correctness/benchmarking test harness.

Changes:

Introduce kernels/fused_rope_cache_kernel.py implementing a 2-launch fused RoPE + KV-cache write using make_layout + crd2idx.
Add tests/kernels/test_fused_rope_cache.py covering flash/non-flash cache layouts, bf16/f16 dtypes, negative-slot behavior, and optional benchmarking/AITER cross-check.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`kernels/fused_rope_cache_kernel.py`	New fused kernel builder using layout-based indexing for Q/K/V and KV-cache address computations.
`tests/kernels/test_fused_rope_cache.py`	New correctness + optional perf/AITER validation coverage for the fused kernel across layouts/dtypes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Replace manual byte-offset arithmetic in fused_rope_cache_kernel with FlyDSL layout API (make_layout + crd2idx) for all structured address computations (Q/K/V, cos/sin, KV cache). - Add _crd2idx_i32 helper to unwrap int_tuple -> i32 scalar for buffer offset math (same pattern as mfma_preshuffle_pipeline.py) - Declare tensor layouts as (shape, stride) tuples, use crd2idx for address computation instead of manual mul+add chains - Paired-half RoPE offset stays as arith (piecewise ±half_dim is not expressible as an affine layout stride) 92/92 tests passed, 0 errors, no performance regression Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

amd-weisun · 2026-03-27T14:19:57Z

As discussed with Felix, I am working on convert Paired-half RoPE offset to layout API as well.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

coderfeli

Overall clean implementation — layout API usage is correct, test coverage is solid (flash/non-flash, bf16/f16, negative slots, multi-model). A few suggestions below.

…mprove layout handling in fused_rope_cache_kernel

- Remove unreachable `vec_dwords != VEC_WIDTH` branches in bitcast calls (vec_dwords is always 4, VEC_WIDTH is always 8) - Remove stale comment on launch function signature - Remove sys.path manipulation in test (use PYTHONPATH instead) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

amd-weisun changed the title ~~[FlyDSL] Migrate RoPE kernel to layout API (make_layout + crd2idx)~~ [FlyDSL] fused RoPE kernel with layout APIs Mar 27, 2026

amd-weisun mentioned this pull request Mar 27, 2026

Add fused RoPE kernel #272

Closed

1 task

amd-weisun marked this pull request as ready for review March 27, 2026 11:40

Copilot AI review requested due to automatic review settings March 27, 2026 11:40

Copilot started reviewing on behalf of amd-weisun March 27, 2026 11:42 View session

Copilot AI reviewed Mar 27, 2026

View reviewed changes

Comment thread tests/kernels/test_fused_rope_cache.py Outdated

Comment thread kernels/fused_rope_cache_kernel.py Outdated

amd-weisun and others added 2 commits March 27, 2026 13:25

adress review feedback

dd599d5

amd-weisun force-pushed the rope-layout-api branch from 8962051 to dd599d5 Compare March 27, 2026 13:25

amd-weisun marked this pull request as draft March 27, 2026 14:19

use layout api

abc49ac

amd-weisun requested a review from Copilot March 30, 2026 09:37

Copilot started reviewing on behalf of amd-weisun March 30, 2026 09:38 View session

Copilot AI reviewed Mar 30, 2026

View reviewed changes

Comment thread kernels/fused_rope_cache_kernel.py Outdated

Comment thread kernels/fused_rope_cache_kernel.py

Comment thread kernels/fused_rope_cache_kernel.py

Comment thread kernels/fused_rope_cache_kernel.py

adress review comments

e228894

amd-weisun requested a review from Copilot March 30, 2026 10:14

Copilot started reviewing on behalf of amd-weisun March 30, 2026 10:15 View session

Copilot AI reviewed Mar 30, 2026

View reviewed changes

coderfeli reviewed Mar 30, 2026

View reviewed changes

Comment thread kernels/fused_rope_cache_kernel.py

Comment thread kernels/fused_rope_cache_kernel.py Outdated

Comment thread kernels/fused_rope_cache_kernel.py

Comment thread kernels/fused_rope_cache_kernel.py Outdated

Comment thread kernels/fused_rope_cache_kernel.py

amd-weisun added 2 commits March 30, 2026 14:07

Refactor kernels to use centralized dtype_to_elem_type function and i…

f37d94c

…mprove layout handling in fused_rope_cache_kernel

add short comment for scale store

ef0803c

amd-weisun marked this pull request as ready for review March 31, 2026 09:16

coderfeli reviewed Apr 1, 2026

View reviewed changes

Comment thread tests/kernels/test_fused_rope_cache.py Outdated

coderfeli reviewed Apr 1, 2026

View reviewed changes

Comment thread tests/kernels/test_fused_rope_cache.py

coderfeli approved these changes Apr 1, 2026

View reviewed changes

coderfeli merged commit 5a1385c into ROCm:main Apr 1, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FlyDSL] fused RoPE kernel with layout APIs #300

[FlyDSL] fused RoPE kernel with layout APIs #300
coderfeli merged 7 commits intoROCm:mainfrom
amd-weisun:rope-layout-api

amd-weisun commented Mar 27, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

amd-weisun commented Mar 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

coderfeli left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

amd-weisun commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

amd-weisun commented Mar 27, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

coderfeli left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

amd-weisun commented Mar 27, 2026 •

edited

Loading