[FlyDSL] fused RoPE kernel with layout APIs #300
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a fused RoPE + KV-cache kernel implementation that uses FlyDSL’s layout APIs for structured address calculations, along with a dedicated correctness/benchmarking test harness.
Changes:
- Introduce
kernels/fused_rope_cache_kernel.pyimplementing a 2-launch fused RoPE + KV-cache write usingmake_layout+crd2idx. - Add
tests/kernels/test_fused_rope_cache.pycovering flash/non-flash cache layouts, bf16/f16 dtypes, negative-slot behavior, and optional benchmarking/AITER cross-check.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
kernels/fused_rope_cache_kernel.py |
New fused kernel builder using layout-based indexing for Q/K/V and KV-cache address computations. |
tests/kernels/test_fused_rope_cache.py |
New correctness + optional perf/AITER validation coverage for the fused kernel across layouts/dtypes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Replace manual byte-offset arithmetic in fused_rope_cache_kernel with FlyDSL layout API (make_layout + crd2idx) for all structured address computations (Q/K/V, cos/sin, KV cache). - Add _crd2idx_i32 helper to unwrap int_tuple -> i32 scalar for buffer offset math (same pattern as mfma_preshuffle_pipeline.py) - Declare tensor layouts as (shape, stride) tuples, use crd2idx for address computation instead of manual mul+add chains - Paired-half RoPE offset stays as arith (piecewise ±half_dim is not expressible as an affine layout stride) 92/92 tests passed, 0 errors, no performance regression Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
8962051 to
dd599d5
Compare
|
As discussed with Felix, I am working on convert Paired-half RoPE offset to layout API as well. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
coderfeli
left a comment
There was a problem hiding this comment.
Overall clean implementation — layout API usage is correct, test coverage is solid (flash/non-flash, bf16/f16, negative slots, multi-model). A few suggestions below.
…mprove layout handling in fused_rope_cache_kernel
- Remove unreachable `vec_dwords != VEC_WIDTH` branches in bitcast calls (vec_dwords is always 4, VEC_WIDTH is always 8) - Remove stale comment on launch function signature - Remove sys.path manipulation in test (use PYTHONPATH instead) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
As discussed in #272. Replace manual byte-offset arithmetic in fused_rope_cache_kernel with FlyDSL layout API (make_layout + crd2idx) for all structured address computations (Q/K/V, cos/sin, KV cache).
92/92 tests passed, Cross-validated with AITER results.
Test Plan
Usage:
Test Result
Submission Checklist