[OPT] Add fly-int-swizzle-simplify pass by sjfeng1999 · Pull Request #427 · ROCm/FlyDSL

sjfeng1999 · 2026-04-22T08:28:32Z

Add an algebraic simplification pass that recognizes the canonical three-instruction swizzle sequence (andi/shrui/xori) emitted by applySwizzle and peels period-aligned addends out of the swizzle:

swizzle(base + d)  →  swizzle(base) + d   when d % period == 0

In example04 (preshuffle_gemm) with SwizzleType S<3,3,3> (period=512), the LDS address for each pipeline stage is swizzle(thread_offset + stage * lds_stride). Since lds_stride (BLOCK_M × BLOCK_K × sizeof(f16) = 16384) is a multiple of 512, the pass peels the stage offset out:

Before: each stage recomputes the full swizzle (andi+shrui+xori) on
(thread_offset + stage_offset), duplicating 3 ALU ops per stage.
After: swizzle is computed once on thread_offset; stage offsets become
plain arith.addi with an immediate, which then fold/CSE across
stages.

Concrete effects on the ds_read/ds_write address computation in the GEMM hot loop:

Instruction count: eliminates redundant andi+shrui+xori triplets per pipeline stage (−6 SALU/VALU ops for 2-stage, −9 for 3-stage).
Immediate offsets: the peeled constant addends enable ds_read_b128 with immediate byte offsets instead of a full v_add + swizzle, which the ISA can encode directly in the ds instruction offset field.
Register pressure: fewer live intermediates (%masked, %shifted) per stage reduces VGPR demand in the address-computation section.

The pass also includes a divisibility lattice that propagates alignment info both forward (through arith add/mul/shl) and backward (from IntAttr.divisibility annotations on Make*Op results), enabling the optimization even for dynamic strides with known alignment.

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Add an algebraic simplification pass that recognizes the canonical three-instruction swizzle sequence (andi/shrui/xori) emitted by `applySwizzle` and peels period-aligned addends out of the swizzle: swizzle(base + d) → swizzle(base) + d when d % period == 0 In example04 (preshuffle_gemm) with SwizzleType S<3,3,3> (period=512), the LDS address for each pipeline stage is `swizzle(thread_offset + stage * lds_stride)`. Since lds_stride (BLOCK_M × BLOCK_K × sizeof(f16) = 16384) is a multiple of 512, the pass peels the stage offset out: Before: each stage recomputes the full swizzle (andi+shrui+xori) on (thread_offset + stage_offset), duplicating 3 ALU ops per stage. After: swizzle is computed once on thread_offset; stage offsets become plain arith.addi with an immediate, which then fold/CSE across stages. Concrete effects on the ds_read/ds_write address computation in the GEMM hot loop: - Instruction count: eliminates redundant andi+shrui+xori triplets per pipeline stage (−6 SALU/VALU ops for 2-stage, −9 for 3-stage). - Immediate offsets: the peeled constant addends enable ds_read_b128 with immediate byte offsets instead of a full v_add + swizzle, which the ISA can encode directly in the ds instruction offset field. - Register pressure: fewer live intermediates (%masked, %shifted) per stage reduces VGPR demand in the address-computation section. The pass also includes a divisibility lattice that propagates alignment info both forward (through arith add/mul/shl) and backward (from IntAttr.divisibility annotations on Make*Op results), enabling the optimization even for dynamic strides with known alignment. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Copilot

Pull request overview

Adds an MLIR optimization pass (fly-int-swizzle-simplify) to algebraically simplify the canonical swizzle sequence emitted by applySwizzle, peeling period-aligned addends out of the swizzle input to reduce redundant ALU ops and improve downstream folding/CSE, and wires it into the ROCm backend pipeline.

Changes:

Introduce fly-int-swizzle-simplify pass with a divisibility lattice to detect period-aligned addends and rewrite swizzle-shaped andi/shrui/xori patterns.
Register/build the new transform (CMake + Passes.td) and enable it in the ROCm compilation pipeline.
Add MLIR tests covering peel/no-peel behavior for i32 and i64.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
`lib/Dialect/Fly/Transforms/IntSwizzleSimplify.cpp`	Implements the new swizzle simplification pass and divisibility propagation.
`include/flydsl/Dialect/Fly/Transforms/Passes.td`	Registers the new pass and documents its intent/pattern.
`python/flydsl/compiler/backends/rocm.py`	Inserts the pass into the ROCm lowering pipeline.
`tests/mlir/Transforms/int_swizzle_simplify.mlir`	Adds FileCheck coverage for the optimization and non-optimization cases.
`lib/Dialect/Fly/CMakeLists.txt`	Builds the new transform source file into the Fly dialect library.
`include/flydsl/Dialect/Fly/Utils/IntUtils.h` / `lib/Dialect/Fly/Utils/IntUtils.cpp`	Adds an `isDivisibleBy` helper for `IntAttr`.
`include/flydsl/Dialect/Fly/IR/FlyAttrDefs.td`	Adds `SwizzleAttr::period()` convenience method.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Address review comments on PR #427: - i64 mask truncation: change mask/low/plus from unsigned/int to uint64_t so i64 swizzle masks are not silently truncated to 32 bits before the ring-of-1s check and period computation. - Constant divisibility overflow: saturate large i64 constants to INT32_MAX instead of casting directly to int, preventing values like 0x1_0000_0000 from truncating to 0 and falsely marking everything as period-aligned. - period <= 0 guard: bail out conservatively in peelByPeriod when period is non-positive, avoiding divide-by-zero UB in the modulo. - dependentDialects: add gpu::GPUDialect so the pass can be run in isolation without relying on upstream dialect loading. - Passes.td description: fix arith.xori to use valid two-operand MLIR SSA form; remove mention of subi (subtraction peeling not implemented). - Pipeline: insert canonicalize after fly-int-swizzle-simplify to eagerly clean up dead andi/shrui ops left behind after xori rewrite. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

sjfeng1999 requested review from coderfeli and Copilot April 22, 2026 08:28

Copilot started reviewing on behalf of sjfeng1999 April 22, 2026 08:30 View session

Copilot AI reviewed Apr 22, 2026

View reviewed changes

coderfeli merged commit ff6df13 into main Apr 23, 2026
11 checks passed

coderfeli deleted the pr/opt-swizzle branch April 23, 2026 07:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OPT] Add fly-int-swizzle-simplify pass#427

[OPT] Add fly-int-swizzle-simplify pass#427
coderfeli merged 2 commits intomainfrom
pr/opt-swizzle

sjfeng1999 commented Apr 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sjfeng1999 commented Apr 22, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants