Conversation
coderfeli
added a commit
that referenced
this pull request
Feb 28, 2026
* fix run error * port all gemm from main * fuix cudagraph hack * add int4 version * change flymemref convert * test ok * add build script * fix graph2 * add files * fix flops * fix path * fix local test * fix * clean * update readme
coderfeli
added a commit
that referenced
this pull request
Mar 2, 2026
* fix run error * port all gemm from main * fuix cudagraph hack * add int4 version * change flymemref convert * test ok * add build script * fix graph2 * add files * fix flops * fix path * fix local test * fix * clean * update readme
coderfeli
added a commit
that referenced
this pull request
Mar 2, 2026
* fix run error * port all gemm from main * fuix cudagraph hack * add int4 version * change flymemref convert * test ok * add build script * fix graph2 * add files * fix flops * fix path * fix local test * fix * clean * update readme
coderfeli
added a commit
that referenced
this pull request
Mar 3, 2026
* fix run error * port all gemm from main * fuix cudagraph hack * add int4 version * change flymemref convert * test ok * add build script * fix graph2 * add files * fix flops * fix path * fix local test * fix * clean * update readme
jli-melchior
pushed a commit
that referenced
this pull request
Mar 18, 2026
* fix run error * port all gemm from main * fuix cudagraph hack * add int4 version * change flymemref convert * test ok * add build script * fix graph2 * add files * fix flops * fix path * fix local test * fix * clean * update readme
jli-melchior
pushed a commit
that referenced
this pull request
Mar 18, 2026
* fix run error * port all gemm from main * fuix cudagraph hack * add int4 version * change flymemref convert * test ok * add build script * fix graph2 * add files * fix flops * fix path * fix local test * fix * clean * update readme
jli-melchior
pushed a commit
that referenced
this pull request
Mar 19, 2026
* fix run error * port all gemm from main * fuix cudagraph hack * add int4 version * change flymemref convert * test ok * add build script * fix graph2 * add files * fix flops * fix path * fix local test * fix * clean * update readme
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Port the preshuffle GEMM kernel infrastructure from the internal development branch to the
pre_v0.1public branch, adding FP4/INT4/BF16 dtype support, async DMA copy, CUDA graph capture, and a streamlined build/test workflow.Technical Details
Kernel Enhancements (
kernels/)compile_preshuffle_gemm_w4for gfx950 FP4 preshuffle GEMM with block-scaled MXFP4 quantization, including per-block scale loading andmfma_scale_f32_16x16x128_f8f6f4withcbsz/blgp/opselparameters.compile_preshuffle_gemm_a8to handleint4(with nibble unpacking) andbf16element types.mgpuSetCaptureStreamTLS mechanism inFlirRocmRuntimeWrappers.cppto redirect kernel launches into a capture stream; integrated withtest_common.pygraph capture flow.kernels/layout_utils.py): Pure-arithcrd2idx/idx2crd/getthat parse static layout type strings and emit plain arith ops, avoiding fly dialect round-trips in the hot path.buffer_loadvec_width=1 fix: Handle scalarbuffer_loadreturning a single value (not a vector) by wrapping withvector.from_elementsbeforebitcast.DSL / Python Layer (
python/flydsl/)primitive.pycleanup: Renamedarithimport to_arithto preventimport *namespace collision withflydsl.expr.arithwrapper module. Changedrange_constexprtoreturn range(*args)for direct kernel usage.arith.pyfunction-level API: Addedconstant,index,index_cast,select,constant_vector,sitofp,trunc_f,andi,xori,shli,unwrap,_to_rawas thin wrappers around MLIR arith ops with ArithValue support. Added_safe_registerfor idempotent value caster registration. Fixed index-type division (divuifor index types that lack.width).rocdl.pyMFMA scale op: Restructuredmfma_scale_f32_16x16x128_f8f6f4to explicitly unpackcbsz,blgp,opselA,scaleA,opselB,scaleBfrom the operand list.buffer_ops.py: Replacedunrealized_conversion_castwithfly.extract_aligned_pointer_as_indexfor memref-to-pointer extraction.SmemAllocator: Addedglobal_sym_nameparameter for multiple independent shared memory allocations.libfly_jit_runtime.so(with graph capture) over upstreamlibmlir_rocm_runtime.so. Temp removed redundantconvert-vector-to-llvmpass from the pipeline.IR / Dialect (
include/,lib/)Fly_ExtractAlignedPointerAsIndexOp: New op to extract raw pointer fromfly.memrefas an index value, with ROCDL lowering viaAddrSpaceCastOp.Test Result