feat(native-cpu): native FFM FP32 SGEMM kernel (PR 5 of 5)#575
Merged
michalharakal merged 1 commit intodevelopfrom Apr 29, 2026
Merged
feat(native-cpu): native FFM FP32 SGEMM kernel (PR 5 of 5)#575michalharakal merged 1 commit intodevelopfrom
michalharakal merged 1 commit intodevelopfrom
Conversation
Final PR of the staged native-FFM rollout per docs/.../perf/native-ffm-plan.adoc. Wires a real FP32 matmul into the existing matmulFp32() SPI accessor on NativeKernelProvider; the KernelRegistry now hands out the native kernel ahead of Panama Vector for FP32 SGEMM on hosts where libskainet_kernels resolves. Native side (native/): - src/fp32_matmul.c implements skainet_fp32_matmul as row-major C(m,n) = A(m,k) * B(k,n) with stride support. Iteration order is i-p-j (outer product into rows of C); the inner `c[j] += a*b[j]` loop streams two contiguous arrays and auto-vectorizes cleanly under -O3 -ffast-math into vfmadd231ps on x86_64 / fmla on AArch64. Caller contract matches the SPI: zero-then-accumulate ensures C is fully overwritten, k=0 zeros the block, m=0||n=0 is a no-op. - include/skainet_kernels.h declares the new export. CMakeLists adds fp32_matmul.c alongside the smoke and Q4_K sources. Kotlin side (src/jvmMain): - NativeFp32MatmulKernel implements Fp32MatmulKernel via FFM downcall (12-arg FunctionDescriptor: 3 ADDRESS + 9 JAVA_INT). Heap arrays are copied into Arena.ofConfined off-heap segments sized to the reach (offset + last-row stride * rows) so non-contiguous strides Just Work without per-row staging. - NativeKernelProvider.matmulFp32() now returns NativeFp32MatmulKernel when the lib loads; cascades to Panama otherwise. NativeKernelProvider's class kdoc is updated to mark PR 5 as the cursor for matmulFp32. Tests (src/jvmTest): - NativeFp32MatmulKernelParityTest — 10 cases mirroring the Panama fixture (small contiguous, random aligned, non-aligned k tail, strided sub-block, irregular sizes, LLM-typical 256², zero-m, zero-k, negative-dim rejection) plus a provider-handout assertion. Tolerance scaled with k matching the Panama-vs-Scalar bar (1e-5 * k); 5e-5 * k at 256² to absorb -ffast-math reassociation. - Q4KMatmulMicrobenchTest grows a bench_fp32_native_vs_panama case at 256³ / 512³ / 1024³, gated by -Dskainet.runBench=true. - NativeFfmPipelineTest's stub-flip assertion now expects both matmulFp32() and matmulQ4K() to return non-null kernels. Microbench numbers (Linux x86_64, JDK 21.0.10, gcc 13.3 -O3 -ffast-math; warmup=5, samples=9, median µs): shape native panama ratio 256³ 1976 3492 1.77× 512³ 17048 26882 1.58× 1024³ 142463 220710 1.55× Honest read: FP32 wins are more modest than Q4_K (Q4_K was 4–6×). Why: Panama's FP32 path has tile-blocking + B-pack (transposed) + parallelChunks across all cores; Q4_K's Panama path doesn't do parallelChunks-style decomposition. Native single-threaded scalar C still wins everywhere measured because the JVM's per-call overhead and parallelChunks dispatch overhead are nontrivial at these shapes. Hand-tuned cache-blocking + threading would push the native FP32 path further but is perf-tuning, not correctness. Verification (linux-x86_64, JDK 21.0.10): - :skainet-backends:skainet-backend-native-cpu:jvmTest — 27/27 (3 pipeline + 5 Q4_K heap-parity + 7 Q4_K memseg-parity + 10 FP32 parity + 2 microbench-gated) - :skainet-backends:skainet-backend-cpu:jvmTest — 218/218 (no regression) - 3 native symbols exported: skainet_smoke_double, skainet_q4k_matmul, skainet_fp32_matmul Out of scope (deferred): - Native Q6_K and Q8_0 matmul. Both need new SPI accessors (Q6KMatmulKernel, Q8MatmulKernel) and the existing Panama provider also needs to expose its internal Q6_K / Q8_0 paths through the new SPI. That arc is a separate plan. - Cache-blocking + B-tile packing + parallelChunks-style threading for the native FP32 path. Profile-driven; the current scalar C path already wins everywhere measured. - Maven Central native classifier publishing / fat-JAR aggregation (still deferred from PR 4). Rollout state: PR 1 (scaffolding), PR 2 (Q4_K matmul), PR 3 (Q4_K MemSeg zero-copy), PR 4 (cross-arch CI matrix), PR 5 (this commit, FP32 matmul). The 5-PR plan from native-ffm-plan.adoc is now complete in its core scope; the optional Q6_K / Q8_0 additions and the publishing infrastructure are future work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Final PR of the staged native-FFM rollout per
docs/.../perf/native-ffm-plan.adoc. Wires native FP32 SGEMM into the existingmatmulFp32()SPI accessor;KernelRegistrynow hands out the native kernel ahead of Panama Vector for FP32 matmul on hosts wherelibskainet_kernelsresolves.What changed
native/src/fp32_matmul.c— row-majorC(m,n) = A(m,k) * B(k,n)with strides. i-p-j outer-product order so the innerc[j] += a*b[j]loop is two contiguous-stream FMA arithmetic (auto-vec intovfmadd231ps/fmlaunder-O3 -ffast-math). Caller contract matches the SPI: zero-then-accumulate soCis fully overwritten;k=0zeros the block;m=0||n=0is a no-op.NativeFp32MatmulKernel— FFMLinker.downcallHandlewrapping the C symbol;Arena.ofConfinedsegments sized to(offset + (rows-1) * stride + cols)so non-contiguous strides Just Work.NativeKernelProvider.matmulFp32()flips fromnulltoNativeFp32MatmulKernelwhen available; still cascades to Panama otherwise.Microbench (Linux x86_64, JDK 21.0.10, gcc 13.3
-O3 -ffast-math; warmup=5, samples=9, median µs)Honest read: FP32 wins are more modest than Q4_K (Q4_K was 4–6×). Why: Panama's FP32 path is much more polished — tile-blocking, B-pack into transposed buffer,
parallelChunksacross all cores. Panama's Q4_K path has none of that. Native single-threaded scalar C still wins everywhere measured because the JVM's per-call overhead +parallelChunksdispatch overhead are nontrivial at these shapes. Hand-tuned cache-blocking + threading would push native further but that's perf-tuning, not correctness.Test plan
:skainet-backends:skainet-backend-native-cpu:jvmTest— 27/27 (3 pipeline + 5 Q4_K heap-parity + 7 Q4_K memseg-parity + 10 FP32 parity + 2 microbench-gated):skainet-backends:skainet-backend-cpu:jvmTest— 218/218 (no regression)PanamaVectorMatmulKernelwithin1e-5 * k(5e-5 * k at 256² to absorb-ffast-mathreassociation) across small contiguous, random aligned, non-aligned k (tail loop), strided sub-block, irregular sizes, LLM-typical 256², zero-m, zero-k, negative-dim rejectionRollout state — staged delivery now complete
Out of scope (deferred — not part of the 5-PR plan)
Q6KMatmulKernel,Q8MatmulKernel) and the existing Panama provider needs to expose its internal Q6_K / Q8_0 paths through the new SPI. Separate plan.parallelChunks-style threading for the native FP32 path — profile-driven follow-up.🤖 Generated with Claude Code