feat(native-cpu): native FFM FP32 SGEMM kernel (PR 5 of 5) by michalharakal · Pull Request #575 · SKaiNET-developers/SKaiNET

michalharakal · 2026-04-29T21:27:13Z

Summary

Final PR of the staged native-FFM rollout per docs/.../perf/native-ffm-plan.adoc. Wires native FP32 SGEMM into the existing matmulFp32() SPI accessor; KernelRegistry now hands out the native kernel ahead of Panama Vector for FP32 matmul on hosts where libskainet_kernels resolves.

What changed

native/src/fp32_matmul.c — row-major C(m,n) = A(m,k) * B(k,n) with strides. i-p-j outer-product order so the inner c[j] += a*b[j] loop is two contiguous-stream FMA arithmetic (auto-vec into vfmadd231ps / fmla under -O3 -ffast-math). Caller contract matches the SPI: zero-then-accumulate so C is fully overwritten; k=0 zeros the block; m=0||n=0 is a no-op.
NativeFp32MatmulKernel — FFM Linker.downcallHandle wrapping the C symbol; Arena.ofConfined segments sized to (offset + (rows-1) * stride + cols) so non-contiguous strides Just Work.
NativeKernelProvider.matmulFp32() flips from null to NativeFp32MatmulKernel when available; still cascades to Panama otherwise.

Microbench (Linux x86_64, JDK 21.0.10, gcc 13.3 `-O3 -ffast-math`; warmup=5, samples=9, median µs)

shape	native	panama	ratio
256³	1976	3492	1.77×
512³	17048	26882	1.58×
1024³	142463	220710	1.55×

Honest read: FP32 wins are more modest than Q4_K (Q4_K was 4–6×). Why: Panama's FP32 path is much more polished — tile-blocking, B-pack into transposed buffer, parallelChunks across all cores. Panama's Q4_K path has none of that. Native single-threaded scalar C still wins everywhere measured because the JVM's per-call overhead + parallelChunks dispatch overhead are nontrivial at these shapes. Hand-tuned cache-blocking + threading would push native further but that's perf-tuning, not correctness.

Test plan

:skainet-backends:skainet-backend-native-cpu:jvmTest — 27/27 (3 pipeline + 5 Q4_K heap-parity + 7 Q4_K memseg-parity + 10 FP32 parity + 2 microbench-gated)
:skainet-backends:skainet-backend-cpu:jvmTest — 218/218 (no regression)
FP32 parity vs PanamaVectorMatmulKernel within 1e-5 * k (5e-5 * k at 256² to absorb -ffast-math reassociation) across small contiguous, random aligned, non-aligned k (tail loop), strided sub-block, irregular sizes, LLM-typical 256², zero-m, zero-k, negative-dim rejection
CI cross-arch matrix (PR 4's workflow) verifies macos-arm64 / linux-arm64 / windows-x86_64 build the new kernel cleanly

Rollout state — staged delivery now complete

PR	What	Status
1	Module scaffolding + smoke FFM downcall	✅ #571
2	Native Q4_K matmul (4–6× over Panama)	✅ #572
3	Q4_K MemSeg zero-copy SPI sibling	✅ #573
4	Cross-arch CI matrix + MSVC/Clang portability	✅ #574
5	Native FP32 SGEMM (this PR)	🟡 #575?

Out of scope (deferred — not part of the 5-PR plan)

Native Q6_K and Q8_0 kernels — need new SPI accessors (Q6KMatmulKernel, Q8MatmulKernel) and the existing Panama provider needs to expose its internal Q6_K / Q8_0 paths through the new SPI. Separate plan.
Cache-blocking + B-tile packing + parallelChunks-style threading for the native FP32 path — profile-driven follow-up.
Maven Central native classifier publishing / fat-JAR aggregation (deferred since PR 4).

🤖 Generated with Claude Code

Final PR of the staged native-FFM rollout per docs/.../perf/native-ffm-plan.adoc. Wires a real FP32 matmul into the existing matmulFp32() SPI accessor on NativeKernelProvider; the KernelRegistry now hands out the native kernel ahead of Panama Vector for FP32 SGEMM on hosts where libskainet_kernels resolves. Native side (native/): - src/fp32_matmul.c implements skainet_fp32_matmul as row-major C(m,n) = A(m,k) * B(k,n) with stride support. Iteration order is i-p-j (outer product into rows of C); the inner `c[j] += a*b[j]` loop streams two contiguous arrays and auto-vectorizes cleanly under -O3 -ffast-math into vfmadd231ps on x86_64 / fmla on AArch64. Caller contract matches the SPI: zero-then-accumulate ensures C is fully overwritten, k=0 zeros the block, m=0||n=0 is a no-op. - include/skainet_kernels.h declares the new export. CMakeLists adds fp32_matmul.c alongside the smoke and Q4_K sources. Kotlin side (src/jvmMain): - NativeFp32MatmulKernel implements Fp32MatmulKernel via FFM downcall (12-arg FunctionDescriptor: 3 ADDRESS + 9 JAVA_INT). Heap arrays are copied into Arena.ofConfined off-heap segments sized to the reach (offset + last-row stride * rows) so non-contiguous strides Just Work without per-row staging. - NativeKernelProvider.matmulFp32() now returns NativeFp32MatmulKernel when the lib loads; cascades to Panama otherwise. NativeKernelProvider's class kdoc is updated to mark PR 5 as the cursor for matmulFp32. Tests (src/jvmTest): - NativeFp32MatmulKernelParityTest — 10 cases mirroring the Panama fixture (small contiguous, random aligned, non-aligned k tail, strided sub-block, irregular sizes, LLM-typical 256², zero-m, zero-k, negative-dim rejection) plus a provider-handout assertion. Tolerance scaled with k matching the Panama-vs-Scalar bar (1e-5 * k); 5e-5 * k at 256² to absorb -ffast-math reassociation. - Q4KMatmulMicrobenchTest grows a bench_fp32_native_vs_panama case at 256³ / 512³ / 1024³, gated by -Dskainet.runBench=true. - NativeFfmPipelineTest's stub-flip assertion now expects both matmulFp32() and matmulQ4K() to return non-null kernels. Microbench numbers (Linux x86_64, JDK 21.0.10, gcc 13.3 -O3 -ffast-math; warmup=5, samples=9, median µs): shape native panama ratio 256³ 1976 3492 1.77× 512³ 17048 26882 1.58× 1024³ 142463 220710 1.55× Honest read: FP32 wins are more modest than Q4_K (Q4_K was 4–6×). Why: Panama's FP32 path has tile-blocking + B-pack (transposed) + parallelChunks across all cores; Q4_K's Panama path doesn't do parallelChunks-style decomposition. Native single-threaded scalar C still wins everywhere measured because the JVM's per-call overhead and parallelChunks dispatch overhead are nontrivial at these shapes. Hand-tuned cache-blocking + threading would push the native FP32 path further but is perf-tuning, not correctness. Verification (linux-x86_64, JDK 21.0.10): - :skainet-backends:skainet-backend-native-cpu:jvmTest — 27/27 (3 pipeline + 5 Q4_K heap-parity + 7 Q4_K memseg-parity + 10 FP32 parity + 2 microbench-gated) - :skainet-backends:skainet-backend-cpu:jvmTest — 218/218 (no regression) - 3 native symbols exported: skainet_smoke_double, skainet_q4k_matmul, skainet_fp32_matmul Out of scope (deferred): - Native Q6_K and Q8_0 matmul. Both need new SPI accessors (Q6KMatmulKernel, Q8MatmulKernel) and the existing Panama provider also needs to expose its internal Q6_K / Q8_0 paths through the new SPI. That arc is a separate plan. - Cache-blocking + B-tile packing + parallelChunks-style threading for the native FP32 path. Profile-driven; the current scalar C path already wins everywhere measured. - Maven Central native classifier publishing / fat-JAR aggregation (still deferred from PR 4). Rollout state: PR 1 (scaffolding), PR 2 (Q4_K matmul), PR 3 (Q4_K MemSeg zero-copy), PR 4 (cross-arch CI matrix), PR 5 (this commit, FP32 matmul). The 5-PR plan from native-ffm-plan.adoc is now complete in its core scope; the optional Q6_K / Q8_0 additions and the publishing infrastructure are future work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

michalharakal merged commit 9d05fc4 into develop Apr 29, 2026
9 of 10 checks passed

michalharakal mentioned this pull request Apr 30, 2026

Prepare 0.22.0 #580

Merged

michalharakal deleted the feature/native-fp32-matmul branch May 2, 2026 17:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(native-cpu): native FFM FP32 SGEMM kernel (PR 5 of 5)#575

feat(native-cpu): native FFM FP32 SGEMM kernel (PR 5 of 5)#575
michalharakal merged 1 commit intodevelopfrom
feature/native-fp32-matmul

michalharakal commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

michalharakal commented Apr 29, 2026

Summary

What changed

Microbench (Linux x86_64, JDK 21.0.10, gcc 13.3 -O3 -ffast-math; warmup=5, samples=9, median µs)

Test plan

Rollout state — staged delivery now complete

Out of scope (deferred — not part of the 5-PR plan)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Microbench (Linux x86_64, JDK 21.0.10, gcc 13.3 `-O3 -ffast-math`; warmup=5, samples=9, median µs)