Skip to content

feat(native-cpu): native FFM FP32 SGEMM kernel (PR 5 of 5)#575

Merged
michalharakal merged 1 commit intodevelopfrom
feature/native-fp32-matmul
Apr 29, 2026
Merged

feat(native-cpu): native FFM FP32 SGEMM kernel (PR 5 of 5)#575
michalharakal merged 1 commit intodevelopfrom
feature/native-fp32-matmul

Conversation

@michalharakal
Copy link
Copy Markdown
Contributor

Summary

Final PR of the staged native-FFM rollout per docs/.../perf/native-ffm-plan.adoc. Wires native FP32 SGEMM into the existing matmulFp32() SPI accessor; KernelRegistry now hands out the native kernel ahead of Panama Vector for FP32 matmul on hosts where libskainet_kernels resolves.

What changed

  • native/src/fp32_matmul.c — row-major C(m,n) = A(m,k) * B(k,n) with strides. i-p-j outer-product order so the inner c[j] += a*b[j] loop is two contiguous-stream FMA arithmetic (auto-vec into vfmadd231ps / fmla under -O3 -ffast-math). Caller contract matches the SPI: zero-then-accumulate so C is fully overwritten; k=0 zeros the block; m=0||n=0 is a no-op.
  • NativeFp32MatmulKernel — FFM Linker.downcallHandle wrapping the C symbol; Arena.ofConfined segments sized to (offset + (rows-1) * stride + cols) so non-contiguous strides Just Work.
  • NativeKernelProvider.matmulFp32() flips from null to NativeFp32MatmulKernel when available; still cascades to Panama otherwise.

Microbench (Linux x86_64, JDK 21.0.10, gcc 13.3 -O3 -ffast-math; warmup=5, samples=9, median µs)

shape native panama ratio
256³ 1976 3492 1.77×
512³ 17048 26882 1.58×
1024³ 142463 220710 1.55×

Honest read: FP32 wins are more modest than Q4_K (Q4_K was 4–6×). Why: Panama's FP32 path is much more polished — tile-blocking, B-pack into transposed buffer, parallelChunks across all cores. Panama's Q4_K path has none of that. Native single-threaded scalar C still wins everywhere measured because the JVM's per-call overhead + parallelChunks dispatch overhead are nontrivial at these shapes. Hand-tuned cache-blocking + threading would push native further but that's perf-tuning, not correctness.

Test plan

  • :skainet-backends:skainet-backend-native-cpu:jvmTest — 27/27 (3 pipeline + 5 Q4_K heap-parity + 7 Q4_K memseg-parity + 10 FP32 parity + 2 microbench-gated)
  • :skainet-backends:skainet-backend-cpu:jvmTest — 218/218 (no regression)
  • FP32 parity vs PanamaVectorMatmulKernel within 1e-5 * k (5e-5 * k at 256² to absorb -ffast-math reassociation) across small contiguous, random aligned, non-aligned k (tail loop), strided sub-block, irregular sizes, LLM-typical 256², zero-m, zero-k, negative-dim rejection
  • CI cross-arch matrix (PR 4's workflow) verifies macos-arm64 / linux-arm64 / windows-x86_64 build the new kernel cleanly

Rollout state — staged delivery now complete

PR What Status
1 Module scaffolding + smoke FFM downcall #571
2 Native Q4_K matmul (4–6× over Panama) #572
3 Q4_K MemSeg zero-copy SPI sibling #573
4 Cross-arch CI matrix + MSVC/Clang portability #574
5 Native FP32 SGEMM (this PR) 🟡 #575?

Out of scope (deferred — not part of the 5-PR plan)

  • Native Q6_K and Q8_0 kernels — need new SPI accessors (Q6KMatmulKernel, Q8MatmulKernel) and the existing Panama provider needs to expose its internal Q6_K / Q8_0 paths through the new SPI. Separate plan.
  • Cache-blocking + B-tile packing + parallelChunks-style threading for the native FP32 path — profile-driven follow-up.
  • Maven Central native classifier publishing / fat-JAR aggregation (deferred since PR 4).

🤖 Generated with Claude Code

Final PR of the staged native-FFM rollout per
docs/.../perf/native-ffm-plan.adoc. Wires a real FP32 matmul into
the existing matmulFp32() SPI accessor on NativeKernelProvider; the
KernelRegistry now hands out the native kernel ahead of Panama
Vector for FP32 SGEMM on hosts where libskainet_kernels resolves.

Native side (native/):

- src/fp32_matmul.c implements skainet_fp32_matmul as row-major
  C(m,n) = A(m,k) * B(k,n) with stride support. Iteration order is
  i-p-j (outer product into rows of C); the inner `c[j] += a*b[j]`
  loop streams two contiguous arrays and auto-vectorizes cleanly
  under -O3 -ffast-math into vfmadd231ps on x86_64 / fmla on AArch64.
  Caller contract matches the SPI: zero-then-accumulate ensures C is
  fully overwritten, k=0 zeros the block, m=0||n=0 is a no-op.

- include/skainet_kernels.h declares the new export. CMakeLists adds
  fp32_matmul.c alongside the smoke and Q4_K sources.

Kotlin side (src/jvmMain):

- NativeFp32MatmulKernel implements Fp32MatmulKernel via FFM downcall
  (12-arg FunctionDescriptor: 3 ADDRESS + 9 JAVA_INT). Heap arrays
  are copied into Arena.ofConfined off-heap segments sized to the
  reach (offset + last-row stride * rows) so non-contiguous strides
  Just Work without per-row staging.

- NativeKernelProvider.matmulFp32() now returns
  NativeFp32MatmulKernel when the lib loads; cascades to Panama
  otherwise. NativeKernelProvider's class kdoc is updated to mark
  PR 5 as the cursor for matmulFp32.

Tests (src/jvmTest):

- NativeFp32MatmulKernelParityTest — 10 cases mirroring the Panama
  fixture (small contiguous, random aligned, non-aligned k tail,
  strided sub-block, irregular sizes, LLM-typical 256², zero-m,
  zero-k, negative-dim rejection) plus a provider-handout assertion.
  Tolerance scaled with k matching the Panama-vs-Scalar bar
  (1e-5 * k); 5e-5 * k at 256² to absorb -ffast-math reassociation.

- Q4KMatmulMicrobenchTest grows a bench_fp32_native_vs_panama case
  at 256³ / 512³ / 1024³, gated by -Dskainet.runBench=true.

- NativeFfmPipelineTest's stub-flip assertion now expects both
  matmulFp32() and matmulQ4K() to return non-null kernels.

Microbench numbers (Linux x86_64, JDK 21.0.10, gcc 13.3 -O3
-ffast-math; warmup=5, samples=9, median µs):

  shape    native     panama    ratio
  256³      1976       3492     1.77×
  512³     17048      26882     1.58×
  1024³   142463     220710     1.55×

Honest read: FP32 wins are more modest than Q4_K (Q4_K was 4–6×).
Why: Panama's FP32 path has tile-blocking + B-pack (transposed)
+ parallelChunks across all cores; Q4_K's Panama path doesn't
do parallelChunks-style decomposition. Native single-threaded scalar
C still wins everywhere measured because the JVM's per-call
overhead and parallelChunks dispatch overhead are nontrivial at
these shapes. Hand-tuned cache-blocking + threading would push the
native FP32 path further but is perf-tuning, not correctness.

Verification (linux-x86_64, JDK 21.0.10):
- :skainet-backends:skainet-backend-native-cpu:jvmTest — 27/27
  (3 pipeline + 5 Q4_K heap-parity + 7 Q4_K memseg-parity + 10 FP32
  parity + 2 microbench-gated)
- :skainet-backends:skainet-backend-cpu:jvmTest — 218/218 (no regression)
- 3 native symbols exported: skainet_smoke_double, skainet_q4k_matmul,
  skainet_fp32_matmul

Out of scope (deferred):

- Native Q6_K and Q8_0 matmul. Both need new SPI accessors
  (Q6KMatmulKernel, Q8MatmulKernel) and the existing Panama provider
  also needs to expose its internal Q6_K / Q8_0 paths through the
  new SPI. That arc is a separate plan.

- Cache-blocking + B-tile packing + parallelChunks-style threading
  for the native FP32 path. Profile-driven; the current scalar C
  path already wins everywhere measured.

- Maven Central native classifier publishing / fat-JAR aggregation
  (still deferred from PR 4).

Rollout state: PR 1 (scaffolding), PR 2 (Q4_K matmul), PR 3 (Q4_K
MemSeg zero-copy), PR 4 (cross-arch CI matrix), PR 5 (this commit,
FP32 matmul). The 5-PR plan from native-ffm-plan.adoc is now
complete in its core scope; the optional Q6_K / Q8_0 additions and
the publishing infrastructure are future work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@michalharakal michalharakal merged commit 9d05fc4 into develop Apr 29, 2026
9 of 10 checks passed
@michalharakal michalharakal mentioned this pull request Apr 30, 2026
@michalharakal michalharakal deleted the feature/native-fp32-matmul branch May 2, 2026 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant