Skip to content

perf(q4_k): SIMD-fy matmulF32Q4_KMemSeg via ByteVector.fromMemorySegment#563

Merged
michalharakal merged 1 commit intodevelopfrom
feature/jvm-q4k-memseg-simd
Apr 28, 2026
Merged

perf(q4_k): SIMD-fy matmulF32Q4_KMemSeg via ByteVector.fromMemorySegment#563
michalharakal merged 1 commit intodevelopfrom
feature/jvm-q4k-memseg-simd

Conversation

@michalharakal
Copy link
Copy Markdown
Contributor

Summary

Same fused lo+hi ByteVector SIMD pipeline as #562's PanamaVectorQ4KMatmulKernel, applied inline to the MemSeg Q4_K path (matmulF32Q4_KMemSeg). Only difference: ByteVector.fromMemorySegment instead of ByteVector.fromArray, since the weight buffer is mmap'd. Single byte load per chunk feeds both lo and hi sub-block accumulators.

Why this matters

The MemSeg path is the production hot path for Gemma 4 E2B Q4_K_M weights loaded via mmap — pages should never be copied to the heap. Before this PR the inner loop was scalar nibble-unpack into a scratch FloatArray (dotQ4_KHalfNibbleSubBlockMemSeg). With #562 making the ByteArray path SIMD-fused, parity between the two paths motivated lifting MemSeg to the same algorithm.

Scope

  • Inline replacement in JvmQuantizedVectorKernels.matmulF32Q4_KMemSeg. No new SPI surface (sibling Q4KMemSegMatmulKernel is a fair follow-up if a native FFM provider needs to register here too).
  • Removes dotQ4_KHalfNibbleSubBlockMemSeg (now unused).

Test plan

  • ./gradlew :skainet-backends:skainet-backend-cpu:jvmTest — 213/213 passes; QuantizedMemSegMatmulTest and MemSegArenaLeakTest exercise the path end-to-end through ctx.ops.matmul.

Out of scope (M5 follow-ups)

  • Dedicated kernel-direct parity test mirroring PanamaVectorQ4KMatmulKernelTest for the MemSeg variant.
  • JMH bench variant for MemSeg-backed weights.
  • Q4KMemSegMatmulKernel SPI sibling — only worthwhile when a second concrete provider lands (likely the native FFM provider).

🤖 Generated with Claude Code

Replaces the scalar nibble-unpack helper dotQ4_KHalfNibbleSubBlockMemSeg
with the same fused lo+hi ByteVector pipeline used by
PanamaVectorQ4KMatmulKernel (PR #562) — the only difference is that
this kernel reads the qs slab via ByteVector.fromMemorySegment for
mmap'd weight buffers, vs ByteVector.fromArray for ByteArray-backed
weights. Single byte load per chunk feeds both lo and hi sub-block
accumulators.

Inline replacement; no new SPI surface. The MemSeg path is the
production hot path for Gemma 4 E2B Q4_K_M loaded via mmap, where
weight pages should never be copied to the heap.

Existing parity tests (QuantizedMemSegMatmulTest, MemSegArenaLeakTest)
exercise the path end-to-end through ctx.ops.matmul; cpu jvmTest
suite passes 213/213 unchanged. Future: a dedicated parity test for
the kernel direct (matching PanamaVectorQ4KMatmulKernelTest's shape)
is a fair follow-up if the indirect coverage proves insufficient.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@michalharakal michalharakal marked this pull request as ready for review April 28, 2026 20:51
@michalharakal michalharakal merged commit 3ea9b5f into develop Apr 28, 2026
6 checks passed
@michalharakal michalharakal deleted the feature/jvm-q4k-memseg-simd branch April 28, 2026 20:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant