perf(q4_k): SIMD-fy matmulF32Q4_KMemSeg via ByteVector.fromMemorySegment#563
Merged
michalharakal merged 1 commit intodevelopfrom Apr 28, 2026
Merged
perf(q4_k): SIMD-fy matmulF32Q4_KMemSeg via ByteVector.fromMemorySegment#563michalharakal merged 1 commit intodevelopfrom
michalharakal merged 1 commit intodevelopfrom
Conversation
Replaces the scalar nibble-unpack helper dotQ4_KHalfNibbleSubBlockMemSeg with the same fused lo+hi ByteVector pipeline used by PanamaVectorQ4KMatmulKernel (PR #562) — the only difference is that this kernel reads the qs slab via ByteVector.fromMemorySegment for mmap'd weight buffers, vs ByteVector.fromArray for ByteArray-backed weights. Single byte load per chunk feeds both lo and hi sub-block accumulators. Inline replacement; no new SPI surface. The MemSeg path is the production hot path for Gemma 4 E2B Q4_K_M loaded via mmap, where weight pages should never be copied to the heap. Existing parity tests (QuantizedMemSegMatmulTest, MemSegArenaLeakTest) exercise the path end-to-end through ctx.ops.matmul; cpu jvmTest suite passes 213/213 unchanged. Future: a dedicated parity test for the kernel direct (matching PanamaVectorQ4KMatmulKernelTest's shape) is a fair follow-up if the indirect coverage proves insufficient. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced Apr 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Same fused lo+hi
ByteVectorSIMD pipeline as #562'sPanamaVectorQ4KMatmulKernel, applied inline to the MemSeg Q4_K path (matmulF32Q4_KMemSeg). Only difference:ByteVector.fromMemorySegmentinstead ofByteVector.fromArray, since the weight buffer is mmap'd. Single byte load per chunk feeds both lo and hi sub-block accumulators.Why this matters
The MemSeg path is the production hot path for Gemma 4 E2B Q4_K_M weights loaded via mmap — pages should never be copied to the heap. Before this PR the inner loop was scalar nibble-unpack into a scratch FloatArray (
dotQ4_KHalfNibbleSubBlockMemSeg). With #562 making the ByteArray path SIMD-fused, parity between the two paths motivated lifting MemSeg to the same algorithm.Scope
JvmQuantizedVectorKernels.matmulF32Q4_KMemSeg. No new SPI surface (siblingQ4KMemSegMatmulKernelis a fair follow-up if a native FFM provider needs to register here too).dotQ4_KHalfNibbleSubBlockMemSeg(now unused).Test plan
./gradlew :skainet-backends:skainet-backend-cpu:jvmTest— 213/213 passes;QuantizedMemSegMatmulTestandMemSegArenaLeakTestexercise the path end-to-end throughctx.ops.matmul.Out of scope (M5 follow-ups)
PanamaVectorQ4KMatmulKernelTestfor the MemSeg variant.🤖 Generated with Claude Code