perf(q4_k): SIMD-fy matmulF32Q4_KMemSeg via ByteVector.fromMemorySegment by michalharakal · Pull Request #563 · SKaiNET-developers/SKaiNET

michalharakal · 2026-04-28T20:48:10Z

Summary

Same fused lo+hi ByteVector SIMD pipeline as #562's PanamaVectorQ4KMatmulKernel, applied inline to the MemSeg Q4_K path (matmulF32Q4_KMemSeg). Only difference: ByteVector.fromMemorySegment instead of ByteVector.fromArray, since the weight buffer is mmap'd. Single byte load per chunk feeds both lo and hi sub-block accumulators.

Why this matters

The MemSeg path is the production hot path for Gemma 4 E2B Q4_K_M weights loaded via mmap — pages should never be copied to the heap. Before this PR the inner loop was scalar nibble-unpack into a scratch FloatArray (dotQ4_KHalfNibbleSubBlockMemSeg). With #562 making the ByteArray path SIMD-fused, parity between the two paths motivated lifting MemSeg to the same algorithm.

Scope

Inline replacement in JvmQuantizedVectorKernels.matmulF32Q4_KMemSeg. No new SPI surface (sibling Q4KMemSegMatmulKernel is a fair follow-up if a native FFM provider needs to register here too).
Removes dotQ4_KHalfNibbleSubBlockMemSeg (now unused).

Test plan

./gradlew :skainet-backends:skainet-backend-cpu:jvmTest — 213/213 passes; QuantizedMemSegMatmulTest and MemSegArenaLeakTest exercise the path end-to-end through ctx.ops.matmul.

Out of scope (M5 follow-ups)

Dedicated kernel-direct parity test mirroring PanamaVectorQ4KMatmulKernelTest for the MemSeg variant.
JMH bench variant for MemSeg-backed weights.
Q4KMemSegMatmulKernel SPI sibling — only worthwhile when a second concrete provider lands (likely the native FFM provider).

🤖 Generated with Claude Code

Replaces the scalar nibble-unpack helper dotQ4_KHalfNibbleSubBlockMemSeg with the same fused lo+hi ByteVector pipeline used by PanamaVectorQ4KMatmulKernel (PR #562) — the only difference is that this kernel reads the qs slab via ByteVector.fromMemorySegment for mmap'd weight buffers, vs ByteVector.fromArray for ByteArray-backed weights. Single byte load per chunk feeds both lo and hi sub-block accumulators. Inline replacement; no new SPI surface. The MemSeg path is the production hot path for Gemma 4 E2B Q4_K_M loaded via mmap, where weight pages should never be copied to the heap. Existing parity tests (QuantizedMemSegMatmulTest, MemSegArenaLeakTest) exercise the path end-to-end through ctx.ops.matmul; cpu jvmTest suite passes 213/213 unchanged. Future: a dedicated parity test for the kernel direct (matching PanamaVectorQ4KMatmulKernelTest's shape) is a fair follow-up if the indirect coverage proves insufficient. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

michalharakal marked this pull request as ready for review April 28, 2026 20:51

michalharakal merged commit 3ea9b5f into develop Apr 28, 2026
6 checks passed

michalharakal deleted the feature/jvm-q4k-memseg-simd branch April 28, 2026 20:51

This was referenced Apr 28, 2026

perf(q6_k): SIMD-fy dequantQ6_KBlock via ByteVector ql + qh extraction #564

Merged

chore(release): prepare 0.21.0 #566

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(q4_k): SIMD-fy matmulF32Q4_KMemSeg via ByteVector.fromMemorySegment#563

perf(q4_k): SIMD-fy matmulF32Q4_KMemSeg via ByteVector.fromMemorySegment#563
michalharakal merged 1 commit intodevelopfrom
feature/jvm-q4k-memseg-simd

michalharakal commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

michalharakal commented Apr 28, 2026

Summary

Why this matters

Scope

Test plan

Out of scope (M5 follow-ups)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant