feat(native-cpu): zero-copy Q4_K MemSeg kernel + SPI sibling (PR 3 of 5)#573
Merged
michalharakal merged 1 commit intodevelopfrom Apr 29, 2026
Merged
feat(native-cpu): zero-copy Q4_K MemSeg kernel + SPI sibling (PR 3 of 5)#573michalharakal merged 1 commit intodevelopfrom
michalharakal merged 1 commit intodevelopfrom
Conversation
PR 3 of the staged native-FFM rollout per docs/.../perf/native-ffm-plan.adoc.
Closes the M4↔M5 zero-copy story for mmap'd Q4_K weights: callers
that already hold off-heap weight bytes (mmap'd .gguf files, shared
arenas) skip the staging ByteArray → MemorySegment copy that
NativeQ4KMatmulKernel.matmul performs on every call.
SPI surface (skainet-backend-api/src/jvmMain):
- Q4KMemSegMatmulKernel — JVM-only sibling of Q4KMatmulKernel; same
block layout / lazy-dmin contract, but `weight` is a
java.lang.foreign.MemorySegment with a Long byte offset. KMP-safe
positioning: lives in jvmMain (not commonMain) because
java.lang.foreign isn't available on Native / JS / Wasm targets.
- MemSegKernelProvider — JVM-only sibling of KernelProvider that
exposes a `matmulQ4KMemSeg(): Q4KMemSegMatmulKernel?` accessor with
a `null`-defaulting body. Lookup pattern at the call site:
val kernel = (KernelRegistry.bestAvailable() as? MemSegKernelProvider)
?.matmulQ4KMemSeg() ?: heapFallback()
Doesn't fork the registry — providers opt into MemSeg surfaces by
implementing both interfaces; smart-cast does the rest. Adding
`matmulQ4KMemSeg` directly to KernelProvider would have broken
commonMain (MemorySegment is JVM-only).
Native side (skainet-backend-native-cpu):
- NativeQ4KMemSegMatmulKernel reuses PR 2's skainet_q4k_matmul C
symbol — the kernel just sees `const uint8_t*` and is oblivious to
whether the bytes were staged through an arena or read directly
from a caller-owned segment. The weight pointer is forwarded
straight through; only input/output go through small confined-arena
copies (those are usually a few KB and produced/consumed on the
heap by the surrounding forward pass).
- Validates the segment is large enough for `(inputDim/256) *
outputDim * 144` bytes from the given offset and rejects undersized
segments with IllegalArgumentException — without it, an undersized
segment would crash the JVM with SIGSEGV from the C side.
- weightByteOffset is Long on the Kotlin side narrowing to int32_t
at the FFM boundary; we require <= Int.MAX_VALUE for now and
document the eventual int64_t-offset overload as a follow-up. No
current LLM single-tensor exceeds 2 GB.
- NativeKernelProvider now implements both KernelProvider and
MemSegKernelProvider; NativeKernelProviderFactory delegates both
via `by NativeKernelProvider`. Without the second `by`, the factory
instance the registry hands out would fail the smart-cast even
though the underlying singleton implements both interfaces.
Tests (skainet-backend-native-cpu/src/jvmTest):
- NativeQ4KMemSegMatmulKernelParityTest — 7 tests asserting
bit-identical output (compared via Float.toRawBits, no tolerance)
to NativeQ4KMatmulKernel across single-block / multi-block /
LLM-typical shapes. The bit-identical contract is the right bar:
same C symbol, same inputs ⇒ same outputs; any drift means the
wrapper added arithmetic.
- Honors-non-zero-weight-byte-offset and rejects-undersized-segment
cases for the new validation logic.
- Provider/factory smart-cast tests confirm the SPI plumbing works
end-to-end (NativeKernelProvider as MemSegKernelProvider succeeds;
factory ditto).
- Q4KMatmulMicrobenchTest extended: heap-copy vs zero-copy at LLM
shapes. Weight segment pre-allocated in an Arena.ofShared outside
the timed region — that's the realistic load profile (mmap once,
reuse across forward passes).
Microbench numbers (Linux x86_64, JDK 21.0.10, gcc 13.3 -O3 -ffast-
math; warmup=20, samples=21, median µs):
shape heap memseg zero-copy speedup memseg vs panama
1024² 360 369 0.98× 5.05×
2048² 1317 1284 1.03× 4.66×
4096² 6206 5184 1.20× 4.48×
Honest read: zero-copy is noise at small shapes (the staged copy is
sub-1MB; arena allocator + memcpy throughput hide it) and a real
+20% saving at 4096² (9 MB weight copy starts to dominate cache
pressure). Production loads on actual LLMs will be larger still and
will benefit more — plus they'll save on resident memory because
the heap path materializes a copy of every weight in JVM heap on
top of the off-heap segment.
Verification (linux-x86_64, JDK 21.0.10):
- :skainet-backends:skainet-backend-native-cpu:jvmTest — 15/15
(3 pipeline + 5 heap-parity + 7 memseg-parity, microbench skipped)
- :skainet-backends:skainet-backend-cpu:jvmTest — 218/218 (no regression)
- :skainet-backends:skainet-backend-api:jvmTest — 0/0 (no tests yet)
Out of scope (deferred per asciidoc staging):
- PR 4: NEON / AVX2 intrinsics + cross-arch CI matrix
- PR 5: native FP32 / Q6_K / Q8_0 kernels
- int64_t weight offset overload (current int32_t limit hit at 2 GB
per single segment slice)
- Panama priority-50 implementation of MemSegKernelProvider — Panama
already has Q4_K MemSeg internals; exposing through the new SPI is
a small follow-up and lets the smart-cast cascade work even when
the native provider is unavailable
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced Apr 29, 2026
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PR 3 of the staged native-FFM rollout per
docs/.../perf/native-ffm-plan.adoc. Closes the M4↔M5 zero-copy story for mmap'd Q4_K weights.New SPI in
skainet-backend-api/jvmMain:Q4KMemSegMatmulKernel— JVM-only sibling ofQ4KMatmulKerneltakingMemorySegmentweights withLongbyte offset. Same block layout / lazy-dminmath contract.MemSegKernelProvider— JVM-only sibling ofKernelProviderwith anull-defaultingmatmulQ4KMemSeg()accessor. Doesn't fork the registry; providers opt in by implementing both interfaces and callers smart-cast:(KernelRegistry.bestAvailable() as? MemSegKernelProvider)?.matmulQ4KMemSeg() ?: heapFallback(). Lives injvmMainbecause addingMemorySegmentto thecommonMainKernelProviderwould have broken Native / JS / Wasm targets.Native impl in
skainet-backend-native-cpu:NativeQ4KMemSegMatmulKernelreuses PR 2'sskainet_q4k_matmulC symbol — the kernel just seesconst uint8_t*and is oblivious to whether bytes were staged through an arena or supplied directly. The weight pointer goes through; only input/output use small confined-arena copies (heap arrays from the surrounding forward pass).(inputDim/256) * outputDim * 144bytes from offset) and rejects undersized segments withIllegalArgumentException— without that, an undersized segment would crash the JVM with SIGSEGV.NativeKernelProvidernow implements bothKernelProviderandMemSegKernelProvider;NativeKernelProviderFactorydelegates both viaby NativeKernelProviderso the ServiceLoader-supplied factory passes the smart-cast.Microbench (Linux x86_64, JDK 21.0.10, gcc 13.3
-O3 -ffast-math; warmup=20, samples=21, median µs)Honest read: zero-copy is noise at smaller shapes (sub-1MB staging copy, hidden by arena allocator + memcpy throughput) and a real +20% saving at 4096² where the 9 MB weight copy starts to dominate cache pressure. Production loads on real LLMs will be larger still and benefit more — plus they save resident memory since the heap path materializes a JVM-heap copy on top of the off-heap segment.
Test plan
:skainet-backends:skainet-backend-native-cpu:jvmTest— 15/15 (3 pipeline + 5 heap-parity + 7 memseg-parity; microbench gated):skainet-backends:skainet-backend-cpu:jvmTest— 218/218 (no regression)Float.toRawBitsequality, no tolerance) between heap and MemSeg paths across 256×{1,16}, 1024×64, 4096×64, plus a non-zero-weight-byte-offset caseOut of scope
int64_tweight-offset C-symbol overload (currentint32_tlimit hits at 2 GB per single segment slice)MemSegKernelProvider— Panama already has Q4_K MemSeg internals; exposing through the new SPI is a small follow-up and lets the smart-cast cascade work even when the native provider is unavailable🤖 Generated with Claude Code