Context
Downstream Gemma 4 inference (SKaiNET-transformers) revealed two upstream correctness bugs that prevented coherent generation from quantized GGUF checkpoints, plus a direct-memory leak that surfaced once those bugs were resolved and longer prompts could actually run end-to-end.
This issue tracks landing all three fixes together so a future SKaiNET 0.21.0 unblocks the downstream SKaiNET-transformers 0.17.0 release that ships Gemma 4.
Bugs
1. Q4_KBlockTensorData.fromRawBytes — non-canonical 4-bit layout
Q4_KBlockTensorData did not match ggml's strided 4-bit layout. For each 32-element sub-block, byte i should hold element i in the low nibble and element i + 32 in the high nibble — we were reading bytes as a flat sequence of 4-bit codes. As a result, decoded Q4_K weights were reshuffled relative to llama.cpp / HF's reference, producing multilingual noise instead of the prompt continuation.
Symptom: Hi greedy on Gemma 4 E2B Q4_K_M produced جرى ﻥ ﻦ ... instead of llama.cpp's Hi}$\n import_result = ....
2. dequantQ5KFromBytes — wrong qh indexing
dequantQ5KFromBytes indexed the high-bit byte plane as qh[idx/8] (output position) instead of ggml's qh[l] (per-byte within the 32-element group). This corrupted the PLE residual via per_layer_token_embd, so post-norm picked the wrong feature even when blk.N argmax matched HF.
Symptom: blk.N argmax matched HF, but small-magnitude features drifted 2–5× starting at blk.0; Gemma 4 top-1 was ل instead of HF's \n.
3. FP32 MemorySegment arena leak on the per-forward path
MemorySegmentTensorDataFactory per-op outputs were created in a long-lived arena, so each forward retained transpose + matmul output segments until process exit. Under sustained Gemma 4 decode this OOMed direct memory in <1 minute. Fixed by Arena.ofAuto for per-op outputs plus liveness-based freeing in ComputeGraphExecutor.
These ride together because the arena leak only became reachable once the Q4_K/Q5_K bugs were fixed and inference could actually run for more than a handful of tokens.
Bonus: matmul perf
feat: vectorize and parallelize CPU matmul kernels — incidental but on the same hot path. Took the per-token rate from 0.124 → 0.66 tok/s on Gemma 4 E2B Q4_K_M after the layout fix, on a 6-perf-core box.
Scope of this PR
Seven commits, surfacing only in skainet-backend-cpu, skainet-lang-core, skainet-compile-dag:
fix: GC-reclaim FP32 MemSeg transpose/matmul output arenas
feat: liveness-based freeing of intermediate tensors in ComputeGraphExecutor
fix: MemorySegmentTensorDataFactory uses Arena.ofAuto for per-op outputs
perf: vectorize and parallelize CPU matmul kernels
fix(q4_k): apply canonical ggml layout in tensor data + SIMD matmul
fix(q5_k): use canonical ggml qh[l] indexing in dequantQ5KFromBytes
fix(test): drop parens from Q4_K test names for Kotlin/Native
Tests added / touched
Q4_KTensorDataTest — strided 4-bit layout reads, set, 2D indexing, ggml get_scale_min_k4 formula, sub-block boundaries.
Q4KCanonicalLayoutTest / Q5KCanonicalLayoutTest — GGUF round-trip parity against llama.cpp output.
MemSegArenaLeakTest — sustained-forward direct-memory non-growth.
Acceptance
clean assemble allTests green on JDK 25 (incl. JS/Wasm browser tests under headless Chromium).
- Downstream Gemma 4 E2B Q4_K_M smoke test produces coherent English from
Hi greedy (no longer multilingual noise).
- No direct-memory growth during sustained decode.
Out of scope
- Tool-call format emission gap on Gemma 4 E2B (downstream
SKaiNET-transformers; not an upstream concern).
- Op-level parallelism in
ComputeGraphExecutor for the 1 tok/s target — left for a follow-up.
Context
Downstream Gemma 4 inference (
SKaiNET-transformers) revealed two upstream correctness bugs that prevented coherent generation from quantized GGUF checkpoints, plus a direct-memory leak that surfaced once those bugs were resolved and longer prompts could actually run end-to-end.This issue tracks landing all three fixes together so a future SKaiNET 0.21.0 unblocks the downstream
SKaiNET-transformers0.17.0 release that ships Gemma 4.Bugs
1.
Q4_KBlockTensorData.fromRawBytes— non-canonical 4-bit layoutQ4_KBlockTensorDatadid not match ggml's strided 4-bit layout. For each 32-element sub-block, byteishould hold elementiin the low nibble and elementi + 32in the high nibble — we were reading bytes as a flat sequence of 4-bit codes. As a result, decoded Q4_K weights were reshuffled relative to llama.cpp / HF's reference, producing multilingual noise instead of the prompt continuation.Symptom:
Higreedy on Gemma 4 E2BQ4_K_Mproducedجرى ﻥ ﻦ ...instead of llama.cpp'sHi}$\n import_result = ....2.
dequantQ5KFromBytes— wrong qh indexingdequantQ5KFromBytesindexed the high-bit byte plane asqh[idx/8](output position) instead of ggml'sqh[l](per-byte within the 32-element group). This corrupted the PLE residual viaper_layer_token_embd, so post-norm picked the wrong feature even when blk.N argmax matched HF.Symptom: blk.N argmax matched HF, but small-magnitude features drifted 2–5× starting at blk.0; Gemma 4 top-1 was
لinstead of HF's\n.3. FP32
MemorySegmentarena leak on the per-forward pathMemorySegmentTensorDataFactoryper-op outputs were created in a long-lived arena, so each forward retained transpose + matmul output segments until process exit. Under sustained Gemma 4 decode this OOMed direct memory in <1 minute. Fixed byArena.ofAutofor per-op outputs plus liveness-based freeing inComputeGraphExecutor.These ride together because the arena leak only became reachable once the Q4_K/Q5_K bugs were fixed and inference could actually run for more than a handful of tokens.
Bonus: matmul perf
feat: vectorize and parallelize CPU matmul kernels— incidental but on the same hot path. Took the per-token rate from 0.124 → 0.66 tok/s on Gemma 4 E2B Q4_K_M after the layout fix, on a 6-perf-core box.Scope of this PR
Seven commits, surfacing only in
skainet-backend-cpu,skainet-lang-core,skainet-compile-dag:fix: GC-reclaim FP32 MemSeg transpose/matmul output arenasfeat: liveness-based freeing of intermediate tensors in ComputeGraphExecutorfix: MemorySegmentTensorDataFactory uses Arena.ofAuto for per-op outputsperf: vectorize and parallelize CPU matmul kernelsfix(q4_k): apply canonical ggml layout in tensor data + SIMD matmulfix(q5_k): use canonical ggml qh[l] indexing in dequantQ5KFromBytesfix(test): drop parens from Q4_K test names for Kotlin/NativeTests added / touched
Q4_KTensorDataTest— strided 4-bit layout reads, set, 2D indexing, ggmlget_scale_min_k4formula, sub-block boundaries.Q4KCanonicalLayoutTest/Q5KCanonicalLayoutTest— GGUF round-trip parity against llama.cpp output.MemSegArenaLeakTest— sustained-forward direct-memory non-growth.Acceptance
clean assemble allTestsgreen on JDK 25 (incl. JS/Wasm browser tests under headless Chromium).Higreedy (no longer multilingual noise).Out of scope
SKaiNET-transformers; not an upstream concern).ComputeGraphExecutorfor the 1 tok/s target — left for a follow-up.