Skip to content

Gemma 4 / Q4_K_M correctness: canonical ggml layout for Q4_K + Q5_K, FP32 MemSeg arena leak fix #555

@michalharakal

Description

@michalharakal

Context

Downstream Gemma 4 inference (SKaiNET-transformers) revealed two upstream correctness bugs that prevented coherent generation from quantized GGUF checkpoints, plus a direct-memory leak that surfaced once those bugs were resolved and longer prompts could actually run end-to-end.

This issue tracks landing all three fixes together so a future SKaiNET 0.21.0 unblocks the downstream SKaiNET-transformers 0.17.0 release that ships Gemma 4.

Bugs

1. Q4_KBlockTensorData.fromRawBytes — non-canonical 4-bit layout

Q4_KBlockTensorData did not match ggml's strided 4-bit layout. For each 32-element sub-block, byte i should hold element i in the low nibble and element i + 32 in the high nibble — we were reading bytes as a flat sequence of 4-bit codes. As a result, decoded Q4_K weights were reshuffled relative to llama.cpp / HF's reference, producing multilingual noise instead of the prompt continuation.

Symptom: Hi greedy on Gemma 4 E2B Q4_K_M produced جرى ﻥ ﻦ ... instead of llama.cpp's Hi}$\n import_result = ....

2. dequantQ5KFromBytes — wrong qh indexing

dequantQ5KFromBytes indexed the high-bit byte plane as qh[idx/8] (output position) instead of ggml's qh[l] (per-byte within the 32-element group). This corrupted the PLE residual via per_layer_token_embd, so post-norm picked the wrong feature even when blk.N argmax matched HF.

Symptom: blk.N argmax matched HF, but small-magnitude features drifted 2–5× starting at blk.0; Gemma 4 top-1 was ل instead of HF's \n.

3. FP32 MemorySegment arena leak on the per-forward path

MemorySegmentTensorDataFactory per-op outputs were created in a long-lived arena, so each forward retained transpose + matmul output segments until process exit. Under sustained Gemma 4 decode this OOMed direct memory in <1 minute. Fixed by Arena.ofAuto for per-op outputs plus liveness-based freeing in ComputeGraphExecutor.

These ride together because the arena leak only became reachable once the Q4_K/Q5_K bugs were fixed and inference could actually run for more than a handful of tokens.

Bonus: matmul perf

feat: vectorize and parallelize CPU matmul kernels — incidental but on the same hot path. Took the per-token rate from 0.124 → 0.66 tok/s on Gemma 4 E2B Q4_K_M after the layout fix, on a 6-perf-core box.

Scope of this PR

Seven commits, surfacing only in skainet-backend-cpu, skainet-lang-core, skainet-compile-dag:

  1. fix: GC-reclaim FP32 MemSeg transpose/matmul output arenas
  2. feat: liveness-based freeing of intermediate tensors in ComputeGraphExecutor
  3. fix: MemorySegmentTensorDataFactory uses Arena.ofAuto for per-op outputs
  4. perf: vectorize and parallelize CPU matmul kernels
  5. fix(q4_k): apply canonical ggml layout in tensor data + SIMD matmul
  6. fix(q5_k): use canonical ggml qh[l] indexing in dequantQ5KFromBytes
  7. fix(test): drop parens from Q4_K test names for Kotlin/Native

Tests added / touched

  • Q4_KTensorDataTest — strided 4-bit layout reads, set, 2D indexing, ggml get_scale_min_k4 formula, sub-block boundaries.
  • Q4KCanonicalLayoutTest / Q5KCanonicalLayoutTest — GGUF round-trip parity against llama.cpp output.
  • MemSegArenaLeakTest — sustained-forward direct-memory non-growth.

Acceptance

  • clean assemble allTests green on JDK 25 (incl. JS/Wasm browser tests under headless Chromium).
  • Downstream Gemma 4 E2B Q4_K_M smoke test produces coherent English from Hi greedy (no longer multilingual noise).
  • No direct-memory growth during sustained decode.

Out of scope

  • Tool-call format emission gap on Gemma 4 E2B (downstream SKaiNET-transformers; not an upstream concern).
  • Op-level parallelism in ComputeGraphExecutor for the 1 tok/s target — left for a follow-up.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions