Gemma 4 / Q4_K_M correctness: canonical ggml layout for Q4_K + Q5_K, FP32 MemSeg arena leak fix

## Context

Downstream Gemma 4 inference (`SKaiNET-transformers`) revealed two upstream correctness bugs that prevented coherent generation from quantized GGUF checkpoints, plus a direct-memory leak that surfaced once those bugs were resolved and longer prompts could actually run end-to-end.

This issue tracks landing all three fixes together so a future SKaiNET 0.21.0 unblocks the downstream `SKaiNET-transformers` 0.17.0 release that ships Gemma 4.

## Bugs

### 1. `Q4_KBlockTensorData.fromRawBytes` — non-canonical 4-bit layout

`Q4_KBlockTensorData` did not match ggml's strided 4-bit layout. For each 32-element sub-block, byte `i` should hold element `i` in the low nibble and element `i + 32` in the high nibble — we were reading bytes as a flat sequence of 4-bit codes. As a result, decoded Q4_K weights were reshuffled relative to llama.cpp / HF's reference, producing multilingual noise instead of the prompt continuation.

Symptom: `Hi` greedy on Gemma 4 E2B `Q4_K_M` produced `جرى ﻥ ﻦ ...` instead of llama.cpp's `Hi}$\n import_result = ...`.

### 2. `dequantQ5KFromBytes` — wrong qh indexing

`dequantQ5KFromBytes` indexed the high-bit byte plane as `qh[idx/8]` (output position) instead of ggml's `qh[l]` (per-byte within the 32-element group). This corrupted the PLE residual via `per_layer_token_embd`, so post-norm picked the wrong feature even when blk.N argmax matched HF.

Symptom: blk.N argmax matched HF, but small-magnitude features drifted 2–5× starting at blk.0; Gemma 4 top-1 was `ل` instead of HF's `\n`.

### 3. FP32 `MemorySegment` arena leak on the per-forward path

`MemorySegmentTensorDataFactory` per-op outputs were created in a long-lived arena, so each forward retained transpose + matmul output segments until process exit. Under sustained Gemma 4 decode this OOMed direct memory in <1 minute. Fixed by `Arena.ofAuto` for per-op outputs plus liveness-based freeing in `ComputeGraphExecutor`.

These ride together because the arena leak only became reachable once the Q4_K/Q5_K bugs were fixed and inference could actually run for more than a handful of tokens.

## Bonus: matmul perf

`feat: vectorize and parallelize CPU matmul kernels` — incidental but on the same hot path. Took the per-token rate from 0.124 → 0.66 tok/s on Gemma 4 E2B Q4_K_M after the layout fix, on a 6-perf-core box.

## Scope of this PR

Seven commits, surfacing only in `skainet-backend-cpu`, `skainet-lang-core`, `skainet-compile-dag`:

1. `fix: GC-reclaim FP32 MemSeg transpose/matmul output arenas`
2. `feat: liveness-based freeing of intermediate tensors in ComputeGraphExecutor`
3. `fix: MemorySegmentTensorDataFactory uses Arena.ofAuto for per-op outputs`
4. `perf: vectorize and parallelize CPU matmul kernels`
5. `fix(q4_k): apply canonical ggml layout in tensor data + SIMD matmul`
6. `fix(q5_k): use canonical ggml qh[l] indexing in dequantQ5KFromBytes`
7. `fix(test): drop parens from Q4_K test names for Kotlin/Native`

## Tests added / touched

- `Q4_KTensorDataTest` — strided 4-bit layout reads, set, 2D indexing, ggml `get_scale_min_k4` formula, sub-block boundaries.
- `Q4KCanonicalLayoutTest` / `Q5KCanonicalLayoutTest` — GGUF round-trip parity against llama.cpp output.
- `MemSegArenaLeakTest` — sustained-forward direct-memory non-growth.

## Acceptance

- `clean assemble allTests` green on JDK 25 (incl. JS/Wasm browser tests under headless Chromium).
- Downstream Gemma 4 E2B Q4_K_M smoke test produces coherent English from `Hi` greedy (no longer multilingual noise).
- No direct-memory growth during sustained decode.

## Out of scope

- Tool-call format emission gap on Gemma 4 E2B (downstream `SKaiNET-transformers`; not an upstream concern).
- Op-level parallelism in `ComputeGraphExecutor` for the 1 tok/s target — left for a follow-up.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gemma 4 / Q4_K_M correctness: canonical ggml layout for Q4_K + Q5_K, FP32 MemSeg arena leak fix #555

Context

Bugs

1. `Q4_KBlockTensorData.fromRawBytes` — non-canonical 4-bit layout

2. `dequantQ5KFromBytes` — wrong qh indexing

3. FP32 `MemorySegment` arena leak on the per-forward path

Bonus: matmul perf

Scope of this PR

Tests added / touched

Acceptance

Out of scope

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Gemma 4 / Q4_K_M correctness: canonical ggml layout for Q4_K + Q5_K, FP32 MemSeg arena leak fix #555

Description

Context

Bugs

1. Q4_KBlockTensorData.fromRawBytes — non-canonical 4-bit layout

2. dequantQ5KFromBytes — wrong qh indexing

3. FP32 MemorySegment arena leak on the per-forward path

Bonus: matmul perf

Scope of this PR

Tests added / touched

Acceptance

Out of scope

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. `Q4_KBlockTensorData.fromRawBytes` — non-canonical 4-bit layout

2. `dequantQ5KFromBytes` — wrong qh indexing

3. FP32 `MemorySegment` arena leak on the per-forward path