Feature/49 tool calling pipeline oom by michalharakal · Pull Request #51 · SKaiNET-developers/SKaiNET-transformers

michalharakal · 2026-04-11T19:17:52Z

No description provided.

LlamaRuntime previously pre-transposed all weight tensors at init, doubling peak memory (~31GB for 8B models). This caused OOM on 48GB machines when loading Qwen3-8B-Q4_K_M. Replace eager pre-transpose with inline .t() calls during forward pass. The GC reclaims each temporary transpose, so only one projection's worth of memory (~200MB) is live at a time instead of the full set. Before: ~62GB peak (31GB weights + 31GB transposed copies) After: ~31GB peak (weights only + 200MB temp per layer) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MemSegWeightConverter previously dequantized Q4_K tensors to FP32 because "no native SIMD kernel yet". But Q4_KBlockTensorData and QuantizedMatmul.matmulQ4_K() already exist in skainet-backend-cpu. Wire Q4_K into the SIMD path: create Q4_KBlockTensorData from raw bytes instead of dequantizing. This keeps Q4_K weights in their compact quantized form (~4.5 bits/param) and uses the SIMD kernel for matmul at inference time. Memory impact for Qwen3-8B-Q4_K_M: Before: ~31GB (Q4_K dequantized to FP32) After: ~5GB (Q4_K kept quantized) Q5_K and Q6_K still dequantize to FP32 (no SIMD kernel yet). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Wire Q4_K into MemSegWeightConverter to keep weights quantized (~5GB) instead of dequantizing to FP32 (~31GB). Add linearProject() helper to LlamaRuntime that dispatches to quantized matmul for Q4_K weights and standard transpose+matmul for FP32 weights. The 8B model now loads successfully on 48GB Mac (15s load time) and generates tokens via the Q4_K SIMD kernel. However, the inline .t() on Q6_K-dequantized FP32 tensors still causes direct buffer OOM during inference -- the JVM doesn't reclaim direct memory eagerly. Full fix requires lazy per-layer dequantization (Solution B) or Q6_K SIMD kernel support. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Pre-transpose ALL projection weights during MemSegWeightConverter so LlamaRuntime never calls .t() during inference. This eliminates direct buffer allocations that the JVM doesn't GC eagerly, which caused OOM on 48GB machines. - Q4_K: transposed shape passed to Q4_KBlockTensorData.fromRawBytes() - Q6_K: dequantize to FP32 + array transpose to [in, out] layout - FP32: pre-transpose via .t() during conversion (one-time cost) - linearProject() auto-detects layout: [in,out] = direct, [out,in] = .t() Qwen3-8B-Q4_K_M now runs on 48GB Mac at 14.2GB RSS (was 45GB+ OOM). Token generation works but output quality needs validation (Q6_K transpose ordering may need adjustment). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Revert Q4_K block shape reinterpretation (corrupted block data layout) and dequantize all K-quant types to FP32 with array pre-transpose. This produces correct output at the cost of higher memory (~30GB vs ~14GB), but still fits on 48GB Mac without runtime OOM. Qwen3-8B-Q4_K_M now generates correct output: "The capital of France is Paris." Memory: ~30GB RSS (was 45GB+ OOM before pre-transpose fix) Speed: 0.002 tok/s (CPU-only 8B, expected for scalar FP32 matmul) Future: native Q4_K SIMD matmul with proper block-aware transpose would reduce to ~14GB and improve speed significantly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Covers the full numeric representation journey from GGUF file to matmul kernel dispatch: - Stage 1: GGUF on-disk layout (Q4_K_M block format, tensor shapes) - Stage 2: Raw byte loading via StreamingGGUFReader - Stage 3: MemSegWeightConverter paths (Q4_0/Q8_0 SIMD, K-quant dequant + pre-transpose, FP32 pre-transpose) - Stage 4: LlamaRuntime.linearProject() auto-detection - Stage 5: Matmul kernel dispatch (SIMD Q4/Q8, scalar FP32) - Why Q4_K blocks cannot be trivially transposed - Memory budget table for 8B model on 48GB Mac - Future: block-aware Q4_K transpose for 5GB inference Includes Mermaid diagram of the full data flow. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add runChatOnce() to AgentCli for non-interactive single-prompt instruct testing. Wire --chat with positional prompt to single-shot mode. Add instruct field to smoke-models.json — when true, the smoke test uses --chat with chat template formatting instead of raw text completion. Fixes garbage output from instruct models (Qwen3) in smoke tests. Refs: #49 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Math.PI is not available on non-JVM targets (iOS, JS, WASM). Use kotlin.math.PI which is multiplatform. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Install npm packages to /opt/antora (not /antora which gets volume-mounted over) - Set NODE_PATH=/opt/antora/node_modules so Antora finds extensions - Use /opt/antora/node_modules/.bin/antora as entrypoint (not npx) - Fix playbook content source url to /antora (git repo root) Verified locally: 19 HTML pages, 12 Mermaid SVG diagrams rendered via Chromium inside the container. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

michalharakal and others added 9 commits April 11, 2026 13:29

fix: replace JVM-only Math.PI with kotlin.math.PI in VoxtralFlowMatching

2df13cd

Math.PI is not available on non-JVM targets (iOS, JS, WASM). Use kotlin.math.PI which is multiplatform. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

michalharakal merged commit ced73dd into develop Apr 11, 2026
3 checks passed

michalharakal deleted the feature/49-tool-calling-pipeline-oom branch April 11, 2026 19:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/49 tool calling pipeline oom#51

Feature/49 tool calling pipeline oom#51
michalharakal merged 9 commits intodevelopfrom
feature/49-tool-calling-pipeline-oom

michalharakal commented Apr 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

michalharakal commented Apr 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant