Feature/49 tool calling pipeline oom#51
Merged
michalharakal merged 9 commits intodevelopfrom Apr 11, 2026
Merged
Conversation
LlamaRuntime previously pre-transposed all weight tensors at init, doubling peak memory (~31GB for 8B models). This caused OOM on 48GB machines when loading Qwen3-8B-Q4_K_M. Replace eager pre-transpose with inline .t() calls during forward pass. The GC reclaims each temporary transpose, so only one projection's worth of memory (~200MB) is live at a time instead of the full set. Before: ~62GB peak (31GB weights + 31GB transposed copies) After: ~31GB peak (weights only + 200MB temp per layer) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
MemSegWeightConverter previously dequantized Q4_K tensors to FP32 because "no native SIMD kernel yet". But Q4_KBlockTensorData and QuantizedMatmul.matmulQ4_K() already exist in skainet-backend-cpu. Wire Q4_K into the SIMD path: create Q4_KBlockTensorData from raw bytes instead of dequantizing. This keeps Q4_K weights in their compact quantized form (~4.5 bits/param) and uses the SIMD kernel for matmul at inference time. Memory impact for Qwen3-8B-Q4_K_M: Before: ~31GB (Q4_K dequantized to FP32) After: ~5GB (Q4_K kept quantized) Q5_K and Q6_K still dequantize to FP32 (no SIMD kernel yet). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wire Q4_K into MemSegWeightConverter to keep weights quantized (~5GB) instead of dequantizing to FP32 (~31GB). Add linearProject() helper to LlamaRuntime that dispatches to quantized matmul for Q4_K weights and standard transpose+matmul for FP32 weights. The 8B model now loads successfully on 48GB Mac (15s load time) and generates tokens via the Q4_K SIMD kernel. However, the inline .t() on Q6_K-dequantized FP32 tensors still causes direct buffer OOM during inference -- the JVM doesn't reclaim direct memory eagerly. Full fix requires lazy per-layer dequantization (Solution B) or Q6_K SIMD kernel support. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pre-transpose ALL projection weights during MemSegWeightConverter so LlamaRuntime never calls .t() during inference. This eliminates direct buffer allocations that the JVM doesn't GC eagerly, which caused OOM on 48GB machines. - Q4_K: transposed shape passed to Q4_KBlockTensorData.fromRawBytes() - Q6_K: dequantize to FP32 + array transpose to [in, out] layout - FP32: pre-transpose via .t() during conversion (one-time cost) - linearProject() auto-detects layout: [in,out] = direct, [out,in] = .t() Qwen3-8B-Q4_K_M now runs on 48GB Mac at 14.2GB RSS (was 45GB+ OOM). Token generation works but output quality needs validation (Q6_K transpose ordering may need adjustment). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Revert Q4_K block shape reinterpretation (corrupted block data layout) and dequantize all K-quant types to FP32 with array pre-transpose. This produces correct output at the cost of higher memory (~30GB vs ~14GB), but still fits on 48GB Mac without runtime OOM. Qwen3-8B-Q4_K_M now generates correct output: "The capital of France is Paris." Memory: ~30GB RSS (was 45GB+ OOM before pre-transpose fix) Speed: 0.002 tok/s (CPU-only 8B, expected for scalar FP32 matmul) Future: native Q4_K SIMD matmul with proper block-aware transpose would reduce to ~14GB and improve speed significantly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Covers the full numeric representation journey from GGUF file to matmul kernel dispatch: - Stage 1: GGUF on-disk layout (Q4_K_M block format, tensor shapes) - Stage 2: Raw byte loading via StreamingGGUFReader - Stage 3: MemSegWeightConverter paths (Q4_0/Q8_0 SIMD, K-quant dequant + pre-transpose, FP32 pre-transpose) - Stage 4: LlamaRuntime.linearProject() auto-detection - Stage 5: Matmul kernel dispatch (SIMD Q4/Q8, scalar FP32) - Why Q4_K blocks cannot be trivially transposed - Memory budget table for 8B model on 48GB Mac - Future: block-aware Q4_K transpose for 5GB inference Includes Mermaid diagram of the full data flow. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add runChatOnce() to AgentCli for non-interactive single-prompt instruct testing. Wire --chat with positional prompt to single-shot mode. Add instruct field to smoke-models.json — when true, the smoke test uses --chat with chat template formatting instead of raw text completion. Fixes garbage output from instruct models (Qwen3) in smoke tests. Refs: #49 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Math.PI is not available on non-JVM targets (iOS, JS, WASM). Use kotlin.math.PI which is multiplatform. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Install npm packages to /opt/antora (not /antora which gets volume-mounted over) - Set NODE_PATH=/opt/antora/node_modules so Antora finds extensions - Use /opt/antora/node_modules/.bin/antora as entrypoint (not npx) - Fix playbook content source url to /antora (git repo root) Verified locally: 19 HTML pages, 12 Mermaid SVG diagrams rendered via Chromium inside the container. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.