Skip to content

Feature/49 tool calling pipeline oom#51

Merged
michalharakal merged 9 commits intodevelopfrom
feature/49-tool-calling-pipeline-oom
Apr 11, 2026
Merged

Feature/49 tool calling pipeline oom#51
michalharakal merged 9 commits intodevelopfrom
feature/49-tool-calling-pipeline-oom

Conversation

@michalharakal
Copy link
Copy Markdown
Contributor

No description provided.

michalharakal and others added 9 commits April 11, 2026 13:29
LlamaRuntime previously pre-transposed all weight tensors at init,
doubling peak memory (~31GB for 8B models). This caused OOM on 48GB
machines when loading Qwen3-8B-Q4_K_M.

Replace eager pre-transpose with inline .t() calls during forward pass.
The GC reclaims each temporary transpose, so only one projection's
worth of memory (~200MB) is live at a time instead of the full set.

Before: ~62GB peak (31GB weights + 31GB transposed copies)
After:  ~31GB peak (weights only + 200MB temp per layer)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
MemSegWeightConverter previously dequantized Q4_K tensors to FP32
because "no native SIMD kernel yet". But Q4_KBlockTensorData and
QuantizedMatmul.matmulQ4_K() already exist in skainet-backend-cpu.

Wire Q4_K into the SIMD path: create Q4_KBlockTensorData from raw
bytes instead of dequantizing. This keeps Q4_K weights in their
compact quantized form (~4.5 bits/param) and uses the SIMD kernel
for matmul at inference time.

Memory impact for Qwen3-8B-Q4_K_M:
  Before: ~31GB (Q4_K dequantized to FP32)
  After:  ~5GB  (Q4_K kept quantized)

Q5_K and Q6_K still dequantize to FP32 (no SIMD kernel yet).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wire Q4_K into MemSegWeightConverter to keep weights quantized (~5GB)
instead of dequantizing to FP32 (~31GB). Add linearProject() helper
to LlamaRuntime that dispatches to quantized matmul for Q4_K weights
and standard transpose+matmul for FP32 weights.

The 8B model now loads successfully on 48GB Mac (15s load time) and
generates tokens via the Q4_K SIMD kernel. However, the inline .t()
on Q6_K-dequantized FP32 tensors still causes direct buffer OOM
during inference -- the JVM doesn't reclaim direct memory eagerly.
Full fix requires lazy per-layer dequantization (Solution B) or
Q6_K SIMD kernel support.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pre-transpose ALL projection weights during MemSegWeightConverter so
LlamaRuntime never calls .t() during inference. This eliminates direct
buffer allocations that the JVM doesn't GC eagerly, which caused OOM
on 48GB machines.

- Q4_K: transposed shape passed to Q4_KBlockTensorData.fromRawBytes()
- Q6_K: dequantize to FP32 + array transpose to [in, out] layout
- FP32: pre-transpose via .t() during conversion (one-time cost)
- linearProject() auto-detects layout: [in,out] = direct, [out,in] = .t()

Qwen3-8B-Q4_K_M now runs on 48GB Mac at 14.2GB RSS (was 45GB+ OOM).
Token generation works but output quality needs validation (Q6_K
transpose ordering may need adjustment).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Revert Q4_K block shape reinterpretation (corrupted block data layout)
and dequantize all K-quant types to FP32 with array pre-transpose.
This produces correct output at the cost of higher memory (~30GB vs
~14GB), but still fits on 48GB Mac without runtime OOM.

Qwen3-8B-Q4_K_M now generates correct output:
  "The capital of France is Paris."

Memory: ~30GB RSS (was 45GB+ OOM before pre-transpose fix)
Speed: 0.002 tok/s (CPU-only 8B, expected for scalar FP32 matmul)

Future: native Q4_K SIMD matmul with proper block-aware transpose
would reduce to ~14GB and improve speed significantly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Covers the full numeric representation journey from GGUF file to
matmul kernel dispatch:

- Stage 1: GGUF on-disk layout (Q4_K_M block format, tensor shapes)
- Stage 2: Raw byte loading via StreamingGGUFReader
- Stage 3: MemSegWeightConverter paths (Q4_0/Q8_0 SIMD, K-quant
  dequant + pre-transpose, FP32 pre-transpose)
- Stage 4: LlamaRuntime.linearProject() auto-detection
- Stage 5: Matmul kernel dispatch (SIMD Q4/Q8, scalar FP32)
- Why Q4_K blocks cannot be trivially transposed
- Memory budget table for 8B model on 48GB Mac
- Future: block-aware Q4_K transpose for 5GB inference

Includes Mermaid diagram of the full data flow.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add runChatOnce() to AgentCli for non-interactive single-prompt
instruct testing. Wire --chat with positional prompt to single-shot
mode. Add instruct field to smoke-models.json — when true, the
smoke test uses --chat with chat template formatting instead of
raw text completion.

Fixes garbage output from instruct models (Qwen3) in smoke tests.

Refs: #49

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Math.PI is not available on non-JVM targets (iOS, JS, WASM).
Use kotlin.math.PI which is multiplatform.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Install npm packages to /opt/antora (not /antora which gets
  volume-mounted over)
- Set NODE_PATH=/opt/antora/node_modules so Antora finds extensions
- Use /opt/antora/node_modules/.bin/antora as entrypoint (not npx)
- Fix playbook content source url to /antora (git repo root)

Verified locally: 19 HTML pages, 12 Mermaid SVG diagrams rendered
via Chromium inside the container.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@michalharakal michalharakal merged commit ced73dd into develop Apr 11, 2026
3 checks passed
@michalharakal michalharakal deleted the feature/49-tool-calling-pipeline-oom branch April 11, 2026 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant