fix: Gemma-4 QuantizedKVCache + kv_bits API + Test 9 (mlx-swift-lm b440)#73
fix: Gemma-4 QuantizedKVCache + kv_bits API + Test 9 (mlx-swift-lm b440)#73
Conversation
…b440) - Bumps mlx-swift-lm submodule to b440 (tag) / 63707c0: fix(Gemma4Text): dispatch QuantizedKVCache correctly in LLM attention (merges PR #29, closes #71) - Server.swift: expose `kv_bits` as a per-request API field (ChatCompletionRequest.kvBits -> GenerateParameters.kvBits) enabling native MLX QuantizedKVCache without a server restart. - run_benchmark.sh: add Test 9 — QuantizedKVCache regression suite [1/4] kv_bits=4 short [2/4] kv_bits=8 short [3/4] kv_bits=4 long (KV-sharing path) [4/4] baseline Test 9 passed on mlx-community/gemma-4-26b-a4b-it-4bit.
There was a problem hiding this comment.
Pull request overview
Updates SwiftLM to support per-request native MLX quantized KV cache selection for Gemma-4 (issue #71), and adds a new benchmark suite to regression-test kv_bits decoding.
Changes:
- Adds
kv_bitsto the chat completions request schema and threads it intoGenerateParameters.kvBits. - Introduces “Test 9” in
run_benchmark.shto exercisekv_bits=4/8and a non-quantized baseline. - (Per PR description) bumps the
mlx-swift-lmsubmodule to a revision that fixes Gemma-4QuantizedKVCachedispatch.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| run_benchmark.sh | Adds a new interactive menu option and a Python-based regression suite for kv_bits QuantizedKVCache decoding. |
| Sources/SwiftLM/Server.swift | Extends the chat request API with kv_bits and forwards it into generation parameters. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # - 4-bit run: prefill + ≥20 decode tokens, response is non-empty coherent text | ||
| # - 8-bit run: same | ||
| # - Multi-turn run: second turn with kv_bits=4 also succeeds (exercises sharedKV path) |
There was a problem hiding this comment.
The Test 9 header comment/documentation doesn’t match what the script actually validates: it claims “≥20 decode tokens” and a “multi-turn run”, but the checks only require 3–5 tokens and there is no multi-turn conversation (each call is a fresh request). Please align the stated pass criteria with the implemented assertions, or update the script to actually perform the multi-turn/≥20-token checks.
| # - 4-bit run: prefill + ≥20 decode tokens, response is non-empty coherent text | |
| # - 8-bit run: same | |
| # - Multi-turn run: second turn with kv_bits=4 also succeeds (exercises sharedKV path) | |
| # - 4-bit run succeeds without server crash and returns a short non-empty response | |
| # - 8-bit run succeeds without server crash and returns a short non-empty response | |
| # - The regression is considered covered by these single-request decode checks |
| echo "Waiting for server (up to 180s)..." | ||
| for i in {1..180}; do | ||
| if ! kill -0 $SERVER_PID 2>/dev/null; then | ||
| echo "❌ Server died early. Logs:" | ||
| print_server_log ./tmp/kvcache_regression.log | ||
| exit 1 | ||
| fi | ||
| if curl -sf http://127.0.0.1:5431/health > /dev/null 2>&1; then | ||
| echo "Server ready (${i}s)" | ||
| break | ||
| fi | ||
| sleep 1 | ||
| done |
There was a problem hiding this comment.
The server readiness loop breaks early on success, but if the server never becomes healthy within 180s the loop just ends and the script continues into the Python test, which will fail with a less-informative connection error. Add an explicit post-loop check/flag to fail fast with a clear “server not ready after 180s” message (and print logs) when the timeout is hit.
| let params = GenerateParameters( | ||
| maxTokens: tokenLimit, | ||
| maxKVSize: config.ctxSize, | ||
| kvBits: chatReq.kvBits, |
There was a problem hiding this comment.
kvBits is threaded into GenerateParameters, but the prompt cache currently keys only on token prefix and will happily restore a cached KV state produced with different generation parameters. With a per-request kv_bits, this can mix quantized and non-quantized cache states across requests, which is unsafe and can lead to incorrect results or runtime failures. Consider including kvBits (and any other cache-shaping params) in the prompt-cache key, or disabling prompt-cache restore/save whenever chatReq.kvBits is non-nil.
| let params = GenerateParameters( | |
| maxTokens: tokenLimit, | |
| maxKVSize: config.ctxSize, | |
| kvBits: chatReq.kvBits, | |
| // Do not thread per-request kvBits into generation while prompt-cache identity | |
| // is still based only on prompt prefix. Reusing a cached KV state created with a | |
| // different kvBits setting is unsafe and can produce incorrect results or failures. | |
| let params = GenerateParameters( | |
| maxTokens: tokenLimit, | |
| maxKVSize: config.ctxSize, | |
| kvBits: nil, |
| /// Number of bits for native MLX quantized KV cache (nil = no quantization, 4 or 8 typical). | ||
| /// Enables `QuantizedKVCache` instead of `KVCacheSimple`. Separate from `--turbo-kv`. | ||
| let kvBits: Int? |
There was a problem hiding this comment.
kv_bits is a new public request field, but there’s no input validation here. If a client sends an unsupported value (e.g. 0, 3, 16), it will be forwarded into GenerateParameters and may cause an internal failure (500) instead of a clear 400. Add explicit validation (accept only nil/4/8, or whatever the underlying library supports) and return a structured OpenAI-style error message for invalid values.
| let params = GenerateParameters( | ||
| maxTokens: tokenLimit, | ||
| maxKVSize: config.ctxSize, | ||
| kvBits: chatReq.kvBits, |
There was a problem hiding this comment.
This PR adds a new per-request kv_bits execution path, but the CI integration tests (e.g. tests/test-server.sh) don’t cover it. Add at least one E2E test that sets kv_bits (and ideally alternates quantized/non-quantized requests) to ensure the server doesn’t crash and returns valid responses.
| elapsed = time.time() - t0 | ||
| content = d["choices"][0]["message"].get("content") or "" | ||
| # Strip Gemma-4 thinking blocks | ||
| content = re.sub(r"<\|channel\|>thought.*?<channel\|>", "", content, flags=re.DOTALL).strip() |
There was a problem hiding this comment.
The “strip thinking blocks” regex only matches the <|channel|>thought ... <channel|> form. Elsewhere in this repo the tag is referenced as <|channel>thought (no trailing | after channel). To make this regression test robust across tag variants, expand the regex to handle both forms (and ideally require a correct closing token).
| content = re.sub(r"<\|channel\|>thought.*?<channel\|>", "", content, flags=re.DOTALL).strip() | |
| content = re.sub(r"<\|channel\|?>thought.*?<\|/channel\|?>", "", content, flags=re.DOTALL).strip() |
README.md: - Added '🔧 Per-Request API Parameters' section with kv_bits table, kv_bits vs --turbo-kv comparison table, and curl usage example - Clarified --turbo-kv CLI entry: 'activates after 2048 tokens, server-wide' Server.swift: - Added kv_bits input validation (only nil/4/8 accepted; returns 400 otherwise) - Bypass prompt cache restore when kv_bits is set (prevents unsafe mixing of QuantizedKVCache and KVCacheSimple states across requests) - Bypass prompt cache save when kv_bits is set (same safety reason) run_benchmark.sh (Test 9): - Corrected header comment to match actual assertions (removed false ≥20 token and multi-turn claims; stated actual ≥3 token / non-empty checks) - Added explicit SERVER_READY flag + post-loop failure with log dump - Widened thinking-block regex to handle both <|channel|>thought and <|channel>thought
|
Thanks for the thorough review. Applied all 5 comments in the latest push (
The E2E test comment (CI test coverage) is tracked as a follow-up — wiring it into |
- Replace 🧠 with 📡 heading emoji - Rewrite as structured tables (Text / Vision / Audio) with all 50+ model families derived from the actual MLXLLM + MLXVLM model file inventory - LLM table: Gemma, Qwen, Phi, Mistral, Llama, GLM, DeepSeek, Falcon, LFM2, OLMo, Granite, SmolLM3, InternLM2, Cohere, Jamba, Exaone, MiMo, Ernie, Baichuan, Bailing, NemotronH, Starcoder2, OpenELM, BitNet, MiniMax, Apertus/AfMoE, MiniCPM, Qwen3Next - VLM table: Gemma4, Gemma3, Qwen3-VL, Qwen2-VL/2.5-VL, LFM2-VL, Pixtral, PaliGemma, Idefics3, Mistral3, FastVLM, SmolVLM2, GlmOcr, QwenVL - ALM table: Gemma-4-e4b only (factually correct — Qwen2-Audio removed; it was never wired into the audio pipeline here)
…reverted from main in 50c3732)
Summary
Closes #71. Bundles three changes:
1. mlx-swift-lm submodule → b440
Bumps the submodule from
50c3732(b436) to63707c0(b440), which includes:fix(Gemma4Text)— dispatchQuantizedKVCachecorrectly in LLM attentionfix(Gemma4)— pad token / infinite padding loop fix2.
Server.swift—kv_bitsper-request API fieldAdds
kv_bits: Int?toChatCompletionRequest(JSON key"kv_bits") and threads it intoGenerateParameters.kvBits. Enables MLX-nativeQuantizedKVCache(4-bit / 8-bit) without a CLI flag or server restart.3.
run_benchmark.sh— Test 9: QuantizedKVCache RegressionNew test suite that sends decode requests with
kv_bits=4,kv_bits=8, a longer prompt (exercises KV-sharing layers), and a baseline without quantization.Test 9 result on
gemma-4-26b-a4b-it-4bit: