-
Notifications
You must be signed in to change notification settings - Fork 5
V-cache precision bug persists in v1.4.1 on head_dim=128 models (Qwen3-14b) — KV recall degradation from ctx ~1100t #11
Description
We tested v1.4.1 via llama-cpp-2 Rust bindings (bypassing llama-cli argument parsing) on Qwen3-14b (head_dim=128) using a KV recall benchmark: math problem at position 0, N tokens of unrelated filler, model asked to recall and solve. Temperature=0, deterministic.
We tested both tbq4_0 (ID=42, blck_size=256) and tbq4_1 (ID=46, blck_size=128). The _1 suffix fixed short-context regressions but the precision bug at 1500-2000t persists in both.
Results (91 tests, filler 0–2500t):
| Config | 500-1000t | 1000-1500t | 1500-2000t | Overall |
|---|---|---|---|---|
| spiritbuun tbq4_0 (ID=42) | 100% | 90.5% | 38.9% | 85.7% |
| AmesianX v1.4.1 tbq4_0 (ID=42) | 69.2% | 81.0% | 38.9% | 78.0% |
| AmesianX v1.4.1 tbq4_1 (ID=46) | 92.3% | 85.7% | 33.3% | 81.3% |
The _1 suffix corrects the short-context regression (confirming the head_dim=128 auto-detection does NOT apply when using Rust bindings directly). But the 1500-2000t bucket is still 33-38% in both versions — worse than spiritbuun.
Symptom: model outputs off-topic text ("Finally, the…", "Starting with…") instead of the recalled matrix — consistent with KV cache corruption, not wrong computation.
Note on the v1.4.0 fix: The release note says the IWHT FP32 fix was verified on Qwen3.5-27B-Q4_K_M. Is that model head_dim=128 or 256? We may be hitting a different code path.
Note on Rust bindings: When using llama-cpp-2 bindings, the 6-priority cascade auto-detection does NOT run — we pass the raw GGML type ID directly. Users integrating via bindings must specify the correct _0/_1/_2 suffix manually. Worth documenting.
Reproduce: https://github.com/eullm/eullm/blob/main/bench/turboquant_math_accuracy.py
python bench/turboquant_math_accuracy.py collect \
--label test --no-think --num-predict 2048 \
--filler 200,500,1000,1500,2000,2500