V-cache precision bug persists in v1.4.1 on head_dim=128 models (Qwen3-14b) — KV recall degradation from ctx ~1100t

We tested v1.4.1 via llama-cpp-2 Rust bindings (bypassing llama-cli argument parsing) on Qwen3-14b (head_dim=128) using a KV recall benchmark: math problem at position 0, N tokens of unrelated filler, model asked to recall and solve. Temperature=0, deterministic.
We tested both <code>tbq4_0</code> (ID=42, blck_size=256) and <code>tbq4_1</code> (ID=46, blck_size=128). The <code>_1</code> suffix fixed short-context regressions but the precision bug at 1500-2000t persists in both.
Results (91 tests, filler 0–2500t):

Config | 500-1000t | 1000-1500t | 1500-2000t | Overall
-- | -- | -- | -- | --
spiritbuun tbq4_0 (ID=42) | 100% | 90.5% | 38.9% | 85.7%
AmesianX v1.4.1 tbq4_0 (ID=42) | 69.2% | 81.0% | 38.9% | 78.0%
AmesianX v1.4.1 tbq4_1 (ID=46) | 92.3% | 85.7% | 33.3% | 81.3%


The <code>_1</code> suffix corrects the short-context regression (confirming the head_dim=128 auto-detection does NOT apply when using Rust bindings directly). But the 1500-2000t bucket is still 33-38% in both versions — worse than spiritbuun.
Symptom: model outputs off-topic text ("Finally, the…", "Starting with…") instead of the recalled matrix — consistent with KV cache corruption, not wrong computation.
Note on the v1.4.0 fix: The release note says the IWHT FP32 fix was verified on Qwen3.5-27B-Q4_K_M. Is that model head_dim=128 or 256? We may be hitting a different code path.
Note on Rust bindings: When using llama-cpp-2 bindings, the 6-priority cascade auto-detection does NOT run — we pass the raw GGML type ID directly. Users integrating via bindings must specify the correct <code>_0</code>/<code>_1</code>/<code>_2</code> suffix manually. Worth documenting.
Reproduce: https://github.com/eullm/eullm/blob/main/bench/turboquant_math_accuracy.py
<pre><code class="language-bash">python bench/turboquant_math_accuracy.py collect \
 --label test --no-think --num-predict 2048 \
 --filler 200,500,1000,1500,2000,2500
</code></pre>
<hr></html>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

V-cache precision bug persists in v1.4.1 on head_dim=128 models (Qwen3-14b) — KV recall degradation from ctx ~1100t #11

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Config	500-1000t	1000-1500t	1500-2000t	Overall
spiritbuun tbq4_0 (ID=42)	100%	90.5%	38.9%	85.7%
AmesianX v1.4.1 tbq4_0 (ID=42)	69.2%	81.0%	38.9%	78.0%
AmesianX v1.4.1 tbq4_1 (ID=46)	92.3%	85.7%	33.3%	81.3%

V-cache precision bug persists in v1.4.1 on head_dim=128 models (Qwen3-14b) — KV recall degradation from ctx ~1100t #11

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions