Skip to content

V-cache precision bug persists in v1.4.1 on head_dim=128 models (Qwen3-14b) — KV recall degradation from ctx ~1100t #11

@primoco

Description

@primoco

We tested v1.4.1 via llama-cpp-2 Rust bindings (bypassing llama-cli argument parsing) on Qwen3-14b (head_dim=128) using a KV recall benchmark: math problem at position 0, N tokens of unrelated filler, model asked to recall and solve. Temperature=0, deterministic.

We tested both tbq4_0 (ID=42, blck_size=256) and tbq4_1 (ID=46, blck_size=128). The _1 suffix fixed short-context regressions but the precision bug at 1500-2000t persists in both.

Results (91 tests, filler 0–2500t):

Config 500-1000t 1000-1500t 1500-2000t Overall
spiritbuun tbq4_0 (ID=42) 100% 90.5% 38.9% 85.7%
AmesianX v1.4.1 tbq4_0 (ID=42) 69.2% 81.0% 38.9% 78.0%
AmesianX v1.4.1 tbq4_1 (ID=46) 92.3% 85.7% 33.3% 81.3%

The _1 suffix corrects the short-context regression (confirming the head_dim=128 auto-detection does NOT apply when using Rust bindings directly). But the 1500-2000t bucket is still 33-38% in both versions — worse than spiritbuun.

Symptom: model outputs off-topic text ("Finally, the…", "Starting with…") instead of the recalled matrix — consistent with KV cache corruption, not wrong computation.

Note on the v1.4.0 fix: The release note says the IWHT FP32 fix was verified on Qwen3.5-27B-Q4_K_M. Is that model head_dim=128 or 256? We may be hitting a different code path.

Note on Rust bindings: When using llama-cpp-2 bindings, the 6-priority cascade auto-detection does NOT run — we pass the raw GGML type ID directly. Users integrating via bindings must specify the correct _0/_1/_2 suffix manually. Worth documenting.

Reproduce: https://github.com/eullm/eullm/blob/main/bench/turboquant_math_accuracy.py

python bench/turboquant_math_accuracy.py collect \
  --label test --no-think --num-predict 2048 \
  --filler 200,500,1000,1500,2000,2500

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions