TurboQuant v1.8.0 — DeepSeek-V4-Flash Full CUDA Port + MTP Self-Speculative Decoding
DeepSeek-V4-Flash (deepseek4) runs end-to-end on the TurboQuant fork — CSA/HCA compressed attention, hyper-connections (sinkhorn), the DSA lightning indexer, the 256-expert IQ2_XS MoE, and a phase-uniform decode graph with CUDA-graph capture. On top of that:
- MTP self-speculative decoding (
--spec-type draft-mtp) from antirez's side GGUF. The MTP head ships as a separate third split shard (no requantization of the 82GB main shards); the--spec-draft-p-min 0.75gate is mandatory. -ctk tbq3 -ctv tbq3on DSV4 — global (ratio==0) layers get TBQ3_0 @ head_dim 512; SWA + compressed side caches stay f16 by quality policy.
Performance (GB10, production tbq3 + MTP)
-ctk tbq3 -ctv tbq3 --spec-type draft-mtp --spec-draft-p-min 0.75 --spec-draft-n-max 2
- ~16–20 tok/s decode, acceptance-driven (96–100% draft accept). The jitter is the nature of speculative decoding: round time is ~constant (~25 ms, target verify GPU pass), tokens-per-round swing 1→3 with text predictability.
DSV4_KERNEL_PROFshows decode is memory-bound: matmul (IQ2_XS MoE + Q8_0 projections) ~52% near the LPDDR ceiling, data-shuffle ~19%, flash-attention (D=512 compressed KV) ~2.3%.
A verify-graph phase-uniform reuse infra also landed, OFF by default (DSV4_VERIFY_REUSE) — it's correct but off the critical path (graph build overlaps async GPU compute) and must be perplexity-gated.
Environment: NVIDIA DGX Spark (GB10, 128GB) · DeepSeek-V4-Flash-IQ2_XS-XL (82GB, antirez lineage) · ctx=16384 · greedy.
See README.md / README_KO.md for full details.