Release TurboQuant v1.8.0 — DeepSeek-V4-Flash + MTP Self-Speculative Decoding · AmesianX/TurboQuant

TurboQuant v1.8.0 — DeepSeek-V4-Flash Full CUDA Port + MTP Self-Speculative Decoding

DeepSeek-V4-Flash (deepseek4) runs end-to-end on the TurboQuant fork — CSA/HCA compressed attention, hyper-connections (sinkhorn), the DSA lightning indexer, the 256-expert IQ2_XS MoE, and a phase-uniform decode graph with CUDA-graph capture. On top of that:

MTP self-speculative decoding (--spec-type draft-mtp) from antirez's side GGUF. The MTP head ships as a separate third split shard (no requantization of the 82GB main shards); the --spec-draft-p-min 0.75 gate is mandatory.
-ctk tbq3 -ctv tbq3 on DSV4 — global (ratio==0) layers get TBQ3_0 @ head_dim 512; SWA + compressed side caches stay f16 by quality policy.

Performance (GB10, production `tbq3` + MTP)

-ctk tbq3 -ctv tbq3 --spec-type draft-mtp --spec-draft-p-min 0.75 --spec-draft-n-max 2

~16–20 tok/s decode, acceptance-driven (96–100% draft accept). The jitter is the nature of speculative decoding: round time is ~constant (~25 ms, target verify GPU pass), tokens-per-round swing 1→3 with text predictability.
DSV4_KERNEL_PROF shows decode is memory-bound: matmul (IQ2_XS MoE + Q8_0 projections) ~52% near the LPDDR ceiling, data-shuffle ~19%, flash-attention (D=512 compressed KV) ~2.3%.

A verify-graph phase-uniform reuse infra also landed, OFF by default (DSV4_VERIFY_REUSE) — it's correct but off the critical path (graph build overlaps async GPU compute) and must be perplexity-gated.

Environment: NVIDIA DGX Spark (GB10, 128GB) · DeepSeek-V4-Flash-IQ2_XS-XL (82GB, antirez lineage) · ctx=16384 · greedy.

See README.md / README_KO.md for full details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TurboQuant v1.8.0 — DeepSeek-V4-Flash + MTP Self-Speculative Decoding

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

TurboQuant v1.8.0 — DeepSeek-V4-Flash Full CUDA Port + MTP Self-Speculative Decoding

Performance (GB10, production `tbq3` + MTP)

Uh oh!

TurboQuant v1.8.0 — DeepSeek-V4-Flash + MTP Self-Speculative Decoding

TurboQuant v1.8.0 — DeepSeek-V4-Flash Full CUDA Port + MTP Self-Speculative Decoding

Performance (GB10, production tbq3 + MTP)

Uh oh!

Performance (GB10, production `tbq3` + MTP)