Skip to content

Releases: KaiFelixBennett/gemma4-turboquant-rdna4

v1.0.0 - TurboQuant KV + HIP graphs on RDNA4, fully measured

10 Jun 12:28

Choose a tag to compare

First stable release.

What works (all measured on an AMD Radeon AI PRO R9700, gfx1201, 32 GB):

  • TurboQuant KV cache + HIP graphs together on RDNA4: 735 t/s prefill, crash-free decode (patches/0001 + 0002 against TheTom/llama-cpp-turboquant @ 7d9715f)
  • Full 256K context loads with ~9 GB VRAM to spare (turbo3)
  • Decode: ~22 t/s at low context, 9.38 +/- 0.93 t/s at 128K (llama-bench)
  • Quality: needle 9/9 for q8_0/turbo4 and turbo3/turbo3; full KLD study in docs/QUALITY.md
  • VS Code Copilot integration with a documented real 176K-token session (docs/VSCODE-COPILOT.md)

Three config traps that silently cost 5-10x decode (any GPU): -b 16384 batch scratch buffer, --parallel 4 default, and llama-server session-state defaults (SWA checkpoints + prompt cache). See docs/BENCHMARKS.md.

One-command reproduction: scripts/setup.ps1 (clones the fork at the pinned commit, applies both patches, builds for gfx1201 with HIP graphs).