Releases: KaiFelixBennett/gemma4-turboquant-rdna4
Releases · KaiFelixBennett/gemma4-turboquant-rdna4
v1.0.0 - TurboQuant KV + HIP graphs on RDNA4, fully measured
First stable release.
What works (all measured on an AMD Radeon AI PRO R9700, gfx1201, 32 GB):
- TurboQuant KV cache + HIP graphs together on RDNA4: 735 t/s prefill, crash-free decode (patches/0001 + 0002 against TheTom/llama-cpp-turboquant @ 7d9715f)
- Full 256K context loads with ~9 GB VRAM to spare (turbo3)
- Decode: ~22 t/s at low context, 9.38 +/- 0.93 t/s at 128K (llama-bench)
- Quality: needle 9/9 for q8_0/turbo4 and turbo3/turbo3; full KLD study in docs/QUALITY.md
- VS Code Copilot integration with a documented real 176K-token session (docs/VSCODE-COPILOT.md)
Three config traps that silently cost 5-10x decode (any GPU): -b 16384 batch scratch buffer, --parallel 4 default, and llama-server session-state defaults (SWA checkpoints + prompt cache). See docs/BENCHMARKS.md.
One-command reproduction: scripts/setup.ps1 (clones the fork at the pinned commit, applies both patches, builds for gfx1201 with HIP graphs).