Releases · KaiFelixBennett/gemma4-turboquant-rdna4

First stable release.

What works (all measured on an AMD Radeon AI PRO R9700, gfx1201, 32 GB):

TurboQuant KV cache + HIP graphs together on RDNA4: 735 t/s prefill, crash-free decode (patches/0001 + 0002 against TheTom/llama-cpp-turboquant @ 7d9715f)
Full 256K context loads with ~9 GB VRAM to spare (turbo3)
Decode: ~22 t/s at low context, 9.38 +/- 0.93 t/s at 128K (llama-bench)
Quality: needle 9/9 for q8_0/turbo4 and turbo3/turbo3; full KLD study in docs/QUALITY.md
VS Code Copilot integration with a documented real 176K-token session (docs/VSCODE-COPILOT.md)

Three config traps that silently cost 5-10x decode (any GPU): -b 16384 batch scratch buffer, --parallel 4 default, and llama-server session-state defaults (SWA checkpoints + prompt cache). See docs/BENCHMARKS.md.

One-command reproduction: scripts/setup.ps1 (clones the fork at the pinned commit, applies both patches, builds for gfx1201 with HIP graphs).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Uh oh!

Releases: KaiFelixBennett/gemma4-turboquant-rdna4

v1.0.0 - TurboQuant KV + HIP graphs on RDNA4, fully measured

Uh oh!