Q1_0 repack kernels for Arm NEON+DP#34
Conversation
|
Added I8MM specific repack:
Tradeoff:
|
|
Nice, this is good, thanks. Massive gains. |
There was a problem hiding this comment.
Pull request overview
Adds Q1_0 repack kernels and tensor traits for Arm NEON with the dotprod extension (4x4 layout) and MATMUL_INT8 extension (4x8 gemm), together with the supporting generic fallbacks, packing helpers, and dispatch wiring. According to the benchmarks in the PR description this yields large speedups on Snapdragon 7 gen 3 (pp128 +275%, tg32 +72%) over the current Q1_0 baseline.
Changes:
- Introduce
block_q1_0x4(extendingQK_0<K>/block<K,N>machinery for K=1) and addmake_block_q1_0x4,repack_q1_0_to_q1_0_4_bl, and therepack/gemv/gemmtemplate specializations plus dispatch inggml_repack_get_optimal_repack_type. - Add generic reference implementations
ggml_gemv_q1_0_4x{4,8}_q8_0_genericandggml_gemm_q1_0_4x{4,8}_q8_0_genericinrepack.cpp. - Add NEON+DOTPROD
ggml_gemv_q1_0_4x4,ggml_gemv_q1_0_4x8,ggml_gemm_q1_0_4x4kernels and a NEON+MATMUL_INT8ggml_gemm_q1_0_4x8kernel inarch/arm/repack.cpp(with a sign-expansion LUT), plus arch-fallback aliases for all non-arm targets.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| ggml/src/ggml-cpu/repack.h | Extends QK_0 template for K=1, declares block_q1_0x4 and the new gemv/gemm/generic entry points. |
| ggml/src/ggml-cpu/repack.cpp | Generic Q1_0 gemv/gemm, repack helper, template specializations, and dispatch wiring for the new traits. |
| ggml/src/ggml-cpu/arch/arm/repack.cpp | Adds sign-LUT helper and NEON DOTPROD/MATMUL_INT8 kernels for Q1_0 4x4 and 4x8 layouts. |
| ggml/src/ggml-cpu/arch-fallback.h | Aliases new _generic symbols to the public names for all non-Arm targets (wasm block missing the 4x8 entries). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@khosravipasha I am still waiting for your tests of repacked kernel, thank you in advance |
Implemented 4x4 repack kernels and traits for NEON with dotprod extension as it is way more straightforward than on x86.
I8MM extension is not used there due to more convenient access pattern.
Benchmarks were performed with:
Honor 400 (smartphone) with Snapdragon 7 gen 3, 12 gb ram
Android 16 via ADB
Used 4 performance cores due to all 8 cores causing unstable performance
Command:
./llama-bench -m Bonsai-1.7B.gguf -p 128 -n 32 -r 3 -C 0xF0 -t 4 -fa 1 -mmp 0Perplexity for 5x512 chunks: Mean KLD 0.00021, PPL 21.08, Same top p 99,22%
TODO: perplexity for chunks not divisible by 4 for gemv validation.
Llama-completionand-cliare producing sane outputs.Note: code branch and results are separate from #33 and baseline is from current ggml branch, see comments with reference to this PR there.
@khosravipasha, can you, please, compare performance between your and my implementations on your Mac for repack too?