Skip to content

Q1_0 repack kernels for Arm NEON+DP#34

Open
pl752 wants to merge 6 commits into
PrismML-Eng:masterfrom
pl752:perf/q1_0_arm_repack_4x4
Open

Q1_0 repack kernels for Arm NEON+DP#34
pl752 wants to merge 6 commits into
PrismML-Eng:masterfrom
pl752:perf/q1_0_arm_repack_4x4

Conversation

@pl752
Copy link
Copy Markdown

@pl752 pl752 commented May 17, 2026

Implemented 4x4 repack kernels and traits for NEON with dotprod extension as it is way more straightforward than on x86.
I8MM extension is not used there due to more convenient access pattern.

Benchmarks were performed with:
Honor 400 (smartphone) with Snapdragon 7 gen 3, 12 gb ram
Android 16 via ADB
Used 4 performance cores due to all 8 cores causing unstable performance
Command: ./llama-bench -m Bonsai-1.7B.gguf -p 128 -n 32 -r 3 -C 0xF0 -t 4 -fa 1 -mmp 0
Perplexity for 5x512 chunks: Mean KLD 0.00021, PPL 21.08, Same top p 99,22%
TODO: perplexity for chunks not divisible by 4 for gemv validation. Llama-completion and -cli are producing sane outputs.

flow run baseline repack delta
NEON+DP pp128 27.17 t/s 102.12 t/s +275.86%
NEON+DP tg32 20.80 t/s 35.96 t/s +72.79%

Note: code branch and results are separate from #33 and baseline is from current ggml branch, see comments with reference to this PR there.

@khosravipasha, can you, please, compare performance between your and my implementations on your Mac for repack too?

@pl752 pl752 changed the title Q1_0 4x4 repack kernels for Arm NEON+DP Q1_0 repack kernels for Arm NEON+DP May 18, 2026
@pl752
Copy link
Copy Markdown
Author

pl752 commented May 18, 2026

Added I8MM specific repack:

flow run baseline 4x8 delta
NEON+DP pp128 27.17 t/s 121.87 t/s +348.55%
NEON+DP tg32 20.80 t/s 33.61 t/s +61.59%

Tradeoff:

flow run 4x4 4x8 delta
NEON+DP pp128 102.12 t/s 121.87 t/s +19.34%
NEON+DP tg32 35.96 t/s 33.61 t/s -6.54%

@khosravipasha
Copy link
Copy Markdown
Collaborator

Nice, this is good, thanks. Massive gains.
Just noticed this one, I ran the other PR, will compare to this one tomorrow.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Q1_0 repack kernels and tensor traits for Arm NEON with the dotprod extension (4x4 layout) and MATMUL_INT8 extension (4x8 gemm), together with the supporting generic fallbacks, packing helpers, and dispatch wiring. According to the benchmarks in the PR description this yields large speedups on Snapdragon 7 gen 3 (pp128 +275%, tg32 +72%) over the current Q1_0 baseline.

Changes:

  • Introduce block_q1_0x4 (extending QK_0<K> / block<K,N> machinery for K=1) and add make_block_q1_0x4, repack_q1_0_to_q1_0_4_bl, and the repack/gemv/gemm template specializations plus dispatch in ggml_repack_get_optimal_repack_type.
  • Add generic reference implementations ggml_gemv_q1_0_4x{4,8}_q8_0_generic and ggml_gemm_q1_0_4x{4,8}_q8_0_generic in repack.cpp.
  • Add NEON+DOTPROD ggml_gemv_q1_0_4x4, ggml_gemv_q1_0_4x8, ggml_gemm_q1_0_4x4 kernels and a NEON+MATMUL_INT8 ggml_gemm_q1_0_4x8 kernel in arch/arm/repack.cpp (with a sign-expansion LUT), plus arch-fallback aliases for all non-arm targets.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
ggml/src/ggml-cpu/repack.h Extends QK_0 template for K=1, declares block_q1_0x4 and the new gemv/gemm/generic entry points.
ggml/src/ggml-cpu/repack.cpp Generic Q1_0 gemv/gemm, repack helper, template specializations, and dispatch wiring for the new traits.
ggml/src/ggml-cpu/arch/arm/repack.cpp Adds sign-LUT helper and NEON DOTPROD/MATMUL_INT8 kernels for Q1_0 4x4 and 4x8 layouts.
ggml/src/ggml-cpu/arch-fallback.h Aliases new _generic symbols to the public names for all non-Arm targets (wasm block missing the 4x8 entries).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ggml/src/ggml-cpu/arch-fallback.h
Comment thread ggml/src/ggml-cpu/arch/arm/repack.cpp Outdated
@pl752
Copy link
Copy Markdown
Author

pl752 commented May 20, 2026

@khosravipasha I am still waiting for your tests of repacked kernel, thank you in advance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants