Q1_0 repack kernels for Arm NEON+DP by pl752 · Pull Request #34 · PrismML-Eng/llama.cpp

pl752 · 2026-05-17T17:08:13Z

Implemented 4x4 repack kernels and traits for NEON with dotprod extension as it is way more straightforward than on x86.
I8MM extension is not used there due to more convenient access pattern.

Benchmarks were performed with:
Honor 400 (smartphone) with Snapdragon 7 gen 3, 12 gb ram
Android 16 via ADB
Used 4 performance cores due to all 8 cores causing unstable performance
Command: ./llama-bench -m Bonsai-1.7B.gguf -p 128 -n 32 -r 3 -C 0xF0 -t 4 -fa 1 -mmp 0
Perplexity for 5x512 chunks: Mean KLD 0.00021, PPL 21.08, Same top p 99,22%
TODO: perplexity for chunks not divisible by 4 for gemv validation. Llama-completion and -cli are producing sane outputs.

flow	run	baseline	repack	delta
NEON+DP	pp128	27.17 t/s	102.12 t/s	+275.86%
NEON+DP	tg32	20.80 t/s	35.96 t/s	+72.79%

Note: code branch and results are separate from #33 and baseline is from current ggml branch, see comments with reference to this PR there.

@khosravipasha, can you, please, compare performance between your and my implementations on your Mac for repack too?

pl752 · 2026-05-18T08:29:57Z

Added I8MM specific repack:

flow	run	baseline	4x8	delta
NEON+DP	pp128	27.17 t/s	121.87 t/s	+348.55%
NEON+DP	tg32	20.80 t/s	33.61 t/s	+61.59%

Tradeoff:

flow	run	4x4	4x8	delta
NEON+DP	pp128	102.12 t/s	121.87 t/s	+19.34%
NEON+DP	tg32	35.96 t/s	33.61 t/s	-6.54%

khosravipasha · 2026-05-19T00:17:46Z

Nice, this is good, thanks. Massive gains.
Just noticed this one, I ran the other PR, will compare to this one tomorrow.

Copilot

Pull request overview

Adds Q1_0 repack kernels and tensor traits for Arm NEON with the dotprod extension (4x4 layout) and MATMUL_INT8 extension (4x8 gemm), together with the supporting generic fallbacks, packing helpers, and dispatch wiring. According to the benchmarks in the PR description this yields large speedups on Snapdragon 7 gen 3 (pp128 +275%, tg32 +72%) over the current Q1_0 baseline.

Changes:

Introduce block_q1_0x4 (extending QK_0<K> / block<K,N> machinery for K=1) and add make_block_q1_0x4, repack_q1_0_to_q1_0_4_bl, and the repack/gemv/gemm template specializations plus dispatch in ggml_repack_get_optimal_repack_type.
Add generic reference implementations ggml_gemv_q1_0_4x{4,8}_q8_0_generic and ggml_gemm_q1_0_4x{4,8}_q8_0_generic in repack.cpp.
Add NEON+DOTPROD ggml_gemv_q1_0_4x4, ggml_gemv_q1_0_4x8, ggml_gemm_q1_0_4x4 kernels and a NEON+MATMUL_INT8 ggml_gemm_q1_0_4x8 kernel in arch/arm/repack.cpp (with a sign-expansion LUT), plus arch-fallback aliases for all non-arm targets.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
ggml/src/ggml-cpu/repack.h	Extends `QK_0` template for K=1, declares `block_q1_0x4` and the new gemv/gemm/generic entry points.
ggml/src/ggml-cpu/repack.cpp	Generic Q1_0 gemv/gemm, repack helper, template specializations, and dispatch wiring for the new traits.
ggml/src/ggml-cpu/arch/arm/repack.cpp	Adds sign-LUT helper and NEON DOTPROD/MATMUL_INT8 kernels for Q1_0 4x4 and 4x8 layouts.
ggml/src/ggml-cpu/arch-fallback.h	Aliases new `_generic` symbols to the public names for all non-Arm targets (wasm block missing the 4x8 entries).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pl752 · 2026-05-20T15:17:12Z

@khosravipasha I am still waiting for your tests of repacked kernel, thank you in advance

pl752 added 2 commits May 17, 2026 21:13

Implemented ARM NEON DP q1 4x4 repack

33055fc

Hoisted out scaling by b_d in gemm

b162fdc

github-actions Bot added the ggml label May 17, 2026

pl752 mentioned this pull request May 17, 2026

Optimized ARM NEON q1_0 dot #33

Open

Added 4x8 NEON I8MM repack kernels

5e677e6

pl752 changed the title ~~Q1_0 4x4 repack kernels for Arm NEON+DP~~ Q1_0 repack kernels for Arm NEON+DP May 18, 2026

Cleanup for q1 arm repack

00289b0

khosravipasha requested a review from Copilot May 19, 2026 00:17

Copilot started reviewing on behalf of khosravipasha May 19, 2026 00:17 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

Comment thread ggml/src/ggml-cpu/arch-fallback.h

Comment thread ggml/src/ggml-cpu/arch/arm/repack.cpp Outdated

pl752 added 2 commits May 20, 2026 20:07

Added missing aliases for arch fallback

e4c4c4a

Corrected unused var statements

8ac75b2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Q1_0 repack kernels for Arm NEON+DP#34

Q1_0 repack kernels for Arm NEON+DP#34
pl752 wants to merge 6 commits into
PrismML-Eng:masterfrom
pl752:perf/q1_0_arm_repack_4x4

pl752 commented May 17, 2026 •

edited

Loading

Uh oh!

pl752 commented May 18, 2026

Uh oh!

khosravipasha commented May 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

pl752 commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pl752 commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pl752 commented May 18, 2026

Uh oh!

khosravipasha commented May 19, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

pl752 commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pl752 commented May 17, 2026 •

edited

Loading