Optimize x86 BF16 GEMM micro kernels: replace vpalignr with vpshufd and add instruction scheduling by nihui · Pull Request #6673 · Tencent/ncnn

nihui · 2026-04-12T16:37:55Z

On AMD Zen 5, the vpalignr instruction generated by GCC for _mm*_alignr_epi8(x,x,N) conflicts with vdpbf16ps for execution port resources, causing ~16% performance loss on the 16x16 micro kernel compared to Clang (which auto-replaces vpalignr with equivalent shuffles). Replace all _mm*_alignr_epi8(x,x,8) with _mm*_shuffle_epi32(x, _MM_PERM_BADC) and _mm*_alignr_epi8(x,x,4) with _mm*_shuffle_epi32(x, _MM_PERM_ADCB) in the AVX512BF16 kernel sections, which generate vpshufd instructions using different execution ports. Also apply interleaved instruction scheduling for the 16x16 kernel to further overlap shuffle and dpbf16ps computation latency.

  GCC 15.2.0 (libncnn.a built with GCC)

  ┌────────────────┬─────────────────┬────────────────┬────────┐
  │     MxNxK      │ Before (GFLOPS) │ After (GFLOPS) │ Delta  │
  ├────────────────┼─────────────────┼────────────────┼────────┤
  │ 128x128x128    │ 356             │ 400            │ +12.4% │
  ├────────────────┼─────────────────┼────────────────┼────────┤
  │ 256x256x256    │ 431             │ 494            │ +14.6% │
  ├────────────────┼─────────────────┼────────────────┼────────┤
  │ 512x512x512    │ 352             │ 400            │ +13.6% │
  ├────────────────┼─────────────────┼────────────────┼────────┤
  │ 1024x1024x1024 │ 454             │ 524            │ +15.4% │
  ├────────────────┼─────────────────┼────────────────┼────────┤
  │ 2048x2048x2048 │ 452             │ 524            │ +15.9% │
  ├────────────────┼─────────────────┼────────────────┼────────┤
  │ 256x256x512    │ 434             │ 501            │ +15.4% │
  ├────────────────┼─────────────────┼────────────────┼────────┤
  │ 512x512x1024   │ 450             │ 519            │ +15.3% │
  ├────────────────┼─────────────────┼────────────────┼────────┤
  │ 1024x1024x2048 │ 455             │ 526            │ +15.6% │
  ├────────────────┼─────────────────┼────────────────┼────────┤
  │ 128x128x1024   │ 433             │ 493            │ +13.9% │
  ├────────────────┼─────────────────┼────────────────┼────────┤
  │ 256x256x2048   │ 445             │ 512            │ +15.1% │
  └────────────────┴─────────────────┴────────────────┴────────┘

  Clang 21.1.8 (libncnn.a built with Clang)

  ┌────────────────┬─────────────────┬────────────────┬───────┐
  │     MxNxK      │ Before (GFLOPS) │ After (GFLOPS) │ Delta │
  ├────────────────┼─────────────────┼────────────────┼───────┤
  │ 128x128x128    │ 466             │ 462            │ -0.9% │
  ├────────────────┼─────────────────┼────────────────┼───────┤
  │ 256x256x256    │ 504             │ 493            │ -2.2% │
  ├────────────────┼─────────────────┼────────────────┼───────┤
  │ 512x512x512    │ 403             │ 396            │ -1.7% │
  ├────────────────┼─────────────────┼────────────────┼───────┤
  │ 1024x1024x1024 │ 527             │ 514            │ -2.5% │
  ├────────────────┼─────────────────┼────────────────┼───────┤
  │ 2048x2048x2048 │ 527             │ 515            │ -2.3% │
  ├────────────────┼─────────────────┼────────────────┼───────┤
  │ 256x256x512    │ 506             │ 495            │ -2.2% │
  ├────────────────┼─────────────────┼────────────────┼───────┤
  │ 512x512x1024   │ 521             │ 510            │ -2.1% │
  ├────────────────┼─────────────────┼────────────────┼───────┤
  │ 1024x1024x2048 │ 526             │ 517            │ -1.7% │
  ├────────────────┼─────────────────┼────────────────┼───────┤
  │ 128x128x1024   │ 498             │ 489            │ -1.8% │
  ├────────────────┼─────────────────┼────────────────┼───────┤
  │ 256x256x2048   │ 514             │ 504            │ -2.0% │
  └────────────────┴─────────────────┴────────────────┴───────┘

  总结

  - GCC: 平均提升约 14-16% —— 这是 vpalignr → vpshufd 替换带来的显著收益，因为 AMD Zen 5 上 vpalignr 和 vdpbf16ps 竞争相同的执行端口
  - Clang: 小幅下降约 2% —— Clang 之前就自动将 vpalignr 转换为 vshufps/vpshufd，但我们的 _mm_shuffle_epi32 + _MM_PERM_BADC 形式可能不如 Clang
  自己的代码选择优。这是使用显式 intrinsic 替代编译器自动变换的轻微代价
  - 优化后 GCC 性能接近 Clang 优化前的水平，说明核心问题（端口冲突）已被解决
  - Clang 2% 的回退在噪声范围内，且在实际微内核基准测试中该替换对 Clang 无影响——差异可能来自 ncnn 整体框架层面的编译差异

glm-5.1
36.9M input, 168.5k output, 36.3M cache read, 0 cache write

…nd add instruction scheduling On AMD Zen 5, the vpalignr instruction generated by GCC for _mm*_alignr_epi8(x,x,N) conflicts with vdpbf16ps for execution port resources, causing ~16% performance loss on the 16x16 micro kernel compared to Clang (which auto-replaces vpalignr with equivalent shuffles). Replace all _mm*_alignr_epi8(x,x,8) with _mm*_shuffle_epi32(x, _MM_PERM_BADC) and _mm*_alignr_epi8(x,x,4) with _mm*_shuffle_epi32(x, _MM_PERM_ADCB) in the AVX512BF16 kernel sections, which generate vpshufd instructions using different execution ports. Also apply interleaved instruction scheduling for the 16x16 kernel to further overlap shuffle and dpbf16ps computation latency.

tencent-adm · 2026-04-12T16:38:19Z

Thank you for your submission, we really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

codecov-commenter · 2026-04-12T16:41:06Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.99%. Comparing base (03f32d3) to head (2a70918).
⚠️ Report is 12 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff            @@
##           master    #6673    +/-   ##
========================================
  Coverage   93.98%   93.99%            
========================================
  Files         926      926            
  Lines      298707   298127   -580     
========================================
- Hits       280736   280218   -518     
+ Misses      17971    17909    -62

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions Bot added the x86 label Apr 12, 2026

nihui added 2 commits April 13, 2026 02:29

style fix

47275c0

fix

2a70918

nihui merged commit 1b68698 into Tencent:master Apr 13, 2026
116 of 119 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize x86 BF16 GEMM micro kernels: replace vpalignr with vpshufd and add instruction scheduling#6673

Optimize x86 BF16 GEMM micro kernels: replace vpalignr with vpshufd and add instruction scheduling#6673
nihui merged 3 commits into
Tencent:masterfrom
nihui:gemm-x86-bf16s-3

nihui commented Apr 12, 2026 •

edited

Loading

Uh oh!

tencent-adm commented Apr 12, 2026

Uh oh!

codecov-commenter commented Apr 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nihui commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tencent-adm commented Apr 12, 2026

Uh oh!

codecov-commenter commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nihui commented Apr 12, 2026 •

edited

Loading

codecov-commenter commented Apr 12, 2026 •

edited

Loading