Skip to content

Optimize x86 BF16 GEMM micro kernels: replace vpalignr with vpshufd and add instruction scheduling#6673

Merged
nihui merged 3 commits into
Tencent:masterfrom
nihui:gemm-x86-bf16s-3
Apr 13, 2026
Merged

Optimize x86 BF16 GEMM micro kernels: replace vpalignr with vpshufd and add instruction scheduling#6673
nihui merged 3 commits into
Tencent:masterfrom
nihui:gemm-x86-bf16s-3

Conversation

@nihui
Copy link
Copy Markdown
Member

@nihui nihui commented Apr 12, 2026

On AMD Zen 5, the vpalignr instruction generated by GCC for _mm*_alignr_epi8(x,x,N) conflicts with vdpbf16ps for execution port resources, causing ~16% performance loss on the 16x16 micro kernel compared to Clang (which auto-replaces vpalignr with equivalent shuffles). Replace all _mm*_alignr_epi8(x,x,8) with _mm*_shuffle_epi32(x, _MM_PERM_BADC) and _mm*_alignr_epi8(x,x,4) with _mm*_shuffle_epi32(x, _MM_PERM_ADCB) in the AVX512BF16 kernel sections, which generate vpshufd instructions using different execution ports. Also apply interleaved instruction scheduling for the 16x16 kernel to further overlap shuffle and dpbf16ps computation latency.

  GCC 15.2.0 (libncnn.a built with GCC)

  ┌────────────────┬─────────────────┬────────────────┬────────┐
  │     MxNxK      │ Before (GFLOPS) │ After (GFLOPS) │ Delta  │
  ├────────────────┼─────────────────┼────────────────┼────────┤
  │ 128x128x128    │ 356             │ 400            │ +12.4% │
  ├────────────────┼─────────────────┼────────────────┼────────┤
  │ 256x256x256    │ 431             │ 494            │ +14.6% │
  ├────────────────┼─────────────────┼────────────────┼────────┤
  │ 512x512x512    │ 352             │ 400            │ +13.6% │
  ├────────────────┼─────────────────┼────────────────┼────────┤
  │ 1024x1024x1024 │ 454             │ 524            │ +15.4% │
  ├────────────────┼─────────────────┼────────────────┼────────┤
  │ 2048x2048x2048 │ 452             │ 524            │ +15.9% │
  ├────────────────┼─────────────────┼────────────────┼────────┤
  │ 256x256x512    │ 434             │ 501            │ +15.4% │
  ├────────────────┼─────────────────┼────────────────┼────────┤
  │ 512x512x1024   │ 450             │ 519            │ +15.3% │
  ├────────────────┼─────────────────┼────────────────┼────────┤
  │ 1024x1024x2048 │ 455             │ 526            │ +15.6% │
  ├────────────────┼─────────────────┼────────────────┼────────┤
  │ 128x128x1024   │ 433             │ 493            │ +13.9% │
  ├────────────────┼─────────────────┼────────────────┼────────┤
  │ 256x256x2048   │ 445             │ 512            │ +15.1% │
  └────────────────┴─────────────────┴────────────────┴────────┘

  Clang 21.1.8 (libncnn.a built with Clang)

  ┌────────────────┬─────────────────┬────────────────┬───────┐
  │     MxNxK      │ Before (GFLOPS) │ After (GFLOPS) │ Delta │
  ├────────────────┼─────────────────┼────────────────┼───────┤
  │ 128x128x128    │ 466             │ 462            │ -0.9% │
  ├────────────────┼─────────────────┼────────────────┼───────┤
  │ 256x256x256    │ 504             │ 493            │ -2.2% │
  ├────────────────┼─────────────────┼────────────────┼───────┤
  │ 512x512x512    │ 403             │ 396            │ -1.7% │
  ├────────────────┼─────────────────┼────────────────┼───────┤
  │ 1024x1024x1024 │ 527             │ 514            │ -2.5% │
  ├────────────────┼─────────────────┼────────────────┼───────┤
  │ 2048x2048x2048 │ 527             │ 515            │ -2.3% │
  ├────────────────┼─────────────────┼────────────────┼───────┤
  │ 256x256x512    │ 506             │ 495            │ -2.2% │
  ├────────────────┼─────────────────┼────────────────┼───────┤
  │ 512x512x1024   │ 521             │ 510            │ -2.1% │
  ├────────────────┼─────────────────┼────────────────┼───────┤
  │ 1024x1024x2048 │ 526             │ 517            │ -1.7% │
  ├────────────────┼─────────────────┼────────────────┼───────┤
  │ 128x128x1024   │ 498             │ 489            │ -1.8% │
  ├────────────────┼─────────────────┼────────────────┼───────┤
  │ 256x256x2048   │ 514             │ 504            │ -2.0% │
  └────────────────┴─────────────────┴────────────────┴───────┘

  总结

  - GCC: 平均提升约 14-16% —— 这是 vpalignr → vpshufd 替换带来的显著收益,因为 AMD Zen 5 上 vpalignr 和 vdpbf16ps 竞争相同的执行端口
  - Clang: 小幅下降约 2% —— Clang 之前就自动将 vpalignr 转换为 vshufps/vpshufd,但我们的 _mm_shuffle_epi32 + _MM_PERM_BADC 形式可能不如 Clang
  自己的代码选择优。这是使用显式 intrinsic 替代编译器自动变换的轻微代价
  - 优化后 GCC 性能接近 Clang 优化前的水平,说明核心问题(端口冲突)已被解决
  - Clang 2% 的回退在噪声范围内,且在实际微内核基准测试中该替换对 Clang 无影响——差异可能来自 ncnn 整体框架层面的编译差异

glm-5.1
36.9M input, 168.5k output, 36.3M cache read, 0 cache write

…nd add instruction scheduling

On AMD Zen 5, the vpalignr instruction generated by GCC for _mm*_alignr_epi8(x,x,N)
conflicts with vdpbf16ps for execution port resources, causing ~16% performance loss
on the 16x16 micro kernel compared to Clang (which auto-replaces vpalignr with
equivalent shuffles). Replace all _mm*_alignr_epi8(x,x,8) with _mm*_shuffle_epi32(x,
_MM_PERM_BADC) and _mm*_alignr_epi8(x,x,4) with _mm*_shuffle_epi32(x, _MM_PERM_ADCB)
in the AVX512BF16 kernel sections, which generate vpshufd instructions using different
execution ports. Also apply interleaved instruction scheduling for the 16x16 kernel to
further overlap shuffle and dpbf16ps computation latency.
@github-actions github-actions Bot added the x86 label Apr 12, 2026
@tencent-adm
Copy link
Copy Markdown
Member

CLA assistant check
Thank you for your submission, we really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.99%. Comparing base (03f32d3) to head (2a70918).
⚠️ Report is 12 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff            @@
##           master    #6673    +/-   ##
========================================
  Coverage   93.98%   93.99%            
========================================
  Files         926      926            
  Lines      298707   298127   -580     
========================================
- Hits       280736   280218   -518     
+ Misses      17971    17909    -62     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@nihui nihui merged commit 1b68698 into Tencent:master Apr 13, 2026
116 of 119 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants