[feature] arm: speed up fp16 exp_ps floor step on aarch64#6659
[feature] arm: speed up fp16 exp_ps floor step on aarch64#6659nihui merged 4 commits intoTencent:masterfrom
Conversation
Summary: Use vcvtmq_s16_f16 for floor computation in exp_ps_f16 on aarch64 while keeping the legacy fallback path for non-aarch64 targets. This reduces the exp_ps hot-path cost on ARM without changing approximation behavior. Also reuses the floor result for 2^n construction to avoid redundant vcvt instruction.
Summary: __aarch64__ is always true for all armv8.2+ targets the macro condition is unnecessary
Get. I removed the check for aarch64. Optimizations are always used in ARM fp16 implementations. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #6659 +/- ##
==========================================
- Coverage 93.80% 93.79% -0.01%
==========================================
Files 917 917
Lines 288669 288475 -194
==========================================
- Hits 270776 270581 -195
- Misses 17893 17894 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
I suspect that |
Summary: Use vrndm/vrndmq plus vcvt for exp_ps_f16 floor conversion on AArch64 while preserving output accuracy on device tests.
Yes, I tried using elements=1048576 rounds=10000
--- pack4 (4 x fp16 elements) ---
legacy exp_ps_f16 : 1.962 ns/elem
vcvtm exp_ps_f16 : 1.482 ns/elem
rnd exp_ps_f16 : 1.467 ns/elem
legacy/vcvtm : 1.32x
rnd/vcvtm : 0.990x
--- pack8 (8 x fp16 elements) ---
legacy exp_ps_f16 : 1.203 ns/elem
vcvtm exp_ps_f16 : 0.861 ns/elem
rnd exp_ps_f16 : 0.858 ns/elem
legacy/vcvtm : 1.40x
rnd/vcvtm : 0.997x
--- Accuracy ---
legacy vs new avg diff : 0.000000000
legacy vs new max diff : 0.000000000
new vs rnd avg diff : 0.000000000
new vs rnd max diff : 0.000000000Therefore, I submitted another change, using |
Summary: Keep floor value in fx via vrndm/vrndmq and perform s16 conversion only when building pow2n.
|
Thanks for your contribution ! |
This PR is an improvement upon #6657, supplementing and optimizing
exp_psfor ARM fp16 scenarios, maintaining consistency with the ARM fp32 implementation.The performance improvement is due to reduced SIMD instruction computation and the replacement of the original generic floor processing with native instructions:
Optimization: Use vcvtmq_s16_f16 for floor on aarch64
Performance and accuracy are as follows (tested on Android devices), and it is expected to improve speed by 20-30%.
Testing method: Execute fn in batches using SIMD and measure the throughput performance per element.