[feature] arm: speed up exp_ps floor step on aarch64#6657
Merged
nihui merged 1 commit intoTencent:masterfrom Apr 7, 2026
Merged
[feature] arm: speed up exp_ps floor step on aarch64#6657nihui merged 1 commit intoTencent:masterfrom
nihui merged 1 commit intoTencent:masterfrom
Conversation
Summary: Use vrndmq_f32 for floor computation in exp_ps on aarch64 while keeping the legacy fallback path for non-aarch64 targets. This reduces the exp_ps hot-path cost on ARM without changing approximation behavior.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #6657 +/- ##
=======================================
Coverage 93.79% 93.79%
=======================================
Files 917 917
Lines 288530 288530
=======================================
+ Hits 270634 270637 +3
+ Misses 17896 17893 -3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Member
|
Member
|
Thanks for your contribution ! |
Member
|
actually, there is floor operation in arm fp16 exp variant too :D |
Contributor
Author
Yes, while reviewing the code later, I discovered that we still have a variant for the fp16 scenario.🤔 I'll include the floor optimization in the next PR, and I want to run a local benchmark to see how much performance improvement fp16 provides.👍 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Based on the discussion at #6655, the optimization of exp_ps for ARM is continued.
Use vrndmq_f32 for floor computation in exp_ps on aarch64 while keeping the legacy fallback path for non-aarch64 targets. This reduces the exp_ps hot-path cost on ARM without changing approximation behavior.
Based on the above optimizations, a simple performance comparison was performed on the M3Max device, and the data is as follows:
For aarch64 devices, an improvement of approximately 20% is expected.