[feature] add fast_exp approximation support in Softmax operator#6655
[feature] add fast_exp approximation support in Softmax operator#6655crafcat7 wants to merge 2 commits intoTencent:masterfrom
Conversation
Summary: Introduce fast_math.h with a degree-4 minimax polynomial approximation of expf() (max relative error < 0.02%) and wire it into the Softmax scalar reference path via opt.use_approximate_exp. - Add src/fast_math.h: scalar fast_exp() using range reduction exp(x) = 2^n * exp(r) and IEEE 754 bit manipulation for the 2^n reconstruction - Rename use_reserved_9 to use_approximate_exp in option.h/option.cpp so the flag is now a first-class Option field (default: false) - Expose ncnn_option_get/set_use_approximate_exp in the C API (c_api.h / c_api.cpp) - Wire opt.use_approximate_exp into net.cpp featmask gating (bit 8) - Update softmax.cpp: propagate use_approximate_exp through both static softmax() helpers (contiguous and strided) and all forward_inplace call sites across 1-D to 4-D tensor layouts - Add test_softmax_approx_exp* test cases in tests/test_softmax.cpp covering all dims/axis combinations with a relaxed epsilon (0.01) that accounts for the bounded approximation error of fast_exp
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6655 +/- ##
==========================================
- Coverage 93.79% 93.42% -0.37%
==========================================
Files 917 915 -2
Lines 288530 287391 -1139
==========================================
- Hits 270634 268507 -2127
- Misses 17896 18884 +988 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
All optimizations should reside in the subdirectories of Based on your implementation, it seems x86 sse2/avx/avx512 already has a relevant exp_ps function. I doubt your implementation can be faster. |
Summary: Use vrndmq_f32 for floor computation in exp_ps on aarch64 while keeping the legacy fallback path for non-aarch64 targets. This reduces the exp_ps hot-path cost on ARM without changing approximation behavior.
Yes... this implementation is for the common However, I also found some room for optimization on ARM: The floor phase implemented in exp_ps on the ARM architecture is done in a generic way. On aarch64, the clang++ -std=c++11 -O3 -DNDEBUG -I. -I"src/layer/arm" "bench_compare_exp_ps_arm.cpp" -o "bench_compare_exp_ps_arm" && ./bench_compare_exp_ps_arm
elements=1048576 rounds=10000
legacy exp_ps : 0.507 ns/elem
new exp_ps : 0.413 ns/elem
speedup : 1.23x |
I have attached accuracy verification logs for this result. legacy avg/max rel err : 0.000001% / 0.000012%
new avg/max rel err : 0.000001% / 0.000012%
max |legacy - new| : 0 |
create new pull request for your arm optimization |
Done. Optimizations for the layer/softmax PR have been disabled, as this implementation is for reference only. The newly added arm exp_ps optimization can be discussed at #6657. |
Summary
This PR originates from the discussion in Issues #6633; please refer to the discussion for its data performance.
This PR introduces
fast_math.h, which contains a 4th-order minimax polynomial approximation ofexpf()(maximum relative error < 0.02%), and connects it to the Softmax scalar path viaopt.use_approximate_exp.The changes are as follows:
Add src/fast_math.h: scalar fast_exp() using range reduction exp(x) = 2^n * exp(r) and IEEE 754 bit manipulation for the 2^n reconstruction
Rename use_reserved_9 to use_approximate_exp in option.h/option.cpp so the flag is now a first-class Option field (default: false)
Expose ncnn_option_get/set_use_approximate_exp in the C API (c_api.h / c_api.cpp)
Wire opt.use_approximate_exp into net.cpp featmask gating (bit 8)
Update softmax.cpp: propagate use_approximate_exp through both static softmax() helpers (contiguous and strided) and all forward_inplace call sites across 1-D to 4-D tensor layouts
Add test_softmax_approx_exp* test cases in tests/test_softmax.cpp covering all dims/axis combinations with a relaxed epsilon (0.01) that accounts for the bounded approximation error of fast_exp
Test
Since there was no fast approximate implementation of
expin the previoussoftmaximplementation, I added corresponding test cases intest_softmaxto check for this.During testing,
use_approximate_expwas enabled, and a relatively lenient epsilon value was used to account for the approximation error offast_exp(the error for each calculation is less than 0.02%; the cumulative absolute output error during softmax reduction remains within 1e-2).The final local test result was passed.
Test project /Users/crafcat7/code/ai/ncnn/build Start 149: test_softmax 1/2 Test #149: test_softmax ..................... Passed 0.77 sec Start 150: test_softmax_oom 2/2 Test #150: test_softmax_oom ................. Passed 0.02 sec 100% tests passed, 0 tests failed out of 2 Total Test time (real) = 0.79 sec