[feature] add fast_exp approximation support in Softmax operator by crafcat7 · Pull Request #6655 · Tencent/ncnn

crafcat7 · 2026-04-07T03:31:47Z

Summary

This PR originates from the discussion in Issues #6633; please refer to the discussion for its data performance.

This PR introduces fast_math.h, which contains a 4th-order minimax polynomial approximation of expf() (maximum relative error < 0.02%), and connects it to the Softmax scalar path via opt.use_approximate_exp.

The changes are as follows:

Add src/fast_math.h: scalar fast_exp() using range reduction exp(x) = 2^n * exp(r) and IEEE 754 bit manipulation for the 2^n reconstruction
Rename use_reserved_9 to use_approximate_exp in option.h/option.cpp so the flag is now a first-class Option field (default: false)
Expose ncnn_option_get/set_use_approximate_exp in the C API (c_api.h / c_api.cpp)
Wire opt.use_approximate_exp into net.cpp featmask gating (bit 8)
Update softmax.cpp: propagate use_approximate_exp through both static softmax() helpers (contiguous and strided) and all forward_inplace call sites across 1-D to 4-D tensor layouts
Add test_softmax_approx_exp* test cases in tests/test_softmax.cpp covering all dims/axis combinations with a relaxed epsilon (0.01) that accounts for the bounded approximation error of fast_exp

Test

Since there was no fast approximate implementation of exp in the previous softmax implementation, I added corresponding test cases in test_softmax to check for this.

During testing, use_approximate_exp was enabled, and a relatively lenient epsilon value was used to account for the approximation error of fast_exp (the error for each calculation is less than 0.02%; the cumulative absolute output error during softmax reduction remains within 1e-2).

The final local test result was passed.

Test project /Users/crafcat7/code/ai/ncnn/build
    Start 149: test_softmax
1/2 Test #149: test_softmax .....................   Passed    0.77 sec
    Start 150: test_softmax_oom
2/2 Test #150: test_softmax_oom .................   Passed    0.02 sec

100% tests passed, 0 tests failed out of 2

Total Test time (real) =   0.79 sec

Summary: Introduce fast_math.h with a degree-4 minimax polynomial approximation of expf() (max relative error < 0.02%) and wire it into the Softmax scalar reference path via opt.use_approximate_exp. - Add src/fast_math.h: scalar fast_exp() using range reduction exp(x) = 2^n * exp(r) and IEEE 754 bit manipulation for the 2^n reconstruction - Rename use_reserved_9 to use_approximate_exp in option.h/option.cpp so the flag is now a first-class Option field (default: false) - Expose ncnn_option_get/set_use_approximate_exp in the C API (c_api.h / c_api.cpp) - Wire opt.use_approximate_exp into net.cpp featmask gating (bit 8) - Update softmax.cpp: propagate use_approximate_exp through both static softmax() helpers (contiguous and strided) and all forward_inplace call sites across 1-D to 4-D tensor layouts - Add test_softmax_approx_exp* test cases in tests/test_softmax.cpp covering all dims/axis combinations with a relaxed epsilon (0.01) that accounts for the bounded approximation error of fast_exp

codecov-commenter · 2026-04-07T06:11:08Z

Codecov Report

❌ Patch coverage is 50.00000% with 17 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.42%. Comparing base (e366f48) to head (6df30a1).

Files with missing lines	Patch %	Lines
src/fast_math.h	0.00%	12 Missing ⚠️
src/c_api.cpp	0.00%	5 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6655      +/-   ##
==========================================
- Coverage   93.79%   93.42%   -0.37%     
==========================================
  Files         917      915       -2     
  Lines      288530   287391    -1139     
==========================================
- Hits       270634   268507    -2127     
- Misses      17896    18884     +988

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

nihui · 2026-04-07T06:21:46Z

All optimizations should reside in the subdirectories of layer/{x86,arm,...}/, with layer/softmax.cpp serving only as a reference implementation.

Based on your implementation, it seems x86 sse2/avx/avx512 already has a relevant exp_ps function. I doubt your implementation can be faster.

Summary: Use vrndmq_f32 for floor computation in exp_ps on aarch64 while keeping the legacy fallback path for non-aarch64 targets. This reduces the exp_ps hot-path cost on ARM without changing approximation behavior.

crafcat7 · 2026-04-07T07:56:35Z

Based on your implementation, it seems x86 sse2/avx/avx512 already has a relevant exp_ps function. I doubt your implementation can be faster.

Yes... this implementation is for the common softmax scenario using libc's expf. That is, it doesn't support alternative solutions with SIMD optimization. Therefore, if SIMD were supported, the current exp_ps would indeed be faster.

However, I also found some room for optimization on ARM:

The floor phase implemented in exp_ps on the ARM architecture is done in a generic way. On aarch64, the vrndmq_f32 instruction can be used directly to replace the original implementation. On my M3Max device, this resulted in an improvement of about 20% in time reduction.

clang++ -std=c++11 -O3 -DNDEBUG -I. -I"src/layer/arm" "bench_compare_exp_ps_arm.cpp" -o "bench_compare_exp_ps_arm" && ./bench_compare_exp_ps_arm

elements=1048576 rounds=10000
legacy exp_ps : 0.507 ns/elem
new exp_ps    : 0.413 ns/elem
speedup       : 1.23x

crafcat7 · 2026-04-07T07:59:24Z

The floor phase implemented in exp_ps on the ARM architecture is done in a generic way. On aarch64, the vrndmq_f32 instruction can be used directly to replace the original implementation. On my M3Max device, this resulted in an improvement of about 20% in time reduction.

I have attached accuracy verification logs for this result.

legacy avg/max rel err : 0.000001% / 0.000012%
new    avg/max rel err : 0.000001% / 0.000012%
max |legacy - new|     : 0

nihui · 2026-04-07T08:34:04Z

Based on your implementation, it seems x86 sse2/avx/avx512 already has a relevant exp_ps function. I doubt your implementation can be faster.

Yes... this implementation is for the common softmax scenario using libc's expf. That is, it doesn't support alternative solutions with SIMD optimization. Therefore, if SIMD were supported, the current exp_ps would indeed be faster.

However, I also found some room for optimization on ARM:

The floor phase implemented in exp_ps on the ARM architecture is done in a generic way. On aarch64, the vrndmq_f32 instruction can be used directly to replace the original implementation. On my M3Max device, this resulted in an improvement of about 20% in time reduction.
clang++ -std=c++11 -O3 -DNDEBUG -I. -I"src/layer/arm" "bench_compare_exp_ps_arm.cpp" -o "bench_compare_exp_ps_arm" && ./bench_compare_exp_ps_arm

elements=1048576 rounds=10000
legacy exp_ps : 0.507 ns/elem
new exp_ps    : 0.413 ns/elem
speedup       : 1.23x

create new pull request for your arm optimization

crafcat7 · 2026-04-07T08:50:34Z

Based on your implementation, it seems x86 sse2/avx/avx512 already has a relevant exp_ps function. I doubt your implementation can be faster.

Yes... this implementation is for the common softmax scenario using libc's expf. That is, it doesn't support alternative solutions with SIMD optimization. Therefore, if SIMD were supported, the current exp_ps would indeed be faster.
However, I also found some room for optimization on ARM:
The floor phase implemented in exp_ps on the ARM architecture is done in a generic way. On aarch64, the vrndmq_f32 instruction can be used directly to replace the original implementation. On my M3Max device, this resulted in an improvement of about 20% in time reduction.
clang++ -std=c++11 -O3 -DNDEBUG -I. -I"src/layer/arm" "bench_compare_exp_ps_arm.cpp" -o "bench_compare_exp_ps_arm" && ./bench_compare_exp_ps_arm

elements=1048576 rounds=10000
legacy exp_ps : 0.507 ns/elem
new exp_ps    : 0.413 ns/elem
speedup       : 1.23x
create new pull request for your arm optimization

Done. Optimizations for the layer/softmax PR have been disabled, as this implementation is for reference only.

The newly added arm exp_ps optimization can be discussed at #6657.

github-actions bot added core test layer labels Apr 7, 2026

crafcat7 mentioned this pull request Apr 7, 2026

Discussion on performance optimization of expf in softmax #6633

Closed

[feature] arm: speed up exp_ps floor step on aarch64

fbf8676

Summary: Use vrndmq_f32 for floor computation in exp_ps on aarch64 while keeping the legacy fallback path for non-aarch64 targets. This reduces the exp_ps hot-path cost on ARM without changing approximation behavior.

github-actions bot added the arm label Apr 7, 2026

crafcat7 mentioned this pull request Apr 7, 2026

[feature] arm: speed up exp_ps floor step on aarch64 #6657

Merged

crafcat7 closed this Apr 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature] add fast_exp approximation support in Softmax operator#6655

[feature] add fast_exp approximation support in Softmax operator#6655
crafcat7 wants to merge 2 commits intoTencent:masterfrom
crafcat7:softmax-fast-expf

crafcat7 commented Apr 7, 2026

Uh oh!

codecov-commenter commented Apr 7, 2026 •

edited

Loading

Uh oh!

nihui commented Apr 7, 2026

Uh oh!

crafcat7 commented Apr 7, 2026 •

edited

Loading

Uh oh!

crafcat7 commented Apr 7, 2026

Uh oh!

nihui commented Apr 7, 2026

Uh oh!

crafcat7 commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

crafcat7 commented Apr 7, 2026

Summary

Test

Uh oh!

codecov-commenter commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

nihui commented Apr 7, 2026

Uh oh!

crafcat7 commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

crafcat7 commented Apr 7, 2026

Uh oh!

nihui commented Apr 7, 2026

Uh oh!

crafcat7 commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-commenter commented Apr 7, 2026 •

edited

Loading

crafcat7 commented Apr 7, 2026 •

edited

Loading