Skip to content

[feature] add fast_exp approximation support in Softmax operator#6655

Closed
crafcat7 wants to merge 2 commits intoTencent:masterfrom
crafcat7:softmax-fast-expf
Closed

[feature] add fast_exp approximation support in Softmax operator#6655
crafcat7 wants to merge 2 commits intoTencent:masterfrom
crafcat7:softmax-fast-expf

Conversation

@crafcat7
Copy link
Copy Markdown
Contributor

@crafcat7 crafcat7 commented Apr 7, 2026

Summary

This PR originates from the discussion in Issues #6633; please refer to the discussion for its data performance.

This PR introduces fast_math.h, which contains a 4th-order minimax polynomial approximation of expf() (maximum relative error < 0.02%), and connects it to the Softmax scalar path via opt.use_approximate_exp.

The changes are as follows:

  • Add src/fast_math.h: scalar fast_exp() using range reduction exp(x) = 2^n * exp(r) and IEEE 754 bit manipulation for the 2^n reconstruction

  • Rename use_reserved_9 to use_approximate_exp in option.h/option.cpp so the flag is now a first-class Option field (default: false)

  • Expose ncnn_option_get/set_use_approximate_exp in the C API (c_api.h / c_api.cpp)

  • Wire opt.use_approximate_exp into net.cpp featmask gating (bit 8)

  • Update softmax.cpp: propagate use_approximate_exp through both static softmax() helpers (contiguous and strided) and all forward_inplace call sites across 1-D to 4-D tensor layouts

  • Add test_softmax_approx_exp* test cases in tests/test_softmax.cpp covering all dims/axis combinations with a relaxed epsilon (0.01) that accounts for the bounded approximation error of fast_exp

Test

Since there was no fast approximate implementation of exp in the previous softmax implementation, I added corresponding test cases in test_softmax to check for this.

During testing, use_approximate_exp was enabled, and a relatively lenient epsilon value was used to account for the approximation error of fast_exp (the error for each calculation is less than 0.02%; the cumulative absolute output error during softmax reduction remains within 1e-2).

The final local test result was passed.

Test project /Users/crafcat7/code/ai/ncnn/build
    Start 149: test_softmax
1/2 Test #149: test_softmax .....................   Passed    0.77 sec
    Start 150: test_softmax_oom
2/2 Test #150: test_softmax_oom .................   Passed    0.02 sec

100% tests passed, 0 tests failed out of 2

Total Test time (real) =   0.79 sec

Summary:
Introduce fast_math.h with a degree-4 minimax polynomial approximation of
expf() (max relative error < 0.02%) and wire it into the Softmax scalar
reference path via opt.use_approximate_exp.

- Add src/fast_math.h: scalar fast_exp() using range reduction exp(x) =
  2^n * exp(r) and IEEE 754 bit manipulation for the 2^n reconstruction
- Rename use_reserved_9 to use_approximate_exp in option.h/option.cpp so
  the flag is now a first-class Option field (default: false)
- Expose ncnn_option_get/set_use_approximate_exp in the C API (c_api.h /
  c_api.cpp)
- Wire opt.use_approximate_exp into net.cpp featmask gating (bit 8)
- Update softmax.cpp: propagate use_approximate_exp through both static
  softmax() helpers (contiguous and strided) and all forward_inplace call
  sites across 1-D to 4-D tensor layouts
- Add test_softmax_approx_exp* test cases in tests/test_softmax.cpp
  covering all dims/axis combinations with a relaxed epsilon (0.01) that
  accounts for the bounded approximation error of fast_exp
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 7, 2026

Codecov Report

❌ Patch coverage is 50.00000% with 17 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.42%. Comparing base (e366f48) to head (6df30a1).

Files with missing lines Patch % Lines
src/fast_math.h 0.00% 12 Missing ⚠️
src/c_api.cpp 0.00% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6655      +/-   ##
==========================================
- Coverage   93.79%   93.42%   -0.37%     
==========================================
  Files         917      915       -2     
  Lines      288530   287391    -1139     
==========================================
- Hits       270634   268507    -2127     
- Misses      17896    18884     +988     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@nihui
Copy link
Copy Markdown
Member

nihui commented Apr 7, 2026

All optimizations should reside in the subdirectories of layer/{x86,arm,...}/, with layer/softmax.cpp serving only as a reference implementation.

Based on your implementation, it seems x86 sse2/avx/avx512 already has a relevant exp_ps function. I doubt your implementation can be faster.

Summary:
  Use vrndmq_f32 for floor computation in exp_ps on aarch64 while keeping the legacy fallback path for non-aarch64 targets. This reduces the exp_ps hot-path cost on ARM without changing approximation behavior.
@github-actions github-actions bot added the arm label Apr 7, 2026
@crafcat7
Copy link
Copy Markdown
Contributor Author

crafcat7 commented Apr 7, 2026

Based on your implementation, it seems x86 sse2/avx/avx512 already has a relevant exp_ps function. I doubt your implementation can be faster.

Yes... this implementation is for the common softmax scenario using libc's expf. That is, it doesn't support alternative solutions with SIMD optimization. Therefore, if SIMD were supported, the current exp_ps would indeed be faster.

However, I also found some room for optimization on ARM:

The floor phase implemented in exp_ps on the ARM architecture is done in a generic way. On aarch64, the vrndmq_f32 instruction can be used directly to replace the original implementation. On my M3Max device, this resulted in an improvement of about 20% in time reduction.

clang++ -std=c++11 -O3 -DNDEBUG -I. -I"src/layer/arm" "bench_compare_exp_ps_arm.cpp" -o "bench_compare_exp_ps_arm" && ./bench_compare_exp_ps_arm

elements=1048576 rounds=10000
legacy exp_ps : 0.507 ns/elem
new exp_ps    : 0.413 ns/elem
speedup       : 1.23x

@crafcat7
Copy link
Copy Markdown
Contributor Author

crafcat7 commented Apr 7, 2026

The floor phase implemented in exp_ps on the ARM architecture is done in a generic way. On aarch64, the vrndmq_f32 instruction can be used directly to replace the original implementation. On my M3Max device, this resulted in an improvement of about 20% in time reduction.

I have attached accuracy verification logs for this result.

legacy avg/max rel err : 0.000001% / 0.000012%
new    avg/max rel err : 0.000001% / 0.000012%
max |legacy - new|     : 0

@nihui
Copy link
Copy Markdown
Member

nihui commented Apr 7, 2026

Based on your implementation, it seems x86 sse2/avx/avx512 already has a relevant exp_ps function. I doubt your implementation can be faster.

Yes... this implementation is for the common softmax scenario using libc's expf. That is, it doesn't support alternative solutions with SIMD optimization. Therefore, if SIMD were supported, the current exp_ps would indeed be faster.

However, I also found some room for optimization on ARM:

The floor phase implemented in exp_ps on the ARM architecture is done in a generic way. On aarch64, the vrndmq_f32 instruction can be used directly to replace the original implementation. On my M3Max device, this resulted in an improvement of about 20% in time reduction.

clang++ -std=c++11 -O3 -DNDEBUG -I. -I"src/layer/arm" "bench_compare_exp_ps_arm.cpp" -o "bench_compare_exp_ps_arm" && ./bench_compare_exp_ps_arm

elements=1048576 rounds=10000
legacy exp_ps : 0.507 ns/elem
new exp_ps    : 0.413 ns/elem
speedup       : 1.23x

create new pull request for your arm optimization

@crafcat7
Copy link
Copy Markdown
Contributor Author

crafcat7 commented Apr 7, 2026

Based on your implementation, it seems x86 sse2/avx/avx512 already has a relevant exp_ps function. I doubt your implementation can be faster.

Yes... this implementation is for the common softmax scenario using libc's expf. That is, it doesn't support alternative solutions with SIMD optimization. Therefore, if SIMD were supported, the current exp_ps would indeed be faster.
However, I also found some room for optimization on ARM:
The floor phase implemented in exp_ps on the ARM architecture is done in a generic way. On aarch64, the vrndmq_f32 instruction can be used directly to replace the original implementation. On my M3Max device, this resulted in an improvement of about 20% in time reduction.

clang++ -std=c++11 -O3 -DNDEBUG -I. -I"src/layer/arm" "bench_compare_exp_ps_arm.cpp" -o "bench_compare_exp_ps_arm" && ./bench_compare_exp_ps_arm

elements=1048576 rounds=10000
legacy exp_ps : 0.507 ns/elem
new exp_ps    : 0.413 ns/elem
speedup       : 1.23x

create new pull request for your arm optimization

Done. Optimizations for the layer/softmax PR have been disabled, as this implementation is for reference only.

The newly added arm exp_ps optimization can be discussed at #6657.

@crafcat7 crafcat7 closed this Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants