[feature] arm: speed up exp_ps floor step on aarch64 by crafcat7 · Pull Request #6657 · Tencent/ncnn

crafcat7 · 2026-04-07T08:46:53Z

Summary

Based on the discussion at #6655, the optimization of exp_ps for ARM is continued.

Use vrndmq_f32 for floor computation in exp_ps on aarch64 while keeping the legacy fallback path for non-aarch64 targets. This reduces the exp_ps hot-path cost on ARM without changing approximation behavior.

Based on the above optimizations, a simple performance comparison was performed on the M3Max device, and the data is as follows:

clang++ -std=c++11 -O3 -DNDEBUG -I. -I"src/layer/arm" "bench_compare_exp_ps_arm.cpp" -o "bench_compare_exp_ps_arm" && ./bench_compare_exp_ps_arm                  141 ↵
elements=1048576 rounds=10000
legacy exp_ps : 0.515 ns/elem
new exp_ps    : 0.423 ns/elem
speedup       : 1.22x
legacy avg/max rel err : 0.000001% / 0.000012%
new    avg/max rel err : 0.000001% / 0.000012%
max |legacy - new|     : 0

For aarch64 devices, an improvement of approximately 20% is expected.

Summary: Use vrndmq_f32 for floor computation in exp_ps on aarch64 while keeping the legacy fallback path for non-aarch64 targets. This reduces the exp_ps hot-path cost on ARM without changing approximation behavior.

codecov-commenter · 2026-04-07T09:29:43Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.79%. Comparing base (e366f48) to head (cec0bb2).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #6657   +/-   ##
=======================================
  Coverage   93.79%   93.79%           
=======================================
  Files         917      917           
  Lines      288530   288530           
=======================================
+ Hits       270634   270637    +3     
+ Misses      17896    17893    -3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

nihui · 2026-04-07T11:09:12Z

  ---
  Summary

  The PR adds a preprocessor guard in src/layer/arm/neon_mathfun.h to use the AArch64 native vrndmq_f32 (vector round-toward-minus-infinity, i.e. floor) instruction instead of the manual
  truncate-then-correct floor emulation on 32-bit ARM. This is a clean ~20% speedup with identical accuracy per the author's benchmarks.

  Verdict: No logic bugs found. Minor observations below.

  ---
  1. Correctness of vrndmq_f32 as a floor replacement — Correct

  The legacy code does:
  tmp = vcvtq_f32_s32(vcvtq_s32_f32(fx));   // truncate toward zero
  mask = vcgtq_f32(tmp, fx);                  // if truncation went up (negative fx)
  mask = vandq_u32(mask, vreinterpretq_u32_f32(one)); // mask = 1.0 where correction needed
  fx = vsubq_f32(tmp, vreinterpretq_f32_u32(mask));   // subtract 1 to get floor

  vrndmq_f32 (ARMv8 FRINTM) performs exactly IEEE 754 round-toward-negative-infinity, which is floorf. These are semantically identical. No logic bug here.

  2. Edge case: NaN and Inf handling — Not a concern

  Before the floor block, x is clamped to [c_exp_lo, c_exp_hi] (~[-88.38, 88.38]), and fx is computed as a finite linear transform of x. So fx can never be NaN or Inf when reaching the floor
  operation. Both paths handle this identically anyway. No issue.

  3. Edge case: Large float → int overflow in legacy path — Not affected by this PR, but worth noting

  The legacy fallback vcvtq_s32_f32(fx) would produce undefined/saturated results if fx exceeded the int32 range (~±2.1 billion). This can't happen here due to the clamping (max |fx| ≈ 128), but it's
  a latent fragility in the legacy path. The new vrndmq_f32 path doesn't have this limitation. Slight improvement in robustness.

  4. Variable tmp declaration scope — Minor suboptimal issue

  tmp is declared at the top of the function:
  float32x4_t tmp, fx;

  With the #if defined(__aarch64__) path, tmp is no longer used until later (tmp = vmulq_f32(fx, ...)), so the variable is still needed. However, mask and one have different usage patterns between
  paths:

  - one is declared and initialized before the #if block but is only used inside the #else floor-correction block and later when computing y = vaddq_f32(y, one). So one is still needed on both paths.
  No issue.
  - mask is declared inside the #else block, so it's properly scoped. No issue.

  No real problem here — the declarations are fine.

  5. Missing #if guard symmetry for related optimizations — Observation

  The "build 2^n" section below still uses:
  mm = vcvtq_s32_f32(fx);
  On AArch64, the compiler will likely emit FCVTZS, which is already efficient. But if the author wanted to be thorough, one could note that vcvtq_s32_f32 truncates toward zero while fx is now
  guaranteed to be an integer (from vrndmq_f32), so the truncation direction doesn't matter. No bug, just an observation.

  6. Preprocessor guard choice — Acceptable but could be more precise

  __aarch64__ is the standard compiler-defined macro for AArch64 targets and is correct here. An alternative would be checking for __ARM_ARCH >= 8 combined with NEON availability, but __aarch64__ is
  the conventional choice in ncnn's codebase. No issue.

  7. No #include changes needed — Correct

  vrndmq_f32 is part of <arm_neon.h> which is already included. No issue.

  ---
  Summary Table

  ┌───┬───────────────────────────────┬─────────────────────────────────────────────┬───────────────────┐
  │   │ Finding                       │ Severity                                    │ Action needed?    │
  ├───┼───────────────────────────────┼─────────────────────────────────────────────┼───────────────────┤
  │ 1 │ vrndmq_f32 correctness        │ None — correct                              │ No                │
  ├───┼───────────────────────────────┼─────────────────────────────────────────────┼───────────────────┤
  │ 2 │ NaN/Inf edge cases            │ None — clamped                              │ No                │
  ├───┼───────────────────────────────┼─────────────────────────────────────────────┼───────────────────┤
  │ 3 │ Legacy int overflow fragility │ Info                                        │ No (pre-existing) │
  ├───┼───────────────────────────────┼─────────────────────────────────────────────┼───────────────────┤
  │ 4 │ Variable scope of tmp         │ None                                        │ No                │
  ├───┼───────────────────────────────┼─────────────────────────────────────────────┼───────────────────┤
  │ 5 │ vcvtq_s32_f32(fx) after floor │ Info — works correctly since fx is integral │ No                │
  ├───┼───────────────────────────────┼─────────────────────────────────────────────┼───────────────────┤
  │ 6 │ Preprocessor guard choice     │ None — idiomatic                            │ No                │
  └───┴───────────────────────────────┴─────────────────────────────────────────────┴───────────────────┘

  Overall: This is a clean, focused, correct optimization. No logic bugs or meaningful suboptimal issues found. The change is minimal (4 lines added, 0 removed from either path), preserves backward
  compatibility for 32-bit ARM, and the benchmarks confirm identical accuracy with ~20% speedup.

nihui

LGTM

nihui · 2026-04-07T11:09:58Z

Thanks for your contribution !

nihui · 2026-04-07T14:38:20Z

actually, there is floor operation in arm fp16 exp variant too :D

crafcat7 · 2026-04-07T14:47:51Z

actually, there is floor operation in arm fp16 exp variant too :D

Yes, while reviewing the code later, I discovered that we still have a variant for the fp16 scenario.🤔

I'll include the floor optimization in the next PR, and I want to run a local benchmark to see how much performance improvement fp16 provides.👍

[feature] arm: speed up exp_ps floor step on aarch64

cec0bb2

Summary: Use vrndmq_f32 for floor computation in exp_ps on aarch64 while keeping the legacy fallback path for non-aarch64 targets. This reduces the exp_ps hot-path cost on ARM without changing approximation behavior.

github-actions bot added the arm label Apr 7, 2026

crafcat7 mentioned this pull request Apr 7, 2026

[feature] add fast_exp approximation support in Softmax operator #6655

Closed

nihui approved these changes Apr 7, 2026

View reviewed changes

nihui merged commit 16beb34 into Tencent:master Apr 7, 2026
59 of 60 checks passed

crafcat7 mentioned this pull request Apr 8, 2026

[feature] arm: speed up fp16 exp_ps floor step on aarch64 #6659

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature] arm: speed up exp_ps floor step on aarch64#6657

[feature] arm: speed up exp_ps floor step on aarch64#6657
nihui merged 1 commit intoTencent:masterfrom
crafcat7:opt-arm_exp_ps

crafcat7 commented Apr 7, 2026

Uh oh!

codecov-commenter commented Apr 7, 2026 •

edited

Loading

Uh oh!

nihui commented Apr 7, 2026

Uh oh!

nihui left a comment

Uh oh!

Uh oh!

nihui commented Apr 7, 2026

Uh oh!

nihui commented Apr 7, 2026

Uh oh!

crafcat7 commented Apr 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

crafcat7 commented Apr 7, 2026

Summary

Uh oh!

codecov-commenter commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

nihui commented Apr 7, 2026

Uh oh!

nihui left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nihui commented Apr 7, 2026

Uh oh!

nihui commented Apr 7, 2026

Uh oh!

crafcat7 commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-commenter commented Apr 7, 2026 •

edited

Loading

crafcat7 commented Apr 7, 2026 •

edited

Loading