Skip to content

[feature] arm: speed up exp_ps floor step on aarch64#6657

Merged
nihui merged 1 commit intoTencent:masterfrom
crafcat7:opt-arm_exp_ps
Apr 7, 2026
Merged

[feature] arm: speed up exp_ps floor step on aarch64#6657
nihui merged 1 commit intoTencent:masterfrom
crafcat7:opt-arm_exp_ps

Conversation

@crafcat7
Copy link
Copy Markdown
Contributor

@crafcat7 crafcat7 commented Apr 7, 2026

Summary

Based on the discussion at #6655, the optimization of exp_ps for ARM is continued.

Use vrndmq_f32 for floor computation in exp_ps on aarch64 while keeping the legacy fallback path for non-aarch64 targets. This reduces the exp_ps hot-path cost on ARM without changing approximation behavior.

Based on the above optimizations, a simple performance comparison was performed on the M3Max device, and the data is as follows:

clang++ -std=c++11 -O3 -DNDEBUG -I. -I"src/layer/arm" "bench_compare_exp_ps_arm.cpp" -o "bench_compare_exp_ps_arm" && ./bench_compare_exp_ps_arm                  141 ↵
elements=1048576 rounds=10000
legacy exp_ps : 0.515 ns/elem
new exp_ps    : 0.423 ns/elem
speedup       : 1.22x
legacy avg/max rel err : 0.000001% / 0.000012%
new    avg/max rel err : 0.000001% / 0.000012%
max |legacy - new|     : 0

For aarch64 devices, an improvement of approximately 20% is expected.

Summary:
  Use vrndmq_f32 for floor computation in exp_ps on aarch64 while keeping the legacy fallback path for non-aarch64 targets. This reduces the exp_ps hot-path cost on ARM without changing approximation behavior.
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 7, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.79%. Comparing base (e366f48) to head (cec0bb2).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #6657   +/-   ##
=======================================
  Coverage   93.79%   93.79%           
=======================================
  Files         917      917           
  Lines      288530   288530           
=======================================
+ Hits       270634   270637    +3     
+ Misses      17896    17893    -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@nihui
Copy link
Copy Markdown
Member

nihui commented Apr 7, 2026

  ---
  Summary

  The PR adds a preprocessor guard in src/layer/arm/neon_mathfun.h to use the AArch64 native vrndmq_f32 (vector round-toward-minus-infinity, i.e. floor) instruction instead of the manual
  truncate-then-correct floor emulation on 32-bit ARM. This is a clean ~20% speedup with identical accuracy per the author's benchmarks.

  Verdict: No logic bugs found. Minor observations below.

  ---
  1. Correctness of vrndmq_f32 as a floor replacement — Correct

  The legacy code does:
  tmp = vcvtq_f32_s32(vcvtq_s32_f32(fx));   // truncate toward zero
  mask = vcgtq_f32(tmp, fx);                  // if truncation went up (negative fx)
  mask = vandq_u32(mask, vreinterpretq_u32_f32(one)); // mask = 1.0 where correction needed
  fx = vsubq_f32(tmp, vreinterpretq_f32_u32(mask));   // subtract 1 to get floor

  vrndmq_f32 (ARMv8 FRINTM) performs exactly IEEE 754 round-toward-negative-infinity, which is floorf. These are semantically identical. No logic bug here.

  2. Edge case: NaN and Inf handling — Not a concern

  Before the floor block, x is clamped to [c_exp_lo, c_exp_hi] (~[-88.38, 88.38]), and fx is computed as a finite linear transform of x. So fx can never be NaN or Inf when reaching the floor
  operation. Both paths handle this identically anyway. No issue.

  3. Edge case: Large float → int overflow in legacy path — Not affected by this PR, but worth noting

  The legacy fallback vcvtq_s32_f32(fx) would produce undefined/saturated results if fx exceeded the int32 range (~±2.1 billion). This can't happen here due to the clamping (max |fx| ≈ 128), but it's
  a latent fragility in the legacy path. The new vrndmq_f32 path doesn't have this limitation. Slight improvement in robustness.

  4. Variable tmp declaration scope — Minor suboptimal issue

  tmp is declared at the top of the function:
  float32x4_t tmp, fx;

  With the #if defined(__aarch64__) path, tmp is no longer used until later (tmp = vmulq_f32(fx, ...)), so the variable is still needed. However, mask and one have different usage patterns between
  paths:

  - one is declared and initialized before the #if block but is only used inside the #else floor-correction block and later when computing y = vaddq_f32(y, one). So one is still needed on both paths.
  No issue.
  - mask is declared inside the #else block, so it's properly scoped. No issue.

  No real problem here — the declarations are fine.

  5. Missing #if guard symmetry for related optimizations — Observation

  The "build 2^n" section below still uses:
  mm = vcvtq_s32_f32(fx);
  On AArch64, the compiler will likely emit FCVTZS, which is already efficient. But if the author wanted to be thorough, one could note that vcvtq_s32_f32 truncates toward zero while fx is now
  guaranteed to be an integer (from vrndmq_f32), so the truncation direction doesn't matter. No bug, just an observation.

  6. Preprocessor guard choice — Acceptable but could be more precise

  __aarch64__ is the standard compiler-defined macro for AArch64 targets and is correct here. An alternative would be checking for __ARM_ARCH >= 8 combined with NEON availability, but __aarch64__ is
  the conventional choice in ncnn's codebase. No issue.

  7. No #include changes needed — Correct

  vrndmq_f32 is part of <arm_neon.h> which is already included. No issue.

  ---
  Summary Table

  ┌───┬───────────────────────────────┬─────────────────────────────────────────────┬───────────────────┐
  │   │ Finding                       │ Severity                                    │ Action needed?    │
  ├───┼───────────────────────────────┼─────────────────────────────────────────────┼───────────────────┤
  │ 1 │ vrndmq_f32 correctness        │ None — correct                              │ No                │
  ├───┼───────────────────────────────┼─────────────────────────────────────────────┼───────────────────┤
  │ 2 │ NaN/Inf edge cases            │ None — clamped                              │ No                │
  ├───┼───────────────────────────────┼─────────────────────────────────────────────┼───────────────────┤
  │ 3 │ Legacy int overflow fragility │ Info                                        │ No (pre-existing) │
  ├───┼───────────────────────────────┼─────────────────────────────────────────────┼───────────────────┤
  │ 4 │ Variable scope of tmp         │ None                                        │ No                │
  ├───┼───────────────────────────────┼─────────────────────────────────────────────┼───────────────────┤
  │ 5 │ vcvtq_s32_f32(fx) after floor │ Info — works correctly since fx is integral │ No                │
  ├───┼───────────────────────────────┼─────────────────────────────────────────────┼───────────────────┤
  │ 6 │ Preprocessor guard choice     │ None — idiomatic                            │ No                │
  └───┴───────────────────────────────┴─────────────────────────────────────────────┴───────────────────┘

  Overall: This is a clean, focused, correct optimization. No logic bugs or meaningful suboptimal issues found. The change is minimal (4 lines added, 0 removed from either path), preserves backward
  compatibility for 32-bit ARM, and the benchmarks confirm identical accuracy with ~20% speedup.

Copy link
Copy Markdown
Member

@nihui nihui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nihui nihui merged commit 16beb34 into Tencent:master Apr 7, 2026
59 of 60 checks passed
@nihui
Copy link
Copy Markdown
Member

nihui commented Apr 7, 2026

Thanks for your contribution !

@nihui
Copy link
Copy Markdown
Member

nihui commented Apr 7, 2026

actually, there is floor operation in arm fp16 exp variant too :D

@crafcat7
Copy link
Copy Markdown
Contributor Author

crafcat7 commented Apr 7, 2026

actually, there is floor operation in arm fp16 exp variant too :D

Yes, while reviewing the code later, I discovered that we still have a variant for the fp16 scenario.🤔

I'll include the floor optimization in the next PR, and I want to run a local benchmark to see how much performance improvement fp16 provides.👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants