[feature] arm: speed up fp16 exp_ps floor step on aarch64 by crafcat7 · Pull Request #6659 · Tencent/ncnn

crafcat7 · 2026-04-08T03:16:21Z

This PR is an improvement upon #6657, supplementing and optimizing exp_ps for ARM fp16 scenarios, maintaining consistency with the ARM fp32 implementation.

The performance improvement is due to reduced SIMD instruction computation and the replacement of the original generic floor processing with native instructions:

Optimization: Use vcvtmq_s16_f16 for floor on aarch64

Legacy: 6 instructions (vcvt + vcvt + vclt + vand + vsub + vcvt)
New: 2 instructions (vcvtmq + vcvt) + reuse for 2^n

Performance and accuracy are as follows (tested on Android devices), and it is expected to improve speed by 20-30%.

elements=1048576 rounds=10000

--- pack4 (4 x fp16 elements) ---
legacy exp_ps_f16 : 1.959 ns/elem
new exp_ps_f16 : 1.486 ns/elem
speedup : 1.32x

--- pack8 (8 x fp16 elements) ---
legacy exp_ps_f16 : 1.203 ns/elem
new exp_ps_f16 : 0.864 ns/elem
speedup : 1.39x

--- Accuracy ---
legacy vs new avg diff : 0.000000000
legacy vs new max diff : 0.000000000

Testing method: Execute fn in batches using SIMD and measure the throughput performance per element.

static float bench_fn_pack4(const std::vector<float>& data, int rounds, float16x4_t (*fn)(float16x4_t))
{
    volatile float sink = 0.f;
    const auto t0 = std::chrono::steady_clock::now();

    for (int r = 0; r < rounds; r++)
    {
        float16x4_t vacc = vdup_n_f16(0.f);
        for (size_t i = 0; i + 3 < data.size(); i += 4)
        {
            // Convert float to fp16
            float32x4_t x_f32 = vld1q_f32(&data[i]);
            float16x4_t x = vcvt_f16_f32(x_f32);
            vacc = vadd_f16(vacc, fn(x));
        }
        sink += hsum4_f16(vacc);
    }

    const auto t1 = std::chrono::steady_clock::now();
    const std::chrono::duration<double, std::nano> dt = t1 - t0;
    if (sink == 0.123f)
        std::printf("ignore %f\n", sink);
    return (float)(dt.count() / (double)(data.size() * (size_t)rounds));
}

static float bench_fn_pack8(const std::vector<float>& data, int rounds, float16x8_t (*fn)(float16x8_t))
{
    volatile float sink = 0.f;
    const auto t0 = std::chrono::steady_clock::now();

    for (int r = 0; r < rounds; r++)
    {
        float16x8_t vacc = vdupq_n_f16(0.f);
        for (size_t i = 0; i + 7 < data.size(); i += 8)
        {
            float32x4_t x_f32_0 = vld1q_f32(&data[i]);
            float32x4_t x_f32_1 = vld1q_f32(&data[i + 4]);
            float16x4_t x0 = vcvt_f16_f32(x_f32_0);
            float16x4_t x1 = vcvt_f16_f32(x_f32_1);
            float16x8_t x = vcombine_f16(x0, x1);
            vacc = vaddq_f16(vacc, fn(x));
        }
        sink += hsum8_f16(vacc);
    }

    const auto t1 = std::chrono::steady_clock::now();
    const std::chrono::duration<double, std::nano> dt = t1 - t0;
    if (sink == 0.123f)
        std::printf("ignore %f\n", sink);
    return (float)(dt.count() / (double)(data.size() * (size_t)rounds));
}

Summary: Use vcvtmq_s16_f16 for floor computation in exp_ps_f16 on aarch64 while keeping the legacy fallback path for non-aarch64 targets. This reduces the exp_ps hot-path cost on ARM without changing approximation behavior. Also reuses the floor result for 2^n construction to avoid redundant vcvt instruction.

nihui

#if defined(__aarch64__) is always true for all armv8.2+ targets
the macro condition is unnecessary

Summary: __aarch64__ is always true for all armv8.2+ targets the macro condition is unnecessary

crafcat7 · 2026-04-08T05:31:17Z

#if defined(__aarch64__) is always true for all armv8.2+ targets the macro condition is unnecessary

Get. I removed the check for aarch64. Optimizations are always used in ARM fp16 implementations.

codecov-commenter · 2026-04-08T06:45:54Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.79%. Comparing base (16beb34) to head (c38d865).

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6659      +/-   ##
==========================================
- Coverage   93.80%   93.79%   -0.01%     
==========================================
  Files         917      917              
  Lines      288669   288475     -194     
==========================================
- Hits       270776   270581     -195     
- Misses      17893    17894       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

nihui · 2026-04-08T07:06:24Z

I suspect that vrndmq_f16 and vcvtq_s16_f16 would be better, because this would delay a register allocation and reduce one register dependency, similar to how FP32 does it.

Summary: Use vrndm/vrndmq plus vcvt for exp_ps_f16 floor conversion on AArch64 while preserving output accuracy on device tests.

crafcat7 · 2026-04-08T08:30:14Z

I suspect that vrndmq_f16 and vcvtq_s16_f16 would be better, because this would delay a register allocation and reduce one register dependency, similar to how FP32 does it.

Yes, I tried using vrndmq_f16 + vcvtq_s16_f16 and it is indeed faster than the vcvtm.

elements=1048576 rounds=10000

--- pack4 (4 x fp16 elements) ---
legacy exp_ps_f16 : 1.962 ns/elem
vcvtm  exp_ps_f16 : 1.482 ns/elem
rnd    exp_ps_f16 : 1.467 ns/elem

legacy/vcvtm      : 1.32x
rnd/vcvtm         : 0.990x

--- pack8 (8 x fp16 elements) ---
legacy exp_ps_f16 : 1.203 ns/elem
vcvtm  exp_ps_f16 : 0.861 ns/elem
rnd    exp_ps_f16 : 0.858 ns/elem

legacy/vcvtm  : 1.40x
rnd/vcvtm     : 0.997x

--- Accuracy ---
legacy vs new avg diff  : 0.000000000
legacy vs new max diff  : 0.000000000
new vs rnd   avg diff   : 0.000000000
new vs rnd   max diff   : 0.000000000

Therefore, I submitted another change, using vrndmq_f16 + vcvtq_s16_f16 to reduce register dependencies.

Summary: Keep floor value in fx via vrndm/vrndmq and perform s16 conversion only when building pow2n.

nihui · 2026-04-08T11:07:36Z

Thanks for your contribution !

github-actions bot added the arm label Apr 8, 2026

nihui requested changes Apr 8, 2026

View reviewed changes

[bugfix] Remove unused macro in neon_mathfun_fp16s.h

b6f2fda

Summary: __aarch64__ is always true for all armv8.2+ targets the macro condition is unnecessary

[bugfix] Switch fp16 exp floor path to vrndm+vcvt

4db0394

Summary: Use vrndm/vrndmq plus vcvt for exp_ps_f16 floor conversion on AArch64 while preserving output accuracy on device tests.

[bugfix] Move fp16 exp mm conversion to pow2n stage

c38d865

Summary: Keep floor value in fx via vrndm/vrndmq and perform s16 conversion only when building pow2n.

nihui approved these changes Apr 8, 2026

View reviewed changes

nihui merged commit 7a8b9a5 into Tencent:master Apr 8, 2026
59 of 60 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature] arm: speed up fp16 exp_ps floor step on aarch64#6659

[feature] arm: speed up fp16 exp_ps floor step on aarch64#6659
nihui merged 4 commits intoTencent:masterfrom
crafcat7:opt-arm_exp_ps_f16

crafcat7 commented Apr 8, 2026

Uh oh!

nihui left a comment •

edited

Loading

Uh oh!

crafcat7 commented Apr 8, 2026

Uh oh!

codecov-commenter commented Apr 8, 2026 •

edited

Loading

Uh oh!

nihui commented Apr 8, 2026

Uh oh!

crafcat7 commented Apr 8, 2026

Uh oh!

Uh oh!

nihui commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

crafcat7 commented Apr 8, 2026

Uh oh!

nihui left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crafcat7 commented Apr 8, 2026

Uh oh!

codecov-commenter commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

nihui commented Apr 8, 2026

Uh oh!

crafcat7 commented Apr 8, 2026

Uh oh!

Uh oh!

nihui commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nihui left a comment •

edited

Loading

codecov-commenter commented Apr 8, 2026 •

edited

Loading