Skip to content

[feature] arm: speed up fp16 exp_ps floor step on aarch64#6659

Merged
nihui merged 4 commits intoTencent:masterfrom
crafcat7:opt-arm_exp_ps_f16
Apr 8, 2026
Merged

[feature] arm: speed up fp16 exp_ps floor step on aarch64#6659
nihui merged 4 commits intoTencent:masterfrom
crafcat7:opt-arm_exp_ps_f16

Conversation

@crafcat7
Copy link
Copy Markdown
Contributor

@crafcat7 crafcat7 commented Apr 8, 2026

This PR is an improvement upon #6657, supplementing and optimizing exp_ps for ARM fp16 scenarios, maintaining consistency with the ARM fp32 implementation.

The performance improvement is due to reduced SIMD instruction computation and the replacement of the original generic floor processing with native instructions:

Optimization: Use vcvtmq_s16_f16 for floor on aarch64

  • Legacy: 6 instructions (vcvt + vcvt + vclt + vand + vsub + vcvt)
  • New: 2 instructions (vcvtmq + vcvt) + reuse for 2^n

Performance and accuracy are as follows (tested on Android devices), and it is expected to improve speed by 20-30%.

elements=1048576 rounds=10000

--- pack4 (4 x fp16 elements) ---
legacy exp_ps_f16 : 1.959 ns/elem
new exp_ps_f16 : 1.486 ns/elem
speedup : 1.32x

--- pack8 (8 x fp16 elements) ---
legacy exp_ps_f16 : 1.203 ns/elem
new exp_ps_f16 : 0.864 ns/elem
speedup : 1.39x

--- Accuracy ---
legacy vs new avg diff : 0.000000000
legacy vs new max diff : 0.000000000

Testing method: Execute fn in batches using SIMD and measure the throughput performance per element.

static float bench_fn_pack4(const std::vector<float>& data, int rounds, float16x4_t (*fn)(float16x4_t))
{
    volatile float sink = 0.f;
    const auto t0 = std::chrono::steady_clock::now();

    for (int r = 0; r < rounds; r++)
    {
        float16x4_t vacc = vdup_n_f16(0.f);
        for (size_t i = 0; i + 3 < data.size(); i += 4)
        {
            // Convert float to fp16
            float32x4_t x_f32 = vld1q_f32(&data[i]);
            float16x4_t x = vcvt_f16_f32(x_f32);
            vacc = vadd_f16(vacc, fn(x));
        }
        sink += hsum4_f16(vacc);
    }

    const auto t1 = std::chrono::steady_clock::now();
    const std::chrono::duration<double, std::nano> dt = t1 - t0;
    if (sink == 0.123f)
        std::printf("ignore %f\n", sink);
    return (float)(dt.count() / (double)(data.size() * (size_t)rounds));
}

static float bench_fn_pack8(const std::vector<float>& data, int rounds, float16x8_t (*fn)(float16x8_t))
{
    volatile float sink = 0.f;
    const auto t0 = std::chrono::steady_clock::now();

    for (int r = 0; r < rounds; r++)
    {
        float16x8_t vacc = vdupq_n_f16(0.f);
        for (size_t i = 0; i + 7 < data.size(); i += 8)
        {
            float32x4_t x_f32_0 = vld1q_f32(&data[i]);
            float32x4_t x_f32_1 = vld1q_f32(&data[i + 4]);
            float16x4_t x0 = vcvt_f16_f32(x_f32_0);
            float16x4_t x1 = vcvt_f16_f32(x_f32_1);
            float16x8_t x = vcombine_f16(x0, x1);
            vacc = vaddq_f16(vacc, fn(x));
        }
        sink += hsum8_f16(vacc);
    }

    const auto t1 = std::chrono::steady_clock::now();
    const std::chrono::duration<double, std::nano> dt = t1 - t0;
    if (sink == 0.123f)
        std::printf("ignore %f\n", sink);
    return (float)(dt.count() / (double)(data.size() * (size_t)rounds));
}

Summary:
  Use vcvtmq_s16_f16 for floor computation in exp_ps_f16 on aarch64 while keeping the legacy fallback path for non-aarch64 targets. This reduces the exp_ps hot-path cost on ARM without changing approximation behavior.

  Also reuses the floor result for 2^n construction to avoid redundant vcvt instruction.
@github-actions github-actions bot added the arm label Apr 8, 2026
Copy link
Copy Markdown
Member

@nihui nihui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#if defined(__aarch64__) is always true for all armv8.2+ targets
the macro condition is unnecessary

Summary:
  __aarch64__ is always true for all armv8.2+ targets the macro condition is unnecessary
@crafcat7
Copy link
Copy Markdown
Contributor Author

crafcat7 commented Apr 8, 2026

#if defined(__aarch64__) is always true for all armv8.2+ targets the macro condition is unnecessary

Get. I removed the check for aarch64. Optimizations are always used in ARM fp16 implementations.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 8, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.79%. Comparing base (16beb34) to head (c38d865).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6659      +/-   ##
==========================================
- Coverage   93.80%   93.79%   -0.01%     
==========================================
  Files         917      917              
  Lines      288669   288475     -194     
==========================================
- Hits       270776   270581     -195     
- Misses      17893    17894       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@nihui
Copy link
Copy Markdown
Member

nihui commented Apr 8, 2026

I suspect that vrndmq_f16 and vcvtq_s16_f16 would be better, because this would delay a register allocation and reduce one register dependency, similar to how FP32 does it.

Summary:
  Use vrndm/vrndmq plus vcvt for exp_ps_f16 floor conversion on AArch64 while preserving output accuracy on device tests.
@crafcat7
Copy link
Copy Markdown
Contributor Author

crafcat7 commented Apr 8, 2026

I suspect that vrndmq_f16 and vcvtq_s16_f16 would be better, because this would delay a register allocation and reduce one register dependency, similar to how FP32 does it.

Yes, I tried using vrndmq_f16 + vcvtq_s16_f16 and it is indeed faster than the vcvtm.

elements=1048576 rounds=10000

--- pack4 (4 x fp16 elements) ---
legacy exp_ps_f16 : 1.962 ns/elem
vcvtm  exp_ps_f16 : 1.482 ns/elem
rnd    exp_ps_f16 : 1.467 ns/elem

legacy/vcvtm      : 1.32x
rnd/vcvtm         : 0.990x

--- pack8 (8 x fp16 elements) ---
legacy exp_ps_f16 : 1.203 ns/elem
vcvtm  exp_ps_f16 : 0.861 ns/elem
rnd    exp_ps_f16 : 0.858 ns/elem

legacy/vcvtm  : 1.40x
rnd/vcvtm     : 0.997x

--- Accuracy ---
legacy vs new avg diff  : 0.000000000
legacy vs new max diff  : 0.000000000
new vs rnd   avg diff   : 0.000000000
new vs rnd   max diff   : 0.000000000

Therefore, I submitted another change, using vrndmq_f16 + vcvtq_s16_f16 to reduce register dependencies.

Summary:
  Keep floor value in fx via vrndm/vrndmq and perform s16 conversion only when building pow2n.
@nihui nihui merged commit 7a8b9a5 into Tencent:master Apr 8, 2026
59 of 60 checks passed
@nihui
Copy link
Copy Markdown
Member

nihui commented Apr 8, 2026

Thanks for your contribution !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants