Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low performance with NETranspose on aarch64 #1045

Closed
yd2102 opened this issue Apr 4, 2023 · 3 comments
Closed

Low performance with NETranspose on aarch64 #1045

yd2102 opened this issue Apr 4, 2023 · 3 comments
Assignees
Milestone

Comments

@yd2102
Copy link

yd2102 commented Apr 4, 2023

Output of 'strings libarm_compute.so | grep arm_compute_version':
arm_compute_version=v23.02.1 Build options: {'Werror': '1', 'debug': '0', 'neon': '1', 'opencl': '0', 'os': 'linux', 'openmp': '1', 'cppthreads': '0', 'arch': 'armv8.2-a', 'multi_isa': '1', 'build': 'native'} Git hash=b'd8bf9b53752a4f573120cf51b31055de8b3c7d29'

Platform:
AWS Graviton3 aarch64 (ARMv8.4-a)

Operating System:
23~22.04.1-Ubuntu

Problem description:

Hi,

I am experiencing low performance when trying to compute ABT where A and B are matrices of shapes [M, K] and [N, K] respectively. Such pattern of computation is very common in modern transformer-based ML models, so it is important that we compute this efficiently.

What I've found so far in ACL's repo is in order to compute ABT, we need to compute BT first, and then compute dot product of A and BT.

Based on the result from linux profiler, it shows that > 60% of time is spent on matrix transpose, which isn't expected because transpose is a much more lightweight operation than GEMM itself.

So the questions are:

  1. Is there a more optimized NETranspose kernel in ACL other than "transpose_32bit_elements" that I can configure (my processor supports ARM SVE)?
  2. I think an even more optimized approach is to handle ABT in GEMM's tiled kernel without having to compute transpose and GEMM separately. Does ACL support this kind of fused computation?

Thanks!

The linux profiler shows > 60% of time is spent on matrix transpose:

# Overhead  Command     Shared Object        Symbol                                                                                                                                                    
# ........  ..........  ...................  ..........................................................................................................................................................
#
    68.80%  neon_sgemm  libarm_compute.so    [.] arm_compute::cpu::kernels::(anonymous namespace)::transpose_32bit_elements
    30.54%  neon_sgemm  libarm_compute.so    [.] arm_gemm::sve_hybrid_fp32_mla_6x4VL
     0.21%  neon_sgemm  libarm_compute.so    [.] arm_gemm::GemmHybridIndirect<arm_gemm::cls_sve_hybrid_fp32_mla_6x4VL, float, float, arm_gemm::Nothing, false, false>::execute
     0.04%  neon_sgemm  libarm_compute.so    [.] arm_compute::cpu::kernels::CpuTransposeKernel::run_op
     0.04%  neon_sgemm  libstdc++.so.6.0.30  [.] std::__detail::_Prime_rehash_policy::_M_need_rehash
     0.04%  neon_sgemm  libarm_compute.so    [.] arm_compute::cpu::CpuGemm::run
     0.04%  neon_sgemm  libc.so.6            [.] _int_free
     0.04%  neon_sgemm  libarm_compute.so    [.] arm_compute::Scheduler::get
     0.04%  neon_sgemm  libarm_compute.so    [.] arm_gemm::(anonymous namespace)::run_hybrid_kernel<arm_gemm::Nothing, false, false>::run<arm_gemm::cls_sve_hybrid_fp32_mla_6x4VL, float, float, float>
     0.02%  neon_sgemm  libarm_compute.so    [.] arm_compute::ITensorPack::get_const_tensor@plt

Here's the short version of my code to see the problem (I use single thread):

#include "arm_compute/core/Types.h"
#include "arm_compute/runtime/NEON/NEFunctions.h"
#include "arm_compute/runtime/NEON/NEScheduler.h"
#include "utils/Utils.h"

#include <cstdlib>
#include <chrono>

using namespace arm_compute;

static const size_t M = 10;
static const size_t N = 768;
static const size_t K = 768;
static const size_t iterations = 100000;
static const float alpha = 1.f;
static const float beta = 0.f;

void benchmark(IFunction *trans, IFunction *gemm, const int threads)
{
    printf("Using %d threads...\n", threads);

    // Use specified number of threads
    NEScheduler::get().set_num_threads(threads);

    // Warm up kernel
    for (size_t i = 0; i < 100; i++)
    {
        if (trans)
        {
            trans->run();
        }

        gemm->run();
    }

    size_t total = threads * iterations;
    auto start = std::chrono::steady_clock::now();

    // Execute kernel
    for (size_t i = 0; i < total; i++)
    {
        if (trans)
        {
            trans->run();
        }

        gemm->run();
    }

    auto stop = std::chrono::steady_clock::now();
    std::chrono::duration<double> diff = stop - start;
    double time = diff.count();

    printf("%f ms/iter\n", 1e3 * time / total);
}

void test_gemm()
{
    printf("M=%ld, N=%ld, K=%ld\n", M, N, K);

    Tensor      src0;
    Tensor      src1;
    Tensor      dst;
    NEGEMM      fgemm;

    // Populate tensor information
    src0.allocator()->init(TensorInfo(TensorShape(K, M), 1, DataType::F32));
    src1.allocator()->init(TensorInfo(TensorShape(N, K), 1, DataType::F32));
    dst.allocator()->init(TensorInfo(TensorShape(N, M), 1, DataType::F32));

    // Configure kernel
    fgemm.configure(&src0, &src1, nullptr, &dst, alpha, beta);

    // Allocate all tensors
    src0.allocator()->allocate();
    src1.allocator()->allocate();
    dst.allocator()->allocate();

    // Initialize random inputs
    utils::fill_random_tensor(src0, -1.f, 1.f);
    utils::fill_random_tensor(src1, -1.f, 1.f);

    // Run benchmarking
    benchmark(nullptr, &fgemm, 1);
}

void test_gemm_transpose()
{
    printf("M=%ld, N=%ld, K=%ld\n", M, N, K);

    Tensor      src0;
    Tensor      src1;
    Tensor      src1t;
    Tensor      dst;
    NETranspose trans;
    NEGEMM      fgemm;

    // Populate tensor information
    src0.allocator()->init(TensorInfo(TensorShape(K, M), 1, DataType::F32));
    src1.allocator()->init(TensorInfo(TensorShape(K, N), 1, DataType::F32));
    src1t.allocator()->init(TensorInfo(TensorShape(N, K), 1, DataType::F32));
    dst.allocator()->init(TensorInfo(TensorShape(N, M), 1, DataType::F32));

    // Configure kernel
    trans.configure(&src1, &src1t);
    fgemm.configure(&src0, &src1t, nullptr, &dst, alpha, beta);

    // Allocate all tensors
    src0.allocator()->allocate();
    src1.allocator()->allocate();
    src1t.allocator()->allocate();
    dst.allocator()->allocate();

    // Initialize random inputs
    utils::fill_random_tensor(src0, -1.f, 1.f);
    utils::fill_random_tensor(src1, -1.f, 1.f);

    // Run benchmarking
    benchmark(&trans, &fgemm, 1);
}

int main(int argc, char **argv)
{
    (void)argc;
    (void)argv;

    // test_gemm();
    test_gemm_transpose();

    return 0;
}
@GGGGxxxxxxxxr
Copy link

The NETranspose and NEReshaping are both relatively slow on Armv8 according to my own test.
Same issue.

@yd2102
Copy link
Author

yd2102 commented Apr 7, 2023

There appears to be fixed format GEMM kernels suitable for computing ABT, but I'm not sure such kernels are usable in this case. Is there example code that shows how I can compute ABT using fixed format Neon kernels?

@nSircombe
Copy link

@yd2102

I think the only practical example at present is within this oneDNN PR here - oneapi-src/oneDNN#1590 (the changes to matmul, and the supporting acl_matmul_utils.cpp show the absorption of an NETranspose of the "B" matrix with a re-order into the memory format expected for the fixed format kernels.

@morgolock morgolock self-assigned this Apr 17, 2023
junzhuimx pushed a commit to nxp-imx/arm-computelibrary-imx that referenced this issue Mar 13, 2024
The existing 4x4 tiling for 32-bit transpose is not efficient on aarch64, given that there are a lot more Neon registers available. So making the tile size to 8x8 will greatly improve NETranspose latency.

For example, on AWS Graviton3 processors, with this change I have observed transposing a 768x768 matrix improves latency from 0.32ms down to 0.19ms. Improvement can also be seen across different matrix sizes.

Further enlarging the tile size to 8x16 or 16x16 won't make it perform as good as 8x8 due to register pressure.

This change is to mitigate the issue reported at:
ARM-software/ComputeLibrary#1045

Signed-off-by: Ethan Doe <yidoe@amazon.com>
Change-Id: Ia09859cdf2f6d312e67219a9d95a3a3bf1db1999
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/9448
Benchmark: Arm Jenkins <bsgcomp@arm.com>
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
Reviewed-by: Gunes Bayir <gunes.bayir@arm.com>
Reviewed-by: Pablo Marquez Tello <pablo.tello@arm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants