Low performance with NETranspose on aarch64 #1045

yd2102 · 2023-04-04T18:28:16Z

Output of 'strings libarm_compute.so | grep arm_compute_version':
arm_compute_version=v23.02.1 Build options: {'Werror': '1', 'debug': '0', 'neon': '1', 'opencl': '0', 'os': 'linux', 'openmp': '1', 'cppthreads': '0', 'arch': 'armv8.2-a', 'multi_isa': '1', 'build': 'native'} Git hash=b'd8bf9b53752a4f573120cf51b31055de8b3c7d29'

Platform:
AWS Graviton3 aarch64 (ARMv8.4-a)

Operating System:
23~22.04.1-Ubuntu

Problem description:

Hi,

I am experiencing low performance when trying to compute AB^T where A and B are matrices of shapes [M, K] and [N, K] respectively. Such pattern of computation is very common in modern transformer-based ML models, so it is important that we compute this efficiently.

What I've found so far in ACL's repo is in order to compute AB^T, we need to compute B^T first, and then compute dot product of A and B^T.

Based on the result from linux profiler, it shows that > 60% of time is spent on matrix transpose, which isn't expected because transpose is a much more lightweight operation than GEMM itself.

So the questions are:

Is there a more optimized NETranspose kernel in ACL other than "transpose_32bit_elements" that I can configure (my processor supports ARM SVE)?
I think an even more optimized approach is to handle AB^T in GEMM's tiled kernel without having to compute transpose and GEMM separately. Does ACL support this kind of fused computation?

Thanks!

The linux profiler shows > 60% of time is spent on matrix transpose:

# Overhead  Command     Shared Object        Symbol                                                                                                                                                    
# ........  ..........  ...................  ..........................................................................................................................................................
#
    68.80%  neon_sgemm  libarm_compute.so    [.] arm_compute::cpu::kernels::(anonymous namespace)::transpose_32bit_elements
    30.54%  neon_sgemm  libarm_compute.so    [.] arm_gemm::sve_hybrid_fp32_mla_6x4VL
     0.21%  neon_sgemm  libarm_compute.so    [.] arm_gemm::GemmHybridIndirect<arm_gemm::cls_sve_hybrid_fp32_mla_6x4VL, float, float, arm_gemm::Nothing, false, false>::execute
     0.04%  neon_sgemm  libarm_compute.so    [.] arm_compute::cpu::kernels::CpuTransposeKernel::run_op
     0.04%  neon_sgemm  libstdc++.so.6.0.30  [.] std::__detail::_Prime_rehash_policy::_M_need_rehash
     0.04%  neon_sgemm  libarm_compute.so    [.] arm_compute::cpu::CpuGemm::run
     0.04%  neon_sgemm  libc.so.6            [.] _int_free
     0.04%  neon_sgemm  libarm_compute.so    [.] arm_compute::Scheduler::get
     0.04%  neon_sgemm  libarm_compute.so    [.] arm_gemm::(anonymous namespace)::run_hybrid_kernel<arm_gemm::Nothing, false, false>::run<arm_gemm::cls_sve_hybrid_fp32_mla_6x4VL, float, float, float>
     0.02%  neon_sgemm  libarm_compute.so    [.] arm_compute::ITensorPack::get_const_tensor@plt

Here's the short version of my code to see the problem (I use single thread):

#include "arm_compute/core/Types.h"
#include "arm_compute/runtime/NEON/NEFunctions.h"
#include "arm_compute/runtime/NEON/NEScheduler.h"
#include "utils/Utils.h"

#include <cstdlib>
#include <chrono>

using namespace arm_compute;

static const size_t M = 10;
static const size_t N = 768;
static const size_t K = 768;
static const size_t iterations = 100000;
static const float alpha = 1.f;
static const float beta = 0.f;

void benchmark(IFunction *trans, IFunction *gemm, const int threads)
{
    printf("Using %d threads...\n", threads);

    // Use specified number of threads
    NEScheduler::get().set_num_threads(threads);

    // Warm up kernel
    for (size_t i = 0; i < 100; i++)
    {
        if (trans)
        {
            trans->run();
        }

        gemm->run();
    }

    size_t total = threads * iterations;
    auto start = std::chrono::steady_clock::now();

    // Execute kernel
    for (size_t i = 0; i < total; i++)
    {
        if (trans)
        {
            trans->run();
        }

        gemm->run();
    }

    auto stop = std::chrono::steady_clock::now();
    std::chrono::duration<double> diff = stop - start;
    double time = diff.count();

    printf("%f ms/iter\n", 1e3 * time / total);
}

void test_gemm()
{
    printf("M=%ld, N=%ld, K=%ld\n", M, N, K);

    Tensor      src0;
    Tensor      src1;
    Tensor      dst;
    NEGEMM      fgemm;

    // Populate tensor information
    src0.allocator()->init(TensorInfo(TensorShape(K, M), 1, DataType::F32));
    src1.allocator()->init(TensorInfo(TensorShape(N, K), 1, DataType::F32));
    dst.allocator()->init(TensorInfo(TensorShape(N, M), 1, DataType::F32));

    // Configure kernel
    fgemm.configure(&src0, &src1, nullptr, &dst, alpha, beta);

    // Allocate all tensors
    src0.allocator()->allocate();
    src1.allocator()->allocate();
    dst.allocator()->allocate();

    // Initialize random inputs
    utils::fill_random_tensor(src0, -1.f, 1.f);
    utils::fill_random_tensor(src1, -1.f, 1.f);

    // Run benchmarking
    benchmark(nullptr, &fgemm, 1);
}

void test_gemm_transpose()
{
    printf("M=%ld, N=%ld, K=%ld\n", M, N, K);

    Tensor      src0;
    Tensor      src1;
    Tensor      src1t;
    Tensor      dst;
    NETranspose trans;
    NEGEMM      fgemm;

    // Populate tensor information
    src0.allocator()->init(TensorInfo(TensorShape(K, M), 1, DataType::F32));
    src1.allocator()->init(TensorInfo(TensorShape(K, N), 1, DataType::F32));
    src1t.allocator()->init(TensorInfo(TensorShape(N, K), 1, DataType::F32));
    dst.allocator()->init(TensorInfo(TensorShape(N, M), 1, DataType::F32));

    // Configure kernel
    trans.configure(&src1, &src1t);
    fgemm.configure(&src0, &src1t, nullptr, &dst, alpha, beta);

    // Allocate all tensors
    src0.allocator()->allocate();
    src1.allocator()->allocate();
    src1t.allocator()->allocate();
    dst.allocator()->allocate();

    // Initialize random inputs
    utils::fill_random_tensor(src0, -1.f, 1.f);
    utils::fill_random_tensor(src1, -1.f, 1.f);

    // Run benchmarking
    benchmark(&trans, &fgemm, 1);
}

int main(int argc, char **argv)
{
    (void)argc;
    (void)argv;

    // test_gemm();
    test_gemm_transpose();

    return 0;
}

The text was updated successfully, but these errors were encountered:

GGGGxxxxxxxxr · 2023-04-05T17:09:31Z

The NETranspose and NEReshaping are both relatively slow on Armv8 according to my own test.
Same issue.

yd2102 · 2023-04-07T22:43:47Z

There appears to be fixed format GEMM kernels suitable for computing AB^T, but I'm not sure such kernels are usable in this case. Is there example code that shows how I can compute AB^T using fixed format Neon kernels?

nSircombe · 2023-04-11T07:37:03Z

@yd2102

I think the only practical example at present is within this oneDNN PR here - oneapi-src/oneDNN#1590 (the changes to matmul, and the supporting acl_matmul_utils.cpp show the absorption of an NETranspose of the "B" matrix with a re-order into the memory format expected for the fixed format kernels.

The existing 4x4 tiling for 32-bit transpose is not efficient on aarch64, given that there are a lot more Neon registers available. So making the tile size to 8x8 will greatly improve NETranspose latency. For example, on AWS Graviton3 processors, with this change I have observed transposing a 768x768 matrix improves latency from 0.32ms down to 0.19ms. Improvement can also be seen across different matrix sizes. Further enlarging the tile size to 8x16 or 16x16 won't make it perform as good as 8x8 due to register pressure. This change is to mitigate the issue reported at: ARM-software/ComputeLibrary#1045 Signed-off-by: Ethan Doe <yidoe@amazon.com> Change-Id: Ia09859cdf2f6d312e67219a9d95a3a3bf1db1999 Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/9448 Benchmark: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Gunes Bayir <gunes.bayir@arm.com> Reviewed-by: Pablo Marquez Tello <pablo.tello@arm.com>

SuhailMunshi added Question Performance labels Apr 6, 2023

SuhailMunshi added this to the v23.05 milestone Apr 6, 2023

morgolock self-assigned this Apr 17, 2023

morgolock closed this as completed Apr 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low performance with NETranspose on aarch64 #1045

Low performance with NETranspose on aarch64 #1045

yd2102 commented Apr 4, 2023

GGGGxxxxxxxxr commented Apr 5, 2023

yd2102 commented Apr 7, 2023

nSircombe commented Apr 11, 2023