NEGEMM performance issue #93

cyberfire · 2017-05-05T10:01:54Z

Hi, Guys,

I compared the SGEMM performance of ACL as well as OpenBlas single core in A72 and found the performance of ACL is much lower.
ACL version costs 116 ms per MM while OpenBlas version just takes 8.8 ms.

The parameter settings are: M=32, N=30000, K=9, alpha=1, beta=0.

I did a little debug and found that ACL requested a lot of memory for temporary tensors by adding logs in TensorAllocator::allocate().

MEMBLOCK: request 3478608 //for _interleave_kernel
MEMBLOCK: request 899640000 //for _transpose_kernel NEARLY 900M!!!!
MEMBLOCK: request 1536 //Matrix A: 32x9
MEMBLOCK: request 1080000 //Matrix B: 9x30000
MEMBLOCK: request 3840000 // Matrix D: 32x30000

The memory requested for _tmp_a and _tmp_b looks too much and I believe both of them will result in performance drop greatly.

I posted my test code here and please try if your guys can reproduce the same issue;


int main(int argc, char * argv[]) 
{
unsigned int M=32;
unsigned int N=30000;
unsigned int K=9;


int rep=10;

int res;

while((res=getopt(argc,argv,"r:"))!=-1)
{
   switch(res)
   {
      case 'r':
        rep=strtoul(optarg,NULL,10);
        break;
      default:
        break;
   }

}

TensorShape AShape(K,M);
TensorShape BShape(N,K);
TensorShape OShape(N,M);

Tensor ATensor, BTensor, OTensor;

ATensor.allocator()->init(TensorInfo(AShape, Format::F32));
BTensor.allocator()->init(TensorInfo(BShape, Format::F32));
OTensor.allocator()->init(TensorInfo(OShape, Format::F32));


NEGEMM armGemm;
armGemm.configure(&ATensor, &BTensor,nullptr, &OTensor,1.0, 0.0);

ATensor.allocator()->allocate();
BTensor.allocator()->allocate();
OTensor.allocator()->allocate();
//warmup run 
armGemm.run();

unsigned long start=get_cur_time();

for(int i=0;i<rep;i++)
      armGemm.run();


unsigned long end=get_cur_time();

std::cout<<"SGEMM settings: M="<<M<<" K="<<K<<" N="<<N<<std::endl;
std::cout<<"repetion: "<<rep<<" used time: "<<end-start<<std::endl;

Thanks,

Cyber

The text was updated successfully, but these errors were encountered:

gmiodice · 2017-05-05T11:58:37Z

Hi @cyberfire,

Could you please tell us wich release version of the compute library are you using? If you are not using the latest one, could you re-run your test?

Also may I ask you to test SGEMM with different values of MNK? (I.e. 12544, 64, 147 - 3136, 64, 64,...)

Many thanks,
Gian Marco

cyberfire · 2017-05-05T14:40:22Z

@gmiodice

I'm using 17.04. I just noticed that 17.05 is online. I will try this new version.

Thanks,
Cyber

cyberfire · 2017-05-05T15:58:43Z

I've tried 17.05. The performance is better. Per loop is about 72ms now and the memory usage is much less than 17.04.

MEMBLOCK: request 959616
MEMBLOCK: request 1080000
MEMBLOCK: request 1536
MEMBLOCK: request 1080000
MEMBLOCK: request 3840000
SGEMM settings: M=32 K=9 N=30000
repetion: 10 used time: 720

Thanks,

Cyber

cyberfire · 2017-05-05T23:02:55Z

Hi, all
By setting debug=0 when building ACL, the time cost is reduced to ~14 ms per MM.
I 'v compared the performance of debug=0 and debug=1 in 17.04 before and there is no big difference.
While there is a big gap in 17.05 between these two modes.

Thanks,

Cyber

AnthonyBarbier · 2017-05-09T13:58:11Z

I'm not sure I understand what the issue is ?
Nothing has changed between 17.04 and 17.05 in terms of debug

gfursin · 2017-05-09T20:58:51Z

By the way, @cyberfire , if it's of interest, I just added your benchmark to the CK workflow framework. The idea is to make it simpler to build and run both ACL and such benchmarks on different hosts (Windows, Linux) and targets (such as Android). CK also "auto-calibrates" such small programs, i.e. automatically increases your "rep" var until program runs around 5 secs (and then divides total execution time by "rep"). If you have Android NDK, SDK, Git and Python installed, you can check it out as following:

$ (sudo) pip install ck
$ ck pull repo:ck-math
$ ck install package:lib-acl-master-universal --target_os=android21-arm64 --env.USE_NEON=ON

And if you have Android device connected via adb to your Linux or Windows machine, you can run your benchmark as following:
$ ck compile program:acl-sgemm-neon-example --target_os=android21-arm64
$ ck run program:acl-sgemm-neon-example --target_os=android21-arm64

The idea is to gradually provide unified way to run such benchmarks and share results ... Hope it will be of any use ;) ...

cyberfire · 2017-05-12T03:40:24Z

There is no pending issue ....

GeorgeARM added the Help wanted label May 10, 2017

cyberfire closed this as completed May 12, 2017

developer-compute mentioned this issue Jul 12, 2021

Execution of Inference Workloads on Hikey970 with layer splitting #882

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NEGEMM performance issue #93

NEGEMM performance issue #93

cyberfire commented May 5, 2017

gmiodice commented May 5, 2017

cyberfire commented May 5, 2017

cyberfire commented May 5, 2017

cyberfire commented May 5, 2017

AnthonyBarbier commented May 9, 2017

gfursin commented May 9, 2017

cyberfire commented May 12, 2017

NEGEMM performance issue #93

NEGEMM performance issue #93

Comments

cyberfire commented May 5, 2017

gmiodice commented May 5, 2017

cyberfire commented May 5, 2017

cyberfire commented May 5, 2017

cyberfire commented May 5, 2017

AnthonyBarbier commented May 9, 2017

gfursin commented May 9, 2017

cyberfire commented May 12, 2017