Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NEGEMM performance issue #93

Closed
cyberfire opened this issue May 5, 2017 · 7 comments
Closed

NEGEMM performance issue #93

cyberfire opened this issue May 5, 2017 · 7 comments

Comments

@cyberfire
Copy link

Hi, Guys,

I compared the SGEMM performance of ACL as well as OpenBlas single core in A72 and found the performance of ACL is much lower.
ACL version costs 116 ms per MM while OpenBlas version just takes 8.8 ms.

The parameter settings are: M=32, N=30000, K=9, alpha=1, beta=0.

I did a little debug and found that ACL requested a lot of memory for temporary tensors by adding logs in TensorAllocator::allocate().

MEMBLOCK: request 3478608 //for _interleave_kernel
MEMBLOCK: request 899640000 //for _transpose_kernel NEARLY 900M!!!!
MEMBLOCK: request 1536 //Matrix A: 32x9
MEMBLOCK: request 1080000 //Matrix B: 9x30000
MEMBLOCK: request 3840000 // Matrix D: 32x30000

The memory requested for _tmp_a and _tmp_b looks too much and I believe both of them will result in performance drop greatly.

I posted my test code here and please try if your guys can reproduce the same issue;

int main(int argc, char * argv[]) {
unsigned int M=32;
unsigned int N=30000;
unsigned int K=9;


int rep=10;

int res;

while((res=getopt(argc,argv,"r:"))!=-1)
{
   switch(res)
   {
      case 'r':
        rep=strtoul(optarg,NULL,10);
        break;
      default:
        break;
   }

}

TensorShape AShape(K,M);
TensorShape BShape(N,K);
TensorShape OShape(N,M);

Tensor ATensor, BTensor, OTensor;

ATensor.allocator()->init(TensorInfo(AShape, Format::F32));
BTensor.allocator()->init(TensorInfo(BShape, Format::F32));
OTensor.allocator()->init(TensorInfo(OShape, Format::F32));


NEGEMM armGemm;
armGemm.configure(&ATensor, &BTensor,nullptr, &OTensor,1.0, 0.0);

ATensor.allocator()->allocate();
BTensor.allocator()->allocate();
OTensor.allocator()->allocate();
//warmup run 
armGemm.run();

unsigned long start=get_cur_time();

for(int i=0;i<rep;i++)
      armGemm.run();


unsigned long end=get_cur_time();

std::cout<<"SGEMM settings: M="<<M<<" K="<<K<<" N="<<N<<std::endl;
std::cout<<"repetion: "<<rep<<" used time: "<<end-start<<std::endl;

Thanks,

Cyber

@gmiodice
Copy link
Contributor

gmiodice commented May 5, 2017

Hi @cyberfire,

Could you please tell us wich release version of the compute library are you using? If you are not using the latest one, could you re-run your test?

Also may I ask you to test SGEMM with different values of MNK? (I.e. 12544, 64, 147 - 3136, 64, 64,...)

Many thanks,
Gian Marco

@cyberfire
Copy link
Author

@gmiodice

I'm using 17.04. I just noticed that 17.05 is online. I will try this new version.

Thanks,
Cyber

@cyberfire
Copy link
Author

I've tried 17.05. The performance is better. Per loop is about 72ms now and the memory usage is much less than 17.04.

MEMBLOCK: request 959616
MEMBLOCK: request 1080000
MEMBLOCK: request 1536
MEMBLOCK: request 1080000
MEMBLOCK: request 3840000
SGEMM settings: M=32 K=9 N=30000
repetion: 10 used time: 720

Thanks,

Cyber

@cyberfire
Copy link
Author

Hi, all
By setting debug=0 when building ACL, the time cost is reduced to ~14 ms per MM.
I 'v compared the performance of debug=0 and debug=1 in 17.04 before and there is no big difference.
While there is a big gap in 17.05 between these two modes.

Thanks,

Cyber

@AnthonyBarbier
Copy link
Contributor

I'm not sure I understand what the issue is ?
Nothing has changed between 17.04 and 17.05 in terms of debug

@gfursin
Copy link

gfursin commented May 9, 2017

By the way, @cyberfire , if it's of interest, I just added your benchmark to the CK workflow framework. The idea is to make it simpler to build and run both ACL and such benchmarks on different hosts (Windows, Linux) and targets (such as Android). CK also "auto-calibrates" such small programs, i.e. automatically increases your "rep" var until program runs around 5 secs (and then divides total execution time by "rep"). If you have Android NDK, SDK, Git and Python installed, you can check it out as following:

$ (sudo) pip install ck
$ ck pull repo:ck-math
$ ck install package:lib-acl-master-universal --target_os=android21-arm64 --env.USE_NEON=ON

And if you have Android device connected via adb to your Linux or Windows machine, you can run your benchmark as following:
$ ck compile program:acl-sgemm-neon-example --target_os=android21-arm64
$ ck run program:acl-sgemm-neon-example --target_os=android21-arm64

The idea is to gradually provide unified way to run such benchmarks and share results ... Hope it will be of any use ;) ...

@cyberfire
Copy link
Author

There is no pending issue ....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants