New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NEGEMM performance issue #93
Comments
Hi @cyberfire, Could you please tell us wich release version of the compute library are you using? If you are not using the latest one, could you re-run your test? Also may I ask you to test SGEMM with different values of MNK? (I.e. 12544, 64, 147 - 3136, 64, 64,...) Many thanks, |
I'm using 17.04. I just noticed that 17.05 is online. I will try this new version. Thanks, |
I've tried 17.05. The performance is better. Per loop is about 72ms now and the memory usage is much less than 17.04. MEMBLOCK: request 959616 Thanks, Cyber |
Hi, all Thanks, Cyber |
I'm not sure I understand what the issue is ? |
By the way, @cyberfire , if it's of interest, I just added your benchmark to the CK workflow framework. The idea is to make it simpler to build and run both ACL and such benchmarks on different hosts (Windows, Linux) and targets (such as Android). CK also "auto-calibrates" such small programs, i.e. automatically increases your "rep" var until program runs around 5 secs (and then divides total execution time by "rep"). If you have Android NDK, SDK, Git and Python installed, you can check it out as following: $ (sudo) pip install ck And if you have Android device connected via adb to your Linux or Windows machine, you can run your benchmark as following: The idea is to gradually provide unified way to run such benchmarks and share results ... Hope it will be of any use ;) ... |
There is no pending issue .... |
Hi, Guys,
I compared the SGEMM performance of ACL as well as OpenBlas single core in A72 and found the performance of ACL is much lower.
ACL version costs 116 ms per MM while OpenBlas version just takes 8.8 ms.
The parameter settings are: M=32, N=30000, K=9, alpha=1, beta=0.
I did a little debug and found that ACL requested a lot of memory for temporary tensors by adding logs in TensorAllocator::allocate().
MEMBLOCK: request 3478608 //for _interleave_kernel
MEMBLOCK: request 899640000 //for _transpose_kernel NEARLY 900M!!!!
MEMBLOCK: request 1536 //Matrix A: 32x9
MEMBLOCK: request 1080000 //Matrix B: 9x30000
MEMBLOCK: request 3840000 // Matrix D: 32x30000
The memory requested for _tmp_a and _tmp_b looks too much and I believe both of them will result in performance drop greatly.
I posted my test code here and please try if your guys can reproduce the same issue;
int main(int argc, char * argv[]) {
Thanks,
Cyber
The text was updated successfully, but these errors were encountered: