Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HGEMM performance in Adreno(tm) 740 is not faster than SGEMM #513

Open
cunyangwei opened this issue Oct 31, 2023 · 1 comment
Open

HGEMM performance in Adreno(tm) 740 is not faster than SGEMM #513

cunyangwei opened this issue Oct 31, 2023 · 1 comment

Comments

@cunyangwei
Copy link

I build CLBLAST for android. Although it can run in Adreno(tm) 740, I found that performance for HGEMM dose not have a significant sppedup. For example, when I use

/clblast_client_xgemm --m 4096 --n 4096 --k 4096 --precision 16 --device 0 --platform 0 ,

the performance is 604.8 GFLOPS.

However, when I use

/clblast_client_xgemm --m 4096 --n 4096 --k 4096 --precision 32 --device 0 --platform 0 ,

the performance is 462.8 GFLOPS.

It that correct? Because I think the performance in HGEMM might have 1TFLOPS.

@CNugteren
Copy link
Owner

It could well be that your hardware is slower in FP16 compared to FP32, even though there are memory bandwidth savings by using less data. However, it can also be that the CLBlast FP16 code is sub-optimal. One thing I suggest you to do is compile and run the tuners (see the docs), in particular for FP16, and perhaps even for the 4Kx4K matrices you are interested in. That should reveal whether you can achieve the 1TFLOPS with your device.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants