HGEMM performance in Adreno(tm) 740 is not faster than SGEMM #513

cunyangwei · 2023-10-31T07:20:44Z

I build CLBLAST for android. Although it can run in Adreno(tm) 740, I found that performance for HGEMM dose not have a significant sppedup. For example, when I use

/clblast_client_xgemm --m 4096 --n 4096 --k 4096 --precision 16 --device 0 --platform 0 ,

the performance is 604.8 GFLOPS.

However, when I use

/clblast_client_xgemm --m 4096 --n 4096 --k 4096 --precision 32 --device 0 --platform 0 ,

the performance is 462.8 GFLOPS.

It that correct? Because I think the performance in HGEMM might have 1TFLOPS.

CNugteren · 2023-11-03T20:33:52Z

It could well be that your hardware is slower in FP16 compared to FP32, even though there are memory bandwidth savings by using less data. However, it can also be that the CLBlast FP16 code is sub-optimal. One thing I suggest you to do is compile and run the tuners (see the docs), in particular for FP16, and perhaps even for the 4Kx4K matrices you are interested in. That should reveal whether you can achieve the 1TFLOPS with your device.

CNugteren added the performance label Oct 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HGEMM performance in Adreno(tm) 740 is not faster than SGEMM #513

HGEMM performance in Adreno(tm) 740 is not faster than SGEMM #513

cunyangwei commented Oct 31, 2023

CNugteren commented Nov 3, 2023

HGEMM performance in Adreno(tm) 740 is not faster than SGEMM #513

HGEMM performance in Adreno(tm) 740 is not faster than SGEMM #513

Comments

cunyangwei commented Oct 31, 2023

CNugteren commented Nov 3, 2023