-
-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sub-optimal performance with Vega FE in FP32 SGEMM #350
Comments
Thanks for reporting. Indeed, there seems to be some speed issue on Vega hardware, as #327 also reports. I assume this is after tuning? What kind of GFLOPS numbers do the CLBLast gemm tuners report? I don't have access to Vega hardware myself, so it is a bit tricky to play around with the kernels. If you can point me to some OpenCL code that is fast (e.g. the code used by TensorFlow), then I can try to implement that in CLBlast as well. By the way, you can also compile with |
Thanks for the reply, I am new to using CLBlast (and Linux in general), so I could be missing something. What kind of GFLOPS numbers do the CLBLast gemm tuners report? If you can point me to some OpenCL code that is fast (e.g. the code used by TensorFlow), then I can try to implement that in CLBlast as well. The performance was checked using this code (on the same system), and this code tries to run the two GPUs in parallel (I tested by disabling one GPU, no significant change in overall performance):
Output:
By the way, you can also compile with |
When I tried to compile the performance measurement script, I encountered an error.
And it got stuck at the point (39%) where it tried "Linking CXX executable clblast_client_xaxpybatched".
On the other hand, I got a chance to test Radeon VII with PyCLBlast, it does perform better than Vega, while they had a similar peak TFLOPs. As a comparison, |
Thanks for your comments, let me react a bit:
|
Thanks for the reply, I will try to fix the compilation and run the CLBlast benchmark later. Once I figure that out I will benchmark using rocBLAS to see how GEMM performs there and get back to you. |
I confirm the same poor performance on Vega 64, which is even a little slower that my old R9 290X. ROCm OpenCL. |
I am using PyOpenCL and the wrapper PyCLBlast, on ubuntu 18.04.1 with Python 3.5. The GPU is Vega FE (x2, but in the test I used only 1).
When testing the GFLOPs with Vega FE using the supplied script for SGEMM:
CLBlast/src/pyclblast/samples/sgemm.py
In my case of Vega FE, after applying the tuning results and restarting the jupyter notebook server, I am still getting at max 3.5 TFLOPs of SGEMM performance, which is a lot below the theoretical max of 12 TFLOPs.
Also, in TensorFlow 1.12, I am able to get around 7 TFLOPs or a bit more in doing SGEMM. However, over there the initialization of variable has too much overhead which made it slow overall and unsuitable for doing matrix math. So it does seem CLBlast is quite a bit slower. Would there be anything I was missing regarding the performance hit?
Below is the code I used for the test, the modifications are to make the matrices large enough and to compute the time:
Output:
The text was updated successfully, but these errors were encountered: