Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sub-optimal performance with Vega FE in FP32 SGEMM #350

Open
SandboChang opened this issue Feb 14, 2019 · 6 comments
Open

Sub-optimal performance with Vega FE in FP32 SGEMM #350

SandboChang opened this issue Feb 14, 2019 · 6 comments

Comments

@SandboChang
Copy link

SandboChang commented Feb 14, 2019

I am using PyOpenCL and the wrapper PyCLBlast, on ubuntu 18.04.1 with Python 3.5. The GPU is Vega FE (x2, but in the test I used only 1).

When testing the GFLOPs with Vega FE using the supplied script for SGEMM:
CLBlast/src/pyclblast/samples/sgemm.py

In my case of Vega FE, after applying the tuning results and restarting the jupyter notebook server, I am still getting at max 3.5 TFLOPs of SGEMM performance, which is a lot below the theoretical max of 12 TFLOPs.

Also, in TensorFlow 1.12, I am able to get around 7 TFLOPs or a bit more in doing SGEMM. However, over there the initialization of variable has too much overhead which made it slow overall and unsuitable for doing matrix math. So it does seem CLBlast is quite a bit slower. Would there be anything I was missing regarding the performance hit?

Below is the code I used for the test, the modifications are to make the matrices large enough and to compute the time:

#!/usr/bin/env python

# This file is part of the CLBlast project. The project is licensed under Apache Version 2.0.
# This file follows the PEP8 Python style guide and uses a max-width of 100 characters per line.
#
# Author(s):
#   Cedric Nugteren <www.cedricnugteren.nl>

import numpy as np
import pyopencl as cl
from pyopencl.array import Array
import pyclblast
import time as ti

# Settings for this sample
dtype = 'float32'

print("# Setting up OpenCL")
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)

print("# Setting up Numpy arrays")
m, n, k = 16384, 16384, 16384
a = np.random.rand(m, k).astype(dtype=dtype)
b = np.random.rand(k, n).astype(dtype=dtype)
c = np.random.rand(m, n).astype(dtype=dtype)

print("# Setting up OpenCL arrays")
cla = Array(queue, a.shape, a.dtype)
clb = Array(queue, b.shape, b.dtype)
clc = Array(queue, c.shape, c.dtype)
cla.set(a)
clb.set(b)
clc.set(c)

print("# Example level-3 operation: GEMM")
# start timer
start = ti.time()
pyclblast.gemm(queue, m, n, k, cla, clb, clc, a_ld=k, b_ld=n, c_ld=n)
queue.finish()
# stop timer
end = ti.time()
print("GPUTakes: ", end - start)
print("GPU_GFLOPS: ", (m*n*(2*k+2) / (1E9 * (end-start))))
# print("# Matrix C result: %s" % clc.get())
# print("# Expected result: %s" % (np.dot(a, b)))

Output:

# Setting up OpenCL
# Setting up Numpy arrays
# Setting up OpenCL arrays
# Example level-3 operation: GEMM
GPUTakes:  2.4747931957244873
GPU_GFLOPS:  3554.4908998122637
@CNugteren
Copy link
Owner

Thanks for reporting. Indeed, there seems to be some speed issue on Vega hardware, as #327 also reports. I assume this is after tuning? What kind of GFLOPS numbers do the CLBLast gemm tuners report?

I don't have access to Vega hardware myself, so it is a bit tricky to play around with the kernels. If you can point me to some OpenCL code that is fast (e.g. the code used by TensorFlow), then I can try to implement that in CLBlast as well.

By the way, you can also compile with -DCLIENTS=ON and then run clblast_client_xgemm for a nice performance measurement. It will also compare to AMD's clBLAS if you have it installed.

@SandboChang
Copy link
Author

SandboChang commented Feb 15, 2019

Thanks for the reply, I am new to using CLBlast (and Linux in general), so I could be missing something.
I followed the documents to try to tune the GPUs:
https://github.com/CNugteren/CLBlast/blob/master/doc/tuning.md
by running the code in the "Using the tuning results". Just in case, I also ran once using python3 after using python, for the second last line. No errors were reported. Then I rebooted and restarted the Jupyter Notebook server.

What kind of GFLOPS numbers do the CLBLast gemm tuners report?
My apology, I wasn't paying attention to that fully, but I remember they vary a lot from 1000-3000 GFLOPs mostly.

If you can point me to some OpenCL code that is fast (e.g. the code used by TensorFlow), then I can try to implement that in CLBlast as well.
To use Tensorflow, I was using the package provided in the ROCm stack. That was done by installing a python3 package through pip3 tensorflow-rocm
https://github.com/ROCmSoftwarePlatform/tensorflow-upstream
Unfortunately, I am not familiar with how things were implemented behind the scene over there.

The performance was checked using this code (on the same system), and this code tries to run the two GPUs in parallel (I tested by disabling one GPU, no significant change in overall performance):

import os
import sys
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import time

n = 8192
dtype = tf.float32

with tf.device("/gpu:0"):
    d1_matrix1 = tf.Variable(tf.ones((n, n), dtype=dtype))
    d1_matrix2 = tf.Variable(tf.ones((n, n), dtype=dtype)*2)
    d1_product = tf.matmul(d1_matrix1, d1_matrix2)
    d1_add     = tf.add(d1_matrix1, d1_matrix2)
    
with tf.device("/gpu:1"):
    d2_matrix1 = tf.Variable(tf.ones((n, n), dtype=dtype))
    d2_matrix2 = tf.Variable(tf.ones((n, n), dtype=dtype)*3)
    d2_product = tf.matmul(d2_matrix1, d2_matrix2)
    d2_add     = tf.add(d2_matrix1, d2_matrix2)
    
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

# # avoid optimizing away redundant nodes
config = tf.ConfigProto(graph_options=tf.GraphOptions(optimizer_options=tf.OptimizerOptions(opt_level=tf.OptimizerOptions.L0)))

sess = tf.Session()
sess.run(tf.global_variables_initializer())

iters = 10
start = time.time()
for i in range(iters):
    result = sess.run([d1_product,d2_product])
end = time.time()
ops = n**3 + (n-1)*n**2 # n^2*(n-1) additions, n^3 multiplications
elapsed = (end - start)
rate = iters*ops/elapsed/10**9
print('\n %d x %d matmul took: %.2f sec, %.2f G ops/sec' % (n, n,
                                                            elapsed/iters,
                                                            rate,))

Output:

8192 x 8192 matmul took: 0.13 sec, 8247.43 G ops/sec

By the way, you can also compile with -DCLIENTS=ON and then run clblast_client_xgemm for a nice performance measurement.
Sure, I will check that soon. Actually today I will install the new Radeon VII onto the system, I will update with you to see if that card shows any performance difference from Vega 64 with CLBlast.

@SandboChang
Copy link
Author

SandboChang commented Feb 16, 2019

When I tried to compile the performance measurement script, I encountered an error.
I used:

mkdir buildBench && cd buildBench
cmake .. -DCLIENTS=ON
make

And it got stuck at the point (39%) where it tried "Linking CXX executable clblast_client_xaxpybatched".

[ 38%] Built target clblast
Scanning dependencies of target clblast_client_xaxpybatched
[ 39%] Building CXX object CMakeFiles/clblast_client_xaxpybatched.dir/test/performance/routines/levelx/xaxpybatched.cpp.o
[ 39%] Linking CXX executable clblast_client_xaxpybatched
/usr/local/lib/libcblas.a(cblas_srotg.c.o): In function `cblas_srotg':
cblas_srotg.c:(.text+0x1): undefined reference to `srotg_'
/usr/local/lib/libcblas.a(cblas_srotmg.c.o): In function `cblas_srotmg':
cblas_srotmg.c:(.text+0x13): undefined reference to `srotmg_'
/usr/local/lib/libcblas.a(cblas_srot.c.o): In function `cblas_srot':
cblas_srot.c:(.text+0x3c): undefined reference to `srot_'
/usr/local/lib/libcblas.a(cblas_srotm.c.o): In function `cblas_srotm':
cblas_srotm.c:(.text+0x21): undefined reference to `srotm_'
/usr/local/lib/libcblas.a(cblas_sswap.c.o): In function `cblas_sswap':
cblas_sswap.c:(.text+0x21): undefined reference to `sswap_'
/usr/local/lib/libcblas.a(cblas_sscal.c.o): In function `cblas_sscal':
cblas_sscal.c:(.text+0x28): undefined reference to `sscal_'
/usr/local/lib/libcblas.a(cblas_scopy.c.o): In function `cblas_scopy':
cblas_scopy.c:(.text+0x21): undefined reference to `scopy_'
/usr/local/lib/libcblas.a(cblas_saxpy.c.o): In function `cblas_saxpy':
cblas_saxpy.c:(.text+0x35): undefined reference to `saxpy_'
(...)
cblas_zsyrk.c:(.text+0x13c): undefined reference to `zsyrk_'
/usr/local/lib/libcblas.a(cblas_zsyr2k.c.o): In function `cblas_zsyr2k':
cblas_zsyr2k.c:(.text+0x14c): undefined reference to `zsyr2k_'
collect2: error: ld returned 1 exit status
CMakeFiles/clblast_client_xaxpybatched.dir/build.make:102: recipe for target 'clblast_client_xaxpybatched' failed
make[2]: *** [clblast_client_xaxpybatched] Error 1
CMakeFiles/Makefile2:68: recipe for target 'CMakeFiles/clblast_client_xaxpybatched.dir/all' failed
make[1]: *** [CMakeFiles/clblast_client_xaxpybatched.dir/all] Error 2
Makefile:129: recipe for target 'all' failed
make: *** [all] Error 2

On the other hand, I got a chance to test Radeon VII with PyCLBlast, it does perform better than Vega, while they had a similar peak TFLOPs. As a comparison,
Vega FP32: 4568 GFLOPs, FP64 609 GFLOPs
R7 FP32: 5763 GFLOPs, FP64 1437 GFLOPs
(updated on 20190217 after re-running the tuner and powering up the third Vega)
I shared some results here over the table:
https://www.reddit.com/r/Amd/comments/ar72t0/project_one_more_vega/egl7b29/

@CNugteren
Copy link
Owner

CNugteren commented Feb 18, 2019

Thanks for your comments, let me react a bit:

  • First of all, the linking issue you see is that CLBlast tries to link to a regular BLAS library (cblas_*), but it can't apparently, even though your CMake did find it. Not sure what went wrong, perhaps removing your CMake cache files and re-running it could help? It will say where it found your BLAS library, double check if that is indeed the correct place.

  • Second, about the speed issue: I asked you for the results of the tuners (e.g. ./clblast_tuner_xgemm), because there are cases where the tuners do report high speeds but the library in the end doesn't. But that does not seem to be the case here.

  • Lastly, most importantly, the reason I'm asking for reference code is that I don't have access to such a device, but if I see other GEMM code, i can try to reproduce that code in CLBlast. So thanks for point me to AMD's TF port. I suspect the actual GEMM kernel used might then be in AMD's MLOpen, a special MLOpen GEMM, or in the regular ROCblas. Perhaps if you have the time you can also try one of those 3 out as stand-alone and see how they perform. I suspect rocBLAS for example has an example tool to measure performance. If not, also OK, I will dig through the code a bit when I have time.

@SandboChang
Copy link
Author

Thanks for the reply, I will try to fix the compilation and run the CLBlast benchmark later.
As for the rocBLAS, I am still stuck at trying to compile some of their examples, though I used their prebuilt package to install rocBLAS and that should be complete.

Once I figure that out I will benchmark using rocBLAS to see how GEMM performs there and get back to you.

@blueberry
Copy link

blueberry commented May 31, 2019

I confirm the same poor performance on Vega 64, which is even a little slower that my old R9 290X. ROCm OpenCL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants