GPU gemv speedup #5257

khaotik · 2016-11-20T12:01:55Z

Added call to cublasSdot in GPU gemv code.

This is proposed solution for the 1st issue in #1168. Despite not solving all the problem, I'm getting 4x speedup on my GPU.

nouiz

I have a few small changes in comments.

Otherwise, I'll accept this small PR, but we don't develop the old gpu back-end. It would be great to port this to the new back-end.

nouiz · 2016-11-21T16:35:55Z

theano/sandbox/cuda/cuda_ndarray.cu

+            // alpha and beta parameter
+            // 2. permanant solution:
+            // define a new "InnerProduct" Op, add an optimization
+            // "gemv -> inner_prod", perhaps for CPU/GPU both


The new op trick won't always work. Sometimes we won't know the shapes, so having it here is good I think and is easier. So just remove this comment.

nouiz · 2016-11-21T16:45:45Z

theano/sandbox/cuda/cuda_ndarray.cu

+            cublasPointerMode_t pmode;
+            cublasGetPointerMode(handle, &pmode);
+            // need to store dot result on device here
+            cublasSetPointerMode(handle, CUBLAS_POINTER_MODE_DEVICE);


Don't set the pointermode. We don't do this anywhere, we always pass them on the CPU (the default). Also, you are passing them on the host bellow. So I don't understand why you added this.

If I don't add this, this will cause CUBLAS take dot result as a host pointer, causing segfault (Example). I set it back to make sure other CUDA code still function correctly.

nouiz · 2016-11-21T16:45:51Z

theano/sandbox/cuda/cuda_ndarray.cu

+            // 2. permanant solution:
+            // define a new "InnerProduct" Op, add an optimization
+            // "gemv -> inner_prod", perhaps for CPU/GPU both
+            float* dev_dst = CudaNdarray_DEV_DATA(C)+1-sc_0;


Why the "+1-sc_0"? I would completly remove that.

I was misunderstanding cublas doc, removed.

nouiz · 2016-11-22T13:48:38Z

jenkins test this

call to cublasSdot in gemv when for row-vector matrix

89f4614

nouiz requested changes Nov 21, 2016

View reviewed changes

khaotik added 2 commits November 22, 2016 00:49

cleanup

c27d3b5

minifix

e0917e8

nouiz approved these changes Nov 22, 2016

View reviewed changes

khaotik mentioned this pull request Nov 23, 2016

BLAS dot implementation Theano/libgpuarray#292

Closed

nouiz merged commit c89d973 into Theano:master Nov 23, 2016

khaotik deleted the gpu_gemv_speedup branch November 24, 2016 01:40

khaotik mentioned this pull request Nov 29, 2016

GPU gemv->dot speedup for new backend #5303

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU gemv speedup #5257

GPU gemv speedup #5257

khaotik commented Nov 20, 2016

nouiz left a comment

nouiz Nov 21, 2016

nouiz Nov 21, 2016

khaotik Nov 22, 2016

nouiz Nov 21, 2016

khaotik Nov 22, 2016

nouiz commented Nov 22, 2016

GPU gemv speedup #5257

GPU gemv speedup #5257

Conversation

khaotik commented Nov 20, 2016

nouiz left a comment

Choose a reason for hiding this comment

nouiz Nov 21, 2016

Choose a reason for hiding this comment

nouiz Nov 21, 2016

Choose a reason for hiding this comment

khaotik Nov 22, 2016

Choose a reason for hiding this comment

nouiz Nov 21, 2016

Choose a reason for hiding this comment

khaotik Nov 22, 2016

Choose a reason for hiding this comment

nouiz commented Nov 22, 2016