Variable Length Matrix Multiplication Batched Processing #409

akumaburn · 2020-12-22T14:56:22Z

Currently in XgemmBatched::BatchedGemmDirect (routines/levelx), m,n,k values are singular.

This enables one to process a batch of matrix multiplications where the sizes of m,n,k do not change.

However there may be use cases where the sizes of m,n,k are completely variable. This forces one to use the non-batched call in a tight-loop which isn't ideal.

My suggestion is to make m, n, and k vectors so that you can specify the dimensions of each matrix individually.

Optionally let a vector of size 1 be passed in to indicate that they are all the same length instead of using two functions.

CNugteren · 2020-12-23T19:08:52Z

I've labeled this as a feature request, and I'm not sure I can work on this any time soon. But if anyone else wants to pick it up, feel free, and I'll do a proper code review or can help with questions.

About your suggestion, have you seen this feature in any other (GPU) BLAS libraries? If not, why don't they have it you think? I can imagine the gain here might not be that much. Because you're m/n/k values will need to be stored in off-chip memory and be retrieved, and you'll probably have to set aside all kinds of other m/n/k-based optimisations (such as run a faster kernel if they are a multiple of 32) and so on...

And, have you tried the loop-option with a pre-allocated temporary buffer as described in the other issue - and without synchronisations. How much performance do you loose?

akumaburn · 2020-12-24T19:32:56Z

I don't know of any other libraries that offer this feature.

However there is a paper exploring the subject.

I'm writing a custom neural network framework which isn't finished yet so I can't give performance numbers yet. But I presume that enqueuing a kernel with an optimal tiling structure once would be faster than enqueuing with different parameters repeatedly.

CNugteren · 2020-12-26T10:49:01Z

I would then suggest that you first build your software using the available CLBlast routines and get back to me when you have some example use-case and performance numbers. Because if cuBLAS/cuDNN don't implement this, I doubt that this is useful for in general.

Also if this is indeed for neural networks (training or inference?) I don't see the use-case yet. Operations can either be batched or have to be sequential (e.g. two layers following each other). I don't know of a use-case where you would want to run multiple small matrix-multiplications in parallel with different m/n/k values.

akumaburn · 2021-01-04T22:34:30Z

I am trying to build my software(In Java) using the OpenCL kernels directly since I don't want two different libraries managing GPU memory.

My problem is currently that I'm getting a CL_INVALID_WORK_GROUP_SIZE when I attempt to use the suggested tuning parameters.

For example with the following defines(With precision left at the default 32 bits):

        long WGD = 8; // Tile-size in dimension M, N, and K (e.g. 8, 16, 32, 64)
        long MDIMCD = 8; // Threads per workgroup in M-dimension (e.g. 8, 16, 32)
        long NDIMCD = 8; // Threads per workgroup in N-dimension (e.g. 8, 16, 32)
        long MDIMAD = m; // Re-shaped tile dimension of matrix A: KDIMAD * MDIMAD
        long NDIMBD = n; // Re-shaped tile dimension of matrix B: KDIMBD * NDIMBD
        long KWID = 1; // Unroll factor of the WGD loop (smaller or equal than WGD)
        long VWMD = 1; // Vector width of matrices A and C (this affects precision)
        long VWND = 1; // Vector width of matrix B
        long PADA = 0; // Local memory padding for matrix A
        long PADB = 0; // Local memory padding for matrix B

I've set the calculation for global and local work sizes as follows:

        long[] globalWorkSize = {Math.max(m, WGD) * MDIMCD / WGD,Math.max(n,WGD) * NDIMCD/WGD};
        long[] localWorkSize = {MDIMCD, NDIMCD};

In order to mimic what I see in xgemm.cpp(under level 3 routines)

  // Computes the global and local thread sizes
  const auto m_ceiled = Ceil(m, db_["WGD"]);
  const auto n_ceiled = Ceil(n, db_["WGD"]);

  const auto global = std::vector<size_t>{
  //  CeilDiv(m * db_["MDIMCD"], db_["WGD"]),
  //  CeilDiv(n * db_["NDIMCD"], db_["WGD"])
      (m_ceiled * db_["MDIMCD"]) / db_["WGD"],
      (n_ceiled * db_["NDIMCD"]) / db_["WGD"]
  };
  const auto local = std::vector<size_t>{db_["MDIMCD"], db_["NDIMCD"]};

However when testing with a set of three 3x3 matrices generated with the following code

            Matrices A
         */
        for (int i = 0; i < 3; i++) {
            MatrixF nMatrix = new MatrixF(3, 3, new float[]{0, 1, 2, 3, 4, 5, 6, 7, 8});
            matricesA.add(nMatrix);
        }

        /*
           Matrices B
        */
        for (int i = 0; i < 3; i++) {
            MatrixF nMatrix = new MatrixF(3, 3, new float[]{0, 1, 2, 3, 4, 5, 6, 7, 8});
            matricesB.add(nMatrix);
        }

I get CL_INVALID_WORK_GROUP_SIZE

Can you please elucidate for me exactly how local and global group sizes should be calculated in reference to the size of the input matrices?

CNugteren · 2021-01-05T13:36:33Z

In general what can help perhaps is to go through the CLBlast paper and the cited paper by Matsumoto. An other thing that could help is to look at the tuner code: they also need to run the kernels, but in general it is much simpler than in the actual library, since there is only one kernel to run. However, the tuner does make some assumptions as well.

In particular for the problem you describe, I would just print out the local and global values and see what they are and whether or not they make sense. You can optionally do the same in the CLBlast library for comparison. From the code you've pasted it looks fine in general, although I doubt that the Math.max() operation is the same as the Ceil function from CLBlast, see here:

CLBlast/src/utilities/utilities.cpp

Line 369 in 46fb748

size_t Ceil(const size_t x, const size_t y) {

Also, in general, be careful that there are two kinds of GEMM kernels in CLBlast, but it looks like you are doing that correct from the details you've shared at least.

akumaburn · 2021-01-05T23:26:10Z

I've looked at the paper and am trying to understand exactly what the rules are for tile sizing. In other-words how do I determine valid definitions for the defines above.

The main problem is that I don't have a working CLBlast dev environment. I've imported CLBlast from JOCLBlast which provides the Java -> C++ Bindings; but my IDE doesn't support C++ code debugging (IntelliJ) AFAIK.

Looking at the XgemmDirectBatchedNN kernel

It sets the following kernel attribute requesting a specific local work group size

__kernel __attribute__((reqd_work_group_size(MDIMCD, NDIMCD, 1)))

and then calls the XgemmDirect Inline Function,

Given that I've set my local work group using the defines of MDIMCD and NDIMCD, why is it failing with CL_INVALID_WORK_GROUP_SIZE ?

Can you point me in the direction of the sanity checks for this kernel?

CNugteren · 2021-01-08T12:50:32Z

If you look in the OpenCL specification (https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clEnqueueNDRangeKernel.html) you will see that there can be multiple reasons for CL_INVALID_WORK_GROUP_SIZE. As I said in the previous message, I believe yours relates to:

number of work-items specified by global_work_size is not evenly divisable by size of work-group given by local_work_size

But if you've tried my earlier suggestion already and that doesn't help you should check the other conditions under which this can fail from the spec.

If you want to know about the constraints, you could check out the remainder of the source code of the library itself, or check out the tuner as I also suggested above. E.g. here you see some of the constraints of the tuner: https://github.com/CNugteren/CLBlast/blob/master/src/tuning/kernels/xgemm_direct.hpp#L118. You can also run the tuner and play around with some values. Of course, there are also device constraints, not just code-design constraints.

akumaburn · 2021-01-08T15:11:31Z

If you look in the OpenCL specification (https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clEnqueueNDRangeKernel.html) you will see that there can be multiple reasons for CL_INVALID_WORK_GROUP_SIZE. As I said in the previous message, I believe yours relates to:

number of work-items specified by global_work_size is not evenly divisable by size of work-group given by local_work_size

But if you've tried my earlier suggestion already and that doesn't help you should check the other conditions under which this can fail from the spec.

If you want to know about the constraints, you could check out the remainder of the source code of the library itself, or check out the tuner as I also suggested above. E.g. here you see some of the constraints of the tuner: https://github.com/CNugteren/CLBlast/blob/master/src/tuning/kernels/xgemm_direct.hpp#L118. You can also run the tuner and play around with some values. Of course, there are also device constraints, not just code-design constraints.

I've defined the Ceil and isMultiple objects accordingly..

I've also added asserts to match what you linked in xgemm_direct.hpp; however even with those in place and passing I still get CL_INVALID_WORK_GROUP_SIZE

My device (Nvidia GTX 1070) has the following parameters:

CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS:		: 3
CL_DEVICE_MAX_WORK_ITEM_SIZES:			: 1024 / 1024 / 64 
CL_DEVICE_MAX_WORK_GROUP_SIZE:			: 1024

I use the following code for my defines as well as the localWorkGroup and globalWorkGroup sizes.

Does anything else catch your eye as being wrong here(In the context of three sets of 3x3 matrix multiplication)?

        // Computes the global and local thread sizes

        // Tile-size in dimension M, N, and K (e.g. 8, 16, 32, 64)
        long WGD = Math.min(Math.min(m,n),k);
        // Threads per workgroup in M-dimension (e.g. 8, 16, 32)
        long MDIMCD = WGD;
        // Threads per workgroup in N-dimension (e.g. 8, 16, 32)
        long NDIMCD = WGD;
        // Re-shaped tile dimension of matrix A: KDIMAD * MDIMAD
        long MDIMAD = k * m;
        // Re-shaped tile dimension of matrix B: KDIMBD * NDIMBD
        long NDIMBD = k * n;
        // Unroll factor of the WGD loop (smaller or equal than WGD)
        long KWID = 1;
        // Vector width of matrices A and C (this affects precision)
        long VWMD = 1;
        // Vector width of matrix B
        long VWND = 1;
        // Local memory padding for matrix A
        long PADA = 0;
        // Local memory padding for matrix B
        long PADB = 0;

        // Requirement for unrolling the WGD loop
        assert(General.isMultiple(WGD,KWID));
        // Required for integer MWID and NWID
        assert(General.isMultiple(WGD,MDIMCD*VWMD));
        assert(General.isMultiple(WGD,NDIMCD*VWMD));
        // Required for integer MWIAD and NWIBD
        assert(General.isMultiple(WGD,MDIMAD*VWMD));
        assert(General.isMultiple(WGD,NDIMBD*VWND));
        // WGD has to be a multiple of KDIMAD = ((MDIMCD*NDIMCD)/(MDIMAD)) and KDIMBD = (...)
        assert(General.isMultiple(WGD, (MDIMCD * NDIMCD)/MDIMAD));
        assert(General.isMultiple(WGD, (MDIMCD * NDIMCD)/NDIMBD));

        String defines =String.format(
                        "     #define WGD %d      // Tile-size in dimension M, N, and K (e.g. 8, 16, 32, 64)  \n" +
                        "     #define MDIMCD %d    // Threads per workgroup in M-dimension (e.g. 8, 16, 32)  \n" +
                        "     #define NDIMCD %d    // Threads per workgroup in N-dimension (e.g. 8, 16, 32)  \n" +
                        "     #define MDIMAD %d    // Re-shaped tile dimension of matrix A: KDIMAD * MDIMAD  \n" +
                        "     #define NDIMBD %d    // Re-shaped tile dimension of matrix B: KDIMBD * NDIMBD  \n" +
                        "     #define KWID %d      // Unroll factor of the WGD loop (smaller or equal than WGD)  \n" +
                        "     #define VWMD %d      // Vector width of matrices A and C  \n" +
                        "     #define VWND %d      // Vector width of matrix B  \n" +
                        "     #define PADA %d      // Local memory padding for matrix A \n " +
                        "     #define PADB %d      // Local memory padding for matrix B  \n",
                WGD,
                MDIMCD,
                NDIMCD,
                MDIMAD,
                NDIMBD,
                KWID,
                VWMD,
                VWND,
                PADA,
                PADB
                );

        //TODO  figure out why these work sizes aren't working.
        long[] globalWorkSize = {General.Ceil(m, WGD) * MDIMCD / WGD,General.Ceil(n,WGD) * NDIMCD/WGD};
        long[] localWorkSize = {MDIMCD, NDIMCD};


        ObjectSet<DataBuffer> values = runProgram("XgemmDirectBatchedNN",
                defines + HWAcceleration.getKernel(HWAcceleration.KernelGroups.GEMMBATCHED, HWAcceleration.LangType.OpenCL),
                inputs,
                globalWorkSize,
                localWorkSize,
                device);

CNugteren · 2021-01-08T19:03:48Z

I see now that you are using the batched kernel (sorry, should have re-read earlier), but you are only supplying 2 dimension of local and global sizes, is that another bug perhaps? The third dimension is the batch dimension (1 for the local size, number of batches for the global size).

Other than that, I would really dig deeper. So don't just set some global/local sizes and observe the error, but actually print/inspect those values? What are those values? Do they make sense? Are they computed correctly? How does it relate to the possibly causes for the error as per the OpenCL spec linked above? Or even if all of that doesn't help you: how do they compare to the ones used in CLBlast? Just modify one of the code samples in the samples directory of the repository, add some printf or debugging statements and run it for exactly your use-case.

akumaburn · 2021-01-10T18:25:09Z

For the current code,
globalWorkSize = {3,3,1}
localWorkSize = {3,3,1}

Can you explain to me exactly how I can determine if it makes sense or not? Seeing as I don't have a detailed knowledge of the tiling algorithm; I don't know how to make sense of it.

Also even if it doesn't make sense I still presume that in order for CL_INVALID_WORK_GROUP_SIZE to be returned by the OpenCL kernel. It needs to be somewhere in the CLBlast code for this kernel aside from just __kernel attribute((reqd_work_group_size(MDIMCD, NDIMCD, 1)))

CNugteren · 2021-01-11T09:22:50Z

To be honest it feels a bit like I'm debugging your code. It should be fairly simple (without knowledge of CLBlast but with my previous pointers) to do this debugging yourself. To illustrate this, what I've done just now is mimic your problem and run the corresponding sample:

diff --git a/samples/sgemm_batched.cpp b/samples/sgemm_batched.cpp
index 32c465c..51521b8 100644
--- a/samples/sgemm_batched.cpp
+++ b/samples/sgemm_batched.cpp
@@ -40,13 +40,13 @@ int main() {
   const auto device_id = 0;
 
   // Example arguments
-  const size_t batch_count = 261;
-  const size_t m = 1;
-  const size_t n = 1;
-  const size_t k = 40;
-  const auto a_ld = 2560;
-  const auto b_ld = 160;
-  const auto c_ld = 261;
+  const size_t batch_count = 1;
+  const size_t m = 3;
+  const size_t n = 3;
+  const size_t k = 3;
+  const auto a_ld = 3;
+  const auto b_ld = 3;
+  const auto c_ld = 3;
   std::vector<float> alphas(batch_count);
   std::vector<float> betas(batch_count);
   std::vector<size_t> a_offsets(batch_count);
diff --git a/src/routines/levelx/xgemmbatched.cpp b/src/routines/levelx/xgemmbatched.cpp
index 0d7ae5a..7346c41 100644
--- a/src/routines/levelx/xgemmbatched.cpp
+++ b/src/routines/levelx/xgemmbatched.cpp
@@ -315,6 +315,8 @@ void XgemmBatched<T>::BatchedGemmDirect(const size_t m, const size_t n, const si
     batch_count
   };
   const auto local = std::vector<size_t>{db_["MDIMCD"], db_["NDIMCD"], 1};
+  printf("Local: %d %d %d\n", local[0], local[1], local[2]);
+  printf("Global: %d %d %d\n", global[0], global[1], global[2]);
 
   // Launches the kernel
   RunKernel(kernel, queue_, device_, global, local, event_);

Then I get the following output:

Local: 8 8 1
Global: 8 8 1

Thus indeed something is different. Now looking logically at the formulas to compute this, you can see it has to be a multiple of WGD, which is the tiling factor, described in a comment as: Tile-size in dimension M, N, and K (e.g. 8, 16, 32, 64). It is not specified here (you are right in that sense - I should have specified this somewhere), but a tile size other than a multiple of two will most likely not work. To get a definitive answer on what is supported by CLBlast, look at the tuner which I referred to earlier. Here you can see what values are considered tuning values:
https://github.com/CNugteren/CLBlast/blob/master/src/tuning/kernels/xgemm_direct.hpp#L90

So in short: Try with WGD equal to 8.

akumaburn · 2021-01-11T20:05:36Z

To be honest it feels a bit like I'm debugging your code. It should be fairly simple (without knowledge of CLBlast but with my previous pointers) to do this debugging yourself. To illustrate this, what I've done just now is mimic your problem and run the corresponding sample:
diff --git a/samples/sgemm_batched.cpp b/samples/sgemm_batched.cpp
index 32c465c..51521b8 100644
--- a/samples/sgemm_batched.cpp
+++ b/samples/sgemm_batched.cpp
@@ -40,13 +40,13 @@ int main() {
   const auto device_id = 0;
 
   // Example arguments
-  const size_t batch_count = 261;
-  const size_t m = 1;
-  const size_t n = 1;
-  const size_t k = 40;
-  const auto a_ld = 2560;
-  const auto b_ld = 160;
-  const auto c_ld = 261;
+  const size_t batch_count = 1;
+  const size_t m = 3;
+  const size_t n = 3;
+  const size_t k = 3;
+  const auto a_ld = 3;
+  const auto b_ld = 3;
+  const auto c_ld = 3;
   std::vector<float> alphas(batch_count);
   std::vector<float> betas(batch_count);
   std::vector<size_t> a_offsets(batch_count);
diff --git a/src/routines/levelx/xgemmbatched.cpp b/src/routines/levelx/xgemmbatched.cpp
index 0d7ae5a..7346c41 100644
--- a/src/routines/levelx/xgemmbatched.cpp
+++ b/src/routines/levelx/xgemmbatched.cpp
@@ -315,6 +315,8 @@ void XgemmBatched<T>::BatchedGemmDirect(const size_t m, const size_t n, const si
     batch_count
   };
   const auto local = std::vector<size_t>{db_["MDIMCD"], db_["NDIMCD"], 1};
+  printf("Local: %d %d %d\n", local[0], local[1], local[2]);
+  printf("Global: %d %d %d\n", global[0], global[1], global[2]);
 
   // Launches the kernel
   RunKernel(kernel, queue_, device_, global, local, event_);
Then I get the following output:
Local: 8 8 1
Global: 8 8 1
Thus indeed something is different. Now looking logically at the formulas to compute this, you can see it has to be a multiple of WGD, which is the tiling factor, described in a comment as: Tile-size in dimension M, N, and K (e.g. 8, 16, 32, 64). It is not specified here (you are right in that sense - I should have specified this somewhere), but a tile size other than a multiple of two will most likely not work. To get a definitive answer on what is supported by CLBlast, look at the tuner which I referred to earlier. Here you can see what values are considered tuning values:
https://github.com/CNugteren/CLBlast/blob/master/src/tuning/kernels/xgemm_direct.hpp#L90

So in short: Try with WGD equal to 8.

Sorry, I didn't mean to be troublesome.
I thought you may have some insight to share that would help me avoid going through CLBlast line by line.

With the WGD equal to 8 resulted in the same CL_INVALID_WORK_GROUP_SIZE error as before.

I realize that this may be something that isn't clearly defined in the documentation so I've gone and taken your advice.

I've setup a working build environment a test-project that will let me compare exactly what is going on with the tuning values step-by-step

Inshallah I will get to the bottom of this and report back.

akumaburn · 2021-01-19T15:30:05Z

I found the problem shortly afterwards, in my clEnqueueNDRangeKernel workDim was set to 1 whereas it should of been globalWorkSize.length

However, I've been wrestling with a new problem over the last week, sometimes during clEnqueueReadBuffer call I will get a CL_OUT_OF_RESOURCES error. Even when it doesn't return CL_OUT_OF_RESOURCES it won't have the correct result in the buffer.

I've tried side-stepping the Windows TDR timeout via a registry hack but that didn't fix it.

I did some more digging and found out that CLBlast doesn't actually use the arguments as supplied for this function but first modifies them through Xgemm::ProcessArguments

Hopefully something there will enlighten me as to why I'm getting this error on a read operation.

akumaburn · 2021-01-21T17:12:14Z

I converted the ProcessArguments logic to Java, and noticed that it actually picked the kernel name dynamically depending on the result. I also fixed some argument parameters for the OpenCL kernel arguments.

In actuality the kernel CLBlast was running was XgemmDirectBatchedTN not XgemmDirectBatchedNN (apparently it is necessary in CLBlast for row-major multiplication to transpose both A and C; B is transposed in the case of col-major)

However, this alone did not resolve the issue; after praying for an answer, the divine provided one

When I was supplying the primitives to the kernel, I was doing so by the means of providing a pointer to a cl_mem buffer. However the kernel itself was expecting copied values not a pointer. So this was the reason for the CL_OUT_OF_RESOURCES(I earnestly wish for better error codes in opencl..) after making some changes in the way primatives are handled, and passing them directly like so
clSetKernelArg(kernel, argIndex, Sizeof.cl_int, Pointer.to(new int[]{(int) primative}));

It works as expected at last :)

CNugteren added the feature request label Dec 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Variable Length Matrix Multiplication Batched Processing #409

Variable Length Matrix Multiplication Batched Processing #409

akumaburn commented Dec 22, 2020 •

edited

CNugteren commented Dec 23, 2020

akumaburn commented Dec 24, 2020

CNugteren commented Dec 26, 2020

akumaburn commented Jan 4, 2021

CNugteren commented Jan 5, 2021

akumaburn commented Jan 5, 2021 •

edited

CNugteren commented Jan 8, 2021 •

edited

akumaburn commented Jan 8, 2021

CNugteren commented Jan 8, 2021 •

edited

akumaburn commented Jan 10, 2021

CNugteren commented Jan 11, 2021 •

edited

akumaburn commented Jan 11, 2021

akumaburn commented Jan 19, 2021

akumaburn commented Jan 21, 2021

Variable Length Matrix Multiplication Batched Processing #409

Variable Length Matrix Multiplication Batched Processing #409

Comments

akumaburn commented Dec 22, 2020 • edited

CNugteren commented Dec 23, 2020

akumaburn commented Dec 24, 2020

CNugteren commented Dec 26, 2020

akumaburn commented Jan 4, 2021

CNugteren commented Jan 5, 2021

akumaburn commented Jan 5, 2021 • edited

CNugteren commented Jan 8, 2021 • edited

akumaburn commented Jan 8, 2021

CNugteren commented Jan 8, 2021 • edited

akumaburn commented Jan 10, 2021

CNugteren commented Jan 11, 2021 • edited

akumaburn commented Jan 11, 2021

akumaburn commented Jan 19, 2021

akumaburn commented Jan 21, 2021

akumaburn commented Dec 22, 2020 •

edited

akumaburn commented Jan 5, 2021 •

edited

CNugteren commented Jan 8, 2021 •

edited

CNugteren commented Jan 8, 2021 •

edited

CNugteren commented Jan 11, 2021 •

edited