New routine for the stride of 0 for C in CLBlastSgemmStridedBatched() is need #347

TaihuLight · 2019-01-09T09:20:33Z

#346
However, the stride of 0 for C in CLBlastSgemmStridedBatched() is also useful in deep learning applicaiton such as the implementations of the backfoward of convolution layer. It can be implemented with two steps:
(1) use CLBlastSgemmStridedBatched() to compute batched matrices C;
(2) add a new routine (e.g., named StridedBatchedAddMatrix) to compute the sum of each batch of the computed matrix C in Step (1).
Thus, the new routine (e.g., named StridedBatchedAddMatrix) for adding all batched matrices with element-wised to reducing the results of CLBlastSgemmStridedBatched(). That is, the new routine is used to compute the sum of batched matrices as following:

for(int i=0; i< batch_count; i++){
SUM = SUM + β* ( C + i * c_stride ) 
}

where c_stride is the stride between two batches of the C matrix, and batched C matrices is computed with CLBlastSgemmStridedBatched(). For instance,

Therefore, the new routine for adding matrices is similar with the routine xAXPYStridedBATCHED: StridedBatched version of AXPY for adding vectors.

The text was updated successfully, but these errors were encountered:

CNugteren · 2019-02-03T12:50:04Z

Sorry for my late reply.

I don't think 'batched' is the right wording here. That is typically used to indicate an operation that is repeated multiple times but on independent data. In your case the SUM variable is shared, right? So the iterations of the 'batched' loop are not actually independent of each other.

I think what you are looking for is perhaps something like the XSUM routine from CLBlast, but than from 3D (batches of 2D matrices) to 2D (a single 2D matrix) rather than 1D (a vector) to 0D (a scalar). Perhaps if you see your 2D matrices as a flat vector and you re-organize your data, something like in #349 could fit your need?

CNugteren · 2019-02-09T09:48:03Z

Could you have a look at the latest reply in #349 regarding a solution with GEMV? i think this solves your issue as well, since you can just use GEMV, set the x vector to all 1's equal in size to the amount of values you want to sum, and use the other dimension (either m or n depending on how the data is currently layed-out in memory) as the size of the matrix. For example, set a_transposed = true, m = num_batches (the number of sums you want to do), and n = height_of_C * width_of_C (the matrix C flattened).

Could you let me know if this works for you?

TaihuLight changed the title ~~New routine for the stride of 0 for C in CLBlastSgemmStridedBatched() is need~~ New routine for the stride of 0 for C in CLBlastSgemmStridedBatched() is need Jan 9, 2019

TaihuLight mentioned this issue Jan 9, 2019

c_stride=0 in CLBlastSgemmStridedBatched() get wrong results? #346

Closed

CNugteren added the feature request label Jan 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New routine for the stride of 0 for C in CLBlastSgemmStridedBatched() is need #347

New routine for the stride of 0 for C in CLBlastSgemmStridedBatched() is need #347

TaihuLight commented Jan 9, 2019

CNugteren commented Feb 3, 2019

CNugteren commented Feb 9, 2019

New routine for the stride of 0 for C in CLBlastSgemmStridedBatched() is need #347

New routine for the stride of 0 for C in CLBlastSgemmStridedBatched() is need #347

Comments

TaihuLight commented Jan 9, 2019

CNugteren commented Feb 3, 2019

CNugteren commented Feb 9, 2019