Optimizing elementwise_add for CPU with MKL #10786

tpatejko · 2018-05-19T19:05:40Z

I working on optimizing elementwise_add operator for CPU. The operator adds two tensors x and y element by element, and stores the result in tensor z. I'm currently focusing on the case when both operands x and y are of equal dimensions.

The optimization uses MKL VML's v?Add operation that performs elementwise addition:
https://software.intel.com/en-us/mkl-developer-reference-c-v-add

When elementwise_add is performed on GPU and/or x and y are of different dimensions the algorithm falls back to the default implementation.

To implement the optimization, I extended an interface of PaddlePaddle BLAS code:
https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/math/blas.h

with two operations: VADD that performs elementwise add operation with VML vAdd routine, and VCOPY that performs copying of two vectors and uses BLAS level 1 routine cblas_vcopy. I use VCOPY routine with already available SAXPY routine to implement VADD operation for non-MKL case.

Is it ok for you to extend the interface of Blas routines in PaddlePaddle for CPU?
Currently the algorithm is as follows. What do you think about it?

x = ctx.Input<T>('X')
y = ctx.Input<T>('Y')
z = ctx.Output<T>('Z')

if (ctx.is_cpu_place() and x.dims() == y.dims()) {
    flatten(x);
    flatten(y);
    flatten(z);
    if (MKL_is_used()) { 
        
        VADD(x->numel(), x, y, z);
    } else {
        // SAXPY implements y = alpha * x + y
       // so first content of y is copied to z
       // and x is added to z
        VCOPY(y, z);
        SAXPY(x->numel(), 1.0 /*alpha*/, x, z) 
    }
} else {
    // fall back to default implentation
}

The text was updated successfully, but these errors were encountered:

tensor-tang · 2018-05-21T03:22:18Z

Looks ok to me, what do you think @luotao1

luotao1 · 2018-05-21T03:50:34Z

@tpatejko LGTM, you can continue your codes.

tpatejko assigned luotao1 and tensor-tang May 19, 2018

tpatejko added the Intel label May 19, 2018

tpatejko mentioned this issue May 24, 2018

Blas optimized elementwise_add forward and backward passes #10913

Merged

mrysztow closed this as completed May 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizing elementwise_add for CPU with MKL #10786

Optimizing elementwise_add for CPU with MKL #10786

tpatejko commented May 19, 2018

tensor-tang commented May 21, 2018

luotao1 commented May 21, 2018

Optimizing elementwise_add for CPU with MKL #10786

Optimizing elementwise_add for CPU with MKL #10786

Comments

tpatejko commented May 19, 2018

tensor-tang commented May 21, 2018

luotao1 commented May 21, 2018