Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizing elementwise_add for CPU with MKL #10786

Closed
tpatejko opened this issue May 19, 2018 · 2 comments
Closed

Optimizing elementwise_add for CPU with MKL #10786

tpatejko opened this issue May 19, 2018 · 2 comments
Assignees
Labels

Comments

@tpatejko
Copy link

I working on optimizing elementwise_add operator for CPU. The operator adds two tensors x and y element by element, and stores the result in tensor z. I'm currently focusing on the case when both operands x and y are of equal dimensions.

The optimization uses MKL VML's v?Add operation that performs elementwise addition:
https://software.intel.com/en-us/mkl-developer-reference-c-v-add

When elementwise_add is performed on GPU and/or x and y are of different dimensions the algorithm falls back to the default implementation.

To implement the optimization, I extended an interface of PaddlePaddle BLAS code:
https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/math/blas.h

with two operations: VADD that performs elementwise add operation with VML vAdd routine, and VCOPY that performs copying of two vectors and uses BLAS level 1 routine cblas_vcopy. I use VCOPY routine with already available SAXPY routine to implement VADD operation for non-MKL case.

Is it ok for you to extend the interface of Blas routines in PaddlePaddle for CPU?
Currently the algorithm is as follows. What do you think about it?

x = ctx.Input<T>('X')
y = ctx.Input<T>('Y')
z = ctx.Output<T>('Z')

if (ctx.is_cpu_place() and x.dims() == y.dims()) {
    flatten(x);
    flatten(y);
    flatten(z);
    if (MKL_is_used()) { 
        
        VADD(x->numel(), x, y, z);
    } else {
        // SAXPY implements y = alpha * x + y
       // so first content of y is copied to z
       // and x is added to z
        VCOPY(y, z);
        SAXPY(x->numel(), 1.0 /*alpha*/, x, z) 
    }
} else {
    // fall back to default implentation
}
@tensor-tang
Copy link
Contributor

Looks ok to me, what do you think @luotao1

@luotao1
Copy link
Contributor

luotao1 commented May 21, 2018

@tpatejko LGTM, you can continue your codes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants