You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I working on optimizing elementwise_add operator for CPU. The operator adds two tensors x and y element by element, and stores the result in tensor z. I'm currently focusing on the case when both operands x and y are of equal dimensions.
with two operations: VADD that performs elementwise add operation with VML vAdd routine, and VCOPY that performs copying of two vectors and uses BLAS level 1 routine cblas_vcopy. I use VCOPY routine with already available SAXPY routine to implement VADD operation for non-MKL case.
Is it ok for you to extend the interface of Blas routines in PaddlePaddle for CPU?
Currently the algorithm is as follows. What do you think about it?
x = ctx.Input<T>('X')
y = ctx.Input<T>('Y')
z = ctx.Output<T>('Z')
if (ctx.is_cpu_place() and x.dims() == y.dims()) {
flatten(x);
flatten(y);
flatten(z);
if (MKL_is_used()) {
VADD(x->numel(), x, y, z);
} else {
// SAXPY implements y = alpha * x + y
// so first content of y is copied to z
// and x is added to z
VCOPY(y, z);
SAXPY(x->numel(), 1.0 /*alpha*/, x, z)
}
} else {
// fall back to default implentation
}
The text was updated successfully, but these errors were encountered:
I working on optimizing
elementwise_add
operator for CPU. The operator adds two tensorsx
andy
element by element, and stores the result in tensorz
. I'm currently focusing on the case when both operandsx
andy
are of equal dimensions.The optimization uses MKL VML's
v?Add
operation that performs elementwise addition:https://software.intel.com/en-us/mkl-developer-reference-c-v-add
When
elementwise_add
is performed on GPU and/orx
andy
are of different dimensions the algorithm falls back to the default implementation.To implement the optimization, I extended an interface of PaddlePaddle BLAS code:
https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/math/blas.h
with two operations: VADD that performs elementwise add operation with VML
vAdd
routine, and VCOPY that performs copying of two vectors and uses BLAS level 1 routinecblas_vcopy
. I use VCOPY routine with already available SAXPY routine to implement VADD operation for non-MKL case.Is it ok for you to extend the interface of Blas routines in PaddlePaddle for CPU?
Currently the algorithm is as follows. What do you think about it?
The text was updated successfully, but these errors were encountered: