New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fc padding to improve mkl GEMM's performance when N and K are multiple of 128. #20972
Conversation
test=develop
PR描述里需要说一下,性能提升的结果。 |
test=develop
test=develop
test=develop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for PADDLE_ENFORCE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
test=develop
paddle/fluid/operators/math/fc.cc
Outdated
auto blas = math::GetBlas<platform::CPUDeviceContext, T>(context); | ||
blas.MatMul(M, N, K, X, W, Y); | ||
framework::Tensor X1, Y1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 不要在一行代码里面定义多个变量。
- fc_fuse_pass里面只是判断weight是否做padding,kernel里面要进一步判断x是否要padding。w和x的padding是独立的。所以实现里面可能分好几种情况:
- w padding了,x看是否要padding
- w不padding,x看是否要做padding
- 也不一定每个分支都要支持,但x是否padding肯定是要做判断的吧?
- 需要padding的时候再临时Tensor。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
weight的尺寸是K*N,同时被128整除的时候,才做padding。weight和X的padding应该是同步的,所以在pass中做了padding之后,这里的X就不判断了。
-
临时Tensor的Y1因为在下面的计算中要用到,所以是需要放在判断条件外的。临时变量X1,可以放到判断条件里。
test=develop
test=develop
@@ -78,6 +78,9 @@ def setUp(self): | |||
'Out': fc_refer(self.matrix, self.with_bias, self.with_relu) | |||
} | |||
|
|||
def padding(self): | |||
self.attrs = {'padding_weights': False} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里是不会生效的,因为没有地方调用到了padding这个函数。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
因为重新修改了FC的padding策略,这里可以测试 (N % 128 == 0 && K % 128 == 0)时的正确性。
paddle/fluid/operators/math/fc.cc
Outdated
framework::Tensor Y1; | ||
Y1.Resize({M * (N + 4)}); | ||
T* Y1_data = Y1.mutable_data<T>(platform::CPUPlace()); | ||
if (padding_weights) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果weights没有提前padding好,那是否需要在这里对x和w做padding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
重新改了FC中加padding的策略。
当N % 128 == 0 && K % 128 == 0时,都会对x和w做padding。
如果weights提前padding好,在fc.cc中省略对w做padding。
PR描述里面提供一下只padding w不padding x、只padding x不padding w的性能数据吧 |
test=develop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
一些小的修改意见,下个PR再改吧。
…tiple of 128. (PaddlePaddle#20972) * Add fc padding to solve mkl performance test=develop * fix gpu pass and error information test=develop * fix fc_fuse_pass_test test=develop * fix error information test=develop * fix error information test=develop * fix name and add fc op padding test test=develop * fix attributes test=develop * optimize fc padding test=develop * fix test test=develop
…tiple of 128. (PaddlePaddle#20972) * Add fc padding to solve mkl performance test=develop * fix gpu pass and error information test=develop * fix fc_fuse_pass_test test=develop * fix error information test=develop * fix error information test=develop * fix name and add fc op padding test test=develop * fix attributes test=develop * optimize fc padding test=develop * fix test test=develop
…tiple of 128. (PaddlePaddle#20972) * Add fc padding to solve mkl performance test=develop * fix gpu pass and error information test=develop * fix fc_fuse_pass_test test=develop * fix error information test=develop * fix error information test=develop * fix name and add fc op padding test test=develop * fix attributes test=develop * optimize fc padding test=develop * fix test test=develop
…22198) * Optimize the kernel implementation of layernorm with openmp (#20895) * Add ernie c++ inference test (#21015) * Add ernie unit test test=develop * Add ernie unit test test=develop * Add ernie unit test test=develop * remove ngraph * optimize gpu test test=develop * optimize codes test=develop * fix cmake fails on inference_download_and_uncompress (#21185) * solve cmake fails on inference_download_and_uncompress test=develop * solve cmake fails on inference_download_and_uncompress test=develop * Add fc padding to improve mkl GEMM's performance when N and K are multiple of 128. (#20972) * Add fc padding to solve mkl performance test=develop * fix gpu pass and error information test=develop * fix fc_fuse_pass_test test=develop * fix error information test=develop * fix error information test=develop * fix name and add fc op padding test test=develop * fix attributes test=develop * optimize fc padding test=develop * fix test test=develop * Polish the codes of fc when needs padding (#21378) test=develop * Add ernie large c++ inference test (#21365) * add ernie-large test test=develop * add ernie large c++ inference test test=develop * Modify padding strategy: remove weight copy in fc padding (#21650) test=develop * optimize fc jit (#21878) test=develop Co-authored-by: Yihua Xu <yihuaxu@hotmail.com>
当使用MKL计算的矩阵的尺寸是128的倍数时,内存存取的时间会大大增加。
优化方案是内存尺寸为128的倍数时,做尺寸加4的padding。内存调用的时间会减少。
Intel MKL内存调用分析
预测模型ERNIE padding优化前后性能对比:
经过测试,需要同时对FC计算中的W和X同时做Padding,才有较好的性能提升。