Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fc padding to improve mkl GEMM's performance when N and K are multiple of 128. #20972

Merged
merged 9 commits into from Nov 26, 2019

Conversation

GaoWei8
Copy link
Contributor

@GaoWei8 GaoWei8 commented Nov 1, 2019

当使用MKL计算的矩阵的尺寸是128的倍数时,内存存取的时间会大大增加。
优化方案是内存尺寸为128的倍数时,做尺寸加4的padding。内存调用的时间会减少。
Intel MKL内存调用分析

预测模型ERNIE padding优化前后性能对比:

线程数 优化前 优化后 提升
单线程 276.253 ms 251.464ms 8.97%
20线程 52.1854ms 29.9978ms 42.52%

经过测试,需要同时对FC计算中的W和X同时做Padding,才有较好的性能提升。

线程数 优化前 只padding w 不padding x 只padding x 不padding w
20线程 52.1854ms 53.2533ms 50.9526ms

@luotao1
Copy link
Contributor

luotao1 commented Nov 14, 2019

PR描述里需要说一下,性能提升的结果。

chenwhql
chenwhql previously approved these changes Nov 19, 2019
Copy link
Contributor

@chenwhql chenwhql left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for PADDLE_ENFORCE

@luotao1 luotao1 requested a review from Xreki November 19, 2019 03:47
luotao1
luotao1 previously approved these changes Nov 19, 2019
Copy link
Contributor

@luotao1 luotao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

paddle/fluid/framework/ir/fc_fuse_pass.cc Show resolved Hide resolved
paddle/fluid/operators/fc_op.cc Outdated Show resolved Hide resolved
paddle/fluid/operators/fc_op.cc Outdated Show resolved Hide resolved
paddle/fluid/operators/mkldnn/fc_mkldnn_op.cc Outdated Show resolved Hide resolved
paddle/fluid/operators/math/fc.cc Outdated Show resolved Hide resolved
paddle/fluid/operators/math/fc.cc Outdated Show resolved Hide resolved
auto blas = math::GetBlas<platform::CPUDeviceContext, T>(context);
blas.MatMul(M, N, K, X, W, Y);
framework::Tensor X1, Y1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • 不要在一行代码里面定义多个变量。
  • fc_fuse_pass里面只是判断weight是否做padding,kernel里面要进一步判断x是否要padding。w和x的padding是独立的。所以实现里面可能分好几种情况:
    • w padding了,x看是否要padding
    • w不padding,x看是否要做padding
    • 也不一定每个分支都要支持,但x是否padding肯定是要做判断的吧?
  • 需要padding的时候再临时Tensor。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • weight的尺寸是K*N,同时被128整除的时候,才做padding。weight和X的padding应该是同步的,所以在pass中做了padding之后,这里的X就不判断了。

  • 临时Tensor的Y1因为在下面的计算中要用到,所以是需要放在判断条件外的。临时变量X1,可以放到判断条件里。

paddle/fluid/operators/math/fc.cc Show resolved Hide resolved
paddle/fluid/framework/ir/fc_fuse_pass.cc Show resolved Hide resolved
paddle/fluid/framework/ir/fc_fuse_pass_tester.cc Outdated Show resolved Hide resolved
paddle/fluid/operators/math/fc.cc Outdated Show resolved Hide resolved
paddle/fluid/operators/math/fc.cc Outdated Show resolved Hide resolved
paddle/fluid/operators/math/fc.cc Outdated Show resolved Hide resolved
paddle/fluid/operators/math/fc.cu Show resolved Hide resolved
paddle/fluid/operators/mkldnn/fc_mkldnn_op.cc Show resolved Hide resolved
@@ -78,6 +78,9 @@ def setUp(self):
'Out': fc_refer(self.matrix, self.with_bias, self.with_relu)
}

def padding(self):
self.attrs = {'padding_weights': False}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是不会生效的,因为没有地方调用到了padding这个函数。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

因为重新修改了FC的padding策略,这里可以测试 (N % 128 == 0 && K % 128 == 0)时的正确性。

framework::Tensor Y1;
Y1.Resize({M * (N + 4)});
T* Y1_data = Y1.mutable_data<T>(platform::CPUPlace());
if (padding_weights) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果weights没有提前padding好,那是否需要在这里对x和w做padding?

Copy link
Contributor Author

@GaoWei8 GaoWei8 Nov 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

重新改了FC中加padding的策略。
当N % 128 == 0 && K % 128 == 0时,都会对x和w做padding。
如果weights提前padding好,在fc.cc中省略对w做padding。

python/paddle/fluid/tests/unittests/test_fc_op.py Outdated Show resolved Hide resolved
@Xreki
Copy link
Contributor

Xreki commented Nov 22, 2019

PR描述里面提供一下只padding w不padding x、只padding x不padding w的性能数据吧

test=develop
test=develop
Copy link
Contributor

@Xreki Xreki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

一些小的修改意见,下个PR再改吧。

paddle/fluid/framework/ir/fc_fuse_pass.cc Show resolved Hide resolved
paddle/fluid/framework/ir/fc_fuse_pass.cc Show resolved Hide resolved
paddle/fluid/operators/math/fc.cc Show resolved Hide resolved
paddle/fluid/operators/math/fc.cc Show resolved Hide resolved
paddle/fluid/operators/math/fc.cc Show resolved Hide resolved
@Xreki Xreki changed the title Add fc padding to solve mkl performance Add fc padding to improve mkl GEMM's performance when N and K are multiple of 128. Nov 26, 2019
@Xreki Xreki merged commit 234060f into PaddlePaddle:develop Nov 26, 2019
seiriosPlus pushed a commit to seiriosPlus/Paddle that referenced this pull request Dec 9, 2019
…tiple of 128. (PaddlePaddle#20972)

* Add fc padding to solve mkl performance
test=develop

* fix gpu pass and error information
test=develop

* fix fc_fuse_pass_test
test=develop

* fix error information
test=develop

* fix error information
test=develop

* fix name and add fc op padding test
test=develop

* fix attributes
test=develop

* optimize fc padding
test=develop

* fix test
test=develop
seiriosPlus pushed a commit to seiriosPlus/Paddle that referenced this pull request Dec 9, 2019
…tiple of 128. (PaddlePaddle#20972)

* Add fc padding to solve mkl performance
test=develop

* fix gpu pass and error information
test=develop

* fix fc_fuse_pass_test
test=develop

* fix error information
test=develop

* fix error information
test=develop

* fix name and add fc op padding test
test=develop

* fix attributes
test=develop

* optimize fc padding
test=develop

* fix test
test=develop
GaoWei8 added a commit to GaoWei8/Paddle that referenced this pull request Jan 9, 2020
…tiple of 128. (PaddlePaddle#20972)

* Add fc padding to solve mkl performance
test=develop

* fix gpu pass and error information
test=develop

* fix fc_fuse_pass_test
test=develop

* fix error information
test=develop

* fix error information
test=develop

* fix name and add fc op padding test
test=develop

* fix attributes
test=develop

* optimize fc padding
test=develop

* fix test
test=develop
Xreki pushed a commit that referenced this pull request Jan 10, 2020
…22198)

* Optimize the kernel implementation of layernorm with openmp (#20895)

* Add ernie c++ inference test (#21015)

* Add ernie unit test
test=develop

* Add ernie unit test
test=develop

* Add ernie unit test
test=develop

* remove ngraph

* optimize gpu test
test=develop

* optimize codes
test=develop

* fix cmake fails on inference_download_and_uncompress (#21185)

* solve cmake fails on inference_download_and_uncompress
test=develop

* solve cmake fails on inference_download_and_uncompress
test=develop

* Add fc padding to improve mkl GEMM's performance when N and K are multiple of 128. (#20972)

* Add fc padding to solve mkl performance
test=develop

* fix gpu pass and error information
test=develop

* fix fc_fuse_pass_test
test=develop

* fix error information
test=develop

* fix error information
test=develop

* fix name and add fc op padding test
test=develop

* fix attributes
test=develop

* optimize fc padding
test=develop

* fix test
test=develop

* Polish the codes of fc when needs padding (#21378)

test=develop

* Add ernie large c++ inference test (#21365)

* add ernie-large test
test=develop

* add ernie large c++ inference test
test=develop

* Modify padding strategy: remove weight copy in fc padding (#21650)

test=develop

* optimize fc jit (#21878)

test=develop

Co-authored-by: Yihua Xu <yihuaxu@hotmail.com>
@GaoWei8 GaoWei8 deleted the padding-fcc branch April 3, 2020 03:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants