Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Depthwise Convolution Optimization #3718

Merged
merged 9 commits into from
Aug 30, 2017

Conversation

hedaoyuan
Copy link
Contributor

@hedaoyuan hedaoyuan commented Aug 28, 2017

This depthwise convolution optimization is discussed with @NHZlX , and is based on the ARM NEON instruction set, also can be extended to X86 SSE and AVX instruction set.

The optimized logic is if the output size is greater than 4 than each step calculates the four elements of the output.
For example, convolution filter is 3x3:
Use 9 instructions to calculate four elements of the output:

Output[0, 1, 2, 3]   = R0[0, 1, 2, 3] * K0[0]
Output[0, 1, 2, 3] += R0[1, 2, 3, 4] * K0[1]
Output[0, 1, 2, 3] += R0[2, 3, 4, 5] * K0[2]
Output[0, 1, 2, 3] += R1[0, 1, 2, 3] * K1[0]
Output[0, 1, 2, 3] += R1[1, 2, 3, 4] * K1[1]
Output[0, 1, 2, 3] += R1[2, 3, 4, 5] * K1[2]
Output[0, 1, 2, 3] += R2[0, 1, 2, 3] * K2[0]
Output[0, 1, 2, 3] += R2[1, 2, 3, 4] * K2[1]
Output[0, 1, 2, 3] += R2[2, 3, 4, 5] * K2[2]

Another implementation requires 4 instructions to calculate one element of the output. This method is slower than the previous method but can be used to calculate the remainder of output.

V   = R0[0, 1, 2, x] * K0[0, 1, 2, x]
V += R1[0, 1, 2, x] * K1[0, 1, 2, x]
V += R2[0, 1, 2, x] * K2[0, 1, 2, x]
Output[0] = SUM(V)

@hedaoyuan hedaoyuan changed the title Convolution Depthwise Convolution Optimization Aug 28, 2017
@hedaoyuan hedaoyuan requested a review from NHZlX August 28, 2017 09:51
Copy link
Contributor

@NHZlX NHZlX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/function/DepthwiseConvOp.cpp#L21 这里边的函数是不是可以去掉了,是不是应该并且加一些check, 比如, device必须是 gpu

const float*, const float*, int, int, int, int, int, int, float*)>
DepthWiseConv;

if (filterWidth == 3 && strideW() == 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我认为这里应该把最朴素的实现给添加上,并且我认为https://github.com/NHZlX/Paddle/blob/mobilenet_neon/paddle/function/neon/DepthwiseConvCpu.h#L98 这种实现会好一些

Copy link
Contributor Author

@hedaoyuan hedaoyuan Aug 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里不应该加上朴素的实现,加上的话,相当于如果不支持优化实现则走朴素的实现,但实际上如果不支持优化实现,转而执行GemmConv实现更好。另外,NaiveConv本身就有一Function的实现了,可以在ConvLayer里面判断该走哪个分支。

@NHZlX
Copy link
Contributor

NHZlX commented Aug 29, 2017

LGTM

NHZlX
NHZlX previously approved these changes Aug 29, 2017
@hedaoyuan hedaoyuan merged commit b45d020 into PaddlePaddle:develop Aug 30, 2017
@hedaoyuan hedaoyuan added this to Convolution Optimization in Embedded and Mobile Deployment Sep 15, 2017
heavengate pushed a commit to heavengate/Paddle that referenced this pull request Aug 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Embedded and Mobile Deployment
Convolution Optimization
Development

Successfully merging this pull request may close these issues.

None yet

2 participants