Depthwise Convolution Optimization #3718

hedaoyuan · 2017-08-28T07:02:02Z

This depthwise convolution optimization is discussed with @NHZlX , and is based on the ARM NEON instruction set, also can be extended to X86 SSE and AVX instruction set.

The optimized logic is if the output size is greater than 4 than each step calculates the four elements of the output.
For example, convolution filter is 3x3:
Use 9 instructions to calculate four elements of the output:

Output[0, 1, 2, 3]   = R0[0, 1, 2, 3] * K0[0]
Output[0, 1, 2, 3] += R0[1, 2, 3, 4] * K0[1]
Output[0, 1, 2, 3] += R0[2, 3, 4, 5] * K0[2]
Output[0, 1, 2, 3] += R1[0, 1, 2, 3] * K1[0]
Output[0, 1, 2, 3] += R1[1, 2, 3, 4] * K1[1]
Output[0, 1, 2, 3] += R1[2, 3, 4, 5] * K1[2]
Output[0, 1, 2, 3] += R2[0, 1, 2, 3] * K2[0]
Output[0, 1, 2, 3] += R2[1, 2, 3, 4] * K2[1]
Output[0, 1, 2, 3] += R2[2, 3, 4, 5] * K2[2]

Another implementation requires 4 instructions to calculate one element of the output. This method is slower than the previous method but can be used to calculate the remainder of output.

V   = R0[0, 1, 2, x] * K0[0, 1, 2, x]
V += R1[0, 1, 2, x] * K1[0, 1, 2, x]
V += R2[0, 1, 2, x] * K2[0, 1, 2, x]
Output[0] = SUM(V)

NHZlX

https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/function/DepthwiseConvOp.cpp#L21 这里边的函数是不是可以去掉了，是不是应该并且加一些check，比如， device必须是 gpu

NHZlX · 2017-08-28T16:25:39Z

paddle/function/neon/NeonDepthwiseConv.cpp

+        const float*, const float*, int, int, int, int, int, int, float*)>
+        DepthWiseConv;
+
+    if (filterWidth == 3 && strideW() == 1) {


我认为这里应该把最朴素的实现给添加上，并且我认为https://github.com/NHZlX/Paddle/blob/mobilenet_neon/paddle/function/neon/DepthwiseConvCpu.h#L98 这种实现会好一些

这里不应该加上朴素的实现，加上的话，相当于如果不支持优化实现则走朴素的实现，但实际上如果不支持优化实现，转而执行GemmConv实现更好。另外，NaiveConv本身就有一Function的实现了，可以在ConvLayer里面判断该走哪个分支。

NHZlX · 2017-08-29T05:48:39Z

LGTM

hedaoyuan added 7 commits August 24, 2017 23:45

Add NeonDepthwiseConvFunction.

0dffe68

Add DepthwiseConvKernel for filter size is 4.

b7885b0

Neon depthwise conv with filterSize = 3 and stride = 2.

6dcff9a

Neon depthwise conv with filterSize = 4 and stride = 2.

f00c411

Refine NeonDepthwiseConvFunction.

227fdfb

Fix CMakeLists.text

3a75b4b

ExpandConvLayer adds support of arm-neon acceleration.

34a92ab

hedaoyuan changed the title ~~Convolution~~ Depthwise Convolution Optimization Aug 28, 2017

hedaoyuan requested a review from NHZlX August 28, 2017 09:51

Remove NeonDepthwiseConv.h

5df384d

NHZlX reviewed Aug 29, 2017

View reviewed changes

NHZlX previously approved these changes Aug 29, 2017

View reviewed changes

Fix a small bug.

168707c

hedaoyuan dismissed NHZlX’s stale review via 168707c August 30, 2017 03:35

NHZlX approved these changes Aug 30, 2017

View reviewed changes

hedaoyuan merged commit b45d020 into PaddlePaddle:develop Aug 30, 2017

hedaoyuan added this to Convolution Optimization in Embedded and Mobile Deployment Sep 15, 2017

heavengate pushed a commit to heavengate/Paddle that referenced this pull request Aug 16, 2021

[transformer] add Deformable DETR base code (PaddlePaddle#3718)

e8aeb80

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Depthwise Convolution Optimization #3718

Depthwise Convolution Optimization #3718

hedaoyuan commented Aug 28, 2017 •

edited

Loading

NHZlX left a comment

NHZlX Aug 28, 2017

hedaoyuan Aug 29, 2017 •

edited

Loading

NHZlX commented Aug 29, 2017

Depthwise Convolution Optimization #3718

Depthwise Convolution Optimization #3718

Conversation

hedaoyuan commented Aug 28, 2017 • edited Loading

NHZlX left a comment

Choose a reason for hiding this comment

NHZlX Aug 28, 2017

Choose a reason for hiding this comment

hedaoyuan Aug 29, 2017 • edited Loading

Choose a reason for hiding this comment

NHZlX commented Aug 29, 2017

hedaoyuan commented Aug 28, 2017 •

edited

Loading

hedaoyuan Aug 29, 2017 •

edited

Loading