Mobilenet gpu implementation #2776

NHZlX · 2017-07-07T13:58:16Z

The normal group convolution for mobilenet is very slow, particularly in training process. So, i firstly implement the gpu acceleration for depthwise convolution for mobilenet 1.0 with input 224 *224.

current Implementation effect

category	batch 1 forwardbackward (s)	batch 40 forwardbackward(s)
group convolution	0.75	29.23
cudnn convolution	0.74	2.888
depthwise gpu acceleration	0.052	1.27

future

there is still acceleration room for gpu.
a new pr for cpu acceleration of mobilenet.

… mobilenet_gpu

hedaoyuan

Add a test file like ConvOpTest.cpp, which can be named DepthwiseConvOpTest.cpp. And use GemmConv Function to detect the correctness of DepthwiseConv Function.

hedaoyuan · 2017-07-10T02:41:47Z

paddle/function/DepthwiseConvOpGpu.cu

+#include "ConvOp.h"
+#include "DepthwiseConvOp.h"
+#include "GemmFunctor.h"
+#include "paddle/math/MemoryHandle.h"


Remove line 15, 18.

hedaoyuan · 2017-07-10T02:48:46Z

paddle/function/DepthwiseConvOpGpu.cu

+#else 
+using real=float;
+#endif
+template class DepthwiseConvGradInputFunctor<DEVICE_TYPE_GPU, real>;


There is no need to define real, just use the float and double type to instantiate the template.

hedaoyuan · 2017-07-10T02:50:58Z

paddle/function/DepthwiseConvOp.h

+
+#pragma once
+
+#include "ConvOp.h"


Do not need to include this header file.

hedaoyuan · 2017-07-10T02:52:52Z

paddle/function/DepthwiseConvOp.h

+namespace paddle {
+
+template <DeviceType Device, class T>
+class DepthwiseConvFunctor {


Add some comments, explain the use of this interface and the meaning of the parameters.

hedaoyuan · 2017-07-10T02:54:24Z

paddle/function/DepthwiseConvOp.h

+};
+
+template <DeviceType Device, class T>
+class DepthwiseConvGradInputFunctor {


Add some comments.

hedaoyuan · 2017-07-10T02:54:30Z

paddle/function/DepthwiseConvOp.h

+};
+
+template <DeviceType Device, class T>
+class DepthwiseConvGradFilterFunctor {


Add some comments.

hedaoyuan · 2017-07-10T02:56:02Z

paddle/function/DepthwiseConvOpGpu.cu

+            int outputChannels,
+            int outputHeight,
+            int outputWidth,
+			int inputHeight,


Don't use tab.

hedaoyuan · 2017-07-10T02:58:52Z

paddle/function/DepthwiseConvOp.cpp

+    size_t outputSize = batchSize * outputChannels * outputHeight * outputWidth;
+
+    DepthwiseConvFunctor<Device, real> depthwiseConv;
+    depthwiseConv(outputSize,


You can remove the outputSize parameter, this parameter can be calculated inside the depthwiseConv.

hedaoyuan · 2017-07-10T03:13:01Z

paddle/gserver/layers/DepthwiseConvLayer.cpp

+namespace paddle {
+
+/*
+ * The calculation of the exconvt(convolution transpose (deconv) operation)


What does this comment mean?

I have deleted the wrong comment.

hedaoyuan · 2017-07-10T03:25:11Z

paddle/gserver/layers/DepthwiseConvLayer.h

+ * The config file api is img_conv_layer.
+ */
+
+class DepthwiseConvLayer : public ExpandConvBaseLayer {


I think you can use DepthwiseConv Function in the forward and backward calculations of ExpandConvLayer '.

Although they have a lot of code duplication, but there are still subtle differences， eg, the weight_multiplier_ in depthwiseConvLayer

Maybe put the weight_multiplier_ in DepthwiseConvGradFilterFunction is better. The other two functions do not need weight_multiplier_.

the weight_multiplier_ related operations were replaced by operation sumRows(...) in paddle/math/BaseMatrix.h, and basically no impact on efficiency.

这个有必要实现成一个Layer吗？

hedaoyuan · 2017-07-10T03:37:22Z

Check the reason why Travis CI does not pass and fix.

… mobilenet_gpu

NHZlX · 2017-07-10T15:32:57Z

The clang-format version in my workspace is 5.0, so it will fail to pass the travis-ci . I will settle it ASAP.

hedaoyuan

Is the TEST in DepthwiseConvOpTest.cpp can be merged into ConvOpTest.cpp?

hedaoyuan · 2017-07-11T13:07:17Z

paddle/function/DepthwiseConvOpTest.cpp

+    }
+  }
+};
+


Add some TEST, that the function1 is GemmConv and function2 is DepthwiseConv.

hedaoyuan · 2017-07-11T13:07:59Z

paddle/function/DepthwiseConvOpTest.cpp

+#ifndef PADDLE_ONLY_CPU
+TEST(Forward, GEMM2) {
+  DepthwiseConvolutionTest<DEVICE_TYPE_GPU, DEVICE_TYPE_GPU> test(
+      "DepthwiseConv-GPU", "DepthwiseConv-GPU", kForwardTest);


"DepthwiseConv-GPU", "DepthwiseConv-GPU" -> "DepthwiseConv-CPU", "DepthwiseConv-GPU"

hedaoyuan · 2017-07-11T13:08:55Z

paddle/function/DepthwiseConvOpTest.cpp

+  DepthwiseConvolutionTest<DEVICE_TYPE_GPU, DEVICE_TYPE_GPU> test(
+      "DepthwiseConv-GPU", "DepthwiseConv-GPU", kForwardTest);
+  DepthwiseConvolutionTest2<DEVICE_TYPE_CPU, DEVICE_TYPE_GPU> test2(
+      "DepthwiseConv-GPU", "DepthwiseConv-GPU", kForwardTest);


same as above

I known this, but the situation now is the cpu version depthwise conv has not been implemented.

Will you add the CPU code in this pr? If not, I think it is not necessary to add these tests now. Add a TODO, explain that when the CPU version is implemented need to increase the test case.

There will be a new pr for the cpu code. I will add a TODO .

hedaoyuan · 2017-07-11T13:09:04Z

paddle/function/DepthwiseConvOpTest.cpp

+
+TEST(BackwardInput, GEMM) {
+  DepthwiseConvolutionTest<DEVICE_TYPE_CPU, DEVICE_TYPE_GPU> test(
+      "DepthwiseConvGradInput-GPU",


same as above

hedaoyuan · 2017-07-11T13:09:11Z

paddle/function/DepthwiseConvOpTest.cpp

+
+TEST(BackwardFilter, GEMM) {
+  DepthwiseConvolutionTest<DEVICE_TYPE_CPU, DEVICE_TYPE_GPU> test(
+      "DepthwiseConvGradFilter-GPU",


same as above

… mobilenet_gpu

hedaoyuan · 2017-07-13T02:37:12Z

paddle/function/ConvOpTest.cpp

@@ -177,6 +177,156 @@ class ConvolutionTest2 {
  }
 };

+template <DeviceType DType1, DeviceType DType2>
+class DepthwiseConvolutionTest {


我主要的意思是，看一下DepthwiseConvolutionTest 是否可以用ConvolutionTest替换，这两个大部分代码都是一样的。

hedaoyuan · 2017-07-13T02:39:18Z

paddle/function/ConvOpTest.cpp

+// version of depthwiseConv is implemented.
+
+#ifndef PADDLE_ONLY_CPU
+TEST(DepthwiseConvForward, GEMM) {


这些留个TEST接口就行了，实现先删了吧，没有什么意义，只是消耗测试时间。

hedaoyuan · 2017-07-13T02:39:43Z

paddle/function/DepthwiseConvOp.cpp

+#include "DepthwiseConvOp.h"
+#include "ConvOp.h"
+#include "GemmFunctor.h"
+//#include "paddle/math/MemoryHandle.h"


去掉无用的代码。

hedaoyuan · 2017-07-13T02:41:38Z

paddle/function/DepthwiseConvOp.cpp

+    check(inputs, outputs);
+    const TensorShape& output = inputs[0].shape();
+    const TensorShape& input = inputs[1].shape();
+    // const TensorShape& multiplier = inputs[2].shape();


没有用的代码去掉。

hedaoyuan · 2017-07-13T02:42:43Z

paddle/function/DepthwiseConvOp.cpp

+  }
+
+  void calc(const BufferArgs& inputs, const BufferArgs& outputs) override {
+    // CHECK_EQ(numInputs_, inputs.size());


这个注释是要打开还是去掉？

hedaoyuan · 2017-07-13T02:55:55Z

paddle/function/DepthwiseConvOp.cpp

+    CHECK_EQ(numInputs_, inputs.size());
+    CHECK_EQ(numOutputs_, outputs.size());
+    check(inputs, outputs);
+    // Since the implementation of Col2ImFunctor is ADD_TO,


这个注释和这里的代码也完全不相关。

hedaoyuan · 2017-07-13T02:56:41Z

paddle/function/DepthwiseConvOp.cpp

+    const TensorShape& output = outputs[0].shape();
+
+    size_t batchSize = input[0];
+    // size_t inputChannels = input[1];


depthwiseConv为什么可以不需要inputChannels？

hedaoyuan · 2017-07-13T03:05:19Z

paddle/function/ConvOpTest.cpp

+      for (size_t inputSize : {7, 14, 54}) {
+        for (size_t filterSize : {1, 3, 5}) {
+          for (size_t inputChannels : {64, 128}) {
+            size_t outputChannels = inputChannels;


outputChannels != inputChannels的情况呢？

hedaoyuan · 2017-07-13T03:06:40Z

paddle/function/DepthwiseConvOpGpu.cu

+		BaseMatrix filterGradMatrix(inputChannels * filterHeight * filterWidth, 1, filterGrad, false, true);
+
+        for(int i = 0; i < batchSize; i++) {
+			ConvolutionDepthwiseFilterBackward<T>


这些代码整理一下，至少要对齐。

hedaoyuan · 2017-07-13T03:07:37Z

paddle/function/DepthwiseConvOpGpu.cu

+            int paddingH,
+            int paddingW,
+            T* inputGrad){
+


format可能没有检查出来，手动自己看一下，这些代码格式。有的多行空格，有的对齐比较乱。

… mobilenet_gpu

hedaoyuan

DepthwiseConv相关几个函数需要添加一下对groups_和channels的检查。另外添加一下Depthwise相关的几个函数的Test。

… mobilenet_gpu

hedaoyuan · 2017-07-21T03:16:25Z

paddle/function/ConvOpTest.cpp

@@ -228,24 +231,76 @@ TEST(Forward, GEMM) {
 #ifndef PADDLE_ONLY_CPU
 TEST(Forward, GEMM2) {
  ConvolutionTest<DEVICE_TYPE_CPU, DEVICE_TYPE_GPU> test(
-      "GemmConv-CPU", "GemmConv-GPU", kForwardTest);
+      "GemmConv-CPU", "GemmConv-GPU", kForwardTest, false);


可以在接口参数上直接defualt=false，这里少写个参数。

默认的是true，所以depthwise conv 不用传参数

也可以，不过depthwise是优化特例，所以defualt=false更好。

hedaoyuan · 2017-07-21T03:17:51Z

paddle/function/DepthwiseConvOp.cpp

+    checkShape(input, filter, output);
+  }
+
+  void calc(const BufferArgs& inputs, const BufferArgs& outputs) override {


这里面需要增加CHECK_EQ(outputs[0].getArgType(), ADD_TO);

hedaoyuan · 2017-07-21T03:20:07Z

paddle/function/DepthwiseConvOp.cpp

+    checkShape(input, filter, output);
+  }
+
+  void calc(const BufferArgs& inputs, const BufferArgs& outputs) override {


需要增加CHECK_EQ(outputs[0].getArgType(), ADD_TO);

NHZlX added 5 commits July 4, 2017 17:05

set depthwise conv layer interface in python

211f83f

add depthwise operation and depthwise conv layer

eeb17c2

add the mobilenet gpu acceleration, cpu is in the process

efae51c

add mobilenet gpu grad test, the test is ok

f4e7ae5

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

36e7800

… mobilenet_gpu

NHZlX requested review from Xreki and hedaoyuan July 7, 2017 13:58

hedaoyuan requested changes Jul 10, 2017

View reviewed changes

NHZlX added 4 commits July 10, 2017 16:59

add the comments for .h file and code tiny modify

064dc88

use the expandconvlayer forward and backward, add the explain for class

198164a

add depthwise conv test

a3ce6aa

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

e92f002

… mobilenet_gpu

hedaoyuan requested changes Jul 11, 2017

View reviewed changes

NHZlX added 4 commits July 12, 2017 15:25

move DepthwiseConvOpTest.cpp to ConvOpTest.cpp

fd4b113

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

433935a

… mobilenet_gpu

modify format accored with clang-format 3.8

2bc08f8

modify format accored with clang-format 3.8

ccd46d1

NHZlX mentioned this pull request Jul 12, 2017

The need of cpu acceleartion implementation of depthwise convolution in Mobilenet. #2826

Closed

NHZlX added 2 commits July 12, 2017 21:09

the groups default should be None

030a3db

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

fc8aedb

… mobilenet_gpu

hedaoyuan requested changes Jul 13, 2017

View reviewed changes

Xreki added this to In Progress in Embedded and Mobile Deployment Jul 14, 2017

NHZlX added 6 commits July 14, 2017 11:16

modify the format and delete useless comment

c43f693

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

6267312

… mobilenet_gpu

fuse the conv and depthwise conv together

02e04b4

support inputchannels != outputchannels of depthwiseconv

11588b3

add comments for python api

d43fbba

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

44927bf

… mobilenet_gpu

NHZlX added 11 commits July 18, 2017 22:57

modity the format

dbb6588

accelerate inputbackward(delete 'if' in this func) of depthwise conv

66520af

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

d50c71f

… mobilenet_gpu

delete useless .h header in DepthwiseConvOpGpu.cu

f7390d1

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

21ab0eb

… mobilenet_gpu

fuse interface of depthwise to expand in python api

77ff97a

fuse interface of depthwise to expandconv

8199886

modify format, and modify the layer grad test, op test

1f516fa

tiny modify the test

bd54eb9

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

4d6be97

… mobilenet_gpu

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

5b07d4e

… mobilenet_gpu

hedaoyuan reviewed Jul 20, 2017

View reviewed changes

NHZlX added 4 commits July 21, 2017 00:13

add depthwiseconv test and fix the little bug of the convOpTest

248149f

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

d5b0c57

… mobilenet_gpu

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

cfd4c05

… mobilenet_gpu

add check for groups and inputChannels

e8d171b

hedaoyuan approved these changes Jul 21, 2017

View reviewed changes

add check: CHECK_EQ(outputs[0].getArgType(), ADD_TO)

6c528cb

NHZlX merged commit 91d2a57 into PaddlePaddle:develop Jul 21, 2017

hedaoyuan moved this from In Progress to Convolution optimization in Embedded and Mobile Deployment Jul 24, 2017

NHZlX deleted the mobilenet_gpu branch August 3, 2017 11:52


		#pragma once

		#include "ConvOp.h"

Mobilenet gpu implementation #2776

Mobilenet gpu implementation #2776

Conversation

NHZlX commented Jul 7, 2017 • edited Loading

current Implementation effect

future

hedaoyuan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NHZlX Jul 10, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hedaoyuan commented Jul 10, 2017

NHZlX commented Jul 10, 2017

hedaoyuan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hedaoyuan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NHZlX Jul 21, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NHZlX commented Jul 7, 2017 •

edited

Loading

NHZlX Jul 10, 2017 •

edited

Loading

NHZlX Jul 21, 2017 •

edited

Loading