Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mobilenet gpu implementation #2776

Merged
merged 37 commits into from
Jul 21, 2017
Merged

Mobilenet gpu implementation #2776

merged 37 commits into from
Jul 21, 2017

Conversation

NHZlX
Copy link
Contributor

@NHZlX NHZlX commented Jul 7, 2017

The normal group convolution for mobilenet is very slow, particularly in training process. So, i firstly implement the gpu acceleration for depthwise convolution for mobilenet 1.0 with input 224 *224.

current Implementation effect

category batch 1 forwardbackward (s) batch 40 forwardbackward(s)
group convolution 0.75 29.23
cudnn convolution 0.74 2.888
depthwise gpu acceleration 0.052 1.27

future

  • there is still acceleration room for gpu.
  • a new pr for cpu acceleration of mobilenet.

@NHZlX NHZlX requested review from Xreki and hedaoyuan July 7, 2017 13:58
Copy link
Contributor

@hedaoyuan hedaoyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a test file like ConvOpTest.cpp, which can be named DepthwiseConvOpTest.cpp. And use GemmConv Function to detect the correctness of DepthwiseConv Function.

#include "ConvOp.h"
#include "DepthwiseConvOp.h"
#include "GemmFunctor.h"
#include "paddle/math/MemoryHandle.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove line 15, 18.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

#else
using real=float;
#endif
template class DepthwiseConvGradInputFunctor<DEVICE_TYPE_GPU, real>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to define real, just use the float and double type to instantiate the template.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


#pragma once

#include "ConvOp.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not need to include this header file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

namespace paddle {

template <DeviceType Device, class T>
class DepthwiseConvFunctor {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some comments, explain the use of this interface and the meaning of the parameters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

};

template <DeviceType Device, class T>
class DepthwiseConvGradInputFunctor {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

};

template <DeviceType Device, class T>
class DepthwiseConvGradFilterFunctor {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

int outputChannels,
int outputHeight,
int outputWidth,
int inputHeight,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use tab.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

size_t outputSize = batchSize * outputChannels * outputHeight * outputWidth;

DepthwiseConvFunctor<Device, real> depthwiseConv;
depthwiseConv(outputSize,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove the outputSize parameter, this parameter can be calculated inside the depthwiseConv.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

namespace paddle {

/*
* The calculation of the exconvt(convolution transpose (deconv) operation)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this comment mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have deleted the wrong comment.

* The config file api is img_conv_layer.
*/

class DepthwiseConvLayer : public ExpandConvBaseLayer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can use DepthwiseConv Function in the forward and backward calculations of ExpandConvLayer '.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although they have a lot of code duplication, but there are still subtle differences, eg, the weight_multiplier_ in depthwiseConvLayer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe put the weight_multiplier_ in DepthwiseConvGradFilterFunction is better. The other two functions do not need weight_multiplier_.

Copy link
Contributor Author

@NHZlX NHZlX Jul 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the weight_multiplier_ related operations were replaced by operation sumRows(...) in paddle/math/BaseMatrix.h, and basically no impact on efficiency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个有必要实现成一个Layer吗?

@hedaoyuan
Copy link
Contributor

Check the reason why Travis CI does not pass and fix.

@NHZlX
Copy link
Contributor Author

NHZlX commented Jul 10, 2017

The clang-format version in my workspace is 5.0, so it will fail to pass the travis-ci . I will settle it ASAP.

Copy link
Contributor

@hedaoyuan hedaoyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the TEST in DepthwiseConvOpTest.cpp can be merged into ConvOpTest.cpp?

}
}
};

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some TEST, that the function1 is GemmConv and function2 is DepthwiseConv.

#ifndef PADDLE_ONLY_CPU
TEST(Forward, GEMM2) {
DepthwiseConvolutionTest<DEVICE_TYPE_GPU, DEVICE_TYPE_GPU> test(
"DepthwiseConv-GPU", "DepthwiseConv-GPU", kForwardTest);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"DepthwiseConv-GPU", "DepthwiseConv-GPU" -> "DepthwiseConv-CPU", "DepthwiseConv-GPU"

DepthwiseConvolutionTest<DEVICE_TYPE_GPU, DEVICE_TYPE_GPU> test(
"DepthwiseConv-GPU", "DepthwiseConv-GPU", kForwardTest);
DepthwiseConvolutionTest2<DEVICE_TYPE_CPU, DEVICE_TYPE_GPU> test2(
"DepthwiseConv-GPU", "DepthwiseConv-GPU", kForwardTest);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I known this, but the situation now is the cpu version depthwise conv has not been implemented.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will you add the CPU code in this pr? If not, I think it is not necessary to add these tests now. Add a TODO, explain that when the CPU version is implemented need to increase the test case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be a new pr for the cpu code. I will add a TODO .


TEST(BackwardInput, GEMM) {
DepthwiseConvolutionTest<DEVICE_TYPE_CPU, DEVICE_TYPE_GPU> test(
"DepthwiseConvGradInput-GPU",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above


TEST(BackwardFilter, GEMM) {
DepthwiseConvolutionTest<DEVICE_TYPE_CPU, DEVICE_TYPE_GPU> test(
"DepthwiseConvGradFilter-GPU",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

@@ -177,6 +177,156 @@ class ConvolutionTest2 {
}
};

template <DeviceType DType1, DeviceType DType2>
class DepthwiseConvolutionTest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我主要的意思是,看一下DepthwiseConvolutionTest 是否可以用ConvolutionTest替换,这两个大部分代码都是一样的。

// version of depthwiseConv is implemented.

#ifndef PADDLE_ONLY_CPU
TEST(DepthwiseConvForward, GEMM) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这些留个TEST接口就行了,实现先删了吧,没有什么意义,只是消耗测试时间。

#include "DepthwiseConvOp.h"
#include "ConvOp.h"
#include "GemmFunctor.h"
//#include "paddle/math/MemoryHandle.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

去掉无用的代码。

check(inputs, outputs);
const TensorShape& output = inputs[0].shape();
const TensorShape& input = inputs[1].shape();
// const TensorShape& multiplier = inputs[2].shape();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

没有用的代码去掉。

}

void calc(const BufferArgs& inputs, const BufferArgs& outputs) override {
// CHECK_EQ(numInputs_, inputs.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个注释是要打开还是去掉?

CHECK_EQ(numInputs_, inputs.size());
CHECK_EQ(numOutputs_, outputs.size());
check(inputs, outputs);
// Since the implementation of Col2ImFunctor is ADD_TO,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个注释和这里的代码也完全不相关。

const TensorShape& output = outputs[0].shape();

size_t batchSize = input[0];
// size_t inputChannels = input[1];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

depthwiseConv为什么可以不需要inputChannels?

for (size_t inputSize : {7, 14, 54}) {
for (size_t filterSize : {1, 3, 5}) {
for (size_t inputChannels : {64, 128}) {
size_t outputChannels = inputChannels;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outputChannels != inputChannels的情况呢?

BaseMatrix filterGradMatrix(inputChannels * filterHeight * filterWidth, 1, filterGrad, false, true);

for(int i = 0; i < batchSize; i++) {
ConvolutionDepthwiseFilterBackward<T>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这些代码整理一下,至少要对齐。

int paddingH,
int paddingW,
T* inputGrad){

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format可能没有检查出来,手动自己看一下,这些代码格式。有的多行空格,有的对齐比较乱。

@Xreki Xreki added this to In Progress in Embedded and Mobile Deployment Jul 14, 2017
Copy link
Contributor

@hedaoyuan hedaoyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DepthwiseConv相关几个函数需要添加一下对groups_和channels的检查。另外添加一下Depthwise相关的几个函数的Test。

@@ -228,24 +231,76 @@ TEST(Forward, GEMM) {
#ifndef PADDLE_ONLY_CPU
TEST(Forward, GEMM2) {
ConvolutionTest<DEVICE_TYPE_CPU, DEVICE_TYPE_GPU> test(
"GemmConv-CPU", "GemmConv-GPU", kForwardTest);
"GemmConv-CPU", "GemmConv-GPU", kForwardTest, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以在接口参数上直接defualt=false,这里少写个参数。

Copy link
Contributor Author

@NHZlX NHZlX Jul 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

默认的是true, 所以depthwise conv 不用传参数

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

也可以,不过depthwise是优化特例,所以defualt=false更好。

checkShape(input, filter, output);
}

void calc(const BufferArgs& inputs, const BufferArgs& outputs) override {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里面需要增加CHECK_EQ(outputs[0].getArgType(), ADD_TO);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

checkShape(input, filter, output);
}

void calc(const BufferArgs& inputs, const BufferArgs& outputs) override {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

需要增加CHECK_EQ(outputs[0].getArgType(), ADD_TO);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@NHZlX NHZlX merged commit 91d2a57 into PaddlePaddle:develop Jul 21, 2017
@hedaoyuan hedaoyuan moved this from In Progress to Convolution optimization in Embedded and Mobile Deployment Jul 24, 2017
@NHZlX NHZlX deleted the mobilenet_gpu branch August 3, 2017 11:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Embedded and Mobile Deployment
Convolution Optimization
Development

Successfully merging this pull request may close these issues.

None yet

2 participants