[Speed Compiling]: Reduce NVCC compiling files. #5573

qingqing01 · 2017-11-11T14:13:40Z

Related to #5491

Testing : local machine

env: centos, cuda 7.5, make -j8, WITH_GPU=ON
time: 31m24.320s -> 26m43.523s

Now the PaddlePaddle compiling is much slower, about 30min in GPU server each time. I tested the NVCC and G++ compiling time by operator compiling, the NVCC is slower than G++ more than 1min. And some GPU kernel implementation of operators can be compiled by G++. So I try to change about 25 cu operator files to cu.cc and make them compiled by G++ to speed the compiling time. But I'm not sure whether it is necessary.

… cmake_speed

hedaoyuan · 2017-11-13T12:03:13Z

python/paddle/v2/framework/tests/test_seq_conv.py

-                    [self.input_size[0]]]
-        self.output_represention = 8  # output feature size
-
+#class TestSeqProjectCase1(TestSeqProject):


Is these codes can be removed?

Resume these unit testing.

… cmake_speed

emailweixu · 2017-11-14T00:14:52Z

paddle/operators/math/sequence2batch.cu

+        input.data<T>(), bias.data<T>(), output->data<T>(), in_dims[0], size);
+  }
+};
+


why not use eigen to do it?

Done. Move RowwiseAdd to paddle/operators/math/math_function and use Eigen for CPU and GPU.

emailweixu · 2017-11-14T00:16:09Z

paddle/operators/gru_op.h

+      math::SetConstant<Place, T> set;
+      set(dev_ctx, &ones, static_cast<T>(1));
+      math::gemv<Place, T>(dev_ctx, true, m, n, 1., batch_gate_grad.data<T>(),
+                           ones.data<T>(), 0., bias_grad->data<T>());


Since this is used at many places, should we make a similar function as RowwiseAdd to do this?

Done. Add a ColwiseSum functor in paddle/operators/math/math_function.h to compute the gradient of bias.

emailweixu · 2017-11-14T00:17:31Z

paddle/operators/math/sequence2batch.h

+  void operator()(const platform::DeviceContext& context,
+                  const framework::Tensor& input, const framework::Tensor& bias,
+                  framework::Tensor* output);
+};


Why is this at sequence2batch.h?

At first, this functor is only used in lstm_op and gru_op. Now I move it to paddle/operators/math/math_function.

emailweixu · 2017-11-15T17:47:32Z

Another possibility is to merge .cc and .cu.cc as a single .cc. What do everyone think?

hedaoyuan · 2017-11-16T01:15:46Z

I agree with this.

luotao1 · 2017-11-16T01:36:11Z

Can this PR be merged at first, since it reduces NVCC compile time, and many files will be conflicted with it. How about merge .cc and .cu.cc as a single .cc in next PR?

luotao1

As this PR conflicts with many files, we will merge it at first. And un-finished works will be continued in next PR.

qingqing01 added 2 commits November 11, 2017 19:38

Use G++ to compile some cu operators.

f5e3676

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

524ccba

… cmake_speed

qingqing01 force-pushed the cmake_speed branch from 916a9b9 to 37ee17c Compare November 11, 2017 14:15

Fix bug.

5f21709

qingqing01 force-pushed the cmake_speed branch from 37ee17c to 5f21709 Compare November 11, 2017 14:35

Fix compling for softmax_with_cross_entropy_op.

91d4fc6

qingqing01 changed the title ~~[Speed Compiling Test]: Reduce NVCC compiling files.~~ [Speed Compiling]: Reduce NVCC compiling files. Nov 13, 2017

qingqing01 requested review from reyoung, wangkuiyi, hedaoyuan and emailweixu November 13, 2017 11:30

hedaoyuan previously approved these changes Nov 13, 2017

View reviewed changes

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

884ce5d

… cmake_speed

qingqing01 dismissed hedaoyuan’s stale review via e97a5fb November 13, 2017 13:38

Resume unit testing.

e9082bb

qingqing01 force-pushed the cmake_speed branch from e97a5fb to e9082bb Compare November 13, 2017 15:56

emailweixu reviewed Nov 14, 2017

View reviewed changes

qingqing01 added 3 commits November 14, 2017 10:29

update code and fix conflicts.

48947b5

Move RowwiseAdd functor to math_funcion and Add ColwiseSum functor.

2673657

Update code and fix conflicts

d9a305c

wanghaox and others added 3 commits November 16, 2017 10:25

Update code and fix conflicts.

0968c7c

Merge branch 'develop' into cmake_speed

57bbee6

Merge branch 'develop' into cmake_speed

c33922c

luotao1 approved these changes Nov 16, 2017

View reviewed changes

luotao1 merged commit 3edd833 into PaddlePaddle:develop Nov 16, 2017

This was referenced Nov 17, 2017

[Speed compiling]: Refine cmake about CUDA to automatically detect GPU arch by default. #5713

Merged

Add maxout operator. #5571

Merged

luotao1 mentioned this pull request Aug 27, 2019

[Speed Compiling Part1] : Reduce NVCC compiling files #19441

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Speed Compiling]: Reduce NVCC compiling files. #5573

[Speed Compiling]: Reduce NVCC compiling files. #5573

qingqing01 commented Nov 11, 2017 •

edited

Loading

hedaoyuan Nov 13, 2017

qingqing01 Nov 14, 2017

emailweixu Nov 14, 2017

qingqing01 Nov 14, 2017

emailweixu Nov 14, 2017

qingqing01 Nov 14, 2017

emailweixu Nov 14, 2017

qingqing01 Nov 14, 2017

emailweixu commented Nov 15, 2017

hedaoyuan commented Nov 16, 2017

luotao1 commented Nov 16, 2017

luotao1 left a comment

[Speed Compiling]: Reduce NVCC compiling files. #5573

[Speed Compiling]: Reduce NVCC compiling files. #5573

Conversation

qingqing01 commented Nov 11, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emailweixu commented Nov 15, 2017

hedaoyuan commented Nov 16, 2017

luotao1 commented Nov 16, 2017

luotao1 left a comment

Choose a reason for hiding this comment

qingqing01 commented Nov 11, 2017 •

edited

Loading