Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Speed up compiling]: reduce the NVCC compiling (some .cu operators can be compiled by G++) #5491

Closed
qingqing01 opened this issue Nov 8, 2017 · 5 comments

Comments

@qingqing01
Copy link
Contributor

qingqing01 commented Nov 8, 2017

Compiling time comparison between NVCC and G++

  1. Conclusion:

    • NVCC is slower than G++, more than 1 min. For example, in elementwise_mul_op, the comiper time is 13s (G++) vs 1m41s (NVCC).
    • more cuda gencodes(gencode: sm_xx), more slower NVCC compiling
  2. Experiment 1: elementwise_mul_op, this op uses Eigen to compute

    • G++

      • compile :
      time /home/dangqingqing/.jumbo/opt/gcc48/bin/c++   -DANY_IMPL_ANY_CAST_MOVEABLE -DPADDLE_DISABLE_PROFILER -DPADDLE_DISABLE_RDMA -DPADDLE_DISABLE_TIMER -DPADDLE_USE_DSO -DPADDLE_USE_PTHREAD_BARRIER -DPADDLE_USE_PTHREAD_SPINLOCK -DPADDLE_VERSION=0.10.0rc4 -DPADDLE_WITHOUT_GOLANG -DPADDLE_WITH_CUDA -DPADDLE_WITH_TESTING -mavx -std=c++11 -fPIC -fno-omit-frame-pointer -Wall -Wextra -Werror -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wno-unused-parameter -Wno-unused-function -Wno-error=literal-suffix -Wno-error=sign-compare -Wno-error=unused-local-typedefs -O2 -g -DNDEBUG -I/home/dangqingqing/.third_party/install/zlib/include -I/home/dangqingqing/.third_party/install/gflags/include -I/home/dangqingqing/.third_party/install/glog/include -I/home/dangqingqing/.third_party/install/gtest/include -I/home/dangqingqing/.third_party/install/protobuf/include -I/home/dangqingqing/.jumbo/include/python2.7 -I/home/dangqingqing/.jumbo/lib/python2.7/site-packages/numpy/core/include -I/home/dangqingqing/.third_party/install/openblas/include -I/home/dangqingqing/.third_party/install/warpctc/include -I/home/dangqingqing/.third_party/any/src/extern_lib_any -I/home/dangqingqing/.third_party/eigen3/src/extern_eigen3 -I/home/dangqingqing/.third_party/pybind/src/extern_pybind/include -I/home/dangqingqing/.third_party/nccl/src/extern_nccl/src -I/usr/local/cuda/include -I/home/dangqingqing/github/myfork/build -I/home/dangqingqing/github/myfork/Paddle -I/home/dangqingqing/github/myfork/Paddle/paddle/cuda/include -I/home/dangqingqing/github/myfork/build/proto -I/home/dangqingqing/github/myfork/build/go/pserver/client/c    -o CMakeFiles/elementwise_mul_op.dir/elementwise_mul_op.cc.o -c /home/dangqingqing/github/myfork/Paddle/paddle/operators/elementwise_mul_op.cc
      • time:
      real	0m13.116s
      user	0m12.264s
      sys	0m0.849s
    • NVCC

      • gencode: sm_30, sm_35, sm_50,sm_52
      • compile :
      time /usr/local/cuda/bin/nvcc /home/dangqingqing/github/myfork/Paddle/paddle/operators/elementwise_mul_op.cu -c -o /home/dangqingqing/github/myfork/build/paddle/operators/CMakeFiles/elementwise_mul_op.dir//./elementwise_mul_op_generated_elementwise_mul_op.cu.o -ccbin /home/dangqingqing/.jumbo/opt/gcc48/bin/g++ -m64 -DANY_IMPL_ANY_CAST_MOVEABLE -DPADDLE_USE_DSO -DPADDLE_WITH_TESTING -DPADDLE_DISABLE_TIMER -DPADDLE_DISABLE_PROFILER -DPADDLE_WITHOUT_GOLANG -DPADDLE_WITH_CUDA -DPADDLE_DISABLE_RDMA -DPADDLE_USE_PTHREAD_SPINLOCK -DPADDLE_USE_PTHREAD_BARRIER -DPADDLE_VERSION=0.10.0rc4 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -Xcompiler -mavx -Xcompiler -Wall -Xcompiler -Wextra -Xcompiler -Werror -Xcompiler -fPIC -Xcompiler -fno-omit-frame-pointer -Xcompiler -Wno-unused-parameter -Xcompiler -Wno-unused-function -Xcompiler -Wno-error=sign-compare -Xcompiler -Wno-error=literal-suffix -Xcompiler -Wno-error=unused-local-typedefs -Xcompiler -Wno-error=unused-function -Xcompiler -Wno-error=array-bounds -std=c++11 --use_fast_math -O2 -g -DNDEBUG -DNVCC -I/usr/local/cuda/include -I/home/dangqingqing/.third_party/install/zlib/include -I/home/dangqingqing/.third_party/install/gflags/include -I/home/dangqingqing/.third_party/install/glog/include -I/home/dangqingqing/.third_party/install/gtest/include -I/home/dangqingqing/.third_party/install/protobuf/include -I/home/dangqingqing/.jumbo/include/python2.7 -I/home/dangqingqing/.jumbo/lib/python2.7/site-packages/numpy/core/include -I/home/dangqingqing/.third_party/install/openblas/include -I/home/dangqingqing/.third_party/install/warpctc/include -I/home/dangqingqing/.third_party/any/src/extern_lib_any -I/home/dangqingqing/.third_party/eigen3/src/extern_eigen3 -I/home/dangqingqing/.third_party/pybind/src/extern_pybind/include -I/home/dangqingqing/.third_party/nccl/src/extern_nccl/src -I/usr/local/cuda/include -I/home/dangqingqing/github/myfork/build -I/home/dangqingqing/github/myfork/Paddle -I/home/dangqingqing/github/myfork/Paddle/paddle/cuda/include -I/home/dangqingqing/github/myfork/build/proto -I/home/dangqingqing/github/myfork/build/go/pserver/client/c -I/usr/include
      
      • time:
      real	1m41.708s
      user	1m30.757s
      sys   0m10.992s
    • gencode: only sm_35

      • compile :
      time /usr/local/cuda/bin/nvcc /home/dangqingqing/github/myfork/Paddle/paddle/operators/elementwise_mul_op.cu -c -o /home/dangqingqing/github/myfork/build/paddle/operators/CMakeFiles/elementwise_mul_op.dir//./elementwise_mul_op_generated_elementwise_mul_op.cu.o -ccbin /home/dangqingqing/.jumbo/opt/gcc48/bin/g++ -m64 -DANY_IMPL_ANY_CAST_MOVEABLE -DPADDLE_USE_DSO -DPADDLE_WITH_TESTING -DPADDLE_DISABLE_TIMER -DPADDLE_DISABLE_PROFILER -DPADDLE_WITHOUT_GOLANG -DPADDLE_WITH_CUDA -DPADDLE_DISABLE_RDMA -DPADDLE_USE_PTHREAD_SPINLOCK -DPADDLE_USE_PTHREAD_BARRIER -DPADDLE_VERSION=0.10.0rc4 -gencode arch=compute_35,code=sm_35 -Xcompiler -mavx -Xcompiler -Wall -Xcompiler -Wextra -Xcompiler -Werror -Xcompiler -fPIC -Xcompiler -fno-omit-frame-pointer -Xcompiler -Wno-unused-parameter -Xcompiler -Wno-unused-function -Xcompiler -Wno-error=sign-compare -Xcompiler -Wno-error=literal-suffix -Xcompiler -Wno-error=unused-local-typedefs -Xcompiler -Wno-error=unused-function -Xcompiler -Wno-error=array-bounds -std=c++11 --use_fast_math -O2 -g -DNDEBUG -DNVCC -I/usr/local/cuda/include -I/home/dangqingqing/.third_party/install/zlib/include -I/home/dangqingqing/.third_party/install/gflags/include -I/home/dangqingqing/.third_party/install/glog/include -I/home/dangqingqing/.third_party/install/gtest/include -I/home/dangqingqing/.third_party/install/protobuf/include -I/home/dangqingqing/.jumbo/include/python2.7 -I/home/dangqingqing/.jumbo/lib/python2.7/site-packages/numpy/core/include -I/home/dangqingqing/.third_party/install/openblas/include -I/home/dangqingqing/.third_party/install/warpctc/include -I/home/dangqingqing/.third_party/any/src/extern_lib_any -I/home/dangqingqing/.third_party/eigen3/src/extern_eigen3 -I/home/dangqingqing/.third_party/pybind/src/extern_pybind/include -I/home/dangqingqing/.third_party/nccl/src/extern_nccl/src -I/usr/local/cuda/include -I/home/dangqingqing/github/myfork/build -I/home/dangqingqing/github/myfork/Paddle -I/home/dangqingqing/github/myfork/Paddle/paddle/cuda/include -I/home/dangqingqing/github/myfork/build/proto -I/home/dangqingqing/github/myfork/build/go/pserver/client/c -I/usr/include
      • time:
       real	0m34.035s
       user	0m30.629s
       sys	0m3.414s 
  3. Experiment 2: mul_op, this op uses math::matmul to compute.

    • G++
      • compile :
        time /home/dangqingqing/.jumbo/opt/gcc48/bin/c++   -DANY_IMPL_ANY_CAST_MOVEABLE -DPADDLE_DISABLE_PROFILER -DPADDLE_DISABLE_RDMA -DPADDLE_DISABLE_TIMER -DPADDLE_USE_DSO -DPADDLE_USE_PTHREAD_BARRIER -DPADDLE_USE_PTHREAD_SPINLOCK -DPADDLE_VERSION=0.10.0rc4 -DPADDLE_WITHOUT_GOLANG -DPADDLE_WITH_CUDA -DPADDLE_WITH_TESTING -mavx -std=c++11 -fPIC -fno-omit-frame-pointer -Wall -Wextra -Werror -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wno-unused-parameter -Wno-unused-function -Wno-error=literal-suffix -Wno-error=sign-compare -Wno-error=unused-local-typedefs -O2 -g -DNDEBUG -I/home/dangqingqing/.third_party/install/zlib/include -I/home/dangqingqing/.third_party/install/gflags/include -I/home/dangqingqing/.third_party/install/glog/include -I/home/dangqingqing/.third_party/install/gtest/include -I/home/dangqingqing/.third_party/install/protobuf/include -I/home/dangqingqing/.jumbo/include/python2.7 -I/home/dangqingqing/.jumbo/lib/python2.7/site-packages/numpy/core/include -I/home/dangqingqing/.third_party/install/openblas/include -I/home/dangqingqing/.third_party/install/warpctc/include -I/home/dangqingqing/.third_party/any/src/extern_lib_any -I/home/dangqingqing/.third_party/eigen3/src/extern_eigen3 -I/home/dangqingqing/.third_party/pybind/src/extern_pybind/include -I/home/dangqingqing/.third_party/nccl/src/extern_nccl/src -I/usr/local/cuda/include -I/home/dangqingqing/github/myfork/build -I/home/dangqingqing/github/myfork/Paddle -I/home/dangqingqing/github/myfork/Paddle/paddle/cuda/include -I/home/dangqingqing/github/myfork/build/proto -I/home/dangqingqing/github/myfork/build/go/pserver/client/c    -o CMakeFiles/mul_op.dir/mul_op.cc.o -c /home/dangqingqing/github/myfork/Paddle/paddle/operators/mul_op.cc
      • time:
      real	0m11.383s
      user	0m10.568s
      sys	0m0.825s
    • NVCC: gencode: sm_30, sm_35, sm_50,sm_52
      • compile:
      time /usr/local/cuda/bin/nvcc /home/dangqingqing/github/myfork/Paddle/paddle/operators/mul_op.cu -c -o /home/dangqingqing/github/myfork/build/paddle/operators/CMakeFiles/mul_op.dir//./mul_op_generated_mul_op.cu.o -ccbin /home/dangqingqing/.jumbo/opt/gcc48/bin/g++ -m64 -DANY_IMPL_ANY_CAST_MOVEABLE -DPADDLE_USE_DSO -DPADDLE_WITH_TESTING -DPADDLE_DISABLE_TIMER -DPADDLE_DISABLE_PROFILER -DPADDLE_WITHOUT_GOLANG -DPADDLE_WITH_CUDA -DPADDLE_DISABLE_RDMA -DPADDLE_USE_PTHREAD_SPINLOCK -DPADDLE_USE_PTHREAD_BARRIER -DPADDLE_VERSION=0.10.0rc4 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -Xcompiler -mavx -Xcompiler -Wall -Xcompiler -Wextra -Xcompiler -Werror -Xcompiler -fPIC -Xcompiler -fno-omit-frame-pointer -Xcompiler -Wno-unused-parameter -Xcompiler -Wno-unused-function -Xcompiler -Wno-error=sign-compare -Xcompiler -Wno-error=literal-suffix -Xcompiler -Wno-error=unused-local-typedefs -Xcompiler -Wno-error=unused-function -Xcompiler -Wno-error=array-bounds -std=c++11 --use_fast_math -O2 -g -DNDEBUG -DNVCC -I/usr/local/cuda/include -I/home/dangqingqing/.third_party/install/zlib/include -I/home/dangqingqing/.third_party/install/gflags/include -I/home/dangqingqing/.third_party/install/glog/include -I/home/dangqingqing/.third_party/install/gtest/include -I/home/dangqingqing/.third_party/install/protobuf/include -I/home/dangqingqing/.jumbo/include/python2.7 -I/home/dangqingqing/.jumbo/lib/python2.7/site-packages/numpy/core/include -I/home/dangqingqing/.third_party/install/openblas/include -I/home/dangqingqing/.third_party/install/warpctc/include -I/home/dangqingqing/.third_party/any/src/extern_lib_any -I/home/dangqingqing/.third_party/eigen3/src/extern_eigen3 -I/home/dangqingqing/.third_party/pybind/src/extern_pybind/include -I/home/dangqingqing/.third_party/nccl/src/extern_nccl/src -I/usr/local/cuda/include -I/home/dangqingqing/github/myfork/build -I/home/dangqingqing/github/myfork/Paddle -I/home/dangqingqing/github/myfork/Paddle/paddle/cuda/include -I/home/dangqingqing/github/myfork/build/proto -I/home/dangqingqing/github/myfork/build/go/pserver/client/c -I/usr/include
      
      • time
      real	1m31.902s
      user	1m21.839s
      sys	0m10.026s

The .cu operators which can be compiled by G++

Following .cu operators can be compiled by G++, since some the dependent CUDA kernels have been compiled in math libraries (paddle/operator/math/ file). And the cuDNN can also be compiled by G++.

batch_norm_op.cu
concat_op.cu
conv2d_transpose_cudnn_op.cu
conv_cudnn_op.cu
conv_op.cu
conv_transpose_op.cu
fill_constant_batch_size_like_op.cu
fill_constant_op.cu
fill_zeros_like_op.cu
gru_op.cu
linear_chain_crf_op.cu
lstm_op.cu
matmul_op.cu
mul_op.cu
nccl_op.cu
nccl_op_test.cu
pool_cudnn_op.cu
pool_op.cu
pool_with_index_op.cu
sequence_conv_op.cu
reshape_op.cu
sequence_concat_op.cu
sequence_softmax_op.cu
softmax_op.cu
split_op.cu

But different compiling rules for different operators are a little confused for developers.

Also ralated to #5413

@luotao1
Copy link
Contributor

luotao1 commented Nov 8, 2017

The experiment seems so great!

The .cu operators which can be compiled by G++

How to find which op should be compiled by G++, but another should be compiled by nvcc?

But different compiling rules for different operators are a little confused for developers.

Is the .cu which doesn't contain #define EIGEN_USE_GPU could be compiled by G++ ?

@chengduoZH
Copy link
Contributor

The compile time above seems to be in Debug mode. Do you have compile time in Release mode?

@qingqing01
Copy link
Contributor Author

@luotao1

  1. If there are CUDA keywords in the .cu file, such as __device__, __global__, dim3, these files should be compiled by NVCC.

  2. The cuDNN operator, such as, conv_cudnn_op.cu, batch_norm_op.cu, conv2d_transpose_cudnn_op.cu, can be compiled by G++.

  3. If the CUDA dependences have been compiled into libraries, then the operators only based on these libraries, can be compiled by G++, such as, mul_op, conv_op.cu.

@qingqing01
Copy link
Contributor Author

@chengduoZH The compiling uses the default flags RelWithDebInfo, not Debug. Switch to Release, the time for mul_op is as follows:

  • G++
time /home/dangqingqing/.jumbo/opt/gcc48/bin/c++   -DANY_IMPL_ANY_CAST_MOVEABLE -DPADDLE_DISABLE_PROFILER -DPADDLE_DISABLE_RDMA -DPADDLE_DISABLE_TIMER -DPADDLE_USE_DSO -DPADDLE_USE_PTHREAD_BARRIER -DPADDLE_USE_PTHREAD_SPINLOCK -DPADDLE_VERSION=0.10.0rc4 -DPADDLE_WITHOUT_GOLANG -DPADDLE_WITH_CUDA -DPADDLE_WITH_TESTING -mavx -std=c++11 -fPIC -fno-omit-frame-pointer -Wall -Wextra -Werror -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wno-unused-parameter -Wno-unused-function -Wno-error=literal-suffix -Wno-error=sign-compare -Wno-error=unused-local-typedefs -O3 -DNDEBUG -I/home/dangqingqing/.third_party/install/zlib/include -I/home/dangqingqing/.third_party/install/gflags/include -I/home/dangqingqing/.third_party/install/glog/include -I/home/dangqingqing/.third_party/install/gtest/include -I/home/dangqingqing/.third_party/install/protobuf/include -I/home/dangqingqing/.jumbo/include/python2.7 -I/home/dangqingqing/.jumbo/lib/python2.7/site-packages/numpy/core/include -I/home/dangqingqing/.third_party/install/openblas/include -I/home/dangqingqing/.third_party/install/warpctc/include -I/home/dangqingqing/.third_party/any/src/extern_lib_any -I/home/dangqingqing/.third_party/eigen3/src/extern_eigen3 -I/home/dangqingqing/.third_party/pybind/src/extern_pybind/include -I/home/dangqingqing/.third_party/nccl/src/extern_nccl/src -I/usr/local/cuda/include -I/home/dangqingqing/github/myfork/build -I/home/dangqingqing/github/myfork/Paddle -I/home/dangqingqing/github/myfork/Paddle/paddle/cuda/include -I/home/dangqingqing/github/myfork/build/proto -I/home/dangqingqing/github/myfork/build/go/pserver/client/c    -o CMakeFiles/mul_op.dir/mul_op.cc.o -c /home/dangqingqing/github/myfork/Paddle/paddle/operators/mul_op.cc

real	0m9.541s
user	0m8.819s
sys	0m0.730s
  • NVCC
time /usr/local/cuda/bin/nvcc /home/dangqingqing/github/myfork/Paddle/paddle/operators/mul_op.cu -c -o /home/dangqingqing/github/myfork/build/paddle/operators/CMakeFiles/mul_op.dir//./mul_op_generated_mul_op.cu.o -ccbin /home/dangqingqing/.jumbo/opt/gcc48/bin/g++ -m64 -DANY_IMPL_ANY_CAST_MOVEABLE -DPADDLE_USE_DSO -DPADDLE_WITH_TESTING -DPADDLE_DISABLE_TIMER -DPADDLE_DISABLE_PROFILER -DPADDLE_WITHOUT_GOLANG -DPADDLE_WITH_CUDA -DPADDLE_DISABLE_RDMA -DPADDLE_USE_PTHREAD_SPINLOCK -DPADDLE_USE_PTHREAD_BARRIER -DPADDLE_VERSION=0.10.0rc4 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -Xcompiler -mavx -Xcompiler -Wall -Xcompiler -Wextra -Xcompiler -Werror -Xcompiler -fPIC -Xcompiler -fno-omit-frame-pointer -Xcompiler -Wno-unused-parameter -Xcompiler -Wno-unused-function -Xcompiler -Wno-error=sign-compare -Xcompiler -Wno-error=literal-suffix -Xcompiler -Wno-error=unused-local-typedefs -Xcompiler -Wno-error=unused-function -Xcompiler -Wno-error=array-bounds -std=c++11 --use_fast_math -O3 -DNDEBUG -DNVCC -I/usr/local/cuda/include -I/home/dangqingqing/.third_party/install/zlib/include -I/home/dangqingqing/.third_party/install/gflags/include -I/home/dangqingqing/.third_party/install/glog/include -I/home/dangqingqing/.third_party/install/gtest/include -I/home/dangqingqing/.third_party/install/protobuf/include -I/home/dangqingqing/.jumbo/include/python2.7 -I/home/dangqingqing/.jumbo/lib/python2.7/site-packages/numpy/core/include -I/home/dangqingqing/.third_party/install/openblas/include -I/home/dangqingqing/.third_party/install/warpctc/include -I/home/dangqingqing/.third_party/any/src/extern_lib_any -I/home/dangqingqing/.third_party/eigen3/src/extern_eigen3 -I/home/dangqingqing/.third_party/pybind/src/extern_pybind/include -I/home/dangqingqing/.third_party/nccl/src/extern_nccl/src -I/usr/local/cuda/include -I/home/dangqingqing/github/myfork/build -I/home/dangqingqing/github/myfork/Paddle -I/home/dangqingqing/github/myfork/Paddle/paddle/cuda/include -I/home/dangqingqing/github/myfork/build/proto -I/home/dangqingqing/github/myfork/build/go/pserver/client/c -I/usr/include


real	1m28.979s
user	1m20.179s
sys	0m8.848s

@emailweixu
Copy link
Collaborator

emailweixu commented Nov 13, 2017

Even though CUDA keywords in the .cu file, such as __device__, __global__, dim3 may not be used in the .cu file, they could be used in the header files from eigen. It seems that many of our operators are in this category (e.g, elementwise_mul_op.cu). Is there any easier way to figure this out?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants