[Speed compiling]: Refine cmake about CUDA to automatically detect GPU arch by default. #5713

qingqing01 · 2017-11-16T14:39:35Z

~~1. Automatically detect GPU arch and only specify the detected arch by default.~~
~~- For example, in the Tesla K40m, automatically get and specify sm_35 arch.~~
~~--DCUDA_ARCH_NAME=All in the TeamCity.~~

-DCUDA_ARCH_NAME=All by default.
The developers can set -DCUDA_ARCH_NAME=Auto

Specify -DCUDA_ARCH_NAME=All when releasing PaddlePaddle new version.
- support: Kepler, Maxwell, Pascal, Volta archs.

Speed:

TeamCity:
- The GPU compiling time: about (14min ~ 16min) -> about 9min in TeamCity
local machine: env: centos, cuda 7.5, make -j8, WITH_GPU=ON
- raw: time: 31m24.320s
- [Speed Compiling]: Reduce NVCC compiling files. #5573 : 26m43.523s
- This PR: 15m0.158s

Compile time interval:

[14:49:56]W:	 [Step 1/1] + nvidia-docker run -i  ... WITH_GPU=ON 
[14:59:19] :	 [Step 1/1]     Running unit tests ...

1. Automatically detect GPU arch by default. 2. Specify -DCUDA_ARCH_NAME=All when releasing PaddlePaddle new version

emailweixu · 2017-11-16T17:47:21Z

"Fix #5713" is not right. "#5713" is this PR itself.

emailweixu · 2017-11-16T17:50:50Z

If the code can be compiled for one arch, is it guaranteed to compile for another arch? If not, we should still at least compile it for all arch in TeamCity. We can compile for the local arch on local machine to save dev time though.

hedaoyuan · 2017-11-17T01:35:44Z

Before cmake, we can get the CUDA architecture of TeamCity machines.

qingqing01 · 2017-11-17T01:59:32Z

@emailweixu

As @hedaoyuan said, this cmake can automatically detect GPU installed and automatically get the local arch on the local machine by the following code. That is to say, the arch is sm_35 on Tesla K40, sm_61 on GTX 1080 Ti.

#############################################################
# A function for automatic detection of GPUs installed  (if autodetection is enabled)
# Usage:
#   detect_installed_gpus(out_variable)
function(detect_installed_gpus out_variable)
  if(NOT CUDA_gpu_detect_output)
    set(cufile ${PROJECT_BINARY_DIR}/detect_cuda_archs.cu)

    file(WRITE ${cufile} ""
      "#include <cstdio>\n"
      "int main() {\n"
      "  int count = 0;\n"
      "  if (cudaSuccess != cudaGetDeviceCount(&count)) return -1;\n"
      "  if (count == 0) return -1;\n"
      "  for (int device = 0; device < count; ++device) {\n"
      "    cudaDeviceProp prop;\n"
      "    if (cudaSuccess == cudaGetDeviceProperties(&prop, device))\n"
      "      std::printf(\"%d.%d \", prop.major, prop.minor);\n"
      "  }\n"
      "  return 0;\n"
      "}\n")

    execute_process(COMMAND "${CUDA_NVCC_EXECUTABLE}" "-ccbin=${CUDA_HOST_COMPILER}"
                    "--run" "${cufile}"
                    WORKING_DIRECTORY "${PROJECT_BINARY_DIR}/CMakeFiles/"
                    RESULT_VARIABLE nvcc_res OUTPUT_VARIABLE nvcc_out
                    ERROR_QUIET OUTPUT_STRIP_TRAILING_WHITESPACE)

    if(nvcc_res EQUAL 0)
      # only keep the last line of nvcc_out
      STRING(REGEX REPLACE ";" "\\\\;" nvcc_out "${nvcc_out}")
      STRING(REGEX REPLACE "\n" ";" nvcc_out "${nvcc_out}")
      list(GET nvcc_out -1 nvcc_out)
      string(REPLACE "2.1" "2.1(2.0)" nvcc_out "${nvcc_out}")
      set(CUDA_gpu_detect_output ${nvcc_out} CACHE INTERNAL "Returned GPU architetures from caffe_detect_gpus tool" FORCE)
    endif()
  endif()

  if(NOT CUDA_gpu_detect_output)
    message(STATUS "Automatic GPU detection failed. Building for all known architectures.")
    set(${out_variable} ${paddle_known_gpu_archs} PARENT_SCOPE)
  else()
    set(${out_variable} ${CUDA_gpu_detect_output} PARENT_SCOPE)
  endif()
endfunction()

And if specify -DCUDA_ARCH_NAME=All (Auto by default), will compile all arch.

luotao1 · 2017-11-17T02:48:13Z

As this PR reduces the Teamcity time from (30min~33min) to 24min, can we merge it ASAP?

chengduoZH · 2017-11-17T03:03:03Z

That is to say, the arch is sm_35 on Tesla K40, sm_61 on GTX 1080 Ti.

What happens if there are both Tesla K40 and GTX 1080 Ti in the machine that compiles the code?
Or by default, the machine has only one type of GPU?

qingqing01 · 2017-11-17T03:11:14Z

@chengduoZH The code to detect CUDA capability is as follows, it detects all GPUs on one machine. If there are mixed GPU type on one machine. It also can get all CUDA archs for them.

#include <cstdio>
int main() {
  int count = 0;
  if (cudaSuccess != cudaGetDeviceCount(&count)) return -1;
  if (count == 0) return -1;
  for (int device = 0; device < count; ++device) {
    cudaDeviceProp prop;
    if (cudaSuccess == cudaGetDeviceProperties(&prop, device))
      std::printf("%d.%d ", prop.major, prop.minor);
  }
  return 0;
}

qingqing01 · 2017-11-17T05:15:44Z

I change the CUDA_ARCH_NAME to All by default. In our local machine, we can use Auto:

cmake -DCUDA_ARCH_NAME=Auto ..

luotao1 · 2017-11-17T05:55:26Z

I think CUDA_ARCH_NAME to Auto by default. And for Teamcity, you can change the https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/scripts/docker/build.sh

…ve 20 21(20) in cmake/cuda.cmake.

qingqing01 · 2017-11-17T06:20:34Z

@luotao1 Done: https://github.com/PaddlePaddle/Paddle/pull/5713/files#diff-4ba54a68f2307c418498c5799da31d17

reyoung · 2017-11-17T07:18:26Z

I think CUDA_ARCH_NAME to Auto by default.

@luotao1 It is strange that users compile a Paddle binary and cannot use this binary on another machine by default. So, the default CUDA_ARCH_NAME should be ALL. Paddle users will not compile Paddle many times, only developers need to speed up Paddle compile, and developers can set CUDA_ARCH_NAME=AUTO easily.

If the code can be compiled for one arch, is it guaranteed to compile for another arch?

No, it is not. If developers use some features which introduced in a higher arch, the Paddle cannot be compiled by a lower arch.

However, we can specify two arch in our CI tests, the lowest arch Paddle support (sm_30) and the arch of GPU cards in our CI machines. It will speed up CI tests and be guaranteed to compile Paddle for all arch above sm_30.

hedaoyuan · 2017-11-17T07:23:21Z

It will speed up CI tests and be guaranteed to compile Paddle for all arch above sm_30.

Not all, sm_30 only support sm_3x.

qingqing01 · 2017-11-17T07:50:33Z

Use -DCUDA_ARCH_NAME=All by default. @ALL

emailweixu · 2017-11-17T20:15:42Z

We also need to document this somewhere.

luotao1 · 2017-11-20T02:49:28Z

How about document in #4382?

Refine cmake about CUDA to automatically detect GPU arch by default.

3d080f3

1. Automatically detect GPU arch by default. 2. Specify -DCUDA_ARCH_NAME=All when releasing PaddlePaddle new version

qingqing01 requested review from reyoung, emailweixu, hedaoyuan and wangkuiyi November 16, 2017 14:40

qingqing01 changed the title ~~Refine cmake about CUDA to automatically detect GPU arch by default.~~ [Speed compiling]: Refine cmake about CUDA to automatically detect GPU arch by default. Nov 16, 2017

qingqing01 requested a review from luotao1 November 16, 2017 14:41

Use CUDA_ARCH_NAME=All in the paddle/scripts/docker/build.sh and remo…

082bc7a

…ve 20 21(20) in cmake/cuda.cmake.

qingqing01 force-pushed the cmake_speed branch from c48999d to 082bc7a Compare November 17, 2017 06:16

update code and fix conflicts.

94e8689

qingqing01 force-pushed the cmake_speed branch from a736990 to 94e8689 Compare November 17, 2017 07:48

qingqing01 mentioned this pull request Nov 17, 2017

配置gpu运行book例子的02.recognize_digits,报错CUDA error: invalid device function #5629

Closed

reyoung approved these changes Nov 17, 2017

View reviewed changes

qingqing01 merged commit 2113cbf into PaddlePaddle:develop Nov 17, 2017

qingqing01 deleted the cmake_speed branch November 14, 2019 05:25

luotao1 mentioned this pull request Nov 25, 2019

make CUDA_ARCH_NAME default Auto #21352

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Speed compiling]: Refine cmake about CUDA to automatically detect GPU arch by default. #5713

[Speed compiling]: Refine cmake about CUDA to automatically detect GPU arch by default. #5713

qingqing01 commented Nov 16, 2017 •

edited

Loading

emailweixu commented Nov 16, 2017 •

edited

Loading

emailweixu commented Nov 16, 2017

hedaoyuan commented Nov 17, 2017

qingqing01 commented Nov 17, 2017

luotao1 commented Nov 17, 2017

chengduoZH commented Nov 17, 2017

qingqing01 commented Nov 17, 2017

qingqing01 commented Nov 17, 2017

luotao1 commented Nov 17, 2017

qingqing01 commented Nov 17, 2017

reyoung commented Nov 17, 2017

hedaoyuan commented Nov 17, 2017

qingqing01 commented Nov 17, 2017

emailweixu commented Nov 17, 2017

luotao1 commented Nov 20, 2017

[Speed compiling]: Refine cmake about CUDA to automatically detect GPU arch by default. #5713

[Speed compiling]: Refine cmake about CUDA to automatically detect GPU arch by default. #5713

Conversation

qingqing01 commented Nov 16, 2017 • edited Loading

Speed:

emailweixu commented Nov 16, 2017 • edited Loading

emailweixu commented Nov 16, 2017

hedaoyuan commented Nov 17, 2017

qingqing01 commented Nov 17, 2017

luotao1 commented Nov 17, 2017

chengduoZH commented Nov 17, 2017

qingqing01 commented Nov 17, 2017

qingqing01 commented Nov 17, 2017

luotao1 commented Nov 17, 2017

qingqing01 commented Nov 17, 2017

reyoung commented Nov 17, 2017

hedaoyuan commented Nov 17, 2017

qingqing01 commented Nov 17, 2017

emailweixu commented Nov 17, 2017

luotao1 commented Nov 20, 2017

qingqing01 commented Nov 16, 2017 •

edited

Loading

emailweixu commented Nov 16, 2017 •

edited

Loading