Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Speed compiling]: Refine cmake about CUDA to automatically detect GPU arch by default. #5713

Merged
merged 3 commits into from
Nov 17, 2017

Conversation

qingqing01
Copy link
Contributor

@qingqing01 qingqing01 commented Nov 16, 2017

Fix #5712

1. Automatically detect GPU arch and only specify the detected arch by default.
- For example, in the Tesla K40m, automatically get and specify sm_35 arch.
--DCUDA_ARCH_NAME=All in the TeamCity.

  • -DCUDA_ARCH_NAME=All by default.
  • The developers can set -DCUDA_ARCH_NAME=Auto
  1. Specify -DCUDA_ARCH_NAME=All when releasing PaddlePaddle new version.
    • support: Kepler, Maxwell, Pascal, Volta archs.

Speed:

Compile time interval:

[14:49:56]W:	 [Step 1/1] + nvidia-docker run -i  ... WITH_GPU=ON 
[14:59:19] :	 [Step 1/1]     Running unit tests ...

1. Automatically detect GPU arch by default.
2. Specify -DCUDA_ARCH_NAME=All when releasing PaddlePaddle new version
@qingqing01 qingqing01 changed the title Refine cmake about CUDA to automatically detect GPU arch by default. [Speed compiling]: Refine cmake about CUDA to automatically detect GPU arch by default. Nov 16, 2017
@emailweixu
Copy link
Collaborator

emailweixu commented Nov 16, 2017

"Fix #5713" is not right. "#5713" is this PR itself.

@emailweixu
Copy link
Collaborator

If the code can be compiled for one arch, is it guaranteed to compile for another arch? If not, we should still at least compile it for all arch in TeamCity. We can compile for the local arch on local machine to save dev time though.

@hedaoyuan
Copy link
Contributor

Before cmake, we can get the CUDA architecture of TeamCity machines.

@qingqing01
Copy link
Contributor Author

@emailweixu

As @hedaoyuan said, this cmake can automatically detect GPU installed and automatically get the local arch on the local machine by the following code. That is to say, the arch is sm_35 on Tesla K40, sm_61 on GTX 1080 Ti.

#############################################################
# A function for automatic detection of GPUs installed  (if autodetection is enabled)
# Usage:
#   detect_installed_gpus(out_variable)
function(detect_installed_gpus out_variable)
  if(NOT CUDA_gpu_detect_output)
    set(cufile ${PROJECT_BINARY_DIR}/detect_cuda_archs.cu)

    file(WRITE ${cufile} ""
      "#include <cstdio>\n"
      "int main() {\n"
      "  int count = 0;\n"
      "  if (cudaSuccess != cudaGetDeviceCount(&count)) return -1;\n"
      "  if (count == 0) return -1;\n"
      "  for (int device = 0; device < count; ++device) {\n"
      "    cudaDeviceProp prop;\n"
      "    if (cudaSuccess == cudaGetDeviceProperties(&prop, device))\n"
      "      std::printf(\"%d.%d \", prop.major, prop.minor);\n"
      "  }\n"
      "  return 0;\n"
      "}\n")

    execute_process(COMMAND "${CUDA_NVCC_EXECUTABLE}" "-ccbin=${CUDA_HOST_COMPILER}"
                    "--run" "${cufile}"
                    WORKING_DIRECTORY "${PROJECT_BINARY_DIR}/CMakeFiles/"
                    RESULT_VARIABLE nvcc_res OUTPUT_VARIABLE nvcc_out
                    ERROR_QUIET OUTPUT_STRIP_TRAILING_WHITESPACE)

    if(nvcc_res EQUAL 0)
      # only keep the last line of nvcc_out
      STRING(REGEX REPLACE ";" "\\\\;" nvcc_out "${nvcc_out}")
      STRING(REGEX REPLACE "\n" ";" nvcc_out "${nvcc_out}")
      list(GET nvcc_out -1 nvcc_out)
      string(REPLACE "2.1" "2.1(2.0)" nvcc_out "${nvcc_out}")
      set(CUDA_gpu_detect_output ${nvcc_out} CACHE INTERNAL "Returned GPU architetures from caffe_detect_gpus tool" FORCE)
    endif()
  endif()

  if(NOT CUDA_gpu_detect_output)
    message(STATUS "Automatic GPU detection failed. Building for all known architectures.")
    set(${out_variable} ${paddle_known_gpu_archs} PARENT_SCOPE)
  else()
    set(${out_variable} ${CUDA_gpu_detect_output} PARENT_SCOPE)
  endif()
endfunction()

And if specify -DCUDA_ARCH_NAME=All (Auto by default), will compile all arch.

@luotao1
Copy link
Contributor

luotao1 commented Nov 17, 2017

As this PR reduces the Teamcity time from (30min~33min) to 24min, can we merge it ASAP?

@chengduoZH
Copy link
Contributor

That is to say, the arch is sm_35 on Tesla K40, sm_61 on GTX 1080 Ti.

What happens if there are both Tesla K40 and GTX 1080 Ti in the machine that compiles the code?
Or by default, the machine has only one type of GPU?

@qingqing01
Copy link
Contributor Author

@chengduoZH The code to detect CUDA capability is as follows, it detects all GPUs on one machine. If there are mixed GPU type on one machine. It also can get all CUDA archs for them.

#include <cstdio>
int main() {
  int count = 0;
  if (cudaSuccess != cudaGetDeviceCount(&count)) return -1;
  if (count == 0) return -1;
  for (int device = 0; device < count; ++device) {
    cudaDeviceProp prop;
    if (cudaSuccess == cudaGetDeviceProperties(&prop, device))
      std::printf("%d.%d ", prop.major, prop.minor);
  }
  return 0;
}

@qingqing01
Copy link
Contributor Author

I change the CUDA_ARCH_NAME to All by default. In our local machine, we can use Auto:

cmake -DCUDA_ARCH_NAME=Auto ..

@luotao1
Copy link
Contributor

luotao1 commented Nov 17, 2017

I think CUDA_ARCH_NAME to Auto by default. And for Teamcity, you can change the https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/scripts/docker/build.sh

@qingqing01
Copy link
Contributor Author

@reyoung
Copy link
Collaborator

reyoung commented Nov 17, 2017

I think CUDA_ARCH_NAME to Auto by default.

@luotao1 It is strange that users compile a Paddle binary and cannot use this binary on another machine by default. So, the default CUDA_ARCH_NAME should be ALL. Paddle users will not compile Paddle many times, only developers need to speed up Paddle compile, and developers can set CUDA_ARCH_NAME=AUTO easily.

If the code can be compiled for one arch, is it guaranteed to compile for another arch?

No, it is not. If developers use some features which introduced in a higher arch, the Paddle cannot be compiled by a lower arch.

However, we can specify two arch in our CI tests, the lowest arch Paddle support (sm_30) and the arch of GPU cards in our CI machines. It will speed up CI tests and be guaranteed to compile Paddle for all arch above sm_30.

@hedaoyuan
Copy link
Contributor

It will speed up CI tests and be guaranteed to compile Paddle for all arch above sm_30.

Not all, sm_30 only support sm_3x.

@qingqing01
Copy link
Contributor Author

Use -DCUDA_ARCH_NAME=All by default. @ALL

@qingqing01 qingqing01 merged commit 2113cbf into PaddlePaddle:develop Nov 17, 2017
@emailweixu
Copy link
Collaborator

We also need to document this somewhere.

@luotao1
Copy link
Contributor

luotao1 commented Nov 20, 2017

How about document in #4382?

@qingqing01 qingqing01 deleted the cmake_speed branch November 14, 2019 05:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants