(cmake) Fix cuda arch selection #1091

matthewdouglas · 2024-02-26T16:57:47Z

I noticed a few problems with the way the compute capabilities would be applied when compiling.

Currently CMAKE_CUDA_ARCHITECTURES_ALL is used to create a default list. This is defined in cmake>=3.23.0 but our minimum requirement is below that, cmake==3.22.1. The GitHub Actions for the Ubuntu builds are running the 3.21.0 version, and therefore have empty lists:

Ubuntu, x86-64, CUDA 12.1.0

-- Building with backend cuda
-- NO_CUBLASLT := OFF
-- CUDA Version: 121 (12.1.105)
-- CUDA Compiler: /usr/local/cuda/bin/nvcc
-- CUDA Capabilities Available: 
-- CUDA Capabilities  Selected: 
-- CUDA NVCC Flags:  --use_fast_math
-- Configuring done
-- Generating done
-- Build files have been written to: /src

However, the Windows builds have a newer version of CMake set up, so it's closer to what we expect:

Windows, x86-64, CUDA 12.1

-- NO_CUBLASLT := ON
-- The CUDA compiler identification is NVIDIA 12.1.66
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.1/bin/nvcc.exe - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Found CUDAToolkit: C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.1/include (found version "12.1.66") 
-- CUDA Version: 121 (12.1.66)
-- CUDA Compiler: C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.1/bin/nvcc.exe
-- CUDA Capabilities Available: 50;52;53;60;61;62;70;72;75;80;86;87;89;90
-- CUDA Capabilities  Selected: 50;52;53;60;61;62;70;72;75;80;86;87;89;90

Besides this issue, there's two others resolved here:

When CMAKE_CUDA_ARCHITECTURES is not user-defined, it's initialized to the nvcc default. In the case of 12.x, this is adding to the actual flags used when invoking nvcc: --generate-code=arch=compute_52,code=[sm_52]
By default we're only building cubins for the "real" architectures. When a new arch is released, we'd have to rebuild to support it. Instead, by including the "virtual" PTX build for at least the highest architecture selected, we can help ensure the PTX can run on newer cards.

After these changes, the generated flags look like this (Windows, CUDA 12.3, all capabilities selected)

nvcc.exe -forward-unknown-to-host-compiler -DBUILD_CUDA -Dbitsandbytes_EXPORTS -IC:\\dev\\bitsandbytes\\csrc -IC:\\dev\\bitsandbytes\\include -isystem \"C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\include\" -D_WINDOWS -Xcompiler=\" /EHsc\" --use_fast_math -Xcompiler=\"-O2 -Ob2\" -DNDEBUG -std=c++14 \"--generate-code=arch=compute_50,code=[sm_50]\" \"--generate-code=arch=compute_52,code=[sm_52]\" \"--generate-code=arch=compute_53,code=[sm_53]\" \"--generate-code=arch=compute_60,code=[sm_60]\" \"--generate-code=arch=compute_61,code=[sm_61]\" \"--generate-code=arch=compute_62,code=[sm_62]\" \"--generate-code=arch=compute_70,code=[sm_70]\" \"--generate-code=arch=compute_72,code=[sm_72]\" \"--generate-code=arch=compute_75,code=[sm_75]\" \"--generate-code=arch=compute_80,code=[sm_80]\" \"--generate-code=arch=compute_86,code=[sm_86]\" \"--generate-code=arch=compute_87,code=[sm_87]\" \"--generate-code=arch=compute_89,code=[sm_89]\" \"--generate-code=arch=compute_90,code=[compute_90,sm_90]\" -Xcompiler=-MD -x cu -rdc=true -c C:\\dev\\bitsandbytes\\csrc\\kernels.cu -o CMakeFiles\\bitsandbytes.dir\\csrc\\kernels.cu.obj -Xcompiler=-FdTARGET_COMPILE_PDB,-FS"

Note that in a future PR, I'd like to consider replacing the build option -DCOMPUTE_CAPABILITY=... with the more canonical CMAKE_CUDA_ARCHITECTURES option. That can be a later change stacked after this PR.

… sure we build only selected cubins and only ptx for latest capability

matthewdouglas · 2024-02-26T17:00:39Z

.github/workflows/python-package.yml

@@ -125,7 +125,7 @@ jobs:
            docker run --platform linux/$build_arch -i -w /src -v $PWD:/src $image sh -c \
              "apt-get update \
              && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends cmake \
-              && cmake -DCOMPUTE_BACKEND=cuda -DNO_CUBLASLT=${NO_CUBLASLT} . \
+              && cmake -DCOMPUTE_BACKEND=cuda -DCOMPUTE_CAPABILITY=\"50;52;60;61;70;75;80;86;89;90\" -DNO_CUBLASLT=${NO_CUBLASLT} . \


I selected these for the CI to bring parity with what the old Makefile was selecting. When we get around to doing tests on GPU instances, we may want to narrow this down to "native" so we don't waste so much build time.

github-actions · 2024-02-26T17:01:16Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Titus-von-Koeller · 2024-02-26T19:54:25Z

Thanks @matthewdouglas! Great work.

@rickardp @wkpark @akx Could one of you please do a first review of this?

rickardp

Nice work! I left a few comments / questions

rickardp · 2024-02-26T20:01:50Z

CMakeLists.txt

@@ -82,6 +83,31 @@ if(BUILD_CUDA)
        message(FATAL_ERROR "CUDA Version > 12 is not supported")
    endif()

+    # CMake < 3.23.0 does not define CMAKE_CUDA_ARCHITECTURES_ALL.


Better to just kill this code and bump cmake requirement to 3.23.0?

I had thought about that, but considering Ubuntu 22.04 LTS as a popular host system, the default is 3.22.1. I believe this is what is on the Docker images that we're using too. I didn't want this PR to change too much outside of the scope of the issues I saw.

I know we could install a newer cmake on the Docker builds though, and GitHub's latest runner images all have CMake 3.28.x on them. So maybe a separate PR to bump up the minimum?

CMakeLists.txt

rickardp · 2024-02-26T20:13:12Z

CMakeLists.txt

+    # Ensure we build the PTX for the latest version.
+    # This is similar to the "all" and "all-major" options in CMake >= 23.
+    # TODO: Consider bumping CMake requirement and using CMAKE_CUDA_ARCHITECTURES=[all | native] by default
+    list(REMOVE_DUPLICATES COMPUTE_CAPABILITY)


The downside of removing unsupported archs is that we now remove them silently if the build agent no longer supports them (for example after a CUDA update). Maybe it would be better to fail if specifying an an architecture that is not available?

I was thinking that CI/release builds should be specific and set COMPUTE_CAPABILITY (or CMAKE_CUDA_ARCHITECTURES if moving to use that directly). But otherwise, all we're doing here is adding the virtual architecture to the highest capability selected, e.g. generating both compute_90 and sm_90: arch=compute_90,code=[compute_90,sm_90]. This is for forward compatibility.

We already have a guard for CUDA >= 13, so that would require manual intervention. We'd find out ahead of time as the architectures to be dropped would be deprecated in 12.x first. That should emit warnings from nvcc.

Yes, that makes sense

Titus-von-Koeller · 2024-02-27T17:30:27Z

Thanks @matthewdouglas, this is really helpful! Really glad to have your support with this :)

Would you be up to join our Slack channel to be able to coordinate about stuff every now and then? If yes, please send me an email at my firstname at huggingface dot co with the email that I should use for the invite.

I'm planning to merge some more workflow related PRs these days and aim to do a release to test PyPi by the end of the week in preparation for a full release next week.

Titus-von-Koeller · 2024-02-27T17:30:47Z

And thanks @rickardp for the review, very much appreciated as well!

matthewdouglas added 7 commits February 23, 2024 21:36

(cmake) Fix generation of targets for nvcc

3d35f8a

Typo

9a7d1f3

(ci) linux + CUDA workflow: make sure we specify target architectures

cf53a49

fix

1d75906

fix one more time

1264630

(cmake) Default in CMAKE_CUDA_ARCHITECTURES_ALL when cmake<3.23, make…

70aba79

… sure we build only selected cubins and only ptx for latest capability

Fix static lookup for CMAKE_CUDA_ARCHITECTURES_ALL on cmake<3.23

cde17cf

matthewdouglas commented Feb 26, 2024

View reviewed changes

rickardp reviewed Feb 26, 2024

View reviewed changes

matthewdouglas added 2 commits February 26, 2024 16:39

Remove debug setting

61992b8

clarification

f603452

Titus-von-Koeller merged commit 753df25 into bitsandbytes-foundation:main Feb 27, 2024
31 checks passed

matthewdouglas mentioned this pull request Mar 5, 2024

[RFC] Cross-Platform Refactor: Build System and Binary Distribution #1032

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(cmake) Fix cuda arch selection #1091

(cmake) Fix cuda arch selection #1091

matthewdouglas commented Feb 26, 2024 •

edited

Loading

matthewdouglas Feb 26, 2024

github-actions bot commented Feb 26, 2024

Titus-von-Koeller commented Feb 26, 2024

rickardp left a comment

rickardp Feb 26, 2024

matthewdouglas Feb 26, 2024

rickardp Feb 26, 2024

matthewdouglas Feb 26, 2024

rickardp Feb 27, 2024

Titus-von-Koeller commented Feb 27, 2024

Titus-von-Koeller commented Feb 27, 2024

(cmake) Fix cuda arch selection #1091

(cmake) Fix cuda arch selection #1091

Conversation

matthewdouglas commented Feb 26, 2024 • edited Loading

matthewdouglas Feb 26, 2024

Choose a reason for hiding this comment

github-actions bot commented Feb 26, 2024

Titus-von-Koeller commented Feb 26, 2024

rickardp left a comment

Choose a reason for hiding this comment

rickardp Feb 26, 2024

Choose a reason for hiding this comment

matthewdouglas Feb 26, 2024

Choose a reason for hiding this comment

rickardp Feb 26, 2024

Choose a reason for hiding this comment

matthewdouglas Feb 26, 2024

Choose a reason for hiding this comment

rickardp Feb 27, 2024

Choose a reason for hiding this comment

Titus-von-Koeller commented Feb 27, 2024

Titus-von-Koeller commented Feb 27, 2024

matthewdouglas commented Feb 26, 2024 •

edited

Loading