cudaPackages: point nvcc at a compatible -ccbin #218265

SomeoneSerge · 2023-02-25T14:54:05Z

EDIT of 2023-04-01: linking the old libstdc++ was a mistake; we should link the newest possible libstdc++, even if we use an older compiler; libstdc++ if backwards-compatible in the sense that if a process loads a newer libstdc++ first, and then loads a library built against an older libstdc++, everything should work just fine

This PR is a rather dirty hot-fix needed to resume caching cuda-enabled packages. I hope we can just merge this and remove the scum later

This PR specifically does not address the issue of downstream packages (like torch) consuming several major versions of gcc and libstdc++ (one from stdenv, and one from cudatoolkit.cc)

Desiredata

NVCC uses a compatible backend by default

E.g. cuda11 uses gcc11 and links to gcc11.cc.lib, even if stdenv is at gcc12. This means successful buildPhase-s
No runtime linkage errors

E.g. no libstdc++.so.6: version `GLIBCXX_3.4.30' not found after a successful build. This means successful checkPhase-s
In default configuration, a downstream package's runtime closure only includes one toolchain

E.g. we don't link cached packages against multiple versions of libstdc++.so at once, and maybe there's a warning if we accidentally try to. This means smaller closures and fewer conflicts

Description of changes

This is a hot-fix to un-break cuda-enabled packages (like tensorflow, jaxlib, faiss, opencv, ...) after the gcc11->gcc12 bump. We should probably build the whole downstream packages with a compatible stdenv (such as gcc11Stdenv for cudaPackages_11), but just pointing nvcc at the right compiler seems to do the trick

We already used this hack for non-redist cudatoolkit. Now we use it more consistently.

This commit also re-links cuda packages against libstdc++ from the same "compatible" gcc, rather than the current stdenv. We didn't test if this is necessary -> need to revise in further PRs

NOTE: long-term we should make it possible to override -ccbin and use e.g. clang
NOTE: the NVCC_PREPEND_FLAGS line pollutes build logs with warnings when e.g. cmake appends another -ccbin

Things done

Notify maintainers

@NixOS/cuda-maintainers @ConnorBaker @mcwitt

pkgs/development/libraries/science/math/nccl/default.nix

pkgs/development/compilers/cudatoolkit/redist/build-cuda-redist-package.nix

SomeoneSerge · 2023-02-25T17:08:15Z

UPD: torchvision seems to still override -ccbin, but I expect #218035 should fix it

SomeoneSerge · 2023-02-26T06:39:49Z

Result of nixpkgs-review pr 218265 run on x86_64-linux 1

3 packages marked as broken and skipped:

cudaPackages.nvidia_driver
python310Packages.caffeWithCuda
truecrack-cuda

5 packages failed to build:

cudaPackages.tensorrt (cudaPackages.tensorrt_8_4_0)
mathematica-cuda
python310Packages.tensorflowWithCuda
python310Packages.tensorrt
python311Packages.tensorrt

69 packages built:

caffeWithCuda
colmapWithCuda
cudaPackages.cuda_cccl
cudaPackages.cuda_cudart
cudaPackages.cuda_cuobjdump
cudaPackages.cuda_cupti
cudaPackages.cuda_cuxxfilt
cudaPackages.cuda_demo_suite
cudaPackages.cuda_documentation
cudaPackages.cuda_gdb
cudaPackages.cuda_memcheck
cudaPackages.cuda_nsight
cudaPackages.cuda_nvcc
cudaPackages.cuda_nvdisasm
cudaPackages.cuda_nvml_dev
cudaPackages.cuda_nvprof
cudaPackages.cuda_nvprune
cudaPackages.cuda_nvrtc
cudaPackages.cuda_nvtx
cudaPackages.cuda_nvvp
cudaPackages.cuda_sanitizer_api
cudatoolkit (cudaPackages.cudatoolkit ,cudatoolkit_11)
cudaPackages.cudnn (cudaPackages.cudnn_8_7_0)
cudaPackages.cudnn_8_4_1
cudaPackages.cudnn_8_5_0
cudaPackages.cudnn_8_6_0
cudaPackages.cutensor
cudaPackages.fabricmanager
cudaPackages.libcublas
cudaPackages.libcufft
cudaPackages.libcufile
cudaPackages.libcurand
cudaPackages.libcusolver
cudaPackages.libcusparse
cudaPackages.libnpp
cudaPackages.libnvidia_nscq
cudaPackages.libnvjpeg
cudaPackages.nccl
cudaPackages.nsight_compute
cudaPackages.nsight_systems
cudaPackages.nvidia_fs
forge
gpu-burn
gpu-screen-recorder
gpu-screen-recorder-gtk
gromacsCudaMpi
gwe
hip-nvidia
katagoWithCuda
librealsenseWithCuda
magma (magma-cuda)
nvtop
nvtop-nvidia
python310Packages.TheanoWithCuda
python310Packages.cupy
python310Packages.jaxlibWithCuda
python310Packages.numbaWithCuda
python310Packages.pycuda
python310Packages.pynvml
python310Packages.pyrealsense2WithCuda
python310Packages.torchWithCuda
python311Packages.TheanoWithCuda
python311Packages.cupy
python311Packages.jaxlibWithCuda
python311Packages.pycuda
python311Packages.pynvml
python311Packages.pyrealsense2WithCuda
xgboostWithCuda
xpraWithNvenc

SomeoneSerge · 2023-02-26T10:33:35Z

Failed derivations

Mathematica_13.2.0_BNDL_LINUX.sh.drv (derivation hash: aqghfka6isdpinz89rysidz2wavfarps)
- https://gist.github.com/18589d227e5f0d584276364467155bcc
TensorRT-8.4.0.6.Linux.x86_64-gnu.cuda-11.6.cudnn8.3.tar.gz.drv (derivation hash: 7hdh3vdgbrwxv8619k80zhn4vb5bbbj0)
- https://gist.github.com/6d86d6895fb7b91ac391a8a1d4b2cb02
tensorflow-gpu-2.11.0.drv (derivation hash: bi2cd86mhm6xfz9y4jxqphnnnlj1apvw)
- https://gist.github.com/02419e1534499948d74f4d68db5d7bea

SomeoneSerge · 2023-02-27T14:30:27Z

Rebuilding python3Packages.tensorflow, will see if this helped

pkgs/development/compilers/cudatoolkit/flags.nix

pkgs/development/python-modules/tensorflow/default.nix

pkgs/development/compilers/cudatoolkit/flags.nix

SomeoneSerge · 2023-02-28T01:08:12Z

pkgs/development/compilers/cudatoolkit/flags.nix


 in
+# A silly unit-test
+assert (formatCapabilities { cudaCapabilities = [ "7.5" "8.6" ]; }) == {


Would be more interesting to test [ "8.6" "7.5" ]. Should this preserve the order? Should this print a warning?

It's my opinion that capabilities should be sorted, so I would want the order of the output to be invariant with respect to the order of the input (which should already be sorted). Although, I'd love to hear other views!

The way we handle this parameter now, the order is significant. It's our semi-implicit convention that the last element goes into PTX. Maybe the take away is rather that we don't want this to be implicit:)

Hm, I think you're right there -- the last capability in the list shouldn't be the one which gets turned into a virtual architecture.

Although, I do like the idea of having them ordered so packages can decide what to build for. For example, Magma doesn't support 8.6/8.9, so I can imagine at some point in the future Magma iterating over the list of cuda capabilities to find the greatest lower bound (in Magma's case, 8.0) and building for that architecture.

Left as a TODO

pkgs/development/compilers/cudatoolkit/flags.nix

samuela · 2023-02-28T01:31:03Z

pkgs/development/compilers/cudatoolkit/common.nix

@@ -191,7 +209,7 @@ stdenv.mkDerivation rec {
  preFixup =
    let rpath = lib.concatStringsSep ":" [
      (lib.makeLibraryPath (runtimeDependencies ++ [ "$lib" "$out" "$out/nvvm" ]))
-      "${stdenv.cc.cc.lib}/lib64"
+      "${gcc.cc.lib}/lib64"


the gcc/stdenv distinction here is subtle enough i believe it deserves a comment

How about this one?

I guess my confusion was rather why reference gcc directly instead of accessing it through stdenv

Oh, right. So, I should explain that gcc here is what's we override in extension.nix based on versions.toml
Actually, maybe I should override stdenv too? Like so

https://github.com/SomeoneSerge/nixpkgs/blob/cc4f01552c2dca50b452170df2770edb71148555/pkgs/development/compilers/cudatoolkit/extension.nix#L11-L15

ooh yeah not a bad idea...

could we get away with only overriding stdenv and then pulling the gcc version from that stdenv?

@samuela Good. I'm thinking about exposing that stdenv in cudaPackages (rather than cudatoolkit) then. I, however, feel uneasy about exposing it as cudaPackages.stdenv because it might affect people's expectations... E.g. since clangStdenv contains clang, people might think cudaPackages.stdenv contains nvcc

Alt names I'm thinking of: cudaStdenv (might be misinterpreted the same way), backendStdenv (exactly what it is, but hard to pronounce 🤣 ). Do you like any?

Good point! I agree that cudaStdenv could be slightly misleading. It's a little tricky to name... Maybe matchingStdenv? compatibleStdenv? Idk I'm happy with whatever you feel is most appropriate

I kept the ugly name, because "backend for nvcc" seemed like the clearest description...

pkgs/development/compilers/cudatoolkit/extension.nix

pkgs/development/compilers/cudatoolkit/redist/overrides.nix

pkgs/development/libraries/science/math/nccl/default.nix

This is needed for faster builds when debugging the opencv derivation, and it's more consistent with other cuda-enabled packages -DCUDA_GENERATION seems to expect architecture names, so we refactor cudaFlags to facilitate easier extraction of the configured archnames

Make tensorflow (and a bunch of ther things) use CUDA-compatible toolchain. Introduces cudaPackages.backendStdenv

Co-authored-by: Connor Baker <ConnorBaker01@Gmail.com>

pkgs/development/libraries/science/math/magma/generic.nix

samuela · 2023-03-04T01:11:07Z

Wooo, thanks for seeing this through @SomeoneSerge ! diff LGTM. @ConnorBaker are you ok with these changes? i saw that you two still have some convos open

ConnorBaker · 2023-03-04T02:08:18Z

@samuela looks good to me!

@SomeoneSerge thank you for all the work you put into this :)

ConnorBaker

Looks good! Just questions for future reference/work

ConnorBaker · 2023-03-04T02:40:40Z

pkgs/development/libraries/science/math/magma/generic.nix

+      minArch' = builtins.head (builtins.sort builtins.lessThan cudaArchitectures);
+    in
+    # If this fails some day, something must've changed and we should re-validate our assumptions
+    assert builtins.stringLength minArch' == 2;


Nit for later: This is lexicographic sorting right? Won't we run into issues starting with Blackwell (post-Hopper) because we'll have capabilities starting with a one? E.g., "100" < "50".

Again, just for future stuff! We'd have until at least 2024 before this becomes a problem, and that's assuming they keep the same naming scheme.

My preference is to see this PR merged sooner rather than later so I can work on rebasing my PRs ;)

ConnorBaker · 2023-03-04T02:43:03Z

pkgs/development/libraries/science/math/nccl/default.nix

-  nativeBuildInputs = [ which addOpenGLRunpath ];
+  nativeBuildInputs = [
+    which
+    addOpenGLRunpath


Would the autoAddOpenGLRunpathHook also work here, or do we need to manually invoke it in postFixup?

autoAddOpenGLRunpathHook should work, but we better test

SomeoneSerge · 2023-03-06T13:26:24Z

Another note for later (but not to delay the merge):

Apparently one can also encounter the libstdc++.so.6: version `GLIBCXX_3.4.30' not found issue when cuda-enabled packages use scipy: https://gist.github.com/SomeoneSerge/7883b70e11ba4d874f65dbe39de99cf1#file-gistfile0-txt-L816 (built from https://github.com/SomeoneSerge/pkgs/blob/master/pkgs/gpflow.nix)

samuela · 2023-03-06T16:41:49Z

In the interest of keeping things moving I went ahead and merged. AFAIU there are still a 4 things left as TODOs for future PRs however:

libstdc++.so.6 issues (link)
try autoAddOpenGLRunpathHook in nccl (link)
lexicographic sorting of arch's is not future proof (link)
more arch list sorting (link)

Thanks @SomeoneSerge !

ConnorBaker · 2023-03-09T14:33:02Z

Thank you @samuela for the summary! I made issues to track those followups:

SomeoneSerge · 2023-04-01T03:24:03Z

pkgs/development/compilers/cudatoolkit/common.nix

+    cat <<EOF >> $out/nix-support/setup-hook
+    cmakeFlags+=' -DCUDA_TOOLKIT_ROOT_DIR=$out'
+    cmakeFlags+=' -DCUDA_HOST_COMPILER=${backendStdenv.cc}/bin'
+    cmakeFlags+=' -DCMAKE_CUDA_HOST_COMPILER=${backendStdenv.cc}/bin'


NOTE: nvidia/thrust treats this as a path to the executable, not parent directory
TODO: check if maybe nvidia/thrust actually does this right

SomeoneSerge added the 6.topic: cuda Parallel computing platform and API label Feb 25, 2023

SomeoneSerge commented Feb 25, 2023

View reviewed changes

pkgs/development/libraries/science/math/nccl/default.nix Outdated Show resolved Hide resolved

SomeoneSerge commented Feb 25, 2023

View reviewed changes

pkgs/development/libraries/science/math/nccl/default.nix Show resolved Hide resolved

SomeoneSerge commented Feb 25, 2023

View reviewed changes

pkgs/development/compilers/cudatoolkit/redist/build-cuda-redist-package.nix Outdated Show resolved Hide resolved

ofborg bot requested a review from mdaiter February 25, 2023 15:22

ofborg bot added 10.rebuild-darwin: 11-100 10.rebuild-linux: 11-100 labels Feb 25, 2023

SomeoneSerge marked this pull request as ready for review February 25, 2023 16:54

SomeoneSerge mentioned this pull request Feb 27, 2023

python3.pkgs.buildPython*: allow overriding of the stdenv #173411

Merged

13 tasks

github-actions bot added the 6.topic: python label Feb 27, 2023

ofborg bot added the 2.status: merge conflict This PR has merge conflicts with the target branch label Feb 27, 2023

SomeoneSerge commented Feb 27, 2023

View reviewed changes

pkgs/development/compilers/cudatoolkit/flags.nix Outdated Show resolved Hide resolved

SomeoneSerge force-pushed the hotfix-nvcc-gcc-incompatibility branch 2 times, most recently from 04142e2 to f63d5d3 Compare February 28, 2023 00:52