Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cudaPackages.cudatoolkit: use nix-built dependencies to avoid spurious failures #224646

Closed
SomeoneSerge opened this issue Apr 4, 2023 · 9 comments

Comments

@SomeoneSerge
Copy link
Contributor

SomeoneSerge commented Apr 4, 2023

Describe the bug

As seen in #222273, #178440 introduced a regression:

nix-shell torch-shell.nix --command "python torch-compile-tutorial-01.py"
ImportError: /nix/store/chzf3k3s07wd9i7xgzg6ha667bjhpc51-cudatoolkit-11.7.0/host-linux-x64/libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by /nix/store/q9vc4bbnpyswkl8scxaqgjlxpsh9ncb6-python3.10-torch-2.0.0/lib/python3.10/site-packages/torch/lib/libtorch_python.so)

I guess the hotfix is to either remove it or to replace it with a symlink to our own libstdc++

Note that we had not noticed this issue during initial review of #178440 long time before the gcc11->gcc12 update: this must be because the libstdc++ shipped by cudatoolkit was actually compatible with pytorch we were building back then

An ls reveals there are many more dependencies in cudatoolkit that could have been symlinks to nix store paths, which probably would have been less dangerous:

ls /nix/store/chzf3k3s07wd9i7xgzg6ha667bjhpc51-cudatoolkit-11.7.0/host-linux-x64/
CrashReporter                       libGenericHierarchy.so         libQt5MultimediaWidgets.so.5  libssl.so
ImportNvtxt                         libGpuInfo.so                  libQt5Network.so.5            libssl.so.1.1
libAgentAPI.so                      libGpuTraits.so                libQt5OpenGL.so.5             libstdc++.so.6
libAnalysisData.so                  libicudata.so.56               libQt5Positioning.so.5        libStreamSections.so
libAnalysisProto.so                 libicui18n.so.56               libQt5PrintSupport.so.5       libSymbolAnalyzerLight.so
libAnalysis.so                      libicuuc.so.56                 libQt5QmlModels.so.5          libSymbolDemangler.so
libAppLibInterfaces.so              libInjectionCommunicator.so    libQt5Qml.so.5                libTimelineAssert.so
libAppLib.so                        libInterfaceData.so            libQt5QuickParticles.so.5     libTimelineCommon.so
libAssert.so                        libInterfaceSharedBase.so      libQt5Quick.so.5              libTimelineUIUtils.so
libboost_atomic.so.1.70.0           libInterfaceSharedCore.so      libQt5QuickTest.so.5          libTimelineWidget.so
libboost_chrono.so.1.70.0           libInterfaceSharedLoggers.so   libQt5QuickWidgets.so.5       libz.so
libboost_container.so.1.70.0        libInterfaceShared.so          libQt5Script.so.5             libz.so.1.2.7
libboost_date_time.so.1.70.0        libLinuxPerf.so                libQt5ScriptTools.so.5        Mesa
libboost_filesystem.so.1.70.0       libnvlog.so                    libQt5Sensors.so.5            nsys-ui
libboost_iostreams.so.1.70.0        libNvQtGui.so                  libQt5Sql.so.5                nsys-ui.bin
libboost_program_options.so.1.70.0  libpapi.so.5                   libQt5Svg.so.5                nsys-ui.desktop.template
libboost_python35.so.1.70.0         libpfm.so.4                    libQt5Test.so.5               nsys-ui.png
libboost_regex.so.1.70.0            libProcessLauncher.so          libQt5WaylandClient.so.5      NVIDIA_SLA.pdf
libboost_serialization.so.1.70.0    libProtobufCommClient.so       libQt5WaylandCompositor.so.5  nvlog.config.template
libboost_system.so.1.70.0           libProtobufCommProto.so        libQt5WebChannel.so.5         Plugins
libboost_thread.so.1.70.0           libProtobufComm.so             libQt5WebEngineCore.so.5      python
libboost_timer.so.1.70.0            libprotobuf-shared.so          libQt5WebEngine.so.5          QdstrmImporter
libCommonProtoServices.so           libQt5Charts.so.5              libQt5WebEngineWidgets.so.5   reports
libCommonProtoStreamSections.so     libQt5Concurrent.so.5          libQt5Widgets.so.5            ResolveSymbols
libCore.so                          libQt5Core.so.5                libQt5X11Extras.so.5          resources
libcrypto.so                        libQt5DBus.so.5                libQt5XcbQpa.so.5             rules
libcrypto.so.1.1                    libQt5DesignerComponents.so.5  libQt5XmlPatterns.so.5        Scripts
libCudaDrvApiWrapper.so             libQt5Designer.so.5            libQt5Xml.so.5                sqlite3
libDevicePropertyProto.so           libQt5Gui.so.5                 libQtPropertyBrowser.so       translations
libDeviceProperty.so                libQt5Help.so.5                libsqlite3-shared.so
libexec                             libQt5MultimediaQuick.so.5     libSshClient.so
libexporter.so                      libQt5Multimedia.so.5          libssh.so

EDIT 2023-04-06: As a matter of fact it's very likely that they all can be removed: they're probably used by the profiling tools through $ORIGIN, and I would bet that we already replace $ORIGIN with the paths to respective nixpkgs packages

Notify maintainers

CC @NixOS/cuda-maintainers

Expected response

The barest minimum is that we fix the libstdc++ error

@samuela
Copy link
Member

samuela commented Apr 4, 2023

How exactly did #178440 cause this? Looking at the diff it appears like we still have backendStdenv in RPATH just as before. Perhaps i'm missing something?

@SomeoneSerge
Copy link
Contributor Author

@samuela actually, I don't know for sure if #178440 is the cause, I only assumed so because this error appeared in the torch 2.0 PR only after I rebased on master. Maybe I prematurely jumped to conclusions. As for backendStdenv, it's being removed by #223664

@SomeoneSerge
Copy link
Contributor Author

I was wondering if I accidentally dropped some bash that might have been removing these libraries, but looking at a pre-merge revision, we have always had them:

nix build --impure --expr '(import (builtins.getFlake github:NixOS/nixpkgs/c819f0adc75eccb71d95df4bd7c23c06471ccf43) { config.cudaSupport = true; config.allowUnfree = true; }).cudaPackages.cudatoolkit'ls result/host-linux-x64/
CrashReporter                  libboost_iostreams.so.1.70.0        libDevicePropertyProto.so    libInterfaceSharedLoggers.so  libQt5Core.so.5                libQt5Qml.so.5                libQt5WebEngineCore.so.5     libstdc++.so.6             nsys-ui.png
ImportNvtxt                    libboost_program_options.so.1.70.0  libDeviceProperty.so         libInterfaceShared.so         libQt5DBus.so.5                libQt5QuickParticles.so.5     libQt5WebEngine.so.5         libStreamSections.so       NVIDIA_SLA.pdf
libAgentAPI.so                 libboost_python35.so.1.70.0         libexec                      libLinuxPerf.so               libQt5DesignerComponents.so.5  libQt5Quick.so.5              libQt5WebEngineWidgets.so.5  libSymbolAnalyzerLight.so  nvlog.config.template
libAnalysisData.so             libboost_regex.so.1.70.0            libexporter.so               libnvlog.so                   libQt5Designer.so.5            libQt5QuickTest.so.5          libQt5Widgets.so.5           libSymbolDemangler.so      Plugins
libAnalysisProto.so            libboost_serialization.so.1.70.0    libGenericHierarchy.so       libNvQtGui.so                 libQt5Gui.so.5                 libQt5QuickWidgets.so.5       libQt5X11Extras.so.5         libTimelineAssert.so       python
libAnalysis.so                 libboost_system.so.1.70.0           libGpuInfo.so                libpapi.so.5                  libQt5Help.so.5                libQt5Script.so.5             libQt5XcbQpa.so.5            libTimelineCommon.so       QdstrmImporter
libAppLibInterfaces.so         libboost_thread.so.1.70.0           libGpuTraits.so              libpfm.so.4                   libQt5MultimediaQuick.so.5     libQt5ScriptTools.so.5        libQt5XmlPatterns.so.5       libTimelineUIUtils.so      reports
libAppLib.so                   libboost_timer.so.1.70.0            libicudata.so.56             libProcessLauncher.so         libQt5Multimedia.so.5          libQt5Sensors.so.5            libQt5Xml.so.5               libTimelineWidget.so       ResolveSymbols
libAssert.so                   libCommonProtoServices.so           libicui18n.so.56             libProtobufCommClient.so      libQt5MultimediaWidgets.so.5   libQt5Sql.so.5                libQtPropertyBrowser.so      libz.so                    resources
libboost_atomic.so.1.70.0      libCommonProtoStreamSections.so     libicuuc.so.56               libProtobufCommProto.so       libQt5Network.so.5             libQt5Svg.so.5                libsqlite3-shared.so         libz.so.1.2.7              rules
libboost_chrono.so.1.70.0      libCore.so                          libInjectionCommunicator.so  libProtobufComm.so            libQt5OpenGL.so.5              libQt5Test.so.5               libSshClient.so              Mesa                       Scripts
libboost_container.so.1.70.0   libcrypto.so                        libInterfaceData.so          libprotobuf-shared.so         libQt5Positioning.so.5         libQt5WaylandClient.so.5      libssh.so                    nsys-ui                    sqlite3
libboost_date_time.so.1.70.0   libcrypto.so.1.1                    libInterfaceSharedBase.so    libQt5Charts.so.5             libQt5PrintSupport.so.5        libQt5WaylandCompositor.so.5  libssl.so                    nsys-ui.bin                translations
libboost_filesystem.so.1.70.0  libCudaDrvApiWrapper.so             libInterfaceSharedCore.so    libQt5Concurrent.so.5         libQt5QmlModels.so.5           libQt5WebChannel.so.5         libssl.so.1.1                nsys-ui.desktop.template

@aaronmondal
Copy link
Contributor

@SomeoneSerge I think these changes are breaking cuda for us. We are using

pkgs.linuxPackages_6_1.nvidia_x11
pkgs.cudaPackages_12.cudatoolkit
pkgs.cudaPackages_12.cudatoolkit.lib

in this flake and just noticed this because it blocks nix flake update. I'm a bit confused because grepping not found in the log below doesn't even find libcuda itself 😅

Error log: error.txt

Last couple lines of log:

auto-patchelf: 16 dependencies could not be satisfied
warn: auto-patchelf ignoring missing libcuda.so.1 wanted by /nix/store/cdab365ncgr42a1rh2g43zf0lv7x7i8y-cudatoolkit-12.0.1/target-linux-x64/nvgpucs
warn: auto-patchelf ignoring missing libcom_err.so.2 wanted by /nix/store/cdab365ncgr42a1rh2g43zf0lv7x7i8y-cudatoolkit-12.0.1/target-linux-x64/CollectX/libssl.so.10
error: auto-patchelf could not satisfy dependency libibumad.so.3 wanted by /nix/store/cdab365ncgr42a1rh2g43zf0lv7x7i8y-cudatoolkit-12.0.1/target-linux-x64/CollectX/clx
error: auto-patchelf could not satisfy dependency libucp.so.0 wanted by /nix/store/cdab365ncgr42a1rh2g43zf0lv7x7i8y-cudatoolkit-12.0.1/target-linux-x64/CollectX/clx
error: auto-patchelf could not satisfy dependency libuct.so.0 wanted by /nix/store/cdab365ncgr42a1rh2g43zf0lv7x7i8y-cudatoolkit-12.0.1/target-linux-x64/CollectX/clx
error: auto-patchelf could not satisfy dependency libucs.so.0 wanted by /nix/store/cdab365ncgr42a1rh2g43zf0lv7x7i8y-cudatoolkit-12.0.1/target-linux-x64/CollectX/clx
error: auto-patchelf could not satisfy dependency libucm.so.0 wanted by /nix/store/cdab365ncgr42a1rh2g43zf0lv7x7i8y-cudatoolkit-12.0.1/target-linux-x64/CollectX/clx
error: auto-patchelf could not satisfy dependency libibumad.so.3 wanted by /nix/store/cdab365ncgr42a1rh2g43zf0lv7x7i8y-cudatoolkit-12.0.1/target-linux-x64/CollectX/libclx_api.so
error: auto-patchelf could not satisfy dependency libxshmfence.so.1 wanted by /nix/store/cdab365ncgr42a1rh2g43zf0lv7x7i8y-cudatoolkit-12.0.1/host-linux-x64/libQt6WebEngineCore.so.6
error: auto-patchelf could not satisfy dependency libxkbfile.so.1 wanted by /nix/store/cdab365ncgr42a1rh2g43zf0lv7x7i8y-cudatoolkit-12.0.1/host-linux-x64/libQt6WebEngineCore.so.6
error: auto-patchelf could not satisfy dependency libQt6WlShellIntegration.so.6 wanted by /nix/store/cdab365ncgr42a1rh2g43zf0lv7x7i8y-cudatoolkit-12.0.1/host-linux-x64/Plugins/wayland-shell-integration/libwl-shell-plugin.so
error: auto-patchelf could not satisfy dependency libtiff.so.5 wanted by /nix/store/cdab365ncgr42a1rh2g43zf0lv7x7i8y-cudatoolkit-12.0.1/host-linux-x64/Plugins/imageformats/libqtiff.so
error: auto-patchelf could not satisfy dependency libmlx5.so.1 wanted by /nix/store/cdab365ncgr42a1rh2g43zf0lv7x7i8y-cudatoolkit-12.0.1/lib/libcufile_rdma.so.1.5.1
error: auto-patchelf could not satisfy dependency librdmacm.so.1 wanted by /nix/store/cdab365ncgr42a1rh2g43zf0lv7x7i8y-cudatoolkit-12.0.1/lib/libcufile_rdma.so.1.5.1
error: auto-patchelf could not satisfy dependency libibverbs.so.1 wanted by /nix/store/cdab365ncgr42a1rh2g43zf0lv7x7i8y-cudatoolkit-12.0.1/lib/libcufile_rdma.so.1.5.1
warn: auto-patchelf ignoring missing libcuda.so.1 wanted by /nix/store/cdab365ncgr42a1rh2g43zf0lv7x7i8y-cudatoolkit-12.0.1/lib/libcuinj64.so.12.0.146
auto-patchelf failed to find all the required dependencies.

cc @SpamDoodler

@SomeoneSerge
Copy link
Contributor Author

SomeoneSerge commented Apr 6, 2023

Hi @aaronmondal! Yup, that's my fault, I should've tested cudaPackages_12 when we were merging #178440. Will try and fix that in #224986

Note that we are trying to get rid of the legacy cudaPackages.cudatoolkit and switch to the redistributable cudaPackages.{lib,cuda_}* packages (e.g. cuda_cudart or libcublas)

I'm a bit confused because grepping not found in the log below doesn't even find libcuda itself

Oh 🙃 this is actually intended! Libcuda is the userspace driver which has to go hand in hand with the system's .ko module. For this reason we never link libcuda.so directly, but we deploy current driver's libcuda.so impurely at /run/opengl-driver/lib/libcuda.so in NixOS, and we add /run/opengl-driver/lib to all our binaries' Runpaths using addOpenGLRunpath or cudaPackages.autoAddOpenGLRunpathHook. This is also why we never use nvidia_x11 directly either

@aaronmondal
Copy link
Contributor

@SomeoneSerge Ah thanks for clearing that up! I'll try out the patch later today 😊

@SpamDoodler It's new, its shiny, I want it 😄. I think #224646 (comment) was the insight we needed which explains why our non-local cuda toolchains are so fragile. We should rework our *_nvptx toolchains to use this cuda properly. I think this also explains why things didn't work with WSL.

@SomeoneSerge
Copy link
Contributor Author

@aaronmondal offtopic, but it would seem that you and your colleagues are familiar with Bazel? It is my impression that there's quite a bit of struggle with Bazel in nixpkgs: platform-dependent fetchAttrs.sha256s, bazel ignoring our buildInputs and nativeBuildInputs, and probably much more. We mostly put up with this for the packages like jax and tensorflow, but I believe we could do better, only maybe we lack the expertise and motivation. I wonder if any of you would be interested in having a look at the Bazel situation in nixpkgs

P.S. I'd have asked on the discord linked on eomii.org, but it insists that I provide a phone number 😅

@aaronmondal
Copy link
Contributor

@SomeoneSerge Ahh yeah of course we'd like to help 😊

We're trying to get a solid Nix/Bazel interop to work at this very moment, and also noticed that this is actually fairly hard to get working. Jax is also on the list of libraries we'll want to support well in rules_ll at some point, so this is absolutely within our scope. I'll take a look at the issues.

cc @jaroeichler @SpamDoodler @JannisFengler

@SomeoneSerge
Copy link
Contributor Author

SomeoneSerge commented May 9, 2023

I don't remember if this was fixed by #224986 or somewhere else, but I don't observe the libstdc++ error on master anymore. A decision on forcefully including $ORIGIN in Runpaths was made in #226038, but may be reverted in future. Tracking the rest in #226165

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants