Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use cuda_compat drivers when available #267247

Merged
merged 1 commit into from
Dec 12, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
9 changes: 9 additions & 0 deletions pkgs/development/cuda-modules/generic-builders/manifest.nix
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
{
# General callPackage-supplied arguments
autoAddOpenGLRunpathHook,
autoAddCudaCompatRunpathHook,
autoPatchelfHook,
backendStdenv,
fetchurl,
Expand Down Expand Up @@ -126,6 +127,14 @@ backendStdenv.mkDerivation (
# Check e.g. with `patchelf --print-rpath path/to/my/binary
autoAddOpenGLRunpathHook
markForCudatoolkitRootHook
]
# autoAddCudaCompatRunpathHook depends on cuda_compat and would cause
# infinite recursion if applied to `cuda_compat` itself (beside the fact
# that it doesn't make sense in the first place)
++ lib.optionals (pname != "cuda_compat" && flags.isJetsonBuild) [
# autoAddCudaCompatRunpathHook must appear AFTER autoAddOpenGLRunpathHook.
# See its documentation in ./setup-hooks/extension.nix.
autoAddCudaCompatRunpathHook
];

buildInputs =
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# shellcheck shell=bash
# Patch all dynamically linked, ELF files with the CUDA driver (libcuda.so)
# coming from the cuda_compat package by adding it to the RUNPATH.
echo "Sourcing auto-add-cuda-compat-runpath-hook"

elfHasDynamicSection() {
patchelf --print-rpath "$1" >& /dev/null
}

autoAddCudaCompatRunpathPhase() (
local outputPaths
mapfile -t outputPaths < <(for o in $(getAllOutputNames); do echo "${!o}"; done)
find "${outputPaths[@]}" -type f -executable -print0 | while IFS= read -rd "" f; do
if isELF "$f"; then
# patchelf returns an error on statically linked ELF files
if elfHasDynamicSection "$f" ; then
echo "autoAddCudaCompatRunpathHook: patching $f"
local origRpath="$(patchelf --print-rpath "$f")"
patchelf --set-rpath "@libcudaPath@:$origRpath" "$f"
elif (( "${NIX_DEBUG:-0}" >= 1 )) ; then
echo "autoAddCudaCompatRunpathHook: skipping a statically-linked ELF file $f"
fi
fi
done
)

postFixupHooks+=(autoAddCudaCompatRunpathPhase)
20 changes: 20 additions & 0 deletions pkgs/development/cuda-modules/setup-hooks/extension.nix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think propagating a different version of addOpenGLRunpath might cause conflicts in derivations that explicitly consume their own version.

Initially, I thought that all we'd need is patchelf --add-needed with the absolute path to ${cuda_compat}/lib/libcuda.so in https://github.com/NixOS/nixpkgs/blob/bb142a6838c823a2cd8235e1d0dd3641e7dcfd63/pkgs/development/compilers/cudatoolkit/redist/build-cuda-redist-package.nix. This would ensure that anybody trying to load, for example, libcudart.so ends up loading the cuda-compat driver first.

Now I think that might be insufficient (how do we know people don't try to dlopen libcuda directly?) and you're right about overriding addOpenGLRunpath. Except we might want to do that on an even more global scale, i.e. we might want to change pkgs.addOpenGLRunpath whenever our nixpkgs instantiation happens to target an nvidia jetson (e.g. config.cudaSupport && hostPlatform.system == "aarch64-linux", but also maybe a special flag)

All in all, I think we need to ponder at the available options a bit more. Thank you for starting this!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't a simple test that cudaPackages.cuda_compat exists, as we do here, be sufficient? Basically, if there's a CUDA cuda_compat redist package available, we do that. Maybe we can add a test to see if we're not cross-compiling.

I personally think changing pkgs.addOpenGLRunpath would make sense, as what we do is mostly to "extend" the usual path /run/opengl-driver to find cuda_compat's libcuda.so. But we noticed that some packages are using e.g. the attribute addOpenGLRunpath.driverLink, so we must be careful to not change the interface of the derivation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't a simple test that cudaPackages.cuda_compat exists, as we do here, be sufficient

Yes and no, e.g. cf. #266475. I'd say, we probably don't want the definition of addOpenGLRunpath to depend on the internals of the cudaPackages set, but it's OK if it depends on a top-level attribute like config. I haven't thought about this much though

some packages are using e.g. the attribute addOpenGLRunpath.driverLink, so we must be careful to not change the interface of the derivation.

Yes

Original file line number Diff line number Diff line change
Expand Up @@ -44,4 +44,24 @@ final: _: {
./auto-add-opengl-runpath-hook.sh
)
{};

# autoAddCudaCompatRunpathHook hook must be added AFTER `setupCudaHook`. Both
# hooks prepend a path with `libcuda.so` to the `DT_RUNPATH` section of
# patched elf files, but `cuda_compat` path must take precedence (otherwise,
# it doesn't have any effect) and thus appear first. Meaning this hook must be
# executed last.
autoAddCudaCompatRunpathHook =
final.callPackage
(
{makeSetupHook, cuda_compat}:
makeSetupHook
{
name = "auto-add-cuda-compat-runpath-hook";
substitutions = {
libcudaPath = "${cuda_compat}/compat";
};
}
./auto-add-cuda-compat-runpath.sh
)
{};
Comment on lines +47 to +66
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought: maybe we don't want the directly-linked cuda_compat to take precedence over the impure path. Rather, we want the impure path to also contain a cuda_compat. I.e. we want jetpack-nixos to deploy the compat libcuda, and we want the wrappers for ubuntu to also prefer the compat library over normal one

Here's how it goes (well, it's all assumptions): we always want to use the newest possible cuda_compat. We're going to link one directly through DT_RUNPATH at build time, which we expect to be newer than libcuda in the jetpack Ubuntu installation. However, a year or a few down the line, we might want to reuse a package built from an old nixpkgs revision. It conceivable that we'd be running a much newer kernel module, one for which cuda_compat in the runpath would be too old. Then we just deploy a newer cuda_compat on our jetpack-nixos and make the app use it. We could do that via LD_LIBRARY_PATH/LD_PRELOAD, or we could try to give the impure runpath higher priority. The latter might hypothetically break because it could rely on a newer libc/libstdc++, but in practice nvidia seems to always link old versions. LD_* are always an option

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All in all, I think I agree with you. The driver is provided impurely in /run/opengl-driver, and as cuda_compat is supposed be a drop-in replacement for the driver, I don't see a good reason why it should be treated differently.

That being said, I think right now jetpack-nixos is lagging behind (based on NixOS 22.11), so this PR would still be a temporary (TM) way to get cuda_compat working there without waiting for a newer jetpack? But, I would be tempted to get rid of this once #256324 and associated PRs all make it into a NixOS release, and jetpack-nixos supports that release.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this PR would still be a temporary (TM) way to

I wonder if we still misunderstood each other. To be clear, I hope that this PR's changes will stay and that we won't have to remove them. That cuda_compat would essentially work the same way as ROCm drivers do (which, afaiu, we link directly and which we don't need tools like nvidia-docker2 for). I only wonder about special cases where the user might just hypothetically load a different libcuda

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ust a thought: maybe we don't want the directly-linked cuda_compat to take precedence over the impure path. Rather, we want the impure path to also contain a cuda_compat. I.e. we want jetpack-nixos to deploy the compat libcuda, and we want the wrappers for ubuntu to also prefer the compat library over normal one

I interpreted this, at least on jetpack-nixos, as "let's put cuda_compat into /run/opengl-driver because it's more flexible: jetpack-nixos is now responsible for selecting cuda_compat or the normal drivers, and it's easy to dynamically change by toggling what we put in /run/opengl-driver without resorting to LD_". In the latter case, would this PR still be needed? It seems to me that the answer is no, but maybe I'm missing something

}