New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use cuda_compat drivers when available #267247
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think propagating a different version of addOpenGLRunpath
might cause conflicts in derivations that explicitly consume their own version.
Initially, I thought that all we'd need is patchelf --add-needed
with the absolute path to ${cuda_compat}/lib/libcuda.so
in https://github.com/NixOS/nixpkgs/blob/bb142a6838c823a2cd8235e1d0dd3641e7dcfd63/pkgs/development/compilers/cudatoolkit/redist/build-cuda-redist-package.nix. This would ensure that anybody trying to load, for example, libcudart.so
ends up loading the cuda-compat driver first.
Now I think that might be insufficient (how do we know people don't try to dlopen libcuda directly?) and you're right about overriding addOpenGLRunpath
. Except we might want to do that on an even more global scale, i.e. we might want to change pkgs.addOpenGLRunpath
whenever our nixpkgs instantiation happens to target an nvidia jetson (e.g. config.cudaSupport && hostPlatform.system == "aarch64-linux"
, but also maybe a special flag)
All in all, I think we need to ponder at the available options a bit more. Thank you for starting this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't a simple test that cudaPackages.cuda_compat
exists, as we do here, be sufficient? Basically, if there's a CUDA cuda_compat redist package available, we do that. Maybe we can add a test to see if we're not cross-compiling.
I personally think changing pkgs.addOpenGLRunpath
would make sense, as what we do is mostly to "extend" the usual path /run/opengl-driver
to find cuda_compat's libcuda.so
. But we noticed that some packages are using e.g. the attribute addOpenGLRunpath.driverLink
, so we must be careful to not change the interface of the derivation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't a simple test that cudaPackages.cuda_compat exists, as we do here, be sufficient
Yes and no, e.g. cf. #266475. I'd say, we probably don't want the definition of addOpenGLRunpath
to depend on the internals of the cudaPackages
set, but it's OK if it depends on a top-level attribute like config
. I haven't thought about this much though
some packages are using e.g. the attribute addOpenGLRunpath.driverLink, so we must be careful to not change the interface of the derivation.
Yes
Actually, no we do not want What we really want is Potential blockers:
|
@yannham I just rebased on branch on the latest staging to fix the merge conflicts so you'll want to rebase as well. |
7f232fe
to
ac5faa6
Compare
There's a bunch of places, currently in |
@SomeoneSerge thanks a lot for your input! I'm not drawing any conclusion here, and you guys are the experts, but I wonder: how is the openGL driver different from cuda_compat? Because I believe your recommendation/alternative way of reaching all CUDA packages could apply to linking to the openGL driver as well. That is, instead of using I can only guess, but I suppose it wasn't done this way because:
Are those assumptions somehow remotely correct? And if they are, is there anything differentiating As a side-question, if you think patching build inputs is still the way to go, it sounds like a bigger change: if this PR works as it is with a very minimal change, would that be reasonable as a first step, and then open an issue or whatever tracking method to use the right approach as a follow-up? |
In fact, what I'm saying might not make a lot of sense, because I thought Also, there is a kind of a bootstrapping issue, because |
904f0b6
to
ac5faa6
Compare
Indeed, they are similar, which is why normally we deploy both impurely. The exact restrictions that OpenGL/libgbm imposes are different. The There will be a few programs that use
Yes, it does seem like we have to deploy As suggested before, I would try reverting the |
So tried the proposed approach, but it doesn't work (ignore cuda_compat and show the actual version of the driver, 11.4). I can't see any I personally don't see an obvious way of enabling cuda_compat without doing something impure such as we do for openGL drivers... |
30b2433
to
85adf60
Compare
The last version is attempting to do what @SomeoneSerge suggested, by duplicating the logic of Unfortunately, I get the following error:
|
Additional update: the build works, and the binary run without any |
I used |
The error looks like cmake trying to enable_language(cuda) and failing some sanity checks... I think I had similar issues when nvcc coildn't find cudart. Could you gist the full log? PS sorry I'm slow to respond lately' |
One more thing: addOpenGLRunpath assigns the impure path a higher priority so it's important that our postfixup happens last or we might load the old driver. We also need to ensure that we're prepending the compat driver to the runpath, not appending. Try inspecting --print-rpath of the resulting binaries |
Here is the full CMake log. I also tried with I also attached the environment (serialized with As mentioned above, I tried to use the exact exported Beside trying to |
Damn! I just realized that we probably can't have the compat libcuda.so in dt_needed, because then we couldn't e.g. load the reverse dependencies inside the nix sandbox. This is because libcuda from compat has the libnvrm* libraries in dt_needed but they're impure. E.g. we couldn't import torch, if it loads libcublas, and that loads libcuda, and that fails to find libnvrm* Unless that's all wrong I guess we want to just modify the runpath... |
Another thought: in jetpack-nixos, Could we do the same thing with the one from the compat library, and patch its libnvrm* dependencies directly with the right paths? It might be made more complicated by the fact that the Another solution, fully on the jetpack-nixos side: we can control what goes in the hardware.opengl.package = pkgs.nvidia-jetpack.l4t-3d-core;
hardware.opengl.extraPackages = with pkgs.nvidia-jetpack; [
+ # l4t-core provides the libnvrm_gpu.so and libnvrm_mem.so, which are
+ # otherwise missing when using cuda_compat for the time being. Ideally,
+ # they should probably be directly set in a DT_NEEDED entry, as is the
+ # case currently with the `libcuda.so` provided by l4t-cuda
+ l4t-core
l4t-cuda
l4t-nvsci # cuda may use nvsci
l4t-gbm l4t-wayland I wonder if doing something as dumb as including |
We'd have to get hold of the jetpack
Indeed. If we deploy a jetpack-nixos, it probably only makes sense to deploy the compat cuda driver, since it offers pretty much a superset of the features in l4t-cuda's libcuda.so However, I think we'd still want to (try) to link cuda_compat directly to the nixpkgs' apps, because we'd want them to work on vanilla/ubuntu jetsons as well |
38f913d
to
9004a3c
Compare
9004a3c
to
2d3d5ab
Compare
I rebased, now I just need to add a guard to not add this hook when we're not on a Jetson |
|
||
# autoAddCudaCompatRunpathHook hook must be added AFTER `setupCudaHook`. Both | ||
# hooks prepend a path with `libcuda.so` to the `DT_RUNPATH` section of | ||
# patched elf files, but `cuda_compat` path must take precedence (otherwise, | ||
# it doesn't have any effect) and thus appear first. Meaning this hook must be | ||
# executed last. | ||
autoAddCudaCompatRunpathHook = | ||
final.callPackage | ||
( | ||
{makeSetupHook, cuda_compat}: | ||
makeSetupHook | ||
{ | ||
name = "auto-add-cuda-compat-runpath-hook"; | ||
substitutions = { | ||
libcudaPath = "${cuda_compat}/compat"; | ||
}; | ||
} | ||
./auto-add-cuda-compat-runpath.sh | ||
) | ||
{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a thought: maybe we don't want the directly-linked cuda_compat
to take precedence over the impure path. Rather, we want the impure path to also contain a cuda_compat. I.e. we want jetpack-nixos to deploy the compat libcuda, and we want the wrappers for ubuntu to also prefer the compat library over normal one
Here's how it goes (well, it's all assumptions): we always want to use the newest possible cuda_compat
. We're going to link one directly through DT_RUNPATH at build time, which we expect to be newer than libcuda in the jetpack Ubuntu installation. However, a year or a few down the line, we might want to reuse a package built from an old nixpkgs revision. It conceivable that we'd be running a much newer kernel module, one for which cuda_compat
in the runpath would be too old. Then we just deploy a newer cuda_compat on our jetpack-nixos and make the app use it. We could do that via LD_LIBRARY_PATH/LD_PRELOAD, or we could try to give the impure runpath higher priority. The latter might hypothetically break because it could rely on a newer libc/libstdc++, but in practice nvidia seems to always link old versions. LD_*
are always an option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All in all, I think I agree with you. The driver is provided impurely in /run/opengl-driver
, and as cuda_compat
is supposed be a drop-in replacement for the driver, I don't see a good reason why it should be treated differently.
That being said, I think right now jetpack-nixos is lagging behind (based on NixOS 22.11), so this PR would still be a temporary (TM) way to get cuda_compat working there without waiting for a newer jetpack? But, I would be tempted to get rid of this once #256324 and associated PRs all make it into a NixOS release, and jetpack-nixos supports that release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this PR would still be a temporary (TM) way to
I wonder if we still misunderstood each other. To be clear, I hope that this PR's changes will stay and that we won't have to remove them. That cuda_compat would essentially work the same way as ROCm drivers do (which, afaiu, we link directly and which we don't need tools like nvidia-docker2 for). I only wonder about special cases where the user might just hypothetically load a different libcuda
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ust a thought: maybe we don't want the directly-linked cuda_compat to take precedence over the impure path. Rather, we want the impure path to also contain a cuda_compat. I.e. we want jetpack-nixos to deploy the compat libcuda, and we want the wrappers for ubuntu to also prefer the compat library over normal one
I interpreted this, at least on jetpack-nixos, as "let's put cuda_compat into /run/opengl-driver because it's more flexible: jetpack-nixos is now responsible for selecting cuda_compat or the normal drivers, and it's easy to dynamically change by toggling what we put in /run/opengl-driver without resorting to LD_
". In the latter case, would this PR still be needed? It seems to me that the answer is no, but maybe I'm missing something
2d3d5ab
to
b685883
Compare
Some nvidia devices, such as the Jetson family, support the Nvidia compatibility package (nvidia_compat) which allows to run executables built against a higher CUDA major version on a system with an older CUDA driver. On such platforms, the consensus among CUDA maintainers is that there is no downside in always enabling it by default. This commit links to the relevant cuda_compat shared libraries by patching the CUDA core packages' runpaths when cuda_compat is available, in the same way as we do for OpenGL drivers currently.
188a95b
to
d6c198a
Compare
This branch has been rebased, fixed and tested on Jetson. It's good to go |
Ofborg mentions |
Argh. Thanks for noticing @SomeoneSerge. Maybe the guard |
Please @ConnorBaker do not merge when ofborg-eval has not finished on such PRs. Maybe the breakage was avoidable on master here... |
This pull request has been mentioned on NixOS Discourse. There might be relevant details there: |
Description of changes
Some nvidia devices, such as the Jetson family, support a package called nvidia compatibility package (
cudaPackages.cuda_compat
) which allows to run executables built against a higher CUDA major version on a system with an older CUDA driver. On such platforms, the consensus among CUDA maintainers is that there is no downside in always enabling it by default.This PR links to the relevant cuda_compat shared libraries when available by patching the executables' runpaths when cuda_compat is available, in the same way as we already do for OpenGL drivers.
We initially tried to upgrade and reuse
add-patch-opengl-driver
because the script does exactly what we want, excepted that we would like to add several paths, and not just one, to therunpath
of concerned executables.However, this derivation is used in many places (both the result but also the attrs such as
driverLink
), making it harder to reuse it or to make more generic without causing a mass rebuild and without breaking existing packages. Instead, we just override thedriverLink
attrs, which is small, localized, and what we want.TODO before merging
Things done
nix.conf
? (See Nix manual)sandbox = relaxed
sandbox = true
nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD"
. Note: all changes have to be committed, also see nixpkgs-review usage./result/bin/
)