-
-
Notifications
You must be signed in to change notification settings - Fork 14.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
libcuda.so: driver mismatch on nixos-rebuild switch #255070
Comments
I've actually been looking for a solution to the "update-causes-version-mismatch" to make it possible to backport nvidia driver updates. What I've been considering is a variant of the Like this: /run/opengl-driver/lib/libcuda.so.1 -> /run/nvidia/current/lib/libcuda.so.1
/run/nvidia/current -> /run/nvidia/<version>
/run/nvidia/<version> -> /nix/store/nvidia-x11-<version>-<hash> What this will require is:
Note that |
@Kiskae I was actually thinking in a similar direction! Specifically, we could keep track of every deployed configuration's drivers by exposing the let
package = pkgs.buildEnv {
name = "drivers";
paths = [ config.hardware.opengl.package ] ++ config.hardware.opengl.extraPaths;
postBuild = ''
mkdir drivers
mv * drivers/
'';
};
in
{
environment.systemPackages = [ package ];
environment.pathsToLink = [ "/drivers" ];
} With this, we'd have access to (NB "booted") Observation: with this solution people switching from |
The risk is that by tying the nvidia driver to That is why I'm considering the symlink indirection, since it will allow updates to the nvidia driver closure as long as the nvidia driver remains on the same version. In addition you could add a warning in EDIT: I seemed to recall seeing a PR related to moving |
@Kiskae maybe I didn't make myself clear, but I was trying to suggest that we'd have both EDIT: RE: #158079 Wonderful! I forgot that wasn't just about naming the nixos option. So we might just merge that PR, and then make the indirection in |
I understood that part, what I'm talking about is the more complex libraries like the vulkan driver which depends on other libraries. Mind you this exact thing would still happen in my solution when the version of the driver changes, but as long as the version remains the same the nvidia driver closure can be updated in sync with the rest of the system. Essentially there are two ways the driver can cause issues:
|
But do we need to load |
Yup same issue as ❯ find -L /run/opengl-driver/lib -name "lib*535*" -exec fgrep "API mismatch" {} +
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/libcuda.so.535.86.05: binary file matches
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/libnvcuvid.so.535.86.05: binary file matches
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/libnvidia-allocator.so.535.86.05: binary file matches
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/libnvidia-cfg.so.535.86.05: binary file matches
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/libnvidia-eglcore.so.535.86.05: binary file matches
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/libnvidia-glcore.so.535.86.05: binary file matches
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/libnvidia-glsi.so.535.86.05: binary file matches
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/libnvidia-ml.so.535.86.05: binary file matches
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/libnvidia-opencl.so.535.86.05: binary file matches
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/vdpau/libvdpau_nvidia.so.535.86.05: binary file matches the nvidia vulkan driver is actually |
Related: #269419 |
There's one more thing we've missed:
|
These probably refer to I know that cuda has official backwards- and forwards-support, but I believe that only exists between The driver itself definitely has version errors: [145cab8] NVIDIA: failed to load the NVIDIA kernel module.\n
[145caf0] NVIDIA: could not create the device file %s\n
[145cb20] NVIDIA: could not open the device file %s (%s).\n
[145cb58] NVIDIA: API mismatch: the NVIDIA kernel module has version %s,\n
but this NVIDIA driver component has version %s. Please make\n
sure that the kernel module and all NVIDIA driver components\n
have the same version.\n
[145cc30] NVIDIA: API mismatch: this NVIDIA driver component has version\n
%s, but the NVIDIA kernel module's version does not match.\n
Please make sure that the kernel module and all NVIDIA driver\n
components have the same version.\n
[145cd10] NVIDIA: could not create file for device %u\n
|
Yes: nixpkgs/pkgs/development/cuda-modules/saxpy/saxpy.cu Lines 27 to 31 in 1a6f704
Uh-huh, that's what I meant by the "userspace driver"
Right, I recall seeing that. I suppose we should change that. Do you know any reason not to?
There is some leeway for libcuda and the kernel module to diverge which is why |
I didn't realize that is literally the cuda userspace libraries from a newer driver release. The documentation about compatibility are quite comprehensive: https://docs.nvidia.com/deploy/cuda-compatibility/#forward-compatibility-title |
Issue description
We're linking both OpenGL and CUDA applications to libGL and to libcuda through an impure path,
/run/opengl-driver/lib
, deployed by NixOS. This path is substituted onnixos-rebuild switch
together with the rest of the system, in which case the userspace drivers may diverge (e.g. afternix flake update
or after updating the channels) from the respective kernel modules. In case of libcuda, we want to keep using the driver from the/run/booted-system
, rather than from the/run/current-system
, or the user may observe errors like:...until they reboot
mesa vs cuda
It may not be sufficient to move
/run/opengl-driver/lib
to/run/booted-system
. From matrix:how mesa breaks
I'm not sure if this is the kind of error K900 was warning about, I tried approximately the following sequence:
I'll update with a reproducible example later
Notify maintainers
@NixOS/cuda-maintainers
The text was updated successfully, but these errors were encountered: