Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nvidia: driver/library version mismatch #138943

Closed
alsvartr opened this issue Sep 22, 2021 · 7 comments
Closed

Nvidia: driver/library version mismatch #138943

alsvartr opened this issue Sep 22, 2021 · 7 comments
Labels
0.kind: bug Something is broken

Comments

@alsvartr
Copy link

It seems I have collision between two different versions of nvidia-x11 derivation:

>:/ nvidia-smi 
Failed to initialize NVML: Driver/library version mismatch
>:/ file `which nvidia-smi`
/nix/var/nix/profiles/default/bin/nvidia-smi: symbolic link to /nix/store/q3g0bx2603v01n1sbnva321p48v2hdnp-nvidia-x11-460.73.01-5.11.21-bin/bin/nvidia-smi

>:/ modinfo nvidia | grep filename
filename:       /run/current-system/kernel-modules/lib/modules/5.13.18/misc/nvidia.ko
>:/ file /run/current-system/kernel-modules/lib/modules/5.13.18/misc/nvidia.ko
/run/current-system/kernel-modules/lib/modules/5.13.18/misc/nvidia.ko: symbolic link to /nix/store/pkqf89ah7vih1i8v09z883ck8972n9j1-nvidia-x11-470.57.02-5.13.18-bin/lib/modules/5.13.18/misc/nvidia.ko

Kernel module is from nvidia-x11-470.57.02-5.13.18 but nvidia-smi is symlinked from nvidia-x11-460.73.01-5.11.21
My config:

extraModulePackages = [ pkgs.linuxPackages_5_13.nvidia_x11 ];
services.xserver.videoDrivers = [ "nvidia" ];
hardware.nvidia.package = pkgs.linuxPackages_5_13.nvidia_x11;

How can I debug this problem?

>:/ nix-shell -p nix-info --run "nix-info -m"
these paths will be fetched (0.05 MiB download, 0.28 MiB unpacked):
  /nix/store/p5lnl4zr45n7mf9kz9w8yz3rqh001b5c-bash-interactive-4.4-p23-dev
copying path '/nix/store/p5lnl4zr45n7mf9kz9w8yz3rqh001b5c-bash-interactive-4.4-p23-dev' from 'https://cache.nixos.org'...
 - system: `"x86_64-linux"`
 - host os: `Linux 5.13.18, NixOS, 21.05.3294.3397f0ede9e (Okapi)`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.3.15`
 - channels(root): `"nixos-21.05.3294.3397f0ede9e, home-manager-20.09, nixos-hardware, nixos-unstable-21.11pre316684.79c444b5bde"`
 - nixpkgs: `/nix/var/nix/profiles/per-user/root/channels/nixos`
@alsvartr alsvartr added the 0.kind: bug Something is broken label Sep 22, 2021
@r-burns
Copy link
Contributor

r-burns commented Sep 22, 2021

In my experience this means "time to reboot". This happens when your running kernel is using a different nvidia driver than the driver you have most recently done a nixos-rebuild with.

Alternatively we could make nvidia-smi pick up the cuda driver from /run/opengl-driver/lib using addOpenGLRunpath. I am not sure if this is the correct behavior in this case, as nvidia-smi ships with the driver.

Note that this is not NixOS-specific, and is the same as what you'd get on any other distro. If you apt-get upgrade, the nvidia-smi executable updates to use the new driver, but the kernel (and nvidia driver) are still the old one until you reboot. So nvidia-smi fails with the same message.

@alsvartr
Copy link
Author

In my experience this means "time to reboot". This happens when your running kernel is using a different nvidia driver than the driver you have most recently done a nixos-rebuild with.

No, it's not.

>:/ uptime 
 12:32:57  up   0:00,  1 user,  load average: 0,82, 0,19, 0,06
>:/ nvidia-smi 
Failed to initialize NVML: Driver/library version mismatch

@stale
Copy link

stale bot commented Apr 19, 2022

I marked this as stale due to inactivity. → More info

@stale stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Apr 19, 2022
@SomeoneSerge
Copy link
Contributor

To clarify, unless I'm mistaken the message "driver/library version mismatch" means there's a divergence between libcuda.so and the kernel module (usually nvidia.ko, nvidia jetsons being the exception).

hardware.nvidia.package = pkgs.linuxPackages_5_13.nvidia_x11

This is usually accessed through config.boot.kernelPackages to avoid precisely the sort of drift you must be observing, e.g. hardware.nvidia.package = config.boot.kernelPackages.nvidia_x11. I'm not sure if this would've helped you though

Have you had any luck with this issue so far?

@stale stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jul 20, 2023
@ConnorBaker
Copy link
Contributor

Closing this as stale. If you're able to provide a minimum reproducer, please re-open it!

@LoganBarnett
Copy link

Just for posterity:

I ran into this problem due to having the equivalent of this:

environment.systemPackages = [
  pkgs.linuxPackages_latest.nvidia_x11
];

Removing this package made it so the correct version could be found. In hindsight, I suppose it should be obvious this was the wrong thing to do. I'm not sure if I'd found some advice stating I should do this, or just mistranslated something. This was causing a host of issues for building torch for me that seemed very unrelated.

I was able to find this out by doing an strace with nvidia-smi and looking through the openat statements. From there I identified it was pointing at an incorrect path. From there I used nix why depends:

[logan@lithium:~]$ nix why-depends /nix/store/3qglzj7cibqafk6mg0xc0m8abls2inv5-nixos-system-lithium-24.11.20240618.ee2f568 /nix/store/5rj5jsrn0az3j8hb9d067a7p1vsvxbc4-nvidia-x11-550.90.07-6.9.5
/nix/store/3qglzj7cibqafk6mg0xc0m8abls2inv5-nixos-system-lithium-24.11.20240618.ee2f568
└───/nix/store/na8848s1accywf1qd0rqzpmdsplbs1zg-system-path
    └───/nix/store/9v4hg0p16hk1qm7yvnb8im6j0dr2jn50-nvidia-x11-550.90.07-6.9.5-bin
        └───/nix/store/5rj5jsrn0az3j8hb9d067a7p1vsvxbc4-nvidia-x11-550.90.07-6.9.5

This put me close enough to try removing the pkgs.linuxPackages_latest.nvidia_x11.

If it's helpful for reconstruction, my kernel module version is 555.52.04.

@LoganBarnett LoganBarnett mentioned this issue Jun 19, 2024
13 tasks
@SomeoneSerge
Copy link
Contributor

@LoganBarnett We should update the manual to re-iterate that referencing specific versions of linuxPackages (as opposed to config.boot.kernelPackages) and the driver (as opposed to config.hardware.nvidia.package) is almost always a red flag

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.kind: bug Something is broken
Projects
Status: Done
Development

No branches or pull requests

5 participants