Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libcuda.so: driver mismatch on nixos-rebuild switch #255070

Open
SomeoneSerge opened this issue Sep 14, 2023 · 12 comments
Open

libcuda.so: driver mismatch on nixos-rebuild switch #255070

SomeoneSerge opened this issue Sep 14, 2023 · 12 comments
Labels
6.topic: cuda Parallel computing platform and API 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS

Comments

@SomeoneSerge
Copy link
Contributor

SomeoneSerge commented Sep 14, 2023

Issue description

We're linking both OpenGL and CUDA applications to libGL and to libcuda through an impure path, /run/opengl-driver/lib, deployed by NixOS. This path is substituted on nixos-rebuild switch together with the rest of the system, in which case the userspace drivers may diverge (e.g. after nix flake update or after updating the channels) from the respective kernel modules. In case of libcuda, we want to keep using the driver from the /run/booted-system, rather than from the /run/current-system, or the user may observe errors like:

nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
...python
>>> import torch
>>> torch.cuda.is_available()
CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at /build/source/c10/cuda/CUDAFunctions.cpp:109.)
  return torch._C._cuda_getDeviceCount() > 0

...until they reboot

mesa vs cuda

It may not be sufficient to move /run/opengl-driver/lib to /run/booted-system. From matrix:

K900 ⚡️
Mesa needs libgbm to match the driver
And Nvidia needs the driver to match the kernelspace
But Mesa can't have a 1:1 compatibility with the kernelspace because they don't own it
Someone (UTC+3)
"the driver" meaning the userspace bit?
K900 ⚡️
Yes
Actually if we ever figure out dynamic GBM we could have it use the booted drivers for everything
But then it would just as easily be able to use the new driver
So it's like
Still weird

how mesa breaks

I'm not sure if this is the kind of error K900 was warning about, I tried approximately the following sequence:

$ nix flake update
$ nixos-rebuild switch
$ # now /run/current-system and /run/booted-system are different,
$ # in particular nvidia-smi is broken and complains about the driver mismatch,
$ # but OpenGL apps still work correctly, e.g.:
$ kitty
$ # Now let's restore the old /run/opengl-driver/lib/libcuda.so:
$ sudo /run/booted-system/activate
$ # ...after which CUDA apps work again:
$ nvidia-smi
$ # ...but OpenGL apps are broken:
$ kitty
[258 18:52:45.973274] [glfw error 65543]: GLX: Failed to create context: BadValue (integer parameter out of range for operation)
[258 18:52:45.973294] Failed to create GLFW temp window! This usually happens because of old/broken OpenGL drivers. kitty requires working OpenGL 3.3 drivers.
$ # Now whatever the difference between `activate` and `switch-to-configuration`, recover the OpenGL apps too:
$ sudo /run/booted-system/bin/switch-to-configuration switch

I'll update with a reproducible example later

Notify maintainers

@NixOS/cuda-maintainers

@SomeoneSerge SomeoneSerge added the 6.topic: cuda Parallel computing platform and API label Sep 14, 2023
@Kiskae
Copy link
Contributor

Kiskae commented Sep 14, 2023

I've actually been looking for a solution to the "update-causes-version-mismatch" to make it possible to backport nvidia driver updates.

What I've been considering is a variant of the /etc/static link dance in combination with tmpfiles rules to link the active userspace library to the loaded kernel module.

Like this:

/run/opengl-driver/lib/libcuda.so.1 -> /run/nvidia/current/lib/libcuda.so.1
/run/nvidia/current -> /run/nvidia/<version>
/run/nvidia/<version> -> /nix/store/nvidia-x11-<version>-<hash>

What this will require is:

  1. A way to rewrite symlinks to a different base path to turn /nix/store/nvidia-x11-<version>-<hash> into /run/nvidia/current. This mirrored derivation then gets added to hardware.opengl.extraPackages
  2. tmpfiles rules that set up the /run/nvidia symlinks for both the version and current at boot. This can be expanded to a udev rule to support runtime upgrades of the nvidia driver.
  3. A way to link the current /run/nvidia/current symlink into a gc root so it doesn't get garbage collected.

Note that /run/nvidia is a placeholder and could probably use a more unique nix-specific name.

@SomeoneSerge
Copy link
Contributor Author

SomeoneSerge commented Sep 15, 2023

@Kiskae I was actually thinking in a similar direction! Specifically, we could keep track of every deployed configuration's drivers by exposing the package derivation from nixos/modules/hardware/opengl.nix using systemPackages and pathsToLink:

let
  package = pkgs.buildEnv {
    name = "drivers";
    paths = [ config.hardware.opengl.package ] ++ config.hardware.opengl.extraPaths;
    postBuild = ''
      mkdir drivers
      mv * drivers/
    '';
  };
in
{
  environment.systemPackages = [ package ];
  environment.pathsToLink = [ "/drivers" ];
}

With this, we'd have access to (NB "booted") /run/booted-system/sw/drivers/lib/libcuda.so and to (NB "current") /run/current-system/sw/drivers/lib/{dri,gbm,...} (whatever breaks alacritty in the example above), which we could symlink to from /run/opengl-driver/lib. I'd say this feels brittle and too many symlinks to me at a glance, but it's something we definitely could make to work.

Observation: with this solution people switching from hardware.opengl.enable = false to true won't be able to use CUDA apps without a reboot, because /run/booted-system will stay the same, i.e. it won't contain any libcuda.so

@SomeoneSerge SomeoneSerge changed the title libcuda.so: driver mismatch nixos-rebuild switch libcuda.so: driver mismatch on nixos-rebuild switch Sep 15, 2023
@Kiskae
Copy link
Contributor

Kiskae commented Sep 15, 2023

The risk is that by tying the nvidia driver to booted-system its own dependencies might become outdated compared to the current active profile. As it currently exists the only desync happens between kernelspace and userspace which is generally stable. (unless you're named NVIDIA)

That is why I'm considering the symlink indirection, since it will allow updates to the nvidia driver closure as long as the nvidia driver remains on the same version. In addition you could add a warning in switch-to-derivation if the nvidia driver no longer matches to alert the user.

EDIT: I seemed to recall seeing a PR related to moving /run/opengl-driver into the system closure, it appears to be #158079

@SomeoneSerge
Copy link
Contributor Author

SomeoneSerge commented Sep 15, 2023

@Kiskae maybe I didn't make myself clear, but I was trying to suggest that we'd have both /run/current-system/sw/drivers and /run/booted-system/sw/drivers: one corresponds to the last switched-to configuration, and the latter corresponds to the configuration booted from. Then we'd make all of /run/opengl-driver/lib link to /run/current-system/sw/drivers (equivalent to what we do now), except for libcuda.so (and, maybe, libnvidia-ml.so - as much as would be required to make cuda work), which we'd make to point at the old/booted system instead

EDIT: RE: #158079

Wonderful! I forgot that wasn't just about naming the nixos option. So we might just merge that PR, and then make the indirection in addOpenGLRunpath.driverLink more granular on NixOS, by redirecting chosen libraries to the /run/booted-system/drivers instead of /run/current-system/drivers

@Kiskae
Copy link
Contributor

Kiskae commented Sep 15, 2023

I understood that part, what I'm talking about is the more complex libraries like the vulkan driver which depends on other libraries.
So if libnvidia-vulkan* depends on libgbm and is loaded from boot, but something else is linked with a newer version of mesa which has a different version of libgbm then that can cause issues with dynamic loading.

Mind you this exact thing would still happen in my solution when the version of the driver changes, but as long as the version remains the same the nvidia driver closure can be updated in sync with the rest of the system.

Essentially there are two ways the driver can cause issues:

  1. If the driver is newer than the kernel module, it stops working.
  2. If the driver closure is older than the system closure, it might include conflicting dynamic dependencies.

@SomeoneSerge
Copy link
Contributor Author

SomeoneSerge commented Sep 15, 2023

libnvidia-vulkan* ... is loaded from boot

But do we need to load libnvidia-vulkan* from "boot"? Why don't we load it from /run/current-system instead, this seems to have worked so far, and there's only two libraries (cuda, nvml) I know by now we might want to load from /run/booted-system

@Kiskae
Copy link
Contributor

Kiskae commented Sep 15, 2023

But do we need to load libnvidia-vulkan* from "boot"?

Yup same issue as libcuda, almost all driver libraries will crash if the kernel module doesn't match.

find -L /run/opengl-driver/lib -name "lib*535*" -exec fgrep "API mismatch" {} +
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/libcuda.so.535.86.05: binary file matches
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/libnvcuvid.so.535.86.05: binary file matches
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/libnvidia-allocator.so.535.86.05: binary file matches
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/libnvidia-cfg.so.535.86.05: binary file matches
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/libnvidia-eglcore.so.535.86.05: binary file matches
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/libnvidia-glcore.so.535.86.05: binary file matches
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/libnvidia-glsi.so.535.86.05: binary file matches
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/libnvidia-ml.so.535.86.05: binary file matches
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/libnvidia-opencl.so.535.86.05: binary file matches
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/vdpau/libvdpau_nvidia.so.535.86.05: binary file matches

the nvidia vulkan driver is actually lib(GLX|EGL)_nvidia which depend on libnvidia-e?glcore.

@Atemu
Copy link
Member

Atemu commented Nov 23, 2023

Related: #269419

@SomeoneSerge
Copy link
Contributor Author

There's one more thing we've missed: nixos-rebuild switch doesn't actually break CUDA all that often (I think the heuristic is that libcuda.so needs be at least as new as the kernel module, and it's usually OK if it's newer), but it currently does break nvidia-smi which comes from the nvidia_x11. E.g. right now I'm seeing:

❯ nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
NVML library version: 545.29
❯ nix run -f ./. --arg config '{ cudaSupport = true; cudaCapabilities = [ "8.6" ]; cudaEnableForwardCompat = false; allowUnfree = true; }' -L cudaPackages.saxpy
Start
Runtime version: 11080
Driver version: 12030
Host memory initialized, copying to the device
Scheduled a cudaMemcpy, calling the kernel
Scheduled a kernel call
Max error: 0.000000

@Kiskae
Copy link
Contributor

Kiskae commented Dec 10, 2023

but it currently does break nvidia-smi which comes from the nvidia_x11. E.g. right now I'm seeing:

nvidia-smi ignores /run/opengl-driver and links directly to the associated library files at the moment. So that error is coming from the 'newer' libnvidia-ml.so.

Runtime version: 11080
Driver version: 12030

These probably refer to libcuda and libcudart, not the kernel drivers.
However the most recent update was 545.29.02 -> 545.29.06, so it might very well be that the cuda driver is the same on these releases.

I know that cuda has official backwards- and forwards-support, but I believe that only exists between libcuda and the toolkit libraries, not between libcuda and the driver itself.

The driver itself definitely has version errors:

  [145cab8]  NVIDIA: failed to load the NVIDIA kernel module.\n
  [145caf0]  NVIDIA: could not create the device file %s\n
  [145cb20]  NVIDIA: could not open the device file %s (%s).\n
  [145cb58]  NVIDIA: API mismatch: the NVIDIA kernel module has version %s,\n
            but this NVIDIA driver component has version %s.  Please make\n
            sure that the kernel module and all NVIDIA driver components\n
            have the same version.\n
  [145cc30]  NVIDIA: API mismatch: this NVIDIA driver component has version\n
            %s, but the NVIDIA kernel module's version does not match.\n
            Please make sure that the kernel module and all NVIDIA driver\n
            components have the same version.\n
  [145cd10]  NVIDIA: could not create file for device %u\n

@SomeoneSerge
Copy link
Contributor Author

SomeoneSerge commented Dec 10, 2023

These probably refer to libcuda and libcudart, not the kernel drivers.

Yes:

CHECK(cudaRuntimeGetVersion(&rtVersion));
CHECK(cudaDriverGetVersion(&driverVersion));
fprintf(stderr, "Runtime version: %d\n", rtVersion);
fprintf(stderr, "Driver version: %d\n", driverVersion);

libcuda

Uh-huh, that's what I meant by the "userspace driver"

nvidia-smi ignores /run/opengl-driver and links directly to the associated library files at the moment.

Right, I recall seeing that. I suppose we should change that. Do you know any reason not to?

I know that cuda has official backwards- and forwards-support, but I believe that only exists between libcuda and the toolkit libraries, not between libcuda and the driver itself

There is some leeway for libcuda and the kernel module to diverge which is why cudaPackages.cuda_compat exists, but they only test and officially support this for chosen platforms (jetsons and datacenters). EDIT: I suppose we could expect some software blocks in nvidia_x11 as well

@Kiskae
Copy link
Contributor

Kiskae commented Dec 10, 2023

which is why cudaPackages.cuda_compat exists

I didn't realize that is literally the cuda userspace libraries from a newer driver release. The documentation about compatibility are quite comprehensive: https://docs.nvidia.com/deploy/cuda-compatibility/#forward-compatibility-title

@samueldr samueldr added the 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS label Apr 22, 2024
@LoganBarnett LoganBarnett mentioned this issue Jun 19, 2024
13 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
6.topic: cuda Parallel computing platform and API 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS
Projects
Status: New
Development

No branches or pull requests

4 participants