libcuda.so: driver mismatch on nixos-rebuild switch #255070

SomeoneSerge · 2023-09-14T07:56:06Z

Issue description

We're linking both OpenGL and CUDA applications to libGL and to libcuda through an impure path, /run/opengl-driver/lib, deployed by NixOS. This path is substituted on nixos-rebuild switch together with the rest of the system, in which case the userspace drivers may diverge (e.g. after nix flake update or after updating the channels) from the respective kernel modules. In case of libcuda, we want to keep using the driver from the /run/booted-system, rather than from the /run/current-system, or the user may observe errors like:

❯ nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
...
❯ python
>>> import torch
>>> torch.cuda.is_available()
CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at /build/source/c10/cuda/CUDAFunctions.cpp:109.)
  return torch._C._cuda_getDeviceCount() > 0

...until they reboot

mesa vs cuda

It may not be sufficient to move /run/opengl-driver/lib to /run/booted-system. From matrix:

K900 ⚡️
Mesa needs libgbm to match the driver
And Nvidia needs the driver to match the kernelspace
But Mesa can't have a 1:1 compatibility with the kernelspace because they don't own it
Someone (UTC+3)
"the driver" meaning the userspace bit?
K900 ⚡️
Yes
Actually if we ever figure out dynamic GBM we could have it use the booted drivers for everything
But then it would just as easily be able to use the new driver
So it's like
Still weird

how mesa breaks

I'm not sure if this is the kind of error K900 was warning about, I tried approximately the following sequence:

$ nix flake update
$ nixos-rebuild switch
$ # now /run/current-system and /run/booted-system are different,
$ # in particular nvidia-smi is broken and complains about the driver mismatch,
$ # but OpenGL apps still work correctly, e.g.:
$ kitty
$ # Now let's restore the old /run/opengl-driver/lib/libcuda.so:
$ sudo /run/booted-system/activate
$ # ...after which CUDA apps work again:
$ nvidia-smi
$ # ...but OpenGL apps are broken:
$ kitty
[258 18:52:45.973274] [glfw error 65543]: GLX: Failed to create context: BadValue (integer parameter out of range for operation)
[258 18:52:45.973294] Failed to create GLFW temp window! This usually happens because of old/broken OpenGL drivers. kitty requires working OpenGL 3.3 drivers.
$ # Now whatever the difference between `activate` and `switch-to-configuration`, recover the OpenGL apps too:
$ sudo /run/booted-system/bin/switch-to-configuration switch

I'll update with a reproducible example later

Notify maintainers

@NixOS/cuda-maintainers

The text was updated successfully, but these errors were encountered:

Kiskae · 2023-09-14T19:38:26Z

I've actually been looking for a solution to the "update-causes-version-mismatch" to make it possible to backport nvidia driver updates.

What I've been considering is a variant of the /etc/static link dance in combination with tmpfiles rules to link the active userspace library to the loaded kernel module.

Like this:

/run/opengl-driver/lib/libcuda.so.1 -> /run/nvidia/current/lib/libcuda.so.1
/run/nvidia/current -> /run/nvidia/<version>
/run/nvidia/<version> -> /nix/store/nvidia-x11-<version>-<hash>

What this will require is:

A way to rewrite symlinks to a different base path to turn /nix/store/nvidia-x11-<version>-<hash> into /run/nvidia/current. This mirrored derivation then gets added to hardware.opengl.extraPackages
tmpfiles rules that set up the /run/nvidia symlinks for both the version and current at boot. This can be expanded to a udev rule to support runtime upgrades of the nvidia driver.
A way to link the current /run/nvidia/current symlink into a gc root so it doesn't get garbage collected.

Note that /run/nvidia is a placeholder and could probably use a more unique nix-specific name.

SomeoneSerge · 2023-09-15T16:29:52Z

@Kiskae I was actually thinking in a similar direction! Specifically, we could keep track of every deployed configuration's drivers by exposing the package derivation from nixos/modules/hardware/opengl.nix using systemPackages and pathsToLink:

let
  package = pkgs.buildEnv {
    name = "drivers";
    paths = [ config.hardware.opengl.package ] ++ config.hardware.opengl.extraPaths;
    postBuild = ''
      mkdir drivers
      mv * drivers/
    '';
  };
in
{
  environment.systemPackages = [ package ];
  environment.pathsToLink = [ "/drivers" ];
}

With this, we'd have access to (NB "booted") /run/booted-system/sw/drivers/lib/libcuda.so and to (NB "current") /run/current-system/sw/drivers/lib/{dri,gbm,...} (whatever breaks alacritty in the example above), which we could symlink to from /run/opengl-driver/lib. I'd say this feels brittle and too many symlinks to me at a glance, but it's something we definitely could make to work.

Observation: with this solution people switching from hardware.opengl.enable = false to true won't be able to use CUDA apps without a reboot, because /run/booted-system will stay the same, i.e. it won't contain any libcuda.so

Kiskae · 2023-09-15T16:56:34Z

The risk is that by tying the nvidia driver to booted-system its own dependencies might become outdated compared to the current active profile. As it currently exists the only desync happens between kernelspace and userspace which is generally stable. (unless you're named NVIDIA)

That is why I'm considering the symlink indirection, since it will allow updates to the nvidia driver closure as long as the nvidia driver remains on the same version. In addition you could add a warning in switch-to-derivation if the nvidia driver no longer matches to alert the user.

EDIT: I seemed to recall seeing a PR related to moving /run/opengl-driver into the system closure, it appears to be #158079

SomeoneSerge · 2023-09-15T17:01:09Z

@Kiskae maybe I didn't make myself clear, but I was trying to suggest that we'd have both /run/current-system/sw/drivers and /run/booted-system/sw/drivers: one corresponds to the last switched-to configuration, and the latter corresponds to the configuration booted from. Then we'd make all of /run/opengl-driver/lib link to /run/current-system/sw/drivers (equivalent to what we do now), except for libcuda.so (and, maybe, libnvidia-ml.so - as much as would be required to make cuda work), which we'd make to point at the old/booted system instead

EDIT: RE: #158079

Wonderful! I forgot that wasn't just about naming the nixos option. So we might just merge that PR, and then make the indirection in addOpenGLRunpath.driverLink more granular on NixOS, by redirecting chosen libraries to the /run/booted-system/drivers instead of /run/current-system/drivers

Kiskae · 2023-09-15T17:11:10Z

I understood that part, what I'm talking about is the more complex libraries like the vulkan driver which depends on other libraries.
So if libnvidia-vulkan* depends on libgbm and is loaded from boot, but something else is linked with a newer version of mesa which has a different version of libgbm then that can cause issues with dynamic loading.

Mind you this exact thing would still happen in my solution when the version of the driver changes, but as long as the version remains the same the nvidia driver closure can be updated in sync with the rest of the system.

Essentially there are two ways the driver can cause issues:

If the driver is newer than the kernel module, it stops working.
If the driver closure is older than the system closure, it might include conflicting dynamic dependencies.

SomeoneSerge · 2023-09-15T17:19:02Z

libnvidia-vulkan* ... is loaded from boot

But do we need to load libnvidia-vulkan* from "boot"? Why don't we load it from /run/current-system instead, this seems to have worked so far, and there's only two libraries (cuda, nvml) I know by now we might want to load from /run/booted-system

Kiskae · 2023-09-15T17:32:26Z

But do we need to load libnvidia-vulkan* from "boot"?

Yup same issue as libcuda, almost all driver libraries will crash if the kernel module doesn't match.

❯ find -L /run/opengl-driver/lib -name "lib*535*" -exec fgrep "API mismatch" {} +
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/libcuda.so.535.86.05: binary file matches
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/libnvcuvid.so.535.86.05: binary file matches
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/libnvidia-allocator.so.535.86.05: binary file matches
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/libnvidia-cfg.so.535.86.05: binary file matches
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/libnvidia-eglcore.so.535.86.05: binary file matches
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/libnvidia-glcore.so.535.86.05: binary file matches
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/libnvidia-glsi.so.535.86.05: binary file matches
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/libnvidia-ml.so.535.86.05: binary file matches
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/libnvidia-opencl.so.535.86.05: binary file matches
/nix/store/8mzvz6kk57p9aqdk72pq1adsl38bkzi6-gnugrep-3.7/bin/grep: /run/opengl-driver/lib/vdpau/libvdpau_nvidia.so.535.86.05: binary file matches

the nvidia vulkan driver is actually lib(GLX|EGL)_nvidia which depend on libnvidia-e?glcore.

Atemu · 2023-11-23T10:27:33Z

Related: #269419

SomeoneSerge · 2023-12-10T11:09:14Z

There's one more thing we've missed: nixos-rebuild switch doesn't actually break CUDA all that often (I think the heuristic is that libcuda.so needs be at least as new as the kernel module, and it's usually OK if it's newer), but it currently does break nvidia-smi which comes from the nvidia_x11. E.g. right now I'm seeing:

❯ nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
NVML library version: 545.29
❯ nix run -f ./. --arg config '{ cudaSupport = true; cudaCapabilities = [ "8.6" ]; cudaEnableForwardCompat = false; allowUnfree = true; }' -L cudaPackages.saxpy
Start
Runtime version: 11080
Driver version: 12030
Host memory initialized, copying to the device
Scheduled a cudaMemcpy, calling the kernel
Scheduled a kernel call
Max error: 0.000000

Kiskae · 2023-12-10T13:18:40Z

but it currently does break nvidia-smi which comes from the nvidia_x11. E.g. right now I'm seeing:

nvidia-smi ignores /run/opengl-driver and links directly to the associated library files at the moment. So that error is coming from the 'newer' libnvidia-ml.so.

Runtime version: 11080
Driver version: 12030

These probably refer to libcuda and libcudart, not the kernel drivers.
However the most recent update was 545.29.02 -> 545.29.06, so it might very well be that the cuda driver is the same on these releases.

I know that cuda has official backwards- and forwards-support, but I believe that only exists between libcuda and the toolkit libraries, not between libcuda and the driver itself.

The driver itself definitely has version errors:

  [145cab8]  NVIDIA: failed to load the NVIDIA kernel module.\n
  [145caf0]  NVIDIA: could not create the device file %s\n
  [145cb20]  NVIDIA: could not open the device file %s (%s).\n
  [145cb58]  NVIDIA: API mismatch: the NVIDIA kernel module has version %s,\n
            but this NVIDIA driver component has version %s.  Please make\n
            sure that the kernel module and all NVIDIA driver components\n
            have the same version.\n
  [145cc30]  NVIDIA: API mismatch: this NVIDIA driver component has version\n
            %s, but the NVIDIA kernel module's version does not match.\n
            Please make sure that the kernel module and all NVIDIA driver\n
            components have the same version.\n
  [145cd10]  NVIDIA: could not create file for device %u\n

SomeoneSerge · 2023-12-10T14:45:13Z

These probably refer to libcuda and libcudart, not the kernel drivers.

Yes:

nixpkgs/pkgs/development/cuda-modules/saxpy/saxpy.cu

Lines 27 to 31 in 1a6f704

    
           CHECK(cudaRuntimeGetVersion(&rtVersion)); 
        
           CHECK(cudaDriverGetVersion(&driverVersion)); 
        
           fprintf(stderr, "Runtime version: %d\n", rtVersion); 
        
           fprintf(stderr, "Driver version: %d\n", driverVersion);

libcuda

Uh-huh, that's what I meant by the "userspace driver"

nvidia-smi ignores /run/opengl-driver and links directly to the associated library files at the moment.

Right, I recall seeing that. I suppose we should change that. Do you know any reason not to?

I know that cuda has official backwards- and forwards-support, but I believe that only exists between libcuda and the toolkit libraries, not between libcuda and the driver itself

There is some leeway for libcuda and the kernel module to diverge which is why cudaPackages.cuda_compat exists, but they only test and officially support this for chosen platforms (jetsons and datacenters). EDIT: I suppose we could expect some software blocks in nvidia_x11 as well

Kiskae · 2023-12-10T15:47:27Z

which is why cudaPackages.cuda_compat exists

I didn't realize that is literally the cuda userspace libraries from a newer driver release. The documentation about compatibility are quite comprehensive: https://docs.nvidia.com/deploy/cuda-compatibility/#forward-compatibility-title

SomeoneSerge added the 6.topic: cuda Parallel computing platform and API label Sep 14, 2023

SomeoneSerge changed the title ~~libcuda.so: driver mismatch nixos-rebuild switch~~ libcuda.so: driver mismatch on nixos-rebuild switch Sep 15, 2023

Kiskae mentioned this issue Nov 22, 2023

Build failure: nvidia-x11 with linuxPackages_latest #269037

Closed

SomeoneSerge mentioned this issue Jan 6, 2024

torch cannot initialize CUDA #278976

Closed

SomeoneSerge mentioned this issue Feb 11, 2024

NixOS: Add support for CDI #284507

Merged

13 tasks

samueldr added the 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS label Apr 22, 2024

LoganBarnett mentioned this issue Jun 19, 2024

comfyui: init #268378

Open

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libcuda.so: driver mismatch on nixos-rebuild switch #255070

libcuda.so: driver mismatch on nixos-rebuild switch #255070

SomeoneSerge commented Sep 14, 2023 •

edited

Loading

Kiskae commented Sep 14, 2023

SomeoneSerge commented Sep 15, 2023 •

edited

Loading

Kiskae commented Sep 15, 2023 •

edited

Loading

SomeoneSerge commented Sep 15, 2023 •

edited

Loading

Kiskae commented Sep 15, 2023

SomeoneSerge commented Sep 15, 2023 •

edited

Loading

Kiskae commented Sep 15, 2023 •

edited

Loading

Atemu commented Nov 23, 2023

SomeoneSerge commented Dec 10, 2023

Kiskae commented Dec 10, 2023

SomeoneSerge commented Dec 10, 2023 •

edited

Loading

Kiskae commented Dec 10, 2023

libcuda.so: driver mismatch on nixos-rebuild switch #255070

libcuda.so: driver mismatch on nixos-rebuild switch #255070

Comments

SomeoneSerge commented Sep 14, 2023 • edited Loading

Issue description

mesa vs cuda

how mesa breaks

Notify maintainers

Kiskae commented Sep 14, 2023

SomeoneSerge commented Sep 15, 2023 • edited Loading

Kiskae commented Sep 15, 2023 • edited Loading

SomeoneSerge commented Sep 15, 2023 • edited Loading

Kiskae commented Sep 15, 2023

SomeoneSerge commented Sep 15, 2023 • edited Loading

Kiskae commented Sep 15, 2023 • edited Loading

Atemu commented Nov 23, 2023

SomeoneSerge commented Dec 10, 2023

Kiskae commented Dec 10, 2023

SomeoneSerge commented Dec 10, 2023 • edited Loading

Kiskae commented Dec 10, 2023

SomeoneSerge commented Sep 14, 2023 •

edited

Loading

SomeoneSerge commented Sep 15, 2023 •

edited

Loading

Kiskae commented Sep 15, 2023 •

edited

Loading

SomeoneSerge commented Sep 15, 2023 •

edited

Loading

SomeoneSerge commented Sep 15, 2023 •

edited

Loading

Kiskae commented Sep 15, 2023 •

edited

Loading

SomeoneSerge commented Dec 10, 2023 •

edited

Loading