cudaPackages: GPU-enabled tests #225912

SomeoneSerge · 2023-04-12T16:43:17Z

Description

Introduce the new "cuda"/"expose-cuda" value for nix.settings.system-features
- Describe a way to handle different cuda capabilities? E.g. at least a way to distinguish between discrete GPUs and jesons
- E.g. we could request and declare features such as cuda-50-60-61-70-75-80-86-89-90 (the set of architectures for which nvidia ships device code in cuda 12.2 for x86_64-linux) and cuda-53-61-62-70-72-75-80-86-87 (same for linux-aarch64 a.k.a. jetson). This way the only test-able packages would be ones built to support a wide range of GPU architectures
PR1: Introduce into nixpkgs the pre-build-hook script for conditional exposure of cuda cudaPackages.preBuildHook. Make the last new-line terminator optional. At this point users may manage their pre-build-hook directly as:
```
nix.settings.pre-build-hook = pkgs.writeScript "nix-pre-build-hook.sh" ''
  ${lib.getExe pkgs.cudaPackages.preBuildHook} --dont-terminate
  # Do more work: ...
  # Exit:
  echo
'';
```
The job of this preBuildHook would be to expose cuda devices to those derivations marked with requiredSystemFeatures = [ "cuda" ], and only to them. 2023-03-26: Cf. an example, also 2023-07-20: as a flake. Thanks to @thufschmitt for pointing out that this behaviour may be implement using pre-build hooks.
- It may be a good idea to provide an .override-able List str parameter to the hook's derivation, so that non-NixOS users may specify custom locations for libcuda.so different from addOpenGLRunpath.driverLink
- It may be a good idea to devote this hook a paragraph in the nixpkgs manual
PR2: Introduce a NixOS module for managing the nix pre-build-hook.
- Introduce a bool option for conditionally exposing cuda devices. The option would enable the hook and extend system-features
- Introduce a way to test the hook before applying the new generation
- Outline the path forward for implementing a more generic module for a composable pre-build-hook (not specific to cuda), point out that it doesn't have to come in the same PR

Old description:

Test GPU functionality in passthru.tests.
Mark GPU tests with something like requiredSystemFeatures = [ "cuda" ].
Conditionally expose /dev and /run/opengl-driver/lib (and/or whatever is required to make GPU tests work) in extra-sandbox-paths for derivations marked with "cuda" in requiredSystemFeatures.
Ensure normal derivations cannot see these extra paths.
Set up a PoC CI that would run these tests

The text was updated successfully, but these errors were encountered:

ConnorBaker · 2023-05-24T05:02:54Z

@SomeoneSerge would you envision these as tests which would ideally be run during checkPhase if we were able to?

Or is this something different?

SomeoneSerge · 2023-05-24T08:06:45Z

Consider something like this:

{ ..., torch }:

buildPythonPackage {
  pname = "torch";
  # ...
  passthru.tests.gpuTests = torch.overridePythonAttrs (_: {
    requiredSystemFeatures = [ "expose-cuda" ];
  });
  passthru.tests.cudaAvailable = buildPythonPackage {
    # ...
    requiredSystemFeatures = [ "expose-cuda" ];
    checkPhase = ''
      python << EOF
      import torch
      assert torch.cuda.is_available()
      EOF
    '';
  };
  # ...
}

Any normal Nix deployment should reject to build python3Packages.torch.tests.testGpu because of the unknown system-feature. Meantime, I'm hoping we could either deploy remote builders with system-features = ... expose-cuda that manually mount /dev/... and /run/opengl-driver/lib/libcuda.so*, or otherwise maybe even get a special branch into Nix itself so that it would expose these paths conditionally depending on whether a derivation asks for the "feature". Presumably, these dedicated builders should be able to use GPUs just in the checkPhase

Pros: an easy way to maintain a basic test-suite for our packages' GPU functionality within and synchronized with nixpkgs?
Cons: ugly ad hoc hack

thufschmitt · 2023-05-25T06:24:48Z

maybe even get a special branch into Nix itself so that it would expose these paths conditionally depending on whether a derivation asks for the "feature"

You can actually do that using the pre-build-hook feature. I haven't used it myself so take that with a grain of salt, but I suspect that something like the one below would work:

#!/bin/sh

DRV="$1"

# Do we have the "expose-cuda" required feature?
if nix derivation show "$DRV" | jq --exit-status '.["'"$DRV"'"].env.requiredSystemFeatures | contains("expose-cuda")'; then
  echo "extra-sandbox-paths"
  echo "/run/opengl-driver/lib=/run/opengl-driver/lib"
  echo "/dev=/dev"
fi

SomeoneSerge · 2023-05-26T18:02:33Z

I think this works! Cf. https://gist.github.com/SomeoneSerge/4832997ab09e4e71301e5469eec3066a

On a correctly configured builder, declaring an expose-cuda feature:

❯ nix build --file with-cuda.nix python3Packages.pynvml.tests.testNvmlInit  -L
...
python3.10-pynvml> running install tests
python3.10-pynvml> enter: nvmlInit
python3.10-pynvml> pynvml.nvmlInit()=None
python3.10-pynvml> exit: nvmlInit
...
python3.10-pynvml> Check whether the following modules can be imported: pynvml pynvml.smi

A builder doesn't declare expose-cuda:

❯ nix build --file with-my-cuda.nix python3Packages.pynvml.tests.testNvmlInit  -L --rebuild
error: a 'x86_64-linux' with features {expose-cuda} is required to build '/nix/store/94vw78sgh3y92bx3rmk62cdgg9nakkrx-python3.10-pynvml-11.5.0.drv', but I am a 'x86_64-linux' with features {benchmark, big-parallel, ca-derivations, kvm, nixos-test}

Behaviour when expose-cuda is set but /dev mounts are misconfigured:

❯ nix build --file with-cuda.nix python3Packages.pynvml.tests.testNvmlInit  -L
python3.10-pynvml> enter: nvmlInit
python3.10-pynvml> Driver Not Loaded
python3.10-pynvml> exit: nvmlInit
...
python3.10-pynvml> pynvml.nvml.NVMLError_Uninitialized: Uninitialized

SomeoneSerge · 2023-05-26T18:33:02Z

The next step could be to prepare a PR, introducing

A NixOS module that adds the required nix.settings.system-features and adds/extends the nix.settings.pre-build-hook
The very basic "GPU can be detected" passthru tests:
- nvmlInit()
- torch.cuda.is_available()
- tf.config.list_physical_devices('GPU')
- ...
- [Optional] Just to make the case more appealing to people and to test how far we can go: it would be really nice to code up a NixOS test that runs blender in a virtual machine, goes Edit -> Preferences -> System -> CUDA and confirms that there's a GPU in the list 🙃

We should expect a "why don't you do it in a flake" response, among other things.
Personally, I think it's very interesting to have everything set in nixpkgs and to benefit from its monorepo structure.
Without it the effort gets sparse

nixos-discourse · 2023-06-22T15:59:21Z

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/cuda-team-roadmap-and-call-for-sponsors/29495/1

ogoid · 2023-07-20T00:50:05Z

We should expect a "why don't you do it in a flake" response, among other things. Personally, I think it's very interesting to have everything set in nixpkgs and to benefit from its monorepo structure. Without it the effort gets sparse

Well... I created a NixOS module flake here anyway, but I also prefer this to be available from the main repo.

I also tried to do this from a normal/package flake by setting nixConfig.extra-sandbox-paths, which sort of works. However the paths in /run/opengl-driver/* are symbolic links, whose targets are not granted access by the nix daemon. As a result a manual entry to the final driver path inside nix store is also necessary:

{
  nixConfig = {
    extra-sandbox-paths = [
      "/dev/nvidia0"
      "/dev/nvidiactl"
      "/dev/nvidia-modeset"
      "/dev/nvidia-uvm"
      "/dev/nvidia-uvm-tools"
      
      "/run/opengl-driver"

      # build will fail without this:
      "/nix/store/fq7vp75q1f1yd5ypd0mxv1c935xl4j2b-nvidia-x11-535.54.03-6.1.38/"
    ];
  };

  inputs.nixpkgs.url = "github:nixos/nixpkgs/nixpkgs-unstable";

  outputs = { self, nixpkgs }:
  let
    system = "x86_64-linux";

    pkgs = import nixpkgs {
      inherit system;
      config.allowUnfree = true;
      config.cudaSupport = true;
    };

    pytorch = pkgs.python3.withPackages (p: [p.pytorch]);

    package = pkgs.stdenvNoCC.mkDerivation {
      name = "cuda-test";

      unpackPhase = "true";

      buildInputs = [ pytorch ];

      buildPhase = ''
      echo == torch check:
      python -c "import torch; print(torch.cuda.is_available())"
      
      echo
      echo == link check
      ls -l /run/opengl-driver/lib/libcuda.so

      echo
      echo == readlink will error if target is not accessible
      readlink -f /run/opengl-driver/lib/libcuda.so
      '';
    };

  in {
    packages.${system} = {
      default = package;
    };
  };
}

The sandbox doesn't include access to symlink targets due to any security/performance concerns, or is this a feature it should have?

SomeoneSerge · 2023-07-20T08:34:50Z

I created a NixOS module flake here anyway

🎉

The sandbox doesn't include access to symlink targets due to any security/performance concerns, or is this a feature it should have?

Not even that, I'd say recursive resolution of symlinks would be non-trivial extra work and potentially surprising behaviour on the sandbox's part, which is a good enough reason not to have it?

Nix's sandboxed builds forbids access to some files necessary for CUDA applications ... Hopefully this will not be necessary in the future

Hiding the hardware by default really is more of a feature. Many builds do auto-detect cuda capabilities and behave differently depending on the results

I also prefer this to be available from the main repo

I updated the issue description to reflect how I think this could be merged in nixpkgs in smaller steps. I thought I'd work on this issue myself this week, but I lag behind the schedule and now won't be able to act on this until August. I'd be happy if kept going with your work and got it accepted upstream

ogoid · 2023-07-20T12:39:15Z

I don't have enough experience with CUDA/Nix to help with this more granular features.

But one thing I just noticed while experimenting with jax library is that it also tries to access /sys path:

2023-07-20 12:36:28.734602: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.

ogoid · 2023-07-20T21:04:25Z

Is there any way a derivation can refer to the driver currently in use?

In my previous example, I realized we can remove the extra path to /nix/store/fq7vp75q1f1yd5ypd0mxv1c935xl4j2b-nvidia-x11-535.54.03-6.1.38/ if we add a reference to pkgs.linuxPackages.nvidia_x11 in the derivation. But this depends on the the system using the same nixpkgs version as the flake.

SomeoneSerge · 2023-07-20T23:44:10Z

"the driver currently in use" is a runtime (e.g. it makes sense in NixOS, but not in nixpkgs) concept, nix packages don't know anything about those at build time; this is the reason NixOS deploys driver-related libraries impurely at /run/opengl-driver/lib (otherwise it'd just have to rebuild each package on every system against its particular drivers). In short, you don't want to do that, the reliable solution is to resolve the symlinks in the pre-build-hook and in turn instruct nix to mount all the required paths, including /run/opengl-driver/lib and the nix store locations it points to

EDIT: you can reference "the (nvidia) driver currently in use" in NixOS as config.boot.kernelPackages.nvidia_x11/config.hardware.nvidia.package, and sorry this was a pretty lazy answer

SomeoneSerge · 2023-09-10T15:28:42Z

I think this works! Cf. https://gist.github.com/SomeoneSerge/4832997ab09e4e71301e5469eec3066a

I forgot to mention this earlier, but I had experienced some issues with using as a remote builder the machine with that hook set up. IIRC the show-derivation was failing due to the .drv file somehow not being there on the remote node. I need to check this again, but I think this raises the question about pre-build-hook's semantics: when exactly is meant to run between the build and the eval, etc

Kiskae · 2023-09-20T14:19:21Z

An alternative approach is to have the test provide their own cuda drivers by defining them as nixos tests. Then you could use pci-passthrough to provide it with a real GPU to run the actual tests.

Configuring the qemu VM for passthrough could be done with a build hook to ensure only a single test per GPU can run, or if you formulate the tests as a flake you could use flake overrides to inject the configuration during CI.

the-furry-hubofeverything · 2023-10-13T01:37:50Z

The next step could be to prepare a PR, introducing
* [ ]  A NixOS module that adds the required `nix.settings.system-features` and adds/extends the `nix.settings.pre-build-hook`

* [ ]  The very basic "GPU can be detected" passthru tests:
  
  * [ ]  `nvmlInit()`
  * [ ]  `torch.cuda.is_available()`
  * [ ]  `tf.config.list_physical_devices('GPU')`
  * [ ]  ...
  * [ ]  [Optional] Just to make the case more appealing to people and to test how far we can go: it would be really nice to code up a NixOS test that runs blender in a virtual machine, goes `Edit -> Preferences -> System -> CUDA` and confirms that there's a GPU in the list 🙃
We should expect a "why don't you do it in a flake" response, among other things. Personally, I think it's very interesting to have everything set in nixpkgs and to benefit from its monorepo structure. Without it the effort gets sparse

Blender has python integration, you could list out gpu devices that are seen by blender with a python script https://blender.stackexchange.com/questions/208135/way-to-detect-if-a-valid-gpu-device-is-available-from-python

ogoid · 2023-10-13T13:41:20Z

I'm thinking these GPU enabled builds would also be useful for general machine learning tasks/pipelines (reproducible model building etc), not just for package testing.

Kiskae · 2023-10-14T12:37:56Z

It might be an idea to work on creating runnable tests for gpu-enabled applications as a separate issue to actually running those tests.

Suppose you define a list on passthru.gpuTests that, when executed, check whether the functionality is working correctly. You could then execute these tests manually on the host gpu at the beginning and transition to either executing the tests in a derivation with a build-hook or executing them in a nixos tests with a passthrough gpu.

The tests themselves would remain the same regardless of the final solution.

SomeoneSerge · 2023-10-17T12:27:27Z

@Kiskae have you got an example of how to do the PCI passthrough/an understanding of how to integrate that with the NixOS test framework?

Kiskae · 2023-10-17T13:09:32Z

https://ubuntu.com/server/docs/gpu-virtualization-with-qemu-kvm or https://wiki.archlinux.org/title/PCI_passthrough_via_OVMF covers it pretty well.

The required qemu arguments can be set through https://search.nixos.org/options?channel=23.05&show=virtualisation.qemu.options and in theory that would be enough to bind the gpu to the VM.

However there is little documentation about other requirements so it might take some trial and error to get it to work.

SomeoneSerge · 2023-10-19T08:55:22Z

@Kiskae AFAIU, to use PCI passthrough in NixOS tests we'd first have to expose devices to the sandbox (e.g. using #256230 or __impureHostDeps), and then expose them to (one of) the VM(s) running inside the sandbox. Does that sound about correct?

Kiskae · 2023-10-19T10:16:51Z

@SomeoneSerge yeah, it looks like vfio/iommu has their own device nodes that are required:

https://docs.kernel.org/driver-api/vfio.html#vfio-usage-example

I guess that usually PCI passthrough gets used by privileged users so the required system access isn't well documented

SomeoneSerge added the 6.topic: cuda label Apr 12, 2023

ConnorBaker mentioned this issue May 23, 2023

CUDA-Team: tend to build and test infrastructure #232435

Open

2 tasks

samuela mentioned this issue Jul 28, 2023

cudaPackages.autoAddOpenGLRunpathHook: fix to skip unsupported #245789

Merged

12 tasks

SomeoneSerge mentioned this issue Sep 20, 2023

GPU access in the sandbox #256230

Open

12 tasks

RobbieBuxton mentioned this issue Jan 21, 2024

dlib: use cuda and blas properly #279927

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cudaPackages: GPU-enabled tests #225912

cudaPackages: GPU-enabled tests #225912

SomeoneSerge commented Apr 12, 2023 •

edited

ConnorBaker commented May 24, 2023

SomeoneSerge commented May 24, 2023 •

edited

thufschmitt commented May 25, 2023

SomeoneSerge commented May 26, 2023

SomeoneSerge commented May 26, 2023

nixos-discourse commented Jun 22, 2023

ogoid commented Jul 20, 2023 •

edited

SomeoneSerge commented Jul 20, 2023

ogoid commented Jul 20, 2023

ogoid commented Jul 20, 2023

SomeoneSerge commented Jul 20, 2023 •

edited

SomeoneSerge commented Sep 10, 2023

Kiskae commented Sep 20, 2023

the-furry-hubofeverything commented Oct 13, 2023

ogoid commented Oct 13, 2023 •

edited

Kiskae commented Oct 14, 2023

SomeoneSerge commented Oct 17, 2023

Kiskae commented Oct 17, 2023 •

edited

SomeoneSerge commented Oct 19, 2023

Kiskae commented Oct 19, 2023

cudaPackages: GPU-enabled tests #225912

cudaPackages: GPU-enabled tests #225912

Comments

SomeoneSerge commented Apr 12, 2023 • edited

Description

Old description:

ConnorBaker commented May 24, 2023

SomeoneSerge commented May 24, 2023 • edited

thufschmitt commented May 25, 2023

SomeoneSerge commented May 26, 2023

SomeoneSerge commented May 26, 2023

nixos-discourse commented Jun 22, 2023

ogoid commented Jul 20, 2023 • edited

SomeoneSerge commented Jul 20, 2023

ogoid commented Jul 20, 2023

ogoid commented Jul 20, 2023

SomeoneSerge commented Jul 20, 2023 • edited

SomeoneSerge commented Sep 10, 2023

Kiskae commented Sep 20, 2023

the-furry-hubofeverything commented Oct 13, 2023

ogoid commented Oct 13, 2023 • edited

Kiskae commented Oct 14, 2023

SomeoneSerge commented Oct 17, 2023

Kiskae commented Oct 17, 2023 • edited

SomeoneSerge commented Oct 19, 2023

Kiskae commented Oct 19, 2023

SomeoneSerge commented Apr 12, 2023 •

edited

SomeoneSerge commented May 24, 2023 •

edited

ogoid commented Jul 20, 2023 •

edited

SomeoneSerge commented Jul 20, 2023 •

edited

ogoid commented Oct 13, 2023 •

edited

Kiskae commented Oct 17, 2023 •

edited