Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cudaPackages: GPU-enabled tests #225912

Open
8 tasks
SomeoneSerge opened this issue Apr 12, 2023 · 20 comments
Open
8 tasks

cudaPackages: GPU-enabled tests #225912

SomeoneSerge opened this issue Apr 12, 2023 · 20 comments

Comments

@SomeoneSerge
Copy link
Contributor

SomeoneSerge commented Apr 12, 2023

Description

  • Introduce the new "cuda"/"expose-cuda" value for nix.settings.system-features
    • Describe a way to handle different cuda capabilities? E.g. at least a way to distinguish between discrete GPUs and jesons
    • E.g. we could request and declare features such as cuda-50-60-61-70-75-80-86-89-90 (the set of architectures for which nvidia ships device code in cuda 12.2 for x86_64-linux) and cuda-53-61-62-70-72-75-80-86-87 (same for linux-aarch64 a.k.a. jetson). This way the only test-able packages would be ones built to support a wide range of GPU architectures
  • PR1: Introduce into nixpkgs the pre-build-hook script for conditional exposure of cuda cudaPackages.preBuildHook. Make the last new-line terminator optional. At this point users may manage their pre-build-hook directly as:
    nix.settings.pre-build-hook = pkgs.writeScript "nix-pre-build-hook.sh" ''
      ${lib.getExe pkgs.cudaPackages.preBuildHook} --dont-terminate
      # Do more work: ...
      # Exit:
      echo
    '';
    
    The job of this preBuildHook would be to expose cuda devices to those derivations marked with requiredSystemFeatures = [ "cuda" ], and only to them. 2023-03-26: Cf. an example, also 2023-07-20: as a flake. Thanks to @thufschmitt for pointing out that this behaviour may be implement using pre-build hooks.
    • It may be a good idea to provide an .override-able List str parameter to the hook's derivation, so that non-NixOS users may specify custom locations for libcuda.so different from addOpenGLRunpath.driverLink
    • It may be a good idea to devote this hook a paragraph in the nixpkgs manual
  • PR2: Introduce a NixOS module for managing the nix pre-build-hook.
    • Introduce a bool option for conditionally exposing cuda devices. The option would enable the hook and extend system-features
    • Introduce a way to test the hook before applying the new generation
    • Outline the path forward for implementing a more generic module for a composable pre-build-hook (not specific to cuda), point out that it doesn't have to come in the same PR

Old description:

Test GPU functionality in passthru.tests.
Mark GPU tests with something like requiredSystemFeatures = [ "cuda" ].
Conditionally expose /dev and /run/opengl-driver/lib (and/or whatever is required to make GPU tests work) in extra-sandbox-paths for derivations marked with "cuda" in requiredSystemFeatures.
Ensure normal derivations cannot see these extra paths.
Set up a PoC CI that would run these tests

@ConnorBaker
Copy link
Contributor

@SomeoneSerge would you envision these as tests which would ideally be run during checkPhase if we were able to?

Or is this something different?

@SomeoneSerge
Copy link
Contributor Author

SomeoneSerge commented May 24, 2023

Consider something like this:

{ ..., torch }:

buildPythonPackage {
  pname = "torch";
  # ...
  passthru.tests.gpuTests = torch.overridePythonAttrs (_: {
    requiredSystemFeatures = [ "expose-cuda" ];
  });
  passthru.tests.cudaAvailable = buildPythonPackage {
    # ...
    requiredSystemFeatures = [ "expose-cuda" ];
    checkPhase = ''
      python << EOF
      import torch
      assert torch.cuda.is_available()
      EOF
    '';
  };
  # ...
}

Any normal Nix deployment should reject to build python3Packages.torch.tests.testGpu because of the unknown system-feature. Meantime, I'm hoping we could either deploy remote builders with system-features = ... expose-cuda that manually mount /dev/... and /run/opengl-driver/lib/libcuda.so*, or otherwise maybe even get a special branch into Nix itself so that it would expose these paths conditionally depending on whether a derivation asks for the "feature". Presumably, these dedicated builders should be able to use GPUs just in the checkPhase

Pros: an easy way to maintain a basic test-suite for our packages' GPU functionality within and synchronized with nixpkgs?
Cons: ugly ad hoc hack

@thufschmitt
Copy link
Member

maybe even get a special branch into Nix itself so that it would expose these paths conditionally depending on whether a derivation asks for the "feature"

You can actually do that using the pre-build-hook feature. I haven't used it myself so take that with a grain of salt, but I suspect that something like the one below would work:

#!/bin/sh

DRV="$1"

# Do we have the "expose-cuda" required feature?
if nix derivation show "$DRV" | jq --exit-status '.["'"$DRV"'"].env.requiredSystemFeatures | contains("expose-cuda")'; then
  echo "extra-sandbox-paths"
  echo "/run/opengl-driver/lib=/run/opengl-driver/lib"
  echo "/dev=/dev"
fi

@SomeoneSerge
Copy link
Contributor Author

I think this works! Cf. https://gist.github.com/SomeoneSerge/4832997ab09e4e71301e5469eec3066a

On a correctly configured builder, declaring an expose-cuda feature:

nix build --file with-cuda.nix python3Packages.pynvml.tests.testNvmlInit  -L
...
python3.10-pynvml> running install tests
python3.10-pynvml> enter: nvmlInit
python3.10-pynvml> pynvml.nvmlInit()=None
python3.10-pynvml> exit: nvmlInit
...
python3.10-pynvml> Check whether the following modules can be imported: pynvml pynvml.smi

A builder doesn't declare expose-cuda:

nix build --file with-my-cuda.nix python3Packages.pynvml.tests.testNvmlInit  -L --rebuild
error: a 'x86_64-linux' with features {expose-cuda} is required to build '/nix/store/94vw78sgh3y92bx3rmk62cdgg9nakkrx-python3.10-pynvml-11.5.0.drv', but I am a 'x86_64-linux' with features {benchmark, big-parallel, ca-derivations, kvm, nixos-test}

Behaviour when expose-cuda is set but /dev mounts are misconfigured:

nix build --file with-cuda.nix python3Packages.pynvml.tests.testNvmlInit  -L
python3.10-pynvml> enter: nvmlInit
python3.10-pynvml> Driver Not Loaded
python3.10-pynvml> exit: nvmlInit
...
python3.10-pynvml> pynvml.nvml.NVMLError_Uninitialized: Uninitialized

@SomeoneSerge
Copy link
Contributor Author

The next step could be to prepare a PR, introducing

  • A NixOS module that adds the required nix.settings.system-features and adds/extends the nix.settings.pre-build-hook
  • The very basic "GPU can be detected" passthru tests:
    • nvmlInit()
    • torch.cuda.is_available()
    • tf.config.list_physical_devices('GPU')
    • ...
    • [Optional] Just to make the case more appealing to people and to test how far we can go: it would be really nice to code up a NixOS test that runs blender in a virtual machine, goes Edit -> Preferences -> System -> CUDA and confirms that there's a GPU in the list 🙃

We should expect a "why don't you do it in a flake" response, among other things.
Personally, I think it's very interesting to have everything set in nixpkgs and to benefit from its monorepo structure.
Without it the effort gets sparse

@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/cuda-team-roadmap-and-call-for-sponsors/29495/1

@ogoid
Copy link
Contributor

ogoid commented Jul 20, 2023

We should expect a "why don't you do it in a flake" response, among other things. Personally, I think it's very interesting to have everything set in nixpkgs and to benefit from its monorepo structure. Without it the effort gets sparse

Well... I created a NixOS module flake here anyway, but I also prefer this to be available from the main repo.

I also tried to do this from a normal/package flake by setting nixConfig.extra-sandbox-paths, which sort of works. However the paths in /run/opengl-driver/* are symbolic links, whose targets are not granted access by the nix daemon. As a result a manual entry to the final driver path inside nix store is also necessary:

{
  nixConfig = {
    extra-sandbox-paths = [
      "/dev/nvidia0"
      "/dev/nvidiactl"
      "/dev/nvidia-modeset"
      "/dev/nvidia-uvm"
      "/dev/nvidia-uvm-tools"
      
      "/run/opengl-driver"

      # build will fail without this:
      "/nix/store/fq7vp75q1f1yd5ypd0mxv1c935xl4j2b-nvidia-x11-535.54.03-6.1.38/"
    ];
  };

  inputs.nixpkgs.url = "github:nixos/nixpkgs/nixpkgs-unstable";

  outputs = { self, nixpkgs }:
  let
    system = "x86_64-linux";

    pkgs = import nixpkgs {
      inherit system;
      config.allowUnfree = true;
      config.cudaSupport = true;
    };

    pytorch = pkgs.python3.withPackages (p: [p.pytorch]);

    package = pkgs.stdenvNoCC.mkDerivation {
      name = "cuda-test";

      unpackPhase = "true";

      buildInputs = [ pytorch ];

      buildPhase = ''
      echo == torch check:
      python -c "import torch; print(torch.cuda.is_available())"
      
      echo
      echo == link check
      ls -l /run/opengl-driver/lib/libcuda.so

      echo
      echo == readlink will error if target is not accessible
      readlink -f /run/opengl-driver/lib/libcuda.so
      '';
    };

  in {
    packages.${system} = {
      default = package;
    };
  };
}

The sandbox doesn't include access to symlink targets due to any security/performance concerns, or is this a feature it should have?

@SomeoneSerge
Copy link
Contributor Author

I created a NixOS module flake here anyway

🎉

The sandbox doesn't include access to symlink targets due to any security/performance concerns, or is this a feature it should have?

Not even that, I'd say recursive resolution of symlinks would be non-trivial extra work and potentially surprising behaviour on the sandbox's part, which is a good enough reason not to have it?

Nix's sandboxed builds forbids access to some files necessary for CUDA applications ... Hopefully this will not be necessary in the future

Hiding the hardware by default really is more of a feature. Many builds do auto-detect cuda capabilities and behave differently depending on the results

I also prefer this to be available from the main repo

I updated the issue description to reflect how I think this could be merged in nixpkgs in smaller steps. I thought I'd work on this issue myself this week, but I lag behind the schedule and now won't be able to act on this until August. I'd be happy if kept going with your work and got it accepted upstream

@ogoid
Copy link
Contributor

ogoid commented Jul 20, 2023

I don't have enough experience with CUDA/Nix to help with this more granular features.

But one thing I just noticed while experimenting with jax library is that it also tries to access /sys path:

2023-07-20 12:36:28.734602: I external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.

@ogoid
Copy link
Contributor

ogoid commented Jul 20, 2023

Is there any way a derivation can refer to the driver currently in use?

In my previous example, I realized we can remove the extra path to /nix/store/fq7vp75q1f1yd5ypd0mxv1c935xl4j2b-nvidia-x11-535.54.03-6.1.38/ if we add a reference to pkgs.linuxPackages.nvidia_x11 in the derivation. But this depends on the the system using the same nixpkgs version as the flake.

@SomeoneSerge
Copy link
Contributor Author

SomeoneSerge commented Jul 20, 2023

"the driver currently in use" is a runtime (e.g. it makes sense in NixOS, but not in nixpkgs) concept, nix packages don't know anything about those at build time; this is the reason NixOS deploys driver-related libraries impurely at /run/opengl-driver/lib (otherwise it'd just have to rebuild each package on every system against its particular drivers). In short, you don't want to do that, the reliable solution is to resolve the symlinks in the pre-build-hook and in turn instruct nix to mount all the required paths, including /run/opengl-driver/lib and the nix store locations it points to

EDIT: you can reference "the (nvidia) driver currently in use" in NixOS as config.boot.kernelPackages.nvidia_x11/config.hardware.nvidia.package, and sorry this was a pretty lazy answer

@SomeoneSerge
Copy link
Contributor Author

I think this works! Cf. https://gist.github.com/SomeoneSerge/4832997ab09e4e71301e5469eec3066a

I forgot to mention this earlier, but I had experienced some issues with using as a remote builder the machine with that hook set up. IIRC the show-derivation was failing due to the .drv file somehow not being there on the remote node. I need to check this again, but I think this raises the question about pre-build-hook's semantics: when exactly is meant to run between the build and the eval, etc

@Kiskae
Copy link
Contributor

Kiskae commented Sep 20, 2023

An alternative approach is to have the test provide their own cuda drivers by defining them as nixos tests. Then you could use pci-passthrough to provide it with a real GPU to run the actual tests.

Configuring the qemu VM for passthrough could be done with a build hook to ensure only a single test per GPU can run, or if you formulate the tests as a flake you could use flake overrides to inject the configuration during CI.

@the-furry-hubofeverything
Copy link
Contributor

The next step could be to prepare a PR, introducing

* [ ]  A NixOS module that adds the required `nix.settings.system-features` and adds/extends the `nix.settings.pre-build-hook`

* [ ]  The very basic "GPU can be detected" passthru tests:
  
  * [ ]  `nvmlInit()`
  * [ ]  `torch.cuda.is_available()`
  * [ ]  `tf.config.list_physical_devices('GPU')`
  * [ ]  ...
  * [ ]  [Optional] Just to make the case more appealing to people and to test how far we can go: it would be really nice to code up a NixOS test that runs blender in a virtual machine, goes `Edit -> Preferences -> System -> CUDA` and confirms that there's a GPU in the list 🙃

We should expect a "why don't you do it in a flake" response, among other things. Personally, I think it's very interesting to have everything set in nixpkgs and to benefit from its monorepo structure. Without it the effort gets sparse

Blender has python integration, you could list out gpu devices that are seen by blender with a python script https://blender.stackexchange.com/questions/208135/way-to-detect-if-a-valid-gpu-device-is-available-from-python

@ogoid
Copy link
Contributor

ogoid commented Oct 13, 2023

I'm thinking these GPU enabled builds would also be useful for general machine learning tasks/pipelines (reproducible model building etc), not just for package testing.

@Kiskae
Copy link
Contributor

Kiskae commented Oct 14, 2023

It might be an idea to work on creating runnable tests for gpu-enabled applications as a separate issue to actually running those tests.

Suppose you define a list on passthru.gpuTests that, when executed, check whether the functionality is working correctly. You could then execute these tests manually on the host gpu at the beginning and transition to either executing the tests in a derivation with a build-hook or executing them in a nixos tests with a passthrough gpu.

The tests themselves would remain the same regardless of the final solution.

@SomeoneSerge
Copy link
Contributor Author

@Kiskae have you got an example of how to do the PCI passthrough/an understanding of how to integrate that with the NixOS test framework?

@Kiskae
Copy link
Contributor

Kiskae commented Oct 17, 2023

https://ubuntu.com/server/docs/gpu-virtualization-with-qemu-kvm or https://wiki.archlinux.org/title/PCI_passthrough_via_OVMF covers it pretty well.

The required qemu arguments can be set through https://search.nixos.org/options?channel=23.05&show=virtualisation.qemu.options and in theory that would be enough to bind the gpu to the VM.

However there is little documentation about other requirements so it might take some trial and error to get it to work.

@SomeoneSerge
Copy link
Contributor Author

@Kiskae AFAIU, to use PCI passthrough in NixOS tests we'd first have to expose devices to the sandbox (e.g. using #256230 or __impureHostDeps), and then expose them to (one of) the VM(s) running inside the sandbox. Does that sound about correct?

@Kiskae
Copy link
Contributor

Kiskae commented Oct 19, 2023

@SomeoneSerge yeah, it looks like vfio/iommu has their own device nodes that are required:

https://docs.kernel.org/driver-api/vfio.html#vfio-usage-example

I guess that usually PCI passthrough gets used by privileged users so the required system access isn't well documented

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 🔮 Roadmap
Development

No branches or pull requests

7 participants