Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interaction between the accelerator back-end (CUDA and ROCm) support flags #268919

Open
SomeoneSerge opened this issue Nov 21, 2023 · 4 comments
Open
Labels
6.topic: cuda Parallel computing platform and API 6.topic: rocm

Comments

@SomeoneSerge
Copy link
Contributor

SomeoneSerge commented Nov 21, 2023

Issue description

At this point we independently expose the global config.cudaSupport and config.rocmSupport options. We also have a number of package expressions of the form similar to:

{ ...
, config
, cudaSupport ? config.cudaSupport
, rocmSupport ? config.rocmSupport
}:

... {
  meta.broken =
    (cudaSupport && rocmSupport) # Sometimes the back-ends are mutually exclusive
    || !(cudaSupport || rocmSupport) # Sometimes at least one is required

}

There are many variations, both in the signature: { config, gpuBackend ? if config.cudaSupport then ... elif ... else ... }: ..., and in the way incompatible combinations are handled: assert (builtins.elem gpuBackend [ ... ]) instead of broken = ....

We might want to find a more consistent approach

Preliminary proposal

  • Handle interaction at the individual package level, do nothing about the config.xSupport combinations.
  • Rewrite the package expressions with { ..., gpuBackend }: ...
  • In the first cycle, keep the existing {cuda,rocm}Support and {with,enable}{Cuda,Rocm} arguments. Set defaults to null. Whenever they aren't null, display a deprecation warning and set gpuBackend appropriately
  • Do not assert, use broken instead. This way it's always the end-user's decision as to what builds to attempt (maybe the users extend patches and manage to relax the interaction rules)

CC @NixOS/cuda-maintainers @NixOS/rocm-maintainers

@samuela
Copy link
Member

samuela commented Nov 21, 2023

For packages that do support building with both cudaSupport and rocmSupport enabled, I could imagine that some user may desire to be able to do so, eg managing a single package cache across multiple machines. Unfortunately packages that disallow building with both, like magma, lead to a wrinkle in this.

Perhaps we ought to treat it like cudaCapabilities? Eg config.gpuBackends is a list (not sure if nix has a set type?) that can contain nothing, just "cuda", just "rocm", or both.

@SomeoneSerge
Copy link
Contributor Author

Lists sound good! There's a natural designation for the CPU-only builds, and a representation for "enable both backends".

RE: config.gpuBackends, set types

This actually leads us back to the config.accelerators.{cuda,rocm} question. We might actually prefer lists over sets, with the order representing priority! E.g. import nixpkgs { config.accelerators.enabledBackends = [ "rocm" "cuda" ]; } could mean "enable rocm in every package that supports it, and build cuda too if that's possible"

@fazo96 fazo96 mentioned this issue Nov 21, 2023
13 tasks
@Madouura
Copy link
Contributor

could mean "enable rocm in every package that supports it, and build cuda too if that's possible"

To be really honest, that seems a bit useless because 99.999999999% of users aren't going to have both AMD and NVIDIA GPUs on their system.
Despite it's arguable uselessness though, we should definitely prioritize-by-order.
I like this idea a lot.

@samuela
Copy link
Member

samuela commented Nov 22, 2023

Hmm I'm not sure I understand the priority thing. What's the use case for having priority flags on build options? Shouldn't the build always execute with the configuration it's given?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
6.topic: cuda Parallel computing platform and API 6.topic: rocm
Projects
Status: New
Development

No branches or pull requests

3 participants