Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cudaPackages: keep unsupported attributes, mark unsupported, explain why #266475

Open
ConnorBaker opened this issue Nov 9, 2023 · 6 comments
Open
Assignees

Comments

@ConnorBaker
Copy link
Contributor

ConnorBaker commented Nov 9, 2023

A fair number of tools we use in cudaPackages depend on specific versions of CUDA. Sometimes this is a strict dependency due to API changes or other breakages; other times, it's merely because it is not known whether the code is forward (or backward!) compatible.

The approach that I (@ConnorBaker) have taken so far in cudaPackages has been that when a package is unavailable for some release of CUDA, or is known to be incompatible, it is not present in that CUDA release's package set (e.g., cudaPackages_12_2).

However, this limits the flexibility of downstream users and avoids a larger pattern: attempting to use newer or older versions of dependencies.

@SomeoneSerge raised this issue in a comment on a PR about cuda-samples, which we use as a sort of sanity-check, allowing us to make sure that the cudatoolkit builds: #266115 (comment)

To clarify, it's not "break" in the sense that the derivation has broken = true -- I'm replacing builtins.throw which would "break" evaluation entirely.

Right, we definitely do not want to throw. But broken = true is still an alternative to optionalAttrs, what do you think about those?

For context: cudaPackages.cuda-samples has a release for each recent release of CUDA except for 11.7. I made the decision to exclude cuda-samples from cudaPackages_11_7 be wrapping it in optionalAttrs.

This issue proposes two mechanisms in abstract:

  • A way to build a variable version of a package against a fixed version of a package set
    • e.g., use an older cuda-samples release (11.6) with cudaPackages_11_7
    • e.g., use a newer cuda-samples release (11.8) with cudaPackages_11_7
  • A way to build a fixed version of a package against a variable version of a package set
    • e.g., use the same version of TensorRT with different versions of cudaPackages even if compatibility is not guaranteed by NVIDIA's support matrix

Taken together, these mechanisms would allow users to attempt to build things which would otherwise not be possible (e.g., using a newer release of CUDNN which might support a specific version of CUDA, even if it is not listed in the support matrix).

It's worth keeping in mind that support matrices (specifically as we're talking about them here, with respect to NVIDIA's software) are essentially guaranteeing that the listed combinations work. It does not guarantee that combinations outside those enumerated will be broken, only that they are unsupported.

@SomeoneSerge @samuela thoughts?

@samuela
Copy link
Member

samuela commented Nov 9, 2023

Is this currently blocking any packages/PRs? It looks like #266115 has already been merged

@ConnorBaker
Copy link
Contributor Author

No, not blocking anything. Just had this thought since I’ve seen this pattern around as I’ve been rewriting cuda-modules. It could simplify things internally and perhaps make it easier for users to try out officially unsupported combinations without needing to do a number of deep overrides — I mean, assuming that’s something we’d want to let them do!

@SomeoneSerge
Copy link
Contributor

SomeoneSerge commented Nov 14, 2023

More pros of broken = true are that in addition to .override, you could also attempt the following:

❯ nix eval -f '<nixpkgs>' --argstr crossSystem aarch64-linux cudaPackages.cuda_compat.meta.name
...
❯ nix build -f '<nixpkgs>' --argstr crossSystem aarch64-linux cudaPackages.cuda_compat.src
...

@SomeoneSerge
Copy link
Contributor

I just had torchvision to even evaluate in #269639 because of

defaultBuild = { "tensorrt" = if allBuilds ? ${computeName tensorRTDefaultVersion}
then allBuilds.${computeName tensorRTDefaultVersion}
else throw "tensorrt-${tensorRTDefaultVersion} does not support your cuda version ${cudaVersion}"; };
, even though torch/torchvision doesn't actually use tensorrt by default.

I'm pretty sure throw is not the tool we need

@SomeoneSerge
Copy link
Contributor

Some more reflection on the TRT situation: it's different from cuda_compat.

We only enable cuda_compat for aarch64-linux hosts, and we presume them to be jetsons. We do so because nvidia just chooses not to ship cuda_compat for the consumer gpus and x86_64 hosts, although in principle the same hack would totally work on any platform. Any recent cudatoolkit release clearly corresponds to a version of cuda_compat, so it "exists". Albeit only buildable on certain platforms.

With TRT, there are cudatoolkit releases for which TRT isn't a thing (yet). We couldn't even assign the broken/fake derivation any meaningful version. However, I still don't feel like omitting the attribute is a good thing, because it breaks eval so easily. If we leave out cudaPackages.tensorrt, we've got to remove python3Packages.tensorrt as well. I just don't feel like it's the kind of thing we want to maintain really?

@SomeoneSerge
Copy link
Contributor

  • cuda-library-samples

Going through #280386 I see that cuda-library-samples still contain optionalAttrs (e.g. they're confusingly missing in nix-instantiate in the linked PR) even after #273794

@SomeoneSerge SomeoneSerge changed the title cudaPackages: implement an allow(Newer|Older) mechanism cudaPackages: keep unsupported attributes, mark unsupported, explain why Jan 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: New
Development

No branches or pull requests

3 participants