Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

rocm: 4.0.1 -> 4.1.0 #117450

Merged
merged 9 commits into from Mar 25, 2021
Merged

rocm: 4.0.1 -> 4.1.0 #117450

merged 9 commits into from Mar 25, 2021

Conversation

danieldk
Copy link
Contributor

Motivation for this change

Update to ROCm 4.1.0. I cannot test this since my Radeon card is on a shelf 馃槈 .

Do not merge this PR unless it is approved by both @acowley and @Flakebi .

Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS linux)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Ensured that relevant documentation is up to date
  • Fits CONTRIBUTING.md.

@r-rmcgibbo
Copy link

r-rmcgibbo commented Mar 24, 2021

Result of nixpkgs-review pr 117450 at cc852010 run on aarch64-linux 1

2 packages marked as broken and skipped:
  • rocm-opencl-icd
  • rocm-opencl-runtime
1 package failed to build:
2 packages skipped due to time constraints:
  • rocm-comgr
  • rocm-device-libs
1 package built successfully:
  • rocm-cmake
1 suggestion:
  • warning: unused-argument

    Unused argument: buildPythonApplication.
    Near pkgs/tools/system/rocm-smi/default.nix:1:16:

      |
    1 | { lib, stdenv, buildPythonApplication, fetchFromGitHub, cmake, python3 }:
      |                ^
    

Note that build failures may predate this PR, and could be nondeterministic or hardware dependent.
Please exercise your independent judgement.


Result of nixpkgs-review pr 117450 at cc852010 run on x86_64-linux 1

6 packages skipped due to time constraints:
  • rocclr
  • rocm-comgr
  • rocm-device-libs
  • rocm-opencl-icd
  • rocm-opencl-runtime
  • rocm-runtime
3 packages built successfully:
  • rocm-cmake
  • rocm-smi
  • rocm-thunk
1 suggestion:
  • warning: unused-argument

    Unused argument: buildPythonApplication.
    Near pkgs/tools/system/rocm-smi/default.nix:1:16:

      |
    1 | { lib, stdenv, buildPythonApplication, fetchFromGitHub, cmake, python3 }:
      |                ^
    

This also migrates this derivation from the old deprecated ROC-smi
repository to rocm_smi_lib.
Copy link
Member

@Flakebi Flakebi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested OpenCL and rocm-smi and they are still working. Thanks!

Comment on lines +16 to 22
postPatch = ''
# Upstream ROCm is installed in an /opt directory. For this reason,
# it does not completely follow FHS layout, creating top-level
# rocm_smi, oam, and bindings top-level directories. Since rocm-smi
# is a package that is typically installed, we change the paths to
# follow FHS more closely.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
postPatch = ''
# Upstream ROCm is installed in an /opt directory. For this reason,
# it does not completely follow FHS layout, creating top-level
# rocm_smi, oam, and bindings top-level directories. Since rocm-smi
# is a package that is typically installed, we change the paths to
# follow FHS more closely.
# Upstream ROCm is installed in an /opt directory. For this reason,
# it does not completely follow FHS layout, creating top-level
# rocm_smi, oam, and bindings top-level directories. Since rocm-smi
# is a package that is typically installed, we change the paths to
# follow FHS more closely.
postPatch = ''

This reduces the final derivation size a bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But than half of the comments are in the indented block and the others outside?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they are written in the phase the are treated as part of the command and updating them causes a rebuild because they are included in the drv. This also grows the size of the drv file which should be avoided especially for bigger comments. If they comments are just a line or two this is not a big problem but for bigger ones this can be avoided.

Copy link
Contributor Author

@danieldk danieldk Mar 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the size of the drv is much of an issue, given that the whole derivation is 3KiB. Rebuilds when changing comments is annoying, but these comments don't really get updated anyway. Unless the comments are not relevant or accurate anymore, but then other parts of the derivation will change as well.

Of course, if a phase has a single comment, I would agree that it is better to put it outside the phase itself. But when a phase has several comments that are part of a 'discourse', I think it is nicer to have them in the same scope.

Copy link
Contributor

@acowley acowley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clinfo looks good for me, and hip-4.1.0 in the overlay seems to be working, too.

@danieldk
Copy link
Contributor Author

danieldk commented Mar 25, 2021

Thanks for the reviews and testing the changes on actual hardware!

@danieldk danieldk merged commit 445ff0c into NixOS:master Mar 25, 2021
@danieldk danieldk deleted the rocm-4.1.0 branch March 25, 2021 07:53
@acowley
Copy link
Contributor

acowley commented Mar 25, 2021

Just as a heads up, there may be issues on Vega 20 hardware. I only have an RX580, so I don't know if we're hitting this.

ETA: Right, so there are patches that are yet to make it into a kernel that are needed for ROCm 4.1 on Radeon VII hardware. Users must either stay on ROCm 4.0, or use an older kernel that works with the 4.1 rock-dkms. We do not have updated packaging for the dkms route, but I think the overlay used to support that. It could be revived if there is a need and the kernel patch upstreaming takes much longer.

@danieldk
Copy link
Contributor Author

danieldk commented Mar 28, 2021

Just as a heads up, there may be issues on Vega 20 hardware. I only have an RX580, so I don't know if we're hitting this.

ETA: Right, so there are patches that are yet to make it into a kernel that are needed for ROCm 4.1 on Radeon VII hardware. Users must either stay on ROCm 4.0, or use an older kernel that works with the 4.1 rock-dkms. We do not have updated packaging for the dkms route, but I think the overlay used to support that. It could be revived if there is a need and the kernel patch upstreaming takes much longer.

Thanks for the heads-up! I am not sure how we could best solve this. I guess that people who have Radeon VII hardware could use an older revision of nixpkgs for rocm-opencl-icd.

I am also a bit worried about future testing, since Polaris GPUs are not officially supported anymore, I guess support could break in the future. Does anyone in the Nix ecosystem have a Vega 7nm, Vega 10, or MI100?

@acowley
Copy link
Contributor

acowley commented Mar 28, 2021

I don鈥檛 want to drop a burden on anyone, but let me just gently ping @athas and @Wulfsta as two interested parties.

A reasonable(?) hope is to stay the course for now, hope that ROCm picks up RDNA2 support within a month or two, and that one of the people who won the able-to-buy-a-GPU lottery comes along to help out by running each new release.

@athas
Copy link
Contributor

athas commented Mar 28, 2021

I'm using ROCm from Nixpkgs these days, although I may switch back to this repository for HIP.

I don't have a Vega 20-based GPU, though, so I wouldn't have any opinion anyway.

I hope to get an MI100 soon, but it'll probably go in a CentOS system, and I doubt the rest of my lab would be pleased to have me Nixify it!

@Wulfsta
Copy link
Member

Wulfsta commented Mar 28, 2021

@acowley Please, feel free to ping me whenever - I have certainly done it to you enough! I can try to test this sometime in the next few days, but I am particularly busy right now and likely will end up delaying. Also, I have to figure out how to use this right from nixpkgs - I am still using some rocm-nixos fork or branch at this point...

Edit: I have a Radeon VII if anyone ever needs me to test something for nixpkgs - but I can't promise it will happen quickly due to my IRL schedule.

@acowley
Copy link
Contributor

acowley commented Mar 28, 2021

@Wulfsta Thanks, I think everything is known for this release. The main thing is that when new releases come around, we want people to be able to run them at least once before we update nixpkgs.

I feel as though OpenCL will be fine since they鈥檙e using ROCm for that in every driver. But in the event that Polaris / gfx803 / RX580 totally loses support, we鈥檇 need to figure out something. I don鈥檛 know for certain, but I think the support issue on AMD鈥檚 side is mostly the specialized kernels for ML middleware that they don鈥檛 want to do for Polaris because it鈥檚 different enough from Vega and no HPC customers are using Polaris.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants