Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python3Packages.torch: Fix performance problem on darwin #220017

Merged
merged 2 commits into from Mar 14, 2023

Conversation

azuwis
Copy link
Contributor

@azuwis azuwis commented Mar 7, 2023

Description of changes

Enable Accelerate.framework, disable mkldnn.

In my test in a Macbook Pro M1 2020, using pianotrans to transcribe
cut_liszt.opus, transcription time:

Without this patch: ~88s

Disable mkldnn: ~21s

Disable mkldnn and enable Accelerate.framework: ~9s

The final result is close to using torch-bin.

See also #219104

Things done
  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandbox = true set in nix.conf? (See Nix manual)
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 23.05 Release Notes (or backporting 22.11 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

@azuwis azuwis mentioned this pull request Mar 7, 2023
12 tasks
@ofborg ofborg bot added the 6.topic: darwin Running or building packages on Darwin label Mar 7, 2023
@SomeoneSerge
Copy link
Contributor

Awesome! So, was it CoreML and MLCompute that made the difference, compared to the previous run?

@azuwis
Copy link
Contributor Author

azuwis commented Mar 7, 2023

Disabling mkldnn makes the big difference.

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/prs-ready-for-review/3032/1917

@@ -1,6 +1,6 @@
{ stdenv, lib, fetchFromGitHub, fetchpatch, buildPythonPackage, python,
cudaSupport ? false, cudaPackages, magma,
mklDnnSupport ? true, useSystemNccl ? true,
mklDnnSupport ? stdenv.isLinux, useSystemNccl ? true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this poses the question of whether we should touch x86_64-darwin, if we were only looking at aarch64-darwin so far.

  1. Does x86_64-darwin lose performance too, with mkldnn on?
  2. If there's an Apple-specific library that provides the same functionality as mkldnn and that pytorch ends up consuming in your build, we should probably always use it.

I'm looking at 'wheel-py3_10-cpu-build/10_Build PyTorch binary.txt' from one of the x86 darwin actions upstream, they seem to have these flags set:

--   USE_MKL               : ON
--   USE_MKLDNN            : ON
--   USE_MKLDNN_ACL        : OFF
--   USE_MKLDNN_CBLAS      : OFF

Ultimately, if we can manually check that, having the rest of your change merged, mklDnnSupport = false doesn't harm performance on x86_64-darwin, I think we're good to keep isLinux

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, let's add a comment referring to this PR and explaining that MKLDNN negatively impacts performance on M1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please add a comment otherwise this gets forgotten again

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, let's add a comment referring to this PR and explaining that MKLDNN negatively impacts performance on M1

Done in the new push.

@SomeoneSerge
Copy link
Contributor

To anyone possessing an x86_64-darwin machine, I think we need to see the difference between:

❯ cat << EOF > pianotrans-tests.nix
let
  # github:azuwis/nixpkgs/62b4df51ee3c01968ac2ac30831b1f989ab917f9
  nixpkgs = builtins.getFlake github:azuwis/nixpkgs/torch;
  pkgs = import nixpkgs { system = "x86_64-darwin"; };
  mkPy = mklDnnSupport: pkgs.python3.override
    {
      packageOverrides = self: super: { torch = super.torch.override { inherit mklDnnSupport; }; };
    };
in
{
  mklDnnOn = (pkgs.pianotrans.override {
    python3 = mkPy true;
  });
  mklDnnOff = (pkgs.pianotrans.override {
    python3 = mkPy false;
  });
}
EOF
❯ nix eval -f pianotrans-tests.nix mklDnnOff.outPath
"/nix/store/6zlk47rgzv6vck640vl5qi0fqmi4n18c-pianotrans-1.0.1"
❯ nix eval -f pianotrans-tests.nix mklDnnOn.outPath
"/nix/store/58mxcypwkn2idk89d6z9as66pm28j08h-pianotrans-1.0.1"
❯ wget https://github.com/azuwis/pianotrans/raw/master/test/cut_liszt.opus
❯ nix shell -f pianotrans-tests.nix mklDnnOn --command pianotrans cut_liszt.opus
...
Transcribe time: XX.XXX s
...
❯ nix shell -f pianotrans-tests.nix mklDnnOff --command pianotrans cut_liszt.opus
...
Transcribe time: YY.YYY s
...

</details>

@azuwis
Copy link
Contributor Author

azuwis commented Mar 9, 2023

I've made a github action to test x86_64-darwin, https://github.com/azuwis/pianotrans/actions/runs/4373108055/jobs/7650862384

The result:

Torch-bin:

Transcribe time: 30.519 s
Transcribe time: 30.065 s
Transcribe time: 29.004 s
Transcribe time: 28.713 s
Transcribe time: 28.911 s
Transcribe time: 45.366 s
Transcribe time: 101.648 s
Transcribe time: 70.682 s
Transcribe time: 70.235 s
Transcribe time: 26.910 s

Enable mklDnn, disable Accelerate.framework (current nixpkgs):

Transcribe time: 64.889 s
Transcribe time: 59.743 s
Transcribe time: 61.396 s
Transcribe time: 59.938 s
Transcribe time: 60.521 s
Transcribe time: 60.098 s
Transcribe time: 59.536 s
Transcribe time: 63.212 s
Transcribe time: 61.648 s
Transcribe time: 63.184 s

Disable mklDnn, enable Accelerate.framework:

Transcribe time: 54.248 s
Transcribe time: 52.206 s
Transcribe time: 51.521 s
Transcribe time: 54.340 s
Transcribe time: 54.419 s
Transcribe time: 50.019 s
Transcribe time: 50.763 s
Transcribe time: 47.536 s
Transcribe time: 47.214 s
Transcribe time: 48.254 s

Enable mklDnn, enable Accelerate.framework:

Transcribe time: 46.777 s
Transcribe time: 46.982 s
Transcribe time: 46.966 s
Transcribe time: 46.838 s
Transcribe time: 45.889 s
Transcribe time: 46.136 s
Transcribe time: 47.243 s
Transcribe time: 46.703 s
Transcribe time: 47.070 s
Transcribe time: 46.289 s

We should enable mklDnn and enalbe Accelerate.framework for x86_64-darwin, but still the performance of torch is not as good as torch-bin.

@SomeoneSerge
Copy link
Contributor

@azuwis amazing! To finish the comparison against torch-bin, can you also add a version with

        blas = prev.blas.override {
          blasProvider = final.mkl;
        };

It negatively impacts performance, this is also what official pytorch
build does.

In my test in a Macbook Pro M1 2020, using [pianotrans][1] to transcribe
[cut_liszt.opus][2], transcription time:

Enable mkldnn and disable Accelerate.framework: ~88s

Disable mkldnn and disable Accelerate.framework: ~21s

Disable mkldnn and enable Accelerate.framework: ~9s

The final result is close to using torch-bin.

See also NixOS#219104

[1]: https://github.com/azuwis/pianotrans
[2]: https://github.com/azuwis/pianotrans/raw/master/test/cut_liszt.opus
@azuwis
Copy link
Contributor Author

azuwis commented Mar 9, 2023

Please review the new changes.

@azuwis
Copy link
Contributor Author

azuwis commented Mar 10, 2023

@azuwis amazing! To finish the comparison against torch-bin, can you also add a version with

        blas = prev.blas.override {
          blasProvider = final.mkl;
        };

I've tried override blas in https://github.com/azuwis/pianotrans/blob/61b1e61b9eaea4f926e9bf4f265bdb826e1f11da/flake.nix#L15-L17, but build failed on both x86_64-linux and x86_64-darwin https://github.com/azuwis/pianotrans/actions/runs/4382622983/jobs/7671881870

@SomeoneSerge
Copy link
Contributor

but build failed on both x86_64-linux and x86_64-darwin

Turns out if I'm not logged in in the browser I can't see the workflow logs. But I see that you override blas globally. Just in case the build fails before torch (in some of the dependencies): it could be sufficient for the test that we only pass the mkl blas to torch (we did that on x86 Linux in the linked issue once and it worked)

@azuwis
Copy link
Contributor Author

azuwis commented Mar 10, 2023

@wegank Can you please have a look at this PR, I think it's ready.

@SomeoneSerge
Copy link
Contributor

override blas

Ok, AFAIU mkl is simply broken on Darwin. No reason not to merge this right away

@wegank
Copy link
Member

wegank commented Mar 14, 2023

@ofborg build python3Packages.torch

@wegank wegank merged commit 1474943 into NixOS:master Mar 14, 2023
@azuwis
Copy link
Contributor Author

azuwis commented Mar 15, 2023

@SomeoneSerge I can confirm the performance different between torch and torch-bin on linux is mkl.

I did a test on x86_64-linux.

Torch:

$ nix-shell -p pianotrans
$ pianotrans ~/src/pianotrans/test/cut_liszt.opus ~/src/pianotrans/test/cut_liszt.opus ~/src/pianotrans/test/cut_liszt.opus | grep 'Transcribe time:'
Transcribe time: 38.323 s
Transcribe time: 38.269 s
Transcribe time: 38.305 s

Torch-bin:

$ nix-shell -p '(pianotrans.override { python3 = python3.override { packageOverrides = self: super: { torch = super.torch-bin; }; }; })'
$ pianotrans --cli ~/src/pianotrans/test/cut_liszt.opus ~/src/pianotrans/test/cut_liszt.opus ~/src/pianotrans/test/cut_liszt.opus | grep 'Transcribe time:'
Transcribe time: 27.919 s
Transcribe time: 27.461 s
Transcribe time: 27.497 s

Torch with mkl by override blasProvider = mkl:

$ nix-shell -p 'with import <nixpkgs> { config.allowUnfree = true;}; (pianotrans.override { python3 = python3.override { packageOverrides = self: super: { torch = super.torch.override { blas = blas.override { blasProvider = mkl; }; } ; }; }; })'
$ pianotrans --cli ~/src/pianotrans/test/cut_liszt.opus ~/src/pianotrans/test/cut_liszt.opus ~/src/pianotrans/test/cut_liszt.opus | grep 'Transcribe time:'
Transcribe time: 27.646 s
Transcribe time: 27.868 s
Transcribe time: 27.610 s

Torch with mkl by using LD_LIBRARY_PATH:

$ NIXPKGS_ALLOW_UNFREE=1 nix build --impure --no-link --print-out-paths nixpkgs#mkl 
/nix/store/5aayf907z0v4z5b6bx4z7j5gg9xh4cz6-mkl-2023.0.0.25398
$ LD_LIBRARY_PATH=/nix/store/5aayf907z0v4z5b6bx4z7j5gg9xh4cz6-mkl-2023.0.0.25398/lib nix-shell -p pianotrans
$ pianotrans ~/src/pianotrans/test/cut_liszt.opus ~/src/pianotrans/test/cut_liszt.opus ~/src/pianotrans/test/cut_liszt.opus | grep 'Transcribe time:'
Transcribe time: 38.585 s
Transcribe time: 38.662 s
Transcribe time: 38.816 s

The LD_LIBRARY_PATH method mentioned in

This overlay uses Intel's MKL library for both BLAS and LAPACK interfaces. Note that the same can be accomplished at runtime using `LD_LIBRARY_PATH` of `libblas.so.3` and `liblapack.so.3`. For instance:
```ShellSession
$ LD_LIBRARY_PATH=$(nix-build -A mkl)/lib${LD_LIBRARY_PATH:+:}$LD_LIBRARY_PATH nix-shell -p octave --run octave
```
does not work for torch.

Is there a way to use torch with mkl without recompiling?

Another thing I noticed is that mkl in nixpkgs is not distributed because it's unfree.

Yes, mkl is unfree, but redistribution is allowed according to https://www.intel.com/content/www/us/en/developer/articles/tool/onemkl-license-faq.html:

Can I redistribute oneMKL?
Yes, redistribution is allowed per the terms of the ISSL.

Debian also redistributes mkl https://packages.debian.org/source/sid/intel-mkl (in non-free section). I wonder if nixpkgs can also redistribute mkl and make using torch a better experience.

@SomeoneSerge
Copy link
Contributor

Torch with mkl by using LD_LIBRARY_PATH

I would assume torch links bits of MKL statically, and there are probably more differences

I wonder if nixpkgs can also redistribute mkl and make using torch a better experience

Wouldn't that be just awesome This has been a recurring subject. What's available right now is community-run binary caches (cf. cachix instructions):

Yes, mkl is unfree, but redistribution is allowed

Yes... One (but not sole) nit pick issue with redistribution is that we use patchelf which, technically, modifies the binaries.
What that means is that someone has to do the lawyer's work and the foundation has to take a risk before we get an official unfree binary cache. For that, in turn, to happen we have to overcome a certain inertia. I do agree that an unfree cache is worth investment, and that lack of one is a hindrance to the project's wider acceptance

For what people say about this cf. e.g. https://discourse.nixos.org/t/petition-to-build-and-cache-unfree-packages-on-cache-nixos-org/17440

@azuwis azuwis deleted the torch branch April 11, 2023 06:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants