python3Packages.torch: Fix performance problem on darwin #220017

azuwis · 2023-03-07T17:20:12Z

Description of changes

Enable Accelerate.framework, disable mkldnn.

In my test in a Macbook Pro M1 2020, using pianotrans to transcribe
cut_liszt.opus, transcription time:

Without this patch: ~88s

Disable mkldnn: ~21s

Disable mkldnn and enable Accelerate.framework: ~9s

The final result is close to using torch-bin.

Things done

SomeoneSerge · 2023-03-07T17:39:47Z

Awesome! So, was it CoreML and MLCompute that made the difference, compared to the previous run?

azuwis · 2023-03-07T17:49:09Z

Disabling mkldnn makes the big difference.

nixos-discourse · 2023-03-08T07:39:22Z

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/prs-ready-for-review/3032/1917

SomeoneSerge · 2023-03-08T11:07:51Z

pkgs/development/python-modules/torch/default.nix

@@ -1,6 +1,6 @@
 { stdenv, lib, fetchFromGitHub, fetchpatch, buildPythonPackage, python,
  cudaSupport ? false, cudaPackages, magma,
-  mklDnnSupport ? true, useSystemNccl ? true,
+  mklDnnSupport ? stdenv.isLinux, useSystemNccl ? true,


Hmm, this poses the question of whether we should touch x86_64-darwin, if we were only looking at aarch64-darwin so far.

Does x86_64-darwin lose performance too, with mkldnn on?

If there's an Apple-specific library that provides the same functionality as mkldnn and that pytorch ends up consuming in your build, we should probably always use it.

I'm looking at 'wheel-py3_10-cpu-build/10_Build PyTorch binary.txt' from one of the x86 darwin actions upstream, they seem to have these flags set:

-- USE_MKL : ON -- USE_MKLDNN : ON -- USE_MKLDNN_ACL : OFF -- USE_MKLDNN_CBLAS : OFF

Ultimately, if we can manually check that, having the rest of your change merged, mklDnnSupport = false doesn't harm performance on x86_64-darwin, I think we're good to keep isLinux

Also, let's add a comment referring to this PR and explaining that MKLDNN negatively impacts performance on M1

Yes, please add a comment otherwise this gets forgotten again

Also, let's add a comment referring to this PR and explaining that MKLDNN negatively impacts performance on M1

Done in the new push.

SomeoneSerge · 2023-03-08T11:36:00Z

To anyone possessing an x86_64-darwin machine, I think we need to see the difference between:

❯ cat << EOF > pianotrans-tests.nix
let
  # github:azuwis/nixpkgs/62b4df51ee3c01968ac2ac30831b1f989ab917f9
  nixpkgs = builtins.getFlake github:azuwis/nixpkgs/torch;
  pkgs = import nixpkgs { system = "x86_64-darwin"; };
  mkPy = mklDnnSupport: pkgs.python3.override
    {
      packageOverrides = self: super: { torch = super.torch.override { inherit mklDnnSupport; }; };
    };
in
{
  mklDnnOn = (pkgs.pianotrans.override {
    python3 = mkPy true;
  });
  mklDnnOff = (pkgs.pianotrans.override {
    python3 = mkPy false;
  });
}
EOF
❯ nix eval -f pianotrans-tests.nix mklDnnOff.outPath
"/nix/store/6zlk47rgzv6vck640vl5qi0fqmi4n18c-pianotrans-1.0.1"
❯ nix eval -f pianotrans-tests.nix mklDnnOn.outPath
"/nix/store/58mxcypwkn2idk89d6z9as66pm28j08h-pianotrans-1.0.1"
❯ wget https://github.com/azuwis/pianotrans/raw/master/test/cut_liszt.opus
❯ nix shell -f pianotrans-tests.nix mklDnnOn --command pianotrans cut_liszt.opus
...
Transcribe time: XX.XXX s
...
❯ nix shell -f pianotrans-tests.nix mklDnnOff --command pianotrans cut_liszt.opus
...
Transcribe time: YY.YYY s
...

</details>

azuwis · 2023-03-09T11:50:52Z

I've made a github action to test x86_64-darwin, https://github.com/azuwis/pianotrans/actions/runs/4373108055/jobs/7650862384

The result:

Torch-bin:

Transcribe time: 30.519 s
Transcribe time: 30.065 s
Transcribe time: 29.004 s
Transcribe time: 28.713 s
Transcribe time: 28.911 s
Transcribe time: 45.366 s
Transcribe time: 101.648 s
Transcribe time: 70.682 s
Transcribe time: 70.235 s
Transcribe time: 26.910 s

Enable mklDnn, disable Accelerate.framework (current nixpkgs):

Transcribe time: 64.889 s
Transcribe time: 59.743 s
Transcribe time: 61.396 s
Transcribe time: 59.938 s
Transcribe time: 60.521 s
Transcribe time: 60.098 s
Transcribe time: 59.536 s
Transcribe time: 63.212 s
Transcribe time: 61.648 s
Transcribe time: 63.184 s

Disable mklDnn, enable Accelerate.framework:

Transcribe time: 54.248 s
Transcribe time: 52.206 s
Transcribe time: 51.521 s
Transcribe time: 54.340 s
Transcribe time: 54.419 s
Transcribe time: 50.019 s
Transcribe time: 50.763 s
Transcribe time: 47.536 s
Transcribe time: 47.214 s
Transcribe time: 48.254 s

Enable mklDnn, enable Accelerate.framework:

Transcribe time: 46.777 s
Transcribe time: 46.982 s
Transcribe time: 46.966 s
Transcribe time: 46.838 s
Transcribe time: 45.889 s
Transcribe time: 46.136 s
Transcribe time: 47.243 s
Transcribe time: 46.703 s
Transcribe time: 47.070 s
Transcribe time: 46.289 s

We should enable mklDnn and enalbe Accelerate.framework for x86_64-darwin, but still the performance of torch is not as good as torch-bin.

…ase performance

SomeoneSerge · 2023-03-09T12:06:08Z

@azuwis amazing! To finish the comparison against torch-bin, can you also add a version with

        blas = prev.blas.override {
          blasProvider = final.mkl;
        };

It negatively impacts performance, this is also what official pytorch build does. In my test in a Macbook Pro M1 2020, using [pianotrans][1] to transcribe [cut_liszt.opus][2], transcription time: Enable mkldnn and disable Accelerate.framework: ~88s Disable mkldnn and disable Accelerate.framework: ~21s Disable mkldnn and enable Accelerate.framework: ~9s The final result is close to using torch-bin. See also NixOS#219104 [1]: https://github.com/azuwis/pianotrans [2]: https://github.com/azuwis/pianotrans/raw/master/test/cut_liszt.opus

azuwis · 2023-03-09T13:26:10Z

Please review the new changes.

pkgs/development/python-modules/torch/default.nix

azuwis · 2023-03-10T08:30:07Z

@azuwis amazing! To finish the comparison against torch-bin, can you also add a version with
        blas = prev.blas.override {
          blasProvider = final.mkl;
        };

I've tried override blas in https://github.com/azuwis/pianotrans/blob/61b1e61b9eaea4f926e9bf4f265bdb826e1f11da/flake.nix#L15-L17, but build failed on both x86_64-linux and x86_64-darwin https://github.com/azuwis/pianotrans/actions/runs/4382622983/jobs/7671881870

SomeoneSerge · 2023-03-10T10:08:44Z

but build failed on both x86_64-linux and x86_64-darwin

Turns out if I'm not logged in in the browser I can't see the workflow logs. But I see that you override blas globally. Just in case the build fails before torch (in some of the dependencies): it could be sufficient for the test that we only pass the mkl blas to torch (we did that on x86 Linux in the linked issue once and it worked)

azuwis · 2023-03-10T16:14:49Z

@wegank Can you please have a look at this PR, I think it's ready.

SomeoneSerge · 2023-03-14T06:31:14Z

override blas

Ok, AFAIU mkl is simply broken on Darwin. No reason not to merge this right away

wegank · 2023-03-14T06:43:05Z

@ofborg build python3Packages.torch

azuwis · 2023-03-15T01:50:49Z

@SomeoneSerge I can confirm the performance different between torch and torch-bin on linux is mkl.

I did a test on x86_64-linux.

Torch:

$ nix-shell -p pianotrans
$ pianotrans ~/src/pianotrans/test/cut_liszt.opus ~/src/pianotrans/test/cut_liszt.opus ~/src/pianotrans/test/cut_liszt.opus | grep 'Transcribe time:'
Transcribe time: 38.323 s
Transcribe time: 38.269 s
Transcribe time: 38.305 s

Torch-bin:

$ nix-shell -p '(pianotrans.override { python3 = python3.override { packageOverrides = self: super: { torch = super.torch-bin; }; }; })'
$ pianotrans --cli ~/src/pianotrans/test/cut_liszt.opus ~/src/pianotrans/test/cut_liszt.opus ~/src/pianotrans/test/cut_liszt.opus | grep 'Transcribe time:'
Transcribe time: 27.919 s
Transcribe time: 27.461 s
Transcribe time: 27.497 s

Torch with mkl by override blasProvider = mkl:

$ nix-shell -p 'with import <nixpkgs> { config.allowUnfree = true;}; (pianotrans.override { python3 = python3.override { packageOverrides = self: super: { torch = super.torch.override { blas = blas.override { blasProvider = mkl; }; } ; }; }; })'
$ pianotrans --cli ~/src/pianotrans/test/cut_liszt.opus ~/src/pianotrans/test/cut_liszt.opus ~/src/pianotrans/test/cut_liszt.opus | grep 'Transcribe time:'
Transcribe time: 27.646 s
Transcribe time: 27.868 s
Transcribe time: 27.610 s

Torch with mkl by using LD_LIBRARY_PATH:

$ NIXPKGS_ALLOW_UNFREE=1 nix build --impure --no-link --print-out-paths nixpkgs#mkl 
/nix/store/5aayf907z0v4z5b6bx4z7j5gg9xh4cz6-mkl-2023.0.0.25398
$ LD_LIBRARY_PATH=/nix/store/5aayf907z0v4z5b6bx4z7j5gg9xh4cz6-mkl-2023.0.0.25398/lib nix-shell -p pianotrans
$ pianotrans ~/src/pianotrans/test/cut_liszt.opus ~/src/pianotrans/test/cut_liszt.opus ~/src/pianotrans/test/cut_liszt.opus | grep 'Transcribe time:'
Transcribe time: 38.585 s
Transcribe time: 38.662 s
Transcribe time: 38.816 s

The LD_LIBRARY_PATH method mentioned in

nixpkgs/doc/using/overlays.chapter.md

Lines 112 to 116 in abb2ade

    
           This overlay uses Intel's MKL library for both BLAS and LAPACK interfaces. Note that the same can be accomplished at runtime using `LD_LIBRARY_PATH` of `libblas.so.3` and `liblapack.so.3`. For instance: 
        
           ```ShellSession 
        
           $ LD_LIBRARY_PATH=$(nix-build -A mkl)/lib${LD_LIBRARY_PATH:+:}$LD_LIBRARY_PATH nix-shell -p octave --run octave 
        
           ```

does not work for torch.

Is there a way to use torch with mkl without recompiling?

Another thing I noticed is that mkl in nixpkgs is not distributed because it's unfree.

Yes, mkl is unfree, but redistribution is allowed according to https://www.intel.com/content/www/us/en/developer/articles/tool/onemkl-license-faq.html:

Can I redistribute oneMKL?
Yes, redistribution is allowed per the terms of the ISSL.

Debian also redistributes mkl https://packages.debian.org/source/sid/intel-mkl (in non-free section). I wonder if nixpkgs can also redistribute mkl and make using torch a better experience.

SomeoneSerge · 2023-03-15T11:40:47Z

Torch with mkl by using LD_LIBRARY_PATH

I would assume torch links bits of MKL statically, and there are probably more differences

I wonder if nixpkgs can also redistribute mkl and make using torch a better experience

~~Wouldn't that be just awesome~~ This has been a recurring subject. What's available right now is community-run binary caches (cf. cachix instructions):

Yes, mkl is unfree, but redistribution is allowed

Yes... One (but not sole) nit pick issue with redistribution is that we use patchelf which, technically, modifies the binaries.
What that means is that someone has to do the lawyer's work and the foundation has to take a risk before we get an official unfree binary cache. For that, in turn, to happen we have to overcome a certain inertia. I do agree that an unfree cache is worth investment, and that lack of one is a hindrance to the project's wider acceptance

For what people say about this cf. e.g. https://discourse.nixos.org/t/petition-to-build-and-cache-unfree-packages-on-cache-nixos-org/17440

github-actions bot added the 6.topic: python label Mar 7, 2023

azuwis mentioned this pull request Mar 7, 2023

pianotrans: Use torch-bin on darwin #219104

Closed

12 tasks

ofborg bot added the 6.topic: darwin Running or building packages on Darwin label Mar 7, 2023

ofborg bot added 8.has: clean-up 10.rebuild-darwin: 0 10.rebuild-linux: 0 labels Mar 7, 2023

azuwis force-pushed the torch branch from e29f311 to 62b4df5 Compare March 8, 2023 07:34

ofborg bot requested review from teh, tscholak and thoughtpolice March 8, 2023 08:01

ofborg bot added 10.rebuild-darwin: 11-100 and removed 10.rebuild-darwin: 0 labels Mar 8, 2023

SomeoneSerge reviewed Mar 8, 2023

View reviewed changes

python3Packages.torch: Enable Accelerate.framework on darwin to incre…

4e59252

…ase performance

azuwis force-pushed the torch branch from 62b4df5 to 896095d Compare March 9, 2023 13:19

azuwis force-pushed the torch branch from 896095d to 0d758c2 Compare March 9, 2023 13:24

SuperSandro2000 approved these changes Mar 9, 2023

View reviewed changes

SomeoneSerge reviewed Mar 9, 2023

View reviewed changes

pkgs/development/python-modules/torch/default.nix Show resolved Hide resolved

SomeoneSerge approved these changes Mar 10, 2023

View reviewed changes

wegank approved these changes Mar 10, 2023

View reviewed changes

figsoda added the 12.approvals: 3+ label Mar 10, 2023

wegank merged commit 1474943 into NixOS:master Mar 14, 2023

azuwis deleted the torch branch April 11, 2023 06:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

python3Packages.torch: Fix performance problem on darwin #220017

python3Packages.torch: Fix performance problem on darwin #220017

azuwis commented Mar 7, 2023 •

edited

SomeoneSerge commented Mar 7, 2023

azuwis commented Mar 7, 2023

nixos-discourse commented Mar 8, 2023

SomeoneSerge Mar 8, 2023

SomeoneSerge Mar 8, 2023

SuperSandro2000 Mar 8, 2023

azuwis Mar 9, 2023

SomeoneSerge commented Mar 8, 2023

azuwis commented Mar 9, 2023

SomeoneSerge commented Mar 9, 2023

azuwis commented Mar 9, 2023

azuwis commented Mar 10, 2023

SomeoneSerge commented Mar 10, 2023

azuwis commented Mar 10, 2023

SomeoneSerge commented Mar 14, 2023

wegank commented Mar 14, 2023

azuwis commented Mar 15, 2023

SomeoneSerge commented Mar 15, 2023

python3Packages.torch: Fix performance problem on darwin #220017

python3Packages.torch: Fix performance problem on darwin #220017

Conversation

azuwis commented Mar 7, 2023 • edited

Description of changes

Things done

SomeoneSerge commented Mar 7, 2023

azuwis commented Mar 7, 2023

nixos-discourse commented Mar 8, 2023

SomeoneSerge Mar 8, 2023

Choose a reason for hiding this comment

SomeoneSerge Mar 8, 2023

Choose a reason for hiding this comment

SuperSandro2000 Mar 8, 2023

Choose a reason for hiding this comment

azuwis Mar 9, 2023

Choose a reason for hiding this comment

SomeoneSerge commented Mar 8, 2023

azuwis commented Mar 9, 2023

SomeoneSerge commented Mar 9, 2023

azuwis commented Mar 9, 2023

azuwis commented Mar 10, 2023

SomeoneSerge commented Mar 10, 2023

azuwis commented Mar 10, 2023

SomeoneSerge commented Mar 14, 2023

wegank commented Mar 14, 2023

azuwis commented Mar 15, 2023

SomeoneSerge commented Mar 15, 2023

azuwis commented Mar 7, 2023 •

edited