Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python3Packages.pytorch: 1.2.0 -> 1.4.1, python3Packages.ignite: 0.2.1 -> 0.3.0 #75827

Merged
merged 4 commits into from May 9, 2020

Conversation

@stites
Copy link
Contributor

stites commented Dec 17, 2019

Motivation for this change

Update pytorch to 1.3.1. A detailed commit history can be seen here.
Check phase verified on python36.pytorch, python36.pytorchWithMkl, python36.pytorchWithCuda10, python36.pytorchWithCuda10Mkl.

Cachix pending (my machine with the keys is down).

Relevant changelog:

  • buildDocs flag added
  • buildNamedTensor is now true by default
  • adds experimental useNixProtobuf but disables this functionality.
  • uses CMake directly for pytorch builds
To build

I believe the following should work:

git remote add --fetch stites git@github.com:stites/nixpkgs.git
git checkout --track stites pytorch-1.3.1
nix-shell -I nixpkgs=$(pwd) -p "python37.withPackages(ps: with ps; [ pytorch ])"
Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS linux)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nix-review --run "nix-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Ensured that relevant documentation is up to date
  • Fits CONTRIBUTING.md.
Notify maintainers

cc @teh @thoughtpolice @stites @tscholak @bhipple

@stites stites marked this pull request as ready for review Dec 18, 2019
@stites stites requested review from FRidh and jonringer as code owners Dec 18, 2019
@stites stites force-pushed the stites:pytorch-1.3.1 branch 2 times, most recently from cb1d7a0 to f138203 Dec 18, 2019
@stites stites force-pushed the stites:pytorch-1.3.1 branch from f138203 to 71c812e Dec 20, 2019
Copy link
Contributor Author

stites left a comment

Removed the nix-protobuf code (it will just live in pytorch-world for now). Rebuilding everything again -- I just had to deal with some faulty RAM slots so updates will have a slightly faster turnaround again.

@bcdarwin
Copy link
Member

bcdarwin commented Feb 2, 2020

fwiw, 1.4 is now out.

@danieldk
Copy link
Member

danieldk commented Feb 3, 2020

fwiw, 1.4 is now out.

There may soon also be a 1.4.1 release that is compatible with gcc 9:

pytorch/pytorch#32277

@stites stites force-pushed the stites:pytorch-1.3.1 branch from 71c812e to 9cc2d7d Feb 3, 2020
@stites stites changed the title pytorch: 1.2.0 -> 1.3.1 pytorch: 1.2.0 -> 1.4.0 Feb 3, 2020
@stites
Copy link
Contributor Author

stites commented Feb 3, 2020

@bcdarwin @danieldk -- I've bumped this PR to 1.4.0. I'm fairly confident this derivation remains bug-free since I ran the full test suite on the alpha branch a few days before the 1.4.0 release. Unfortunately, I don't have time to run the entire build matrix again.

@jonringer
Copy link
Contributor

jonringer commented Feb 3, 2020

@GrahamcOfBorg build python2Packages.pytorch
@GrahamcOfBorg build python3Packages.pytorch
@GrahamcOfBorg build python38Packages.pytorch

@jonringer
Copy link
Contributor

jonringer commented Feb 3, 2020

please address failures :(

@jonringer
Copy link
Contributor

jonringer commented Feb 3, 2020

looks like a lot of failures are related to -fpermissive. I forgot the exact steps, but you can lower it from an error to a warning.

../c10/util/LeftRight.h:67:10: error: ‘typename std::result_of<F(const T&)>::type c10::LeftRight<T>::read(F&&) const [with F = c10::Dispatcher::doCallUnboxedOnly(const c10::DispatchTable&, const c10::LeftRight<ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction> >&, Args ...) const [with Return = std::tuple<at::Tensor, at::Tensor>; Args = {const at::Tensor&, long int, long int, bool, bool}]::<lambda(const ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction>&)>; T = ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction>; typename std::result_of<F(const T&)>::type = std::tuple<at::Tensor, at::Tensor>]’, declared using local type ‘c10::Dispatcher::doCallUnboxedOnly(const c10::DispatchTable&, const c10::LeftRight<ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction> >&, Args ...) const [with Return = std::tuple<at::Tensor, at::Tensor>; Args = {const at::Tensor&, long int, long int, bool, bool}]::<lambda(const ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction>&)>’, is used but never defined [-fpermissive]
../c10/util/LeftRight.h:67:10: error: ‘typename std::result_of<F(const T&)>::type c10::LeftRight<T>::read(F&&) const [with F = c10::Dispatcher::doCallUnboxed(const c10::DispatchTable&, const c10::LeftRight<ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction> >&, Args ...) const [with Return = at::Tensor; Args = {const at::Tensor&, c10::Scalar, long int, c10::Scalar}]::<lambda(const ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction>&)>; T = ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction>; typename std::result_of<F(const T&)>::type = at::Tensor]’, declared using local type ‘c10::Dispatcher::doCallUnboxed(const c10::DispatchTable&, const c10::LeftRight<ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction> >&, Args ...) const [with Return = at::Tensor; Args = {const at::Tensor&, c10::Scalar, long int, c10::Scalar}]::<lambda(const ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction>&)>’, is used but never defined [-fpermissive]
@bcdarwin
Copy link
Member

bcdarwin commented Feb 3, 2020

On x86 Ubuntu with Cuda support, the package builds but is broken:

$ nix-shell -I nixpkgs=~/nixpkgs -p 'python3.withPackages (ps: with ps; [ pytorch ])' --pure --run 'python3 "import torch"'
...
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/nix/store/3mkr7x854wslwpdl0zv9kg4k1k8icirc-python3-3.7.5-env/lib/python3.7/site-packages/torch/__init__.py", line 81, in <module>
    from torch._C import *
ImportError: /nix/store/3mkr7x854wslwpdl0zv9kg4k1k8icirc-python3-3.7.5-env/lib/python3.7/site-packages/torch/lib/libtorch_python.so: undefined symbol: _ZN5torch4cuda4nccl6detail16throw_nccl_errorE12ncclResult_t

Possibly setting USE_SYSTEM_NCCL=true doesn't work with the current NCCL/Pytorch.

Seems fine without Cuda but didn't run the test suite.

@bhipple
Copy link
Contributor

bhipple commented May 3, 2020

@GrahamcOfBorg build python37Packages.pytorch

@stites I pushed an update to your PR with the following changes:

  • Pass blas.provider into buildInputs, so that CMake can find the actual
    mkl for inspection of its cmake files and headers.

  • Add USE_MKL correctly when the blas provider is mkl.

  • Use the MKLDNN and MKLDNN_CBLAS flags by default, since mkldnn is FOSS and
    always available..

  • Remove a patch for MKL 2019, since we've moved to 2020.

  • Set doCheck to true; this does not actually appear to take as long as it
    used to. (10-15 minutes for me.)

  • Removed some unused variables at the top of the file

I've managed to build this successfully with both the FOSS stack and with mkl as the blas provider; take a look and LMK what you think.

@bhipple
bhipple approved these changes May 3, 2020
Copy link
Contributor

jonringer left a comment

not sure about regressions, but at the very least, we should disable for python38

builder for '/nix/store/a70hd44hmh22wxsr0w7sl3wnrfgfxpjb-python3.7-ignite-0.2.1.drv' failed with exit code 1; last 10 log lines:
  tests/ignite/contrib/handlers/test_param_scheduler.py::test_lr_scheduler
  tests/ignite/contrib/handlers/test_param_scheduler.py::test_lr_scheduler
  tests/ignite/contrib/handlers/test_param_scheduler.py::test_lr_scheduler
  tests/ignite/contrib/handlers/test_param_scheduler.py::test_lr_scheduler
  tests/ignite/contrib/handlers/test_param_scheduler.py::test_lr_scheduler
    /nix/store/m4i2i1ax2ch5y1ql8ng7f014mrmk63ja-python3.7-pytorch-1.4.1/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:342: DeprecationWarning: To get the last learning rate computed by the scheduler, please use `get_last_lr()`.
      "please use `get_last_lr()`.", DeprecationWarning)

  -- Docs: https://docs.pytest.org/en/latest/warnings.html
  ===== 4 failed, 275 passed, 4 skipped, 72 deselected, 20 warnings in 6.45s =====
builder for '/nix/store/zyxn5nxx22jwdj5xhvxx173ngkms6ryi-python3.8-pytorch-1.4.1.drv' failed with exit code 1; last 10 log lines:
  ----------------------------------------------------------------------
  Ran 2276 tests in 93.630s

  FAILED (failures=1, skipped=66, expected failures=1)
  Traceback (most recent call last):
    File "test/run_test.py", line 455, in <module>
      main()
    File "test/run_test.py", line 448, in main
      raise RuntimeError(message)
  RuntimeError: test_jit failed!
cannot build derivation '/nix/store/iraap11hf3k4yax6bzpmnmrif3pzli5l-python3.8-ignite-0.2.1.drv': 1 dependencies couldn't be built
cannot build derivation '/nix/store/8qdsp4ga63vmwbq8w0b1vf33avb848rx-python3.8-tensorly-0.4.5.drv': 1 dependencies couldn't be built
cannot build derivation '/nix/store/qkmv9p848yh1l62cn9yy3wb6id0hh1ac-python3.8-torchvision-0.2.1.drv': 1 dependencies couldn't be built
cannot build derivation '/nix/store/11waqaqv5xq4276f1npv1dmpj31hgprx-python3.8-pywick-0.5.6.drv': 2 dependencies couldn't be built
builder for '/nix/store/c8jvyrqblrk4a95rlnv5ji142d7sv33m-python3.7-pytorch-1.4.1.drv' failed with exit code 1; last 10 log lines:
  wrapping `/nix/store/28cbbgvg7mm81l4q7qac7pmmqycjizj2-python3.7-pytorch-1.4.1/bin/convert-caffe2-to-onnx'...
  Executing pythonRemoveTestsDir
  Finished executing pythonRemoveTestsDir
  running install tests
  Traceback (most recent call last):
    File "test/run_test.py", line 14, in <module>
      import torch
    File "/nix/store/28cbbgvg7mm81l4q7qac7pmmqycjizj2-python3.7-pytorch-1.4.1/lib/python3.7/site-packages/torch/__init__.py", line 81, in <module>
      from torch._C import *
  ImportError: /nix/store/28cbbgvg7mm81l4q7qac7pmmqycjizj2-python3.7-pytorch-1.4.1/lib/python3.7/site-packages/torch/lib/libtorch_python.so: undefined symbol: _ZN5torch4cuda4nccl6detail16throw_nccl_errorE12ncclResult_t
builder for '/nix/store/f2lxq1zcw1zbc4q3snnskdaadm5vyw2x-python3.8-pytorch-1.4.1.drv' failed with exit code 1; last 10 log lines:
  wrapping `/nix/store/q9diyfdsl34cy1iqcnkg7c6g650aif52-python3.8-pytorch-1.4.1/bin/convert-caffe2-to-onnx'...
  Executing pythonRemoveTestsDir
  Finished executing pythonRemoveTestsDir
  running install tests
  Traceback (most recent call last):
    File "test/run_test.py", line 14, in <module>
      import torch
    File "/nix/store/q9diyfdsl34cy1iqcnkg7c6g650aif52-python3.8-pytorch-1.4.1/lib/python3.8/site-packages/torch/__init__.py", line 81, in <module>
      from torch._C import *
  ImportError: /nix/store/q9diyfdsl34cy1iqcnkg7c6g650aif52-python3.8-pytorch-1.4.1/lib/python3.8/site-packages/torch/lib/libtorch_python.so: undefined symbol: _ZN5torch4cuda4nccl6detail16throw_nccl_errorE12ncclResult_t
cannot build derivation '/nix/store/rqcqydv29f29gn957kfb5vysx4ppgnsv-env.drv': 8 dependencies couldn't be built
[0 built (4 failed), 0.0 MiB DL]
error: build of '/nix/store/rqcqydv29f29gn957kfb5vysx4ppgnsv-env.drv' failed
https://github.com/NixOS/nixpkgs/pull/75827
2 package marked as broken and skipped:
python37Packages.pyro-ppl python38Packages.pyro-ppl

8 package failed to build:
python37Packages.ignite python37Packages.pytorchWithCuda python38Packages.ignite python38Packages.pytorch python38Packages.pytorchWithCuda python38Packages.pywick python38Packages.tensorly python38Packages.torchvision

4 package built:
python37Packages.pytorch python37Packages.pywick python37Packages.tensorly python37Packages.torchvision
@bhipple bhipple force-pushed the stites:pytorch-1.3.1 branch from 8f67760 to 1908c70 May 5, 2020
@bhipple
Copy link
Contributor

bhipple commented May 5, 2020

Hmm, I think I'll go back to just not running the test suite -- particularly since users of MKL may be running without a binary cache and need to wait for the package to build themselves.

@jonringer
Copy link
Contributor

jonringer commented May 5, 2020

or just reduce the checkphase to something simple. I would be fine with some pythonImportsCheck = [ "torch" ];.

As long as the maintainer verified that it worked for more in-depth cases locally

@bhipple bhipple force-pushed the stites:pytorch-1.3.1 branch from 1908c70 to 5d0e295 May 5, 2020
@stites
Copy link
Contributor Author

stites commented May 8, 2020

👍 thanks for bumping this pr @bhipple !

stites and others added 2 commits Dec 17, 2019
Co-authored-by: Benjamin Hipple <bhipple@protonmail.com>
- Pass `blas.provider` into `buildInputs`, so that CMake can find the actual
  `mkl` for inspection of its cmake files and headers.

- Add `USE_MKL` correctly when the blas provider is `mkl`.

- Use the MKLDNN and MKLDNN_CBLAS flags by default, since `mkldnn` is FOSS and
  always available..

- Remove a patch for MKL 2019, since we've moved to 2020.

- Add a pythonImportsCheck for "torch" as a basic sanity-check

- Removed some unused variables at the top of the file
@bhipple
Copy link
Contributor

bhipple commented May 9, 2020

Result of nixpkgs-review pr 75827 1

2 packages marked as broken and skipped:
- python37Packages.pyro-ppl
- python38Packages.pyro-ppl
4 packages failed to build:
- python37Packages.ignite
- python37Packages.pytorchWithCuda
- python38Packages.ignite
- python38Packages.pytorchWithCuda
8 packages built:
- python37Packages.pytorch (python37Packages.pytorchWithoutCuda)
- python37Packages.pywick
- python37Packages.tensorly
- python37Packages.torchvision
- python38Packages.pytorch (python38Packages.pytorchWithoutCuda)
- python38Packages.pywick
- python38Packages.tensorly
- python38Packages.torchvision
@bhipple bhipple force-pushed the stites:pytorch-1.3.1 branch from 5d0e295 to 07dd250 May 9, 2020
@bhipple bhipple changed the title pytorch: 1.2.0 -> 1.4.1 python3Packages.pytorch: 1.2.0 -> 1.4.1, python3Packages.ignite: 0.2.1 -> 0.3.0 May 9, 2020
@bhipple bhipple force-pushed the stites:pytorch-1.3.1 branch from 07dd250 to 47105a1 May 9, 2020
@bhipple
Copy link
Contributor

bhipple commented May 9, 2020

pytorchWithCuda is broken on master; help requested, but since it's not a regression relative to this PR, and since this PR has significant improvements to pytorch, I propose merging as-is and fixing cuda in another PR.

ignite required an upgrade to keep it building.

@GrahamcOfBorg eval

@FRidh
FRidh approved these changes May 9, 2020
@bhipple
Copy link
Contributor

bhipple commented May 9, 2020

One last change: in #85839 I changed the name of dnnl to oneDNN to track an upstream rename. That PR has a back-compat alias as dnnl, but it looks like OfBorg builds its eval without aliases (which is probably a good thing in general).

Updated pytorch to reference dnnl by the new name, but made no other material change.

@ofborg ofborg bot requested a review from bcdarwin May 9, 2020
@bhipple
Copy link
Contributor

bhipple commented May 9, 2020

Result of nixpkgs-review pr 75827 1

2 packages failed to build:
- python37Packages.pytorchWithCuda
- python38Packages.pytorchWithCuda
10 packages built:
- python37Packages.ignite
- python37Packages.pytorch (python37Packages.pytorchWithoutCuda)
- python37Packages.pywick
- python37Packages.tensorly
- python37Packages.torchvision
- python38Packages.ignite
- python38Packages.pytorch (python38Packages.pytorchWithoutCuda)
- python38Packages.pywick
- python38Packages.tensorly
- python38Packages.torchvision
@bhipple bhipple merged commit 3d9f3c3 into NixOS:master May 9, 2020
14 checks passed
14 checks passed
Evaluation Performance Report Evaluator Performance Report
Details
grahamcofborg-eval ^.^!
Details
grahamcofborg-eval-check-maintainers matching changed paths to changed attrs...
Details
grahamcofborg-eval-check-meta config.nix: checkMeta = true
Details
grahamcofborg-eval-darwin nix-instantiate --arg nixpkgs { outPath=./.; revCount=999999; shortRev="ecef297"; rev="ecef2975e3420b560b017c8b856313df6556e556"; } ./pkgs/t
Details
grahamcofborg-eval-lib-tests nix-build --arg pkgs import ./. {} ./lib/tests/release.nix
Details
grahamcofborg-eval-nixos nix-instantiate --arg nixpkgs { outPath=./.; revCount=999999; shortRev="ecef297"; rev="ecef2975e3420b560b017c8b856313df6556e556"; } ./nixos/
Details
grahamcofborg-eval-nixos-manual nix-instantiate --arg nixpkgs { outPath=./.; revCount=999999; shortRev="ecef297"; rev="ecef2975e3420b560b017c8b856313df6556e556"; } ./nixos/
Details
grahamcofborg-eval-nixos-options nix-instantiate --arg nixpkgs { outPath=./.; revCount=999999; shortRev="ecef297"; rev="ecef2975e3420b560b017c8b856313df6556e556"; } ./nixos/
Details
grahamcofborg-eval-nixpkgs-manual nix-instantiate --arg nixpkgs { outPath=./.; revCount=999999; shortRev="ecef297"; rev="ecef2975e3420b560b017c8b856313df6556e556"; } ./pkgs/t
Details
grahamcofborg-eval-nixpkgs-tarball nix-instantiate --arg nixpkgs { outPath=./.; revCount=999999; shortRev="ecef297"; rev="ecef2975e3420b560b017c8b856313df6556e556"; } ./pkgs/t
Details
grahamcofborg-eval-nixpkgs-unstable-jobset nix-instantiate --arg nixpkgs { outPath=./.; revCount=999999; shortRev="ecef297"; rev="ecef2975e3420b560b017c8b856313df6556e556"; } ./pkgs/t
Details
grahamcofborg-eval-package-list nix-env -qa --json --file .
Details
grahamcofborg-eval-package-list-no-aliases nix-env -qa --json --file . --arg config { allowAliases = false; }
Details
@bhipple bhipple mentioned this pull request May 9, 2020
0 of 10 tasks complete
@Zhen-hao
Copy link

Zhen-hao commented May 9, 2020

great work! thank you all!
I've been waiting for this

@Zhen-hao
Copy link

Zhen-hao commented May 10, 2020

what is the best way to get pytorchWithCuda working on NixOS?
currently, the build fails with

    from torch._C import *
ImportError: /nix/store/a29i85licwqsbvysansiyd67zwlkianb-python3.7-pytorch-1.4.1/lib/python3.7/site-packages/torch/lib/libtorch_python.so: undefined symbol: _ZN5torch4cuda4nccl6detail16throw_nccl_errorE12ncclResult_t

for now I will try to follow #75827 (comment)

@jonringer
Copy link
Contributor

jonringer commented May 10, 2020

that looks like a linking error to cuda. Not super familiar with the cuda toolchain to know where an assumption is being broken

@bhipple
Copy link
Contributor

bhipple commented May 10, 2020

If you manage to get it working, please do send a PR!

@bhipple
Copy link
Contributor

bhipple commented May 12, 2020

Interesting note: when we upgraded pytorch from 1.0.0 -> 1.2.0, we must've leaked a proprietary dependency into the default expression, because Hydra stopped building it. That's now been fixed with this PR, so we once again have binary cache builds of the default pytorch:

https://hydra.nixos.org/job/nixpkgs/trunk/python38Packages.pytorch.x86_64-linux/all

@Zhen-hao
Copy link

Zhen-hao commented May 16, 2020

what is the best way to get pytorchWithCuda working on NixOS?
currently, the build fails with

    from torch._C import *
ImportError: /nix/store/a29i85licwqsbvysansiyd67zwlkianb-python3.7-pytorch-1.4.1/lib/python3.7/site-packages/torch/lib/libtorch_python.so: undefined symbol: _ZN5torch4cuda4nccl6detail16throw_nccl_errorE12ncclResult_t

for now I will try to follow #75827 (comment)

I tried a few things but couldn't make it work with the current version.
However, it builds with CUDA when I change the version to 1.5.0 and remove the patches list. I tested it with the examples in transformers' quickstart guide and it worked fine with my GPU. My build uses cudnn_cudatoolkit_10_2.
I think it was bug in pytorch 1.4.1 and they fixed in 1.5.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.