Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't build Pytorch #89403

Closed
GuillaumeDesforges opened this issue Jun 3, 2020 · 6 comments
Closed

Can't build Pytorch #89403

GuillaumeDesforges opened this issue Jun 3, 2020 · 6 comments
Labels

Comments

@GuillaumeDesforges
Copy link
Contributor

Describe the bug
Pytorch fails to build

Executing pythonImportsCheckPhase
Check whether the following modules can be imported: torch
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <lambda>
  File "/nix/store/2dcsn57cgaxs92ha5swihrab0g3l2h6g-python3-3.7.7/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/nix/store/i6fkycsn1q8vk4190hslrvh256wilw5i-python3.7-pytorch-1.4.1/lib/python3.7/site-packages/torch/__init__.py", line 81, in <module>
    from torch._C import *
ImportError: /nix/store/i6fkycsn1q8vk4190hslrvh256wilw5i-python3.7-pytorch-1.4.1/lib/python3.7/site-packages/torch/lib/libtorch_python.so: undefined symbol: _ZN5torch4cuda4nccl6detail16throw_nccl_errorE12ncclResult_t
builder for '/nix/store/lia86jj07nygz23ca3s1gxahx62ipls2-python3.7-pytorch-1.4.1.drv' failed with exit code 1
cannot build derivation '/nix/store/ia0jgpq5w2myqpp8zha2xj7h3f5cj5nn-python3-3.7.7-env.drv': 1 dependencies couldn't be built
cannot build derivation '/nix/store/hxp00cp2lvhhw677d1v8zbp5iv3cqpy1-home-manager-path.drv': 1 dependencies couldn't be built
cannot build derivation '/nix/store/gcpmscdc0d5h3fg5m6ik411q6hhsa0mz-home-manager-generation.drv': 1 dependencies couldn't be built
error: build of '/nix/store/gcpmscdc0d5h3fg5m6ik411q6hhsa0mz-home-manager-generation.drv' failed

To Reproduce
Steps to reproduce the behavior:

  1. Set .config/nixpkgs/config.nix to
{
    allowUnfree = true;
    cudaSupport = true;
}
  1. Switch to nixpkgs version 20.09pre227577.135073a87b7
  2. nix-build '<nixpkgs>' -A python3Packages.pytorch

Expected behavior
Pytorch should build without failure.

Notify maintainers

@teh @thoughtpolice @tscholak

Metadata

$ nix-shell -p nix-info --run "nix-info -m"
 - system: `"x86_64-linux"`
 - host os: `Linux 5.4.35, NixOS, 20.09pre227577.135073a87b7 (Nightingale)`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.3.5`
 - channels(theo): `"home-manager"`
 - channels(root): `"nixos-20.09pre227577.135073a87b7"`
 - channels(arsleust): `"home-manager"`
 - nixpkgs: `/nix/var/nix/profiles/per-user/root/channels/nixos`
  • NVIDIA SMI working
    NVIDIA-SMI 440.82 Driver Version: 440.82 CUDA Version: 10.2

Maintainer information:

# a list of nixpkgs attributes affected by the problem
attribute:
# a list of nixos modules affected by the problem
module:
@GuillaumeDesforges GuillaumeDesforges added the 0.kind: bug Something is broken label Jun 3, 2020
@evanjs
Copy link
Member

evanjs commented Jun 7, 2020

I wonder if this is the sort of issue I was seeing during my many nixpkgs-review runs for #77714
Attaching the error I encountered just in case.
python37Packages.pytorchWithCuda.log

There were also some failures in a nixpkgs-review log that @jonringer posted, but I wasn't sure what to make of them, or if they were more builds that were failing when attempting builds with 128 cores as mentioned here.

@GuillaumeDesforges
Copy link
Contributor Author

Your log (python37Packages.pytorchWithCuda.log) gives exactly the same error for the same symbol...

So if you now say:

Looks like pytorch and batchgenerators build fine on master with Pillow 7.1.2.

I'll try to rerun a build now

@evanjs
Copy link
Member

evanjs commented Jun 7, 2020

Your log (python37Packages.pytorchWithCuda.log) gives exactly the same error for the same symbol...

So if you now say:

Looks like pytorch and batchgenerators build fine on master with Pillow 7.1.2.

I'll try to rerun a build now

I need to stop posting so late.
I did not realize the same symbol was there.

And yes, everything seemed to build fine on hydra.
Though, I suppose there’s a chance this might be an issue with the build environment, though I can’t imagine what that might be.

@GuillaumeDesforges
Copy link
Contributor Author

Same error on 20.09pre228204.467ce5a9f45

To my knowledge hydra builds pytorch without CUDA, @tbenst was working on a Hydra for CUDA related libs I believe. Seeing as cuda and nccl are part of the name of the symbol _ZN5torch4cuda4nccl6detail16throw_nccl_errorE12ncclResult_t, I think this is an issue related to CUDA.

@GuillaumeDesforges
Copy link
Contributor Author

Ok I have seen this: pytorch/pytorch#32638

Corrected by pytorch/pytorch@58cffbf

Which is part of milestone 1.4.1 : https://github.com/pytorch/pytorch/milestone/14
but 1.5.0 seems on the way and the issue still not closed.

Anyway, I will investigate upstream and PR update accordingly

jonringer pushed a commit that referenced this issue Jun 8, 2020
Fixes previous bugs that required a patch
Fixes CUDA build, see #89403
@jonringer
Copy link
Contributor

since #89802 was merged, and I'm able to build this on master:

[16:46:18] jon@nixos ~/projects/nixpkgs (master)
$ nix-build -A python3Packages.pytorch
/nix/store/nr2p2w6mrs8rqnj43ngzw3jq5gixsjai-python3.7-pytorch-1.5.0

I'm going to considered this issue resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants