Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGILL in pytorch #115425

Closed
roberth opened this issue Mar 8, 2021 · 5 comments
Closed

SIGILL in pytorch #115425

roberth opened this issue Mar 8, 2021 · 5 comments

Comments

@roberth
Copy link
Member

roberth commented Mar 8, 2021

Describe the bug

wendy can not build pytorch-lighting on staging-20.09 because of

/nix/store/3vvij73hlqny9lyv4bal1kaxsaqirwzv-python-imports-check-hook.sh/nix-support/setup-hook: line 9:   138 Illegal instruction     (core dumped) /nix/store/xibghivp12jgk2xrwykpxxhy8wbmr5zi-python3-3.8.8/bin/python3.8 -c 'import os; import importlib; list(map(lambda mod: importlib.import_module(mod), os.environ["pythonImportsCheck"].split()))'

This indicates that it, or a dependency, produces a binary that contains instructions that are not available on wendy.
It could be caused by a wrong compiler flag or some assembly code that runs unconditionally.

See https://hydra.nixos.org/build/138503219/nixlog/1

pytorch-metric-learning ends with a similar failure

WARNING: Testing via this command is deprecated and will be removed in a future version. Users looking for a generic test entry point independent of test runner are encouraged to use tox.
running egg_info
writing pytorch_metric_learning.egg-info/PKG-INFO
writing dependency_links to pytorch_metric_learning.egg-info/dependency_links.txt
writing requirements to pytorch_metric_learning.egg-info/requires.txt
writing top-level names to pytorch_metric_learning.egg-info/top_level.txt
reading manifest file 'pytorch_metric_learning.egg-info/SOURCES.txt'
writing manifest file 'pytorch_metric_learning.egg-info/SOURCES.txt'
running build_ext
ERROR:root:The testing module requires faiss. You can install the GPU version with the command 'conda install faiss-gpu -c pytorch' 
                        or the CPU version with 'conda install faiss-cpu -c pytorch'. Learn more at https://github.com/facebookresearch/faiss/blob/master/INSTALL.md
/nix/store/m2b73qq2wmxwmy6anb08nm0pxj1b2ahb-setuptools-check-hook/nix-support/setup-hook: line 4:   270 Illegal instruction     (core dumped) /nix/store/xibghivp12jgk2xrwykpxxhy8wbmr5zi-python3-3.8.8/bin/python3.8 nix_run_setup test

https://hydra.nixos.org/build/138492596/nixlog/1

Perhaps a good next step is to check if pytorch itself is binary reproducible on different hardware.

To Reproduce

Build python38Packages.pytorch-lightning on wendy or similar hardware, nixpkgs da85159

Expected behavior

Build succeeds.

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

Notify maintainers

@tbenst
@bcdarwin
@danieldk
@teh
@thoughtpolice
@tscholak

Maintainer information:

# a list of nixpkgs attributes affected by the problem
attribute:
 - python38Packages.pytorch-lightning
 - python38Packages.pytorch-metric-learning
# a list of nixos modules affected by the problem
module:
@danieldk
Copy link
Contributor

danieldk commented Mar 8, 2021

The problem is that some of the Hydra machines are Opteron 6100 machines that do not even support SSE4.1.

I don't think it is feasible to compile PyTorch pre-SSE4.1, since at least one of its dependencies (oneDNN) does not support pre-SSE4.2 machines (or at least the last time I tried to fix SIGILL in tests in oneDNN). Also, it would be pretty detrimental for machine learning libraries to compile it and its dependencies against such an old baseline. (Of course, many of the libraries dynamically select kernels based on the instruction set.)

I think the preferred solution would be to be able to exclude these Hydra builders based on the platform. I have recently submitted a PR to Nix to add the new platform levels defined in the x86_64 ELF ABI as extra platforms. But I didn't have time yet to look how this could be adopted in nixpkgs.)

@domenkozar
Copy link
Member

Due to NixOS/infra#146 being solved, this should be fixed now.

@mweinelt
Copy link
Member

This is still relevant today, as wendy is yet again failing with SIGILL on pytorch in release-21.05.

https://hydra.nixos.org/build/149565546

@mweinelt mweinelt reopened this Aug 12, 2021
domenkozar referenced this issue Aug 17, 2021
This compiles in usually about 2h15m with a 2-core build, but about 10m
on a big-parallel machine.
@vcunat
Copy link
Member

vcunat commented Apr 10, 2022

Generally it's nice when stuff works even on older HW, but Opteron machines surely won't be coming back to hydra.nixos.org anymore.

@vcunat vcunat closed this as completed Apr 10, 2022
@mweinelt
Copy link
Member

mweinelt commented Apr 10, 2022

Wendy and Ike are long gone, so closing this issue was overdue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants