SIGILL in pytorch #115425

roberth · 2021-03-08T16:37:09Z

Describe the bug

wendy can not build pytorch-lighting on staging-20.09 because of

/nix/store/3vvij73hlqny9lyv4bal1kaxsaqirwzv-python-imports-check-hook.sh/nix-support/setup-hook: line 9:   138 Illegal instruction     (core dumped) /nix/store/xibghivp12jgk2xrwykpxxhy8wbmr5zi-python3-3.8.8/bin/python3.8 -c 'import os; import importlib; list(map(lambda mod: importlib.import_module(mod), os.environ["pythonImportsCheck"].split()))'

This indicates that it, or a dependency, produces a binary that contains instructions that are not available on wendy.
It could be caused by a wrong compiler flag or some assembly code that runs unconditionally.

See https://hydra.nixos.org/build/138503219/nixlog/1

pytorch-metric-learning ends with a similar failure

WARNING: Testing via this command is deprecated and will be removed in a future version. Users looking for a generic test entry point independent of test runner are encouraged to use tox.
running egg_info
writing pytorch_metric_learning.egg-info/PKG-INFO
writing dependency_links to pytorch_metric_learning.egg-info/dependency_links.txt
writing requirements to pytorch_metric_learning.egg-info/requires.txt
writing top-level names to pytorch_metric_learning.egg-info/top_level.txt
reading manifest file 'pytorch_metric_learning.egg-info/SOURCES.txt'
writing manifest file 'pytorch_metric_learning.egg-info/SOURCES.txt'
running build_ext
ERROR:root:The testing module requires faiss. You can install the GPU version with the command 'conda install faiss-gpu -c pytorch' 
                        or the CPU version with 'conda install faiss-cpu -c pytorch'. Learn more at https://github.com/facebookresearch/faiss/blob/master/INSTALL.md
/nix/store/m2b73qq2wmxwmy6anb08nm0pxj1b2ahb-setuptools-check-hook/nix-support/setup-hook: line 4:   270 Illegal instruction     (core dumped) /nix/store/xibghivp12jgk2xrwykpxxhy8wbmr5zi-python3-3.8.8/bin/python3.8 nix_run_setup test

https://hydra.nixos.org/build/138492596/nixlog/1

Perhaps a good next step is to check if pytorch itself is binary reproducible on different hardware.

To Reproduce

Build python38Packages.pytorch-lightning on wendy or similar hardware, nixpkgs da85159

Expected behavior

Build succeeds.

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

Notify maintainers

@tbenst
@bcdarwin
@danieldk
@teh
@thoughtpolice
@tscholak

Maintainer information:

# a list of nixpkgs attributes affected by the problem
attribute:
 - python38Packages.pytorch-lightning
 - python38Packages.pytorch-metric-learning
# a list of nixos modules affected by the problem
module:

The text was updated successfully, but these errors were encountered:

danieldk · 2021-03-08T18:47:17Z

The problem is that some of the Hydra machines are Opteron 6100 machines that do not even support SSE4.1.

I don't think it is feasible to compile PyTorch pre-SSE4.1, since at least one of its dependencies (oneDNN) does not support pre-SSE4.2 machines (or at least the last time I tried to fix SIGILL in tests in oneDNN). Also, it would be pretty detrimental for machine learning libraries to compile it and its dependencies against such an old baseline. (Of course, many of the libraries dynamically select kernels based on the instruction set.)

I think the preferred solution would be to be able to exclude these Hydra builders based on the platform. I have recently submitted a PR to Nix to add the new platform levels defined in the x86_64 ELF ABI as extra platforms. But I didn't have time yet to look how this could be adopted in nixpkgs.)

domenkozar · 2021-04-12T16:10:53Z

Due to NixOS/infra#146 being solved, this should be fixed now.

mweinelt · 2021-08-12T11:29:23Z

This is still relevant today, as wendy is yet again failing with SIGILL on pytorch in release-21.05.

https://hydra.nixos.org/build/149565546

This compiles in usually about 2h15m with a 2-core build, but about 10m on a big-parallel machine.

vcunat · 2022-04-10T19:59:26Z

Generally it's nice when stuff works even on older HW, but Opteron machines surely won't be coming back to hydra.nixos.org anymore.

mweinelt · 2022-04-10T23:29:28Z

Wendy and Ike are long gone, so closing this issue was overdue.

roberth added the 0.kind: bug label Mar 8, 2021

veprbl added the 6.topic: python label Mar 8, 2021

domenkozar mentioned this issue Apr 12, 2021

Remove wendy from the machine list NixOS/infra#146

Closed

domenkozar closed this as completed Apr 12, 2021

mweinelt reopened this Aug 12, 2021

domenkozar referenced this issue Aug 17, 2021

python3Packages.pytorch: require big-parallel

0d4abe5

This compiles in usually about 2h15m with a 2-core build, but about 10m on a big-parallel machine.

vcunat mentioned this issue Feb 1, 2022

python3Packages.zarr: Illegal instruction #157674

Closed

vcunat closed this as completed Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIGILL in pytorch #115425

SIGILL in pytorch #115425

roberth commented Mar 8, 2021

danieldk commented Mar 8, 2021 •

edited

domenkozar commented Apr 12, 2021

mweinelt commented Aug 12, 2021

vcunat commented Apr 10, 2022

mweinelt commented Apr 10, 2022 •

edited

SIGILL in pytorch #115425

SIGILL in pytorch #115425

Comments

roberth commented Mar 8, 2021

danieldk commented Mar 8, 2021 • edited

domenkozar commented Apr 12, 2021

mweinelt commented Aug 12, 2021

vcunat commented Apr 10, 2022

mweinelt commented Apr 10, 2022 • edited

danieldk commented Mar 8, 2021 •

edited

mweinelt commented Apr 10, 2022 •

edited