Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Segmentation fault in temporary eliding (backtrace() fails, probably related to pthread locking) #13042

Closed
UKeyboard opened this issue Feb 26, 2019 · 4 comments

Comments

@UKeyboard
Copy link

UKeyboard commented Feb 26, 2019

In my pytorch project, my test code failed and raised Segmentation fault (core dump) . The code worked one week ago. Then, i updated my conda env and the problem occurs. Unfortunately, i didn't back up the previous conda env. Now the error keeps annoying me.

pdb shows numpy.sum in the following code causes the problem:

        source = numpy.expand_dims(source,axis=-2) # [M,1,2]
        target = numpy.expand_dims(target,axis=-3) # [1,K,2]
        d = numpy.sum((source - target)**2, axis=-1) # [M,K] after broadcasting

The failure generates [2523411.260096] python[1501]: segfault at 38 ip 00007fe1c927c73c sp 00007fffce6262b0 error 4 in ld-2.23.so[7fe1c9270000+26000] syslog item as well.

`gdb` gives the following info:
gdb python
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...done.
(gdb) run "debug/test_network.py"
Starting program: /home/me/.conda/envs/pytorch/bin/python "debug/test_network.py"
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7de373c in ?? () from /lib64/ld-linux-x86-64.so.2
(gdb) bt
#0  0x00007ffff7de373c in ?? () from /lib64/ld-linux-x86-64.so.2
#1  0x00007ffff7dec851 in ?? () from /lib64/ld-linux-x86-64.so.2
#2  0x00007ffff7de7564 in ?? () from /lib64/ld-linux-x86-64.so.2
#3  0x00007ffff7debda9 in ?? () from /lib64/ld-linux-x86-64.so.2
#4  0x00007ffff79335ad in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#5  0x00007ffff7de7564 in ?? () from /lib64/ld-linux-x86-64.so.2
#6  0x00007ffff7933664 in __libc_dlopen_mode () from /lib/x86_64-linux-gnu/libc.so.6
#7  0x00007ffff7905a85 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#8  0x00007ffff7bc8a99 in __pthread_once_slow () from /lib/x86_64-linux-gnu/libpthread.so.0
#9  0x00007ffff7905ba4 in backtrace () from /lib/x86_64-linux-gnu/libc.so.6
#10 0x00007ffff5d2c89f in check_callers.part ()
   from /home/me/.conda/envs/pytorch/lib/python3.6/site-packages/numpy/core/multiarray.cpython-36m-x86_64-linux-gnu.so
#11 0x00007ffff5d2ceed in can_elide_temp_unary ()
   from /home/me/.conda/envs/pytorch/lib/python3.6/site-packages/numpy/core/multiarray.cpython-36m-x86_64-linux-gnu.so
#12 0x00007ffff5d1bb67 in fast_scalar_power ()
   from /home/me/.conda/envs/pytorch/lib/python3.6/site-packages/numpy/core/multiarray.cpython-36m-x86_64-linux-gnu.so
#13 0x00007ffff5d1bff8 in array_power ()
   from /home/me/.conda/envs/pytorch/lib/python3.6/site-packages/numpy/core/multiarray.cpython-36m-x86_64-linux-gnu.so
#14 0x000055555566239d in PyNumber_Power ()
#15 0x00005555557160aa in _PyEval_EvalFrameDefault ()
#16 0x00005555556e9eab in fast_function ()
#17 0x00005555556f01e5 in call_function ()
#18 0x000055555571441a in _PyEval_EvalFrameDefault ()
#19 0x00005555556e9366 in _PyEval_EvalCodeWithName ()
#20 0x00005555556ea5bb in _PyFunction_FastCallDict ()
...
...

That's all i can do and the problem is still there.

Here is my workspace env:
# Name                    Version                   Build  Channel
_tflow_1100_select        0.0.3                       mkl
absl-py                   0.4.0            py36h28b3542_0
asn1crypto                0.24.0                   py36_0
astor                     0.7.1                    py36_0
blas                      1.0                         mkl
bzip2                     1.0.6                h14c3975_5
ca-certificates           2019.1.23                     0
cairo                     1.14.12              h8948797_3
certifi                   2018.11.29               py36_0
cffi                      1.11.5           py36he75722e_1
chardet                   3.0.4                    py36_1
cloudpickle               0.5.5                    py36_0
cryptography              2.3.1            py36hc365091_0
cudatoolkit               9.0                  h13b8566_0
cudnn                     7.3.1                 cuda9.0_0
cycler                    0.10.0                   py36_0
cython                    0.29             py36he6710b0_0
dask-core                 0.18.2                   py36_0
dbus                      1.13.2               h714fa37_1
decorator                 4.3.0                    py36_0
expat                     2.2.6                he6710b0_0
ffmpeg                    4.0                  hcdf2ecd_0
fontconfig                2.13.0               h9420a91_0
freeglut                  3.0.0                hf484d3e_5
freetype                  2.9.1                h8a8886c_1
gast                      0.2.0                    py36_0
glib                      2.56.2               hd408876_0
graphite2                 1.3.13               h23475e2_0
grpcio                    1.12.1           py36hdbcaa40_0
gst-plugins-base          1.14.0               hbbd80ab_1
gstreamer                 1.14.0               hb453b48_1
h5py                      2.8.0            py36h989c5e5_3
harfbuzz                  1.8.8                hffaf4a1_0
hdf5                      1.10.2               hba1933b_1
icu                       58.2                 h9c2bf20_1
idna                      2.7                      py36_0
imageio                   2.3.0                    py36_0
intel-openmp              2018.0.3                      0
jasper                    2.0.14               h07fcdf6_1
jpeg                      9b                   h024ee3a_2
kiwisolver                1.0.1            py36hf484d3e_0
libedit                   3.1.20170329         h6b74fdf_2
libffi                    3.2.1                hd88cf55_4
libgcc-ng                 8.2.0                hdf63c60_1
libgfortran-ng            7.3.0                hdf63c60_0
libglu                    9.0.0                hf484d3e_1
libopencv                 3.4.2                hb342d67_1
libopus                   1.3                  h7b6447c_0
libpng                    1.6.34               hb9fc6fc_0
libprotobuf               3.6.0                hdbcaa40_0
libstdcxx-ng              8.2.0                hdf63c60_1
libtiff                   4.0.9                he85c1e1_2
libuuid                   1.0.3                h1bed415_2
libvpx                    1.7.0                h439df22_0
libxcb                    1.13                 h1bed415_1
libxml2                   2.9.8                h26e45fe_1
markdown                  2.6.11                   py36_0
matplotlib                2.2.3            py36hb69df0a_0
mkl                       2018.0.3                      1
mkl_fft                   1.0.6            py36h7dd41cf_0
mkl_random                1.0.1            py36h4414c95_1
nccl                      1.3.5                 cuda9.0_0
ncurses                   6.1                  hf484d3e_0
networkx                  2.1                      py36_0
ninja                     1.8.2            py36h6bb024c_1
numpy                     1.15.2           py36h1d66e8a_1
numpy-base                1.15.2           py36h81de0dd_1
olefile                   0.45.1                   py36_0
opencv                    3.4.2            py36h6fd60c2_1
openssl                   1.0.2p               h14c3975_0
pcre                      8.42                 h439df22_0
pillow                    5.2.0            py36heded4f4_0
pip                       10.0.1                   py36_0
pixman                    0.36.0               h7b6447c_0
protobuf                  3.6.0            py36hf484d3e_0
py-opencv                 3.4.2            py36hb342d67_1
pycparser                 2.18                     py36_1
pyopenssl                 18.0.0                   py36_0
pyparsing                 2.2.0                    py36_1
pyqt                      5.9.2            py36h22d08a2_1
pysocks                   1.6.8                    py36_0
python                    3.6.4                hc3d631a_3
python-dateutil           2.7.3                    py36_0
pytorch                   0.4.1            py36ha74772b_0
pytz                      2018.5                   py36_0
pywavelets                0.5.2            py36h035aef0_2
pyzmq                     17.1.2                    <pip>
qt                        4.8.7                         2
readline                  7.0                  h7b6447c_5
requests                  2.19.1                   py36_0
scikit-image              0.14.0           py36hf484d3e_1
scipy                     1.1.0            py36hfa4b5c9_1
setuptools                40.2.0                   py36_0
sip                       4.19.12          py36he6710b0_0
six                       1.11.0                   py36_1
sqlite                    3.26.0               h7b6447c_0
termcolor                 1.1.0                    py36_1
tk                        8.6.8                hbc83047_0
toolz                     0.9.0                    py36_0
torchfile                 0.1.0                     <pip>
torchvision               0.2.1                    py36_1    pytorch
tornado                   5.1              py36h14c3975_0
tqdm                      4.25.0           py36h28b3542_0
urllib3                   1.23                     py36_0
visdom                    0.1.8.5                   <pip>
websocket-client          0.53.0                    <pip>
werkzeug                  0.14.1                   py36_0
wheel                     0.31.1                   py36_0
xz                        5.2.4                h14c3975_4
zlib                      1.2.11               ha838bed_2

I can reproduce the error on Ubuntu 16.04 4.4.0-141-generic, Ubuntu 16.04 4.15.0-43-generic and Ubuntu 16.04 4.15.0-45-generic platform.

Any help?

@seberg
Copy link
Member

seberg commented Feb 26, 2019

Hmmm, so if you look at the bug, the actual problem already occurs during the power operation (although it might be hard to trigger it). It seems to have to do with temporary elides. However, that code has not changed in a very long time, and I do not really remember it really causing crashes. It might be worth a shot to uninstall mkl and reinstall numpy to see if that causes some interaction (I would be surprised, but...).

This: https://stackoverflow.com/questions/16196897/segmentation-fault-when-calling-backtrace-on-linux-x86 sounds like it is likely related. The last comment there, suggests that it may be additionally have to do with the glibc version behaving badly together with a locked pthread (only when a lock exists), so it is possible that it can only reproduced with certain glibc versions such as 2.3

@UKeyboard more importantly can you help us to narrow it down/simplify a bit:

  1. Can you reproduce this without keras?
    • If not, maybe try with a threaded environment, considering the above post while holding a pthread mutex (I think the normal python mutexes should do the trick).
  2. Even if you need keras, can you create a minimal self contained example so that others can try to run the code to help narrowing it down?
  3. To poke at it a bit more, can you check what glibc version you have with ldd --version?
    EDIT: Oh, saw you probably got glibc 2.23

So mostly minimal test would be very useful, preferably one that does not require a package like keras where (at least for me) it is hard to judge how it influences everything.


Quick fix, replace your code with:

power = source - target)**2
d = numpy.sum(power, axis=-1)

but I expect it will simply crash later, since temporary eliding is quite common (which is why it is useful…).

@seberg seberg changed the title Segmentation fault in numpy.sum method Segmentation fault in temporary eliding (backtrace() fails, probably related to pthread locking) Feb 26, 2019
@seberg seberg changed the title Segmentation fault in temporary eliding (backtrace() fails, probably related to pthread locking) BUG: Segmentation fault in temporary eliding (backtrace() fails, probably related to pthread locking) Feb 26, 2019
@seberg
Copy link
Member

seberg commented Feb 27, 2019

@juliantaylor just a quick ping in case you know something of why backtrace (in temporary elide) causes segfault in conjunction with some pthread locking probably.

@UKeyboard
Copy link
Author

@seberg Sorry for late reply. I tried the following code:

power = (source - target)**2
d = numpy.sum(power, axis=-1)

before noticing your suggestion and it did crash (as you say) at power = (source - target)**2 due to the same reason.

As you can see, I got this very problem in my pytorch project. I don't need keras. And, the problem is only in my cpu code with numpy. After I rebuilt all the code with pytorch functions, it works now.

I also try numpy ndarray power in python interactive terminal and it doesn't cause any problem.

Maybe it has something with glibc but I cannot test that coz this is a shared machine and I am not a administrator.

Thanks

@seberg
Copy link
Member

seberg commented Jan 30, 2024

Closing this, it's old and without more details, I doubt there is anything to do since we don't even know if there is an issue.

@seberg seberg closed this as completed Jan 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants