Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault after OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable using blas 1.1 through LSF #1668

Closed
gerritholl opened this issue Jul 5, 2018 · 7 comments

Comments

@gerritholl
Copy link

gerritholl commented Jul 5, 2018

When I import the Python package numpy with blas being 1.1-openblas in a script running through LSF, Python raises a SystemError and segmentation fault after repeated instances of OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable and OpenBLAS blas_thread_init: RLIMIT_NPROC 1032189 current, 1032189 max. When I run the same script outside LSF, on the same machine with the same environment, it succeeds. It also succeeds (inside or outside LSF) when I use blas 1.0, either 1.0-openblas or 1.0-mkl.

I run bsub as follows to submit a job to LSF:

bsub -q short-serial -W 00:01 -R "rusage[mem=1000]" -M 1000 -cwd $HOME -oo ~/test.lsf.out -eo ~/test.lsf.err -J test $HOME/test.sh

test.sh is a wrapper to ensure I run test2.sh with a clear environment, in order to ensure identical circumstances whether I run inside or outside LSF:

$ cat test.sh
#!/bin/sh
env -i ~/test2.sh --noprofile --norc

In test2.sh, I write out and set up some environmental information and run Python attempting to import numpy:

$ cat test2.sh
export
ulimit -a
ldconfig -v
export PATH= # needed to avoid https://github.com/conda/conda/issues/7486
.  /group_workspaces/cems2/fiduceo/Users/gholl/anaconda3/etc/profile.d/conda.sh
conda activate
conda activate FCDR
python -c "import numpy; print('Success 1')"
python ~/mwe.py

Running this through LSF results in the following stdout:

$ cat test.lsf.out
Sender: LSF System <lsfadmin@host334.jc.rl.ac.uk>
Subject: Job 2475445: <test> in cluster <lotus> Exited

Job <test> was submitted from host <host293.jc.rl.ac.uk> by user <gholl> in cluster <lotus> at Wed Jul  4 18:44:42 2018.
Job was executed on host(s) <host334.jc.rl.ac.uk>, in queue <short-serial>, as user <gholl> in cluster <lotus> at Wed Jul  4 18:44:42 2018.
</home/users/gholl> was used as the home directory.
</home/users/gholl> was used as the working directory.
Started at Wed Jul  4 18:44:42 2018.
Terminated at Wed Jul  4 18:44:44 2018.
Results reported at Wed Jul  4 18:44:44 2018.

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
/home/users/gholl/test.sh
------------------------------------------------------------

Exited with exit code 139.

Resource usage summary:

    CPU time :                                   0.60 sec.
    Max Memory :                                 -
    Average Memory :                             -
    Total Requested Memory :                     1000.00 MB
    Delta Memory :                               -
    Max Swap :                                   -
    Max Processes :                              -
    Max Threads :                                -
    Run time :                                   2 sec.
    Turnaround time :                            2 sec.

The output (if any) follows:

export OLDPWD
export PWD="/home/users/gholl"
export SHLVL="1"
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 1032189
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 4096
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8589930496
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1032189
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

(omitted output of ldconfig -v for brevity)

PS:

Read file </home/users/gholl/test.lsf.err> for stderr output of this job.

And to stderr:

$ cat test.lsf.err
/sbin/ldconfig: /etc/ld.so.conf.d/kernel-2.6.32-696.23.1.el6.x86_64.conf:6: duplicate hwcap 1 nosegneg
/sbin/ldconfig: /etc/ld.so.conf.d/kernel-2.6.32-754.el6.x86_64.conf:6: duplicate hwcap 1 nosegneg
/sbin/ldconfig: /opt/platform_mpi/lib/linux_amd64/libhpmpi.so is not an ELF file - it has the wrong magic bytes at the start.

/sbin/ldconfig: Can't create temporary cache file /etc/ld.so.cache~: Permission denied
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1032189 current, 1032189 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1032189 current, 1032189 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1032189 current, 1032189 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1032189 current, 1032189 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1032189 current, 1032189 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1032189 current, 1032189 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1032189 current, 1032189 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1032189 current, 1032189 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1032189 current, 1032189 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1032189 current, 1032189 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1032189 current, 1032189 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1032189 current, 1032189 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1032189 current, 1032189 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1032189 current, 1032189 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1032189 current, 1032189 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1032189 current, 1032189 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1032189 current, 1032189 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1032189 current, 1032189 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1032189 current, 1032189 max

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/group_workspaces/cems2/fiduceo/Users/gholl/anaconda3/envs/FCDR/lib/python3.6/site-packages/numpy/__init__.py", line 142, in <module>
    from . import add_newdocs
  File "/group_workspaces/cems2/fiduceo/Users/gholl/anaconda3/envs/FCDR/lib/python3.6/site-packages/numpy/add_newdocs.py", line 13, in <module>
    from numpy.lib import add_newdoc
  File "/group_workspaces/cems2/fiduceo/Users/gholl/anaconda3/envs/FCDR/lib/python3.6/site-packages/numpy/lib/__init__.py", line 8, in <module>
    from .type_check import *
  File "/group_workspaces/cems2/fiduceo/Users/gholl/anaconda3/envs/FCDR/lib/python3.6/site-packages/numpy/lib/type_check.py", line 11, in <module>
    import numpy.core.numeric as _nx
  File "/group_workspaces/cems2/fiduceo/Users/gholl/anaconda3/envs/FCDR/lib/python3.6/site-packages/numpy/core/__init__.py", line 16, in <module>
    from . import multiarray
SystemError: initialization of multiarray raised unreported exception
/home/users/gholl/test2.sh: line 8: 52786 Segmentation fault      (core dumped) python -c "import numpy; print('Success 1')"
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1032189 current, 1032189 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1032189 current, 1032189 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1032189 current, 1032189 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1032189 current, 1032189 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1032189 current, 1032189 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1032189 current, 1032189 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1032189 current, 1032189 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1032189 current, 1032189 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1032189 current, 1032189 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1032189 current, 1032189 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1032189 current, 1032189 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1032189 current, 1032189 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1032189 current, 1032189 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1032189 current, 1032189 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1032189 current, 1032189 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1032189 current, 1032189 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1032189 current, 1032189 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1032189 current, 1032189 max
OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1032189 current, 1032189 max
Traceback (most recent call last):
  File "/home/users/gholl/mwe.py", line 2, in <module>
    import numpy
  File "/group_workspaces/cems2/fiduceo/Users/gholl/anaconda3/envs/FCDR/lib/python3.6/site-packages/numpy/__init__.py", line 142, in <module>
    from . import add_newdocs
  File "/group_workspaces/cems2/fiduceo/Users/gholl/anaconda3/envs/FCDR/lib/python3.6/site-packages/numpy/add_newdocs.py", line 13, in <module>
    from numpy.lib import add_newdoc
  File "/group_workspaces/cems2/fiduceo/Users/gholl/anaconda3/envs/FCDR/lib/python3.6/site-packages/numpy/lib/__init__.py", line 8, in <module>
    from .type_check import *
  File "/group_workspaces/cems2/fiduceo/Users/gholl/anaconda3/envs/FCDR/lib/python3.6/site-packages/numpy/lib/type_check.py", line 11, in <module>
    import numpy.core.numeric as _nx
  File "/group_workspaces/cems2/fiduceo/Users/gholl/anaconda3/envs/FCDR/lib/python3.6/site-packages/numpy/core/__init__.py", line 16, in <module>
    from . import multiarray
SystemError: initialization of multiarray raised unreported exception
/home/users/gholl/test2.sh: line 9: 52789 Segmentation fault      (core dumped) python ~/mwe.py

I also studied the output of ldconfig -v, but I don't know what to look for and it's too long to put here. However, I did compare the sorted outputs when running through LSF or not:

$ diff -u <(sort test.lsf.out_core) <(sort test.nolsf.out_core)
--- /dev/fd/63  2018-07-04 18:56:05.405440986 +0100
+++ /dev/fd/62  2018-07-04 18:56:05.405440986 +0100
@@ -1,4 +1,4 @@
-core file size          (blocks, -c) unlimited
+core file size          (blocks, -c) 0
 cpu time               (seconds, -t) unlimited
 data seg size           (kbytes, -d) unlimited
 export OLDPWD
@@ -102,9 +102,12 @@
        libBrokenLocale.so.1 -> libBrokenLocale-2.12.so
        libBrokenLocale.so.1 -> libBrokenLocale-2.12.so
        libbtf.so.1 -> libbtf.so.1.1.0
+       libbtparser.so.2 -> libbtparser.so.2.2.2
        libbz2.so.1 -> libbz2.so.1.0.4
        libcairo.so.2 -> libcairo.so.2.10800.8
        libcamd.so.2 -> libcamd.so.2.2.0
+       libcanberra-gtk.so.0 -> libcanberra-gtk.so.0.1.5
+       libcanberra.so.0 -> libcanberra.so.0.2.1
        libcanna16.so.1 -> libcanna16.so.1.2.0
        libcanna.so.1 -> libcanna.so.1.2.0
        libcap-ng.so.0 -> libcap-ng.so.0.0.0
@@ -116,7 +119,7 @@
        libcdt.so.5 -> libcdt.so.5.0.0
        libcfitsio.so.0 -> libcfitsio.so.0
        libcgraph.so.6 -> libcgraph.so.6.0.0
-       libcgroup.so.1 -> libcgroup.so.1.0.40
+       libCharLS.so.1 -> libCharLS.so.1.0
        libcholmod.so.1 -> libcholmod.so.1.7.1
        libcidn.so.1 -> libcidn-2.12.so
        libcidn.so.1 -> libcidn-2.12.so
@@ -188,6 +191,7 @@
        libeggdbus-1.so.0 -> libeggdbus-1.so.0.0.0
        libEGL.so.1 -> libEGL.so.1.0.0
        libelf.so.1 -> libelf-0.164.so
+       libenchant.so.1 -> libenchant.so.1.5.0
        libepoxy.so.0 -> libepoxy.so.0.0.0
        libesoobS.so.2 -> libesoobS.so.2.0.0
        libevent-1.4.so.2 -> libevent-1.4.so.2.1.3
@@ -263,6 +267,7 @@
        libgmodule-2.0.so.0 -> libgmodule-2.0.so.0.2800.8
        libgmp.so.3 -> libgmp.so.3.5.0
        libgmpxx.so.4 -> libgmpxx.so.4.1.0
+       libgnomecanvas-2.so.0 -> libgnomecanvas-2.so.0.2600.0
        libgnutls-extra.so.26 -> libgnutls-extra.so.26.22.6
        libgnutls.so.26 -> libgnutls.so.26.22.6
        libgnutlsxx.so.26 -> libgnutlsxx.so.26.14.12
@@ -357,6 +362,7 @@
        libgstvideo-0.10.so.0 -> libgstvideo-0.10.so.0.20.0
        libgta.so.0 -> libgta.so.0.0.1
        libgthread-2.0.so.0 -> libgthread-2.0.so.0.2800.8
+       libgtksourceview-2.0.so.0 -> libgtksourceview-2.0.so.0.0.0
        libgtk-x11-2.0.so.0 -> libgtk-x11-2.0.so.0.2400.23
        libgtrtst.so.2 -> libgtrtst.so.2.0.0
        libgudev-1.0.so.0 -> libgudev-1.0.so.0.0.1
@@ -378,9 +384,9 @@
        libhwloc.so.4 -> libhwloc.so
 /lib/i686: (hwcap: 0x0008000000000000)
 /lib/i686/nosegneg: (hwcap: 0x0028000000000000)
-       libibmad.so.5 -> libibmad.so.5.4.0
+       libibmad.so.5 -> libibmad.so.5.5.0
        libibnetdisc.so.5 -> libibnetdisc.so.5.3.0
-       libibumad.so.3 -> libibumad.so.3.0.4
+       libibumad.so.3 -> libibumad.so.3.1.0
        libibverbs.so.1 -> libibverbs.so.1.0.0
        libICE.so.6 -> libICE.so.6.3.0
        libicudata.so.42 -> libicudata.so.42.1
@@ -499,6 +505,7 @@
        libnih.so.1 -> libnih.so.1.0.0
        libnl.so.1 -> libnl.so.1.1.4
        libnn.so.2 -> libnn.so.2.0.0
+       libnotify.so.1 -> libnotify.so.1.2.3
        libnsl.so.1 -> libnsl-2.12.so
        libnsl.so.1 -> libnsl-2.12.so
        libnspr4.so -> libnspr4.so
@@ -542,17 +549,17 @@
        libopcodes-2.20.51.0.2-5.48.el6.so -> libopcodes-2.20.51.0.2-5.48.el6.so
        libopenjp2.so.7 -> libopenjp2.so.2.3.0
        libopenjpeg.so.2 -> libopenjpeg.so.2.1.3.0
-       libopensm.so.5 -> libopensm.so.5.2.0
+       libopensm.so.12 -> libopensm.so.12.0.0
        liboplodbcS.so.2 -> liboplodbcS.so.2.0.0
        liboraodbcS.so.2 -> liboraodbcS.so.2.0.0
        libORBit-2.so.0 -> libORBit-2.so.0.1.0
        libORBitCosNaming-2.so.0 -> libORBitCosNaming-2.so.0.1.0
        libORBit-imodule-2.so.0 -> libORBit-imodule-2.so.0.0.0
-       libosmcomp.so.3 -> libosmcomp.so.3.0.8
+       libosmcomp.so.3 -> libosmcomp.so.3.0.6
        libOSMesa16.so.6 -> libOSMesa16.so.6.5.3
        libOSMesa32.so.6 -> libOSMesa32.so.6.5.3
        libOSMesa.so.6 -> libOSMesa.so.6.5.3
-       libosmvendor.so.3 -> libosmvendor.so.3.0.9
+       libosmvendor.so.3 -> libosmvendor.so.3.0.8
        libossp-uuid.so.16 -> libossp-uuid.so.16.0.21
        libotf.so.0 -> libotf.so.0.0.0
        libp11-kit.so.0 -> libp11-kit.so.0.0.0
@@ -676,7 +683,6 @@
        libsensors.so.4 -> libsensors.so.4.2.0
        libsepol.so.1 -> libsepol.so.1
        libserf-1.so.1 -> libserf-1.so.1.3.0
-       libsgutils2.so.2 -> libsgutils2.so.2.0.0
        libshiboken-python2.7.so.1.2 -> libshiboken-python2.7.so.1.2.1
        libshp.so.1 -> libshp.so.1.0.1
        libslang.so.2 -> libslang.so.2.2.1
@@ -686,6 +692,7 @@
        libsndfile.so.1 -> libsndfile.so.1.0.20
        libsnmp.so.20 -> libsnmp.so.20.0.0
        libsoftokn3.so -> libsoftokn3.so
+       libspatialite.so.2 -> libspatialite.so.2.0.4
        libspqr.so.1 -> libspqr.so.1.1.2
        libsqlite3.so.0 -> libsqlite3.so.0.8.6
        libssh2.so.1 -> libssh2.so.1.0.1
@@ -754,6 +761,7 @@
        libvorbisenc.so.2 -> libvorbisenc.so.2.0.6
        libvorbisfile.so.3 -> libvorbisfile.so.3.3.2
        libvorbis.so.0 -> libvorbis.so.0.4.3
+       libvpx.so.1 -> libvpx.so.1.3.0
        libvte.so.9 -> libvte.so.9.2501.0
        libwbclient.so.0 -> libwbclient.so.0
        libwebpdecoder.so.1 -> libwebpdecoder.so.1.0.3
@@ -764,6 +772,7 @@
        libwlm-nosched.so -> libwlm-nosched.so
        libwmf-0.2.so.7 -> libwmf-0.2.so.7.1.0
        libwmflite-0.2.so.7 -> libwmflite-0.2.so.7.0.1
+       libwnck-1.so.22 -> libwnck-1.so.22.3.23
        libwrap.so.0 -> libwrap.so.0.7.6
        libwx_baseu-2.8.so.0 -> libwx_baseu-2.8.so.0.8.0
        libwx_baseu-3.0.so.0 -> libwx_baseu-3.0.so.0.2.0
@@ -800,6 +809,7 @@
        libX11.so.6 -> libX11.so.6.3.0
        libX11-xcb.so.1 -> libX11-xcb.so.1.0.0
        libX11-xcb.so.1 -> libX11-xcb.so.1.0.0
+       libx86.so.1 -> libx86.so.1
        libXau.so.6 -> libXau.so.6.0.0
        libXau.so.6 -> libXau.so.6.0.0
        libXaw3d.so.7 -> libXaw3d.so.7.0
@@ -842,6 +852,7 @@
        libxcb.so.1 -> libxcb.so.1.1.0
        libxcb-sync.so.1 -> libxcb-sync.so.1.0.0
        libxcb-sync.so.1 -> libxcb-sync.so.1.0.0
+       libxcb-util.so.1 -> libxcb-util.so.1.0.0
        libxcb-xevie.so.0 -> libxcb-xevie.so.0.0.0
        libxcb-xevie.so.0 -> libxcb-xevie.so.0.0.0
        libxcb-xf86dri.so.0 -> libxcb-xf86dri.so.0.0.0
@@ -889,6 +900,7 @@
        libXp.so.6 -> libXp.so.6.2.0
        libXrandr.so.2 -> libXrandr.so.2.2.0
        libXrender.so.1 -> libXrender.so.1.3.0
+       libXRes.so.1 -> libXRes.so.1.0.0
        libxslt.so.1 -> libxslt.so.1.1.26
        libxtables.so.4 -> libxtables.so.4.0.0-1.4.7
        libXt.so.6 -> libXt.so.6.0.0
@@ -897,20 +909,21 @@
        libXxf86dga.so.1 -> libXxf86dga.so.1.0.0
        libXxf86misc.so.1 -> libXxf86misc.so.1.1.0
        libXxf86vm.so.1 -> libXxf86vm.so.1.0.0
-       libyaml-0.so.2 -> libyaml-0.so.2.0.4
        libz.so.1 -> libz.so.1.2.3
 max locked memory       (kbytes, -l) unlimited
 max memory size         (kbytes, -m) unlimited
-max user processes              (-u) 1032189
-open files                      (-n) 4096
+max user processes              (-u) 1024
+open files                      (-n) 48000
 /opt/platform_mpi/lib/linux_amd64:
        p11-kit-trust.so -> libnssckbi.so
-pending signals                 (-i) 1032189
+pending signals                 (-i) 515955
 pipe size            (512 bytes, -p) 8
 POSIX message queues     (bytes, -q) 819200
 real-time priority              (-r) 0
 scheduling priority             (-e) 0
-stack size              (kbytes, -s) 8589930496
+stack size              (kbytes, -s) 2097151
+Success
+Success 1
 /usr/lib:
 /usr/lib64:
 /usr/lib64/atlas:

When I run outside LSF, the stdout is

$ env -i ~/test2.sh --noprofile --norc
export OLDPWD
export PWD="/home/users/gholl"
export SHLVL="1"
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 515955
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 48000
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 2097151
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1024
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

(output of ldconfig -v omitted for brevity)

Success 1
Success

and the output to stderr is limited to the same ldconfig errors as before:

/sbin/ldconfig: /etc/ld.so.conf.d/kernel-2.6.32-696.23.1.el6.x86_64.conf:6: duplicate hwcap 1 nosegneg
/sbin/ldconfig: /etc/ld.so.conf.d/kernel-2.6.32-754.el6.x86_64.conf:6: duplicate hwcap 1 nosegneg
/sbin/ldconfig: /opt/platform_mpi/lib/linux_amd64/libhpmpi.so is not an ELF file - it has the wrong magic bytes at the start.

/sbin/ldconfig: Can't create temporary cache file /etc/ld.so.cache~: Permission denied

I'm running Python 3.6.3 with a conda environment sourced primarily from anaconda and conda-forge. I've noticed previously that when I set a tight ulimit -v, then import numpy fails with the same OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable. But presently, there is no ulimit -v set. The only ulimit differences between the LSF and non-LSF case are that for several properties, the limits within LSF are much more generous than outside LSF, so I can't see how ulimit limitations are causing the failures within LSF in this case. And in my previous case, I managed to reproduce the problem outside LSF as well.

As stated, everything appears to work fine when using a numpy built on blas 1.0 (either openblas or mkl) rather than blas 1.1 (I can find only openblas, no mkl). There must be some difference in environment between running outside or inside LSF in blas 1.1 (openblas), but I can't pin it down. What else may I look at?

I do not have LSF administrator access.

@gerritholl
Copy link
Author

Everything runs fine when I set

export OMP_NUM_THREADS=1
export USE_SIMPLE_THREADED_LEVEL3= 1

@martin-frbg
Copy link
Collaborator

Is there any information about which OpenBLAS versions the 1.1-openblas and 1.0-openblas correspond to ? The error suggests that creation of new threads actually failed due to a temporary system limitation (which may or may not have been due to the ulimit that is printed for information - perhaps you are actually hitting some memory limit imposed by LSF).
This message (and checking for success of pthread_create in general) was added more than two years ago (version 0.2.15 or thereabouts) in the context of issue #668

@brada4
Copy link
Contributor

brada4 commented Jul 5, 2018

twice the stack size exceeds gigabyte you ordered
probably tune it down to 10MB or so to not exceed LSF quota (and tell LSF admin about it)

@gerritholl
Copy link
Author

It does appear that thread creation fails, for when I use 1.0-openblas I get a failure with matplotlib failing to initiate a thread.

The conda blas version numbers are metapackages, but I don't actually understand the difference between 1.1-openblas and 1.0-openblas. In either case, it's libopenblasp-r0.2.20.so that stays installed, so there must be something else going on that I don't understand.

@brada4 Do you mean, tune down the stack size or tune down the gigabyte I ordered? I don't understand ulimit terribly well, may imposing a lower limit reduce the risk of running out of resources?

@brada4
Copy link
Contributor

brada4 commented Jul 7, 2018

You may need to set lower stack , like ulimit -s 8192 ; test.sh
2GB or 8GB stack is enormous, like for millions deep recursions.
1GB memory limit in LSF may apply to address space of submitted process for example.

@brada4
Copy link
Contributor

brada4 commented Jul 7, 2018

@martin-frbg blas-1.1 and blas-1.0 are dlopen() configuration wrappers for openblas 0.2.20
in conda-forge one may find 0.3.1 wrapped by same wrappers
too bad failing setup cannot be easily debugged.

@martin-frbg
Copy link
Collaborator

@gerritholl did you try with a smaller stack size (or higher memory limit in LSF, depending on what your LSF admin allows) ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants