Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CrayBLAS: segfault after shutdown on Perlmutter #3160

Open
RTSandberg opened this issue Jun 3, 2022 · 2 comments
Open

CrayBLAS: segfault after shutdown on Perlmutter #3160

RTSandberg opened this issue Jun 3, 2022 · 2 comments
Assignees
Labels
bug: affects latest release Bug also exists in latest release version bug Something isn't working component: spectral Spectral solvers (PSATD, IGF) component: third party Changes in WarpX that reflect a change in a third-party library geometry: RZ axisymmetric 2D and quasi-3D machine / system Machine or system-specific issue

Comments

@RTSandberg
Copy link
Member

RTSandberg commented Jun 3, 2022

$Home/src/WarpX/build_perlmutter/bin> ./warpx.RZ.MPI.CUDA.DP.PDP.OPMD.PSATD.QED.DEBUG ../../Examples/Physics_applications/laser_acceleration/inputs_rz
runs a complete simulation

[New Thread 0x7fffd474f000 (LWP 201253)]
Initializing CUDA...
[New Thread 0x7fffca428000 (LWP 202882)]
[New Thread 0x7fffc9c27000 (LWP 202883)]
CUDA initialized with 1 GPU per MPI rank; 1 GPU(s) used in total
MPI initialized with 1 MPI processes
MPI initialized with thread support level 3
AMReX (22.05-37-gb78921a2d80d) initialized
WarpX (22.05-64-g81619e11b45c)
PICSAR (2becfe066559)
Level 0: dt = 4.112304655e-16 ; dx = 4.6875e-07 ; dz = 1.328125e-07
...

but then segfaults:

END REGION WarpX::Evolve()
Total GPU global memory (MB) spread across MPI: [40536 ... 40536]
Free  GPU global memory (MB) spread across MPI: [32209 ... 32209]
[The         Arena] space (MB) allocated spread across MPI: [30402 ... 30402]
[The         Arena] space (MB) used      spread across MPI: [0 ... 0]
[The  Pinned Arena] space (MB) allocated spread across MPI: [8 ... 8]
[The  Pinned Arena] space (MB) used      spread across MPI: [0 ... 0]
AMReX (22.05-37-gb78921a2d80d) finalized
[Thread 0x7fffd474f000 (LWP 201253) exited]

Thread 1 "warpx.RZ.MPI.CU" received signal SIGSEGV, Segmentation fault.
__freeBlasMemPool (numa_mask=<optimized out>, tag=<optimized out>) at ./src/crayblas_util.c:353
353	./src/crayblas_util.c: No such file or directory.
Missing separate debuginfos, use: zypper install cuda-compat-11-5-debuginfo-495.29.05-1.x86_64 libbz2-1-debuginfo-1.0.6-5.9.1.x86_64 libffi7-debuginfo-3.2.1.git259-10.8.x86_64 libibverbs-debuginfo-51mlnx1-1.51258.060.x86_64 libnl3-200-debuginfo-3.3.0-1.29.x86_64 libpng16-16-debuginfo-1.6.34-3.9.1.x86_64 librdmacm-debuginfo-51mlnx1-1.51258.060.x86_64
(gdb) bt
#0  __freeBlasMemPool (numa_mask=<optimized out>, tag=<optimized out>) at ./src/crayblas_util.c:353
#1  0x00007fffe6530275 in __crayblas_shutdown () at ./src/crayblas_conf.c:232
#2  0x00007ffff7de35a3 in _dl_fini () from /lib64/ld-linux-x86-64.so.2
#3  0x00007fffd8f6fb09 in __run_exit_handlers () from /lib64/libc.so.6
#4  0x00007fffd8f6fc9a in exit () from /lib64/libc.so.6
#5  0x00007fffd8f572c4 in __libc_start_main () from /lib64/libc.so.6
#6  0x000000000042e52a in _start () at ../sysdeps/x86_64/start.S:120`

whereas running without PSATD enabled runs fine

@ax3l
Copy link
Member

ax3l commented Jun 6, 2022

We reported this upstream with @mgates3 and track this in
https://bitbucket.org/icl/blaspp/issues/18/segfault-on-exit-with-cray-libsci

@ax3l ax3l added bug Something isn't working component: spectral Spectral solvers (PSATD, IGF) component: third party Changes in WarpX that reflect a change in a third-party library bug: affects latest release Bug also exists in latest release version geometry: RZ axisymmetric 2D and quasi-3D machine / system Machine or system-specific issue labels Jun 6, 2022
@ax3l ax3l self-assigned this Jun 7, 2022
@ax3l ax3l changed the title crayblas segfault on Perlmutter CrayBLAS: segfault after shutdown on Perlmutter Jun 7, 2022
@ax3l
Copy link
Member

ax3l commented Jun 20, 2022

Documented in the above BLAS++ issue tracker: It looks like NERSC compiles and ran successfully building BLAS++ like this:

export CXX=CC      # or your preferred C++ compiler
export FC=ftn  # or your preferred Fortran compiler
export LIBRARY_PATH=/opt/nvidia/hpc_sdk/Linux_x86_64/21.11/math_libs/11.5/lib64:$LIBRARY_PATH

 cmake ../ \
   -DCMAKE_INSTALL_PREFIX=../install \
   -DCMAKE_CUDA_ARCHITECTURES="80" \
   -Dblas=libsci 

We need to check if that helps for us as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug: affects latest release Bug also exists in latest release version bug Something isn't working component: spectral Spectral solvers (PSATD, IGF) component: third party Changes in WarpX that reflect a change in a third-party library geometry: RZ axisymmetric 2D and quasi-3D machine / system Machine or system-specific issue
Projects
None yet
Development

No branches or pull requests

2 participants