Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCX backtrace in the avx512 version #74

Closed
mboisson opened this issue May 3, 2021 · 4 comments
Closed

UCX backtrace in the avx512 version #74

mboisson opened this issue May 3, 2021 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@mboisson
Copy link
Member

mboisson commented May 3, 2021

From OTRS tickets 0119977 and possibly 0119977

With OpenFOAM v2012 compiled for avx512, we get a crash in UCX, with the bakctrace

(gdb) bt
#0  raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  <signal handler called>
#2  ucp_ep_config_get_zcopy_auto_thresh (iovcnt=iovcnt@entry=1,
reg_cost=reg_cost@entry=0x27c41e0, context=context@entry=0x27c1930,
bandwidth=2.8611350507370929e+21) at core/ucp_ep.c:1956
#3  0x00002b3d0f888124 in ucp_ep_config_init_attrs (worker=<optimized out>,
rsc_index=<optimized out>, config=0x29db6e0, max_short=<optimized out>,
max_bcopy=<optimized out>, max_zcopy=<optimized out>, max_iov=3, short_flag=1,
bcopy_flag=2, zcopy_flag=4, hdr_len=8, adjust_min_val=18446744073709551615) at
core/ucp_context.h:405
#4  0x00002b3d0f888a28 in ucp_ep_config_init (worker=worker@entry=0x29da160,
config=config@entry=0x29db500, key=key@entry=0x7fffbd8be6b0) at
core/ucp_ep.c:1546
#5  0x00002b3d0f893c3b in ucp_worker_get_ep_config
(worker=worker@entry=0x29da160, key=key@entry=0x7fffbd8be6b0,
print_cfg=print_cfg@entry=1, config_idx_p=config_idx_p@entry=0x7fffbd8be68e)
at core/ucp_worker.c:1626
#6  0x00002b3d0f8cabda in ucp_wireup_init_lanes (ep=0x2b3d0fae6b40,
ep_init_flags=ep_init_flags@entry=0,
local_tl_bitmap=local_tl_bitmap@entry=18446744073709551615,
remote_address=remote_address@entry=0x7fffbd8be7c0,
addr_indices=addr_indices@entry=0x7fffbd8be750) at wireup/wireup.c:980
#7  0x00002b3d0f885808 in ucp_ep_create_to_worker_addr (worker=<optimized
out>, local_tl_bitmap=18446744073709551615, remote_address=0x7fffbd8be7c0,
ep_init_flags=0, message=<optimized out>, ep_p=0x7fffbd8be7b8) at
core/ucp_ep.c:342
#8  0x00002b3d0f886188 in ucp_ep_create_api_to_worker_addr
(worker=worker@entry=0x29da160, params=params@entry=0x7fffbd8be8a0,
ep_p=ep_p@entry=0x7fffbd8be848) at core/ucp_ep.c:616
#9  0x00002b3d0f88657c in ucp_ep_create (worker=0x29da160,
params=0x7fffbd8be8a0, ep_p=0x7fffbd8be898) at core/ucp_ep.c:681
#10 0x00002b3d092b1143 in mca_pml_ucx_add_proc_common () from
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/Compiler/gcc9/openmpi/4.0.3/lib/openmpi/
mca_pml_ucx.so
#11 0x00002b3d092b20ff in mca_pml_ucx_add_proc () from
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/Compiler/gcc9/openmpi/4.0.3/lib/openmpi/
mca_pml_ucx.so
#12 0x00002b3d092b2445 in mca_pml_ucx_send () from
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/Compiler/gcc9/openmpi/4.0.3/lib/openmpi/
mca_pml_ucx.so
#13 0x00002b3d093cd2dc in ompi_coll_base_sendrecv_actual () from
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/Compiler/gcc9/openmpi/4.0.3/lib/libmpi.s
o.40
#14 0x00002b3d093cda9a in ompi_coll_base_allreduce_intra_recursivedoubling ()
from
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/Compiler/gcc9/openmpi/4.0.3/lib/libmpi.s
o.40
#15 0x00002b3d093801f8 in PMPI_Allreduce () from
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/Compiler/gcc9/openmpi/4.0.3/lib/libmpi.s
o.40
#16 0x00002b3d09b25052 in void Foam::allReduce<double, Foam::sumOp<double>
>(double&, int, ompi_datatype_t*, ompi_op_t*, Foam::sumOp<double> const&, int,
int) () from
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/MPI/gcc9/openmpi4/openfoam/v2012/OpenFOA
M-v2012/platforms/linux64GccDPInt32Opt/lib/libPstream.so
#17 0x00002b3d09b204e2 in Foam::reduce(double&, Foam::sumOp<double> const&,
int, int) () from
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/MPI/gcc9/openmpi4/openfoam/v2012/OpenFOA
M-v2012/platforms/linux64GccDPInt32Opt/lib/libPstream.so
#18 0x00002b3d08e01db9 in Foam::Time::setControls() () from
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/MPI/gcc9/openmpi4/openfoam/v2012/OpenFOA
M-v2012/platforms/linux64GccDPInt32Opt/lib/libOpenFOAM.so
#19 0x00002b3d08e059ae in Foam::Time::Time(Foam::word const&, Foam::argList
const&, Foam::word const&, Foam::word const&, bool, bool) () from
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/MPI/gcc9/openmpi4/openfoam/v2012/OpenFOA
M-v2012/platforms/linux64GccDPInt32Opt/lib/libOpenFOAM.so
#20 0x000000000042edb5 in Foam::Time::Time(Foam::word const&, Foam::argList
const&, bool, bool) ()
#21 0x00000000004271f6 in main ()
@mboisson mboisson added the bug Something isn't working label May 3, 2021
@mboisson mboisson added this to To Do in Compute Canada software stack via automation May 3, 2021
@bartoldeman
Copy link
Contributor

Looks like a GCC compiler bug that triggers for avx512 if FP exceptions are enabled, I stripped down the UCX function and reported it here:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101683

@bartoldeman
Copy link
Contributor

unset FOAM_SIGFPE should also be a workaround for OpenFOAM

@bartoldeman
Copy link
Contributor

unset FOAM_SIGFPE for openfoam/8 (etc)
but
export FOAM_SIGFPE=false for openfoam/v2006 (etc)
always good to know which of openfoams we are talking about.

@bartoldeman
Copy link
Contributor

GCC has been recompiled with an upstream fix patch and pushed, and UCX also. Closing.

Compute Canada software stack automation moved this from To Do to Done Aug 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

No branches or pull requests

2 participants