Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build fails on 32-bit architectures: compute_nonlocal_dual_graph: max_num_vertices_per_facet=-1 #1735

Closed
drew-parsons opened this issue Oct 2, 2021 · 8 comments

Comments

@drew-parsons
Copy link
Contributor

dolfinx 0.3.0 is failing to build on 32-bit architectures (i386, armhf, armel), see https://buildd.debian.org/status/package.php?p=fenics-dolfinx&suite=experimental
e.g. i386
https://buildd.debian.org/status/fetch.php?pkg=fenics-dolfinx&arch=i386&ver=1%3A0.3.0-3&stamp=1633022115&raw=0

There is a segfault, apparently triggered in openmpi's mca_btl_vader.so (backtrace reported at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=995599 ).

The point of handover from dolfinx to MPI before the segfault is mesh/graphbuild.cpp l.143-144

  graph::AdjacencyList<std::int64_t> recvd_buffer
      = dolfinx::MPI::all_to_all(comm, send_buffer);

graph::AdjacencyList<std::int64_t> recvd_buffer

Noting that the segfault is happening on 32-bit arches, and the dolfinx code is using int64_t to index the MPI buffers, could this be the origin of the segfault? Or would it more likely be some other bug in the OpenMPI implementation (in vader) ?

@drew-parsons
Copy link
Contributor Author

A sample backtrace looks like

(experimental_i386-dchroot)barriere$ mpiexec -n 2 ./demo_poisson -start_in_debugger 
PETSC: Attaching gdb to ./demo_poisson of pid 5638 on display :0.0 on machine barriere
PETSC: Attaching gdb to ./demo_poisson of pid 5639 on display :0.0 on machine barriere
Unable to start debugger in xterm: No such file or directory
Unable to start debugger in xterm: No such file or directory
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run 
[0]PETSC ERROR: to get more information on the crash.
[0]PETSC ERROR: PetscAbortErrorHandler: User provided function() line 0 in  unknown file (null)
  To prevent termination, change the error handler using PetscPushErrorHandler()
[barriere:05638] *** Process received signal ***
[barriere:05638] Signal: Aborted (6)
[barriere:05638] Signal code:  (-6)
[barriere:05638] [ 0] linux-gate.so.1(__kernel_rt_sigreturn+0x0)[0xf7f32090]
[barriere:05638] [ 1] linux-gate.so.1(__kernel_vsyscall+0x9)[0xf7f32069]
[barriere:05638] [ 2] /lib/i386-linux-gnu/libc.so.6(gsignal+0xc6)[0xf5f00f36]
[barriere:05638] [ 3] /lib/i386-linux-gnu/libc.so.6(abort+0x125)[0xf5ee9312]
[barriere:05638] [ 4] /usr/lib/petscdir/petsc3.14/i386-linux-gnu-real/lib/libpetsc_real.so.3.14(+0x153d26)[0xf653dd26]
[barriere:05638] [ 5] /usr/lib/petscdir/petsc3.14/i386-linux-gnu-real/lib/libpetsc_real.so.3.14(PetscError+0xd0)[0xf653a3b0]
[barriere:05638] [ 6] /usr/lib/petscdir/petsc3.14/i386-linux-gnu-real/lib/libpetsc_real.so.3.14(PetscSignalHandlerDefault+0x1a0)[0xf653e790]
[barriere:05638] [ 7] /usr/lib/petscdir/petsc3.14/i386-linux-gnu-real/lib/libpetsc_real.so.3.14(+0x154979)[0xf653e979]
[barriere:05638] [ 8] linux-gate.so.1(__kernel_sigreturn+0x0)[0xf7f32080]
[barriere:05638] [ 9] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx4mesh16build_dual_graphEP19ompi_communicator_tRKNS_5graph13AdjacencyListIxEEi+0xd7f)[0xf7ead68f]
[barriere:05638] [10] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx4mesh21partition_cells_graphEP19ompi_communicator_tiiRKNS_5graph13AdjacencyListIxEENS0_9GhostModeERKSt8functionIFNS4_IiEES2_iS7_ibEE+0x21d)[0xf7ebeb9d]
[barriere:05638] [11] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx4mesh21partition_cells_graphEP19ompi_communicator_tiiRKNS_5graph13AdjacencyListIxEENS0_9GhostModeE+0x59)[0xf7ebece9]
[barriere:05638] [12] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZNSt17_Function_handlerIFKN7dolfinx5graph13AdjacencyListIiEEP19ompi_communicator_tiiRKNS2_IxEENS0_4mesh9GhostModeEEPFS3_S6_iiS9_SB_EE9_M_invokeERKSt9_Any_dataOS6_OiSK_S9_OSB_+0x35)[0xf7dff8b5]
[barriere:05638] [13] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx4mesh11create_meshEP19ompi_communicator_tRKNS_5graph13AdjacencyListIxEERKNS_3fem17CoordinateElementERKN2xt17xtensor_containerINSC_7uvectorIdSaIdEEELj2ELNSC_11layout_typeE1ENSC_22xtensor_expression_tagEEENS0_9GhostModeERKSt8functionIFKNS4_IiEES2_iiS7_SM_EE+0x163)[0xf7e96d63]
[barriere:05638] [14] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(+0x10fdab)[0xf7dfedab]
[barriere:05638] [15] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx10generation13RectangleMesh6createEP19ompi_communicator_tRKSt5arrayIS4_IdLj3EELj2EES4_IjLj2EENS_4mesh8CellTypeENSA_9GhostModeERKSt8functionIFKNS_5graph13AdjacencyListIiEES3_iiRKNSF_IxEESC_EERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0xb8)[0xf7dff7c8]
[barriere:05638] [16] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx10generation13RectangleMesh6createEP19ompi_communicator_tRKSt5arrayIS4_IdLj3EELj2EES4_IjLj2EENS_4mesh8CellTypeENSA_9GhostModeERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x5f)[0xf7dff83f]
[barriere:05638] [17] ./demo_poisson(+0x19953)[0x565ed953]
[barriere:05638] [18] /lib/i386-linux-gnu/libc.so.6(__libc_start_main+0x106)[0xf5eeafd6]
[barriere:05638] [19] ./demo_poisson(+0x18451)[0x565ec451]
[barriere:05638] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
[1]PETSC ERROR: ------------------------------------------------------------------------
[1]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
[1]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[1]PETSC ERROR: or see https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[1]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[1]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run 
[1]PETSC ERROR: to get more information on the crash.
[1]PETSC ERROR: PetscAbortErrorHandler: User provided function() line 0 in  unknown file (null)
  To prevent termination, change the error handler using PetscPushErrorHandler()
[barriere:05639] *** Process received signal ***
[barriere:05639] Signal: Aborted (6)
[barriere:05639] Signal code:  (-6)
[barriere:05639] [ 0] linux-gate.so.1(__kernel_rt_sigreturn+0x0)[0xf7f66090]
[barriere:05639] [ 1] linux-gate.so.1(__kernel_vsyscall+0x9)[0xf7f66069]
[barriere:05639] [ 2] /lib/i386-linux-gnu/libc.so.6(gsignal+0xc6)[0xf5f34f36]
[barriere:05639] [ 3] /lib/i386-linux-gnu/libc.so.6(abort+0x125)[0xf5f1d312]
[barriere:05639] [ 4] /usr/lib/petscdir/petsc3.14/i386-linux-gnu-real/lib/libpetsc_real.so.3.14(+0x153d26)[0xf6571d26]
[barriere:05639] [ 5] /usr/lib/petscdir/petsc3.14/i386-linux-gnu-real/lib/libpetsc_real.so.3.14(PetscError+0xd0)[0xf656e3b0]
[barriere:05639] [ 6] /usr/lib/petscdir/petsc3.14/i386-linux-gnu-real/lib/libpetsc_real.so.3.14(PetscSignalHandlerDefault+0x1a0)[0xf6572790]
[barriere:05639] [ 7] /usr/lib/petscdir/petsc3.14/i386-linux-gnu-real/lib/libpetsc_real.so.3.14(+0x154979)[0xf6572979]
[barriere:05639] [ 8] linux-gate.so.1(__kernel_sigreturn+0x0)[0xf7f66080]
[barriere:05639] [ 9] /usr/lib/i386-linux-gnu/openmpi/lib/openmpi3/mca_btl_vader.so(+0x4cd6)[0xf1354cd6]
[barriere:05639] [10] /usr/lib/i386-linux-gnu/libopen-pal.so.40(opal_progress+0x30)[0xf4dcde70]
[barriere:05639] [11] /usr/lib/i386-linux-gnu/libopen-pal.so.40(ompi_sync_wait_mt+0xbd)[0xf4dd4a5d]
[barriere:05639] [12] /usr/lib/i386-linux-gnu/libmpi.so.40(ompi_request_default_wait+0x236)[0xf7a1b2c6]
[barriere:05639] [13] /usr/lib/i386-linux-gnu/libmpi.so.40(ompi_coll_base_sendrecv_actual+0xbb)[0xf7a73b2b]
[barriere:05639] [14] /usr/lib/i386-linux-gnu/libmpi.so.40(ompi_coll_base_alltoall_intra_pairwise+0xf7)[0xf7a77b67]
[barriere:05639] [15] /usr/lib/i386-linux-gnu/openmpi/lib/openmpi3/mca_coll_tuned.so(ompi_coll_tuned_alltoall_intra_do_this+0x11d)[0xf11b28ed]
[barriere:05639] [16] /usr/lib/i386-linux-gnu/openmpi/lib/openmpi3/mca_coll_tuned.so(ompi_coll_tuned_alltoall_intra_dec_fixed+0x99)[0xf11adca9]
[barriere:05639] [17] /usr/lib/i386-linux-gnu/libmpi.so.40(MPI_Alltoall+0x182)[0xf7a2f6d2]
[barriere:05639] [18] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx3MPI10all_to_allIxEENS_5graph13AdjacencyListIT_EEP19ompi_communicator_tRKS5_+0x15d)[0xf7eb95ed]
[barriere:05639] [19] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx4mesh16build_dual_graphEP19ompi_communicator_tRKNS_5graph13AdjacencyListIxEEi+0xdd6)[0xf7ee16e6]
[barriere:05639] [20] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx4mesh21partition_cells_graphEP19ompi_communicator_tiiRKNS_5graph13AdjacencyListIxEENS0_9GhostModeERKSt8functionIFNS4_IiEES2_iS7_ibEE+0x21d)[0xf7ef2b9d]
[barriere:05639] [21] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx4mesh21partition_cells_graphEP19ompi_communicator_tiiRKNS_5graph13AdjacencyListIxEENS0_9GhostModeE+0x59)[0xf7ef2ce9]
[barriere:05639] [22] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZNSt17_Function_handlerIFKN7dolfinx5graph13AdjacencyListIiEEP19ompi_communicator_tiiRKNS2_IxEENS0_4mesh9GhostModeEEPFS3_S6_iiS9_SB_EE9_M_invokeERKSt9_Any_dataOS6_OiSK_S9_OSB_+0x35)[0xf7e338b5]
[barriere:05639] [23] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx4mesh11create_meshEP19ompi_communicator_tRKNS_5graph13AdjacencyListIxEERKNS_3fem17CoordinateElementERKN2xt17xtensor_containerINSC_7uvectorIdSaIdEEELj2ELNSC_11layout_typeE1ENSC_22xtensor_expression_tagEEENS0_9GhostModeERKSt8functionIFKNS4_IiEES2_iiS7_SM_EE+0x163)[0xf7ecad63]
[barriere:05639] [24] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(+0x10fbb2)[0xf7e32bb2]
[barriere:05639] [25] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx10generation13RectangleMesh6createEP19ompi_communicator_tRKSt5arrayIS4_IdLj3EELj2EES4_IjLj2EENS_4mesh8CellTypeENSA_9GhostModeERKSt8functionIFKNS_5graph13AdjacencyListIiEES3_iiRKNSF_IxEESC_EERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0xb8)[0xf7e337c8]
[barriere:05639] [26] /fenics/fenics-dolfinx-0.3.0/debian/tmp-real/usr/lib/i386-linux-gnu/libdolfinx_real.so.0.3(_ZN7dolfinx10generation13RectangleMesh6createEP19ompi_communicator_tRKSt5arrayIS4_IdLj3EELj2EES4_IjLj2EENS_4mesh8CellTypeENSA_9GhostModeERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x5f)[0xf7e3383f]
[barriere:05639] [27] ./demo_poisson(+0x19953)[0x5658d953]
[barriere:05639] [28] /lib/i386-linux-gnu/libc.so.6(__libc_start_main+0x106)[0xf5f1efd6]
[barriere:05639] [29] ./demo_poisson(+0x18451)[0x5658c451]
[barriere:05639] *** End of error message ***

@drew-parsons
Copy link
Contributor Author

drew-parsons commented Oct 13, 2021

Further debugging shows the error is in

buffer[pos[dest] + max_num_vertices_per_facet] += cell_offset;
compute_nonlocal_dual_graph() in graphbuild.cpp.

In a two-process run in i386, the unmatched_facets loop here is skipped by one thread, accessed by the other thread. But values are

pos[dest]=0
max_num_vertices_per_facet=-1

so of course it's crashing on buffer[-1].

max_num_vertices_per_facet=-1 does not sound correct. It's set at

const std::int32_t max_num_vertices_per_facet = -buffer_global_min[0];

With the explicit minus sign there, was buffer_global_min expected to have a negative value? Evidentally on i386 running 2 processes, it has buffer_global_min[0]=1.

@drew-parsons drew-parsons changed the title build fails on 32-bit architectures build fails on 32-bit architectures: compute_nonlocal_dual_graph: max_num_vertices_per_facet=-1 Oct 13, 2021
@drew-parsons
Copy link
Contributor Author

There are other python test failures on 32-bit machines, not certain if it's the same underlying problem. in C++ only demo_poisson_mpi is failing, while in python demo_helmholtz_2d.py, static-condensation-elasticity.py and demo_poisson.py all fail. See for example https://ci.debian.net/data/autopkgtest/testing/i386/f/fenics-dolfinx/16183257/log.gz

Python unit tests give other errors:

______________________________ test_cffi_assembly ______________________________

    @skip_if_complex
    def test_cffi_assembly():
        mesh = UnitSquareMesh(MPI.COMM_WORLD, 13, 13)
        V = FunctionSpace(mesh, ("Lagrange", 1))    
...
        ptrA = ffi.cast("intptr_t", ffi.addressof(lib, "tabulate_tensor_poissonA"))
        integrals = {IntegralType.cell: ([(-1, ptrA)], None)}
>       a = cpp.fem.Form([V._cpp_object, V._cpp_object], integrals, [], [], False)
E       RuntimeError: Unable to cast Python instance to C++ type (compile in debug mode for details)

and

______________________ test_compute_closest_entity_2d[0] _______________________

dim = 0

    @pytest.mark.parametrize("dim", [0, 1, 2])
    def test_compute_closest_entity_2d(dim):
        p = numpy.array([-1.0, -0.01, 0.0])
        mesh = UnitSquareMesh(MPI.COMM_WORLD, 15, 15)
        tree = BoundingBoxTree(mesh, dim)
        entity, distance = compute_closest_entity(tree, p, mesh)
...
        entities = compute_collisions_point(tree, p_c)
...
        if len(entities) > 0:
>           assert numpy.isin(entity, entities)
E           assert array(False)
E            +  where array(False) = <function isin at 0xed23d0b8>(0, array([134]))
E            +    where <function isin at 0xed23d0b8> = numpy.isin

python/test/unit/geometry/test_bounding_box_tree.py:295: AssertionError

The latter problem can be tested by hand (running the command manually. entities contains the same element array([134]) found on amd64, but entity gets set to 0 instead of 134. It's possibly relevant that on i386 entities is set without dtype,

array([134])

while on amd64 it gets a specific dtype,

array([134], dtype=int32)

@francesco-ballarin
Copy link
Member

@drew-parsons is this still happening for 0.7.0?

@drew-parsons
Copy link
Contributor Author

i'm waiting for debian to process dolfinx 0.7.0, will be able to say after that.

@drew-parsons
Copy link
Contributor Author

I guess the problem has cleared now. demo_poisson_mpi_2 and demo_poisson_mpi_3 are passing now. I'll run the next builds without skipping them.

@drew-parsons
Copy link
Contributor Author

There was an armel error in gjk similar to the one reported in #1104, but the tests mentioned here seem to be passing.

@francesco-ballarin
Copy link
Member

Great, thanks. We can close this one too then!

@francesco-ballarin francesco-ballarin closed this as not planned Won't fix, can't repro, duplicate, stale Oct 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants