You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
LPF mpimsg engine currently does not pass post-install checks on Ubuntu 22.04 for several reasons:
the initialization routine breaks
the post-install debug checks hang
the detection of MPI with Clang fails
This issue tracks these problems. I pushed several workarounds for these problems on the branch associated to this issue, but some of them deserve better thinking than what I did.
In the following paragraphs I am detailing each issue with its current workaround.
1. the initialization routine breaks
The mpimsg engine is initialized in the routine mpi_initializer in src/MPI/init.cpp, which expects int argc, char ** argv as parameters to be passed to MPI_thread_Init(). mpi_initializer is invoked during LD_PRELOAD. However, the stack initialization with argc/argv is a non-standard, undocumented feature of the Linux dynamic linker, probably removed in recent versions: the variables are random, related assertions may fail or any access to argv results in segfault.
Current solution: do not use argc/argv, the initialization routine now takes no inputs. Pros: problem solved in a robust way, no need to re-think the solution. Cons: cannot pass implementation-specific parameters to MPI initialization (not used in practice)
2. the post-install debug checks hang
The post-install check at post-install/post-install-test.cmake.in, line 96, hangs with engine = mpimsg and any nprocs (I manually tried 1, which works, but any bigger value does not). The MPI-spawned processes hang. This is due to the call to std::abort() at src/debug/core.cpp, line L939. Some process/library of Ubuntu 22.04 (probably MPI itself, version 4.0 for Ubuntu 22.04) installs a signal handler for SIGABRT (I checked it in the test), which causes the application to hang when the debug library call std::abort().
Current solution: skip post-install debug checks. It is clearly just a hack.
A more refined solution would be to have an actual lpf_abort() routine calling MPI_Abort(), but I don't know whether it is in the spirit of LPF. Another possible solution is to remove calls to std::abort() and change the test to properly handle failures. I am not an LPF expert, so I have no preference and there are maybe better solutions.
Finally, one can intercept the SIGABRT in each backend to handle failures and call MPI_Abort(), although this may conflict with the underlying MPI implementation.
3. detection of MPI with Clang fails
During MPI detection (find_package(MPI) in cmake/mpi.cmake) CMake cannot find it if the compiler passed is Clang. Probably, the compilation of some internal tests fails due to some compiler-specific options that CMake parses. For example, MPICH 4.0 in Ubuntu 22.04 has -flto=auto -ffat-lto-objects in the variable MPI_C_COMPILE_OPTIONS to enable Link-Time Optimization (LTO). This option causes Clang to fail, since the LTO information of MPI binary is built with gcc.
Current solution: if the compiler is Clang, disable LTO during detection via MPI_COMPILER_FLAGS="-fno-lto", which is appended at the end of internal compiler definitions. Pros: binaries are now built also with Clang. Cons: may cause performance degradation (probably small); implicitly assumes MPI to be built with gcc
A robust solution may be very complex and may depend on CMake detection logic.
The text was updated successfully, but these errors were encountered:
For item two, discussion reveals that the debug layer should probably better throw exceptions that are then caught and returned to the calling exec or hook.
@alberto-scolari indicated he would like to clean up the MR further so we may consider this in draft state. Please ping here when the PR is ready for review.
LPF
mpimsg
engine currently does not pass post-install checks on Ubuntu 22.04 for several reasons:This issue tracks these problems. I pushed several workarounds for these problems on the branch associated to this issue, but some of them deserve better thinking than what I did.
In the following paragraphs I am detailing each issue with its current workaround.
1. the initialization routine breaks
The
mpimsg
engine is initialized in the routinempi_initializer
insrc/MPI/init.cpp
, which expectsint argc, char ** argv
as parameters to be passed toMPI_thread_Init()
.mpi_initializer
is invoked duringLD_PRELOAD
. However, the stack initialization withargc/argv
is a non-standard, undocumented feature of the Linux dynamic linker, probably removed in recent versions: the variables are random, related assertions may fail or any access toargv
results in segfault.Current solution: do not use
argc/argv
, the initialization routine now takes no inputs.Pros: problem solved in a robust way, no need to re-think the solution.
Cons: cannot pass implementation-specific parameters to MPI initialization (not used in practice)
2. the post-install debug checks hang
The post-install check at
post-install/post-install-test.cmake.in
, line 96, hangs withengine
=mpimsg
and anynprocs
(I manually tried 1, which works, but any bigger value does not). The MPI-spawned processes hang. This is due to the call tostd::abort()
atsrc/debug/core.cpp
, line L939. Some process/library of Ubuntu 22.04 (probably MPI itself, version 4.0 for Ubuntu 22.04) installs a signal handler forSIGABRT
(I checked it in the test), which causes the application to hang when the debug library callstd::abort()
.Current solution: skip post-install debug checks. It is clearly just a hack.
A more refined solution would be to have an actual
lpf_abort()
routine callingMPI_Abort()
, but I don't know whether it is in the spirit of LPF. Another possible solution is to remove calls tostd::abort()
and change the test to properly handle failures. I am not an LPF expert, so I have no preference and there are maybe better solutions.Finally, one can intercept the
SIGABRT
in each backend to handle failures and callMPI_Abort()
, although this may conflict with the underlying MPI implementation.3. detection of MPI with Clang fails
During MPI detection (
find_package(MPI)
in cmake/mpi.cmake) CMake cannot find it if the compiler passed is Clang. Probably, the compilation of some internal tests fails due to some compiler-specific options that CMake parses. For example, MPICH 4.0 in Ubuntu 22.04 has-flto=auto -ffat-lto-objects
in the variableMPI_C_COMPILE_OPTIONS
to enable Link-Time Optimization (LTO). This option causes Clang to fail, since the LTO information of MPI binary is built with gcc.Current solution: if the compiler is Clang, disable LTO during detection via
MPI_COMPILER_FLAGS="-fno-lto"
, which is appended at the end of internal compiler definitions.Pros: binaries are now built also with Clang.
Cons: may cause performance degradation (probably small); implicitly assumes MPI to be built with gcc
A robust solution may be very complex and may depend on CMake detection logic.
The text was updated successfully, but these errors were encountered: