Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Passing post-install for mpimsg engine checks on Ubuntu 22.04 #6

Open
alberto-scolari opened this issue Apr 11, 2023 · 1 comment · May be fixed by #7
Open

Passing post-install for mpimsg engine checks on Ubuntu 22.04 #6

alberto-scolari opened this issue Apr 11, 2023 · 1 comment · May be fixed by #7

Comments

@alberto-scolari
Copy link
Collaborator

alberto-scolari commented Apr 11, 2023

LPF mpimsg engine currently does not pass post-install checks on Ubuntu 22.04 for several reasons:

  1. the initialization routine breaks
  2. the post-install debug checks hang
  3. the detection of MPI with Clang fails

This issue tracks these problems. I pushed several workarounds for these problems on the branch associated to this issue, but some of them deserve better thinking than what I did.

In the following paragraphs I am detailing each issue with its current workaround.

1. the initialization routine breaks

The mpimsg engine is initialized in the routine mpi_initializer in src/MPI/init.cpp, which expects int argc, char ** argv as parameters to be passed to MPI_thread_Init(). mpi_initializer is invoked during LD_PRELOAD. However, the stack initialization with argc/argv is a non-standard, undocumented feature of the Linux dynamic linker, probably removed in recent versions: the variables are random, related assertions may fail or any access to argv results in segfault.

Current solution: do not use argc/argv, the initialization routine now takes no inputs.
Pros: problem solved in a robust way, no need to re-think the solution.
Cons: cannot pass implementation-specific parameters to MPI initialization (not used in practice)

2. the post-install debug checks hang

The post-install check at post-install/post-install-test.cmake.in, line 96, hangs with engine = mpimsg and any nprocs (I manually tried 1, which works, but any bigger value does not). The MPI-spawned processes hang. This is due to the call to std::abort() at src/debug/core.cpp, line L939. Some process/library of Ubuntu 22.04 (probably MPI itself, version 4.0 for Ubuntu 22.04) installs a signal handler for SIGABRT (I checked it in the test), which causes the application to hang when the debug library call std::abort().

Current solution: skip post-install debug checks. It is clearly just a hack.
A more refined solution would be to have an actual lpf_abort() routine calling MPI_Abort(), but I don't know whether it is in the spirit of LPF. Another possible solution is to remove calls to std::abort() and change the test to properly handle failures. I am not an LPF expert, so I have no preference and there are maybe better solutions.
Finally, one can intercept the SIGABRT in each backend to handle failures and call MPI_Abort(), although this may conflict with the underlying MPI implementation.

3. detection of MPI with Clang fails

During MPI detection (find_package(MPI) in cmake/mpi.cmake) CMake cannot find it if the compiler passed is Clang. Probably, the compilation of some internal tests fails due to some compiler-specific options that CMake parses. For example, MPICH 4.0 in Ubuntu 22.04 has -flto=auto -ffat-lto-objects in the variable MPI_C_COMPILE_OPTIONS to enable Link-Time Optimization (LTO). This option causes Clang to fail, since the LTO information of MPI binary is built with gcc.

Current solution: if the compiler is Clang, disable LTO during detection via MPI_COMPILER_FLAGS="-fno-lto", which is appended at the end of internal compiler definitions.
Pros: binaries are now built also with Clang.
Cons: may cause performance degradation (probably small); implicitly assumes MPI to be built with gcc
A robust solution may be very complex and may depend on CMake detection logic.

@anyzelman
Copy link
Member

For item two, discussion reveals that the debug layer should probably better throw exceptions that are then caught and returned to the calling exec or hook.

@alberto-scolari indicated he would like to clean up the MR further so we may consider this in draft state. Please ping here when the PR is ready for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants