Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reenabling unit tests for Pangea2 target #1245

Closed
castelletto1 opened this issue Dec 1, 2020 · 17 comments · Fixed by #1247
Closed

Reenabling unit tests for Pangea2 target #1245

castelletto1 opened this issue Dec 1, 2020 · 17 comments · Fixed by #1247
Assignees
Labels
type: testing Unit tests, non-regression testing, ...

Comments

@castelletto1
Copy link
Contributor

Unit test testLAOperations fails on Pangea2 target. See discussion #1239 (comment).

@castelletto1 castelletto1 added type: testing Unit tests, non-regression testing, ... new labels Dec 1, 2020
@TotoGaz
Copy link
Contributor

TotoGaz commented Dec 1, 2020

Before it's deleted, here is the log. Only testLAOperations is failling.

         Start  78: testLAOperations
78: Test command: /data_local/sw/OpenMPI/RHEL7/2.1.5/gcc/8.3.0/bin/mpiexec "-n" "2" "--allow-run-as-root" "/tmp/build/tests/testLAOperations" "-x" "2"
78: Test timeout computed to be: 1500
78: [==========] Running 16 tests from 6 test suites.
78: [----------] Global test environment set-up.
78: [----------] 3 tests from Trilinos/LAOperationsTest/0, where TypeParam = geosx::TrilinosInterface
78: [ RUN      ] Trilinos/LAOperationsTest/0.VectorFunctions
78: [==========] Running 16 tests from 6 test suites.
78: [----------] Global test environment set-up.
78: [----------] 3 tests from Trilinos/LAOperationsTest/0, where TypeParam = geosx::TrilinosInterface
78: [ RUN      ] Trilinos/LAOperationsTest/0.VectorFunctions
78: [       OK ] Trilinos/LAOperationsTest/0.VectorFunctions (98 ms)
78: [ RUN      ] Trilinos/LAOperationsTest/0.MatrixMatrixOperations
78: [       OK ] Trilinos/LAOperationsTest/0.VectorFunctions (96 ms)
78: [ RUN      ] Trilinos/LAOperationsTest/0.MatrixMatrixOperations
78: [6cd5ab4340b7:06459] Read -1, expected 14400, errno = 1
78: [6cd5ab4340b7:06460] Read -1, expected 14400, errno = 1
78: [       OK ] Trilinos/LAOperationsTest/0.MatrixMatrixOperations (30 ms)
78: [ RUN      ] Trilinos/LAOperationsTest/0.RectangularMatrixOperations
78: [       OK ] Trilinos/LAOperationsTest/0.MatrixMatrixOperations (30 ms)
78: [ RUN      ] Trilinos/LAOperationsTest/0.RectangularMatrixOperations
78: [       OK ] Trilinos/LAOperationsTest/0.RectangularMatrixOperations (1 ms)
78: [----------] 3 tests from Trilinos/LAOperationsTest/0 (130 ms total)
78: 
78: [----------] 3 tests from Hypre/LAOperationsTest/0, where TypeParam = geosx::HypreInterface
78: [ RUN      ] Hypre/LAOperationsTest/0.VectorFunctions
78: [       OK ] Trilinos/LAOperationsTest/0.RectangularMatrixOperations (1 ms)
78: [----------] 3 tests from Trilinos/LAOperationsTest/0 (129 ms total)
78: 
78: [----------] 3 tests from Hypre/LAOperationsTest/0, where TypeParam = geosx::HypreInterface
78: [ RUN      ] Hypre/LAOperationsTest/0.VectorFunctions
78: [       OK ] Hypre/LAOperationsTest/0.VectorFunctions (3 ms)
78: [ RUN      ] Hypre/LAOperationsTest/0.MatrixMatrixOperations
78: [       OK ] Hypre/LAOperationsTest/0.VectorFunctions (2 ms)
78: [ RUN      ] Hypre/LAOperationsTest/0.MatrixMatrixOperations
78: [       OK ] Hypre/LAOperationsTest/0.MatrixMatrixOperations (10 ms)
78: [ RUN      ] Hypre/LAOperationsTest/0.RectangularMatrixOperations
78: [       OK ] Hypre/LAOperationsTest/0.MatrixMatrixOperations (9 ms)
78: [ RUN      ] Hypre/LAOperationsTest/0.RectangularMatrixOperations
78: [       OK ] Hypre/LAOperationsTest/0.RectangularMatrixOperations (1 ms)
78: [----------] 3 tests from Hypre/LAOperationsTest/0 (15 ms total)
78: 
78: [----------] 3 tests from Trilinos/SolverTestLaplace2D/0, where TypeParam = geosx::TrilinosInterface
78: [ RUN      ] Trilinos/SolverTestLaplace2D/0.Direct
78: [       OK ] Hypre/LAOperationsTest/0.RectangularMatrixOperations (1 ms)
78: [----------] 3 tests from Hypre/LAOperationsTest/0 (14 ms total)
78: 
78: [----------] 3 tests from Trilinos/SolverTestLaplace2D/0, where TypeParam = geosx::TrilinosInterface
78: [ RUN      ] Trilinos/SolverTestLaplace2D/0.Direct
78: [6cd5ab4340b7:06460] Read -1, expected 60000, errno = 1
78: [6cd5ab4340b7:06459] Read -1, expected 518384, errno = 1
78: [6cd5ab4340b7:06459] Read -1, expected 40000, errno = 1
78: [6cd5ab4340b7:06460] Read -1, expected 40000, errno = 1
78: [       OK ] Trilinos/SolverTestLaplace2D/0.Direct (669 ms)
78: [ RUN      ] Trilinos/SolverTestLaplace2D/0.GMRES_ILU
78: [       OK ] Trilinos/SolverTestLaplace2D/0.Direct (671 ms)
78: [ RUN      ] Trilinos/SolverTestLaplace2D/0.GMRES_ILU
78: [       OK ] Trilinos/SolverTestLaplace2D/0.GMRES_ILU (177 ms)
78: [ RUN      ] Trilinos/SolverTestLaplace2D/0.CG_AMG
78: [       OK ] Trilinos/SolverTestLaplace2D/0.GMRES_ILU (179 ms)
78: [ RUN      ] Trilinos/SolverTestLaplace2D/0.CG_AMG
78: [       OK ] Trilinos/SolverTestLaplace2D/0.CG_AMG (160 ms)
78: [----------] 3 tests from Trilinos/SolverTestLaplace2D/0 (1010 ms total)
78: 
78: [----------] 3 tests from Hypre/SolverTestLaplace2D/0, where TypeParam = geosx::HypreInterface
78: [ RUN      ] Hypre/SolverTestLaplace2D/0.Direct
78: [       OK ] Trilinos/SolverTestLaplace2D/0.CG_AMG (162 ms)
78: [----------] 3 tests from Trilinos/SolverTestLaplace2D/0 (1012 ms total)
78: 
78: [----------] 3 tests from Hypre/SolverTestLaplace2D/0, where TypeParam = geosx::HypreInterface
78: [ RUN      ] Hypre/SolverTestLaplace2D/0.Direct
78: [6cd5ab4340b7:06460] Read -1, expected 20000, errno = 1
78: [6cd5ab4340b7:06459] Read -1, expected 20000, errno = 1
78: [6cd5ab4340b7:06460] Read -1, expected 199192, errno = 1
78: [6cd5ab4340b7:06460] Read -1, expected 99596, errno = 1
78: [6cd5ab4340b7:06459] Read -1, expected 199192, errno = 1
78: [6cd5ab4340b7:06459] Read -1, expected 99596, errno = 1
78: [6cd5ab4340b7:06460] Read -1, expected 40000, errno = 1
78: [6cd5ab4340b7:06459] Read -1, expected 40000, errno = 1
78: Received signal 11: Segmentation fault
78: 
78: ** StackTrace of 16 frames **
78: Frame 0:  
78: Frame 1: hypre_VectorToParVector 
78: Frame 2: HYPRE_VectorToParVector 
78: Frame 3: geosx::SuiteSparseSolve(geosx::SuiteSparse&, geosx::HypreVector const&, geosx::HypreVector&, bool) 
78: Frame 4:  
78: Frame 5: geosx::HypreSolver::solve(geosx::HypreMatrix&, geosx::HypreVector&, geosx::HypreVector&, geosx::DofManager const*) 
78: Frame 6: SolverTestBase<geosx::HypreInterface>::test(geosx::LinearSolverParameters const&) 
78: Frame 7: gtest_suite_SolverTestLaplace2D_::Direct<geosx::HypreInterface>::TestBody() 
78: Frame 8: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) 
78: Frame 9: testing::Test::Run() 
78: Frame 10: testing::TestInfo::Run() 
78: Frame 11: testing::TestSuite::Run() 
78: Frame 12: testing::internal::UnitTestImpl::RunAllTests() 
78: Frame 13: testing::UnitTest::Run() 
78: Frame 14: main 
78: Frame 15: __libc_start_main 
78: Frame 16:  
78: =====
78: 
78: --------------------------------------------------------------------------
78: MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
78: with errorcode 1.
78: 
78: NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
78: You may or may not see output from other processes, depending on
78: exactly when Open MPI kills them.
78: --------------------------------------------------------------------------
78: Received signal 15: Terminated
78: 
78: ** StackTrace of 20 frames **
78: Frame 0:  
78: Frame 1: __sched_yield 
78: Frame 2: mca_pml_ob1_recv 
78: Frame 3: ompi_coll_base_gather_intra_basic_linear 
78: Frame 4: MPI_Gather 
78: Frame 5: hypre_VectorToParVector 
78: Frame 6: HYPRE_VectorToParVector 
78: Frame 7: geosx::SuiteSparseSolve(geosx::SuiteSparse&, geosx::HypreVector const&, geosx::HypreVector&, bool) 
78: Frame 8:  
78: Frame 9: geosx::HypreSolver::solve(geosx::HypreMatrix&, geosx::HypreVector&, geosx::HypreVector&, geosx::DofManager const*) 
78: Frame 10: SolverTestBase<geosx::HypreInterface>::test(geosx::LinearSolverParameters const&) 
78: Frame 11: gtest_suite_SolverTestLaplace2D_::Direct<geosx::HypreInterface>::TestBody() 
78: Frame 12: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) 
78: Frame 13: testing::Test::Run() 
78: Frame 14: testing::TestInfo::Run() 
78: Frame 15: testing::TestSuite::Run() 
78: Frame 16: testing::internal::UnitTestImpl::RunAllTests() 
78: Frame 17: testing::UnitTest::Run() 
78: Frame 18: main 
78: Frame 19: __libc_start_main 
78: Frame 20:  
78: =====
78: 
78: [6cd5ab4340b7:06454] 1 more process has sent help message help-mpi-api.txt / mpi-abort
78: [6cd5ab4340b7:06454] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

@TotoGaz

This comment has been minimized.

@castelletto1

This comment has been minimized.

@TotoGaz

This comment has been minimized.

@castelletto1

This comment has been minimized.

@TotoGaz

This comment has been minimized.

@TotoGaz
Copy link
Contributor

TotoGaz commented Dec 3, 2020

Alright, I mixed up things with multiple containers. The current one has the proper #define.
But I cannot reproduce the error in my container...

@TotoGaz
Copy link
Contributor

TotoGaz commented Dec 3, 2020

Found this open-mpi/ompi#4948 about our

79: [a071a014f61e:05550] Read -1, expected 10824, errno = 1

I do have a lot of this kind of messages (on travis, they are far fewer 🤷 ).
I did the suggested

export OMPI_MCA_btl_vader_single_copy_mechanism=none

and I do not have this message again. Good. I shall try to understand what's going on and propose a fix for this.

Then in my test I can now see (which was already there but hidden by so much noise)

79: [----------] 2 tests from Trilinos/SolverTestElasticity2D/0, where TypeParam = geosx::TrilinosInterface
79: [ RUN      ] Trilinos/SolverTestElasticity2D/0.Direct
79: [       OK ] Trilinos/SolverTestElasticity2D/0.Direct (377 ms)
79: [ RUN      ] Trilinos/SolverTestElasticity2D/0.GMRES_AMG
79: [       OK ] Trilinos/SolverTestElasticity2D/0.Direct (377 ms)
79: [ RUN      ] Trilinos/SolverTestElasticity2D/0.GMRES_AMG
79: ERROR : in almalgamation 10200 => 5100(5100).  Have you specified the correct number of DOFs?
79: ERROR : in almalgamation 10200 => 5100(5100).  Have you specified the correct number of DOFs?
79: ERROR : in almalgamation 10200 => 5100(5100).  Have you specified the correct number of DOFs?
79: ERROR : in almalgamation 10200 => 5100(5100).  Have you specified the correct number of DOFs?
79: ERROR : in almalgamation 10200 => 5100(5100).  Have you specified the correct number of DOFs?
79: ERROR : in almalgamation 10200 => 5100(5100).  Have you specified the correct number of DOFs?
79: ERROR : in almalgamation 10200 => 5100(5100).  Have you specified the correct number of DOFs?
79: [       OK ] Trilinos/SolverTestElasticity2D/0.GMRES_AMG (121 ms)
79: [----------] 2 tests from Trilinos/SolverTestElasticity2D/0 (498 ms total)

I know it's for Trilinos only (not hypre) and it's during GMRES_AMG, not Direct, but

  • It says error and the test passes?
  • Could some size error be an explanation for segfault in different contexts?

@castelletto1
Copy link
Contributor Author

castelletto1 commented Dec 3, 2020

Found this open-mpi/ompi#4948 about our

79: [a071a014f61e:05550] Read -1, expected 10824, errno = 1

I do have a lot of this kind of messages (on travis, they are far fewer 🤷 ).
I did the suggested

export OMPI_MCA_btl_vader_single_copy_mechanism=none

and I do not have this message again. Good. I shall try to understand what's going on and propose a fix for this.

I think we had already done this (see discussion here):

https://github.com/GEOSX/GEOSX/blob/43fc91e51e5d9be51457d4dee49e13c613f025db/.travis.yml#L47

@castelletto1
Copy link
Contributor Author

Then in my test I can now see (which was already there but hidden by so much noise)

79: [----------] 2 tests from Trilinos/SolverTestElasticity2D/0, where TypeParam = geosx::TrilinosInterface
79: [ RUN      ] Trilinos/SolverTestElasticity2D/0.Direct
79: [       OK ] Trilinos/SolverTestElasticity2D/0.Direct (377 ms)
79: [ RUN      ] Trilinos/SolverTestElasticity2D/0.GMRES_AMG
79: [       OK ] Trilinos/SolverTestElasticity2D/0.Direct (377 ms)
79: [ RUN      ] Trilinos/SolverTestElasticity2D/0.GMRES_AMG
79: ERROR : in almalgamation 10200 => 5100(5100).  Have you specified the correct number of DOFs?
79: ERROR : in almalgamation 10200 => 5100(5100).  Have you specified the correct number of DOFs?
79: ERROR : in almalgamation 10200 => 5100(5100).  Have you specified the correct number of DOFs?
79: ERROR : in almalgamation 10200 => 5100(5100).  Have you specified the correct number of DOFs?
79: ERROR : in almalgamation 10200 => 5100(5100).  Have you specified the correct number of DOFs?
79: ERROR : in almalgamation 10200 => 5100(5100).  Have you specified the correct number of DOFs?
79: ERROR : in almalgamation 10200 => 5100(5100).  Have you specified the correct number of DOFs?
79: [       OK ] Trilinos/SolverTestElasticity2D/0.GMRES_AMG (121 ms)
79: [----------] 2 tests from Trilinos/SolverTestElasticity2D/0 (498 ms total)

I know it's for Trilinos only (not hypre) and it's during GMRES_AMG, not Direct, but

  • It says error and the test passes?
  • Could some size error be an explanation for segfault in different contexts?

This is definitely worth investigating more.

@TotoGaz
Copy link
Contributor

TotoGaz commented Dec 3, 2020

Oh yes, i did not copy it 🤦

Error is from https://github.com/trilinos/Trilinos/blob/15dfd41034f5d77060c201a22dc9ecca9cff2293/packages/ml/src/Coarsen/ml_agg_uncoupled.c#L368 , in function ML_Aggregate_CoarsenUncoupled...

@castelletto1
Copy link
Contributor Author

Another element I should have mentioned (maybe supporting you size error concern). At some point I was able to have the Pangea2 target passing on Travis by adding for debugging purposes

GEOSX_LOG_RANK_VAR( partitioning[0] );
GEOSX_LOG_RANK_VAR( partitioning[1] );

at line 162 below

https://github.com/GEOSX/GEOSX/blob/43fc91e51e5d9be51457d4dee49e13c613f025db/src/coreComponents/linearAlgebra/interfaces/hypre/HypreSuiteSparse.cpp#L157-L167

@TotoGaz
Copy link
Contributor

TotoGaz commented Dec 8, 2020

Hi @castelletto1
For my understanding, does it make sense to run in parallel the direct solver with parallel parameter set to false?

LinearSolverParameters params_Direct()
{
  LinearSolverParameters parameters;
  parameters.solverType = geosx::LinearSolverParameters::SolverType::direct;
  parameters.direct.parallel = 0;
  return parameters;
}

@castelletto1
Copy link
Contributor Author

parallel is a parameter used to select the direct solver type:

  • 0: umfpack (serial direct solver). The system matrix is gathered in a single process, the linear is solved in serial, and then the solution vector is scattered across processes
  • 1: SuperLU_Dist (parallel direct solver)

Note that option 0 was actually the only one for the Trilinos LAI (via the KLU solver) before PR /pull/1169 got merged.

For testing physics solver capabilities (see integrated tests), umpfack is preferable since it can be used with default options. This is not the case for SuperLU_Dist.

@TotoGaz
Copy link
Contributor

TotoGaz commented Dec 8, 2020

Thx @castelletto1

I gave a shot at parameters.direct.parallel = 1 instead of 0 and (luckily) the test passed 🤷‍♂️

So there should something wrong emphasized by HypreSolver::solve_serialDirect (which is run iff parameters.direct.parallel is 0).

Do you have any idea about that?

@castelletto1
Copy link
Contributor Author

castelletto1 commented Dec 8, 2020

Originally that unit test had parameters.direct.parallel = 1 but it didn't pass on Lassen (see issue #1238). As a temporary fix (both direct solvers should be checked in unit tests), we switched to parameters.direct.parallel = 0. The result was that the Pangea2 build now has problems.

@TotoGaz
Copy link
Contributor

TotoGaz commented Dec 8, 2020

Live debugging fromtravis-ci it appears that the test (ctest --extra-verbose -V -R testLAOperations) fails randomly...
Actually fails more or less 25% of the time when ran multiple times in a row.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: testing Unit tests, non-regression testing, ...
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants