Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seissol sometimes hangs at "Mesh initialized in" #568

Open
Thomas-Ulrich opened this issue Jun 3, 2022 · 3 comments
Open

Seissol sometimes hangs at "Mesh initialized in" #568

Thomas-Ulrich opened this issue Jun 3, 2022 · 3 comments
Labels

Comments

@Thomas-Ulrich
Copy link
Contributor

Thomas-Ulrich commented Jun 3, 2022

Describe the bug
Everything in the title.
log hanging:

Fri Jun 03 09:35:36, Info:  Welcome to SeisSol 
Fri Jun 03 09:35:36, Info:  Copyright (c) 2012-2021, SeisSol Group 
Fri Jun 03 09:35:36, Info:  Built on: Jun  3 2022 09:33:02 
Fri Jun 03 09:35:36, Info:  Version: 202103_Sumatra-685-gd5656cd3 (modified) 
Fri Jun 03 09:35:36, Info:  Running on: i01r01c01s12 
Fri Jun 03 09:35:36, Info:  Using OMP with #threads/rank: 46 
Fri Jun 03 09:35:36, Info:  OpenMP worker affinity (this process): "0123456789|0123456789|012-------|----------|--------89|0123456789|0123456789|0---------|----------|------" 
Fri Jun 03 09:35:36, Info:  OpenMP worker affinity (this node)   : "0123456789|0123456789|012-456789|0123456789|0123456-89|0123456789|0123456789|0-23456789|0123456789|01234-" 
Fri Jun 03 09:35:36, Info:  Using MPI with #ranks: 32 
Fri Jun 03 09:35:36, Info:  Running with communication thread 
Fri Jun 03 09:35:36, Info:  Communication thread affinity: "----------|----------|---3------|----------|----------|----------|----------|-1--------|----------|------" 
Fri Jun 03 09:35:36, Info:  The stack size ulimit is  2097152 [kb]. 
Rank:        0 | Info    | <--------------------------------------------------------->
Rank:        0 | Info    | <                SeisSol MPI initialization               >
Rank:        0 | Info    | <--------------------------------------------------------->
Rank:        0 | Info    |  Double precision used for real.
Rank:        0 | Info    | <--------------------------------------------------------->
 INFORMATION: The assumed unit number is           6 for stdout and           0 
 for stderr.
              If no information follows, please change the value.
Rank:        0 | Info    | <--------------------------------------------------------->
Rank:        0 | Info    | <     Start ini_SeisSol ...                               >
Rank:        0 | Info    | <--------------------------------------------------------->
Rank:        0 | Info    | <  Parameters read from file: parameters_base.par              >
Rank:        0 | Info    | <                                                         >
Rank:        0 | Info    | (Drucker-Prager) plasticity assumed .
Rank:        0 | Info    | Plastic relaxation Tv is set to:   3.000000000000000E-002
Rank:        0 | Info    | No attenuation assumed. 
Rank:        0 | Info    | No adjoint wavefield generated. 
Rank:        0 | Info    | Isotropic material is assumed. 
Rank:        0 | Info    | GPwise initialization. 
Rank:        0 | Info    | Read a PUML mesh file
Rank:        0 | Warning | Ignoring min space order from parameter file, using           4
Rank:        0 | Info    | Volume output is in XDMF format (new implementation)
Rank:        0 | Info    | Output data are generated at delta T=    100.000000000000     
Rank:        0 | Info    | Use POSIX XdmfWriter backend
Rank:        0 | Info    | Refinement strategy for volume output is Equal Face Area and Face Extraction : 32 subcells per cell
Fri Jun 03 09:35:36, Info:  Running mini SeisSol to determine node weight 
Fri Jun 03 09:35:36, Info:  Node weights: mean = 20.0328  std = 0.121472  min = 19.7707  median = 20.0651  max = 20.2116 
Fri Jun 03 09:35:36, Info:  Reading PUML mesh meshes/Mesh-SSI-1.puml.h5 
Fri Jun 03 09:35:36, Info:  Found 5372975 cells 
Fri Jun 03 09:35:36, Info:  Found 961353 vertices 
Fri Jun 03 09:35:40, Info:  Computing LTS weights. 
Fri Jun 03 09:35:47, Info:  Computing LTS weights. Done.  (173527 reductions.)
      Setup: Max:   0.262, Sum:   8.382, Balance:   1.000
   Matching: Max:   0.202, Sum:   6.454, Balance:   1.000
Contraction: Max:   0.148, Sum:   4.732, Balance:   1.000
   InitPart: Max:   0.034, Sum:   1.074, Balance:   1.000
    Project: Max:   0.004, Sum:   0.103, Balance:   1.108
 Initialize: Max:   0.033, Sum:   1.046, Balance:   1.014
      K-way: Max:   0.088, Sum:   2.800, Balance:   1.000
      Remap: Max:   0.001, Sum:   0.027, Balance:   1.003
      Total: Max:   0.770, Sum:  24.646, Balance:   1.000
Fri Jun 03 09:35:57, Info:  Reading mesh. Done. 
Fri Jun 03 09:35:57, Info:  Extracting fault information 
Fri Jun 03 09:35:59, Info:  Mesh initialized in: 22.1733 (min: 22.1718, max: 22.1752)

log not hanging (in Debug, but this does not always work):

Fri Jun 03 09:30:08, Info:  Welcome to SeisSol 
Fri Jun 03 09:30:08, Info:  Copyright (c) 2012-2021, SeisSol Group 
Fri Jun 03 09:30:08, Info:  Built on: Jun  3 2022 09:27:53 
Fri Jun 03 09:30:08, Info:  Version: 202103_Sumatra-685-gd5656cd3 (modified) 
Fri Jun 03 09:30:08, Info:  Running on: i01r01c01s01 
Fri Jun 03 09:30:08, Info:  Using OMP with #threads/rank: 46 
Fri Jun 03 09:30:08, Info:  OpenMP worker affinity (this process): "0123456789|0123456789|012-------|----------|--------89|0123456789|0123456789|0---------|----------|------" 
Fri Jun 03 09:30:08, Info:  OpenMP worker affinity (this node)   : "0123456789|0123456789|012-456789|0123456789|0123456-89|0123456789|0123456789|0-23456789|0123456789|01234-" 
Fri Jun 03 09:30:09, Info:  Using MPI with #ranks: 32 
Fri Jun 03 09:30:09, Info:  Running with communication thread 
Fri Jun 03 09:30:09, Info:  Communication thread affinity: "----------|----------|---3------|----------|----------|----------|----------|-1--------|----------|------" 
Fri Jun 03 09:30:09, Info:  The stack size ulimit is  2097152 [kb]. 
Rank:        0 | Info    | <--------------------------------------------------------->
Rank:        0 | Info    | <                SeisSol MPI initialization               >
Rank:        0 | Info    | <--------------------------------------------------------->
Rank:        0 | Info    |  Double precision used for real.
Rank:        0 | Info    | <--------------------------------------------------------->
 INFORMATION: The assumed unit number is           6 for stdout and           0 
 for stderr.
              If no information follows, please change the value.
Rank:        0 | Info    | <--------------------------------------------------------->
Rank:        0 | Info    | <     Start ini_SeisSol ...                               >
Rank:        0 | Info    | <--------------------------------------------------------->
Rank:        0 | Info    | <  Parameters read from file: parameters_base.par              >
Rank:        0 | Info    | <                                                         >
Rank:        0 | Info    | (Drucker-Prager) plasticity assumed .
Rank:        0 | Info    | Plastic relaxation Tv is set to:   3.000000000000000E-002
Rank:        0 | Info    | No attenuation assumed. 
Rank:        0 | Info    | No adjoint wavefield generated. 
Rank:        0 | Info    | Isotropic material is assumed. 
Rank:        0 | Info    | GPwise initialization. 
Rank:        0 | Info    | Read a PUML mesh file
Rank:        0 | Warning | Ignoring min space order from parameter file, using           4
Rank:        0 | Info    | Volume output is in XDMF format (new implementation)
Rank:        0 | Info    | Output data are generated at delta T=    100.000000000000     
Rank:        0 | Info    | Use POSIX XdmfWriter backend
Rank:        0 | Info    | Refinement strategy for volume output is Equal Face Area and Face Extraction : 32 subcells per cell
Fri Jun 03 09:30:09, Info:  Running mini SeisSol to determine node weight 
Fri Jun 03 09:30:09, Info:  Node weights: mean = 1.51876  std = 0.0212266  min = 1.48025  median = 1.51984  max = 1.55558 
Fri Jun 03 09:30:09, Info:  Reading PUML mesh meshes/Mesh-SSI-1.puml.h5 
Fri Jun 03 09:30:09, Info:  Found 5372975 cells 
Fri Jun 03 09:30:09, Info:  Found 961353 vertices 
Fri Jun 03 09:30:13, Info:  Computing LTS weights. 
Fri Jun 03 09:30:20, Info:  Computing LTS weights. Done.  (173527 reductions.)
      Setup: Max:   0.251, Sum:   8.029, Balance:   1.000
   Matching: Max:   0.195, Sum:   6.246, Balance:   1.000
Contraction: Max:   0.146, Sum:   4.685, Balance:   1.000
   InitPart: Max:   0.033, Sum:   1.051, Balance:   1.000
    Project: Max:   0.004, Sum:   0.103, Balance:   1.113
 Initialize: Max:   0.033, Sum:   1.029, Balance:   1.013
      K-way: Max:   0.089, Sum:   2.862, Balance:   1.000
      Remap: Max:   0.001, Sum:   0.024, Balance:   1.006
      Total: Max:   0.752, Sum:  24.056, Balance:   1.000
Fri Jun 03 09:30:30, Info:  Reading mesh. Done. 
Fri Jun 03 09:30:30, Info:  Extracting fault information 
Fri Jun 03 09:30:31, Info:  Mesh initialized in: 22.2249 (min: 22.2066, max: 22.2422)
Fri Jun 03 09:30:54, Info:  Deriving clusters ids for min. time step width / multiRate: 1.35243e-05 / 2 
Fri Jun 03 09:30:55, Info:  Number of elements in time clusters: 
Fri Jun 03 09:30:55, Info:  0: 243 
Fri Jun 03 09:30:55, Info:  1: 867 
Fri Jun 03 09:30:55, Info:  2: 1326 
Fri Jun 03 09:30:55, Info:  3: 2310 
Fri Jun 03 09:30:55, Info:  4: 3712 
Fri Jun 03 09:30:55, Info:  5: 5099 
Fri Jun 03 09:30:55, Info:  6: 10158 
Fri Jun 03 09:30:55, Info:  7: 1059256 
Fri Jun 03 09:30:55, Info:  8: 2436420 
Fri Jun 03 09:30:55, Info:  9: 897325 
Fri Jun 03 09:30:55, Info:  10: 295918 
Fri Jun 03 09:30:55, Info:  11: 86978 
Fri Jun 03 09:30:55, Info:  12: 568276 
Fri Jun 03 09:30:55, Info:  13: 5087 
Fri Jun 03 09:30:55, Info:  maximum theoretical speedup (compared to GTS): 388.907 per cell LTS, 245.757 with the used clustering. 
Fri Jun 03 09:30:55, Info:  Number of elements in dynamic rupture time clusters: 
Fri Jun 03 09:30:55, Info:  0 (dr): 117 
Fri Jun 03 09:30:55, Info:  1 (dr): 325 
Fri Jun 03 09:30:55, Info:  2 (dr): 250 
Fri Jun 03 09:30:55, Info:  3 (dr): 469 
Fri Jun 03 09:30:55, Info:  4 (dr): 688 
Fri Jun 03 09:30:55, Info:  5 (dr): 812 
Fri Jun 03 09:30:55, Info:  6 (dr): 1307 
Fri Jun 03 09:30:55, Info:  7 (dr): 198915 
Fri Jun 03 09:30:55, Info:  8 (dr): 72405 
Fri Jun 03 09:30:55, Info:  9 (dr): 80 
Fri Jun 03 09:30:55, Info:  10 (dr): 0 
Fri Jun 03 09:30:55, Info:  11 (dr): 0 
Fri Jun 03 09:30:55, Info:  12 (dr): 0 
Fri Jun 03 09:30:55, Info:  13 (dr): 0 
Rank:        0 | Info    | Synchronizing copy cell material data.
Rank:        0 | Info    | Initializing element local matrices.
Rank:        0 | Info    | DG initial condition projection... 
Fri Jun 03 09:30:56, Info:  Using initial condition  "Zero" . 
Rank:        0 | Info    | DG initial condition projection done. 
Rank:        0 | Info    | <--------------------------------------------------------->
Rank:        0 | Info    | <     Start inioutput_SeisSol ...                         >
Rank:        0 | Info    | <--------------------------------------------------------->

Example crashing (with enhanced log):

Fri Jun 03 13:57:02, Info:  Node weights: mean = 19.978  std = 0.211934  min = 19.0552  median = 20.046  max = 20.2026 
Fri Jun 03 13:57:02, Info:  Reading PUML mesh meshes/Mesh-SSI-1.puml.h5 
Fri Jun 03 13:57:02, Info:  Found 5372975 cells 
Fri Jun 03 13:57:02, Info:  Found 961353 vertices 
Fri Jun 03 13:57:06, Info:  Computing LTS weights. 
Fri Jun 03 13:57:13, Info:  Computing LTS weights. Done.  (173527 reductions.)
      Setup: Max:   0.265, Sum:   8.468, Balance:   1.000
   Matching: Max:   0.200, Sum:   6.391, Balance:   1.000
Contraction: Max:   0.152, Sum:   4.879, Balance:   1.000
   InitPart: Max:   0.033, Sum:   1.055, Balance:   1.000
    Project: Max:   0.004, Sum:   0.103, Balance:   1.115
 Initialize: Max:   0.036, Sum:   1.139, Balance:   1.011
      K-way: Max:   0.086, Sum:   2.755, Balance:   1.000
      Remap: Max:   0.001, Sum:   0.041, Balance:   1.001
      Total: Max:   0.777, Sum:  24.859, Balance:   1.000
Fri Jun 03 13:57:23, Info:  Reading mesh. Done. 
Fri Jun 03 13:57:23, Info:  Extracting fault information 
Fri Jun 03 13:57:25, Info:  Mesh initialized in: 22.2336 (min: 22.2314, max: 22.2352)
Rank:       22 | Info    | <--------------------------------------------------------->
Rank:       22 | Info    | <           Calling DG Initialization level 1             >
Rank:       22 | Info    | <--------------------------------------------------------->
Rank:       22 | Info    | Interface SEISSOL successful 
(...)
Rank:       27 | Info    | <--------------------------------------------------------->
Rank:       27 | Info    | <           Calling DG Initialization level 1             >
Rank:       27 | Info    | <--------------------------------------------------------->
Rank:       27 | Info    | Interface SEISSOL successful 
malloc_consolidate(): invalid chunk size

not crashing with extended log continue with:

Rank:       16 | Info    | <--------------------------------------------------------->
Rank:       16 | Info    | <           Calling DG Initialization level 2             >
Rank:       16 | Info    | <--------------------------------------------------------->
Rank:       16 | Info    | iniGalerkin successful
Rank:       16 | Info    | Smallest volume found in tetraedron number :        80089
Rank:       16 | Info    | Smallest volume is                         :    155031.709965159
Rank:       16 | Info    | Smallest insphere found in tetraedron number :        80089
Rank:       16 | Info    | Smallest insphere radius is                  :   0.854538065660072
Rank:       16 | Info    | Initialising Fault output. Refinement strategy:            2  Number of subtriangles:            4
Rank:       16 | Info    | Pick fault output at        22968  points in this MPI domain.
Rank:       24 | Info    | <--------------------------------------------------------->
Rank:       24 | Info    | <           Calling DG Initialization level 2             >
Rank:       24 | Info    | <--------------------------------------------------------->
Rank:       24 | Info    | iniGalerkin successful
Rank:       16 | Info    | Allocation of remaining MPI communication structure
Rank:       16 | Info    |   General info:            6          20           9
Rank:       16 | Info    | Bnd elements for domain            1  :          369
Rank:       16 | Info    | Bnd elements for domain            2  :          204
Rank:       16 | Info    | Bnd elements for domain            3  :          682
Rank:       16 | Info    | Bnd elements for domain            4  :          377
Rank:       16 | Info    | Bnd elements for domain            5  :          641
Rank:       16 | Info    | Bnd elements for domain            6  :         1085

Expected behavior
no hanging of Seissol.

To Reproduce
Steps to reproduce the behavior:

  1. master, d5656cd (latest. Also tested a pre-actor state)
  2. Which build settings do you use? Which compiler version do you use?
CC=mpicc CXX=mpiCC FC=mpif90  cmake -DCMAKE_PREFIX_PATH=$SeisSolDepFolder -DCOMMTHREAD=ON -DNUMA_AWARE_PINNING=ON -DASAGI=ON -DCMAKE_BUILD_TYPE=Release -DHOST_ARCH=skx -DPRECISION=double -DORDER=4 -DCMAKE_INSTALL_PREFIX=$(pwd)/build-release -DGEMM_TOOLS_LIST=LIBXSMM,PSpaMM -DPSpaMM_PROGRAM=$SeisSolDepFolder/bin/pspamm.py ..
  1. On which machine does your problem occur? If on a cluster: Which modules are loaded?
    modules:
Currently Loaded Modulefiles:
 1) admin/1.0     3) lrz/1.0        5) intel-oneapi-compilers/2021.4.0   7) intel-mpi/2019-intel   9) cmake/3.21.4            11) libszip/2.1.1                               13) numactl/2.0.14-intel21  
 2) tempdir/1.0   4) spack/22.2.1   6) intel-mkl/2020                    8) gcc/11.2.0            10) python/3.8.11-extended  12) netcdf-hdf5-all/4.7_hdf5-1.10-intel21-impi  14) yaml-cpp/0.7.0-intel21  
  1. Provide parameter/material files.

/hppfs/work/pr63qo/di73yeq4/bug_Sandwich

SeisSol run with:

#!/bin/bash
#SBATCH -J Sandw
#SBATCH -e ./%j.%x.out
#SBATCH -o ./%j.%x.out
#SBATCH --chdir=./
#SBATCH --mail-type=END
#SBATCH --mail-user=ulrich@geophysik.uni-muenchen.de
#SBATCH --time=0:30:00
#SBATCH --no-requeue
#SBATCH --export=ALL
#SBATCH --account=pn68fi
#SBATCH --partition=test
#SBATCH --nodes=16
#SBATCH --ntasks-per-node=2

export MP_SINGLE_THREAD=no
unset KMP_AFFINITY
#export OMP_NUM_THREADS=94
#export OMP_PLACES="cores(47)"

export OMP_NUM_THREADS=46
export OMP_PLACES="cores(23)"

export XDMFWRITER_ALIGNMENT=8388608
export XDMFWRITER_BLOCK_SIZE=8388608
export SC_CHECKPOINT_ALIGNMENT=8388608

export SEISSOL_CHECKPOINT_ALIGNMENT=8388608
export SEISSOL_CHECKPOINT_DIRECT=0
export ASYNC_MODE=THREAD
export ASYNC_BUFFER_ALIGNMENT=8388608
export SEISSOL_ASAGI_MPI_MODE=OFF
source /etc/profile.d/modules.sh


echo $SLURM_NTASKS
ulimit -Ss 2097152
mpiexec -n $SLURM_NTASKS /hppfs/work/pr45fi/di73yeq4/SeisSol/build-release/SeisSol_RelWithDebInfo_dskx_4_elastic parameters_base.par
@Thomas-Ulrich Thomas-Ulrich changed the title Seissol hangs at "Mesh initialized in" in Release and RelWithDebInfo (but not Debug) Seissol sometimes hangs at "Mesh initialized in", sometimes not Jun 3, 2022
@Thomas-Ulrich
Copy link
Contributor Author

Hi,
Turning off plasticity fixes the problem.
I could not track down the error with Sanitizer (because of Error: Unknown ASAGI MPI mode: "" on Debug mode with GCC), but could manually track down the error around:
https://github.com/SeisSol/SeisSol/blob/master/src/Solver/Interoperability.cpp#L660
Therefore probably related with this bug:
SeisSol/easi#10

@Thomas-Ulrich Thomas-Ulrich changed the title Seissol sometimes hangs at "Mesh initialized in", sometimes not Seissol sometimes hangs at "Mesh initialized in", only with plasticity Jun 9, 2022
@Thomas-Ulrich
Copy link
Contributor Author

at the end, it also occurs without plasticity (less often it seems)....

@Thomas-Ulrich Thomas-Ulrich changed the title Seissol sometimes hangs at "Mesh initialized in", only with plasticity Seissol sometimes hangs at "Mesh initialized in" Jun 9, 2022
@Thomas-Ulrich
Copy link
Contributor Author

I think I've identified the block which cause the error in my setup. In fact, I've got the following error at some point:

Fri Jun 10 10:09:50, Info:  Mesh initialized in: 17.8762 (min: 17.8726, max: 17.8785)
malloc(): largebin double linked list corrupted (bk)
terminate called after throwing an instance of 'std::runtime_error'
  what():  Sandwich_rhomulambda.yaml@6: Could not find model for point [ -nan ] in group 11.
double free or corruption (!prev)
corrupted size vs. prev_size
free(): invalid next size (normal)

Then I simplified Sandwich_rhomulambda.yaml, and SeisSol never crashes/hangs.
Seems then related with SeisSol/easi#10

di73yeq4@login03:/hppfs/work/pr63qo/di73yeq4/bug_Sandwich> diff Sandwich_rhomulambda.yaml Sandwich_rhomulambda_works.yaml 
6,34c6,10
<         map: !EvalModel
<             parameters: [depth_below_sf]
<             model: !EvalModel
<                 parameters: [elevation,z]
<                 model: !Switch
<                   [z]: !AffineMap
<                     matrix:
<                       z: [0.0, 0.0, 1.0]
<                     translation:
<                       z: 0.0
<                   [elevation]: !AffineMap
<                     matrix:
<                       u: [1.0, 0.0, 0.0]
<                       v: [0.0, 1.0, 0.0]
<                     translation:
<                       u: 0.0
<                       v: 0.0
<                     components: !ASAGI
<                       file: Mesh-SSI-1-SandwichElevation2000.nc
<                       parameters: [elevation]
<                       var: elevation
<                 components: !LuaMap
<                   returns: [depth_below_sf]
<                   function: |
<                      function f (x)
<                       return {
<                         depth_below_sf = x["z"] - x["elevation"];
<                       }
<                       end
---
>         map: !AffineMap
>           matrix:
>             z: [0.0, 0.0, 1.0]
>           translation:
>             z: 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant