Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very long initialization during texascale run #1052

Open
NicoSchlw opened this issue Feb 14, 2024 · 6 comments
Open

Very long initialization during texascale run #1052

NicoSchlw opened this issue Feb 14, 2024 · 6 comments
Labels

Comments

@NicoSchlw
Copy link
Contributor

Describe the bug
Initialization during texascale run takes 40 min

Expected behavior
Faster Initialization

To Reproduce
Steps to reproduce the behavior:

  1. Which version do you use? Provide branch and commit id.
    nico/2nuc-latest-master; v1.1.3-109-g10ae92ee
  2. Which build settings do you use? Which compiler version do you use?
    SeisSol_Release_dskx_6_viscoelastic2; intel-19.1.1
  3. On which machine does your problem occur? If on a cluster: Which modules are loaded?
    Frontera
    Currently Loaded Modules:
  1. intel/19.1.1
  2. impi/19.0.9
  3. git/2.24.1
  4. autotools/1.2
  5. cmake/3.24.2
  6. pmix/3.1.4
  7. hwloc/1.11.12
  8. xalt/2.10.34
  9. TACC
  10. python3/3.9.2
  11. intel-mpi/2019.0.9-intel-19.1.1.217-plmutsc
  12. pkgconf/1.9.5-intel-19.1.1.217-orhsldj
  13. zlib-ng/2.1.4-intel-19.1.1.217-xwszawp
  14. hdf5/1.12.2-intel-19.1.1.217-sji7bdx
  15. netcdf-c/4.7.4-intel-19.1.1.217-guy2c4r
  16. numactl/2.0.14-intel-19.1.1.217-fbfn6qe
  17. asagi/1.0.1-intel-19.1.1.217-6oes3fl
  18. cxxtest/develop-intel-19.1.1.217-yavleqi
  19. impalajit/main-intel-19.1.1.217-w45mc4n
  20. curl/7.29.0-intel-19.1.1.217-kiwu5vb
  21. ncurses/6.4-intel-19.1.1.217-jkmcgrd
  22. readline/8.2-intel-19.1.1.217-rdugitv
  23. unzip/6.0-intel-19.1.1.217-s2gzz22
  24. lua/5.3.2-intel-19.1.1.217-sqpjecx
  25. yaml-cpp/0.6.2-intel-19.1.1.217-p36vrxa
  26. easi/1.2.0-intel-19.1.1.217-lwzthnz
  27. eigen/3.4.0-intel-19.1.1.217-wni5u35
  28. libxsmm/1.17-intel-19.1.1.217-knnzdf4
  29. memkind/1.13.0-intel-19.1.1.217-pvz3fkl
  30. metis/5.1.0-intel-19.1.1.217-yjhymv7
  31. parmetis/4.0.3-intel-19.1.1.217-q4zdww5
  32. python/3.9.2-intel-19.1.1.217-ixn2w76
  33. py-pspamm/develop-intel-19.1.1.217-dh5gy3d
  34. seissol-env/develop-intel-19.1.1.217-6tooxtt
  1. Provide parameter/material files.

parameters.txt
6106531.txt

Additional context
Ridgecrest large domain setup with a mesh with 320 Mio elements

Important:
The branch already contains: #970

The longest initialization step takes about 30 min and happens after printing:
"Initializing Fault, using a quadrature rule with 49 points."

This is weird because the setup does contain relatively few DR elements and similar Ridgecrest setups with more DR elements do not have this issue.

The setup is special because it integrates a very large model domain with several low velocity sediment basins. This means that the difference between the largest and smallest seismic velocity (and time step) is quite large. This is the main difference I can think of compared to other Ridgecrest setups.

The slow initialization of the setup already becomes apparent with a small mesh (2 Mio elements) on 20 nodes (ignore the segfault):
6100231.txt

@NicoSchlw NicoSchlw added the bug label Feb 14, 2024
@NicoSchlw
Copy link
Contributor Author

And I forgot to mention that the initialization time increases with the number of ranks, the 320M mesh took less time to initialize on fewer ranks.

@Thomas-Ulrich
Copy link
Contributor

does your seissol integrates the fix of:
#1026

@sebwolf-de
Copy link
Contributor

Hi,
using your mesh with the debug3 mesh (147k elements), I found the following:
Your stress setup is quite complicated with several different maps and models in several files.
Original setup:

Wed Feb 14 13:53:40, Info:  Initializing Fault, using a quadrature rule with  49  points.
Wed Feb 14 13:54:05, Info:  Model initialized in: 35.7449 s (min: 35.7449 s, max: 35.7449 s)

With stresses and nucleation traction as a constant map:

Wed Feb 14 13:52:32, Info:  Initializing Fault, using a quadrature rule with  49  points.
Wed Feb 14 13:52:32, Info:  Model initialized in: 8.5654 s (min: 8.5654 s, max: 8.5654 s)

I remember we encountered a similar problem recently, because easi decided to read the ASAGI files several times.

@sebwolf-de
Copy link
Contributor

20M element mesh on 8 nodes:
Original setup:

Wed Feb 14 14:03:36, Info:  Model initialized in: 2 min 13.0949 s (min: 2 min 13.0849 s, max: 2 min 13.1069 s)

Constant Maps:

Wed Feb 14 14:05:44, Info:  Initializing Fault, using a quadrature rule with  49  points.
Wed Feb 14 14:05:44, Info:  Model initialized in: 39.1861 s (min: 39.0686 s, max: 39.5168 s)

@NicoSchlw
Copy link
Contributor Author

Hi,

@Thomas-Ulrich yes, the branch already includes this fix.

@sebwolf-de yes, the stress setup is quite complicated. I have logs from Texascale runs last year using the same stress setup, where a 620M element mesh is initialized within 15 min.
5141479.txt

But last year, I used only one rank per node. Is it expected that doubling the number of ranks will triple the initialization time?

@sebwolf-de
Copy link
Contributor

I don't see that two ranks per node increase the initialization time. Using the 20M mesh, I find:
8 nodes / 8 ranks: Model initialized in: 2 min 11.1097 s (min: 2 min 11.0979 s, max: 2 min 11.1217 s)
8 nodes / 16 ranks: Model initialized in: 2 min 24.3005 s (min: 2 min 24.2875 s, max: 2 min 24.3083 s)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants