Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpoint module makes the code crash due to "Invalid tag" in intelMPI2019 and above #307

Closed
weipengyao opened this issue Sep 30, 2020 · 28 comments
Labels

Comments

@weipengyao
Copy link
Contributor

Descriptions

Dear Developers,

I have 6 species in a typical laser-solid interaction model (3 ions & 3 electrons), and I want to have both ion-electron collisions and electron-electron collisions.
The namelist is here (with line 139-153 uncommented and 156-186 commented)
input.py.txt
However, the code keeps crashes when I run on the Niagara cluster with a Seg. fault.

Reproduce the error

I try to investigate this with a scale-reduced case (smaller box size, few ppc, and lower resolution) on my laptop and find that the reason for the crash (at least one of the reasons) is due to multiple species in the collision module.
In other words, if I just use one single specie in species1 and species2, no Seg. fault and the code runs.

With this in mind, I modified the original namelist by adding several collision modules, with each of them only contains a single species (in the attached input file, with line 139-153 commented and 156-186 uncommented)

Unfortunately, the code crashes again.

The .out and .err files are as follows:
No23_SG_Coll_eei-4225525.out.txt
No23_SG_Coll_eei-4225525.err.txt

Now I am a bit confused and lost.
Can you take a look and give me some help with this?
According to the document, a collision module with multiple species should not be a problem.

Parameters

make env gives the following results:

SMILEICXX : mpicxx
PYTHONEXE : python
MPIVERSION :
VERSION : ??-??
OPENMP_FLAG : -fopenmp -D_OMP
HDF5_ROOT_DIR : /scinet/niagara/software/2019b/opt/intel-2019u3-intelmpi-2019u3/hdf5/1.10.5
SITEDIR : /home/a/anticipa/weipeng/.local/lib/python2.7/site-packages
PY_CXXFLAGS : -I/usr/include/python2.7 -I/usr/include/python2.7
PY_LDFLAGS : -lpython2.7 -lpthread -ldl -lutil -lm -Xlinker -export-dynamic
CXXFLAGS : -D__VERSION=\"??-??\" -D_VECTO -std=c++11 -Wall  -I/scinet/niagara/software/2019b/opt/intel-2019u3-intelmpi-2019u3/hdf5/1.10.5/include -Isrc -Isrc/Params -Isrc/ElectroMagnSolver -Isrc/ElectroMagn -Isrc/ElectroMagnBC -Isrc/Particles -Isrc/Radiation -Isrc/Ionization -Isrc/Interpolator -Isrc/Collisions -Isrc/Merging -Isrc/Tools -Isrc/Python -Isrc/Projector -Isrc/DomainDecomposition -Isrc/MovWindow -Isrc/Profiles -Isrc/picsar_interface -Isrc/Checkpoint -Isrc/Pusher -Isrc/Field -Isrc/MultiphotonBreitWheeler -Isrc/SmileiMPI -Isrc/Species -Isrc/Diagnostic -Isrc/ParticleInjector -Isrc/Patch -Ibuild/src/Python -I/usr/include/python2.7 -I/usr/include/python2.7 -D_VECTO -O3 -g  -fopenmp -D_OMP
LDFLAGS : -L/scinet/niagara/software/2019b/opt/intel-2019u3-intelmpi-2019u3/hdf5/1.10.5/lib   -lhdf5 -lpython2.7 -lpthread -ldl -lutil -lm -Xlinker -export-dynamic -lm -fopenmp -D_OMP
@weipengyao weipengyao added the bug label Sep 30, 2020
@mccoys
Copy link
Contributor

mccoys commented Sep 30, 2020

This problem is related to intel-mpi-2019. See issue #270
This particular version of intel-mpi is problematic as it has a low maximum tag number. Please check with your sysadmin to see if you can have another version.

@weipengyao
Copy link
Contributor Author

Dear @mccoys
Thanks for your timely reply.
I am trying to use another MPI compiler on the Niagara cluster.
However, in this particular cluster, not so many options are provided.

weipeng@nia-login01:~/scratch/Data_smilei$ module spider hdf5/1.10.5

-------------------------------------------------------------------------------------------------------------------------------------------
  hdf5: hdf5/1.10.5
-------------------------------------------------------------------------------------------------------------------------------------------
    Description:
      HDF5 is a data model, library, and file format for storing and managing data


    You will need to load all module(s) on any one of the lines below before the "hdf5/1.10.5" module is available to load.

      gcc/8.3.0
      intel/2019u3
      intel/2019u3  intelmpi/2019u3
      intel/2019u4
      intel/2019u4  openmpi/.old-4.0.1
      intel/2019u4  openmpi/4.0.1

I gave intel/2019u4 openmpi/4.0.1 a try and met with another MPI-related problem.
No23_SG_Coll_eei-4225922.err.txt

I am asking the sysadmin for help now and will keep you updated.

Many thanks!

@mccoys
Copy link
Contributor

mccoys commented Sep 30, 2020

You can avoid the MPI_THREAD_MULTIPLE issue by compiling smilei with the np_mpi_tm option.
See http://www.maisondelasimulation.fr/smilei/installation.html#advanced-compilation-options

Note that this may cause some slowdown.

@weipengyao
Copy link
Contributor Author

Dear @mccoys ,

Thanks for the advice.
I've asked the sysadmin to install intelmpi/2020u2 for hdf5/1.10.7, and I am trying that right now.

Besides, strangely, after I re-compile the code with make config=no_mpi_tm (tried for several times), I used make env and didn't find the the -D_NO_MPI_TM CXXFLAGS, which should be there, I guess.
As a result, re-run the simulations results in the same err.
Note that for this attempt, the modules I use are intel/2019u4, openmpi/4.0.1, and hdf5/1.10.5.

I will keep you updated with the intelmpi-2020 attempt.

PS, as for the MPI-tag issue, I also find it strange because for my simulation I only use [32, 32] patches.
Is it possible that the err is due to something else?

@mccoys
Copy link
Contributor

mccoys commented Sep 30, 2020

There are many more tags than patches. I think we have fixed issues we used to have on tag numbering. But we never know. There might be unforeseen situations.

@weipengyao
Copy link
Contributor Author

I might need to add that, I used to run the above namelist well, with the collision module including single-specie.
And then I upgrade the namelist as follows:

  1. increase the resolution,
  2. add ionization with tunnel model,
  3. add two more species and includes them into the collision module
  4. add the checkpoint module

Then I met with the MPI-tag issue.

I am still in the queue now and I will keep you updated with the intelmpi/2020u2 results.

@weipengyao
Copy link
Contributor Author

Dear @mccoys

The simulation again crashed, but with a new, yet stranger err.

srun: error: nia1354: task 6: Segmentation fault (core dumped)
srun: error: nia0367: task 0: Segmentation fault (core dumped)
srun: error: nia1456: task 8: Segmentation fault (core dumped)
srun: error: nia1457: task 9: Segmentation fault (core dumped)
srun: error: nia1077: task 2: Segmentation fault (core dumped)
srun: error: nia1304: task 4: Segmentation fault (core dumped)
srun: error: nia1078: task 3: Segmentation fault (core dumped)
srun: error: nia1353: task 5: Segmentation fault (core dumped)
srun: error: nia1020: task 1: Segmentation fault (core dumped)
srun: error: nia1455: task 7: Segmentation fault (core dumped)
[mpiexec@nia0367.scinet.local] wait_proxies_to_terminate (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:527): downstream from host nia0367 exited with status 139
[mpiexec@nia0367.scinet.local] main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:2096): assert (exitcodes != NULL) failed

In the .out file, the simulation didn't even start, only creating the directory for the run.
And the directory contains the copied namelist and a couple of core.xxx files.

Any idea what is happening?
I will also ask the sysadmin for help.

@weipengyao
Copy link
Contributor Author

Since I have a workable model without ionization or collision, I repeat the procedure again step-by-step, with the former intelmpi/2019u3.

Now, I have the ionization and collision, and the code seems fine with a reduced-scale on Niagara.
However, as I add the checkpoint module (uncomment the line 345-351), the same MPI-tag error occurs.
DB2_SG_Collision_test.py.txt

So, I guess the problem is more about the checkpoint module, not the collision?

@weipengyao weipengyao changed the title collision module with multiple species makes the code crash Checkpoint module makes the code crash Oct 1, 2020
@mccoys
Copy link
Contributor

mccoys commented Oct 2, 2020

Concerning the segfault, is there any more information on the error ? No other lines ? I would guess this is a compilation issue. Have you done make clean before compilation ?

Concerning checkpoints, we will try to reproduce the issue first.

@weipengyao
Copy link
Contributor Author

Dear @mccoys ,

Yes, I make clean first, and then make config=no_mpi_tm -j 8'. The compilation is successful, but when I make env`, it gives:

weipeng@nia-login03:~/CODES/Smilei$ make env
SMILEICXX : mpicxx
PYTHONEXE : python3
MPIVERSION :
VERSION : b'v4.4-784-gc3f8cc8'-b'no_mpi_tm'
OPENMP_FLAG : -fopenmp -D_OMP
HDF5_ROOT_DIR : /scinet/niagara/software/2019b/modules/intel-2019u3-intelmpi-2019u3/hdf5/1.10.5
SITEDIR : /home/a/anticipa/weipeng/.local/lib/python3.6/site-packages
PY_CXXFLAGS : -I/scinet/niagara/software/2019b/opt/base/python/3.6.8/include/python3.6m -I/scinet/niagara/software/2019b/opt/base/python/3.6.8/include/python3.6m -I/scinet/niagara/software/2019b/opt/base/python/3.6.8/lib/python3.6/site-packages/numpy/core/include -DSMILEI_USE_NUMPY -DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION
PY_LDFLAGS : -lpython3.6m -lpthread -ldl -lutil -lm -Xlinker -export-dynamic
CXXFLAGS : -D__VERSION=\"b'v4.4-784-gc3f8cc8'-b'no_mpi_tm'\" -D_VECTO -std=c++11 -Wall  -I/scinet/niagara/software/2019b/modules/intel-2019u3-intelmpi-2019u3/hdf5/1.10.5/include -Isrc -Isrc/Params -Isrc/ElectroMagnSolver -Isrc/ElectroMagn -Isrc/ElectroMagnBC -Isrc/Particles -Isrc/Radiation -Isrc/Ionization -Isrc/Interpolator -Isrc/Collisions -Isrc/Merging -Isrc/Tools -Isrc/Python -Isrc/Projector -Isrc/DomainDecomposition -Isrc/MovWindow -Isrc/Profiles -Isrc/picsar_interface -Isrc/Checkpoint -Isrc/Pusher -Isrc/Field -Isrc/MultiphotonBreitWheeler -Isrc/SmileiMPI -Isrc/Species -Isrc/Diagnostic -Isrc/ParticleInjector -Isrc/Patch -Ibuild/src/Python -I/scinet/niagara/software/2019b/opt/base/python/3.6.8/include/python3.6m -I/scinet/niagara/software/2019b/opt/base/python/3.6.8/include/python3.6m -I/scinet/niagara/software/2019b/opt/base/python/3.6.8/lib/python3.6/site-packages/numpy/core/include -DSMILEI_USE_NUMPY -DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION -O3 -g  -fopenmp -D_OMP
LDFLAGS : -L/scinet/niagara/software/2019b/modules/intel-2019u3-intelmpi-2019u3/hdf5/1.10.5/lib   -lhdf5 -lpython3.6m -lpthread -ldl -lutil -lm -Xlinker -export-dynamic -L/scinet/niagara/software/2019b/opt/base/python/3.6.8/lib -lm -fopenmp -D_OMP

And there's no -D_NO_MPI_TM in CXXFLAGS.

But as you can see from the above information, I am not using the new version of intelmpi/2020u2, because it won't compile, giving the error:
Catastrophic error: could not set locale "" to allow processing of multibyte characters

@iltommi
Copy link
Contributor

iltommi commented Oct 2, 2020

you should check make config=no_mpi_tm env

@weipengyao
Copy link
Contributor Author

Dear @iltommi ,

Yes, make config=no_mpi_tm env gives the expected results.
But it also gives the same results when I use the above command for the same smilei code compiled without config=no_mpi_tm.
That is to say, make env doesn't tell me what the CXXFLAGS the code compile with?

Besides, after I compile the code with make config=no_mpi_tm, do I need to add any additional options when I run ~/XX/smilei.sh NUM_OF_CORES namelist.py to enable the -D_NO_MPI_TM feature?

Thanks!

@mccoys
Copy link
Contributor

mccoys commented Oct 4, 2020

So, I guess the problem is more about the checkpoint module, not the collision?

No. The problem is not the checkpoints. It is true that checkpoints require more MPI tags, but there is no other way. The problem is that your version of intelmpi is problematic with smilei.

Now, the problem is that the new intelmpi2020 has crashed. Could you show your make env with intelmpi2020 ?

That is to say, make env doesn't tell me what the CXXFLAGS the code compile with?

The command make does not remember what option you used previously. If you write make env, it will show your variables in a default environment (not with no_mpi_tm), even if you had a different config before. If you want to see what environment your code was previously compiled with, it is more complicated.

@weipengyao
Copy link
Contributor Author

Dear @mccoys ,

On the machine Niagara, when I compile Smilei with the 2020u2 module:
Sometimes (like Saturday), it can be compiled successfully; while sometimes (like last Friday and now), it results in an error --

weipeng@nia-login06:~/CODES/test/Smilei$ module list

Currently Loaded Modules:
  1) NiaEnv/2019b   2) intel/2020u2   3) intelmpi/2020u2   4) hdf5/1.10.7   5) python/3.6.8



weipeng@nia-login06:~/CODES/test/Smilei$ make -j 8
Creating binary char for src/Python/pyprofiles.py
Creating binary char for src/Python/pycontrol.py
Creating binary char for src/Python/pyinit.py
Checking dependencies for src/Tools/H5.cpp
Checking dependencies for src/Tools/Tools.cpp
Checking dependencies for src/Tools/userFunctions.cpp
Checking dependencies for src/Tools/Timers.cpp
Checking dependencies for src/Tools/Timer.cpp
Checking dependencies for src/Tools/backward.cpp
Checking dependencies for src/Tools/PyTools.cpp
Checking dependencies for src/Tools/tabulatedFunctions.cpp
Catastrophic error: could not set locale "" to allow processing of multibyte characters

Catastrophic error: could not set locale "" to allow processing of multibyte characters

Catastrophic error: could not set locale "" to allow processing of multibyte characters

Since I can't compile it now with 2020u2, I can't show you the make env results.

And I can always compile it with 2019u3:

weipeng@nia-login06:~/CODES/test/Smilei$ make clean
Cleaning build
weipeng@nia-login06:~/CODES/test/Smilei$ source ~/scratch/Data_smilei/compile_smilei_niagara.sh
The following modules were not unloaded:
  (Use "module --force purge" to unload all):

  1) NiaEnv/2019b
weipeng@nia-login06:~/CODES/test/Smilei$ module list

Currently Loaded Modules:
  1) NiaEnv/2019b   2) intel/2019u3   3) intelmpi/2019u3   4) hdf5/1.10.5   5) python/3.6.8



weipeng@nia-login06:~/CODES/test/Smilei$ make -j 8
Creating binary char for src/Python/pyprofiles.py
Creating binary char for src/Python/pycontrol.py
Creating binary char for src/Python/pyinit.py
Checking dependencies for src/Tools/H5.cpp
Checking dependencies for src/Tools/Tools.cpp
Checking dependencies for src/Tools/userFunctions.cpp
Checking dependencies for src/Tools/Timers.cpp
Checking dependencies for src/Tools/Timer.cpp
Checking dependencies for src/Tools/backward.cpp
Checking dependencies for src/Tools/PyTools.cpp

@mccoys
Copy link
Contributor

mccoys commented Oct 4, 2020

Why can't you show the result of make env ?

@iltommi
Copy link
Contributor

iltommi commented Oct 4, 2020

Catastrophic error: could not set locale "" to allow processing of multibyte characters

It complains about your empty locale environment try export LANG=C before compiling

@weipengyao
Copy link
Contributor Author

weipeng@nia-login06:~/CODES/test/Smilei$ make env
SMILEICXX : mpicxx
PYTHONEXE : python3
MPIVERSION :
VERSION : b'v4.4-784-gc3f8cc8'-b'master'
OPENMP_FLAG : -fopenmp -D_OMP
HDF5_ROOT_DIR : /scinet/niagara/software/2019b/modules/intel-2020u2-intelmpi-2020u2/hdf5/1.10.7
SITEDIR : /home/a/anticipa/weipeng/.local/lib/python3.6/site-packages
PY_CXXFLAGS : -I/scinet/niagara/software/2019b/opt/base/python/3.6.8/include/python3.6m -I/scinet/niagara/software/2019b/opt/base/python/3.6.8/include/python3.6m -I/scinet/niagara/software/2019b/opt/base/python/3.6.8/lib/python3.6/site-packages/numpy/core/include -DSMILEI_USE_NUMPY -DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION
PY_LDFLAGS : -lpython3.6m -lpthread -ldl -lutil -lm -Xlinker -export-dynamic
CXXFLAGS : -D__VERSION=\"b'v4.4-784-gc3f8cc8'-b'master'\" -D_VECTO -std=c++11 -Wall  -I/scinet/niagara/software/2019b/modules/intel-2020u2-intelmpi-2020u2/hdf5/1.10.7/include -Isrc -Isrc/Params -Isrc/ElectroMagnSolver -Isrc/ElectroMagn -Isrc/ElectroMagnBC -Isrc/Particles -Isrc/Radiation -Isrc/Ionization -Isrc/Interpolator -Isrc/Collisions -Isrc/Merging -Isrc/Tools -Isrc/Python -Isrc/Projector -Isrc/DomainDecomposition -Isrc/MovWindow -Isrc/Profiles -Isrc/picsar_interface -Isrc/Checkpoint -Isrc/Pusher -Isrc/Field -Isrc/MultiphotonBreitWheeler -Isrc/SmileiMPI -Isrc/Species -Isrc/Diagnostic -Isrc/ParticleInjector -Isrc/Patch -Ibuild/src/Python -I/scinet/niagara/software/2019b/opt/base/python/3.6.8/include/python3.6m -I/scinet/niagara/software/2019b/opt/base/python/3.6.8/include/python3.6m -I/scinet/niagara/software/2019b/opt/base/python/3.6.8/lib/python3.6/site-packages/numpy/core/include -DSMILEI_USE_NUMPY -DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION -O3 -g  -fopenmp -D_OMP
LDFLAGS : -L/scinet/niagara/software/2019b/modules/intel-2020u2-intelmpi-2020u2/hdf5/1.10.7/lib   -lhdf5 -lpython3.6m -lpthread -ldl -lutil -lm -Xlinker -export-dynamic -L/scinet/niagara/software/2019b/opt/base/python/3.6.8/lib -lm -fopenmp -D_OMP

Sorry, I thought the results of make env only make sense if I can successfully compile Smilei.
Apparently, it is not like this, and the above are the results of make env.

@weipengyao
Copy link
Contributor Author

Dear @iltommi

I tried export LANG=C, didn't work.
I emailed the sysadmin, and provided the output of env command as asked, and also attached here.
out.txt

Maybe this can provide more information to help solve this issue?

Many thanks!

@weipengyao
Copy link
Contributor Author

Sorry for the late updates:

  1. the sysadmin suggested me to unset "LC_CTYPE=UTF-8", but before I do it, I made another attempt to compile the code, and it magically worked.

  2. The code still crashes, with the "Invalid tag" issue. The error message reads:

Abort(537502212) on node 781 (rank 781 in comm 0): Fatal error in PMPI_Iprobe: Invalid tag, error stack:
PMPI_Iprobe(126): MPI_Iprobe(src=0, tag=16777216, MPI_COMM_WORLD, flag=0x7ffc801954b0, status=0x7ffc8019ae80) failed
PMPI_Iprobe(91).: Invalid tag, value is 16777216
Abort(1007264260) on node 799 (rank 799 in comm 0): Fatal error in PMPI_Iprobe: Invalid tag, error stack:
PMPI_Iprobe(126): MPI_Iprobe(src=0, tag=16777216, MPI_COMM_WORLD, flag=0x7ffc4e9f4930, status=0x7ffc4e9fa300) failed
PMPI_Iprobe(91).: Invalid tag, value is 16777216
Abort(134849028) on node 788 (rank 788 in comm 0): Fatal error in PMPI_Iprobe: Invalid tag, error stack:
PMPI_Iprobe(126): MPI_Iprobe(src=0, tag=16777216, MPI_COMM_WORLD, flag=0x7ffca43e92b0, status=0x7ffca43eec80) failed
PMPI_Iprobe(91).: Invalid tag, value is 16777216

@jderouillat
Copy link
Contributor

Hi Yao,
Can you do a small program in which you compare the value of the tag used by Checkpoints.dump_minutes (16777216) and the maximum value acceptable by the MPI library that you are using :

    int flag;
    int* tag_ub_ptr;
    MPI_Comm_get_attr(MPI_COMM_WORLD, MPI_TAG_UB, &tag_ub_ptr, &flag);
    cout << "Max tag of current MPI library : " << (*tag_ub_ptr) << endl;

You can also add this after the line 58 of the main program.
Regards

Julien

@weipengyao
Copy link
Contributor Author

Hello @jderouillat

Thanks for your reply, I will do it now.
May I contact you directly via Element about this?

Thanks

@weipengyao
Copy link
Contributor Author

Dear @iltommi

The problem of the locale environment is solved by changing LC_CTYPE=UTF-8 (due to some ssh-config on my Mac laptop) to LC_CTYPE=en_US.UTF-8.

@weipengyao
Copy link
Contributor Author

Hi @jderouillat

I did what you suggested, and the output is

Max tag of current MPI library : 524287

And here's more info.:

[weipeng@nia0168 Smilei]$ make env
SMILEICXX : mpicxx
PYTHONEXE : python3
MPIVERSION :
VERSION : b'v4.4-784-gc3f8cc8'-b'ckp_tag'
OPENMP_FLAG : -fopenmp -D_OMP
HDF5_ROOT_DIR : /scinet/niagara/software/2019b/modules/intel-2020u2-intelmpi-2020u2/hdf5/1.10.7
SITEDIR : /home/a/anticipa/weipeng/.local/lib/python3.6/site-packages
PY_CXXFLAGS : -I/scinet/niagara/software/2019b/opt/base/python/3.6.8/include/python3.6m -I/scinet/niagara/software/2019b/opt/base/python/3.6.8/include/python3.6m -I/scinet/niagara/software/2019b/opt/base/python/3.6.8/lib/python3.6/site-packages/numpy/core/include -DSMILEI_USE_NUMPY -DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION
PY_LDFLAGS : -lpython3.6m -lpthread -ldl -lutil -lm -Xlinker -export-dynamic
CXXFLAGS : -D__VERSION=\"b'v4.4-784-gc3f8cc8'-b'ckp_tag'\" -D_VECTO -std=c++11 -Wall  -I/scinet/niagara/software/2019b/modules/intel-2020u2-intelmpi-2020u2/hdf5/1.10.7/include -Isrc -Isrc/Params -Isrc/ElectroMagnSolver -Isrc/ElectroMagn -Isrc/ElectroMagnBC -Isrc/Particles -Isrc/Radiation -Isrc/Ionization -Isrc/Interpolator -Isrc/Collisions -Isrc/Merging -Isrc/Tools -Isrc/Python -Isrc/Projector -Isrc/DomainDecomposition -Isrc/MovWindow -Isrc/Profiles -Isrc/picsar_interface -Isrc/Checkpoint -Isrc/Pusher -Isrc/Field -Isrc/MultiphotonBreitWheeler -Isrc/SmileiMPI -Isrc/Species -Isrc/Diagnostic -Isrc/ParticleInjector -Isrc/Patch -Ibuild/src/Python -I/scinet/niagara/software/2019b/opt/base/python/3.6.8/include/python3.6m -I/scinet/niagara/software/2019b/opt/base/python/3.6.8/include/python3.6m -I/scinet/niagara/software/2019b/opt/base/python/3.6.8/lib/python3.6/site-packages/numpy/core/include -DSMILEI_USE_NUMPY -DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION -O3 -g  -fopenmp -D_OMP
LDFLAGS : -L/scinet/niagara/software/2019b/modules/intel-2020u2-intelmpi-2020u2/hdf5/1.10.7/lib   -lhdf5 -lpython3.6m -lpthread -ldl -lutil -lm -Xlinker -export-dynamic -L/scinet/niagara/software/2019b/opt/base/python/3.6.8/lib -lm -fopenmp -D_OMP

It seems that the intelmpi/2020u2 still has less maximum tag than the checkpoint needs.

@weipengyao
Copy link
Contributor Author

Some updates from the sysadmin of Scinet:

Technically, the MPI standard only guarantees you can use tag values up to 32767:

https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report/node210.htm#Node211

You might be able to increase it -- see the section on "upper bound exceed" here:

https://doku.lrz.de/display/PUBLIC/Intel+MPI

Hope that this helps.

And thanks to the discussion with @jderouillat ,

weipeng@nia-login02:~/scratch/Data_smilei$ mpirun --version
Intel(R) MPI Library for Linux* OS, Version 2019 Update 8 Build 20200624 (id: 4f16ad915)
Copyright 2003-2020, Intel Corporation.

I am trying to see if replacing dump_minute with dump_step will bypass the issue.

@mccoys
Copy link
Contributor

mccoys commented Oct 5, 2020

We are currently trying to make checkpoints need a lower tag. It is not necessary to have such a high number, and may help your situation.

@jderouillat
Copy link
Contributor

Many thanks to your sysadmin for the 2nd link. Did you test what is recommend ?
Assuming that you are using less then 16384 MPI process, dp you try to run your simulation setting :

$ export MPIR_CVAR_CH4_OFI_TAG_BITS=25
$ export MPIR_CVAR_CH4_OFI_RANK_BITS=14

It's of course not a long term solution but it could be a nice workaround.

@weipengyao
Copy link
Contributor Author

Dear @jderouillat ,

Sorry for this late reply.

I test this with a single node, it works!
So I guess it should also work with multiple nodes.
The full-scale simulation is now waiting in the queue, I will keep you updated about the result.

Many thanks!

@weipengyao weipengyao changed the title Checkpoint module makes the code crash Checkpoint module makes the code crash due to "Invalid tag" in intelMPI2019 and above Oct 8, 2020
@weipengyao
Copy link
Contributor Author

The checkpoint module now is good with updated MPIR_CVAR_CH4_OFI_TAG_BITS and MPIR_CVAR_CH4_OFI_RANK_BITS.

I am closing this issue.

Thanks again for your support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants