Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize unique list of used atoms #811

Merged
merged 3 commits into from Apr 28, 2022
Merged

Optimize unique list of used atoms #811

merged 3 commits into from Apr 28, 2022

Conversation

GiovanniBussi
Copy link
Member

I tried to optimize the generation of lists of used atoms within plumed.

Background: in current master, PLUMED uses std::sets to store the sets of atoms used in each CV. These sets are then merged and the resulting set is used when running with multiple MPI processes to (a) minimize communication and (b) optimize coordinate and force copy. When running with a single MPI process, the generation of the merged set is skipped since it is expensive.

Proposed change: I converted all these std::sets to sortedstd::vectors which, to my surprise, are quite faster. The expected result is (a) in parallel, there might be a small speedup and (b) in serial, it now makes sense to use the merged vector to optimize copy of coordinates and forces.

The last point requires some heuristic. When using a small fraction of the atoms (e.g., solute only), it is convenient to generate the vector and use it. When using a large fraction of the atoms (e.g. all atoms), it is convenient not to do it. Since the generation of the merged vector is still the expensive part, I heuristically decided to (a) generate the merged vector when a single action requests more than half of the atoms and (b) not do it otherwise. This choice can be overridden with an environment variable for testing.

Based on some timing I did on a test system, this code should be faster than the one proposed in #805 (@shazj99 could you please confirm?).

@carlocamilloni can you also double check if I correctly modified the code that you optimized for parallel execution?

In this commit one can use env vars to select the behavior.
Might be changed to enforce the optimal choice.
@codecov-commenter
Copy link

codecov-commenter commented Apr 11, 2022

Codecov Report

Merging #811 (b4e050b) into master (47a71a6) will increase coverage by 0.02%.
The diff coverage is 83.52%.

@@            Coverage Diff             @@
##           master     #811      +/-   ##
==========================================
+ Coverage   85.55%   85.57%   +0.02%     
==========================================
  Files         597      597              
  Lines       48863    48925      +62     
==========================================
+ Hits        41805    41868      +63     
+ Misses       7058     7057       -1     
Impacted Files Coverage Δ
src/core/ActionAtomistic.h 100.00% <ø> (ø)
src/core/Atoms.h 95.83% <ø> (ø)
src/tools/Tools.h 82.71% <60.00%> (-13.37%) ⬇️
src/core/Atoms.cpp 95.10% <95.23%> (-0.25%) ⬇️
src/cltools/pesmd.cpp 94.40% <100.00%> (ø)
src/colvar/ERMSD.cpp 97.95% <100.00%> (ø)
src/colvar/PathMSDBase.cpp 97.60% <100.00%> (ø)
src/core/ActionAtomistic.cpp 92.39% <100.00%> (+0.18%) ⬆️
src/core/MDAtoms.cpp 89.92% <100.00%> (ø)
src/tools/AtomNumber.h 100.00% <0.00%> (ø)
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 47a71a6...b4e050b. Read the comment docs.


/// Force the construction of the unique list.
/// Can be used for timing the construction of the unique list.
/// export PLUMED_FORCE_UNIQUE=no
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be PLUMED_MAKE_UNIQUE=no

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was meant for constructing the unique array without using it later.

Anyway, currently the code just checks if the variable is set, so it will work even with PLUMED_MAKE_UNIQUE=no. But the idea is to force making the array, just to check how expensive it would be to first build the array and then use its size to decide if it should be used or not. With my tests, I saw that building it is not convenient, and thus I decided to base the heuristic on the size of the largest request rather than on the merged list

@carlocamilloni
Copy link
Member

At a first read is ok, I would need to test it.

Do we check the environment variable at every step? I don't think this is good, does this have an impact on the performances? could we get different environment settings on multiple nodes by mistake?

As a side note from GROMACS 2022 there can be case, based on some heuristic, where non parallel run still use shuffled atoms.

@GiovanniBussi
Copy link
Member Author

Regarding env vars: they are checked once, since they are used to initialize a static const variable.

It's true, you should make sure they are correctly set on multiple nodes. I think you should pass them as an argument to mpirun. For OpenMPI it is -x.

I think this is mostly meant for testing without recompilation. In perspective, we can remove these variables.

@GiovanniBussi
Copy link
Member Author

GiovanniBussi commented Apr 11, 2022

@carlocamilloni as a further note, I have another idea on how to make this even faster. I would like to keep trace of the previously used sets of ActionAtomistic and keep a cache of a few unique lists (say, the last ten different activation patterns). This should be sufficient, since in most applications we have a couple of different STRIDE's, so that we have an alternation of the same lists. The cache would be reset when we make a new domain decomposition, and when individual actions change their request list. In any case, there will be cases where lists are constructed anyway so making the construction as fast as possible will still have an impact.

@shazj99
Copy link

shazj99 commented Apr 15, 2022

hi @GiovanniBussi , I've tested it in my system:

  1. baseline:
PLUMED: 2 Sharing data                                100001    44.696163     0.000447     0.000376     0.033478
  1. do not make unique(with the optimization of force.zero())
PLUMED: 2 Sharing data                                100001     5.470657     0.000055     0.000050     0.014057
  1. make unique list but not use it:
PLUMED: 2 Sharing data                                100001    40.752097     0.000408     0.000333     0.004554
  1. make unique and use it:
PLUMED: 2 Sharing data                                100001     5.448742     0.000054     0.000050     0.011383
  1. make unique and use it, also use priority queue:
PLUMED: 2 Sharing data                                100001     6.579670     0.000066     0.000064     0.010893

I also test it on multiple walker(2 nodes), which also shows a lot of improvement:

  1. baseline:
PLUMED: 2 Sharing data                                100001    43.225760     0.000432     0.000355     0.021329
  1. optimize-unique branch
PLUMED: 2 Sharing data                                100001     7.150335     0.000072     0.000066     0.003508

So according to these numbers, is it possible to combine both of our optimization?

@GiovanniBussi
Copy link
Member Author

@shazj99 thanks for testing!

I am not sure it makes sense to combine both optimizations. As you have seen, in order to get the speed up on a system like yours (with a few atoms used) it is necessary to use the unique list (setting number 4). Once you are using the unique list, the "lazy" optimization is not needed, since we already know exactly which atoms are needed for all actions to work. I think the two optimizations do more or less the same thing, but the "unique" list (given the optimization made by @carlocamilloni a few years ago) is also very effective in parallel runs.

So, I would just merge this and close #805 if you agree.

What surprises me a bit is your result in setting number 2. In theory, this should be slow since one is not using the unique list and all atoms are copied. The run should have export PLUMED_FORCE_UNIQUE=no

@shazj99
Copy link

shazj99 commented Apr 15, 2022

@shazj99 thanks for testing!

I am not sure it makes sense to combine both optimizations. As you have seen, in order to get the speed up on a system like yours (with a few atoms used) it is necessary to use the unique list (setting number 4). Once you are using the unique list, the "lazy" optimization is not needed, since we already know exactly which atoms are needed for all actions to work. I think the two optimizations do more or less the same thing, but the "unique" list (given the optimization made by @carlocamilloni a few years ago) is also very effective in parallel runs.

So, I would just merge this and close #805 if you agree.

OK, I see.

What surprises me a bit is your result in setting number 2. In theory, this should be slow since one is not using the unique list and all atoms are copied. The run should have export PLUMED_FORCE_UNIQUE=no

I rerun it with PLUMED_MAKE_UNIQUE=no and PLUMED_FORCE_UNIQUE=no, and it seems to meet expectation:

PLUMED: 2 Sharing data                                100001    39.207612     0.000392     0.000343     0.013514

@shazj99
Copy link

shazj99 commented Apr 15, 2022

@GiovanniBussi Another thing, I notice that some CV or Actions are also slow and plumed can not run on GPU, do you have any plans to support running on GPU? Some expensive operations can be speed up and can cooperate with Gromacs better.

@GiovanniBussi
Copy link
Member Author

@shazj99 a few CVs can run on GPU already, and we have a tentative plan to port more of them in the future, but we don't have any precise timeline yet.

@carlocamilloni
Copy link
Member

@GiovanniBussi I am doing a last quick test on LUMI-C right now

@carlocamilloni
Copy link
Member

@GiovanniBussi ok, on parallel runs I do not see any measurable effect. Good to go with me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants