Optimize unique list of used atoms #811

GiovanniBussi · 2022-04-10T23:19:50Z

I tried to optimize the generation of lists of used atoms within plumed.

Background: in current master, PLUMED uses std::sets to store the sets of atoms used in each CV. These sets are then merged and the resulting set is used when running with multiple MPI processes to (a) minimize communication and (b) optimize coordinate and force copy. When running with a single MPI process, the generation of the merged set is skipped since it is expensive.

Proposed change: I converted all these std::sets to sortedstd::vectors which, to my surprise, are quite faster. The expected result is (a) in parallel, there might be a small speedup and (b) in serial, it now makes sense to use the merged vector to optimize copy of coordinates and forces.

The last point requires some heuristic. When using a small fraction of the atoms (e.g., solute only), it is convenient to generate the vector and use it. When using a large fraction of the atoms (e.g. all atoms), it is convenient not to do it. Since the generation of the merged vector is still the expensive part, I heuristically decided to (a) generate the merged vector when a single action requests more than half of the atoms and (b) not do it otherwise. This choice can be overridden with an environment variable for testing.

Based on some timing I did on a test system, this code should be faster than the one proposed in #805 (@shazj99 could you please confirm?).

@carlocamilloni can you also double check if I correctly modified the code that you optimized for parallel execution?

In this commit one can use env vars to select the behavior. Might be changed to enforce the optimal choice.

codecov-commenter · 2022-04-11T00:09:42Z

Codecov Report

Merging #811 (b4e050b) into master (47a71a6) will increase coverage by 0.02%.
The diff coverage is 83.52%.

@@            Coverage Diff             @@
##           master     #811      +/-   ##
==========================================
+ Coverage   85.55%   85.57%   +0.02%     
==========================================
  Files         597      597              
  Lines       48863    48925      +62     
==========================================
+ Hits        41805    41868      +63     
+ Misses       7058     7057       -1

Impacted Files	Coverage Δ
src/core/ActionAtomistic.h	`100.00% <ø> (ø)`
src/core/Atoms.h	`95.83% <ø> (ø)`
src/tools/Tools.h	`82.71% <60.00%> (-13.37%)`	⬇️
src/core/Atoms.cpp	`95.10% <95.23%> (-0.25%)`	⬇️
src/cltools/pesmd.cpp	`94.40% <100.00%> (ø)`
src/colvar/ERMSD.cpp	`97.95% <100.00%> (ø)`
src/colvar/PathMSDBase.cpp	`97.60% <100.00%> (ø)`
src/core/ActionAtomistic.cpp	`92.39% <100.00%> (+0.18%)`	⬆️
src/core/MDAtoms.cpp	`89.92% <100.00%> (ø)`
src/tools/AtomNumber.h	`100.00% <0.00%> (ø)`
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 47a71a6...b4e050b. Read the comment docs.

carlocamilloni · 2022-04-11T09:20:39Z

src/core/Atoms.cpp

+
+/// Force the construction of the unique list.
+/// Can be used for timing the construction of the unique list.
+/// export PLUMED_FORCE_UNIQUE=no


this should be PLUMED_MAKE_UNIQUE=no

This was meant for constructing the unique array without using it later.

Anyway, currently the code just checks if the variable is set, so it will work even with PLUMED_MAKE_UNIQUE=no. But the idea is to force making the array, just to check how expensive it would be to first build the array and then use its size to decide if it should be used or not. With my tests, I saw that building it is not convenient, and thus I decided to base the heuristic on the size of the largest request rather than on the merged list

carlocamilloni · 2022-04-11T09:25:07Z

At a first read is ok, I would need to test it.

Do we check the environment variable at every step? I don't think this is good, does this have an impact on the performances? could we get different environment settings on multiple nodes by mistake?

As a side note from GROMACS 2022 there can be case, based on some heuristic, where non parallel run still use shuffled atoms.

GiovanniBussi · 2022-04-11T09:31:15Z

Regarding env vars: they are checked once, since they are used to initialize a static const variable.

It's true, you should make sure they are correctly set on multiple nodes. I think you should pass them as an argument to mpirun. For OpenMPI it is -x.

I think this is mostly meant for testing without recompilation. In perspective, we can remove these variables.

GiovanniBussi · 2022-04-11T09:36:57Z

@carlocamilloni as a further note, I have another idea on how to make this even faster. I would like to keep trace of the previously used sets of ActionAtomistic and keep a cache of a few unique lists (say, the last ten different activation patterns). This should be sufficient, since in most applications we have a couple of different STRIDE's, so that we have an alternation of the same lists. The cache would be reset when we make a new domain decomposition, and when individual actions change their request list. In any case, there will be cases where lists are constructed anyway so making the construction as fast as possible will still have an impact.

shazj99 · 2022-04-15T07:44:59Z

hi @GiovanniBussi , I've tested it in my system:

baseline:

PLUMED: 2 Sharing data                                100001    44.696163     0.000447     0.000376     0.033478

do not make unique(with the optimization of force.zero())

PLUMED: 2 Sharing data                                100001     5.470657     0.000055     0.000050     0.014057

make unique list but not use it:

PLUMED: 2 Sharing data                                100001    40.752097     0.000408     0.000333     0.004554

make unique and use it:

PLUMED: 2 Sharing data                                100001     5.448742     0.000054     0.000050     0.011383

make unique and use it, also use priority queue:

PLUMED: 2 Sharing data                                100001     6.579670     0.000066     0.000064     0.010893

I also test it on multiple walker(2 nodes), which also shows a lot of improvement:

baseline:

PLUMED: 2 Sharing data                                100001    43.225760     0.000432     0.000355     0.021329

optimize-unique branch

PLUMED: 2 Sharing data                                100001     7.150335     0.000072     0.000066     0.003508

So according to these numbers, is it possible to combine both of our optimization?

GiovanniBussi · 2022-04-15T07:55:58Z

@shazj99 thanks for testing!

I am not sure it makes sense to combine both optimizations. As you have seen, in order to get the speed up on a system like yours (with a few atoms used) it is necessary to use the unique list (setting number 4). Once you are using the unique list, the "lazy" optimization is not needed, since we already know exactly which atoms are needed for all actions to work. I think the two optimizations do more or less the same thing, but the "unique" list (given the optimization made by @carlocamilloni a few years ago) is also very effective in parallel runs.

So, I would just merge this and close #805 if you agree.

What surprises me a bit is your result in setting number 2. In theory, this should be slow since one is not using the unique list and all atoms are copied. The run should have export PLUMED_FORCE_UNIQUE=no

shazj99 · 2022-04-15T08:23:33Z

@shazj99 thanks for testing!

I am not sure it makes sense to combine both optimizations. As you have seen, in order to get the speed up on a system like yours (with a few atoms used) it is necessary to use the unique list (setting number 4). Once you are using the unique list, the "lazy" optimization is not needed, since we already know exactly which atoms are needed for all actions to work. I think the two optimizations do more or less the same thing, but the "unique" list (given the optimization made by @carlocamilloni a few years ago) is also very effective in parallel runs.

So, I would just merge this and close #805 if you agree.

OK, I see.

What surprises me a bit is your result in setting number 2. In theory, this should be slow since one is not using the unique list and all atoms are copied. The run should have export PLUMED_FORCE_UNIQUE=no

I rerun it with PLUMED_MAKE_UNIQUE=no and PLUMED_FORCE_UNIQUE=no, and it seems to meet expectation:

PLUMED: 2 Sharing data                                100001    39.207612     0.000392     0.000343     0.013514

shazj99 · 2022-04-15T09:18:41Z

@GiovanniBussi Another thing, I notice that some CV or Actions are also slow and plumed can not run on GPU, do you have any plans to support running on GPU? Some expensive operations can be speed up and can cooperate with Gromacs better.

GiovanniBussi · 2022-04-15T09:45:58Z

@shazj99 a few CVs can run on GPU already, and we have a tentative plan to port more of them in the future, but we don't have any precise timeline yet.

carlocamilloni · 2022-04-22T10:16:06Z

@GiovanniBussi I am doing a last quick test on LUMI-C right now

carlocamilloni · 2022-04-22T10:26:35Z

@GiovanniBussi ok, on parallel runs I do not see any measurable effect. Good to go with me

GiovanniBussi added 2 commits April 11, 2022 01:04

Optimize loops which set forces to zero

27dcd3f

Replaced std::set with std::vectors

b4e050b

In this commit one can use env vars to select the behavior. Might be changed to enforce the optimal choice.

carlocamilloni reviewed Apr 11, 2022

View reviewed changes

some cleanup before merging

9a18784

GiovanniBussi merged commit cb0904d into master Apr 28, 2022

carlocamilloni deleted the optimize-unique branch November 8, 2022 22:44

GiovanniBussi mentioned this pull request Jun 18, 2023

Start of transfer of functionality from hack-the-tree to maser #933

Merged

7 tasks

GiovanniBussi mentioned this pull request Mar 7, 2024

Heuristic optimization of priority queue merging #1037

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize unique list of used atoms #811

Optimize unique list of used atoms #811

GiovanniBussi commented Apr 10, 2022

codecov-commenter commented Apr 11, 2022 •

edited

carlocamilloni Apr 11, 2022 •

edited

GiovanniBussi Apr 11, 2022

carlocamilloni commented Apr 11, 2022

GiovanniBussi commented Apr 11, 2022

GiovanniBussi commented Apr 11, 2022 •

edited

shazj99 commented Apr 15, 2022

GiovanniBussi commented Apr 15, 2022

shazj99 commented Apr 15, 2022

shazj99 commented Apr 15, 2022

GiovanniBussi commented Apr 15, 2022

carlocamilloni commented Apr 22, 2022

carlocamilloni commented Apr 22, 2022

Optimize unique list of used atoms #811

Optimize unique list of used atoms #811

Conversation

GiovanniBussi commented Apr 10, 2022

codecov-commenter commented Apr 11, 2022 • edited

Codecov Report

carlocamilloni Apr 11, 2022 • edited

Choose a reason for hiding this comment

GiovanniBussi Apr 11, 2022

Choose a reason for hiding this comment

carlocamilloni commented Apr 11, 2022

GiovanniBussi commented Apr 11, 2022

GiovanniBussi commented Apr 11, 2022 • edited

shazj99 commented Apr 15, 2022

GiovanniBussi commented Apr 15, 2022

shazj99 commented Apr 15, 2022

shazj99 commented Apr 15, 2022

GiovanniBussi commented Apr 15, 2022

carlocamilloni commented Apr 22, 2022

carlocamilloni commented Apr 22, 2022

codecov-commenter commented Apr 11, 2022 •

edited

carlocamilloni Apr 11, 2022 •

edited

GiovanniBussi commented Apr 11, 2022 •

edited