New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize unique list of used atoms #811
Conversation
In this commit one can use env vars to select the behavior. Might be changed to enforce the optimal choice.
Codecov Report
@@ Coverage Diff @@
## master #811 +/- ##
==========================================
+ Coverage 85.55% 85.57% +0.02%
==========================================
Files 597 597
Lines 48863 48925 +62
==========================================
+ Hits 41805 41868 +63
+ Misses 7058 7057 -1
Continue to review full report at Codecov.
|
src/core/Atoms.cpp
Outdated
|
||
/// Force the construction of the unique list. | ||
/// Can be used for timing the construction of the unique list. | ||
/// export PLUMED_FORCE_UNIQUE=no |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be PLUMED_MAKE_UNIQUE=no
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was meant for constructing the unique array without using it later.
Anyway, currently the code just checks if the variable is set, so it will work even with PLUMED_MAKE_UNIQUE=no
. But the idea is to force making the array, just to check how expensive it would be to first build the array and then use its size to decide if it should be used or not. With my tests, I saw that building it is not convenient, and thus I decided to base the heuristic on the size of the largest request rather than on the merged list
At a first read is ok, I would need to test it. Do we check the environment variable at every step? I don't think this is good, does this have an impact on the performances? could we get different environment settings on multiple nodes by mistake? As a side note from GROMACS 2022 there can be case, based on some heuristic, where non parallel run still use shuffled atoms. |
Regarding env vars: they are checked once, since they are used to initialize a static const variable. It's true, you should make sure they are correctly set on multiple nodes. I think you should pass them as an argument to I think this is mostly meant for testing without recompilation. In perspective, we can remove these variables. |
@carlocamilloni as a further note, I have another idea on how to make this even faster. I would like to keep trace of the previously used sets of ActionAtomistic and keep a cache of a few unique lists (say, the last ten different activation patterns). This should be sufficient, since in most applications we have a couple of different STRIDE's, so that we have an alternation of the same lists. The cache would be reset when we make a new domain decomposition, and when individual actions change their request list. In any case, there will be cases where lists are constructed anyway so making the construction as fast as possible will still have an impact. |
hi @GiovanniBussi , I've tested it in my system:
I also test it on multiple walker(2 nodes), which also shows a lot of improvement:
So according to these numbers, is it possible to combine both of our optimization? |
@shazj99 thanks for testing! I am not sure it makes sense to combine both optimizations. As you have seen, in order to get the speed up on a system like yours (with a few atoms used) it is necessary to use the unique list (setting number 4). Once you are using the unique list, the "lazy" optimization is not needed, since we already know exactly which atoms are needed for all actions to work. I think the two optimizations do more or less the same thing, but the "unique" list (given the optimization made by @carlocamilloni a few years ago) is also very effective in parallel runs. So, I would just merge this and close #805 if you agree. What surprises me a bit is your result in setting number 2. In theory, this should be slow since one is not using the unique list and all atoms are copied. The run should have |
OK, I see.
I rerun it with
|
@GiovanniBussi Another thing, I notice that some CV or Actions are also slow and plumed can not run on GPU, do you have any plans to support running on GPU? Some expensive operations can be speed up and can cooperate with Gromacs better. |
@shazj99 a few CVs can run on GPU already, and we have a tentative plan to port more of them in the future, but we don't have any precise timeline yet. |
@GiovanniBussi I am doing a last quick test on LUMI-C right now |
@GiovanniBussi ok, on parallel runs I do not see any measurable effect. Good to go with me |
I tried to optimize the generation of lists of used atoms within plumed.
Background: in current master, PLUMED uses
std::set
s to store the sets of atoms used in each CV. These sets are then merged and the resulting set is used when running with multiple MPI processes to (a) minimize communication and (b) optimize coordinate and force copy. When running with a single MPI process, the generation of the merged set is skipped since it is expensive.Proposed change: I converted all these
std::set
s to sortedstd::vector
s which, to my surprise, are quite faster. The expected result is (a) in parallel, there might be a small speedup and (b) in serial, it now makes sense to use the merged vector to optimize copy of coordinates and forces.The last point requires some heuristic. When using a small fraction of the atoms (e.g., solute only), it is convenient to generate the vector and use it. When using a large fraction of the atoms (e.g. all atoms), it is convenient not to do it. Since the generation of the merged vector is still the expensive part, I heuristically decided to (a) generate the merged vector when a single action requests more than half of the atoms and (b) not do it otherwise. This choice can be overridden with an environment variable for testing.
Based on some timing I did on a test system, this code should be faster than the one proposed in #805 (@shazj99 could you please confirm?).
@carlocamilloni can you also double check if I correctly modified the code that you optimized for parallel execution?