-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Useful walker ids #5010
Useful walker ids #5010
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did my first pass.
Drive-by comment: I think there are only two requirements to satisfy here (i) scheme is performant [e.g. minimal comms] (ii) walkers IDs are unique. If it helps -- and I expect it does -- i think walker IDs can be completely arbitrary numbers. Rationale: Postmortem analysis will be done by automation. Do we have any other actual requirements vs nice to have properties (e.g., sequential, increasing)? |
If we use UUID's (sorry I they are only called GUIDs in windows.) We would need no state, but without post processing it would be basically human unreadable. |
Non-drive by comment, now that I have thought about this a bit more: since walkers are created at the mpi rank (vs crowd) level, we can count modulo num_ranks and offset the counters by the rank. The only state needed is the number of walkers created by each rank, and this never has to be communicated. This is very close to what is implemented here, but I don't understand the discussion of the "golden walker". What is meant by that? Seems unnecessary? Two rank, initial four walker example: rank 0: walkers 0,2 rank 1 creates an extra walker: rank 0: walkers 0,2 rank 0 kills walker 2, and we load balance: rank 0: walkers 0,5 rank 0 creates an extra walker: rank 0: walkers 0,5,4 |
Here is my expectation of walker ID. It represents the location(rank, index within the rank) of walker in each step. It is not carried over steps. At initialization or after branching, walker ids are reassigned and the old value is kept as parent walker id. Both current and parent ids should allow us to reconstruct walker traces.
I don't get why this is even needed in the current scheme. Since walkers are created and annihilated from step to step, I don't think If the intention is making the walker id never change after creation. I would like to see how branching and trace reconstruction is managed. I don't mean implementation but some description before adopting this idea. Regarding golden walkers. Walkers have configurations (electron positions) and contexts (particleset, TWF, Ham).
Treat |
I think we need to discuss the envisioned use cases for this. What myself, Jaron, and others are after is indeed "making the walker ID never change after creation", so that we can e.g. follow the history of presumed "bad" walkers. With a walker's unique ID plus that of its parent (which it would have to store), we could do that. IDs would never be reused so the scripting would be easy. I believe this is what the legacy code attempts/attempted to do as well, not that it been carefully verified in the modern era. Do you or anyone else have other uses in mind? |
To follow the bad walker, you will need to "backtrace". If IDs are fully unique, you will need to search the parent id in the population at every step which is very bad. If following my ID scheme, ID is an index and there is no need of searching. |
Maybe some crossed language here: once created, walkers have fixed ID independent of timestep. So if a run aborts due to population explosion (most common case), we just follow the low energy walker back to when it was moved to the bad position (assuming we were logging output). This is usually a walker with a single ID. Perhaps your index == what I was describing as a fixed ID? See my example a few comments back. |
In my indexing way, the index is only valid for the current step, the parent index is valid for the previous step. Since it is already index, look up is O(1). In your fixed ID way, the ID is unique but look up is painful. At step N, walker with id=5 goes bad and its parent id is X, we backtrace to the step N-1, it has 100 walkers, due to unsorted walker ID, need to search all the 100 walkers to find X in order to read info like coordinates/energy/weight. How about now we have 10^6 walkers instead of 100 ... Looking up takes O(N). |
When using my index(parent index). @prckent Please also clarify what is parent walker ID in your scheme and how to handle killing/branching. |
The only envisaged use I have for this facility is offline debugging and maybe occasional online debugging. There is zero cost to everyday simulations beyond moving a few extra bytes and one extra piece of state. See what I put in the requirements above. What are you envisaging using this for? |
I'm thinking about a fast way of backtracing a bad walker and accessing infos like coordinates and energy on the evolution path in the analysis phase. This analysis is usually done offline. Walker id is the only way of tracking walkers. The walker id from creation is not enough to backtrace a walker to step 0 due to branching and thus the parent walker id is also needed. That is why I'm wondering with your labeling scheme how to reconstruct a backtrace. Particularly, I need this backtrace construction fast and not disk I/O and memory intensive. For a walker at step N with a parent walker id X, i need an efficient way to find X at step N-1 and read out coordinates and energy.
|
Indeed, I intend for both walker and parent ID's to be immutable and carried with a walker upon/following its creation. Using hash tables in post-processing reduces search times to O(1) following an O(N) construction. The walker tracing functionality (#5019) naturally organizes the walker data by rank (one file per rank), obviating the need to track it explicitly. As Ye and I have discussed offline, a slight generalization of the current functionality could optionally segment the outputted buffer data by MC block to further reduce the construction cost if data only from particular blocks is desired. I definitely prefer using a simple strided indexing formula like the one currently in |
Here is the issue of immutable walker id. If walker id is based on the creation MPI rank not the actual rank were the walker locates, looking up the walker requires loading information of all the walkers of a given step. The walker tracing functionality (#5019) is already quite unfriendly to I/O due to its intensive amount of data being written. To construct backtrace, all the trace data from every MPI rank has to be read from disk and stored in memory although this can be improved by blocking as datasets in the trace files. The cost of creating hash map for the walker id and record location for each step is obviously not O(1). |
Examining in more detail how the driver code actually sequences the transfers and copies I realized we can in fact do all the necessary walker spawning so that all walker ID's are rank local (see below for definition of that), parent ids will allow tracking the originating walker whether it be on or off rank. See the new walker control tests for how this works in more detail. Much more of the walker control is now actually unit tested as well. I'd still consider coverage to be pretty incomplete as the tests are easy cases and only single stage, but I do think they nail down the expected behavior more clearly than in the past. It seems logical to me to have "parentless" walkers have parent id == 0 i.e. the null walker. These are ID's and not array indexes (pointer offsets). From this it follows that no walkers ID should be 0 unless it is a blank not yet properly initialized. Something which happens in legacy code i.e. locations where walkers are constructed outside of MCPopulation. At least until we refactor WalkerConfiguration, Walker and Particle set there will still be walkers that go through a two stage creation where the invalid for a time. |
@PDoakORNL could you clarify "rank id's are local"? What is rank id you were referring? There are two ways of indexing walkers at a given step.
Method 1 eliminates the info about ranks. It is more flexible if rank info is not critical. It can also be more difficult if rank info is needed. I don't quite follow you mention of ""parentless" walkers" and also have difficulty to understand your two stage comments. Here is my understanding of the code. so in a driver
|
I mean that on a rank all walkers id's will be walker_id % (num_ranks) = rank + 1. I'll edit that to make it more clear. I believe in practice walkers should not be getting moved more than necessary so the ID's should be presevered between steps unless the walker actually changed rank or unless it is a copy from another walker on the rank in which case it gets a new id. In practice I don't think a large % of walker die, are born, or transferred each step. IDs are never reused and are not based on local index only on number of new walkers born on the rank. There are no gaps if you look at the sequence of particular ranks IDs with integer division by number of ranks. IDs should be unique for a particular walker and there cannot be based on the local walker index My understanding of mw_saveWalker ... mw_loadWalker was that it was not supposed to just treat the walkers and walker elements as just slots that could have a completely new identity every step as we serialize in and out of some insane untyped byte buffer. But that rather a walker than had not been killed would be a continuation of a particular walker trajectory. Its hard to be certain since this whole save load thing moving state between walkers and particle sets is a little... 😭 But I have always intended that a indexes of walker elements will never be mixed. I.e. Walker[1], ParticleSet[1], TWF[1], Hamiltonian[1] |
Here is an example of the bubble I was referring to. |
If the walker id is only intended to generate a unique id, then the scheme in this PR is ok. However, it cannot be used to be more informative indicating MPI rank or index. As long as killed/born/migrating do happen, such scenarios should be taken into account. I don't think how large % of walker affects our criteria for choosing a scheme for id. |
Now @PDoakORNL is saying walker id is mutable. @prckent @jtkrogel @PDoakORNL we need a discussion. |
I should also add that any time I'm think about producing massive amounts of data I consider how much work getting it into a relational database would be. That is after all "what you do" if working with data that won't fit in memory that you want to make structured queries against. Id use tables like:
|
Walker ID is not mutable. But it is not portable to another rank. |
Unfortunately killing a physical term not a technical term. Migranting a walker cannot be treated as kill and born. |
From my perspective there are no longer any gaps. The walker IDs come from a N = num_ranks disjoint sequences from the positive integers with each value x = {0,..., num_ranks-1} determines a sequence of walker ids such that members are those integers such that w_id -> P_INT % num_rankss == x. Each rank uses the set of IDs such the x defining the set has x == rank. on any particular rank All ID's are used sequentially for active walkers. An active walker is one that will on the next step attempt a move. A walker being overwritten must have either been freshly constructed or have been however briefly in an inactive state and receives the next rank ID. |
In the above example I was talking about gaps between ranks not within a rank. ids spaced by num_ranks were not considered disjoint. |
In your example the next time rank two creates a walker it will be 10. |
At the current step, there is no walker 10. If the simulation ends here, there is no walker 10. |
Perhaps just post the pseudo-code and the ID's generated by 2-3 adjacent ranks? |
a458c1f
to
613633e
Compare
docs/developing.rst
Outdated
:label: eq_walker_id | ||
walker_id = walker_id = num_walkers_created_++ * num_ranks_ + rank_ + 1 | ||
|
||
where ``num_walkers_created_`` is a member variable of the sole ```MCPopulation`` object on the rank and initially set to 0. Each walkers ``parent_id`` is set at initiation of the walkers configuration to the walker from home the configuration is assigned. If that assignment is from previous section's or run's ``WalkerConfigurations`` object then the value of the ``Walker::getWalkerID()`` is multiplied by -1. If the Walker's initial configuration comes from the golden particle set the parent_id will be 0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"set to 0" each driver run or once the whole lifetime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If walker is not assigned from a walker config walker, a transfered walker, or an amplified walker, the parent_id is left as the default i.e. 0.
docs/developing.rst
Outdated
:label: eq_walker_id | ||
walker_id = walker_id = num_walkers_created_++ * num_ranks_ + rank_ + 1 | ||
|
||
where ``num_walkers_created_`` is a member variable of the sole ```MCPopulation`` object on the rank and initially set to 0. Each walkers ``parent_id`` is set at initiation of the walkers configuration to the walker from home the configuration is assigned. If that assignment is from previous section's or run's ``WalkerConfigurations`` object then the value of the ``Walker::getWalkerID()`` is multiplied by -1. If the Walker's initial configuration comes from the golden particle set the parent_id will be 0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is "home"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above section seems explaining what happens when walkerIDs and parent IDs are initialized.
Could you document what happens during branching?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm added documentation of branching. I could not find other discussion of our branching algorithm in the manual. Am I just missing it?
src/Particle/Walker.h
Outdated
*/ | ||
long walker_id_ = 0; | ||
/** in legacy the ancients have said only: | ||
* id reserved for forward walking | ||
* | ||
* In Batched | ||
* If a walker is initialized from the golden particle set it keeps the default constructed 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are talking about lightweight walker here. Should not reference "golden particle set"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've cleaned these docs up now.
src/QMCDrivers/DMC/WalkerControl.cpp
Outdated
@@ -236,13 +208,15 @@ void WalkerControl::branch(int iter, MCPopulation& pop, bool do_not_branch) | |||
throw std::runtime_error("Potential bug! Population num_global_walkers mismatched!"); | |||
#endif | |||
|
|||
// This defensive code implies that previous code left population walkers in invalid state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a defensive code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok so by my reading through the walker life cycle, branching, and population control walkers must always have Multiplicty == 1.0 after amplification and/or transfer. Multiplicity is set in a small number of significant locations (except for here).
- On construction the default member initializer of
Walker
sets the initial value to 1.0 - During warmup = Multiplicity = 1.0
- To
static_cast<int>(walker->Weight + rng_())
resulting in a integer value of >= 0 - After swapping where on rank multiplicity is set to the remaining number of copies on a rank.
- After swapping where a new walker has its multiplicity set to the number of copies received + 1 (number of copies received is counted from 0 due to lack of discrimination between counting numbers and indexing numbers).
- After amplificiation i.e.
copyHighMultiplicityWalkers
andkillDeadWalkers
the population will contain only walkers of Multiplicity == 1. - Here where all living walkers have their multiplicity set to 1.
Several off these state transforms appear defensive in nature only, i.e. they can only cover up Walkers arriving to this function in an invalid state.
- Can only erase a state transformation to walker that left Multiplicity in an incorrect state. Fresh walkers have multiplicity == 1 walkers from walker configuration have multiplicity == 1. An assert should be added and this side of the branch is noop.
- If multiplicity != something is wrong, this simply covers it.
4, 5 are only necessary because of how we construct transfer jobs. Where basically walker have a multiplicity per destination for a short span of code. During this section Walker->Multiplicty is no long a simple value.
Multiplicity is a very important value for correct population control so reasoning about its update and validity should be as simple as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm clarifying the comments, I don't want to lose this insight but it can't really be cleaned up properly until we drop the legacy drivers.
src/QMCDrivers/DMC/WalkerControl.cpp
Outdated
// recv the number of copies from the target | ||
myComm->comm.receive_n(&nsentcopy, 1, plus[ic]); | ||
job_list.push_back(job(newW.size() - 1, plus[ic])); | ||
if (plus[ic] != plus[ic + nsentcopy] || minus[ic] != minus[ic + nsentcopy]) | ||
throw std::runtime_error("WalkerControl::swapWalkersSimple send/recv pair checking failed!"); | ||
#ifdef MCWALKERSET_MPI_DEBUG | ||
fout << "rank " << minus[ic] << " recvs a walker with " << nsentcopy << " copies from rank " << plus[ic] | ||
std::cout << "rank " << minus[ic] << " recvs a walker with " << nsentcopy << " copies from rank " << plus[ic] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your temporary change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/QMCDrivers/DMC/WalkerControl.cpp
Outdated
@@ -462,11 +440,15 @@ void WalkerControl::swapWalkersSimple(MCPopulation& pop) | |||
} | |||
else | |||
{ | |||
// Walker::copyFromBuffer copies the walker_index_ of the sent walker and the local walker does | |||
// not have a local id created. | |||
// the walker on this rank. But this replacement allows tracking this walker across the ranks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please improve the comment.
src/QMCDrivers/DMC/WalkerControl.cpp
Outdated
awalker.copyFromBuffer(); | ||
auto parent_id = awalker.getWalkerID(); | ||
awalker.setWalkerID(walker_id); | ||
awalker.setParentID(parent_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain all the setting?
auto parent_id = awalker.getWalkerID();
why?
Are you sure the parent id also has been wired?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added test and discovered the answer was kinda. It was for transfer but not for reading from WalkerConfigurations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have since discovered that it is also not written or read from walker hdf5 file. I'd rather address that in a future PR.
src/QMCDrivers/DMC/WalkerControl.h
Outdated
* 4. collect good, bad walkers | ||
* 5. communicate walkers | ||
* 6. unpack received walkers, apply walker count floor | ||
* 7. call MCPopulation to amplify walkers with Multiplicity > 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move this back to its original location. Users calling this API doesn't need to know the implementation detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
returned. I left some information for this call in the header.
src/QMCDrivers/DMC/WalkerControl.h
Outdated
* partial enum to access elements is the enum above dumped into WalkerControl's namespace | ||
* includes many integral types converted to and from fp | ||
* \todo MPI in the 21st centure allows user defined data types that would allow all this to benefit from actual | ||
* C++ type safety and still can have performant collective operation efficiency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MPI is C only. I don't think it knows C++ type safety. The derived type in MPI is intended for performance, not for safety. It is not useful in this particular case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looks like its intended so everyone doesn't do things like this constantly.
unfortunately this requires documenting branching and population control which are traditionally not well documented.
I'm working on another rebase and forced push. This has been sitting for a week and new PR's getting merged. I'm not sure why anymore, but after this rebase I will not be keeping it current unless there is actual interest in merging it. I won't be fixing ID propagation between runs and not just between sections until this PR goes in, it has enough in it, tests what it adds and adds tests and documentation that should have always been there. |
The osx failures are also all with nexus and do not appear to have anything to do with this PR. |
If it is not showing conflict, there is no need to aggressively do rebasing. We rarely requires PRs being rebased before merging.
Sorry I was busy and didn't pay attention to commits closely. Sometimes it is hard for me to tell if a submitter is still working on it or it is ready for review again. It is not the best use of maintainer time monitoring PR progress. If you need a specific reviewer to respond, better to ping if the PR stalls. For me specifically, feel free to @ me on github or slack me. Unfortunately, we (submitters and reviewers) all have bandwidth limits and communication helps. |
Just noticed that you sent a new review request. I'm sorry that I missed it. |
OSX failures are all fixed and unrelated to this PR. I would like to get this merged since it adds very useful functionality. It was also helpful to read e.g. how the multiplicities are handled. |
I'm never sure if I don't rebase if I'm going to end up with history wise. I guess I should do some research on this. In this case I don't have dependent branches on this PR so it probably makes no difference. |
b8092cc
to
3b2ad0d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried to explain every code change in WalkerControl.cpp to ease concerns that any real change has been made in the DMC population control. Should I back port the unit tests and/or right additional ones to prove the behavior is the same?
{ | ||
makeCopy(a); | ||
walker_id_ = walker_id; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I make these assignments after makeCopy, putting them above means they get replace by a
's walker_id and parent_id which is not the behavior needed or expected.
ScopedTimer copywalkers_timer(my_timers_[WC_copyWalkers]); | ||
const size_t good_walkers = walkers.size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is entirely manipulation and calls that should be private to MCPopulation so I refactored it into there. This also allows unit testing this with MCPopulation which makes sense since code can be tested without the need for the WalkerControl code.
auto unpackWalker = [](auto& awalker) { | ||
auto walker_id = awalker.getWalkerID(); | ||
// Walker::copyFromBuffer overwrites the walker_id | ||
awalker.copyFromBuffer(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This lambda is just the repeated code for the synchronous and asynchronous receive below. Previously the walker_ids and parent_itds were just getting overwritten by the sent walker_id and parent_id. From the context of all ranks this resulted in duplicate id's all over which is not the behavior this PR is setting up. The copyFromBuffer from the current code is preserved here the new statements only effect parent_id and walker_id which are not used by the current code.
<< "for rank: " << rank_num_ << " total_multiplicity: " << TotalMultiplicity | ||
<< " fair_offset_[rank_num_ + 1] - fair_offset_[rank_num_]: " << fair_offset_[rank_num_ + 1] << " - " | ||
<< fair_offset_[rank_num_] << '\n'; | ||
throw std::runtime_error(error_msg.str()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a less useless error message. It's in an
#ifndef NDEBUG
section so this has no effect on production science
// keep good walker valid. | ||
good_walker.Multiplicity -= 1.0; | ||
num_copies--; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the code that was in WalkerControl
for (size_t iw = 0; iw < good_walkers; iw++)
{
size_t num_copies = static_cast<int>(walkers[iw]->Multiplicity);
while (num_copies > 1)
{
auto walker_elements = pop.spawnWalker();
// save this walkers ID
// \todo revisit Walker assignment operator after legacy drivers removed.
// but in the modern scheme walker IDs are permanent after creation, what walker they
// were copied from is in ParentID.
long save_id = walker_elements.walker.getWalkerID();
walker_elements.walker = *walkers[iw];
walker_elements.walker.setParentID(walker_elements.walker.getWalkerID());
walker_elements.walker.setWalkerID(save_id);
num_copies--;
}
}
The significant diferrence here is that we continue to make the multiplicities valid rather than just ignore them until the end when we set them all 1.0 regardless of what happened to them. I'm of the opinion if you have a state variable its validity must be maintained through state changes not valid for some states and invalid in others. That's asking for bugs later.
@@ -91,18 +149,6 @@ void MCPopulation::createWalkers(IndexType num_walkers, const WalkerConfiguratio | |||
|
|||
outputManager.resume(); | |||
|
|||
int num_walkers_created = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have to wait until here to have valid walker and parent id's
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PDoakORNL Thanks for the additional comments. I'd like to make the following request of changes.
- Please revert the type change to
Multiplicity
. The purpose of this PR is to handle walker_id. It will be better to handle it separately. I also feelMultiplicity
should be an unsigned it but all its use needs to be checked or preferably unit tested. So making a separate PR for that topics will be much better. - Changes around
walker.h
andMCPopulations.h/cpp
can be placed on a separate PR. They are much less exposed to the full scheme of id handling. reviewing and merging turnaround should be fast. - Then we handle the last but most complicated bits in
WalkerControl
.
Sorry for the additional refactoring but that warranties stable progress.
src/Particle/Walker.h
Outdated
*/ | ||
FullPrecRealType Multiplicity = 1.0; | ||
int Multiplicity = 1.0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert this change and open an issue saying this can probably set to int. I would actually prefer an unsigned int. My quick grep 'Multiplicity =' seems indicating a more careful revisit can be needed. Changing type in this PR is too risky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
/** if true, this walker is either copied or tranferred from another MPI rank. | ||
* significant because this walker will need a distance table recompute since we don't transfer them. | ||
* So this is really a variable tracking the state of the ParticleSet associated with this walker. | ||
* \todo this is a smell to be addressing in ParticleSet refactoring. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not just distance tables. Any data beyond coordinates needs to be recomputed for example Slater matrices, Jastrow factors.
Proposed changes
Walker ID's in batched version of the drivers should now supply trackable walker ID's
Practically this looks like after a load balance is done:
44: Test command: /usr/local/bin/mpiexec "-n" "4" "--oversubscribe" "/Users/Shared/ornldev/code/qmcpack/build_new/src/QMCDrivers/tests/test_new_drivers_mpi"
from the "MPI WalkerControl population swap walkers" test case
Section ("Load Balance")
44: count_before: { 4, 1, 1, 1, } count_after: { 1, 2, 2, 2, }
44: Walkers Per Rank (Total: 7)
44: rank: 0 walker ids: { 1, } parent ids: { 0, }
44: rank: 1 walker ids: { 2, 6, } parent ids: { 0, 1, }
44: rank: 3 walker ids: { 4, 8, } parent ids: { 0, 1, }
44: rank: 2 walker ids: { 3, 7, } parent ids: { 0, 1, }
Section("Load Balance Multiple Copy Optimization")
44: count_before: { 12, 1, 1, 1, } count_after: { 3, 4, 4, 4, }
44: Walkers Per Rank (Total: 15)
44: rank: 0 walker ids: { 1, 5, 9, } parent ids: { 0, 1, 1, }
44: rank: 1 walker ids: { 2, 6, 10, 14, } parent ids: { 0, 1, 6, 6, }
44: rank: 2 walker ids: { 3, 7, 11, 15, } parent ids: { 0, 1, 7, 7, }
44: rank: 3 walker ids: { 4, 8, 12, 16, } parent ids: { 0, 1, 8, 8, }
What type(s) of changes does this code introduce?
Does this introduce a breaking change?
What systems has this change been tested on?
OSX laptop
Checklist
Update the following with a yes where the items apply. If you're unsure about any of them, don't hesitate to ask. This is
simply a reminder of what we are going to look for before merging your code.