Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Useful walker ids #5010

Closed
wants to merge 20 commits into from
Closed

Conversation

PDoakORNL
Copy link
Contributor

@PDoakORNL PDoakORNL commented May 28, 2024

Proposed changes

Walker ID's in batched version of the drivers should now supply trackable walker ID's

Practically this looks like after a load balance is done:
44: Test command: /usr/local/bin/mpiexec "-n" "4" "--oversubscribe" "/Users/Shared/ornldev/code/qmcpack/build_new/src/QMCDrivers/tests/test_new_drivers_mpi"

from the "MPI WalkerControl population swap walkers" test case

Section ("Load Balance")
44: count_before: { 4, 1, 1, 1, } count_after: { 1, 2, 2, 2, }
44: Walkers Per Rank (Total: 7)
44: rank: 0 walker ids: { 1, } parent ids: { 0, }
44: rank: 1 walker ids: { 2, 6, } parent ids: { 0, 1, }
44: rank: 3 walker ids: { 4, 8, } parent ids: { 0, 1, }
44: rank: 2 walker ids: { 3, 7, } parent ids: { 0, 1, }

Section("Load Balance Multiple Copy Optimization")
44: count_before: { 12, 1, 1, 1, } count_after: { 3, 4, 4, 4, }
44: Walkers Per Rank (Total: 15)
44: rank: 0 walker ids: { 1, 5, 9, } parent ids: { 0, 1, 1, }
44: rank: 1 walker ids: { 2, 6, 10, 14, } parent ids: { 0, 1, 6, 6, }
44: rank: 2 walker ids: { 3, 7, 11, 15, } parent ids: { 0, 1, 7, 7, }
44: rank: 3 walker ids: { 4, 8, 12, 16, } parent ids: { 0, 1, 8, 8, }

What type(s) of changes does this code introduce?

  • New feature
  • Testing changes (e.g. new unit/integration/performance tests)
  • Documentation or build script changes

Does this introduce a breaking change?

  • No

What systems has this change been tested on?

OSX laptop

Checklist

Update the following with a yes where the items apply. If you're unsure about any of them, don't hesitate to ask. This is
simply a reminder of what we are going to look for before merging your code.

  • Yes This PR is up to date with current the current state of 'develop'
  • Yes Code added or changed in the PR has been clang-formatted
  • Yes This PR adds tests to cover any new code, or to catch a bug that is being fixed
  • Yes Documentation has been added (if appropriate)

Copy link
Contributor

@ye-luo ye-luo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did my first pass.

src/QMCDrivers/MCPopulation.h Outdated Show resolved Hide resolved
src/QMCDrivers/DMC/WalkerControl.cpp Show resolved Hide resolved
src/QMCDrivers/MCPopulation.cpp Show resolved Hide resolved
@prckent
Copy link
Contributor

prckent commented May 28, 2024

Drive-by comment: I think there are only two requirements to satisfy here (i) scheme is performant [e.g. minimal comms] (ii) walkers IDs are unique. If it helps -- and I expect it does -- i think walker IDs can be completely arbitrary numbers. Rationale: Postmortem analysis will be done by automation. Do we have any other actual requirements vs nice to have properties (e.g., sequential, increasing)?

@PDoakORNL
Copy link
Contributor Author

If we use UUID's (sorry I they are only called GUIDs in windows.) We would need no state, but without post processing it would be basically human unreadable.

@prckent
Copy link
Contributor

prckent commented May 29, 2024

Non-drive by comment, now that I have thought about this a bit more: since walkers are created at the mpi rank (vs crowd) level, we can count modulo num_ranks and offset the counters by the rank. The only state needed is the number of walkers created by each rank, and this never has to be communicated. This is very close to what is implemented here, but I don't understand the discussion of the "golden walker". What is meant by that? Seems unnecessary?

Two rank, initial four walker example:

rank 0: walkers 0,2
rank 1: walkers 1,3

rank 1 creates an extra walker:

rank 0: walkers 0,2
rank 1: walkers 1,3,5

rank 0 kills walker 2, and we load balance:

rank 0: walkers 0,5
rank 1: walkers 1,3

rank 0 creates an extra walker:

rank 0: walkers 0,5,4
rank 1: walkers 1,3

@ye-luo
Copy link
Contributor

ye-luo commented May 29, 2024

Here is my expectation of walker ID. It represents the location(rank, index within the rank) of walker in each step. It is not carried over steps. At initialization or after branching, walker ids are reassigned and the old value is kept as parent walker id. Both current and parent ids should allow us to reconstruct walker traces.

The only state needed is the number of walkers created by each rank, and this never has to be communicated.

I don't get why this is even needed in the current scheme. Since walkers are created and annihilated from step to step, I don't think number of walkers created by each rank really means anything.

If the intention is making the walker id never change after creation. I would like to see how branching and trace reconstruction is managed. I don't mean implementation but some description before adopting this idea.

Regarding golden walkers. Walkers have configurations (electron positions) and contexts (particleset, TWF, Ham).
During the initialization before hitting QMC drivers, a full context is built and later used to cloning. That is a called a golden walker. To be accurate, it should be called golden walker context. Only driver cloned contexts are used for calculation and thus the gold walker context is not affected by any drivers being created and destroyed.

  MCPopulation(int num_ranks,
               int this_rank,
               ParticleSet* elecs,
               TrialWaveFunction* trial_wf,
               QMCHamiltonian* hamiltonian_);

Treat elecs, trial_wf, hamiltonian as reference as they are owned by particleset, twf, ham pools.
For this reason, the golden walkers are not members of a population.

@prckent
Copy link
Contributor

prckent commented May 29, 2024

I think we need to discuss the envisioned use cases for this. What myself, Jaron, and others are after is indeed "making the walker ID never change after creation", so that we can e.g. follow the history of presumed "bad" walkers. With a walker's unique ID plus that of its parent (which it would have to store), we could do that. IDs would never be reused so the scripting would be easy. I believe this is what the legacy code attempts/attempted to do as well, not that it been carefully verified in the modern era.

Do you or anyone else have other uses in mind?

@ye-luo
Copy link
Contributor

ye-luo commented May 29, 2024

To follow the bad walker, you will need to "backtrace". If IDs are fully unique, you will need to search the parent id in the population at every step which is very bad. If following my ID scheme, ID is an index and there is no need of searching.

@prckent
Copy link
Contributor

prckent commented May 29, 2024

Maybe some crossed language here: once created, walkers have fixed ID independent of timestep. So if a run aborts due to population explosion (most common case), we just follow the low energy walker back to when it was moved to the bad position (assuming we were logging output). This is usually a walker with a single ID. Perhaps your index == what I was describing as a fixed ID? See my example a few comments back.

@ye-luo
Copy link
Contributor

ye-luo commented May 29, 2024

Maybe some crossed language here: once created, walkers have fixed ID independent of timestep. So if a run aborts due to population explosion (most common case), we just follow the low energy walker back to when it was moved to the bad position (assuming we were logging output). This is usually a walker with a single ID. Perhaps your index == what I was describing as a fixed ID? See my example a few comments back.

In my indexing way, the index is only valid for the current step, the parent index is valid for the previous step. Since it is already index, look up is O(1).

In your fixed ID way, the ID is unique but look up is painful. At step N, walker with id=5 goes bad and its parent id is X, we backtrace to the step N-1, it has 100 walkers, due to unsorted walker ID, need to search all the 100 walkers to find X in order to read info like coordinates/energy/weight. How about now we have 10^6 walkers instead of 100 ... Looking up takes O(N).

@ye-luo
Copy link
Contributor

ye-luo commented May 29, 2024

When using my index(parent index).
Step 0.
rank 0: walkers 0(0),2(2)
rank 1: walkers 1(1),3(3)
step 1.
rank 0: walkers 0(0),2(2)
rank 1: walkers 1(1),3(3),5(1) # both 1 and 5 are branched from walker 1 of step 0.
step 2.
rank 0: walkers 0(0),2(5) # walker 2 is killed and replaced by branching off walker 5 of step 1.
rank 1: walkers 1(3),3(5) # walker 1 of step 1 is killed and load balancing shuffled walkers.

@prckent Please also clarify what is parent walker ID in your scheme and how to handle killing/branching.

@prckent
Copy link
Contributor

prckent commented May 31, 2024

How about now we have 10^6 walkers instead of 100 ... Looking up takes O(N).

The only envisaged use I have for this facility is offline debugging and maybe occasional online debugging. There is zero cost to everyday simulations beyond moving a few extra bytes and one extra piece of state. See what I put in the requirements above. What are you envisaging using this for?

@ye-luo
Copy link
Contributor

ye-luo commented May 31, 2024

I'm thinking about a fast way of backtracing a bad walker and accessing infos like coordinates and energy on the evolution path in the analysis phase. This analysis is usually done offline.

Walker id is the only way of tracking walkers. The walker id from creation is not enough to backtrace a walker to step 0 due to branching and thus the parent walker id is also needed. That is why I'm wondering with your labeling scheme how to reconstruct a backtrace.

Particularly, I need this backtrace construction fast and not disk I/O and memory intensive. For a walker at step N with a parent walker id X, i need an efficient way to find X at step N-1 and read out coordinates and energy.

  1. preferably read only partial data from disk and work only with this partial data. Reading all the walkers of all the MPI ranks of the whole history from disk and storing in the memory is bad. Preferably data can be read per step or a step block.
  2. With the partially read data, looking up still needs to be fast. If there are a million record in a step block, a skip forward method should be used to locate X. Linear searching X is bad.

@jtkrogel
Copy link
Contributor

jtkrogel commented Jun 4, 2024

Indeed, I intend for both walker and parent ID's to be immutable and carried with a walker upon/following its creation.

Using hash tables in post-processing reduces search times to O(1) following an O(N) construction.

The walker tracing functionality (#5019) naturally organizes the walker data by rank (one file per rank), obviating the need to track it explicitly. As Ye and I have discussed offline, a slight generalization of the current functionality could optionally segment the outputted buffer data by MC block to further reduce the construction cost if data only from particular blocks is desired.

I definitely prefer using a simple strided indexing formula like the one currently in nextWalkerID() over anything more complicated or obscure.

@ye-luo
Copy link
Contributor

ye-luo commented Jun 4, 2024

Here is the issue of immutable walker id. If walker id is based on the creation MPI rank not the actual rank were the walker locates, looking up the walker requires loading information of all the walkers of a given step. The walker tracing functionality (#5019) is already quite unfriendly to I/O due to its intensive amount of data being written. To construct backtrace, all the trace data from every MPI rank has to be read from disk and stored in memory although this can be improved by blocking as datasets in the trace files. The cost of creating hash map for the walker id and record location for each step is obviously not O(1).

@PDoakORNL
Copy link
Contributor Author

PDoakORNL commented Jun 4, 2024

Examining in more detail how the driver code actually sequences the transfers and copies I realized we can in fact do all the necessary walker spawning so that all walker ID's are rank local (see below for definition of that), parent ids will allow tracking the originating walker whether it be on or off rank. See the new walker control tests for how this works in more detail. Much more of the walker control is now actually unit tested as well. I'd still consider coverage to be pretty incomplete as the tests are easy cases and only single stage, but I do think they nail down the expected behavior more clearly than in the past.

It seems logical to me to have "parentless" walkers have parent id == 0 i.e. the null walker. These are ID's and not array indexes (pointer offsets).

From this it follows that no walkers ID should be 0 unless it is a blank not yet properly initialized. Something which happens in legacy code i.e. locations where walkers are constructed outside of MCPopulation. At least until we refactor WalkerConfiguration, Walker and Particle set there will still be walkers that go through a two stage creation where the invalid for a time.

@ye-luo
Copy link
Contributor

ye-luo commented Jun 4, 2024

@PDoakORNL could you clarify "rank id's are local"? What is rank id you were referring?
My understanding is that walker id is defined in each step and thus meaningful for that step only. At a given step, each walker on every MPI rank has a unique id. parent id is for the previous step. This allows tracing a walker back to the previous step. Is that what you meant?

There are two ways of indexing walkers at a given step.

  1. contiguously indexing. rank 0 (walker 0, walker 1...) rank 1(walker 0, walker 1...)
  2. more or less the scheme used in the current PR. non contiguously indexing. gid = Np * local_index + RankID. NP is the number of MPI ranks, local_index is the index of a walker on its own MPI rank.

Method 1 eliminates the info about ranks. It is more flexible if rank info is not critical. It can also be more difficult if rank info is needed.
Method 2 requires to know the number of walkers on each rank but more friendly to handling data by ranks.
I'm open to both schemes defining walker IDs.

I don't quite follow you mention of ""parentless" walkers" and also have difficulty to understand your two stage comments.

Here is my understanding of the code.
WalkerConfiguration carries Walkers not ParticleSet. It carries data cross QMC drivers.
MCPopulation is a driver concept. It captures a WalkerConfiguration to start a walker population. With a population created, walker id makes sense. Once a QMC driver ends, the population is technically gone and walker id is not meaningful anymore.

so in a driver

create a population based on loaded WalkerConfiguration
assign initial walker id. parallel id is basically undefined.
loop step
   random walking
   branching.
   parent id = walker id. Reassign walker id
end loop step

@PDoakORNL
Copy link
Contributor Author

I mean that on a rank all walkers id's will be walker_id % (num_ranks) = rank + 1. I'll edit that to make it more clear.

I believe in practice walkers should not be getting moved more than necessary so the ID's should be presevered between steps unless the walker actually changed rank or unless it is a copy from another walker on the rank in which case it gets a new id. In practice I don't think a large % of walker die, are born, or transferred each step.

IDs are never reused and are not based on local index only on number of new walkers born on the rank. There are no gaps if you look at the sequence of particular ranks IDs with integer division by number of ranks.

IDs should be unique for a particular walker and there cannot be based on the local walker index

My understanding of mw_saveWalker ... mw_loadWalker was that it was not supposed to just treat the walkers and walker elements as just slots that could have a completely new identity every step as we serialize in and out of some insane untyped byte buffer.

But that rather a walker than had not been killed would be a continuation of a particular walker trajectory. Its hard to be certain since this whole save load thing moving state between walkers and particle sets is a little... 😭 But I have always intended that a indexes of walker elements will never be mixed. I.e. Walker[1], ParticleSet[1], TWF[1], Hamiltonian[1]

@ye-luo
Copy link
Contributor

ye-luo commented Jun 5, 2024

Here is an example of the bubble I was referring to.
When starting 3 walkers on rank 0, 1 and 2 walkers on rank 2, 3 for total request 10 walkers.
Rank 0 ids: 0, 4, 8
Rank 1 ids: 1, 5, 9
Rank 2 ids: 2, 6
Rank 3 ids: 3, 7
at this point there is no gap. Now rank 3 creates a walker during branching.
Rank 3 ids: 3, 7, 11
There is no walker with id 10.

@ye-luo
Copy link
Contributor

ye-luo commented Jun 5, 2024

If the walker id is only intended to generate a unique id, then the scheme in this PR is ok. However, it cannot be used to be more informative indicating MPI rank or index. As long as killed/born/migrating do happen, such scenarios should be taken into account. I don't think how large % of walker affects our criteria for choosing a scheme for id.

@ye-luo
Copy link
Contributor

ye-luo commented Jun 5, 2024

I believe in practice walkers should not be getting moved more than necessary so the ID's should be presevered between steps unless the walker actually changed rank or unless it is a copy from another walker on the rank in which case it gets a new id.

Now @PDoakORNL is saying walker id is mutable. @prckent @jtkrogel @PDoakORNL we need a discussion.

@PDoakORNL
Copy link
Contributor Author

I should also add that any time I'm think about producing massive amounts of data I consider how much work getting it into a relational database would be. That is after all "what you do" if working with data that won't fit in memory that you want to make structured queries against.

Id use tables like:
Walker

WalkerKey (Primary Key)
Block
Walker Id
Parent ID

Step

stepkey (primary key)
walker index
step
block
kinetic energy
potential energy
...

Conveniently if you limited your database to just one block per instance you could just user walker ID as the key to the first table and step+walkerID as the key second table.

To track a particular walker select all steps for a particular walker ID say (12). Then I'd select all walkers IDs with a parent ID (12). Then query each of those IDs and their steps. You would continue this process until you got no more walkerID's connected to your original ID. Merge all the resulting rows together you could plot a nice graph of the trajectory of some value with respect to the originating walker. If it every reached multiplicity high enough to get copied you would get multiple values per step otherwise you would just recover the trajectory even as it moved between ranks.

Conceptually you can accomplish the same thing with a scripting language just hammering on the output files but I'd rather have WalkerIDs that made processing those files into a DB easy. The issue of primary keys is basically the same as in memory hash.

@PDoakORNL
Copy link
Contributor Author

PDoakORNL commented Jun 5, 2024

Walker ID is not mutable. But it is not portable to another rank.

@ye-luo
Copy link
Contributor

ye-luo commented Jun 5, 2024

Walker ID is not mutable. I walker that migrates is killed and born again.

Unfortunately killing a physical term not a technical term. Migranting a walker cannot be treated as kill and born.

@PDoakORNL
Copy link
Contributor Author

From my perspective there are no longer any gaps.

The walker IDs come from a N = num_ranks disjoint sequences from the positive integers with each value x = {0,..., num_ranks-1} determines a sequence of walker ids such that members are those integers such that w_id -> P_INT % num_rankss == x. Each rank uses the set of IDs such the x defining the set has x == rank. on any particular rank All ID's are used sequentially for active walkers. An active walker is one that will on the next step attempt a move. A walker being overwritten must have either been freshly constructed or have been however briefly in an inactive state and receives the next rank ID.

@ye-luo
Copy link
Contributor

ye-luo commented Jun 6, 2024

In the above example I was talking about gaps between ranks not within a rank. ids spaced by num_ranks were not considered disjoint.

@PDoakORNL
Copy link
Contributor Author

In your example the next time rank two creates a walker it will be 10.

@ye-luo
Copy link
Contributor

ye-luo commented Jun 6, 2024

In your example the next time rank two creates a walker it will be 10.

At the current step, there is no walker 10. If the simulation ends here, there is no walker 10.

@jtkrogel
Copy link
Contributor

jtkrogel commented Jun 7, 2024

Perhaps just post the pseudo-code and the ID's generated by 2-3 adjacent ranks?

:label: eq_walker_id
walker_id = walker_id = num_walkers_created_++ * num_ranks_ + rank_ + 1

where ``num_walkers_created_`` is a member variable of the sole ```MCPopulation`` object on the rank and initially set to 0. Each walkers ``parent_id`` is set at initiation of the walkers configuration to the walker from home the configuration is assigned. If that assignment is from previous section's or run's ``WalkerConfigurations`` object then the value of the ``Walker::getWalkerID()`` is multiplied by -1. If the Walker's initial configuration comes from the golden particle set the parent_id will be 0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"set to 0" each driver run or once the whole lifetime?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If walker is not assigned from a walker config walker, a transfered walker, or an amplified walker, the parent_id is left as the default i.e. 0.

docs/developing.rst Show resolved Hide resolved
:label: eq_walker_id
walker_id = walker_id = num_walkers_created_++ * num_ranks_ + rank_ + 1

where ``num_walkers_created_`` is a member variable of the sole ```MCPopulation`` object on the rank and initially set to 0. Each walkers ``parent_id`` is set at initiation of the walkers configuration to the walker from home the configuration is assigned. If that assignment is from previous section's or run's ``WalkerConfigurations`` object then the value of the ``Walker::getWalkerID()`` is multiplied by -1. If the Walker's initial configuration comes from the golden particle set the parent_id will be 0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is "home"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above section seems explaining what happens when walkerIDs and parent IDs are initialized.
Could you document what happens during branching?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm added documentation of branching. I could not find other discussion of our branching algorithm in the manual. Am I just missing it?

*/
long walker_id_ = 0;
/** in legacy the ancients have said only:
* id reserved for forward walking
*
* In Batched
* If a walker is initialized from the golden particle set it keeps the default constructed 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are talking about lightweight walker here. Should not reference "golden particle set"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've cleaned these docs up now.

@@ -236,13 +208,15 @@ void WalkerControl::branch(int iter, MCPopulation& pop, bool do_not_branch)
throw std::runtime_error("Potential bug! Population num_global_walkers mismatched!");
#endif

// This defensive code implies that previous code left population walkers in invalid state.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a defensive code.

Copy link
Contributor Author

@PDoakORNL PDoakORNL Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so by my reading through the walker life cycle, branching, and population control walkers must always have Multiplicty == 1.0 after amplification and/or transfer. Multiplicity is set in a small number of significant locations (except for here).

  1. On construction the default member initializer of Walker sets the initial value to 1.0
  2. During warmup = Multiplicity = 1.0
  3. To static_cast<int>(walker->Weight + rng_()) resulting in a integer value of >= 0
  4. After swapping where on rank multiplicity is set to the remaining number of copies on a rank.
  5. After swapping where a new walker has its multiplicity set to the number of copies received + 1 (number of copies received is counted from 0 due to lack of discrimination between counting numbers and indexing numbers).
  6. After amplificiation i.e. copyHighMultiplicityWalkers and killDeadWalkers the population will contain only walkers of Multiplicity == 1.
  7. Here where all living walkers have their multiplicity set to 1.

Several off these state transforms appear defensive in nature only, i.e. they can only cover up Walkers arriving to this function in an invalid state.

  1. Can only erase a state transformation to walker that left Multiplicity in an incorrect state. Fresh walkers have multiplicity == 1 walkers from walker configuration have multiplicity == 1. An assert should be added and this side of the branch is noop.
  2. If multiplicity != something is wrong, this simply covers it.

4, 5 are only necessary because of how we construct transfer jobs. Where basically walker have a multiplicity per destination for a short span of code. During this section Walker->Multiplicty is no long a simple value.

Multiplicity is a very important value for correct population control so reasoning about its update and validity should be as simple as possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm clarifying the comments, I don't want to lose this insight but it can't really be cleaned up properly until we drop the legacy drivers.

// recv the number of copies from the target
myComm->comm.receive_n(&nsentcopy, 1, plus[ic]);
job_list.push_back(job(newW.size() - 1, plus[ic]));
if (plus[ic] != plus[ic + nsentcopy] || minus[ic] != minus[ic + nsentcopy])
throw std::runtime_error("WalkerControl::swapWalkersSimple send/recv pair checking failed!");
#ifdef MCWALKERSET_MPI_DEBUG
fout << "rank " << minus[ic] << " recvs a walker with " << nsentcopy << " copies from rank " << plus[ic]
std::cout << "rank " << minus[ic] << " recvs a walker with " << nsentcopy << " copies from rank " << plus[ic]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your temporary change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -462,11 +440,15 @@ void WalkerControl::swapWalkersSimple(MCPopulation& pop)
}
else
{
// Walker::copyFromBuffer copies the walker_index_ of the sent walker and the local walker does
// not have a local id created.
// the walker on this rank. But this replacement allows tracking this walker across the ranks.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please improve the comment.

awalker.copyFromBuffer();
auto parent_id = awalker.getWalkerID();
awalker.setWalkerID(walker_id);
awalker.setParentID(parent_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain all the setting?
auto parent_id = awalker.getWalkerID(); why?
Are you sure the parent id also has been wired?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added test and discovered the answer was kinda. It was for transfer but not for reading from WalkerConfigurations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have since discovered that it is also not written or read from walker hdf5 file. I'd rather address that in a future PR.

* 4. collect good, bad walkers
* 5. communicate walkers
* 6. unpack received walkers, apply walker count floor
* 7. call MCPopulation to amplify walkers with Multiplicity > 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this back to its original location. Users calling this API doesn't need to know the implementation detail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

returned. I left some information for this call in the header.

* partial enum to access elements is the enum above dumped into WalkerControl's namespace
* includes many integral types converted to and from fp
* \todo MPI in the 21st centure allows user defined data types that would allow all this to benefit from actual
* C++ type safety and still can have performant collective operation efficiency.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MPI is C only. I don't think it knows C++ type safety. The derived type in MPI is intended for performance, not for safety. It is not useful in this particular case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looks like its intended so everyone doesn't do things like this constantly.

@PDoakORNL
Copy link
Contributor Author

PDoakORNL commented Jun 24, 2024

I'm working on another rebase and forced push. This has been sitting for a week and new PR's getting merged. I'm not sure why anymore, but after this rebase I will not be keeping it current unless there is actual interest in merging it. I won't be fixing ID propagation between runs and not just between sections until this PR goes in, it has enough in it, tests what it adds and adds tests and documentation that should have always been there.

@PDoakORNL
Copy link
Contributor Author

The osx failures are also all with nexus and do not appear to have anything to do with this PR.

@ye-luo
Copy link
Contributor

ye-luo commented Jun 24, 2024

I'm working on another rebase and forced push.

If it is not showing conflict, there is no need to aggressively do rebasing. We rarely requires PRs being rebased before merging.

This has been sitting for a week I'm not sure why anymore.

Sorry I was busy and didn't pay attention to commits closely. Sometimes it is hard for me to tell if a submitter is still working on it or it is ready for review again. It is not the best use of maintainer time monitoring PR progress. If you need a specific reviewer to respond, better to ping if the PR stalls. For me specifically, feel free to @ me on github or slack me.

Unfortunately, we (submitters and reviewers) all have bandwidth limits and communication helps.

@ye-luo
Copy link
Contributor

ye-luo commented Jun 24, 2024

Just noticed that you sent a new review request. I'm sorry that I missed it.

@prckent
Copy link
Contributor

prckent commented Jun 24, 2024

OSX failures are all fixed and unrelated to this PR.

I would like to get this merged since it adds very useful functionality. It was also helpful to read e.g. how the multiplicities are handled.

@PDoakORNL
Copy link
Contributor Author

I'm never sure if I don't rebase if I'm going to end up with history wise. I guess I should do some research on this. In this case I don't have dependent branches on this PR so it probably makes no difference.

Copy link
Contributor Author

@PDoakORNL PDoakORNL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to explain every code change in WalkerControl.cpp to ease concerns that any real change has been made in the DMC population control. Should I back port the unit tests and/or right additional ones to prove the behavior is the same?

{
makeCopy(a);
walker_id_ = walker_id;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I make these assignments after makeCopy, putting them above means they get replace by a's walker_id and parent_id which is not the behavior needed or expected.

ScopedTimer copywalkers_timer(my_timers_[WC_copyWalkers]);
const size_t good_walkers = walkers.size();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is entirely manipulation and calls that should be private to MCPopulation so I refactored it into there. This also allows unit testing this with MCPopulation which makes sense since code can be tested without the need for the WalkerControl code.

auto unpackWalker = [](auto& awalker) {
auto walker_id = awalker.getWalkerID();
// Walker::copyFromBuffer overwrites the walker_id
awalker.copyFromBuffer();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lambda is just the repeated code for the synchronous and asynchronous receive below. Previously the walker_ids and parent_itds were just getting overwritten by the sent walker_id and parent_id. From the context of all ranks this resulted in duplicate id's all over which is not the behavior this PR is setting up. The copyFromBuffer from the current code is preserved here the new statements only effect parent_id and walker_id which are not used by the current code.

<< "for rank: " << rank_num_ << " total_multiplicity: " << TotalMultiplicity
<< " fair_offset_[rank_num_ + 1] - fair_offset_[rank_num_]: " << fair_offset_[rank_num_ + 1] << " - "
<< fair_offset_[rank_num_] << '\n';
throw std::runtime_error(error_msg.str());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a less useless error message. It's in an

#ifndef NDEBUG

section so this has no effect on production science

// keep good walker valid.
good_walker.Multiplicity -= 1.0;
num_copies--;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the code that was in WalkerControl

for (size_t iw = 0; iw < good_walkers; iw++)
    {
      size_t num_copies = static_cast<int>(walkers[iw]->Multiplicity);
      while (num_copies > 1)
      {
        auto walker_elements = pop.spawnWalker();
        // save this walkers ID
        // \todo revisit Walker assignment operator after legacy drivers removed.
        // but in the modern scheme walker IDs are permanent after creation, what walker they
        // were copied from is in ParentID.
        long save_id                    = walker_elements.walker.getWalkerID();
        walker_elements.walker          = *walkers[iw];
        walker_elements.walker.setParentID(walker_elements.walker.getWalkerID());
        walker_elements.walker.setWalkerID(save_id);
        num_copies--;
      }
    }

The significant diferrence here is that we continue to make the multiplicities valid rather than just ignore them until the end when we set them all 1.0 regardless of what happened to them. I'm of the opinion if you have a state variable its validity must be maintained through state changes not valid for some states and invalid in others. That's asking for bugs later.

@@ -91,18 +149,6 @@ void MCPopulation::createWalkers(IndexType num_walkers, const WalkerConfiguratio

outputManager.resume();

int num_walkers_created = 0;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have to wait until here to have valid walker and parent id's

Copy link
Contributor

@ye-luo ye-luo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PDoakORNL Thanks for the additional comments. I'd like to make the following request of changes.

  1. Please revert the type change to Multiplicity. The purpose of this PR is to handle walker_id. It will be better to handle it separately. I also feel Multiplicity should be an unsigned it but all its use needs to be checked or preferably unit tested. So making a separate PR for that topics will be much better.
  2. Changes around walker.h and MCPopulations.h/cpp can be placed on a separate PR. They are much less exposed to the full scheme of id handling. reviewing and merging turnaround should be fast.
  3. Then we handle the last but most complicated bits in WalkerControl.

Sorry for the additional refactoring but that warranties stable progress.

*/
FullPrecRealType Multiplicity = 1.0;
int Multiplicity = 1.0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert this change and open an issue saying this can probably set to int. I would actually prefer an unsigned int. My quick grep 'Multiplicity =' seems indicating a more careful revisit can be needed. Changing type in this PR is too risky.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

/** if true, this walker is either copied or tranferred from another MPI rank.
* significant because this walker will need a distance table recompute since we don't transfer them.
* So this is really a variable tracking the state of the ParticleSet associated with this walker.
* \todo this is a smell to be addressing in ParticleSet refactoring.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not just distance tables. Any data beyond coordinates needs to be recomputed for example Slater matrices, Jastrow factors.

@ye-luo
Copy link
Contributor

ye-luo commented Jul 25, 2024

superseded by #5063 and #5089

@ye-luo ye-luo closed this Jul 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants