Useful walker ids #5010

PDoakORNL · 2024-05-28T21:11:22Z

Proposed changes

Walker ID's in batched version of the drivers should now supply trackable walker ID's

Practically this looks like after a load balance is done:
44: Test command: /usr/local/bin/mpiexec "-n" "4" "--oversubscribe" "/Users/Shared/ornldev/code/qmcpack/build_new/src/QMCDrivers/tests/test_new_drivers_mpi"

from the "MPI WalkerControl population swap walkers" test case

Section ("Load Balance")
44: count_before: { 4, 1, 1, 1, } count_after: { 1, 2, 2, 2, }
44: Walkers Per Rank (Total: 7)
44: rank: 0 walker ids: { 1, } parent ids: { 0, }
44: rank: 1 walker ids: { 2, 6, } parent ids: { 0, 1, }
44: rank: 3 walker ids: { 4, 8, } parent ids: { 0, 1, }
44: rank: 2 walker ids: { 3, 7, } parent ids: { 0, 1, }

Section("Load Balance Multiple Copy Optimization")
44: count_before: { 12, 1, 1, 1, } count_after: { 3, 4, 4, 4, }
44: Walkers Per Rank (Total: 15)
44: rank: 0 walker ids: { 1, 5, 9, } parent ids: { 0, 1, 1, }
44: rank: 1 walker ids: { 2, 6, 10, 14, } parent ids: { 0, 1, 6, 6, }
44: rank: 2 walker ids: { 3, 7, 11, 15, } parent ids: { 0, 1, 7, 7, }
44: rank: 3 walker ids: { 4, 8, 12, 16, } parent ids: { 0, 1, 8, 8, }

What type(s) of changes does this code introduce?

New feature
Testing changes (e.g. new unit/integration/performance tests)
Documentation or build script changes

Does this introduce a breaking change?

No

What systems has this change been tested on?

OSX laptop

Checklist

Update the following with a yes where the items apply. If you're unsure about any of them, don't hesitate to ask. This is
simply a reminder of what we are going to look for before merging your code.

Yes This PR is up to date with current the current state of 'develop'
Yes Code added or changed in the PR has been clang-formatted
Yes This PR adds tests to cover any new code, or to catch a bug that is being fixed
Yes Documentation has been added (if appropriate)

ye-luo

Did my first pass.

src/QMCDrivers/MCPopulation.h

src/QMCDrivers/DMC/WalkerControl.cpp

src/QMCDrivers/MCPopulation.cpp

prckent · 2024-05-28T22:31:38Z

Drive-by comment: I think there are only two requirements to satisfy here (i) scheme is performant [e.g. minimal comms] (ii) walkers IDs are unique. If it helps -- and I expect it does -- i think walker IDs can be completely arbitrary numbers. Rationale: Postmortem analysis will be done by automation. Do we have any other actual requirements vs nice to have properties (e.g., sequential, increasing)?

PDoakORNL · 2024-05-29T15:23:16Z

If we use UUID's (sorry I they are only called GUIDs in windows.) We would need no state, but without post processing it would be basically human unreadable.

prckent · 2024-05-29T15:45:56Z

Non-drive by comment, now that I have thought about this a bit more: since walkers are created at the mpi rank (vs crowd) level, we can count modulo num_ranks and offset the counters by the rank. The only state needed is the number of walkers created by each rank, and this never has to be communicated. This is very close to what is implemented here, but I don't understand the discussion of the "golden walker". What is meant by that? Seems unnecessary?

Two rank, initial four walker example:

rank 0: walkers 0,2
rank 1: walkers 1,3

rank 1 creates an extra walker:

rank 0: walkers 0,2
rank 1: walkers 1,3,5

rank 0 kills walker 2, and we load balance:

rank 0: walkers 0,5
rank 1: walkers 1,3

rank 0 creates an extra walker:

rank 0: walkers 0,5,4
rank 1: walkers 1,3

ye-luo · 2024-05-29T17:17:48Z

Here is my expectation of walker ID. It represents the location(rank, index within the rank) of walker in each step. It is not carried over steps. At initialization or after branching, walker ids are reassigned and the old value is kept as parent walker id. Both current and parent ids should allow us to reconstruct walker traces.

The only state needed is the number of walkers created by each rank, and this never has to be communicated.

I don't get why this is even needed in the current scheme. Since walkers are created and annihilated from step to step, I don't think number of walkers created by each rank really means anything.

If the intention is making the walker id never change after creation. I would like to see how branching and trace reconstruction is managed. I don't mean implementation but some description before adopting this idea.

Regarding golden walkers. Walkers have configurations (electron positions) and contexts (particleset, TWF, Ham).
During the initialization before hitting QMC drivers, a full context is built and later used to cloning. That is a called a golden walker. To be accurate, it should be called golden walker context. Only driver cloned contexts are used for calculation and thus the gold walker context is not affected by any drivers being created and destroyed.

  MCPopulation(int num_ranks,
               int this_rank,
               ParticleSet* elecs,
               TrialWaveFunction* trial_wf,
               QMCHamiltonian* hamiltonian_);

Treat elecs, trial_wf, hamiltonian as reference as they are owned by particleset, twf, ham pools.
For this reason, the golden walkers are not members of a population.

prckent · 2024-05-29T18:50:05Z

I think we need to discuss the envisioned use cases for this. What myself, Jaron, and others are after is indeed "making the walker ID never change after creation", so that we can e.g. follow the history of presumed "bad" walkers. With a walker's unique ID plus that of its parent (which it would have to store), we could do that. IDs would never be reused so the scripting would be easy. I believe this is what the legacy code attempts/attempted to do as well, not that it been carefully verified in the modern era.

Do you or anyone else have other uses in mind?

ye-luo · 2024-05-29T18:58:09Z

To follow the bad walker, you will need to "backtrace". If IDs are fully unique, you will need to search the parent id in the population at every step which is very bad. If following my ID scheme, ID is an index and there is no need of searching.

prckent · 2024-05-29T20:23:46Z

Maybe some crossed language here: once created, walkers have fixed ID independent of timestep. So if a run aborts due to population explosion (most common case), we just follow the low energy walker back to when it was moved to the bad position (assuming we were logging output). This is usually a walker with a single ID. Perhaps your index == what I was describing as a fixed ID? See my example a few comments back.

ye-luo · 2024-05-29T20:41:54Z

Maybe some crossed language here: once created, walkers have fixed ID independent of timestep. So if a run aborts due to population explosion (most common case), we just follow the low energy walker back to when it was moved to the bad position (assuming we were logging output). This is usually a walker with a single ID. Perhaps your index == what I was describing as a fixed ID? See my example a few comments back.

In my indexing way, the index is only valid for the current step, the parent index is valid for the previous step. Since it is already index, look up is O(1).

In your fixed ID way, the ID is unique but look up is painful. At step N, walker with id=5 goes bad and its parent id is X, we backtrace to the step N-1, it has 100 walkers, due to unsorted walker ID, need to search all the 100 walkers to find X in order to read info like coordinates/energy/weight. How about now we have 10^6 walkers instead of 100 ... Looking up takes O(N).

ye-luo · 2024-05-29T21:15:34Z

When using my index(parent index).
Step 0.
rank 0: walkers 0(0),2(2)
rank 1: walkers 1(1),3(3)
step 1.
rank 0: walkers 0(0),2(2)
rank 1: walkers 1(1),3(3),5(1) # both 1 and 5 are branched from walker 1 of step 0.
step 2.
rank 0: walkers 0(0),2(5) # walker 2 is killed and replaced by branching off walker 5 of step 1.
rank 1: walkers 1(3),3(5) # walker 1 of step 1 is killed and load balancing shuffled walkers.

@prckent Please also clarify what is parent walker ID in your scheme and how to handle killing/branching.

prckent · 2024-05-31T18:21:54Z

How about now we have 10^6 walkers instead of 100 ... Looking up takes O(N).

The only envisaged use I have for this facility is offline debugging and maybe occasional online debugging. There is zero cost to everyday simulations beyond moving a few extra bytes and one extra piece of state. See what I put in the requirements above. What are you envisaging using this for?

ye-luo · 2024-05-31T18:56:36Z

I'm thinking about a fast way of backtracing a bad walker and accessing infos like coordinates and energy on the evolution path in the analysis phase. This analysis is usually done offline.

Walker id is the only way of tracking walkers. The walker id from creation is not enough to backtrace a walker to step 0 due to branching and thus the parent walker id is also needed. That is why I'm wondering with your labeling scheme how to reconstruct a backtrace.

Particularly, I need this backtrace construction fast and not disk I/O and memory intensive. For a walker at step N with a parent walker id X, i need an efficient way to find X at step N-1 and read out coordinates and energy.

preferably read only partial data from disk and work only with this partial data. Reading all the walkers of all the MPI ranks of the whole history from disk and storing in the memory is bad. Preferably data can be read per step or a step block.
With the partially read data, looking up still needs to be fast. If there are a million record in a step block, a skip forward method should be used to locate X. Linear searching X is bad.

jtkrogel · 2024-06-04T06:35:31Z

Indeed, I intend for both walker and parent ID's to be immutable and carried with a walker upon/following its creation.

Using hash tables in post-processing reduces search times to O(1) following an O(N) construction.

The walker tracing functionality (#5019) naturally organizes the walker data by rank (one file per rank), obviating the need to track it explicitly. As Ye and I have discussed offline, a slight generalization of the current functionality could optionally segment the outputted buffer data by MC block to further reduce the construction cost if data only from particular blocks is desired.

I definitely prefer using a simple strided indexing formula like the one currently in nextWalkerID() over anything more complicated or obscure.

ye-luo · 2024-06-04T14:42:42Z

Here is the issue of immutable walker id. If walker id is based on the creation MPI rank not the actual rank were the walker locates, looking up the walker requires loading information of all the walkers of a given step. The walker tracing functionality (#5019) is already quite unfriendly to I/O due to its intensive amount of data being written. To construct backtrace, all the trace data from every MPI rank has to be read from disk and stored in memory although this can be improved by blocking as datasets in the trace files. The cost of creating hash map for the walker id and record location for each step is obviously not O(1).

PDoakORNL · 2024-06-04T21:57:05Z

Examining in more detail how the driver code actually sequences the transfers and copies I realized we can in fact do all the necessary walker spawning so that all walker ID's are rank local (see below for definition of that), parent ids will allow tracking the originating walker whether it be on or off rank. See the new walker control tests for how this works in more detail. Much more of the walker control is now actually unit tested as well. I'd still consider coverage to be pretty incomplete as the tests are easy cases and only single stage, but I do think they nail down the expected behavior more clearly than in the past.

It seems logical to me to have "parentless" walkers have parent id == 0 i.e. the null walker. These are ID's and not array indexes (pointer offsets).

From this it follows that no walkers ID should be 0 unless it is a blank not yet properly initialized. Something which happens in legacy code i.e. locations where walkers are constructed outside of MCPopulation. At least until we refactor WalkerConfiguration, Walker and Particle set there will still be walkers that go through a two stage creation where the invalid for a time.

ye-luo · 2024-06-04T23:23:53Z

@PDoakORNL could you clarify "rank id's are local"? What is rank id you were referring?
My understanding is that walker id is defined in each step and thus meaningful for that step only. At a given step, each walker on every MPI rank has a unique id. parent id is for the previous step. This allows tracing a walker back to the previous step. Is that what you meant?

There are two ways of indexing walkers at a given step.

contiguously indexing. rank 0 (walker 0, walker 1...) rank 1(walker 0, walker 1...)
more or less the scheme used in the current PR. non contiguously indexing. gid = Np * local_index + RankID. NP is the number of MPI ranks, local_index is the index of a walker on its own MPI rank.

Method 1 eliminates the info about ranks. It is more flexible if rank info is not critical. It can also be more difficult if rank info is needed.
Method 2 requires to know the number of walkers on each rank but more friendly to handling data by ranks.
I'm open to both schemes defining walker IDs.

I don't quite follow you mention of ""parentless" walkers" and also have difficulty to understand your two stage comments.

Here is my understanding of the code.
WalkerConfiguration carries Walkers not ParticleSet. It carries data cross QMC drivers.
MCPopulation is a driver concept. It captures a WalkerConfiguration to start a walker population. With a population created, walker id makes sense. Once a QMC driver ends, the population is technically gone and walker id is not meaningful anymore.

so in a driver

create a population based on loaded WalkerConfiguration
assign initial walker id. parallel id is basically undefined.
loop step
   random walking
   branching.
   parent id = walker id. Reassign walker id
end loop step

PDoakORNL · 2024-06-05T00:02:58Z

I mean that on a rank all walkers id's will be walker_id % (num_ranks) = rank + 1. I'll edit that to make it more clear.

I believe in practice walkers should not be getting moved more than necessary so the ID's should be presevered between steps unless the walker actually changed rank or unless it is a copy from another walker on the rank in which case it gets a new id. In practice I don't think a large % of walker die, are born, or transferred each step.

IDs are never reused and are not based on local index only on number of new walkers born on the rank. There are no gaps if you look at the sequence of particular ranks IDs with integer division by number of ranks.

IDs should be unique for a particular walker and there cannot be based on the local walker index

My understanding of mw_saveWalker ... mw_loadWalker was that it was not supposed to just treat the walkers and walker elements as just slots that could have a completely new identity every step as we serialize in and out of some insane untyped byte buffer.

But that rather a walker than had not been killed would be a continuation of a particular walker trajectory. Its hard to be certain since this whole save load thing moving state between walkers and particle sets is a little... 😭 But I have always intended that a indexes of walker elements will never be mixed. I.e. Walker[1], ParticleSet[1], TWF[1], Hamiltonian[1]

ye-luo · 2024-06-05T00:20:45Z

Here is an example of the bubble I was referring to.
When starting 3 walkers on rank 0, 1 and 2 walkers on rank 2, 3 for total request 10 walkers.
Rank 0 ids: 0, 4, 8
Rank 1 ids: 1, 5, 9
Rank 2 ids: 2, 6
Rank 3 ids: 3, 7
at this point there is no gap. Now rank 3 creates a walker during branching.
Rank 3 ids: 3, 7, 11
There is no walker with id 10.

ye-luo · 2024-06-05T00:28:11Z

If the walker id is only intended to generate a unique id, then the scheme in this PR is ok. However, it cannot be used to be more informative indicating MPI rank or index. As long as killed/born/migrating do happen, such scenarios should be taken into account. I don't think how large % of walker affects our criteria for choosing a scheme for id.

ye-luo · 2024-06-05T00:35:15Z

I believe in practice walkers should not be getting moved more than necessary so the ID's should be presevered between steps unless the walker actually changed rank or unless it is a copy from another walker on the rank in which case it gets a new id.

Now @PDoakORNL is saying walker id is mutable. @prckent @jtkrogel @PDoakORNL we need a discussion.

PDoakORNL · 2024-06-05T00:51:47Z

I should also add that any time I'm think about producing massive amounts of data I consider how much work getting it into a relational database would be. That is after all "what you do" if working with data that won't fit in memory that you want to make structured queries against.

Id use tables like:
Walker

WalkerKey (Primary Key)
Block
Walker Id
Parent ID

Step

stepkey (primary key)
walker index
step
block
kinetic energy
potential energy
...

Conveniently if you limited your database to just one block per instance you could just user walker ID as the key to the first table and step+walkerID as the key second table.

To track a particular walker select all steps for a particular walker ID say (12). Then I'd select all walkers IDs with a parent ID (12). Then query each of those IDs and their steps. You would continue this process until you got no more walkerID's connected to your original ID. Merge all the resulting rows together you could plot a nice graph of the trajectory of some value with respect to the originating walker. If it every reached multiplicity high enough to get copied you would get multiple values per step otherwise you would just recover the trajectory even as it moved between ranks.

Conceptually you can accomplish the same thing with a scripting language just hammering on the output files but I'd rather have WalkerIDs that made processing those files into a DB easy. The issue of primary keys is basically the same as in memory hash.

PDoakORNL · 2024-06-05T00:53:09Z

Walker ID is not mutable. But it is not portable to another rank.

ye-luo · 2024-06-05T01:02:57Z

Walker ID is not mutable. I walker that migrates is killed and born again.

Unfortunately killing a physical term not a technical term. Migranting a walker cannot be treated as kill and born.

PDoakORNL · 2024-06-06T18:31:10Z

From my perspective there are no longer any gaps.

The walker IDs come from a N = num_ranks disjoint sequences from the positive integers with each value x = {0,..., num_ranks-1} determines a sequence of walker ids such that members are those integers such that w_id -> P_INT % num_rankss == x. Each rank uses the set of IDs such the x defining the set has x == rank. on any particular rank All ID's are used sequentially for active walkers. An active walker is one that will on the next step attempt a move. A walker being overwritten must have either been freshly constructed or have been however briefly in an inactive state and receives the next rank ID.

ye-luo · 2024-06-06T18:50:48Z

In the above example I was talking about gaps between ranks not within a rank. ids spaced by num_ranks were not considered disjoint.

PDoakORNL · 2024-06-06T18:55:34Z

In your example the next time rank two creates a walker it will be 10.

ye-luo · 2024-06-06T18:59:09Z

In your example the next time rank two creates a walker it will be 10.

At the current step, there is no walker 10. If the simulation ends here, there is no walker 10.

jtkrogel · 2024-06-07T13:33:27Z

Perhaps just post the pseudo-code and the ID's generated by 2-3 adjacent ranks?

ye-luo · 2024-06-11T16:31:10Z

docs/developing.rst

+  :label: eq_walker_id
+  walker_id = walker_id = num_walkers_created_++ * num_ranks_ + rank_ + 1
+
+where ``num_walkers_created_`` is a member variable of the sole ```MCPopulation`` object on the rank and initially set to 0.  Each walkers ``parent_id`` is set at initiation of the walkers configuration to the walker from home the configuration is assigned.  If that assignment is from previous section's or run's ``WalkerConfigurations`` object then the value of the ``Walker::getWalkerID()`` is multiplied by -1. If the Walker's initial configuration comes from the golden particle set the parent_id will be 0.


"set to 0" each driver run or once the whole lifetime?

If walker is not assigned from a walker config walker, a transfered walker, or an amplified walker, the parent_id is left as the default i.e. 0.

docs/developing.rst

ye-luo · 2024-06-11T16:32:10Z

docs/developing.rst

+  :label: eq_walker_id
+  walker_id = walker_id = num_walkers_created_++ * num_ranks_ + rank_ + 1
+
+where ``num_walkers_created_`` is a member variable of the sole ```MCPopulation`` object on the rank and initially set to 0.  Each walkers ``parent_id`` is set at initiation of the walkers configuration to the walker from home the configuration is assigned.  If that assignment is from previous section's or run's ``WalkerConfigurations`` object then the value of the ``Walker::getWalkerID()`` is multiplied by -1. If the Walker's initial configuration comes from the golden particle set the parent_id will be 0.


What is "home"?

The above section seems explaining what happens when walkerIDs and parent IDs are initialized.
Could you document what happens during branching?

I'm added documentation of branching. I could not find other discussion of our branching algorithm in the manual. Am I just missing it?

ye-luo · 2024-06-11T16:34:43Z

src/Particle/Walker.h

   */
  long walker_id_ = 0;
  /** in legacy the ancients have said only:
   *  id reserved for forward walking
+   *
+   *  In Batched
+   *  If a walker is initialized from the golden particle set it keeps the default constructed 0


We are talking about lightweight walker here. Should not reference "golden particle set"

I've cleaned these docs up now.

ye-luo · 2024-06-11T16:41:01Z

src/QMCDrivers/DMC/WalkerControl.cpp

@@ -236,13 +208,15 @@ void WalkerControl::branch(int iter, MCPopulation& pop, bool do_not_branch)
    throw std::runtime_error("Potential bug! Population num_global_walkers mismatched!");
 #endif

+  // This defensive code implies that previous code left population walkers in invalid state.


This is not a defensive code.

Ok so by my reading through the walker life cycle, branching, and population control walkers must always have Multiplicty == 1.0 after amplification and/or transfer. Multiplicity is set in a small number of significant locations (except for here).

On construction the default member initializer of Walker sets the initial value to 1.0

During warmup = Multiplicity = 1.0

To static_cast<int>(walker->Weight + rng_()) resulting in a integer value of >= 0

After swapping where on rank multiplicity is set to the remaining number of copies on a rank.

After swapping where a new walker has its multiplicity set to the number of copies received + 1 (number of copies received is counted from 0 due to lack of discrimination between counting numbers and indexing numbers).

After amplificiation i.e. copyHighMultiplicityWalkers and killDeadWalkers the population will contain only walkers of Multiplicity == 1.

Here where all living walkers have their multiplicity set to 1.

Several off these state transforms appear defensive in nature only, i.e. they can only cover up Walkers arriving to this function in an invalid state.

Can only erase a state transformation to walker that left Multiplicity in an incorrect state. Fresh walkers have multiplicity == 1 walkers from walker configuration have multiplicity == 1. An assert should be added and this side of the branch is noop.

If multiplicity != something is wrong, this simply covers it.

4, 5 are only necessary because of how we construct transfer jobs. Where basically walker have a multiplicity per destination for a short span of code. During this section Walker->Multiplicty is no long a simple value.

Multiplicity is a very important value for correct population control so reasoning about its update and validity should be as simple as possible.

I'm clarifying the comments, I don't want to lose this insight but it can't really be cleaned up properly until we drop the legacy drivers.

ye-luo · 2024-06-11T16:43:00Z

src/QMCDrivers/DMC/WalkerControl.cpp

      // recv the number of copies from the target
      myComm->comm.receive_n(&nsentcopy, 1, plus[ic]);
      job_list.push_back(job(newW.size() - 1, plus[ic]));
      if (plus[ic] != plus[ic + nsentcopy] || minus[ic] != minus[ic + nsentcopy])
        throw std::runtime_error("WalkerControl::swapWalkersSimple send/recv pair checking failed!");
 #ifdef MCWALKERSET_MPI_DEBUG
-      fout << "rank " << minus[ic] << " recvs a walker with " << nsentcopy << " copies from rank " << plus[ic]
+      std::cout << "rank " << minus[ic] << " recvs a walker with " << nsentcopy << " copies from rank " << plus[ic]


Your temporary change?

ye-luo · 2024-06-11T16:44:44Z

src/QMCDrivers/DMC/WalkerControl.cpp

@@ -462,11 +440,15 @@ void WalkerControl::swapWalkersSimple(MCPopulation& pop)
  }
  else
  {
+    // Walker::copyFromBuffer copies the walker_index_ of the sent walker and the local walker does
+    // not have a local id created.
+    // the walker on this rank.  But this replacement allows tracking this walker across the ranks.


Please improve the comment.

ye-luo · 2024-06-11T16:48:37Z

src/QMCDrivers/DMC/WalkerControl.cpp

+	awalker.copyFromBuffer();
+	auto parent_id = awalker.getWalkerID();
+	awalker.setWalkerID(walker_id);
+	awalker.setParentID(parent_id);


Could you explain all the setting?
auto parent_id = awalker.getWalkerID(); why?
Are you sure the parent id also has been wired?

Added test and discovered the answer was kinda. It was for transfer but not for reading from WalkerConfigurations

I have since discovered that it is also not written or read from walker hdf5 file. I'd rather address that in a future PR.

ye-luo · 2024-06-11T16:50:22Z

src/QMCDrivers/DMC/WalkerControl.h

+   *  4. collect good, bad walkers
+   *  5. communicate walkers
+   *  6. unpack received walkers, apply walker count floor
+   *  7. call MCPopulation to amplify walkers with Multiplicity > 1


Please move this back to its original location. Users calling this API doesn't need to know the implementation detail.

returned. I left some information for this call in the header.

ye-luo · 2024-06-11T16:52:34Z

src/QMCDrivers/DMC/WalkerControl.h

+   *  partial enum to access elements is the enum above dumped into WalkerControl's namespace
+   *  includes many integral types converted to and from fp
+   *  \todo MPI in the 21st centure allows user defined data types that would allow all this to benefit from actual
+   *        C++ type safety and still can have performant collective operation efficiency.


MPI is C only. I don't think it knows C++ type safety. The derived type in MPI is intended for performance, not for safety. It is not useful in this particular case.

I looks like its intended so everyone doesn't do things like this constantly.

unfortunately this requires documenting branching and population control which are traditionally not well documented.

PDoakORNL · 2024-06-24T19:33:38Z

I'm working on another rebase and forced push. This has been sitting for a week and new PR's getting merged. I'm not sure why anymore, but after this rebase I will not be keeping it current unless there is actual interest in merging it. I won't be fixing ID propagation between runs and not just between sections until this PR goes in, it has enough in it, tests what it adds and adds tests and documentation that should have always been there.

PDoakORNL · 2024-06-24T19:50:13Z

The osx failures are also all with nexus and do not appear to have anything to do with this PR.

ye-luo · 2024-06-24T19:51:03Z

I'm working on another rebase and forced push.

If it is not showing conflict, there is no need to aggressively do rebasing. We rarely requires PRs being rebased before merging.

This has been sitting for a week I'm not sure why anymore.

Sorry I was busy and didn't pay attention to commits closely. Sometimes it is hard for me to tell if a submitter is still working on it or it is ready for review again. It is not the best use of maintainer time monitoring PR progress. If you need a specific reviewer to respond, better to ping if the PR stalls. For me specifically, feel free to @ me on github or slack me.

Unfortunately, we (submitters and reviewers) all have bandwidth limits and communication helps.

ye-luo · 2024-06-24T19:55:13Z

Just noticed that you sent a new review request. I'm sorry that I missed it.

prckent · 2024-06-24T20:10:21Z

OSX failures are all fixed and unrelated to this PR.

I would like to get this merged since it adds very useful functionality. It was also helpful to read e.g. how the multiplicities are handled.

PDoakORNL · 2024-06-25T13:10:45Z

I'm never sure if I don't rebase if I'm going to end up with history wise. I guess I should do some research on this. In this case I don't have dependent branches on this PR so it probably makes no difference.

PDoakORNL

I've tried to explain every code change in WalkerControl.cpp to ease concerns that any real change has been made in the DMC population control. Should I back port the unit tests and/or right additional ones to prove the behavior is the same?

PDoakORNL · 2024-06-25T18:23:22Z

src/Particle/Walker.h

  {
    makeCopy(a);
+    walker_id_ = walker_id;


Here I make these assignments after makeCopy, putting them above means they get replace by a's walker_id and parent_id which is not the behavior needed or expected.

PDoakORNL · 2024-06-25T18:25:18Z

src/QMCDrivers/DMC/WalkerControl.cpp

    ScopedTimer copywalkers_timer(my_timers_[WC_copyWalkers]);
-    const size_t good_walkers = walkers.size();


This is entirely manipulation and calls that should be private to MCPopulation so I refactored it into there. This also allows unit testing this with MCPopulation which makes sense since code can be tested without the need for the WalkerControl code.

PDoakORNL · 2024-06-25T18:30:56Z

src/QMCDrivers/DMC/WalkerControl.cpp

+    auto unpackWalker = [](auto& awalker) {
+      auto walker_id = awalker.getWalkerID();
+      // Walker::copyFromBuffer overwrites the walker_id
+      awalker.copyFromBuffer();


This lambda is just the repeated code for the synchronous and asynchronous receive below. Previously the walker_ids and parent_itds were just getting overwritten by the sent walker_id and parent_id. From the context of all ranks this resulted in duplicate id's all over which is not the behavior this PR is setting up. The copyFromBuffer from the current code is preserved here the new statements only effect parent_id and walker_id which are not used by the current code.

PDoakORNL · 2024-06-25T18:33:05Z

src/QMCDrivers/DMC/WalkerControl.cpp

+              << "for rank: " << rank_num_ << " total_multiplicity: " << TotalMultiplicity
+              << "  fair_offset_[rank_num_ + 1] - fair_offset_[rank_num_]: " << fair_offset_[rank_num_ + 1] << " - "
+              << fair_offset_[rank_num_] << '\n';
+    throw std::runtime_error(error_msg.str());


This is just a less useless error message. It's in an

#ifndef NDEBUG

section so this has no effect on production science

PDoakORNL · 2024-06-25T18:37:43Z

src/QMCDrivers/MCPopulation.cpp

+      // keep good walker valid.
+      good_walker.Multiplicity -= 1.0;
+      num_copies--;
+    }


This was the code that was in WalkerControl

for (size_t iw = 0; iw < good_walkers; iw++) { size_t num_copies = static_cast<int>(walkers[iw]->Multiplicity); while (num_copies > 1) { auto walker_elements = pop.spawnWalker(); // save this walkers ID // \todo revisit Walker assignment operator after legacy drivers removed. // but in the modern scheme walker IDs are permanent after creation, what walker they // were copied from is in ParentID. long save_id = walker_elements.walker.getWalkerID(); walker_elements.walker = *walkers[iw]; walker_elements.walker.setParentID(walker_elements.walker.getWalkerID()); walker_elements.walker.setWalkerID(save_id); num_copies--; } }

The significant diferrence here is that we continue to make the multiplicities valid rather than just ignore them until the end when we set them all 1.0 regardless of what happened to them. I'm of the opinion if you have a state variable its validity must be maintained through state changes not valid for some states and invalid in others. That's asking for bugs later.

PDoakORNL · 2024-06-25T18:41:57Z

src/QMCDrivers/MCPopulation.cpp

@@ -91,18 +149,6 @@ void MCPopulation::createWalkers(IndexType num_walkers, const WalkerConfiguratio

  outputManager.resume();

-  int num_walkers_created = 0;


We don't have to wait until here to have valid walker and parent id's

ye-luo

@PDoakORNL Thanks for the additional comments. I'd like to make the following request of changes.

Please revert the type change to Multiplicity. The purpose of this PR is to handle walker_id. It will be better to handle it separately. I also feel Multiplicity should be an unsigned it but all its use needs to be checked or preferably unit tested. So making a separate PR for that topics will be much better.
Changes around walker.h and MCPopulations.h/cpp can be placed on a separate PR. They are much less exposed to the full scheme of id handling. reviewing and merging turnaround should be fast.
Then we handle the last but most complicated bits in WalkerControl.

Sorry for the additional refactoring but that warranties stable progress.

ye-luo · 2024-06-24T21:29:35Z

src/Particle/Walker.h

   */
-  FullPrecRealType Multiplicity = 1.0;
+  int Multiplicity = 1.0;


Please revert this change and open an issue saying this can probably set to int. I would actually prefer an unsigned int. My quick grep 'Multiplicity =' seems indicating a more careful revisit can be needed. Changing type in this PR is too risky.

ye-luo · 2024-06-24T21:31:07Z

src/Particle/Walker.h

+  /** if true, this walker is either copied or tranferred from another MPI rank.
+   *  significant because this walker will need a distance table recompute since we don't transfer them.
+   *  So this is really a variable tracking the state of the ParticleSet associated with this walker.
+   *  \todo this is a smell to be addressing in ParticleSet refactoring.


It is not just distance tables. Any data beyond coordinates needs to be recomputed for example Slater matrices, Jastrow factors.

ye-luo · 2024-07-25T22:32:40Z

superseded by #5063 and #5089

ye-luo reviewed May 28, 2024

View reviewed changes

src/QMCDrivers/MCPopulation.h Outdated Show resolved Hide resolved

src/QMCDrivers/DMC/WalkerControl.cpp Show resolved Hide resolved

src/QMCDrivers/MCPopulation.cpp Show resolved Hide resolved

PDoakORNL force-pushed the useful_walker_id_2 branch from a458c1f to 613633e Compare June 7, 2024 14:45

ye-luo reviewed Jun 11, 2024

View reviewed changes

PDoakORNL added 16 commits June 24, 2024 15:19

restore essential integration test for WalkerControl

623a821

refactor expanding multiplicity > 1 walkers to MCPopulation

3f75a25

meaningful tests for walker control with arbitrary rank count

a58bf1c

in response to discussion some improvement of the walker ids

f2e92c1

additional stage of population adjustment test

e51d012

cleanup. multistage testing

e73fd7d

fix rebase damage

74edb42

safety commit before generalize refactor in test_WalkerControl

dd603ab

another safety commit primarily of documentation

ffeb330

documentation added to the manual, chen ERD of WalkerID vicinity

653d640

NULL and nullptr aren't the same thing its a typed language

0e9a883

Another draft on the way to documenting walker_id and parent_id

deeec87

unfortunately this requires documenting branching and population control which are traditionally not well documented.

more changes in response to review

1313086

walker were created from walker_confs but not the converse

92c2d9a

fixing up some commentary in code

cd9b687

more documentation improvement

3b2ad0d

PDoakORNL force-pushed the useful_walker_id_2 branch from b8092cc to 3b2ad0d Compare June 25, 2024 13:12

PDoakORNL commented Jun 25, 2024

View reviewed changes

ye-luo reviewed Jun 25, 2024

View reviewed changes

Multiplicity reverted to FullPrecRealType

efe2f93

PDoakORNL mentioned this pull request Jun 26, 2024

MCPopulation now gives walkers useful walker and parent ids #5063

Merged

PDoakORNL mentioned this pull request Jul 16, 2024

updates to WalkerControl to have useful walker_ids #5089

Merged

ye-luo closed this Jul 25, 2024

		ScopedTimer copywalkers_timer(my_timers_[WC_copyWalkers]);
		const size_t good_walkers = walkers.size();

		@@ -91,18 +149,6 @@ void MCPopulation::createWalkers(IndexType num_walkers, const WalkerConfiguratio

		outputManager.resume();

		int num_walkers_created = 0;

Useful walker ids #5010

Useful walker ids #5010

Conversation

PDoakORNL commented May 28, 2024 • edited Loading

Proposed changes

What type(s) of changes does this code introduce?

Does this introduce a breaking change?

What systems has this change been tested on?

Checklist

ye-luo left a comment

Choose a reason for hiding this comment

prckent commented May 28, 2024 • edited Loading

PDoakORNL commented May 29, 2024

prckent commented May 29, 2024

ye-luo commented May 29, 2024 • edited Loading

prckent commented May 29, 2024

ye-luo commented May 29, 2024

prckent commented May 29, 2024 • edited Loading

ye-luo commented May 29, 2024 • edited Loading

ye-luo commented May 29, 2024

prckent commented May 31, 2024 • edited by ye-luo Loading

ye-luo commented May 31, 2024 • edited Loading

jtkrogel commented Jun 4, 2024 • edited Loading

ye-luo commented Jun 4, 2024 • edited Loading

PDoakORNL commented Jun 4, 2024 • edited Loading

ye-luo commented Jun 4, 2024 • edited Loading

PDoakORNL commented Jun 5, 2024

ye-luo commented Jun 5, 2024

ye-luo commented Jun 5, 2024

ye-luo commented Jun 5, 2024

PDoakORNL commented Jun 5, 2024

Id use tables like: Walker

Step

PDoakORNL commented Jun 5, 2024 • edited Loading

ye-luo commented Jun 5, 2024

PDoakORNL commented Jun 6, 2024

ye-luo commented Jun 6, 2024

PDoakORNL commented Jun 6, 2024

ye-luo commented Jun 6, 2024 • edited Loading

jtkrogel commented Jun 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PDoakORNL Jun 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PDoakORNL commented Jun 24, 2024 • edited Loading

PDoakORNL commented Jun 24, 2024

ye-luo commented Jun 24, 2024 • edited Loading

ye-luo commented Jun 24, 2024

prckent commented Jun 24, 2024

PDoakORNL commented Jun 25, 2024

PDoakORNL left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ye-luo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PDoakORNL commented May 28, 2024 •

edited

Loading

prckent commented May 28, 2024 •

edited

Loading

ye-luo commented May 29, 2024 •

edited

Loading

prckent commented May 29, 2024 •

edited

Loading

ye-luo commented May 29, 2024 •

edited

Loading

prckent commented May 31, 2024 •

edited by ye-luo

Loading

ye-luo commented May 31, 2024 •

edited

Loading

jtkrogel commented Jun 4, 2024 •

edited

Loading

ye-luo commented Jun 4, 2024 •

edited

Loading

PDoakORNL commented Jun 4, 2024 •

edited

Loading

ye-luo commented Jun 4, 2024 •

edited

Loading

Id use tables like:
Walker

PDoakORNL commented Jun 5, 2024 •

edited

Loading

ye-luo commented Jun 6, 2024 •

edited

Loading

PDoakORNL Jun 12, 2024 •

edited

Loading

PDoakORNL commented Jun 24, 2024 •

edited

Loading

ye-luo commented Jun 24, 2024 •

edited

Loading