Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spin density is incorrect in the batched driver port #4533

Closed
aannabe opened this issue Mar 28, 2023 · 19 comments · Fixed by #4552
Closed

Spin density is incorrect in the batched driver port #4533

aannabe opened this issue Mar 28, 2023 · 19 comments · Fixed by #4552
Assignees

Comments

@aannabe
Copy link
Contributor

aannabe commented Mar 28, 2023

Describe the bug
The integrated charge from the spin density is not correct when using the batched driver.

To Reproduce
An isolated Boron atom (N_up = 2, N_down = 1) was simulated in a large box using QE. The spin density was accumulated using the legacy and batched drivers. In the legacy case, N_up and N_down are coming out correctly. In the batched case, N_up is also correct, but N_down is incorrect. In this case, N_down values are identical to N_up, so N_up is erroneously accumulated for both spin channels. See the attached zip for inputs and partial output files. Any ideas?

See the below summary and plots:

LEGACY
===========

Up:
    Norm:   tot             = 2.0001063099999996
    Non-Cumulative Value of B Species at Cutoff 2.0 is: 0.2399443019469809

Down:
    Norm:   tot             = 1.0000710799999997
    Non-Cumulative Value of B Species at Cutoff 2.0 is: 0.12560081495173797

BATCHED
==========

Up:
    Norm:   tot             = 2.000220629999999
    Non-Cumulative Value of B Species at Cutoff 2.0 is: 0.2444495587812607

Down:
    Norm:   tot             = 2.000220629999999
    Non-Cumulative Value of B Species at Cutoff 2.0 is: 0.2444495587812607

Legacy Up:
legacy_up
Legacy Down:
legacy_down

Batched Up:
batched_up
Batched Down:
batched_down

Expected behavior
Both spin channels should integrate to correct N_e.

System:
NERSC/Perlmutter. Latest develop version. See attached files for further info.

spin_density_issue.zip

@jtkrogel
Copy link
Contributor

I don't see any stat.h5 files in the zip. What does e.g. h5ls give as the contents of the file and of the SpinDensity entry? I am most curious to see if both spin up and spin down components are present in the file itself.

@aannabe
Copy link
Contributor Author

aannabe commented Mar 29, 2023

I didn't include the stat.h5 files due to the large size. However all expected components are there:

h5ls -r vmc_J123.s000.stat.h5
/                        Group
/ElecElec                Group
/ElecElec/value          Dataset {10/Inf, 1}
/IonIon                  Group
/IonIon/value            Dataset {10/Inf, 1}
/Kinetic                 Group
/Kinetic/value           Dataset {10/Inf, 1}
/LocalECP                Group
/LocalECP/value          Dataset {10/Inf, 1}
/LocalEnergy             Group
/LocalEnergy/value       Dataset {10/Inf, 1}
/LocalEnergy_sq          Group
/LocalEnergy_sq/value    Dataset {10/Inf, 1}
/LocalPotential          Group
/LocalPotential/value    Dataset {10/Inf, 1}
/MPC                     Group
/MPC/value               Dataset {10/Inf, 1}
/NonLocalECP             Group
/NonLocalECP/value       Dataset {10/Inf, 1}
/SpinDensity             Group
/SpinDensity/d           Group
/SpinDensity/d/value     Dataset {10/Inf, 3375000}
/SpinDensity/u           Group
/SpinDensity/u/value     Dataset {10/Inf, 3375000}

@jtkrogel
Copy link
Contributor

jtkrogel commented Mar 29, 2023

Thanks. Just as a final sanity check can you verify that the data values inside the file are identical for up and down for the batched case? (this removes any possibility of issues with the intervening script)

@aannabe
Copy link
Contributor Author

aannabe commented Mar 29, 2023

I compared the values of a specific, random point in space with respect to the block number. The values are identical.

Up:
2.9296875e-05,1.953125e-05,1.953125e-05,3.90625e-05,4.8828125e-05,6.8359375e-05,5.859375e-05,2.9296875e-05,9.765625e-05,6.8359375e-05
up

Down:
2.9296875e-05,1.953125e-05,1.953125e-05,3.90625e-05,4.8828125e-05,6.8359375e-05,5.859375e-05,2.9296875e-05,9.765625e-05,6.8359375e-05
down

@jtkrogel
Copy link
Contributor

OK, that's pretty convincing. Clearly there is an odd plumbing or evaluation issue in QMCPACK.

@PDoakORNL
Copy link
Contributor

I've expanded the the SpinDensityNew estimator unit tests and the problem does not appear to be with it.

@aannabe
Copy link
Contributor Author

aannabe commented Mar 30, 2023

Are you able to reproduce the problem? Where could I look (or what tests can I run) to isolate the source of the issue?

@jtkrogel
Copy link
Contributor

@PDoakORNL Perhaps the issue is with the internal accumulation or similar?

@aannabe Perhaps put all the files (wavefunction, etc) in a shared space at OLCF? This way others can rerun simply to reproduce.

@jtkrogel
Copy link
Contributor

The current spin density integration tests won't catch this one because those tests have identical up/down electron counts.

@PDoakORNL
Copy link
Contributor

No I have not reproduced it yet. The problem does not appear to be at the estimator level, I'll PR the unit test with the bugfix right now it just reveals that the estimator has no problem accumulating unequal numbers of up and down electrons. But there is a lot of plumbing between the actual particle movses and between there and the hdf5 write. I'm getting q-e working right now to produce some wavefunctions.

@aannabe
Copy link
Contributor Author

aannabe commented Mar 30, 2023

Thanks. I transferred my files to OLCF in case they would be needed:
/gpfs/alpine/mat151/proj-shared/qmcpack_issue_4533

Please let me know if there are problems with permissions etc.

@camelto2
Copy link
Contributor

camelto2 commented Apr 4, 2023

I have been taking a look recently at some of the estimators, so I decided to take a stab at this. I found the bug, which is a simple one line fix. I don't have any unit tests written up, but I can PR the fix unless @PDoakORNL already has a PR ready

@PDoakORNL
Copy link
Contributor

I do have a PR ready, I have new unit testing to cover this issue.

@PDoakORNL
Copy link
Contributor

Unfortunately I'm still debugging the integration test with hdfarchive which has different requirements for successfully reading than I would expect and segfaults. Easy to say this was caused by me getting away with inadequate testing. And I have no idea yet what is wrong with hdfarchive if actually anything.

@camelto2
Copy link
Contributor

@aannabe how quickly do you need this resolved? I recall you saying you have some jobs to submit very soon that need the spin density working?

If this is time critical, there is a simple one line fix to correct the bug that can used as a quick fix for now while @PDoakORNL works on changing the estimators and removing ObservableHelper from Estimators.

The problem with the current code is that

oh.set_dimensions(ng, 0);

should be
https://github.com/camelto2/qmcpack/blob/8a0a50401a7830d36cc047f8b769c0a851cf8117/src/Estimators/SpinDensityNew.cpp#L230

This would allow @aannabe to run now while the estimators get cleaned up in #4540

@aannabe
Copy link
Contributor Author

aannabe commented Apr 12, 2023

@camelto2 Thanks, I will use this fix to run my calculations for now. However, in my opinion, it might be better to get this one-liner in the develop branch so that others don't encounter the same problem. We can then close this issue and create a new one around removing ObservableHelper.

@jtkrogel
Copy link
Contributor

I agree. A one-line fix is always the right size. I would recommend someone just PR the one-liner; the rest can be prioritized against other needs.

@PDoakORNL
Copy link
Contributor

PDoakORNL commented Apr 12, 2023

@camelto2 if you want to make a PR with that one liner I will approve. Test coverage here is poor so I've gotten a bit bogged down in the integration testing between the estimators and hdf5. None of that output was validated at all except through application tests.

If I didn't have a second PR coming I think its important to note that the one line fix with no accompanying unit test to catch regression or prove its correctness is insufficient.

@camelto2
Copy link
Contributor

If I didn't have a second PR coming I think its important to note that the one line fix with no accompanying unit test to catch regression or prove its correctness is insufficient.

completely agree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants