Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for NC pseudos in UPF format #741

Merged
merged 78 commits into from
Oct 21, 2022

Conversation

azadoks
Copy link
Contributor

@azadoks azadoks commented Oct 3, 2022

Based on @unkcpz 's PR.

To do:

  • Non-linear core correction
  • Implement initial guesses
    • Charge density from atomic charge density (PP_RHO)
    • Wavefunction from pseudo-atomic wave functions (PP_CHI)
  • Improve documentation
    • Example script
    • Docstrings
  • Improve performance
    • Parallelize over G-vectors / k-points?
    • Calculate only for unique values of |G|?
    • Precompute common integrand terms
    • Create interpolators at parse time
  • Clean up unit handling
  • Settle on an integrator / integrator interface (Not needed)
  • Settle on local potential correction / correction interface (Only Coulomb)
  • Add / remove included UPF pseudopotentials (Kept a few for testing)
  • Register PseudoPotentialIO
  • Improve testing
    • Figure out why quadgk references used for HGH don't agree well with UPFs (Needed correction term)
    • Refine HGH comparison tolerances

@antoine-levitt
Copy link
Member

If the rab are already the integration weights, why do you do a quadrature (ctrap) rather than just a dot product?

@azadoks
Copy link
Contributor Author

azadoks commented Oct 6, 2022

I wasn't sure if they account for the half-weight at zero. I did a bit of testing comparing the dot product and the ctrap with the half-weight, and they seem to be in similar agreement with HGH.

I've changed to using a dot product, it's probably a good step for better performance assuming that it's the right thing to do.

@azadoks
Copy link
Contributor Author

azadoks commented Oct 6, 2022

It looks like ~78% of time in initializing the PlaneWaveBasis for small Ecut and few k-points with UPFs is taken up by _besselj, which is a call to Amos.

image

Allocations could still be improved,

function init_si(pseudo_type, n)
    a = 10.26  # Silicon lattice constant in Bohr
    lattice = a / 2 * [[0 1 1.];
                    [1 0 1.];
                    [1 1 0.]]
    Si = ElementPsp(:Si, psp=load_psp("$pseudo_type/pbe/si-q4.$pseudo_type"))
    atoms     = [Si, Si]
    positions = [ones(3)/8, -ones(3)/8]

    model = model_PBE(lattice, atoms, positions)
    for _ = 1:n-1
        PlaneWaveBasis(model; Ecut=15, kgrid=[4, 4, 4])
    end
    PlaneWaveBasis(model; Ecut=15, kgrid=[4, 4, 4])
end

basis = @btime init_si("upf", 1);     # benchmark
  6.753 s (26559047 allocations: 865.72 MiB)

Initializing the PlaneWaveBasis for, e.g. Ecut=60, kgrid=[12, 12, 12] takes quite a few minutes. Maybe we need to implement an approximate but optimized sphericalbesselj like what is done in QE?
I'm waiting on the profiling + benchmarking for this setup, but I'm guessing that reducing allocations is also going to be a top priority to get this down to a reasonable time.

@antoine-levitt
Copy link
Member

If the profiling clearly shows up the besselj function, working on allocations isn't going to do anything.
Note https://discourse.julialang.org/t/ann-bessels-jl-bessel-hankel-and-airy-functions/87941 but that's only a factor of 2. Note though that these packages are for general (non-integer) order, and the integer-order bessel functions are considerably simpler to compute: https://en.wikipedia.org/wiki/Bessel_function#Spherical_Bessel_functions:_jn,_yn . So at worst we could just implement them by hand for low-order (only l=3 is needed, right?)

@antoine-levitt
Copy link
Member

With judicious use of @inline and @simd, it should be possible to get the code to vectorize. If that still does not result in acceptable performance, we bring out the big guns (LoopVectorization)

@antoine-levitt
Copy link
Member

antoine-levitt commented Oct 6, 2022

Just did some quick tests for integrating an array of size 100000 with l=2.
Naive version (what you have, essentially): 15ms
Hand-coded spherical bessel of order 2: 1.5ms
Hand-coded + @inline + @simd : 1ms
Hand-coded + @turbo : 400μs
Hand-coded + not building x but just summing up the result: 300μs

So no need to go do complicated optimizations, it looks like :-) Here's the final version:

using SpecialFunctions
using LoopVectorization

@inline function sphericalbesselj2(order, x)
    (3/x^2-1)*sin(x)/x-3cos(x)/(x^2)
end

function compute(q, rgrid, drgrid, rprojs, order)
    N = length(rgrid)
    x = Vector{Float64}(undef, N)
    s = 0.0
    @turbo for ir = 1:N
        s += drgrid[ir]*sphericalbesselj2(order, q * rgrid[ir]) * rprojs[ir]
    end
    s
end

N = 100000
rgrid = rand(N)
drgrid = rand(N)
rprojs = rand(N)
q = 2.3
order = 2
@btime compute($q, $rgrid, $drgrid, $rprojs, $order)

@antoine-levitt
Copy link
Member

Sorry, there was a mistake in the previous post (if you read this via email notifications), updated now.

@azadoks
Copy link
Contributor Author

azadoks commented Oct 6, 2022

Ok sweet, I did write up the hand-coded spherical Bessels, which was already much faster. I thought that LoopVectorization wasn't a dependency of DFTK, do I didn't give it a go.

I'll work a bit on implementing like this with @turbo, and @inline and check back in, thanks!

@azadoks
Copy link
Contributor Author

azadoks commented Oct 6, 2022

# For Silicon, 2-atom cell, si-q4.upf
Ecut = 80
kgrid = [12, 12, 12]
julia> basis = @benchmark PlaneWaveBasis(model; Ecut=Ecut, kgrid=kgrid)  # benchmark
BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 32.746 s (0.65% GC) to evaluate,
 with a memory estimate of 1.41 GiB, over 3165129 allocations.

Much better!

image

@antoine-levitt
Copy link
Member

Still looks expensive but combined with q interpolation it should be ok. You can avoid the code duplication by using Val types and function barriers: have order be a Val type, and convert from int to Val in a separate function from the loop (so that the loop is compiled using the static type information for the order, and propagates it to the inlined bessel function). There are instances of such tricks in the "performance tips" section of the julia manual

@azadoks
Copy link
Contributor Author

azadoks commented Oct 6, 2022

I think this is as you suggest(?), but it seems to break @turbo.

@inline function sphericalbesselj(order::Val{0}, x)
    sin(x) / x
end
@inline function sphericalbesselj(order::Val{1}, x)
    (sin(x) / x - cos(x)) / x
end
...
function psp_projector_fouier_inner_loop(order::Val, psp::PspUpf, i, l, q::T)::T where {T <: Real}
    N = length(psp.rprojs[l+1][i])
    s = zero(T)
    @turbo for ir = 1:N
        s += sphericalbesselj(l_val, q * psp.rgrid[ir]) * psp.rprojs[l+1][i][ir] * psp.drgrid[ir]
    end
end
function eval_psp_projector_fourier(psp::PspUpf, i, l, q::T)::T where {T <: Real}
    if iszero(q)
        if l == 0
            s = T(1)
        else
            s = T(0)
        end
    else
        l_val = Val(l)
        s = psp_projector_fouier_inner_loop(l_val, psp, i, l, q)
    end
    return s
end
julia> PlaneWaveBasis(model; Ecut=Ecut, kgrid=kgrid);                    # warm up
┌ Warning: #= /data/azadoks/source/git/DFTK.jl/src/pseudo/PspUpf.jl:172 =#:
│ `LoopVectorization.check_args` on your inputs failed; running fallback `@inbounds @fastmath` loop instead.
│ Use `warn_check_args=false`, e.g. `@turbo warn_check_args=false ...`, to disable this warning.

@antoine-levitt
Copy link
Member

Hm curious (and annoying). That looks like a limitation of LoopVectorization, I'm not sure how to work around it... that said if just by using hand-coded and SIMD we get reasonable enough performance, we might just as well not use LV...

@azadoks
Copy link
Contributor Author

azadoks commented Oct 7, 2022

It seems like @simd also doesn't work with the Val{l} dispatch.
With @simd:

# For Silicon, 2-atom cell, si-q4.upf
Ecut = 80
kgrid = [12, 12, 12]
julia> basis = @time PlaneWaveBasis(model; Ecut=Ecut, kgrid=kgrid);      # benchmark
 70.301417 seconds (29.15 M allocations: 2.576 GiB, 0.61% gc time, 0.08% compilation time)

With @inbounds + @fastmath:

> # For Silicon, 2-atom cell, si-q4.upf
> Ecut = 80
> kgrid = [12, 12, 12]
julia> basis = @time PlaneWaveBasis(model; Ecut=Ecut, kgrid=kgrid);      # benchmark
 69.722348 seconds (29.08 M allocations: 2.573 GiB, 0.43% gc time)

A glance through the LoopVectorization documentation makes me think that @simd and @turbo might give us problems down the line with dual numbers anyway. It would be a bit messy and repetitive, but I'd propose having both

  • The verbose @turbo version to be used with Float64 + Float32 which can be easily turbocharged without messing with array of structs -> arrays of numbers transforms
  • The safe and succinct Val{l}-dispatched @inbounds + @fastmath version for everything else

@antoine-levitt
Copy link
Member

Dftk does pride itself on simplicity so I'd say let's forget about LV, just use simd and Val. The timings you report are reasonable I'd say. In any event, if we want to get more perf, we just do interpolation in q space (which even with a very fine mesh should result in an extremely significant boost, since we are basically sampling all of 3d q space here, and the functions should be very smooth functions of q)

@antoine-levitt
Copy link
Member

Also LV is kind of in flux and due for a rewrite from what I understood, so hopefully in one year or so we just slap turbo on the Val version and it'll just work

Copy link
Member

@mfherbst mfherbst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some notes.

src/pseudo/PspUpf.jl Outdated Show resolved Hide resolved
src/pseudo/PspUpf.jl Show resolved Hide resolved
src/pseudo/PspUpf.jl Outdated Show resolved Hide resolved
src/pseudo/PspUpf.jl Outdated Show resolved Hide resolved
@mfherbst
Copy link
Member

mfherbst commented Oct 8, 2022

Another thing that could be worth trying is to parallelise over the evaluation of projectors or the evaluation of Fourier components in the local potential. This will certainly help for larger calculations on clusters.

@mfherbst
Copy link
Member

mfherbst commented Oct 8, 2022

I agree with Antoine regarding the tradeoff simplicity / speed. I think what we have now is ok in speed and I would avoid making the code more complicated if possible.

src/pseudo/PspUpf.jl Outdated Show resolved Hide resolved
@mfherbst
Copy link
Member

mfherbst commented Oct 8, 2022

Add / remove included UPF pseudopotentials

On that point: I think we should certainly not try to have a comprehensive UPF pseudo library in our code base. I see two options:

  • Introduce a data artefact containing all sorts of pseudo libraries (that could then be referenced by whomever who needs this, so UPF.jl, DFTK, other people ...).
  • Have only a very stripped down version in DFTK (implying potential false promises about their quality, what works etc.) and beyond that have people manage their pseudos themselves (can already be done now: load_psp takes an arbitrary path to load your custom pseudos).

I lean towards the first thing, but I don't have an overview how many pseudo libraries are actually used, i.e. if there is a combinatorial explosion sitting behind this solution that one should better not buy into.

@mfherbst
Copy link
Member

mfherbst commented Oct 8, 2022

Also: I think there are still some stability issues. Just out of curiosity I tried a UPF for a pretty big atom (Hafnium) and got greeted with NaNs. I have not investigated further as I think we should slowly make our way up the complexity ladder, but just to warn everyone :).

@antoine-levitt
Copy link
Member

Agreed we should not have these in the repo. Gth are fine because they are super small fine, but beyond that it's fine to require people to get the upf themselves (and having a repo with many of them is a separate project)

@azadoks
Copy link
Contributor Author

azadoks commented Oct 8, 2022

Just out of curiosity I tried a UPF for a pretty big atom (Hafnium) and got greeted with NaNs. I have not investigated further as I think we should slowly make our way up the complexity ladder, but just to warn everyone :).

I suspect this is because it's on a linear grid (is it from ONCVPSP?), and the linear grids usually contain r = 0. The sphericalbesselj functions divide by q * r, and we don't check for iszero(psp.rgrid[ir]), only for iszero(q).

@azadoks
Copy link
Contributor Author

azadoks commented Oct 8, 2022

Agreed we should not have these in the repo. Gth are fine because they are super small fine, but beyond that it's fine to require people to get the upf themselves (and having a repo with many of them is a separate project)

Sounds good. It could be nice eventually to have something à la aiida-pseudo which packages the pseudos, recommended cutoffs, and other metadata from projects like the SSSP and Pseudo-Dojo along with the UPFs / PSPs. We can kick this down the road for now.

Shall we leave in a few pseudos for testing purposes? Maybe si-q4.upf for testing consistency with HGH and the Pseudo-Dojo silicon pseudo for testing ONCVPSP pseudos (i.e. linear grid pseudos)?

There might be some other pseudo generators that have their own quirks for which we could have an example for testing; I should probably look into this...

@azadoks
Copy link
Contributor Author

azadoks commented Oct 11, 2022

I made a bit of a mess yesterday trying to rebase, but I've fixed up a few things:

  • Fixed (I hope) the NaN problem
  • Moved the sphericalbesselj functions to common
  • Refactored eval_psp_projector_fourier
  • Fixed the real-fourier consistency tests
    • Instantiate real-space interpolators at parse time so that tests take a reasonable amount of time
  • Trimmed down the number of UPF files to a few used for testing
  • Renamed the UPF.jl package to PseudoPotentialIO.jl

What do you guys think are the remaining key points to address to wrap this all up?

@mfherbst
Copy link
Member

mfherbst commented Oct 11, 2022

Re PseudoPotentialIO.jl: why so complicated? Would not PseudoPotential be sufficient? I mean long-term I guess we should move the PspUpf and the PspHgh busyness from DFTK there as well, don't you think?

Trimmed down the number of UPF files to a few used for testing´

I'm not sure we should have them in data as that could suggest too much. Perhaps just have them in the test subfolder somewhere?

What do you guys think are the remaining key points to address to wrap this all up?

From my point of view there are two blocking things:

  • The unit issue
  • In case not yet done: Some careful test against QE or ABINIT in an example (i.e. show that you can reproduce the numbers over a reasonable range of Ecuts). Maybe with something non-HGH?

A nice example could also be good (I guess you can just download the UPF file from e.g. https://pseudopotentials.quantum-espresso.org/legacy_tables/hartwigesen-goedecker-hutter-pp/si in the test and do the SCF once with the UPF and once with the analytical HGH?

@azadoks
Copy link
Contributor Author

azadoks commented Oct 11, 2022

Re PseudoPotentialIO.jl: why so complicated? Would not PseudoPotential be sufficient? I mean long-term I guess we should move the PspUpf and the PspHgh busyness from DFTK there as well, don't you think?

I agree, but PseudoPotentials.jl was already registered. I guess I could have registered PseudoPotential.jl (maybe a bit too close?); I can always change it.

I'm not sure we should have them in data as that could suggest too much. Perhaps just have them in the test subfolder somewhere?

True, I'll move them into test/

In case not yet done: Some careful test against QE or ABINIT in an example (i.e. show that you can reproduce the numbers over a reasonable range of Ecuts). Maybe with something non-HGH?

Ah yeah, I think this will definitely take some careful thought to try to get all the inputs lined up as well as we can. I'll give it a first shot and share the input files.

@mfherbst
Copy link
Member

Ah yeah, I think this will definitely take some careful thought to try to get all the inputs lined up as well as we can. I'll give it a first shot and share the input files.

Oh yes, this is tricky. Don't try to be too fancy at first (i.e. use a simple functional and setup, first one projector then many etc). But it's worth doing. I found quite a few smaller errors along the way when I did this with ABINIT back in the days.

@azadoks
Copy link
Contributor Author

azadoks commented Oct 11, 2022

The unit issue

Yeah, this is quite messy.

  • rgrid -- Always in Bohr
  • vloc -- Always in Ry
  • beta -- Either in Bohr^{-1/2} (Vanderbilt USPP) or in Ry Bohr^{-1/2} ("some converters")
  • Dion / Dij / h -- Either in Ry (Vanderbilt USPP) or in Ry^-1 ("some converters")

A note on units: the beta functions and D matrix are defined as in Vanderbilt’s US PP code and should have Bohr^{-1/2} and Ry units, respectively. Since they enter the calculation only as (betaDbeta), one might as well have "dion" in Ry^{-1} units and "beta" in Ry*Bohr^{-1/2} units, in line with what suggested by the Kleinman-Bylader transformation. Some converters actually do exactly that.

Because the units of beta and D are undefined, I'm not sure how to deal with this generally. For the pseudos I've tested, it seems like the Kleinman-Bylander units are being used.

For the local potential, I don't understand why, but if I convert before the FT, V(q) has strong oscillations. I also still don't completely understand why the valence number Z needs to be multiplied by two. I'll spend some time starting at it again, but if anyone can see why all this is happening more easily, let me know!

@azadoks
Copy link
Contributor Author

azadoks commented Oct 21, 2022

I've just committed a (non-working) draft of the example: the idea is to run an Ecut convergence on an HGH and a PseudoDojo UPF pseudo (to show the difference in how hard they are).

Then, use the converged Ecut to plot a bandstructure (hopefully to show that the results are quite similar).

I chose Magnesium because it's one of the few heavier elements (i.e. not H, He, Li) in PseudoDojo that doesn't contain NLCC (which we don't yet support). I can't seem to get the 'mg-q10.hgh' pseudo to converge (up to 76 Ha, I haven't tried higher).

@mfherbst
Copy link
Member

Hmm. I think this is too large an Ecut for the docs to handle in the CI. E.g. for https://docs.dftk.org/stable/examples/convergence_study/ we only run rather moderate Ecuts on the CI and show static images for the more reasonable sizes.

Regarding the large Ecuts: I'm not super surprised as HGHs are indeed quite hard. I also know other cases, where we had to go to beyond 100 to converge to 4-5 digits.

I'm not sure it's necessary to do a full convergence + bandstructure example here. I would literally just show how to obtain the pseudos (for now just download from the git repo, I guess) and then run an SCF at a moderate ecut. Surely the results will not be 100% the same, but that's fine.

@azadoks
Copy link
Contributor Author

azadoks commented Oct 21, 2022

I'm not sure it's necessary to do a full convergence + bandstructure example here. I would literally just show how to obtain the pseudos (for now just download from the git repo, I guess) and then run an SCF at a moderate ecut. Surely the results will not be 100% the same, but that's fine.

I guess this is probably more reasonable; I still think it's interesting to show the convergence, so I could do that with static images.

Running the SCF won't really show anything because the energies will be quite different. Something like a bandstructure or equation of state should show that the pseudos give results that agree, at least somewhat.

@mfherbst
Copy link
Member

Running the SCF won't really show anything because the energies will be quite different. Something like a bandstructure or equation of state should show that the pseudos give results that agree, at least somewhat.

Makes sense. In any case: If you create static images, please always add the code to generate the images to the https://github.com/JuliaMolSim/DFTK.jl/blob/master/docs/src/assets/0_pregenerate.jl file to ensure we can easily update them.

examples/pseudopotentials.jl Outdated Show resolved Hide resolved
examples/pseudopotentials.jl Outdated Show resolved Hide resolved
Comment on lines 53 to 54
plot!(plt, conv_hgh.Ecuts, conv_hgh.errors, label="HGH")
plot!(plt, conv_upf.Ecuts, conv_upf.errors, label="PseudoDojo UPF")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show convergence on a logplot, thicker lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very fishy, why is there such a big plateau? Is it because of the tolerance? If so can you just not display that part of the curve?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise very interesting! I knew the hgh were kind of hard but the difference is huge!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😱 That does look like a numerical grid issue. Do you by any chance have a graph of |V(q)| and |beta(q)| in semilog scale at the ready (so we can see at which rate it decays)? Otherwise I can take a look

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can probably get it in a few mins.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, it's a grid issue. I'm opening a separate issue, it shouldn't hold back the merge

Copy link
Contributor Author

@azadoks azadoks Oct 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note the potential seems stagnant around q=10, which is around Ecut=50, so that's consistent.

docs/src/assets/0_pregenerate.jl Outdated Show resolved Hide resolved
Comment on lines 53 to 54
plot!(plt, conv_hgh.Ecuts, conv_hgh.errors, label="HGH")
plot!(plt, conv_upf.Ecuts, conv_upf.errors, label="PseudoDojo UPF")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's the SCF tolerance. By default in this example tol/100 where tol is the convergence we are aiming for (so 1e-3).

@mfherbst
Copy link
Member

Apart from the cosmetic nits on the plot script and the pregenerated plot this LGTM.

@mfherbst mfherbst enabled auto-merge (squash) October 21, 2022 12:56
@mfherbst
Copy link
Member

@azadoks Great PR, thanks!

auto-merge was automatically disabled October 21, 2022 15:26

Head branch was pushed to by a user without write access

@mfherbst mfherbst enabled auto-merge (squash) October 21, 2022 16:34
@mfherbst mfherbst merged commit 3b1bfff into JuliaMolSim:master Oct 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants