Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FFTs only work one band at a time #96

Merged
merged 14 commits into from
Jan 26, 2020
Merged

FFTs only work one band at a time #96

merged 14 commits into from
Jan 26, 2020

Conversation

antoine-levitt
Copy link
Member

First part of #36 (comment)

@mfherbst
Copy link
Member

I'd avoid using Ψ and ψ for different things, because the symbols look far too similar in most fonts. I now use Ψik instead of ψk to make the distinction clearer.

@mfherbst
Copy link
Member

Also in general I see why you did this: It simplifies code. On the other hand we now need to deal with threading at a few more places (density computation, ldos computation ...) otherwise it would only be one centralised spot. I think that's ok for now, but if we look into MPI / hybrid GPU/CPU parallelisation at some point we need to reconsider ... but probably we then have a bunch of other problems, too.

I'll add threading for LDOS and density (actually slows down the calculations I do noticably to have this single-threaded!).

@antoine-levitt
Copy link
Member Author

I'll add threading for LDOS and density (actually slows down the calculations I do noticably to have this single-threaded!).

Really? Might be worth trying to optimize a bit then (eg dot fusion, making sure there's no @code_warntype funny business, etc. It should definitely be lower-order for large systems, but for smallish ones there might be an impact

@antoine-levitt
Copy link
Member Author

Also in general I see why you did this: It simplifies code. On the other hand we now need to deal with threading at a few more places (density computation, ldos computation ...) otherwise it would only be one centralised spot. I think that's ok for now, but if we look into MPI / hybrid GPU/CPU parallelisation at some point we need to reconsider ... but probably we then have a bunch of other problems, too.

Hm, that I don't know. For optimal performance we also want to minimize the number of temp buffers of size Psi we allocate. All these will have to change things anyway, so let's see when we get to it I think.

I'd avoid using Ψ and ψ for different things, because the symbols look far too similar in most fonts. I now use Ψik instead of ψk to make the distinction clearer.

We have way too many notations for these already. We sometimes use Psi, sometimes \Psi, sometimes \psi, and sometimes X. To complicate matters, they are really the u (periodic part of bloch waves), not the psi (bloch waves). Similar for the orben. Let's brainstorm all the notations sometime and update them all consistently?

@mfherbst
Copy link
Member

mfherbst commented Jan 22, 2020

I'll profile the threading aspect on the calculations when this is in master and see what can be done about this. For sure threading should not hurt, however.

We have way too many notations for these already. We sometimes use Psi, sometimes \Psi, sometimes \psi, and sometimes X. To complicate matters, they are really the u (periodic part of bloch waves), not the psi (bloch waves). Similar for the orben. Let's brainstorm all the notations sometime and update them all consistently?

I agree. I've noticed that, too, when going through the code wrt this PR.

@antoine-levitt
Copy link
Member Author

I'll profile the threading aspect on the calculations when this is in master and see what can be done about this. For sure threading should not hurt, however.

I think just adding dots and views to LDOS will already improve things massively

@mfherbst
Copy link
Member

mfherbst commented Jan 23, 2020

Ok ... I got round to do some smallish testing with the density computation function and I agree, that extra threading (on top of what is done by FFTW itself) really makes little sense here and makes the code only more complicated (you need to pre-allocate thread-local buffers for the accumulation and need to take care to not get into false sharing ... it's a bit messy). So I agree we should skip that.

But there clearly are some issues with threading in general and that's probably what I picked up before: How many cores are used on the cluster seems to be a bit random. For example. I ran the graphene calculation (committed) on a cluster node with 32 virtual, 16 physical cores by just including the script. The load stayed at 100% for a pretty long time (10 minutes) with only a short spike 3200% for a couple of seconds in between. So I interrupt the execution and re-include the script in the same julia shell. Now the load is mostly 1600% with short periods (ca 5 seconds) down to 100%. The calculation terminates within about 4 minutes (but the maximal load was around 1600%). Then I re-include the script once again and get the same behaviour (1600% most of the time, short drops to 100%) ... I'm not sure but it sounds to be there is some interference between the FFTW-Threading and the Julia @threads macro. I'll do some more testing ...

@mfherbst
Copy link
Member

mfherbst commented Jan 23, 2020

Here is a log of the processor usage from the run. One line is one second.
dftk_FFTs_clustern10.log
Any ideas?

@antoine-levitt
Copy link
Member Author

What did you use for JULIA_NUM_THREADS? I'm not sure how many threads FFTW uses by default, possibly we should do a fftw_set_num_threads(Threads.nthreads()) by default? Can you also do a profile? It might be that a lot of time is spent on non-FFTs (eg I did a recent profile of supercell silicon, and a lot of time was being spent on the computation of the external potential (that we should cache; again dependent on the term refactor)

@mfherbst
Copy link
Member

JULIA_NUM_THREADS = 32 obviously and FFTW enables threading automatically as well, see here. I can do some profiling, but I do not think that's the problem, because that's just one particularly odd run. Sometimes things look better (e.g. closer to 3200% usage). Also on my laptop I have not encountered such behaviour that much. I'll try to produce another case demonstrating that.

@antoine-levitt
Copy link
Member Author

The first run looks to be a bug, so it would be good to find where that comes from, and if that is reproducible. I have also seen weirdness in the thread usage, so probably there's something fishy going on there. One thing to try is whether doing a dummy threaded loop (@threads for i=1:10 println(i) end or something before including the script changes anything. Possibly FFTW thread initialization is buggy when threads are not used before or something?

@mfherbst
Copy link
Member

Now rerunning I encountered a GC bug ... fun. I should to say I've been using v1.2 on the cluster ... I'm now switching to 1.3.1 to make sure that's not the problem.

@antoine-levitt
Copy link
Member Author

Huh, cool! We don't do anything particularly fishy with memory, so we might be triggering a bug here... Switching to 1.3 is a good idea.

@mfherbst
Copy link
Member

Ok ... with Julia 1.3.1 things are a lot more reliable, fortunately. Still we get a drop in performance in between (from some baseline 2700% to some 400%), but things are obviously much better already. I'll do some profiling wrt. the remaining perfomance drop.

@mfherbst
Copy link
Member

Ok, I found the bottle neck: It is the single-threaded element-wise processing of the density when dealing with the symmetry operations.

I'll add a heuristic to capture the case where the symmetry operation is just the identity (the most common case, especially for supercell approaches) and add some threading to the loop to parallelise the whole processing (i.e. both computing partial densities per k-Points and the elementwise processing due to symmetry).

That should be ok for now, but perhaps we should investigate whether there is a more clever way to do symmetry than fiddeling with the Fourier coefficients elementwise ...

@antoine-levitt
Copy link
Member Author

It looks like G_vectors is pretty inefficient btw. Did you try with a collect in there to materialize the generator?

@mfherbst
Copy link
Member

Nope, but for this case I agree it's a good idea to collect it once.

@antoine-levitt
Copy link
Member Author

Also, we can 1) precompute S^-1 2) add an @inbounds annotation 3) see if index_G_vectors is the bottleneck

@antoine-levitt
Copy link
Member Author

this shouldn't really be a rate-limiting operation so I'd micro-optimize a bit before threading

@mfherbst
Copy link
Member

Indeed about half of the time of the compute_density function is spent in the line with the inv and the index_G_vectors at the moment.

@mfherbst
Copy link
Member

Ok so, I pushed the thread stuff, because I now did it anyway. We can see what we can find in index_G_vectors.

@antoine-levitt
Copy link
Member Author

Oof. That's some heavy thread-specific logic that seems a bit convoluted, and is pretty sure to increase the probability of hitting the infamous closure performance bug (@threads create a closure). Also code that is unoptimized and allocates a lot (like this one) typically does not thread well. Let's optimize the single-threaded version first and do some benchmarking. A simpler possibility that does not have load balancing issues is to thread on the G points only.

@mfherbst
Copy link
Member

I'm still working on the single-threaded version anyway ... just wanted to have it out and get checked ... we can decide later what to use.

@antoine-levitt
Copy link
Member Author

With the above modifications it doesn't seem to be performance-critical for me. OTOH I got a weird "corrupted size vs. prev_size” error when running with threads, did you ever get that?

@antoine-levitt
Copy link
Member Author

ProfileView looks like it exports and imports JLD, so that might be good to exchange profiles

@antoine-levitt
Copy link
Member Author

Well that's fun:

https://gist.github.com/antoine-levitt/41cc5480235d9d4c7e853354ccdb9a45
I got that by running silicon on this branch. It's in fftw_mkplan, and it has the "in expres11): Segmentation fault" characteristic of a race. This is possibly what would happen if we planned FFTs on threads (plan_fft is not thread-safe I believe), but we don't do that, do we?

@mfherbst
Copy link
Member

Yikes. Not that I'm aware of. It is done when PlaneWaveBasis is constructed, so here https://github.com/JuliaMolSim/DFTK.jl/blob/master/src/PlaneWaveBasis.jl#L101.

@antoine-levitt
Copy link
Member Author

Got it: inv(plan) is not thread-safe

@antoine-levitt
Copy link
Member Author

Opened an issue there: JuliaMath/FFTW.jl#134

@mfherbst
Copy link
Member

mfherbst commented Jan 26, 2020

The failure in Travis is caused by JuliaNLSolvers/NLSolversBase.jl#117. I'll pin NLSolversBase to the last working version.

@mfherbst mfherbst merged commit 896de08 into master Jan 26, 2020
@mfherbst mfherbst deleted the FFTs branch January 26, 2020 10:35
@mfherbst
Copy link
Member

Puh ... glad that one's finally in 😄.

@antoine-levitt
Copy link
Member Author

Heh, yeah! Hopefully the other prs are simpler

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants