FFTs only work one band at a time #96

antoine-levitt · 2019-12-29T19:10:44Z

mfherbst · 2020-01-22T08:21:42Z

I'd avoid using Ψ and ψ for different things, because the symbols look far too similar in most fonts. I now use Ψik instead of ψk to make the distinction clearer.

mfherbst · 2020-01-22T08:23:26Z

Also in general I see why you did this: It simplifies code. On the other hand we now need to deal with threading at a few more places (density computation, ldos computation ...) otherwise it would only be one centralised spot. I think that's ok for now, but if we look into MPI / hybrid GPU/CPU parallelisation at some point we need to reconsider ... but probably we then have a bunch of other problems, too.

I'll add threading for LDOS and density (actually slows down the calculations I do noticably to have this single-threaded!).

antoine-levitt · 2020-01-22T09:25:51Z

I'll add threading for LDOS and density (actually slows down the calculations I do noticably to have this single-threaded!).

Really? Might be worth trying to optimize a bit then (eg dot fusion, making sure there's no @code_warntype funny business, etc. It should definitely be lower-order for large systems, but for smallish ones there might be an impact

antoine-levitt · 2020-01-22T09:30:41Z

Also in general I see why you did this: It simplifies code. On the other hand we now need to deal with threading at a few more places (density computation, ldos computation ...) otherwise it would only be one centralised spot. I think that's ok for now, but if we look into MPI / hybrid GPU/CPU parallelisation at some point we need to reconsider ... but probably we then have a bunch of other problems, too.

Hm, that I don't know. For optimal performance we also want to minimize the number of temp buffers of size Psi we allocate. All these will have to change things anyway, so let's see when we get to it I think.

I'd avoid using Ψ and ψ for different things, because the symbols look far too similar in most fonts. I now use Ψik instead of ψk to make the distinction clearer.

We have way too many notations for these already. We sometimes use Psi, sometimes \Psi, sometimes \psi, and sometimes X. To complicate matters, they are really the u (periodic part of bloch waves), not the psi (bloch waves). Similar for the orben. Let's brainstorm all the notations sometime and update them all consistently?

mfherbst · 2020-01-22T11:36:33Z

I'll profile the threading aspect on the calculations when this is in master and see what can be done about this. For sure threading should not hurt, however.

We have way too many notations for these already. We sometimes use Psi, sometimes \Psi, sometimes \psi, and sometimes X. To complicate matters, they are really the u (periodic part of bloch waves), not the psi (bloch waves). Similar for the orben. Let's brainstorm all the notations sometime and update them all consistently?

I agree. I've noticed that, too, when going through the code wrt this PR.

antoine-levitt · 2020-01-22T11:45:55Z

I'll profile the threading aspect on the calculations when this is in master and see what can be done about this. For sure threading should not hurt, however.

I think just adding dots and views to LDOS will already improve things massively

mfherbst · 2020-01-23T13:44:17Z

Ok ... I got round to do some smallish testing with the density computation function and I agree, that extra threading (on top of what is done by FFTW itself) really makes little sense here and makes the code only more complicated (you need to pre-allocate thread-local buffers for the accumulation and need to take care to not get into false sharing ... it's a bit messy). So I agree we should skip that.

But there clearly are some issues with threading in general and that's probably what I picked up before: How many cores are used on the cluster seems to be a bit random. For example. I ran the graphene calculation (committed) on a cluster node with 32 virtual, 16 physical cores by just including the script. The load stayed at 100% for a pretty long time (10 minutes) with only a short spike 3200% for a couple of seconds in between. So I interrupt the execution and re-include the script in the same julia shell. Now the load is mostly 1600% with short periods (ca 5 seconds) down to 100%. The calculation terminates within about 4 minutes (but the maximal load was around 1600%). Then I re-include the script once again and get the same behaviour (1600% most of the time, short drops to 100%) ... I'm not sure but it sounds to be there is some interference between the FFTW-Threading and the Julia @threads macro. I'll do some more testing ...

mfherbst · 2020-01-23T13:50:55Z

Here is a log of the processor usage from the run. One line is one second.
dftk_FFTs_clustern10.log
Any ideas?

antoine-levitt · 2020-01-23T13:52:31Z

What did you use for JULIA_NUM_THREADS? I'm not sure how many threads FFTW uses by default, possibly we should do a fftw_set_num_threads(Threads.nthreads()) by default? Can you also do a profile? It might be that a lot of time is spent on non-FFTs (eg I did a recent profile of supercell silicon, and a lot of time was being spent on the computation of the external potential (that we should cache; again dependent on the term refactor)

mfherbst · 2020-01-23T13:56:22Z

JULIA_NUM_THREADS = 32 obviously and FFTW enables threading automatically as well, see here. I can do some profiling, but I do not think that's the problem, because that's just one particularly odd run. Sometimes things look better (e.g. closer to 3200% usage). Also on my laptop I have not encountered such behaviour that much. I'll try to produce another case demonstrating that.

antoine-levitt · 2020-01-23T14:00:10Z

The first run looks to be a bug, so it would be good to find where that comes from, and if that is reproducible. I have also seen weirdness in the thread usage, so probably there's something fishy going on there. One thing to try is whether doing a dummy threaded loop (@threads for i=1:10 println(i) end or something before including the script changes anything. Possibly FFTW thread initialization is buggy when threads are not used before or something?

mfherbst · 2020-01-23T14:05:55Z

Now rerunning I encountered a GC bug ... fun. I should to say I've been using v1.2 on the cluster ... I'm now switching to 1.3.1 to make sure that's not the problem.

antoine-levitt · 2020-01-23T14:07:11Z

Huh, cool! We don't do anything particularly fishy with memory, so we might be triggering a bug here... Switching to 1.3 is a good idea.

mfherbst · 2020-01-23T14:43:28Z

Ok ... with Julia 1.3.1 things are a lot more reliable, fortunately. Still we get a drop in performance in between (from some baseline 2700% to some 400%), but things are obviously much better already. I'll do some profiling wrt. the remaining perfomance drop.

mfherbst · 2020-01-24T10:03:41Z

Ok, I found the bottle neck: It is the single-threaded element-wise processing of the density when dealing with the symmetry operations.

I'll add a heuristic to capture the case where the symmetry operation is just the identity (the most common case, especially for supercell approaches) and add some threading to the loop to parallelise the whole processing (i.e. both computing partial densities per k-Points and the elementwise processing due to symmetry).

That should be ok for now, but perhaps we should investigate whether there is a more clever way to do symmetry than fiddeling with the Fourier coefficients elementwise ...

antoine-levitt · 2020-01-24T10:33:41Z

It looks like G_vectors is pretty inefficient btw. Did you try with a collect in there to materialize the generator?

mfherbst · 2020-01-24T10:35:32Z

Nope, but for this case I agree it's a good idea to collect it once.

antoine-levitt · 2020-01-24T10:36:35Z

Also, we can 1) precompute S^-1 2) add an @inbounds annotation 3) see if index_G_vectors is the bottleneck

antoine-levitt · 2020-01-24T10:37:18Z

this shouldn't really be a rate-limiting operation so I'd micro-optimize a bit before threading

mfherbst · 2020-01-24T11:11:42Z

Indeed about half of the time of the compute_density function is spent in the line with the inv and the index_G_vectors at the moment.

mfherbst · 2020-01-24T11:25:07Z

Ok so, I pushed the thread stuff, because I now did it anyway. We can see what we can find in index_G_vectors.

antoine-levitt · 2020-01-24T11:42:06Z

Oof. That's some heavy thread-specific logic that seems a bit convoluted, and is pretty sure to increase the probability of hitting the infamous closure performance bug (@threads create a closure). Also code that is unoptimized and allocates a lot (like this one) typically does not thread well. Let's optimize the single-threaded version first and do some benchmarking. A simpler possibility that does not have load balancing issues is to thread on the G points only.

mfherbst · 2020-01-24T11:44:25Z

I'm still working on the single-threaded version anyway ... just wanted to have it out and get checked ... we can decide later what to use.

antoine-levitt · 2020-01-24T14:18:07Z

With the above modifications it doesn't seem to be performance-critical for me. OTOH I got a weird "corrupted size vs. prev_size” error when running with threads, did you ever get that?

antoine-levitt · 2020-01-24T14:19:41Z

ProfileView looks like it exports and imports JLD, so that might be good to exchange profiles

antoine-levitt · 2020-01-24T17:06:45Z

Well that's fun:

https://gist.github.com/antoine-levitt/41cc5480235d9d4c7e853354ccdb9a45
I got that by running silicon on this branch. It's in fftw_mkplan, and it has the "in expres11): Segmentation fault" characteristic of a race. This is possibly what would happen if we planned FFTs on threads (plan_fft is not thread-safe I believe), but we don't do that, do we?

mfherbst · 2020-01-24T18:33:20Z

Yikes. Not that I'm aware of. It is done when PlaneWaveBasis is constructed, so here https://github.com/JuliaMolSim/DFTK.jl/blob/master/src/PlaneWaveBasis.jl#L101.

antoine-levitt · 2020-01-24T19:01:12Z

Got it: inv(plan) is not thread-safe

antoine-levitt · 2020-01-24T19:12:54Z

Opened an issue there: JuliaMath/FFTW.jl#134

… debug things)

…-opts

mfherbst · 2020-01-26T09:43:03Z

The failure in Travis is caused by JuliaNLSolvers/NLSolversBase.jl#117. I'll pin NLSolversBase to the last working version.

mfherbst · 2020-01-26T10:35:25Z

Puh ... glad that one's finally in 😄.

antoine-levitt · 2020-01-26T10:40:37Z

Heh, yeah! Hopefully the other prs are simpler

mfherbst force-pushed the FFTs branch from edc8038 to 1c7301c Compare January 22, 2020 08:25

antoine-levitt and others added 12 commits January 25, 2020 19:06

FFTs only work one band at a time

68bb0a2

simplify the logic in density computation

5bfe3f4

cosmetic fixes + don't limit the number of warnings (makes it hard to…

60d9e3e

… debug things)

kpt.basis -> G_vectors(kpt)

81cf8ab

Typos and more views

3673bfb

More expensive graphene example

993247f

More checks for bzmesh

ed0f0f2

Density threading

1f31072

split into a separate function to avoid closure problems, small micro…

45c3392

…-opts

precompute inverse plans

1e8317b

Fix graphene example

6e3e8ae

Print what is done in test

c22badf

mfherbst force-pushed the FFTs branch from 4ba3331 to c22badf Compare January 25, 2020 18:06

Use !== for comparing to nothing

6fa5225

Pin NLSolversBase 7.5.0

b880c59

mfherbst merged commit 896de08 into master Jan 26, 2020

mfherbst deleted the FFTs branch January 26, 2020 10:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FFTs only work one band at a time #96

FFTs only work one band at a time #96

antoine-levitt commented Dec 29, 2019

mfherbst commented Jan 22, 2020

mfherbst commented Jan 22, 2020

antoine-levitt commented Jan 22, 2020

antoine-levitt commented Jan 22, 2020

mfherbst commented Jan 22, 2020 •

edited

antoine-levitt commented Jan 22, 2020

mfherbst commented Jan 23, 2020 •

edited

mfherbst commented Jan 23, 2020 •

edited

antoine-levitt commented Jan 23, 2020

mfherbst commented Jan 23, 2020

antoine-levitt commented Jan 23, 2020

mfherbst commented Jan 23, 2020

antoine-levitt commented Jan 23, 2020

mfherbst commented Jan 23, 2020

mfherbst commented Jan 24, 2020

antoine-levitt commented Jan 24, 2020

mfherbst commented Jan 24, 2020

antoine-levitt commented Jan 24, 2020

antoine-levitt commented Jan 24, 2020

mfherbst commented Jan 24, 2020

mfherbst commented Jan 24, 2020

antoine-levitt commented Jan 24, 2020

mfherbst commented Jan 24, 2020

antoine-levitt commented Jan 24, 2020

antoine-levitt commented Jan 24, 2020

antoine-levitt commented Jan 24, 2020

mfherbst commented Jan 24, 2020

antoine-levitt commented Jan 24, 2020

antoine-levitt commented Jan 24, 2020

mfherbst commented Jan 26, 2020 •

edited

mfherbst commented Jan 26, 2020

antoine-levitt commented Jan 26, 2020

FFTs only work one band at a time #96

FFTs only work one band at a time #96

Conversation

antoine-levitt commented Dec 29, 2019

mfherbst commented Jan 22, 2020

mfherbst commented Jan 22, 2020

antoine-levitt commented Jan 22, 2020

antoine-levitt commented Jan 22, 2020

mfherbst commented Jan 22, 2020 • edited

antoine-levitt commented Jan 22, 2020

mfherbst commented Jan 23, 2020 • edited

mfherbst commented Jan 23, 2020 • edited

antoine-levitt commented Jan 23, 2020

mfherbst commented Jan 23, 2020

antoine-levitt commented Jan 23, 2020

mfherbst commented Jan 23, 2020

antoine-levitt commented Jan 23, 2020

mfherbst commented Jan 23, 2020

mfherbst commented Jan 24, 2020

antoine-levitt commented Jan 24, 2020

mfherbst commented Jan 24, 2020

antoine-levitt commented Jan 24, 2020

antoine-levitt commented Jan 24, 2020

mfherbst commented Jan 24, 2020

mfherbst commented Jan 24, 2020

antoine-levitt commented Jan 24, 2020

mfherbst commented Jan 24, 2020

antoine-levitt commented Jan 24, 2020

antoine-levitt commented Jan 24, 2020

antoine-levitt commented Jan 24, 2020

mfherbst commented Jan 24, 2020

antoine-levitt commented Jan 24, 2020

antoine-levitt commented Jan 24, 2020

mfherbst commented Jan 26, 2020 • edited

mfherbst commented Jan 26, 2020

antoine-levitt commented Jan 26, 2020

mfherbst commented Jan 22, 2020 •

edited

mfherbst commented Jan 23, 2020 •

edited

mfherbst commented Jan 23, 2020 •

edited

mfherbst commented Jan 26, 2020 •

edited