-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FFTs only work one band at a time #96
Conversation
I'd avoid using Ψ and ψ for different things, because the symbols look far too similar in most fonts. I now use Ψik instead of ψk to make the distinction clearer. |
Also in general I see why you did this: It simplifies code. On the other hand we now need to deal with threading at a few more places (density computation, ldos computation ...) otherwise it would only be one centralised spot. I think that's ok for now, but if we look into MPI / hybrid GPU/CPU parallelisation at some point we need to reconsider ... but probably we then have a bunch of other problems, too. I'll add threading for LDOS and density (actually slows down the calculations I do noticably to have this single-threaded!). |
Really? Might be worth trying to optimize a bit then (eg dot fusion, making sure there's no |
Hm, that I don't know. For optimal performance we also want to minimize the number of temp buffers of size Psi we allocate. All these will have to change things anyway, so let's see when we get to it I think.
We have way too many notations for these already. We sometimes use Psi, sometimes \Psi, sometimes \psi, and sometimes X. To complicate matters, they are really the u (periodic part of bloch waves), not the psi (bloch waves). Similar for the orben. Let's brainstorm all the notations sometime and update them all consistently? |
I'll profile the threading aspect on the calculations when this is in master and see what can be done about this. For sure threading should not hurt, however.
I agree. I've noticed that, too, when going through the code wrt this PR. |
I think just adding dots and views to LDOS will already improve things massively |
Ok ... I got round to do some smallish testing with the density computation function and I agree, that extra threading (on top of what is done by FFTW itself) really makes little sense here and makes the code only more complicated (you need to pre-allocate thread-local buffers for the accumulation and need to take care to not get into false sharing ... it's a bit messy). So I agree we should skip that. But there clearly are some issues with threading in general and that's probably what I picked up before: How many cores are used on the cluster seems to be a bit random. For example. I ran the graphene calculation (committed) on a cluster node with 32 virtual, 16 physical cores by just including the script. The load stayed at 100% for a pretty long time (10 minutes) with only a short spike 3200% for a couple of seconds in between. So I interrupt the execution and re-include the script in the same julia shell. Now the load is mostly 1600% with short periods (ca 5 seconds) down to 100%. The calculation terminates within about 4 minutes (but the maximal load was around 1600%). Then I re-include the script once again and get the same behaviour (1600% most of the time, short drops to 100%) ... I'm not sure but it sounds to be there is some interference between the FFTW-Threading and the Julia |
Here is a log of the processor usage from the run. One line is one second. |
What did you use for |
|
The first run looks to be a bug, so it would be good to find where that comes from, and if that is reproducible. I have also seen weirdness in the thread usage, so probably there's something fishy going on there. One thing to try is whether doing a dummy threaded loop ( |
Now rerunning I encountered a GC bug ... fun. I should to say I've been using v1.2 on the cluster ... I'm now switching to 1.3.1 to make sure that's not the problem. |
Huh, cool! We don't do anything particularly fishy with memory, so we might be triggering a bug here... Switching to 1.3 is a good idea. |
Ok ... with Julia 1.3.1 things are a lot more reliable, fortunately. Still we get a drop in performance in between (from some baseline 2700% to some 400%), but things are obviously much better already. I'll do some profiling wrt. the remaining perfomance drop. |
Ok, I found the bottle neck: It is the single-threaded element-wise processing of the density when dealing with the symmetry operations. I'll add a heuristic to capture the case where the symmetry operation is just the identity (the most common case, especially for supercell approaches) and add some threading to the loop to parallelise the whole processing (i.e. both computing partial densities per k-Points and the elementwise processing due to symmetry). That should be ok for now, but perhaps we should investigate whether there is a more clever way to do symmetry than fiddeling with the Fourier coefficients elementwise ... |
It looks like |
Nope, but for this case I agree it's a good idea to collect it once. |
Also, we can 1) precompute S^-1 2) add an |
this shouldn't really be a rate-limiting operation so I'd micro-optimize a bit before threading |
Indeed about half of the time of the compute_density function is spent in the line with the inv and the index_G_vectors at the moment. |
Ok so, I pushed the thread stuff, because I now did it anyway. We can see what we can find in |
Oof. That's some heavy thread-specific logic that seems a bit convoluted, and is pretty sure to increase the probability of hitting the infamous closure performance bug ( |
I'm still working on the single-threaded version anyway ... just wanted to have it out and get checked ... we can decide later what to use. |
With the above modifications it doesn't seem to be performance-critical for me. OTOH I got a weird "corrupted size vs. prev_size” error when running with threads, did you ever get that? |
ProfileView looks like it exports and imports JLD, so that might be good to exchange profiles |
Well that's fun: https://gist.github.com/antoine-levitt/41cc5480235d9d4c7e853354ccdb9a45 |
Yikes. Not that I'm aware of. It is done when |
Got it: |
Opened an issue there: JuliaMath/FFTW.jl#134 |
The failure in Travis is caused by JuliaNLSolvers/NLSolversBase.jl#117. I'll pin NLSolversBase to the last working version. |
Puh ... glad that one's finally in 😄. |
Heh, yeah! Hopefully the other prs are simpler |
First part of #36 (comment)