exploiting symmetries #16

SRHudson · 2017-07-30T12:48:53Z

Dear Speculators,

I am still working on reducing the computation time in ma00aa and other subroutines. I see that the ma00aa thread has vanished, but not to worry: I am becoming quite adept at opening new issues!

If you examine
http://w3.pppl.gov/~shudson/Spec/ma00aa.pdf
you will notice that DToocc, DToocs, DToosc and DTooss do not depend on geometry. They need to be calculated only once during the entire calculation. (The Chebyshev resolution may differ in each volume, so some re-indexing of these arrays will be required.)

Also, there are various symmetries of the other arrays constructed in ma00aa that could be exploited. Now that I am getting back into the fine print of SPEC, this should not be too difficult, and I intend to pursue this task. It may be that I get further reduce the time of ma00aa by a factor of 4.

However, I would like to measure timing before and after the changes. You are already aware, I believe, of the timing improvements that I made to ma00aa. This was no big achievement really, as this routine had not yet exploited stellarator symmetry. Also, I believe that the l2stell input file is too tiny to really see the impact of the speed improvements that I intend to pursue, and whether the bottleneck in SPEC when we use large Fourier resolution and a large number of interfaces is really with ma00aa or somewhere else. We need to get SPEC fast on experimentally relevant input files. (Eventually, we want every part of the code to be maximally efficient. This is just a question of priority.)

So, I suggest that Joaquim consider the following:

create an appropriately name subdirectory, e.g. 2016PoPLoizu, of the horrendously named TESTCASES directory, and into this directory place some the input files for W7-X that were used in "Verification of the SPEC code in stellarator geometries", J. Loizu, S.R. Hudson & C. Nührenberg, Phys. Plasmas 23, 112505 (2016);
construct one or more multi-volume W7-X input files. At this stage, it does not even matter if these files converge.

I will proceed on exploiting all the symmetries and so associated with ma00aa and the related routines.

jloizu · 2017-07-30T18:40:14Z

OK I agree with the work flow, but then we should first merge the ma00aa branch with master (I will do some final testing) and then create a new branch called something like "fastspec" and work on these matters from there. I will get this set tomorrow and let you know.

jloizu · 2017-08-01T09:03:03Z

@SRHudson wrote:

I believe that the l2stell input file is too tiny to really see the impact of the speed improvements that I
intend to pursue, and whether the bottleneck in SPEC when we use large Fourier resolution and a
large number of interfaces is really with ma00aa or somewhere else.

I performed a study with the l2stell.sp testcase, increasing Fourier resolution (Mpol=Ntor=4,6,8), measuring the total computation time, and comparing the two SPEC versions ("fast" is from ma00aa branch, "slow" is from master branch). Find attached the result of this exercise.

l2stell_cputime_versus_fourier.pdf

For example, at m=n=8, the slow code takes ~ 1558 sec and the fast code takes ~645 sec, a factor of ~2.5 speedup, which seems to remain constant with increasing fourier resolution.

jloizu · 2017-08-01T09:09:10Z

More over, along these runs I did profiling, and found that the ma00aa routine increases percentage of total cpu time as Fourier resolution increases. This means that code speedup relies on ma00aa-optimization even more at high Fourier resolution.

To illustrate this point, here are some numbers:

"slow spec"
ma00aa 74% at m=n=4 ; 85% at m=n=8

"fast spec"
ma00aa 35% at m=n=4 ; 52% at m=n=8

SRHudson · 2017-08-01T14:30:46Z

Comment One) What you are calling "slow spec" is actually non-stellarator-symmetric spec, so comparisons between "fast spec", which exploits stellarator symmetry and slow spec are not particularly meaningful. Recall that the only speed improvements that I made to ma00aa (beyond the re-looping and eliminating divisions as suggested by Sam) was to not compute the non-stellarator-symmetric terms, which were nowhere being used.

Comment Two) It is interesting that at higher M, N that ma00aa takes a higher percentage of the total time. I will continue to work on this, as I have described earlier (by removing the calculation of DToocc, DToocs, DToosc and DTooss from ma00aa to preset).

jloizu · 2017-08-01T14:39:35Z

@SRHudson not sure I understand your Comment One,

the only speed improvements that I made to ma00aa (beyond the re-looping and eliminating divisions > as suggested by Sam) was to not compute the non-stellarator-symmetric terms, which were nowhere
being used.

Does this mean that you modified ma00aa.h in such a way that IF the system is stellarator symmetric THEN those terms are not calculated, but that in general these are still calculated and thus no much speedup is expected?

SRHudson · 2017-08-01T14:39:48Z

Ultimately, it would be nice to determine a scaling law, e.g.,
T \sim L^a M^b N^c Nv^d
for each subroutine, where L, M and N are the radial, poloidal and toroidal resolutions, Nv is the number of volumes, and a, b and c are exponents.

It is very interesting that matrix seems to be taking a lot of time. There is NO computation performed in matrix. It is just loops and assigning matrix elements. I guess that by re-looping that the time of matrix can be reduced significantly.

SRHudson · 2017-08-01T14:48:40Z

In
http://w3.pppl.gov/~shudson/Spec/ma00aa.pdf
I have color-coded the stellarator symmetric terms and the non-stellarator symmetric terms.

As of last week, all the terms were being calculated regardless of Istellsym. All I did was to not compute the non-stellarator-symmetric terms if Istellsym = 1.

For non-stellarator-symmetric equilibria, as you can see from
http://w3.pppl.gov/~shudson/Spec/ma00aa.pdf
there will be four times as many quantities to compute. We can expect that non-stellarator-symmetric equilibria will take much longer . . . . You can test this. Just set Istellsym = 0 into the l2stell.sp input file and take a look. If the boundary is stellarator symmetric, then the equilibria computed with Istellsym = 1 and Istellsym = 0 should be identical.

jloizu · 2017-08-01T14:55:52Z

@SRHudson OK I understand, although somehow I do not see any red terms when I open

http://w3.pppl.gov/~shudson/Spec/ma00aa.pdf

SRHudson · 2017-08-01T16:12:50Z

How are you doing the timing?
I implemented a timing capability. It might be clumsy, and I don't know if it works. Set Ltiming=1 in the &diagnos namelist and a screen output should give the time taken by each routine. I see that when running on a single cpu (with Linitialize=1 in l2stell.sp)
mp00ac : 45.35s
ma00aa : 21.96s
tr00ab : 7.68s
You will see that each call to a subroutine is performed using the macro WCALL, which automatically adds time differences to a T(filename) variable. See also the macro BEGIN, which also adds time.

lazersos · 2017-08-01T16:31:34Z

@SRHudson I believe @jloizu is doing the timing using the standard gnu profiling extension. This is useful as it automatically calculates timing for all routines.

SRHudson · 2017-08-01T16:33:17Z

So, do the results from the different timings agree?

jloizu · 2017-08-01T17:23:01Z

No the times do not agree. This is how I do it, you can try:

compile the code with the profiling option, e.g.

make CC=intel_prof

run the code normally; that should generate, after execution, a file called gmon.out
finally type the command

gprof xspec > somename.out

You can see inside somename.out all the profiling information

SRHudson · 2017-08-01T19:06:46Z

Ok, I made with CC=intel_prof, ran xspec, gprofed xspec and got this:
% cumulative self self total
time seconds seconds calls s/call s/call name
73.54 22.10 22.10 410 0.05 0.06 ma00aa_
8.92 24.78 2.68 410 0.01 0.01 matrix_
4.36 26.09 1.31 43 0.03 0.69 dforce_
3.76 27.22 1.13 412 0.00 0.00 volume_
which I think is what Joaquim has been reporting, that ma00aa is where all the time is. This method also shows that that code took about 30sec to complete.

My timing shows that mp00ac was taking all the time, and that the code took about 83sec. I guess that most of you will regard the standard gnu profiling extension as being the more reliable, but my timing agrees with my watch!!

Perhaps the difference is because I am using MPI_WTIME() as to provide the time. What is more important: the "timing" provided by the so-called standard gnu profiling extension, or the time that the user has to wait to get the result.

SRHudson · 2017-08-01T21:08:20Z

I am still curious about the timing. Can the gnu profiling handle handle the calls to the NAG library?

Also, is there a way to "profile" the memory requirements? I have absolutely no clue how much memory SPEC uses, but I am sure that the memory can be reduced significantly.

lazersos · 2017-08-02T05:32:01Z

@SRHudson , no it can't tell you time spent in those subroutine because we don't compile those libraries, we only link to them. However, while gprof is nice, there are better tools. And the discrepancy between gprof and wall clock timing is well documented.

Regarding memory, you just need to compile with the -g option and run with valgrind as a first pass. The tool is good for detecting uninitialized variables, failures to deallocate variables, and other memory related issues. At some level memory reduction begins to interfere with readability of codes. I suspect this is why many legacy codes read more like letter salad as they attempt to reuse variables.

jloizu · 2017-08-02T08:54:49Z

OK I tried the Ltiming option and also got that mp00ac takes most of the time (by far: 10 times more than any other subroutine). However, it is weird because if I rerun the same case it prints out different numbers, e.g. for the mp00ac subroutine:
FIRST RUN

finish : 24.02 : time spent in mp00ac = 17.33 ;

SECOND RUN

finish : 75.73 : time spent in mp00ac = 54.48 ;

while both runs took the same "watch time"...also, in the ipp-cluster I get all zeros if I try Ltiming...perhaps some environment variable related to timing is messing up?

In any case, it looks like Ltiming says that mp00ac dominated cpu-time, and gprof says that ma00aa dominates cpu-time: maybe the latter neglects NAG calls, that could be the reason. If this is the case, then we should trust Ltiming output and try to optimize mp00ac.

SRHudson · 2017-08-02T09:05:59Z

I think that the Ltiming option will only be reliable if only one cpu is used. Could this explain your strange result?
mp00ac does little more than call a NAG routine to solve a linear equation for the Beltrami field in each volume, so it is entirely consistent that if gprof ignores NAG calls then mp00ac will not show up. Also, remember that the primary calculation that spec performs is to compute the Beltrami field in each volume, so it is entirely expected that this will be a significant cost. Also, ma00aa computes the volume integrals of the Chebyshev-Fourier-metric products, so it similarly expected that this will be costly.
If you have a look around in the source, you will see GETTIME and WCALL macros, which call MPI_WTIME(). If this is not appropriate on a given machine, then Ltiming will be meaningless.
Yesterday, I identified that the linear solver in mp00ac was dominating, and Josh has kindly begun work on this. Initial timing improvements are very encouraging, and there is more that Josh and I will do over the next few days to further reduce the computation.
I will not concentrate on memory at present, but I hope that Joaquim will in the not-to-distant future investigate spec from a memory perspective and identify any areas of concern.

jloizu · 2017-08-03T16:29:57Z

If I understand correctly, there is an ongoing "speedup effort" which translates into modifications of the ma00aa branch (which should change to a more generic name since it seems that other routines like mp00ac are crucial). If that is the case, then we should cancel the "pull request" that is been open for a while with the initial idea of merging that branch into master - but that should take quite some time.

SRHudson · 2017-08-03T17:21:45Z

I have pushed my changes to ma00aa. I will take a break on this topic now, so this would be a good time for you to pull and merge this branch with the master. We should also merge the NAG_REPLACE with the master. I would expect that we should not let different branches get too far removed from each other, and it is helpful for checking the modifications to NAG_REPLACE if the executable is fast.

Speed improvements to spec will be an ongoing exercise, which suggests that we close ma00aa and open another branch called "rapido", or "azogue", or "relampago". I will let our Spanish expert have the final say on this.

jloizu · 2017-08-07T15:14:02Z

The ma00aa branch has been merged into master and deleted.

I would wait until the NAG_REPLACE branch reaches a point in which the code runs well (from the discussions, it seems that some debugging is on the way). Then I can perform some tests and compare the master and NAG_REPLACE branches, before considering merging.

If some new "speedup" improvements are to be made, e.g. by exploiting symmetries in mp00ac, then the person starting such changes should create a new branch,

git branch relampago

make some initial changes, and then,

git add -A
git commit -m "initial commit message"
git push -u origin relampago

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exploiting symmetries #16

exploiting symmetries #16

SRHudson commented Jul 30, 2017

jloizu commented Jul 30, 2017

jloizu commented Aug 1, 2017

jloizu commented Aug 1, 2017 •

edited

Loading

SRHudson commented Aug 1, 2017

jloizu commented Aug 1, 2017

SRHudson commented Aug 1, 2017

SRHudson commented Aug 1, 2017

jloizu commented Aug 1, 2017

SRHudson commented Aug 1, 2017

lazersos commented Aug 1, 2017

SRHudson commented Aug 1, 2017

jloizu commented Aug 1, 2017

SRHudson commented Aug 1, 2017

SRHudson commented Aug 1, 2017

lazersos commented Aug 2, 2017

jloizu commented Aug 2, 2017

SRHudson commented Aug 2, 2017

jloizu commented Aug 3, 2017

SRHudson commented Aug 3, 2017

jloizu commented Aug 7, 2017

exploiting symmetries #16

exploiting symmetries #16

Comments

SRHudson commented Jul 30, 2017

jloizu commented Jul 30, 2017

jloizu commented Aug 1, 2017

jloizu commented Aug 1, 2017 • edited Loading

SRHudson commented Aug 1, 2017

jloizu commented Aug 1, 2017

SRHudson commented Aug 1, 2017

SRHudson commented Aug 1, 2017

jloizu commented Aug 1, 2017

SRHudson commented Aug 1, 2017

lazersos commented Aug 1, 2017

SRHudson commented Aug 1, 2017

jloizu commented Aug 1, 2017

SRHudson commented Aug 1, 2017

SRHudson commented Aug 1, 2017

lazersos commented Aug 2, 2017

jloizu commented Aug 2, 2017

SRHudson commented Aug 2, 2017

jloizu commented Aug 3, 2017

SRHudson commented Aug 3, 2017

jloizu commented Aug 7, 2017

jloizu commented Aug 1, 2017 •

edited

Loading