Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exploiting symmetries #16

Open
SRHudson opened this issue Jul 30, 2017 · 20 comments
Open

exploiting symmetries #16

SRHudson opened this issue Jul 30, 2017 · 20 comments

Comments

@SRHudson
Copy link
Collaborator

Dear Speculators,

I am still working on reducing the computation time in ma00aa and other subroutines. I see that the ma00aa thread has vanished, but not to worry: I am becoming quite adept at opening new issues!

If you examine
http://w3.pppl.gov/~shudson/Spec/ma00aa.pdf
you will notice that DToocc, DToocs, DToosc and DTooss do not depend on geometry. They need to be calculated only once during the entire calculation. (The Chebyshev resolution may differ in each volume, so some re-indexing of these arrays will be required.)

Also, there are various symmetries of the other arrays constructed in ma00aa that could be exploited. Now that I am getting back into the fine print of SPEC, this should not be too difficult, and I intend to pursue this task. It may be that I get further reduce the time of ma00aa by a factor of 4.

However, I would like to measure timing before and after the changes. You are already aware, I believe, of the timing improvements that I made to ma00aa. This was no big achievement really, as this routine had not yet exploited stellarator symmetry. Also, I believe that the l2stell input file is too tiny to really see the impact of the speed improvements that I intend to pursue, and whether the bottleneck in SPEC when we use large Fourier resolution and a large number of interfaces is really with ma00aa or somewhere else. We need to get SPEC fast on experimentally relevant input files. (Eventually, we want every part of the code to be maximally efficient. This is just a question of priority.)

So, I suggest that Joaquim consider the following:

  1. create an appropriately name subdirectory, e.g. 2016PoPLoizu, of the horrendously named TESTCASES directory, and into this directory place some the input files for W7-X that were used in "Verification of the SPEC code in stellarator geometries", J. Loizu, S.R. Hudson & C. Nührenberg, Phys. Plasmas 23, 112505 (2016);
  2. construct one or more multi-volume W7-X input files. At this stage, it does not even matter if these files converge.

I will proceed on exploiting all the symmetries and so associated with ma00aa and the related routines.

@jloizu
Copy link
Collaborator

jloizu commented Jul 30, 2017

OK I agree with the work flow, but then we should first merge the ma00aa branch with master (I will do some final testing) and then create a new branch called something like "fastspec" and work on these matters from there. I will get this set tomorrow and let you know.

@jloizu
Copy link
Collaborator

jloizu commented Aug 1, 2017

@SRHudson wrote:

I believe that the l2stell input file is too tiny to really see the impact of the speed improvements that I
intend to pursue, and whether the bottleneck in SPEC when we use large Fourier resolution and a
large number of interfaces is really with ma00aa or somewhere else.

I performed a study with the l2stell.sp testcase, increasing Fourier resolution (Mpol=Ntor=4,6,8), measuring the total computation time, and comparing the two SPEC versions ("fast" is from ma00aa branch, "slow" is from master branch). Find attached the result of this exercise.

l2stell_cputime_versus_fourier.pdf

For example, at m=n=8, the slow code takes ~ 1558 sec and the fast code takes ~645 sec, a factor of ~2.5 speedup, which seems to remain constant with increasing fourier resolution.

@jloizu
Copy link
Collaborator

jloizu commented Aug 1, 2017

More over, along these runs I did profiling, and found that the ma00aa routine increases percentage of total cpu time as Fourier resolution increases. This means that code speedup relies on ma00aa-optimization even more at high Fourier resolution.

To illustrate this point, here are some numbers:

"slow spec"
ma00aa 74% at m=n=4 ; 85% at m=n=8

"fast spec"
ma00aa 35% at m=n=4 ; 52% at m=n=8

@SRHudson
Copy link
Collaborator Author

SRHudson commented Aug 1, 2017

Comment One) What you are calling "slow spec" is actually non-stellarator-symmetric spec, so comparisons between "fast spec", which exploits stellarator symmetry and slow spec are not particularly meaningful. Recall that the only speed improvements that I made to ma00aa (beyond the re-looping and eliminating divisions as suggested by Sam) was to not compute the non-stellarator-symmetric terms, which were nowhere being used.

Comment Two) It is interesting that at higher M, N that ma00aa takes a higher percentage of the total time. I will continue to work on this, as I have described earlier (by removing the calculation of DToocc, DToocs, DToosc and DTooss from ma00aa to preset).

@jloizu
Copy link
Collaborator

jloizu commented Aug 1, 2017

@SRHudson not sure I understand your Comment One,

the only speed improvements that I made to ma00aa (beyond the re-looping and eliminating divisions > as suggested by Sam) was to not compute the non-stellarator-symmetric terms, which were nowhere
being used.

Does this mean that you modified ma00aa.h in such a way that IF the system is stellarator symmetric THEN those terms are not calculated, but that in general these are still calculated and thus no much speedup is expected?

@SRHudson
Copy link
Collaborator Author

SRHudson commented Aug 1, 2017

Ultimately, it would be nice to determine a scaling law, e.g.,
T \sim L^a M^b N^c Nv^d
for each subroutine, where L, M and N are the radial, poloidal and toroidal resolutions, Nv is the number of volumes, and a, b and c are exponents.

It is very interesting that matrix seems to be taking a lot of time. There is NO computation performed in matrix. It is just loops and assigning matrix elements. I guess that by re-looping that the time of matrix can be reduced significantly.

@SRHudson
Copy link
Collaborator Author

SRHudson commented Aug 1, 2017

In
http://w3.pppl.gov/~shudson/Spec/ma00aa.pdf
I have color-coded the stellarator symmetric terms and the non-stellarator symmetric terms.

As of last week, all the terms were being calculated regardless of Istellsym. All I did was to not compute the non-stellarator-symmetric terms if Istellsym = 1.

For non-stellarator-symmetric equilibria, as you can see from
http://w3.pppl.gov/~shudson/Spec/ma00aa.pdf
there will be four times as many quantities to compute. We can expect that non-stellarator-symmetric equilibria will take much longer . . . . You can test this. Just set Istellsym = 0 into the l2stell.sp input file and take a look. If the boundary is stellarator symmetric, then the equilibria computed with Istellsym = 1 and Istellsym = 0 should be identical.

@jloizu
Copy link
Collaborator

jloizu commented Aug 1, 2017

@SRHudson OK I understand, although somehow I do not see any red terms when I open

http://w3.pppl.gov/~shudson/Spec/ma00aa.pdf

@SRHudson
Copy link
Collaborator Author

SRHudson commented Aug 1, 2017

How are you doing the timing?
I implemented a timing capability. It might be clumsy, and I don't know if it works. Set Ltiming=1 in the &diagnos namelist and a screen output should give the time taken by each routine. I see that when running on a single cpu (with Linitialize=1 in l2stell.sp)
mp00ac : 45.35s
ma00aa : 21.96s
tr00ab : 7.68s
You will see that each call to a subroutine is performed using the macro WCALL, which automatically adds time differences to a T(filename) variable. See also the macro BEGIN, which also adds time.

@lazersos
Copy link
Collaborator

lazersos commented Aug 1, 2017

@SRHudson I believe @jloizu is doing the timing using the standard gnu profiling extension. This is useful as it automatically calculates timing for all routines.

@SRHudson
Copy link
Collaborator Author

SRHudson commented Aug 1, 2017

So, do the results from the different timings agree?

@jloizu
Copy link
Collaborator

jloizu commented Aug 1, 2017

No the times do not agree. This is how I do it, you can try:

  1. compile the code with the profiling option, e.g.

make CC=intel_prof

  1. run the code normally; that should generate, after execution, a file called gmon.out

  2. finally type the command

gprof xspec > somename.out

  1. You can see inside somename.out all the profiling information

@SRHudson
Copy link
Collaborator Author

SRHudson commented Aug 1, 2017

Ok, I made with CC=intel_prof, ran xspec, gprofed xspec and got this:
% cumulative self self total
time seconds seconds calls s/call s/call name
73.54 22.10 22.10 410 0.05 0.06 ma00aa_
8.92 24.78 2.68 410 0.01 0.01 matrix_
4.36 26.09 1.31 43 0.03 0.69 dforce_
3.76 27.22 1.13 412 0.00 0.00 volume_
which I think is what Joaquim has been reporting, that ma00aa is where all the time is. This method also shows that that code took about 30sec to complete.

My timing shows that mp00ac was taking all the time, and that the code took about 83sec. I guess that most of you will regard the standard gnu profiling extension as being the more reliable, but my timing agrees with my watch!!

Perhaps the difference is because I am using MPI_WTIME() as to provide the time. What is more important: the "timing" provided by the so-called standard gnu profiling extension, or the time that the user has to wait to get the result.

@SRHudson
Copy link
Collaborator Author

SRHudson commented Aug 1, 2017

I am still curious about the timing. Can the gnu profiling handle handle the calls to the NAG library?

Also, is there a way to "profile" the memory requirements? I have absolutely no clue how much memory SPEC uses, but I am sure that the memory can be reduced significantly.

@lazersos
Copy link
Collaborator

lazersos commented Aug 2, 2017

@SRHudson , no it can't tell you time spent in those subroutine because we don't compile those libraries, we only link to them. However, while gprof is nice, there are better tools. And the discrepancy between gprof and wall clock timing is well documented.

Regarding memory, you just need to compile with the -g option and run with valgrind as a first pass. The tool is good for detecting uninitialized variables, failures to deallocate variables, and other memory related issues. At some level memory reduction begins to interfere with readability of codes. I suspect this is why many legacy codes read more like letter salad as they attempt to reuse variables.

@jloizu
Copy link
Collaborator

jloizu commented Aug 2, 2017

OK I tried the Ltiming option and also got that mp00ac takes most of the time (by far: 10 times more than any other subroutine). However, it is weird because if I rerun the same case it prints out different numbers, e.g. for the mp00ac subroutine:
FIRST RUN

finish : 24.02 : time spent in mp00ac = 17.33 ;

SECOND RUN

finish : 75.73 : time spent in mp00ac = 54.48 ;

while both runs took the same "watch time"...also, in the ipp-cluster I get all zeros if I try Ltiming...perhaps some environment variable related to timing is messing up?

In any case, it looks like Ltiming says that mp00ac dominated cpu-time, and gprof says that ma00aa dominates cpu-time: maybe the latter neglects NAG calls, that could be the reason. If this is the case, then we should trust Ltiming output and try to optimize mp00ac.

@SRHudson
Copy link
Collaborator Author

SRHudson commented Aug 2, 2017

  1. I think that the Ltiming option will only be reliable if only one cpu is used. Could this explain your strange result?

  2. mp00ac does little more than call a NAG routine to solve a linear equation for the Beltrami field in each volume, so it is entirely consistent that if gprof ignores NAG calls then mp00ac will not show up. Also, remember that the primary calculation that spec performs is to compute the Beltrami field in each volume, so it is entirely expected that this will be a significant cost. Also, ma00aa computes the volume integrals of the Chebyshev-Fourier-metric products, so it similarly expected that this will be costly.

  3. If you have a look around in the source, you will see GETTIME and WCALL macros, which call MPI_WTIME(). If this is not appropriate on a given machine, then Ltiming will be meaningless.

  4. Yesterday, I identified that the linear solver in mp00ac was dominating, and Josh has kindly begun work on this. Initial timing improvements are very encouraging, and there is more that Josh and I will do over the next few days to further reduce the computation.

  5. I will not concentrate on memory at present, but I hope that Joaquim will in the not-to-distant future investigate spec from a memory perspective and identify any areas of concern.

@jloizu
Copy link
Collaborator

jloizu commented Aug 3, 2017

If I understand correctly, there is an ongoing "speedup effort" which translates into modifications of the ma00aa branch (which should change to a more generic name since it seems that other routines like mp00ac are crucial). If that is the case, then we should cancel the "pull request" that is been open for a while with the initial idea of merging that branch into master - but that should take quite some time.

@SRHudson
Copy link
Collaborator Author

SRHudson commented Aug 3, 2017

I have pushed my changes to ma00aa. I will take a break on this topic now, so this would be a good time for you to pull and merge this branch with the master. We should also merge the NAG_REPLACE with the master. I would expect that we should not let different branches get too far removed from each other, and it is helpful for checking the modifications to NAG_REPLACE if the executable is fast.

Speed improvements to spec will be an ongoing exercise, which suggests that we close ma00aa and open another branch called "rapido", or "azogue", or "relampago". I will let our Spanish expert have the final say on this.

@jloizu
Copy link
Collaborator

jloizu commented Aug 7, 2017

The ma00aa branch has been merged into master and deleted.

I would wait until the NAG_REPLACE branch reaches a point in which the code runs well (from the discussions, it seems that some debugging is on the way). Then I can perform some tests and compare the master and NAG_REPLACE branches, before considering merging.

If some new "speedup" improvements are to be made, e.g. by exploiting symmetries in mp00ac, then the person starting such changes should create a new branch,

git branch relampago

make some initial changes, and then,

git add -A
git commit -m "initial commit message"
git push -u origin relampago

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants