Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable parallel processing for large-array transformations #1057

Open
karlmsmith opened this issue Nov 23, 2017 · 4 comments
Open

Enable parallel processing for large-array transformations #1057

karlmsmith opened this issue Nov 23, 2017 · 4 comments

Comments

@karlmsmith
Copy link
Contributor

karlmsmith commented Nov 23, 2017

Reported by @karlmsmith on 5 Jan 2011 22:39 UTC
Enable parallel processing for large-array transformations defined in the do_4d_trans.F. Using threads (pthread_create, pthread_join, pthread_exit in C) should minimize the overhead of starting the parallel processing and simplify communications.

Migrated-From: http://dunkel.pmel.noaa.gov/trac/ferret/ticket/1785

@karlmsmith
Copy link
Contributor Author

karlmsmith commented Nov 23, 2017

Comment by @karlmsmith on 8 Jan 2011 00:40 UTC
Hi Steve,

After reviewing the code again, I see where you are going. This would be parallel computation of each result data point. So if it were a time-average of SST and we had a huge number of processors, each processor would compute a time-averaged SST at a given lat-lon point.

However, threads are expensive to create and we have a limited number of processors. So instead num_parallel (new) threads are created, and the parent (original process) hands out "lat-lon" points to each thread at the start and when they finish a computation.

Definite yes on a ferret command setting the number of threads to create, including zero == no parallelization.

Some of the gory details, if interested:

The parent will have an array of num_parallel integers for telling each thread which "lat-lon" to do. Each thread reads only its own value from this array, does the computation, and writes its result directly to the results matrix. It then sets its "lat-lon" value to the "idle" value and waits for the parent to give it a new "lat-lon" value or the "quit" value.

There will also be some flag/counter (conditional variable) which wakes the parent up when any thread is ready for another input point. This keeps the parent from wasting cycles by constantly checking for an idle thread. Each element of the num_parallel integer array would also be a flag to wake up the thread once the parent has assigned a new value.

Karl

On 1/6/2011 8:59 AM, Steve Hankin wrote:

Hi Karl,

Some thoughts to set initial directions here.

The key routine is doo/DO_4D_TRANS. It performs @ngd, @nbd, @sum, @din,
@ave and @var over regions of up to 4D. In the initialization phase the
routine sets up logical variables yes_xax, yes_yax, ... that tell which
are the axes that will be transformed (see equivalenced array). The
complimentary array no_ax (provided for readability, only) tells you
which axes are suitable to break up into parallel operations.

The block of code just below "* LOOP OVER THE FULL RANGE OF THE RESULT"
(setting up the "300" loop) is the point where major surgery for
parallelization needs to occur. I think if I were taking on this task
the first thing I might do is to take all of the code lying inside of
the 300 loop and reduce it to a single subroutine call. Then I would
test to make sure that I had not degraded performance too much just by
doing this (because subroutine calls are pretty inefficient in FORTRAN
in my experience). If performance is not too degraded this would put you
at a clean point to decompose the 300 loop into parallel calls. AFAIK
each point in the multi-dimensional space defined by the no_ax array
(and corresponding index limits for those axes) represents a calculation
that can be performed independently of any other point in that space.

Discussions?

Steve

@karlmsmith
Copy link
Contributor Author

karlmsmith commented Nov 23, 2017

Comment by @karlmsmith on 9 Mar 2011 19:56 UTC
Using about an 8 Gb file, and forcing clearing of the OS cache, it appears that using 4 processors will about double the speed when averaging over some combinations of the axes (eg, [Z=@ave,T=@ave]). If Ferret detects that it can do a partial read from file and compute a result, then another partial read from file and compute, etc (eg, [X=@ave,Y=@ave,Z=@ave]), the parallelization is skipped since ferret is repeatedly calling to compute only one result. So more reprogramming will be necessary to force Ferret to read in all the data and make a single call to compute all the results to enable parallelization. (Multiple results are computed in parallel to each other, but a computation of a single result is not broken into parallel computation pieces.)

So need to decide the priority for further development effort for these simple transformations.

@karlmsmith
Copy link
Contributor Author

karlmsmith commented Nov 23, 2017

Comment by @karlmsmith on 20 Apr 2011 22:26 UTC

Here's a pretty startling (and illuminating) result:

As a pragmatic matter I realized that the common averaging cases were XY, XYZ, and XYZT and T, alone and Z, alone ... seldom YZT. Since the cases starting with X seemed ok as-is in do_4d_trans and the cases of T, alone and Z, alone do not even use do_4d_trans, it seemed like a poor use of time to worry about permuting arrays to optimize that routine. So instead I tried a comparison of the speed of averaging in 1D, only -- comparing an average along X (the sequential axis) with an average along T. (In the test code here, the T average has to access every 5th point:)

set memory/size=2000

! xaverage test
let myvar = x[x=1:`200*200*200*200`]*t[t=1:5]
load myvar
sp date; load myvar[x=`@ave`]; sp date
canc mem/all

! T average test
let myvar = t[t=1:`200*200*200*200`]*x[x=1:5]
load myvar
sp date; load myvar[t=`@ave`]; sp date

Here's the thing: both of these averages SCREAM. They are orders of magnitude faster than the best case of do_4d_trans.F.

What do we make of this? My hunch is that it comes out of the inability to optimize the loops, given the way the increments are handled in do_4d_trans.F. By comparison the different looping directions are handled in do_ave_int.F using altogether separate loops using cases (IF .... ELSE IF ... ELSE IF ...).

So here's what I propose as a test: create a hack-modified version of do_4d_trans, in which it sets up a special loop for an XYZ average -- replacing loop 300 with something that is patterned after do_ave_int.F.

IF (its an XYZ average) THEN
    DO 300 l= cx_lo_s4, cx_hi_s4    ! T range
           <set up variables>
        DO 300 k = cx_lo_s3, cx_hi_s3
        DO 300 j = cx_lo_s2, cx_hi_s2
        DO 300 i = cx_lo_s1, cx_hi_s1
            <compute an average>
 300 CONTINUE
ENDIF

My hope would be that in this form the optimizer can get its teeth into the loops and they'll run at a speed similar to the 1X average.

@karlmsmith
Copy link
Contributor Author

Adding @AndrewWittenberg

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant