-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable parallel processing for large-array transformations #1057
Comments
Comment by @karlmsmith on 8 Jan 2011 00:40 UTC After reviewing the code again, I see where you are going. This would be parallel computation of each result data point. So if it were a time-average of SST and we had a huge number of processors, each processor would compute a time-averaged SST at a given lat-lon point. However, threads are expensive to create and we have a limited number of processors. So instead num_parallel (new) threads are created, and the parent (original process) hands out "lat-lon" points to each thread at the start and when they finish a computation. Definite yes on a ferret command setting the number of threads to create, including zero == no parallelization. Some of the gory details, if interested: The parent will have an array of num_parallel integers for telling each thread which "lat-lon" to do. Each thread reads only its own value from this array, does the computation, and writes its result directly to the results matrix. It then sets its "lat-lon" value to the "idle" value and waits for the parent to give it a new "lat-lon" value or the "quit" value. There will also be some flag/counter (conditional variable) which wakes the parent up when any thread is ready for another input point. This keeps the parent from wasting cycles by constantly checking for an idle thread. Each element of the num_parallel integer array would also be a flag to wake up the thread once the parent has assigned a new value. Karl On 1/6/2011 8:59 AM, Steve Hankin wrote:
|
Comment by @karlmsmith on 9 Mar 2011 19:56 UTC So need to decide the priority for further development effort for these simple transformations. |
Comment by @karlmsmith on 20 Apr 2011 22:26 UTC Here's a pretty startling (and illuminating) result: As a pragmatic matter I realized that the common averaging cases were XY, XYZ, and XYZT and T, alone and Z, alone ... seldom YZT. Since the cases starting with X seemed ok as-is in do_4d_trans and the cases of T, alone and Z, alone do not even use do_4d_trans, it seemed like a poor use of time to worry about permuting arrays to optimize that routine. So instead I tried a comparison of the speed of averaging in 1D, only -- comparing an average along X (the sequential axis) with an average along T. (In the test code here, the T average has to access every 5th point:)
Here's the thing: both of these averages SCREAM. They are orders of magnitude faster than the best case of do_4d_trans.F. What do we make of this? My hunch is that it comes out of the inability to optimize the loops, given the way the increments are handled in do_4d_trans.F. By comparison the different looping directions are handled in do_ave_int.F using altogether separate loops using cases (IF .... ELSE IF ... ELSE IF ...). So here's what I propose as a test: create a hack-modified version of do_4d_trans, in which it sets up a special loop for an XYZ average -- replacing loop 300 with something that is patterned after do_ave_int.F.
My hope would be that in this form the optimizer can get its teeth into the loops and they'll run at a speed similar to the 1X average. |
Adding @AndrewWittenberg |
Reported by @karlmsmith on 5 Jan 2011 22:39 UTC
Enable parallel processing for large-array transformations defined in the do_4d_trans.F. Using threads (pthread_create, pthread_join, pthread_exit in C) should minimize the overhead of starting the parallel processing and simplify communications.
Migrated-From: http://dunkel.pmel.noaa.gov/trac/ferret/ticket/1785
The text was updated successfully, but these errors were encountered: