# Project 2 - Distributed Memory Sorting

**Due November 14, 2019**

Same logistic rules as Project 1 apply, regarding

This notebook can be developed on a single core of any type, but will be graded on 4 nodes with 28 cores per node

In [1]:
module use $CSE6230_DIR/modulefiles
module unload cse6230/core
module load cse6230/gcc-omp-gpu

|                                                                         |
|       A note about python/3.6:                                          |
|       PACE is lacking the staff to install all of the python 3          |
|       modules, but we do maintain an anaconda distribution for          |
|       both python 2 and python 3. As conda significantly reduces        |
|       the overhead with package management, we would much prefer        |
|       to maintain python 3 through anaconda.                            |
|                                                                         |
|       All pace installed modules are visible via the module avail       |
|       command.                                                          |
|                                                                         |


## Objectives

- The goal of this project is to *use profiling* to optimize the performance of
  an MPI-based library for distributed memory sorting.
  - The main library interface is (as declared in [proj2sorter.h](proj2sorter.h)):

``` C
    /* This is the default implementation of sorting:
     * \param[in] sorter       The sorting context.  Put all of your customizations
     *                         in this object.  Defined in proj2sorter_impl.h, where
     *                         you can change the struct to include more data
     * \param[in] numKeysLocal The number of keys on this process.
     * \param[in] uniform      True if there are the same number of keys on each process
     * \param[in/out] keys     The input array.  On output, should be globally
     *                         sorted in ascending order.
     * \return                 Non-zero if an error occured.
     */
    int Proj2SorterSort(Proj2Sorter sorter, size_t numKeysLocal, int uniform, uint64_t *keys);
```
  - The small library comes with some logging and error macros (see
      [proj2.h](proj2.h)) as well as an interface for obtaining/restoring workspace
      arrays (see `Proj2SorterGetWorkArray()` and `Proj2SorterRestoreWorkArray()` in
      [proj2sorter.h](proj2sorter.h).  To be memory neutral, restore every
      workspace that you get.

  - Functioning parallel implementations have been provided, one based on
      [quicksort](https://en.wikipedia.org/wiki/Quicksort#Parallelization) and
      one base on [bitonic mergesort](https://en.wikipedia.org/wiki/Bitonic_sorter).

  - A [template library](https://github.com/swenson/sort) originated by Chris
      Swenson has been imported for a quicksort implementation that is faster
      than `qsort` from the standard library.  The template library includes
      other implementations that you are welcome to explore.
      
      If you go through the work of bringing in a serial sorting library, you're welcome to use it,
      as long as you put it somewhere the TA and I can access it.

  - Indeed, as with previous assignments, the implementation details are up
      to you.  There is a test program
      ([test_proj2.c](test_proj2.c)), which may not be edited, that calls your
      library.  It will test the sorting bandwidth (bytes sorted per second) of
      your code on random data at varying numbers of *keys
      per MPI process* (a *key* in our library is just a `uint64_t`: a large
      integer).  You may incorporate additional files into your library by
      adding a `Makefile.inc` file to your project.

```
    ./test_proj2 MIN_KEYS_PER_PROCESS MAX_KEYS_PER_PROCESS MULTIPLIER SEED NUM_REPS UNIFORM_SIZE UNIFORM_KEYS PARTIALLY_SORTED
```

This means that the test program seeds the random number generator with `SEED`,
starts with `MIN_KEYS_PER_PROCESS`, tests `NUM_REPS` times to get an average,
and gets the next problem size by multiplying by `MULTIPLIER`, until at most
`MAX_KEYS_PER_PROCESS`.  If `UNIFORM_SIZE` is `0`, then the number of keys per process will vary between `MIN_KEYS_PER_PROCESS` and `2*MIN_KEYS_PER_PROCESS`. IF `UNIFORM_KEYS` is `0`, each process will have a keys in its own randomly chosen interval.  If `PARTIALLY_SORTED` is `1`, the input array is random but generally increasing; if `PARTIALLY_SORTED` is `-1` the input array is random but generally decreasing.  In your testing, you will probably want to test **one
problem size** at a time.  If you are
having problems with correctness (segmentation faults, hangs/deadlocks, etc.),
it is best to work those out on your workstation/laptop is possible before
using SUs on Stampede2.  You are starting (knock on wood) from a correct
implementation: try to work in small changes, testing for correctness at each change.

In [3]:
make clean
make test_proj2

rm -f libproj2.so  proj2.o proj2sorter.o local.o bitonic.o quicksort.o test_proj2.o test_proj2
mpicc -I../../utils/Random123/include -g -Wall -std=c99 -fopenmp -fpic -O3 -c -o test_proj2.o test_proj2.c
mpicc -I../../utils/Random123/include -g -Wall -std=c99 -fopenmp -fpic -O3 -c -o proj2.o proj2.c
mpicc -I../../utils/Random123/include -g -Wall -std=c99 -fopenmp -fpic -O3 -c -o proj2sorter.o proj2sorter.c
mpicc -I../../utils/Random123/include -g -Wall -std=c99 -fopenmp -fpic -O3 -c -o local.o local.c
mpicc -I../../utils/Random123/include -g -Wall -std=c99 -fopenmp -fpic -O3 -c -o bitonic.o bitonic.c
mpicc -I../../utils/Random123/include -g -Wall -std=c99 -fopenmp -fpic -O3 -c -o quicksort.o quicksort.c
quicksort.c: In function ‘Proj2SorterSort_quicksort_recursive’:
mpicc -fopenmp -shared -o libproj2.so proj2.o proj2sorter.o local.o bitonic.o quicksort.o -lm
mpicc -fopenmp -L./ -Wl,-rpath,./ -o test_proj2 test_proj2.o -lproj2


In [7]:
mpirun  -f ${PBS_NODEFILE} -n ${PBS_NP} ./test_proj2 16000 16000 1 0 1 0 0 0 

[0] ./test_proj2 minKeys 16000 maxKeys 16000 mult 2 seed 0 uniform size 0 uniform distribution 0 partially sorted 0
[0] Testing numKeysLocal 19156, numKeysGlobal 669804, total bytes 5358432
Rank: 16, commSize: 28, localSize: 16682, Local Sorting time: 0.00
Rank: 5, commSize: 28, localSize: 17345, Local Sorting time: 0.00
Rank: 0, commSize: 28, localSize: 19156, Local Sorting time: 0.00
Rank: 15, commSize: 28, localSize: 19432, Local Sorting time: 0.00
Rank: 19, commSize: 28, localSize: 18713, Local Sorting time: 0.00
Rank: 22, commSize: 28, localSize: 19781, Local Sorting time: 0.00
Rank: 13, commSize: 28, localSize: 19640, Local Sorting time: 0.00
Rank: 8, commSize: 28, localSize: 21072, Local Sorting time: 0.00
Rank: 23, commSize: 28, localSize: 20598, Local Sorting time: 0.00
Rank: 21, commSize: 28, localSize: 21088, Local Sorting time: 0.00
Rank: 25, commSize: 28, localSize: 22662, Local Sorting time: 0.00
Rank: 7, commSize: 28, localSize: 24041, Local Sorting time: 0.00
Rank: 6, c

Rank: 0, commSize: 3, localSize: 36601, Local Sorting time: 0.00
Rank: 0, commSize: 4, localSize: 34057, Local Sorting time: 0.00
Rank: 0, commSize: 1, localSize: 9732, Local Sorting time: 0.00
Rank: 0, commSize: 1, localSize: 9958, Local Sorting time: 0.00
Rank: 0, commSize: 1, localSize: 9761, Local Sorting time: 0.00
Rank: 0, commSize: 1, localSize: 9743, Local Sorting time: 0.00
Rank: 1, commSize: 2, localSize: 2959, Local Sorting time: 0.00
Rank: 1, commSize: 2, localSize: 7088, Local Sorting time: 0.00
Rank: 1, commSize: 3, localSize: 81088, Local Sorting time: 0.00
Rank: 1, commSize: 2, localSize: 8529, Local Sorting time: 0.00
Rank: 1, commSize: 2, localSize: 3201, Local Sorting time: 0.00
Rank: 0, commSize: 2, localSize: 18199, Local Sorting time: 0.00
Rank: 0, commSize: 2, localSize: 18003, Local Sorting time: 0.00
Rank: 0, commSize: 1, localSize: 24880, Local Sorting time: 0.00
Rank: 0, commSize: 2, localSize: 22580, Local Sorting time: 0.00
Rank: 0, commSize: 2, localSize: 

Rank: 0, commSize: 1, localSize: 5492, Local Sorting time: 0.00
Rank: 0, commSize: 1, localSize: 5371, Local Sorting time: 0.00
Rank: 5, commSize: 7, localSize: 0, Local Sorting time: 0.00
Rank: 5, commSize: 7, localSize: 0, Local Sorting time: 0.00
Rank: 4, commSize: 7, localSize: 2771, Local Sorting time: 0.00
Rank: 2, commSize: 7, localSize: 4222, Local Sorting time: 0.00
Rank: 2, commSize: 7, localSize: 5037, Local Sorting time: 0.00
Rank: 4, commSize: 7, localSize: 5703, Local Sorting time: 0.00
Rank: 6, commSize: 7, localSize: 8607, Local Sorting time: 0.00
Rank: 1, commSize: 7, localSize: 10700, Local Sorting time: 0.00
Rank: 1, commSize: 7, localSize: 20000, Local Sorting time: 0.00
Rank: 6, commSize: 7, localSize: 10955, Local Sorting time: 0.00
Rank: 0, commSize: 7, localSize: 24478, Local Sorting time: 0.00
Rank: 5, commSize: 7, localSize: 92707, Local Sorting time: 0.00
Rank: 0, commSize: 7, localSize: 26783, Local Sorting time: 0.00
Rank: 3, commSize: 7, localSize: 32296, 

In [5]:
hpcstruct ./test_proj2
mpirun  -f ${PBS_NODEFILE} -n ${PBS_NP} hpcrun -t ./test_proj2 400000 400000 32 0 1 0 0 0 

[0] ./test_proj2 minKeys 400000 maxKeys 400000 mult 32 seed 0 uniform size 0 uniform distribution 0 partially sorted 0
[0] Testing numKeysLocal 620439, numKeysGlobal 16310081, total bytes 130480648
[0] Tested numKeysLocal 620439, numKeysGlobal 16310081, total bytes 130480648: average bandwidth 1.845820e+08
[0] Harmonic average bandwidth: 1.845820e+08


In [6]:
hpcprof -S test_proj2.hpcstruct hpctoolkit-test_proj2-measurements-129587.ice-sched.pace.gatech.edu

msg: STRUCTURE: /nv/coc-ice/zjiang333/cse6230-hw/projects/2-sorting/test_proj2
msg: Line map : /nv/coc-ice/tisaac3/opt/pace-ice/hpctoolkit-gcc/lib/hpctoolkit/libhpcrun.so.0.0.0
msg: Line map : /nv/coc-ice/tisaac3/opt/pace-ice/hpctoolkit-gcc/lib/hpctoolkit/ext-libs/libmonitor.so.0.0.0
msg: Line map : /nv/coc-ice/zjiang333/cse6230-hw/projects/2-sorting/libproj2.so
msg: Line map : /nv/coc-ice/tisaac3/opt/pace-ice/mvapich2/2.3/lib/libmpi.so.12.1.1
msg: Line map : /lib64/libpthread-2.12.so
msg: Line map : /lib64/libc-2.12.so
msg: Line map : /lib64/ld-2.12.so
msg: Line map : /usr/lib64/libxml2.so.2.7.6
msg: Line map : /usr/lib64/librdmacm.so.1.0.0
msg: Line map : /usr/lib64/libibverbs.so.1.0.0
msg: Line map : /usr/lib64/libmlx4-rdmav2.so
msg: Populating Experiment database: /nv/coc-ice/zjiang333/cse6230-hw/projects/2-sorting/hpctoolkit-test_proj2-database-129587.ice-sched.pace.gatech.edu


In [3]:
for uniform_size in 0 1; do
for uniform_keys in 0 1; do
for partially_sorted in 0 1 -1; do
mpirun  -f ${PBS_NODEFILE} -n ${PBS_NP} hpcrun -t ./test_proj2 160 400000 32 0 5 ${uniform_size} ${uniform_keys} ${partially_sorted} 
done
done
done
  

[0] ./test_proj2 minKeys 160 maxKeys 400000 mult 32 seed 0 uniform size 0 uniform distribution 0 partially sorted 0
[0] Testing numKeysLocal 197, numKeysGlobal 6492, total bytes 51936
[0] Tested numKeysLocal 197, numKeysGlobal 6492, total bytes 51936: average bandwidth 1.529272e+08
[0] Testing numKeysLocal 7305, numKeysGlobal 226251, total bytes 1810008
[0] Tested numKeysLocal 7305, numKeysGlobal 226251, total bytes 1810008: average bandwidth 2.645595e+08
[0] Testing numKeysLocal 299023, numKeysGlobal 7195096, total bytes 57560768
[0] Tested numKeysLocal 299023, numKeysGlobal 7195096, total bytes 57560768: average bandwidth 1.156562e+08
[0] Harmonic average bandwidth: 1.581841e+08
[0] ./test_proj2 minKeys 160 maxKeys 400000 mult 32 seed 0 uniform size 0 uniform distribution 0 partially sorted 1
[0] Testing numKeysLocal 197, numKeysGlobal 6492, total bytes 51936
[0] Tested numKeysLocal 197, numKeysGlobal 6492, total bytes 51936: average bandwidth 1.914444e+08
[0] Testing numKeysLocal 54

[0] Harmonic average bandwidth: 2.694343e+08


In [5]:
hpcprof -S test_proj2.hpcstruct hpctoolkit-test_proj2-measurements-*.ice-sched.pace.gatech.edu

msg: Directory 'hpctoolkit-test_proj2-database-129279.ice-sched.pace.gatech.edu' already exists. Trying 'hpctoolkit-test_proj2-database-129279.ice-sched.pace.gatech.edu-129279.ice-sched.pace.gatech.edu'
msg: Created directory: hpctoolkit-test_proj2-database-129279.ice-sched.pace.gatech.edu-129279.ice-sched.pace.gatech.edu
msg: STRUCTURE: /nv/coc-ice/zjiang333/cse6230-hw/projects/2-sorting/test_proj2
msg: Line map : /nv/coc-ice/tisaac3/opt/pace-ice/hpctoolkit/lib/hpctoolkit/ext-libs/libmonitor.so.0.0.0
msg: Line map : /nv/coc-ice/zjiang333/cse6230-hw/projects/2-sorting/libproj2.so
msg: Line map : /nv/usr-local-rhel6.7/pacerepov1/intel/compiler/16.0/compilers_and_libraries_2016.0.109/linux/mpi/intel64/lib/release_mt/libmpi.so.12.0
msg: Line map : /lib64/libpthread-2.12.so
msg: Line map : /lib64/libc-2.12.so
msg: Line map : /lib64/ld-2.12.so
msg: Populating Experiment database: /nv/coc-ice/zjiang333/cse6230-hw/projects/2-sorting/hpctoolkit-test_proj2-database-129279.ice-sched.pace.gatech.

## Grading

- 0-4 points for hassle-free usage: maximized if the python script made from the notebook runs the first time.
    * Points lost if we have to figure out how to reproduce your reported results.
- 0-6 points for correctness:
    * Whether the notebook runs to
      completion (it will abort if a list of keys is not properly sorted).
    * You lose half the points if your code is not correct; subsequent points
      can be lost for poor code organization.
- 0-2 Points for your prediction and the reasoning that goes into it
- 0-8 Points for the notebook:
    * 0-2 points for how well the notebook tracks your `git` history: did we find the commits
      used to generate the entries?  Is there an entry for all the major
      aspects of your development?
    * 0-3 points for your profiling evidence: is it present?  Does it seem to
      indicate what you say it indicates?
    * 0-3 points for your planning: do your proposed code changes follow
      logically from the evidence?
- **1 Bonus point** for the closest prediction to the actual highest bandwidth achieved
- **1 Bonus point** for having versatile performance: if your code achieves bandwidths within 50% of the highest bandwidth achieved on that test for at least 50% of tests.
      
## Prediction

Predict, without going under, the highest bandwidth on any individual test that any student will achieve on this assignment.  Assume only the CPUs are used.  Justify your prediction.

  
## Notebook

Please put your notebook documenting your measurements, thought processes, models, etc. from your work on this project