# OpenMP

In this lab you will be working with **OpenMP**.  Theis lab has four parts with increasing difficulty.

You will be parallelizing a code that resizes a regular cube (model) through linear interpolation. The basic algorithm assumes that the output (data) space is larger by an integer scaling factor. The **forward** operator does  outer loops over the smaller input domain. To fill in the data for each portion of the model we use linear interpolation using the current and next value along each axis.

The next cell compiles the code.

In [None]:
!mkdir -p build; cd build; rm -rf *;cmake -DCMAKE_CXX_COMPILER=g++ -DgenericIO_DIR=/opt/SEP/lib/cmake -DCMAKE_CXX_FLAGS="-O3" -DCMAKE_INSTALL_PREFIX=/opt/SEP/local ../src; make install

Below is the code to create and plot a simple 2D model and plot one corner of it. You should not need to edit it. 

In [None]:
# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

%matplotlib inline
import SepVector
import LabOpenMP
import Hypercube
import matplotlib.pyplot as plt

vecSmall = SepVector.getSepVector(Hypercube.hypercube(ns=[10001,10001]))
vecBig = SepVector.getSepVector(Hypercube.hypercube(ns=[50000,50000]))

mat = vecSmall.getNdArray()

for i2 in range(mat.shape[0]):
    for i1 in range(mat.shape[1]):
        mat[i2][i1] = (500 - i2) * (500 - i2) + (500 - i1) * (500 - i1)

x = mat[:100,:100]
x = x - 320000
plt.imshow(mat[:100,:100])

Next we create our `rescale2D` operator.

In [None]:
scale = LabOpenMP.rescale2D(5)

We can then run and time the forward operator.

In [None]:
%%time
scale.forwardS(vecSmall, vecBig)

And plot an equivalent portion of the regridded model.

In [None]:
%matplotlib inline

m = vecBig.getNdArray()[:500,:500]
m = m - 320000

plt.imshow(m)

The code below performs the "dot product test".  It checks that your adjoint operator is actually an adjoint to the forward operator. The printed values should be very close to each other (at least to the 5th decimal point).

In [None]:
x = vecSmall.clone()
y = vecBig.clone()
x.rand()
y.rand()
yp = y.clone()
xp = x.clone()
scale.forwardS(x, yp)
scale.adjointS(xp, y)
print(x.dot(xp))
print(y.dot(yp))

Below you find a cell to edit the code.

In [None]:
%%writefile src/lib/rescale.cc
#include <rescale.h>
#include <cassert>
#include <omp.h>

using namespace gp257;

void rescale2D::forwardS(const std::shared_ptr<float2DReg> mod,
                         std::shared_ptr<float2DReg> dat) {
  const std::shared_ptr<hypercube> hM = mod->getHyper(), hD = dat->getHyper();
  assert((hM->getAxis(1).n - 1) * _rescale == hD->getAxis(1).n);
  assert((hM->getAxis(2).n - 1) * _rescale == hD->getAxis(2).n);
  float db = 1. / (float)_rescale;
    
  for (int i2 = 0; i2 < hM->getAxis(2).n - 1; i2++) {
    for (int i1 = 0; i1 < hM->getAxis(1).n - 1; i1++) {
      float b11 = (*mod->_mat)[i2][i1], b21 = (*mod->_mat)[i2][i1 + 1],
            b12 = (*mod->_mat)[i2 + 1][i1], b22 = (*mod->_mat)[i2 + 1][i1 + 1];
      float f2 = 0;
      for (int ir2 = 0, id2 = i2 * _rescale; ir2 < _rescale;
           ir2++, id2++, f2 += db) {
        float f1 = 0;
        for (int ir1 = 0, id1 = i1 * _rescale; ir1 < _rescale;
             ir1++, id1++, f1 += db) {
          (*dat->_mat)[id2][id1] = (1. - f1) * (1. - f2) * b11 +
                                   (f1) * (1. - f2) * b21 +
                                   (1. - f1) * (f2)*b12 + f1 * f2 * b22;
        }
      }
    }
  }
}
        
void rescale2D::forwardP(const std::shared_ptr<float2DReg> mod,
                         std::shared_ptr<float2DReg> dat) {}

void rescale2D::forwardP2(const std::shared_ptr<float2DReg> mod,
                          std::shared_ptr<float2DReg> dat) {}
        
void rescale2D::adjointS(std::shared_ptr<float2DReg> mod,
                         const std::shared_ptr<float2DReg> dat) {
  const std::shared_ptr<hypercube> hM = mod->getHyper(), hD = dat->getHyper();
  assert((hM->getAxis(1).n - 1) * _rescale == hD->getAxis(1).n);
  assert((hM->getAxis(2).n - 1) * _rescale == hD->getAxis(2).n);
  float db = 1. / (float)_rescale;

  for (int i2 = 0; i2 < hM->getAxis(2).n; i2++) {
    for (int i1 = 0; i1 < hM->getAxis(1).n; i1++) {
      (*mod->_mat)[i2][i1] = 0;
    }
  }

  for (int i2 = 0; i2 < hM->getAxis(2).n - 1; i2++) {
    for (int i1 = 0; i1 < hM->getAxis(1).n - 1; i1++) {
      float f2 = 0;
      for (int ir2 = 0, id2 = i2 * _rescale; ir2 < _rescale;
           ir2++, id2++, f2 += db) {
        float f1 = 0;
        for (int ir1 = 0, id1 = i1 * _rescale; ir1 < _rescale;
             ir1++, id1++, f1 += db) {
          (*mod->_mat)[i2][i1] +=
              (1. - f1) * (1. - f2) * (*dat->_mat)[id2][id1];
          (*mod->_mat)[i2][i1 + 1] += (f1) * (1. - f2) * (*dat->_mat)[id2][id1];
          (*mod->_mat)[i2 + 1][i1] += (1. - f1) * (f2) * (*dat->_mat)[id2][id1];
          (*mod->_mat)[i2 + 1][i1 + 1] += (f1) * (f2) * (*dat->_mat)[id2][id1];
        }
      }
    }
  }
}

void rescale2D::adjointP(std::shared_ptr<float2DReg> mod,
              const std::shared_ptr<float2DReg> dat) {}

void rescale2D::adjointP2(std::shared_ptr<float2DReg> mod,
               const std::shared_ptr<float2DReg> dat) {}

# Problem 1
Compile the code and record the time to execute the serial code. Introduce 
OpenMP parallelization to the forward operator in `forwardP`. Begin by parallelizing over the `i1` axis.  Change the number of threads you are using from 1, 2, 4, 8, 16 by setting the environment variable using the cell magic `%env OMP_NUM_THREADS=<number of threads to use>`. Calculate the speedup in each case. 

Parallelize over the `i2` axis in `forwardP2`. Run the same scaling test.
Did the choice of loop parallelization matter? Why or why not?  How close did you come to perfect scaling?  Speculate on why you achieved your scaling.

Parallelizing the adjoint operator is a more challenging problem. If you use the same approach as you did in the forward you are likely to get the incorrect result. Why?

# Problem 2

One way to parallelize the adjoint operator is by introducing locks and/or single statements.  Choose one of these approaches to parallelize the adjoint in `adjointP`. Try to at least beat the serial code's performance time. Perform the same scaling test and comment on the results.  

# Problem 3
One approach to avoid the race condition of the previous section is to use what is often referred to as red-black ordering, or more generally a colored ordering approach.  The basic concept of these methods is to break a parallel problem which has race conditions into multiple parallel problems each of which do not have a race condition. Paralelize the adjoint using a red-black ordering scheme in `adjointP2`. 

## Hint
To understand how a red-black ordering system might work imagine we are attempting to do 1-D interpolation using the same algorithm.  We could note that our first cell updates model points 0 and 1, our second cell 1 and 2, and
our third cell 2 and 3.  We could modify our loop to parallelize over all odd number cells without introducing a race condition, then parallelize over all of the even cells. 

# Problem 4

By introducing thread parallelism you have changed our view of the roofline model.  When doing parallel reads you can achieve memory read speeds of 72 GB/s.  The machine you are working on has 24 cores. Remake the roofline model from the "roofline" lab for this machine. 

How has the crossover point between bandwidth and flop limit changed?

What is the operational intensity of the algorithm?

How close did you come to peak performance? Speculate why.