# Vectorization

In this lab you will be parallelizing a library that does a [discrete wavelet transform](https://en.wikipedia.org/wiki/Discrete_wavelet_transform). The discrete wavelet transform is used in compression and noise removal (among many other applications). It basically applies a filter to a dataset to break it into low and high wavelet components. 

In the next two cells we download one of the classic image datasets and display it.

In [None]:
!touch leena.HH; rm -rf leena.HH; wget http://sep.stanford.edu/sep/bob/data/leena.HH 

In [None]:
%matplotlib inline
import SepVector
import Hypercube
import genericIO
import matplotlib.pyplot as plt

leena = genericIO.defaultIO.getVector("leena.HH")
plt.imshow(leena.getNdArray(), cmap="gray")

You will be editing a C++ code to introduce parallelism using ISPC. The next cell configure the Cmake package. We can then compile the code using the following cell.

In [None]:
!mkdir -p build; cd build; rm -rf *; cmake -DCMAKE_INSTALL_PREFIX=/opt/SEP/local -DCMAKE_CXX_FLAGS="-O3 -fno-tree-vectorize" -DgenericIO_DIR=/opt/SEP/lib/cmake ../src

In [None]:
!cd build; make install

Below you will find an example of how ISPC can speed up calculations. In the next cell we are creating two random vectors of one million units. 

In [None]:
import SepVector
import Hypercube
a = SepVector.getSepVector(Hypercube.hypercube(ns=[1000000]))
b = SepVector.getSepVector(Hypercube.hypercube(ns=[1000000]))
a.rand()
b.rand()

We will write a simple vector multiplication code to test its speed. We can use the **timeit** function to test the execution speed.

In [None]:
%cat src/lib/mult.cc

In [None]:
%%timeit 
import LabISPC
LabISPC.mult(a,b)

We can then vectorize the loop using ISPC.

In [None]:
%cat src/lib/mult.h

In [None]:
%%writefile src/lib/kernel.ispc
export void multISPC(uniform float a[], uniform float b[], uniform int n) {
  foreach (i = 0 ... n) {
    a[i]=a[i]*b[i];
  }
}

We can test the speed of our vectorized kernel.

In [None]:
%%timeit
import LabISPC
LabISPC.multISPC(a,b)

# Part  1: Vectorization

Your assignment is to vectorize the 2-D wavelet transform code. The next cell shows how to run the forward of the serial version of the code.

In [None]:
import LabISPC
test = leena.clone()
op = LabISPC.wavelet2D()
op.forwardTransformS(leena, test)
plt.imshow(test.getNdArray(), cmap="gray")

We can also run the inverse of the wavelet transform.

In [None]:
%matplotlib inline
leena2 = leena.clone()
op.inverseTransformS(test, leena2)
plt.imshow(leena2.getNdArray(), cmap="gray")

To get a better estimate we will make 256 copies of our image.

In [None]:
%matplotlib inline
leenaBig = SepVector.getSepVector(Hypercube.hypercube(ns=[512*16,512*16]))
leenaWave = leenaBig.clone()
leenaBigOut = leenaBig.clone()

def bigMap(sm, bg):
    big = bg.getNdArray()
    small = sm.getNdArray()
    for i4 in range(16):
        for i3 in range(16):
            for i2 in range(512):
                for i1 in range(512):
                    big[i4*512+i2][i3*512+i1] = small[i2][i1]

bigMap(leena, leenaBig)
plt.imshow(leenaBig.getNdArray(), cmap="gray")

In [None]:
%%timeit
op.forwardTransformS(leenaBig, leenaWave)

Your job is to vectorize the forward and inverse transforms. You will need to add ISPC code to the `kernel.ispc` code. Fill in the `forwardTransformV` and `inverseTransformV` routines in `wavelet2D.cc` the vectorized portion. 

# Part 2: Parallelization and vectorization

Fill in the `forwardTransformVP` and `inverseTransformVP` routines with parallelized version of the forward and ivnerse transform using your vectorized routines.

# Part 3: Roofline

The machine you are working on has 24 cores with a vector length of 8 and has a parallel bandwidth of 72 GB/s. Remake the roofline model from the "roofline" lab. Figure out the new crossing-point between memory-bandwidth-limited and flops-limited.