# Introduction

The **Thread Building Block (TBB)** library is a method to do thread parallelism.  The library is object-oriented in nature but with C++11's ability to do lambda functions the syntax is not too diffcult.  In this lab you will be parallelizing a library that does a [discrete wavelet transform](https://en.wikipedia.org/wiki/Discrete_wavelet_transform).  The discrete wavelet transform is used in compression and noise removal (among many other applications). It basically applies a filter to a dataset to break it into low and high wavelet components. 

In the next two cells we download one of the classic image datasets and display it.

In [None]:
!touch leena.HH; rm -rf leena.HH; wget http://sep.stanford.edu/sep/bob/data/leena.HH

In [None]:
%matplotlib inline
import SepVector
import Hypercube
import genericIO
import matplotlib.pyplot as plt

leena = genericIO.defaultIO.getVector("leena.HH")
plt.imshow(leena.getNdArray(), cmap="gray")

You will be editing a C++ code to introduce parallelism using TBB.  The next cell configures the Cmake package. We can then compile the code using the following cell.

In [None]:
!mkdir -p build; cd build; rm -rf *; cmake -DCMAKE_INSTALL_PREFIX=/opt/SEP/local -DCMAKE_CXX_COMPILER=g++ -DCMAKE_CXX_FLAGS=-fPIC -DgenericIO_DIR=/opt/SEP/lib/cmake ../src

In [None]:
!cd build; make install

We can run the forward serial version of the transform using the following cell.

In [None]:
import LabTBB

test = leena.clone()
op = LabTBB.wavelet2D()
op.forwardTransformS(leena, test)
plt.imshow(test.getNdArray(), cmap="gray")

In this display the result of the low-low transform is displayed in the top-left, high-low is in the top right,
low-high in the bottom left, and the high-high in the bottom right. Note how most of the energy is in the top-left panel. 

We can now run the inverse channel and recover the image.

In [None]:
%matplotlib inline

leena2 = leena.clone()
op.inverseTransformS(test, leena2)
plt.imshow(leena2.getNdArray(), cmap="gray")

To get a good estimate of the execution speed let's make 256 copies of our image.

In [None]:
%matplotlib inline
leenaBig = SepVector.getSepVector(Hypercube.hypercube(ns=[512*16,512*16]))
leenaWave = leenaBig.clone()
leenaBigOut = leenaBig.clone()

def bigMap(sm, bg):
    big = bg.getNdArray()
    small = sm.getNdArray()
    for i4 in range(16):
        for i3 in range(16):
            for i2 in range(512):
                for i1 in range(512):
                    big[i4*512+i2][i3*512+i1] = small[i2][i1]

bigMap(leena, leenaBig)
plt.imshow(leenaBig.getNdArray(), cmap="gray")

Now let's run the forward to see how fast it is.

In [None]:
%%time

op.forwardTransformS(leenaBig,leenaWave)

# Assignment

Your job is to parallelize the C++ code using TBB. You should test and report the speeds. Test it with
outer loop parallelism (easy), inner loop paralleilsm (harder), and parallelizing both loops.  

The machine you are working on has 24 cores and has a parallel bandwidth of 72 GB/s. Remake the roofline model from the **Roofline** lab. Figure out the new crossing point between memory-bandwidth limited and flops limited.

Calculate how close you got to optimal performance and speculate how you could have done better.