# About

*by Dr Paul Richmond ([ICCS](https://iccs.cam.ac.uk/) Engineering Lead at University of Cambridge)*

This is an "Introduction to CUDA" lab designed to be executed inside a Jupyter notebook. It follows on from a series of lectures. You can use the notebook functionality to edit files and run code cells.

Some hints are provided in hidden markdown cells. If you are struggling with a particular exercise then click the three dots "..." to show the hint (if one is available).

*Note: If you are running this lab on Google Colab then you will need to run the following code cell to obtain the source files.* 

In [None]:
!git init .
!git remote add -f origin https://github.com/Cambridge-ICCS/CUDALabAfternoon.git
!git checkout main

## Introduction to Task

For this session we are going to start by improving the performance of an existing CUDA program [`boxblur.cu`](./boxblur.ci) (you can open the file in the jupyter-lab editor by clicking on it). The starting code provided contains an implementation of a simple box blur. The box blur (also known as a box linear filter) is an operation which samples neighbouring pixels of an input image to output an average value. When applied iteratively to an image, the box filter can be used to approximate a more complicated Gaussian blur ([wiki link](https://en.wikipedia.org/wiki/Gaussian_blur)). The box blur can be described as follows.

$$Out_{x,y}=\frac{I_{x-1,y-1} +I_{x,y-1} +I_{x+1,y-1} +I_{x-1,y} +I_{x,y} +I_{x+1,y} +I_{x-1,y+1} +I_{x,y+1} +I_{x+1,y+1}}{9} $$

Within the implementation provided, the box blur has the property that outside of the bounds of the input image values are `0`. The code works for fixed sized square images. An image `input.ppm` is provided in the ppm format and code is provided for image reading and writing. You can use your own image but make sure that the `IMAGE_SIZE` macro is changed to reflect your image size.
   
![Box Blur](./blur.png)

*Figure 1 - Result of applying the Box filter for 0, 50 and 100 iterations on a photograph of the trainers dog (note: he looks cuter than he is...)*

Try compiling and running the code using the code cells below (or via a jupyter-lab Terminal) and examine the output of the blurred image. Make a note of the execution time reported.

*Note: The pre-processer declaration (`-D`) in the compilation is used to change code paths so that different versions of the main box blur loop can be easily switched between. Examine the starting code so that you can see where the switch takes place within the `main` function. There is a seperate switch case for each exercise in this lab.*




In [None]:
# compile
!nvcc boxblur.cu -D EXERCISE_MODE=STARTING_CODE -o boxblur

In [None]:
# execute
!./boxblur

## Displaying the starting image

You can view the original (starting) image for this exercise by running the code cell below. The file format is a `ppm` file which is an uncompressed human readable format (it is not recommended to open it in a text editor as the original is `2048 x 2048` pixels in size!).

Note: This will resize the image to `512 x 512` and requires that the Python environment has the `pillow` package. You can install pillow using the code cell below if it is not already available in your python environment.

In [None]:
!pip install pillow

In [None]:
from IPython.display import display
from PIL import Image as PILImage
display(PILImage.open('input.ppm').resize((512,512)))

## Displaying the output image

You can view the output image of the CUDA box blur program by running the code cell below. You can come back and run this cell any time you have run the program to ensure that the output is as you would expect.

In [None]:
from IPython.display import display
from PIL import Image as PILImage
display(PILImage.open('output.ppm').resize((512,512)))

## Exercise 1

The code has a number of inefficiencies. We will first consider the transfer bottleneck. For each iteration of applying the box filter/blur the algorithms performs the following steps

1. Copy the previous iterations (or input) image from the host to the device
2. Apply the box blur GPU kernel
3. Copy the results back to the host and repeat the above.

It is not necessary to copy the results of each filter operation back to the host. We can simply pass the pointer of the previous iterations output as the input for the next iterations. This will drastically reduce memory movements via PCIe. To implement pointer swapping complete the following steps.

### Step 1

Starting from the code in the `STARTING_CODE` switch case make a copy into the `EXERCISE_01` switch case

### Step 2

Move the memory copy of the input image outside of the `ITERATIONS` loop so that the host data is copied to the device only once.


### Step 3

A pointer `d_image_temp` has been defined for you. Use this as a temporary pointer to swap the areas of memory pointer to by `d_image` and `d_image_output` after the box blur kernel is applied.


### Step 4

Move the memory copy of the output image outside of the `ITERATIONS` loop so that the device data is copied back to the host only once. *Note: Be careful that you copy back from the correct device pointer if you have swapped them!*

### Step 5

Compile and execute your code.  Ensure that the variable exercise is set to `EXERCISE_01` so that your modified code is executed. Make a note of the execution time. It should be considerably faster than previously.

In [None]:
# compile
!nvcc boxblur.cu -D EXERCISE_MODE=EXERCISE_01 -o boxblur

In [None]:
# execute
!./boxblur

## Exercise 2

The `image_blur_columns kernel` currently has a poor memory access pattern. Let us consider why this is. For each thread which is launched the thread iterates over a unique row of `IMAGE_DIM` pixels to perform the blurring on each pixel. Between each thread this creates a stride of `IMAGE_DIM` between memory loads. CUDA code is much more efficient when sequential threads read from sequential values in memory (memory coalescing). To improve the code, we can implement a row wise version on the kernel by completing the following.

### Step 1

Copy the `image_blur_columns` kernel and call the new kernel `image_blur_rows`.


### Step 2

Modify the `image_blur_rows` kernel so that each thread operates on a unique column (rather than row of the images). This will ensure that sequential threads read sequential row values from memory.


### Step 3

Implement the `EXERCISE_02` switch case (by copying the previous one) ensuring that your host code calls your new kernel

### Step 4

Compile and execute your code. Ensure that the variable exercise is set to `EXERCISE_02` so that your modified code is executed. Make a note of the execution time. It should be considerably faster than previously.

In [None]:
# compile
!nvcc boxblur.cu -D EXERCISE_MODE=EXERCISE_02 -o boxblur

In [None]:
# execute
!./boxblur

## Exercise 3

Our previous implementations of the blur kernel have a limited amount of parallelism. There are in total `IMAGE_DIM` threads launched and each of the threads is responsible for calculating a unique row or column. Whilst this number of threads might seem reasonably large it is unlikely that it is sufficient to occupy all of the Streaming Multiprocessors of the device. To increase the level of parallelism and improve the occupancy it is possible to launch a unique thread for each pixel of the image. To implement this, complete the following steps.

### Step 1

Make a copy of the image_blur_rows kernel and call it `image_blur_2d`. Modify the new kernel so that the `x` and `y` locations are determined from the thread and block index. You can then remove the row loop as the kernel is responsible for calculating only a single pixel value.

### Step 2

Implement the `EXERCISE_03` switch case (by copying the previous one). You will need to change the block and grid dimensions so that they launch `IMAGE_DIM²` threads in total.

### Step 3

Compile and execute your code. Ensure that the variable `exercise` is set to `EXERCISE_03` so that your modified code is executed. Make a note of the execution time. It should be considerably faster than previously.

In [None]:
# compile
!nvcc boxblur.cu -D EXERCISE_MODE=EXERCISE_03 -o boxblur

In [None]:
# execute
!./boxblur

## Solutions

If you have found that you have got stuck with the code you can view the solutions by checking out the solution file using the code cell below. The command will checkout a single file from the solutions branch of the repository but it will override any changes you have made.

In [None]:
!git checkout origin/solutions -- boxblur.cu