### **Cuda Programming Applications**

This mini-lab targets some hands-on implementations and more practice on cuda in common real-world recurring tasks. Moreover, we aim to compare the outcomes of our low-level implementations with the built-in functions in popular frameworks as Pytorch. We'll revisit how you can fool cuda by passing a 2D array (for easier indexing)! Then we'll go straight to implement our Conv3D kernel function!

### **Requirement**

A) A cuda program is required to carry out a 3D convolution over RGB images and save the output ones, the program is given a path to a folder containing the input images and that of an output folder that should contain the outputs, respectively as command line arguments.

1.   kernel1: basic implementation (no tiling)
2.   kernel2: tiling where each block matches the input tile size.
3.   kernel3: tiling where each block matches the output tile size.

Notes:
*   Add necessary paddings so that the output image size is the same as that of the input one if stride = 1;

*   The kernel should be able to handle a batch of images at a time, the batch size is passed as the 3rd argument.
*   The mask is given in a .txt file, whose path is passed as the 4th argument. The first line contains its dimension n (one number only as it's a square mask) then the consecutive n lines contain the mask rows, each row in a separate line. Repeat the mask 3 times for the 3 channels of the image.
* (BOUNS) handle different values of stride than 1

  Ex: ./a.out input_folder_path output_folder_path 4 mask.txt stride

B) Implement the same program in python, using the built-in convolution functions in Pytorch.

C) Profile each program carefully and do sufficient experiments to compare between them and collect insightful results. Organise your results in a tabular form and prepare a comprehensive report explaining all of your findings. Also mention the impact of declaring the mask as constant in terms of execution time and elaborate on this in your report.

#### **Helpers**

This section contains some helpers that could be needed for the requirement. Check it frequently.

**Helper1**: Read RGB images in C

In [None]:
# Fetch stb_image library

!git clone https://github.com/nothings/stb.git
!cp stb/stb_image.h /usr/local/include/

Cloning into 'stb'...
remote: Enumerating objects: 8031, done.[K
remote: Counting objects: 100% (163/163), done.[K
remote: Compressing objects: 100% (84/84), done.[K
remote: Total 8031 (delta 99), reused 104 (delta 78), pack-reused 7868[K
Receiving objects: 100% (8031/8031), 5.59 MiB | 12.25 MiB/s, done.
Resolving deltas: 100% (5324/5324), done.


In [None]:
# Read the image dimensions and pixels

%%writefile read_image.c
#define STB_IMAGE_IMPLEMENTATION

#include <stdio.h>
#include "stb_image.h"

const size_t NUM_PIXELS_TO_PRINT = 10;

int main(void) {
    int width, height, comp;
    unsigned char *data = stbi_load("image.jpeg", &width, &height, &comp, 0);
    if (data) {
        printf("width = %d, height = %d, comp = %d (channels)\n", width, height, comp);
        for (size_t i = 0; i < NUM_PIXELS_TO_PRINT * comp; i++) {
            printf("%d%s", data[i], ((i + 1) % comp) ? " " : "\n");
        }
        printf("\n");
    }
    return 0;
}

Overwriting read_image.c


In [None]:
!g++ read_image.c -o readImage.out
!./readImage.out

width = 989, height = 1280, comp = 3 (channels)
153 161 161
153 161 161
153 161 161
153 161 161
153 161 161
153 161 161
153 161 161
153 161 161
152 160 160
152 160 160

