# EE 599 HW 3 Part 1: Kernel Design

Your task in this Colab notebook is to fill out the sections that are specified by **TODO** (please search the keyword `TODO` to make sure you do not miss any).

## Im2col Algorithm


`Im2col` is a method used in CNNs to transform the input data and filters into a format that allows the convolution to be expressed as a matrix multiplication. This transformation can simplify the implementation of convolution and leverage highly optimized matrix multiplication routines such as BLAS library.

In the class and discussion, we have covered the `im2col` algorithm for 2D input with 2D filter.

In this section, we extend and implement the `im2col` algorithm for 4D input with 4D filters.

First, let's import packages, configure the convolution operation, and randomly initialize the input and filters.

In [None]:
import torch
import torch.nn.functional as F

padding = 0
stride = 1

N, C, H, W = 4, 3, 5, 5
K, C, KH, KW = 2, 3, 3, 3

input = torch.randn(N, C, H, W)
filter = torch.randn(K, C, KH, KW)

### **TODO 1:**

Calculate the output height `OH` and output width `OW` for the convolution operation, based on the input dimensions, padding, stride, and kernel size.

In [None]:
#TODO: compute OH and OW

OH = int((H - KH + 2 * padding) / stride) + 1
OW = int((W - KW + 2 * padding) / stride) + 1

# print(OH)
# print(OW)

3
3


### **TODO 2**:

Implement the `im2col` operation to transform the 4D input and filter tensors into two 2D matrices. After matrix multiplication, reshape the result back into a 4D tensor to simulate the convolution operation. Compare your implementation's result `output_im2col` with PyTorch's `conv2d` function to verify correctness.

In [None]:
#TODO: implement im2col for 4D input and filter

toeplitz_input = torch.zeros(C * KH * KW, OH * OW)
toeplitz_filter = torch.zeros(K, C * KH * KW)

block_size = KH * KW

[filter_row, filter_col] = toeplitz_filter.size()
[input_row, input_col] = toeplitz_input.size()

col = 0
for k in range(K):
  for c in range(C):
    block = filter[k, c, :, :].flatten()
    toeplitz_filter[k, c * block_size : (c + 1) * block_size] = block
# print(toeplitz_filter.size())

for n in range(N):
  col = 0
  for i in range(OH):
    for j in range(OW):
      for c in range(C):
        block = input[n, c, i: i + KH, j: j + KW].flatten()
        toeplitz_input[c * block_size: c * block_size + block_size, col] = block
      col += 1
  # print(toeplitz_input.size())

  temp = torch.matmul(toeplitz_filter, toeplitz_input)
  temp = temp.view(-1, K, OH, OW)

  if(n == 0):
    output_im2col = temp
  else:
    output_im2col = torch.cat((output_im2col, temp), dim=0)

print(output_im2col)

tensor([[[[-11.3348,  -9.5617,  -7.9537],
          [  6.9408,  -3.1637,   9.1057],
          [  1.8903,  -4.2099,  -2.9147]],

         [[ -5.5125,  -1.3172,   2.6162],
          [  2.7180,  -3.5951,   5.0873],
          [ -0.7070,   0.2077,  -3.5802]]],


        [[[ 12.0494,  -2.2278,   1.7364],
          [  8.8223,  -1.1798,   5.5721],
          [  5.0131,   2.6206,   9.0075]],

         [[ -5.3580,   6.4889,   1.7364],
          [  3.3177,   4.6009,  -3.4258],
          [  0.6937,   2.3984,  -5.4314]]],


        [[[ -3.1373,  -3.8238,   9.4231],
          [  0.3158,  -9.6695,   0.2259],
          [  3.5829,  -7.3342,  -2.6818]],

         [[ -5.1266,   2.6498,  10.4558],
          [ -1.4673,  -5.7154,   3.7507],
          [ -0.6576,  -4.8772,  -4.0104]]],


        [[[ -0.6623,  -5.4126,  -2.8191],
          [  3.9214,   3.2033,   2.1995],
          [  6.7370,   2.2119,  -1.6412]],

         [[ -6.0497,   5.9986, -10.4230],
          [  2.8818,  -7.1503,   2.0619],
          [  3

The correct result is provided below by using Pytorch's `conv2d` function.

In [None]:
output_conv2d = F.conv2d(input, filter, stride=stride, padding=padding)
print(output_conv2d)

tensor([[[[-11.3348,  -9.5617,  -7.9537],
          [  6.9408,  -3.1637,   9.1057],
          [  1.8903,  -4.2099,  -2.9147]],

         [[ -5.5125,  -1.3172,   2.6162],
          [  2.7180,  -3.5951,   5.0873],
          [ -0.7070,   0.2077,  -3.5802]]],


        [[[ 12.0494,  -2.2278,   1.7364],
          [  8.8223,  -1.1798,   5.5721],
          [  5.0131,   2.6206,   9.0075]],

         [[ -5.3580,   6.4889,   1.7364],
          [  3.3177,   4.6009,  -3.4258],
          [  0.6937,   2.3984,  -5.4314]]],


        [[[ -3.1373,  -3.8238,   9.4231],
          [  0.3158,  -9.6695,   0.2259],
          [  3.5829,  -7.3342,  -2.6818]],

         [[ -5.1266,   2.6498,  10.4558],
          [ -1.4673,  -5.7154,   3.7507],
          [ -0.6576,  -4.8772,  -4.0104]]],


        [[[ -0.6623,  -5.4125,  -2.8191],
          [  3.9214,   3.2033,   2.1995],
          [  6.7370,   2.2119,  -1.6412]],

         [[ -6.0497,   5.9986, -10.4230],
          [  2.8818,  -7.1503,   2.0619],
          [  3

We can use the following functions to check how many the elements are matched.

In [None]:
# print sum of the True values
print("Number of matched elements =", torch.isclose(output_im2col, output_conv2d).sum().item())

# print the total number of values
print("Number of total elements =", output_im2col.numel())


Number of matched elements = 72
Number of total elements = 72


## Matrix Multiplication Optimization

While matrix multiplication operations are foundational, they can pose challenges, especially in terms of expensive computational complexity.
Present compilers are incapable of fully harnessing the processor architecture complexity. There is a wide gap between the available and achieved performance of software. Thereby, the need for performance tuning. Performance tuning of the simple matrix multiplication has indeed been a very tough and challenging project. In this section, we discuss some of the optimization techniques, which gave us substantial improvements.

To evaluate and improve the performance of matrix multiplication implementations, it's beneficial to use low-level programming languages like C or C++, which offer closer control over hardware resources. Within a notebook environment, we can facilitate the development, compilation, and execution of C code by using specific commands. The `%%writefile` command allows us to save the content of a notebook cell directly into a file, which can then be compiled and executed using command-line instructions.

In the cell below, we provide a naive matrix multiplication implementation and measure the FLOPs per section.

In [1]:
%%writefile naive_mm.c
#include <stdlib.h>
#include <stdio.h>
#include <time.h>

int main(void) {
    int i, j, k;
    struct timespec start, stop;
    double time;
    int n = 1024; // Matrix size is n*n

    // Allocate memory for matrices A, B, and C
    double **A = (double**) malloc(sizeof(double*) * n);
    double **B = (double**) malloc(sizeof(double*) * n);
    double **C = (double**) malloc(sizeof(double*) * n);
    for (i = 0; i < n; i++) {
        A[i] = (double*) malloc(sizeof(double) * n);
        B[i] = (double*) malloc(sizeof(double) * n);
        C[i] = (double*) malloc(sizeof(double) * n);
    }

    // Initialize matrices A and B
    for (i = 0; i < n; i++) {
        for (j = 0; j < n; j++) {
            A[i][j] = i;
            B[i][j] = i + j;
            C[i][j] = 0;
        }
    }

    // Start timer
    if (clock_gettime(CLOCK_REALTIME, &start) == -1) {
        perror("clock gettime");
    }

    // Naive Matrix Multiplication
    for (i = 0; i < n; i++) {
        for (j = 0; j < n; j++) {
            for (k = 0; k < n; k++) {
                C[i][j] = C[i][j] + A[i][k] * B[k][j];
            }
        }
    }

    // Stop timer
    if (clock_gettime(CLOCK_REALTIME, &stop) == -1) {
        perror("clock gettime");
    }
    time = (stop.tv_sec - start.tv_sec) + (double)(stop.tv_nsec - start.tv_nsec) / 1e9;

    // Print results
    printf("Number of FLOPs = %u, Execution time = %f sec,\n%lf MFLOPs per sec\n", 2 * n * n * n, time, 1 / time / 1e6 * 2 * n * n * n);
    printf("C[100][100]=%f\n", C[100][100]);

    // Release memory
    for (i = 0; i < n; i++) {
        free(A[i]);
        free(B[i]);
        free(C[i]);
    }
    free(A);
    free(B);
    free(C);

    return 0;
}

Writing naive_mm.c


Compile and execute the code.

In [None]:
!g++ naive_mm.c -o naive_mm && ./naive_mm

Number of FLOPs = 2147483648, Execution time = 22.409222 sec,
95.830352 MFLOPs per sec
C[100][100]=62617600.000000


### **TODO 3:**

Blocked matrix multiplication, also known as tiled matrix multiplication, is an optimization technique used to improve the performance of matrix multiplication operations, especially on modern hardware with hierarchical memory systems. This approach involves dividing the input matrices into smaller sub-matrices or "blocks" and then performing the multiplication on these blocks rather than on individual elements.

In the cell below, fill out the missing code.

In [2]:
%%writefile blocking_mm.c
#include <stdlib.h>
#include <stdio.h>
#include <time.h>

int main(int argc, char *argv[]) {
    int i, j, k, ii, jj, kk;
    struct timespec start, stop;
    double time;
    int n = 1024; // Matrix size is n*n
    int b = atoi(argv[1]); // Block size

    // Allocate memory for matrices A, B, and C
    double **A = (double**) malloc(sizeof(double*) * n);
    double **B = (double**) malloc(sizeof(double*) * n);
    double **C = (double**) malloc(sizeof(double*) * n);
    for (i = 0; i < n; i++) {
        A[i] = (double*) malloc(sizeof(double) * n);
        B[i] = (double*) malloc(sizeof(double) * n);
        C[i] = (double*) malloc(sizeof(double) * n);
    }

    // Initialize matrices A and B
    for (i = 0; i < n; i++) {
        for (j = 0; j < n; j++) {
            A[i][j] = i;
            B[i][j] = i + j;
            C[i][j] = 0;
        }
    }

    // Start timer
    if (clock_gettime(CLOCK_REALTIME, &start) == -1) {
        perror("clock gettime");
    }

    // TODO: Blocking Matrix Multiplication
    // Your code goes here
    //*******************************//
    for (ii = 0; ii < n; ii += b) {
      for (jj = 0; jj < n; jj += b) {
        for (kk = 0; kk < n; kk += b) {
            for (i = ii; i < ii + b && i < n; i++) {
                for (j = jj; j < jj + b && j < n; j++) {
                    for (k = kk; k < kk + b && k < n; k++) {
                        C[i][j] = C[i][j] + A[i][k] * B[k][j];
                    }
                }
            }
        }
      }
    }

    //*******************************//

    // Stop timer
    if (clock_gettime(CLOCK_REALTIME, &stop) == -1) {
        perror("clock gettime");
    }
    time = (stop.tv_sec - start.tv_sec) + (double)(stop.tv_nsec - start.tv_nsec) / 1e9;

    // Print results
    printf("Number of FLOPs = %u, Execution time = %f sec,\n%lf MFLOPs per sec\n", 2 * n * n * n, time, 1 / time / 1e6 * 2 * n * n * n);
    printf("C[100][100]=%f\n", C[100][100]);

    // Release memory
    for (i = 0; i < n; i++) {
        free(A[i]);
        free(B[i]);
        free(C[i]);
    }
    free(A);
    free(B);
    free(C);

    return 0;
}

Writing blocking_mm.c


Compile and execute the code with different block sizes.

In [3]:
!g++ blocking_mm.c -o blocking_mm && ./blocking_mm 4 && ./blocking_mm 8 && ./blocking_mm 16

Number of FLOPs = 2147483648, Execution time = 11.591916 sec,
185.257011 MFLOPs per sec
C[100][100]=62617600.000000
Number of FLOPs = 2147483648, Execution time = 9.187708 sec,
233.734438 MFLOPs per sec
C[100][100]=62617600.000000
Number of FLOPs = 2147483648, Execution time = 8.284142 sec,
259.228257 MFLOPs per sec
C[100][100]=62617600.000000


### **TODO 4:**

OpenMP is a powerful API designed for parallel programming in C, C++, and Fortran, enabling efficient utilization of multicore and multiprocessor systems. It simplifies the development of parallel applications by providing a set of straightforward compiler directives, library routines, and environment variables that abstract away the complexities of thread management and synchronization. By allowing code to be parallelized with minimal modifications, OpenMP fosters portability and scalability across various platforms.

In the cell below, use the proper pragma configuaration to execute your for loops in parallel. You need to make sure the index is a private variable to each thread, otherwise race conditions might happen. We use the default number of threads in Colab enviroment.

In [4]:
%%writefile blocking_mt_mm.c
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <omp.h>

int main(int argc, char *argv[]) {
    int i, j, k, ii, jj, kk;
    struct timespec start, stop;
    double time;
    int n = 1024; // Matrix size is n*n
    int b = atoi(argv[1]); // Block size

    // Allocate memory for matrices A, B, and C
    double **A = (double**) malloc(sizeof(double*) * n);
    double **B = (double**) malloc(sizeof(double*) * n);
    double **C = (double**) malloc(sizeof(double*) * n);
    for (i = 0; i < n; i++) {
        A[i] = (double*) malloc(sizeof(double) * n);
        B[i] = (double*) malloc(sizeof(double) * n);
        C[i] = (double*) malloc(sizeof(double) * n);
    }

    // Initialize matrices A and B
    for (i = 0; i < n; i++) {
        for (j = 0; j < n; j++) {
            A[i][j] = i;
            B[i][j] = i + j;
            C[i][j] = 0;
        }
    }

    // Start timer
    if (clock_gettime(CLOCK_REALTIME, &start) == -1) {
        perror("clock gettime");
    }

    // TODO: Blocking Matrix Multiplication with OpenMP
    // Your code goes here
    //*******************************//
    #pragma omp parallel for private(i, j, k, ii, jj, kk) shared(A, B, C)
    for (ii = 0; ii < n; ii += b) {
        for (jj = 0; jj < n; jj += b) {
            for (kk = 0; kk < n; kk += b) {
                for (i = ii; i < ii + b && i < n; i++) {
                    for (j = jj; j < jj + b && j < n; j++) {
                        for (k = kk; k < kk + b && k < n; k++) {
                            C[i][j] += A[i][k] * B[k][j];
                        }
                    }
                }
            }
        }
    }

    //*******************************//

    // Stop timer
    if (clock_gettime(CLOCK_REALTIME, &stop) == -1) {
        perror("clock gettime");
    }
    time = (stop.tv_sec - start.tv_sec) + (double)(stop.tv_nsec - start.tv_nsec) / 1e9;

    // Print results
    printf("Number of FLOPs = %u, Execution time = %f sec,\n%lf MFLOPs per sec\n", 2 * n * n * n, time, 1 / time / 1e6 * 2 * n * n * n);
    printf("C[100][100]=%f\n", C[100][100]);

    // Release memory
    for (i = 0; i < n; i++) {
        free(A[i]);
        free(B[i]);
        free(C[i]);
    }
    free(A);
    free(B);
    free(C);

    return 0;
}

Writing blocking_mt_mm.c


Compile and execute the code with different block sizes.

In [5]:
!g++ -fopenmp blocking_mt_mm.c -o blocking_mt_mm && ./blocking_mt_mm 4 && ./blocking_mt_mm 8 && ./blocking_mt_mm 16

Number of FLOPs = 2147483648, Execution time = 9.577955 sec,
224.211083 MFLOPs per sec
C[100][100]=62617600.000000
Number of FLOPs = 2147483648, Execution time = 8.163059 sec,
263.073407 MFLOPs per sec
C[100][100]=62617600.000000
Number of FLOPs = 2147483648, Execution time = 6.489922 sec,
330.895120 MFLOPs per sec
C[100][100]=62617600.000000
