Name: Rahul Prabhu NetID: rprabhu5 Section: ECE408 AL/AB

## ECE 408/CS483 Milestone 3 Report

0. List Op Times, whole program execution time, and accuracy for batch size of 100, 1k, and 10k images from your basic forward convolution kernel in milestone 2. This will act as your baseline this milestone.

| Batch Size | Op Time 1     | Op Time 2     | Total<br>Execution<br>Time | Accuracy |
|------------|---------------|---------------|----------------------------|----------|
| 100        | 1.73674<br>ms | 2.34653<br>ms | 1.231 secs                 | 0.86     |
| 1000       | 1.79239<br>ms | 6.92718<br>ms | 10.162 secs                | 0.886    |
| 10000      | 17.738<br>ms  | 69.3869<br>ms | 1 min<br>43.661 secs       | 0.8714   |

## 1. Optimization 1: Using Streams to overlap computation with data transfer

| a. | Which optimization did you choose to implement? Chose from the optimization below |
|----|-----------------------------------------------------------------------------------|
|    | by clicking on the check box and explain why did you choose that optimization     |
|    | technique.                                                                        |

| ☐ Tiled shared memory convolution (2 points)                                |
|-----------------------------------------------------------------------------|
| ☐ Shared memory matrix multiplication and input matrix unrolling (3 points) |
| ☐ Kernel fusion for unrolling and matrix-multiplication (2 points)          |
| ☐ Weight matrix in constant memory (1 point)                                |
| ☐ Tuning with restrict and loop unrolling (3 points)                        |
| ☐ Sweeping various parameters to find best values (1 point)                 |
| ☐ Multiple kernel implementations for different layer sizes (1 point)       |
| ☐ Input channel reduction: tree ( <b>3 point</b> )                          |
| ☐ Input channel reduction: atomics (2 point)                                |
| ☐ Fixed point (FP16) arithmetic. (4 points)                                 |

| Using Streams to overlap computation with data transfer (4 points)         |
|----------------------------------------------------------------------------|
| $\square$ An advanced matrix multiplication algorithm (5 points)           |
| ☐ Using Tensor Cores to speed up matrix multiplication ( <b>5 points</b> ) |
| ☐ Overlap-Add method for FFT-based convolution (8 points)                  |
| ☐ Other optimizations: please explain                                      |
|                                                                            |

I choose this optimization because I thought it was really cool how you are able to divide the trivial tasks of copy memory and executing a kernel even further into streams of execution queues

b. How does the optimization work? Did you think the optimization would increase performance of the forward convolution? Why? Does the optimization synergize with any of your previous optimizations?

This optimization works by allowing the device overlap which simultaneously executes a kernel while performing a copy between device and host memory. These streams divide the data sets into batches and assign each batch to a stream. Within a stream, there are three major operations that are added to its queue of execution: Memcpy to device, execute kernel function, and Memcpy back to host. I believed this would increase performance significantly by allowing even further parallelism between redundant and independent tasks. This optimization synergizes with additional optimizations as most optimizations require two memcpy's and a kernel call.

 List the Op Times, whole program execution time, and accuracy for batch size of 100, 1k, and 10k images using this optimization (including any previous optimizations also used).

| Batch Size | Op Time 1      | Op Time 2      | Total<br>Execution<br>Time | Accuracy |
|------------|----------------|----------------|----------------------------|----------|
| 100        | 0.001047<br>ms | 0.001274m<br>s | 1.224 secs                 | 0.86     |
| 1000       | 0.000816<br>ms | 0.000689m<br>s | 10.702 sec                 | 0.886    |
| 10000      | 0.000741<br>ms | 0.00086 ms     | 1min42 secs                | 0.8714   |

d. Was implementing this optimization successful in improving performance? Why or why not? Include profiling results from *nsys* and *Nsight-Compute* to justify your

answer, directly comparing to your baseline (or the previous optimization this one is built off of

The streams were successful in that they allowed for more parallelism within the GPU, making the usage of the device more efficient. This can be shown through the screenshots below in that they show that the device and streaming multiprocessor use was much more efficient.

| Time(%) | Total Time | Instances | Average    | Minimum | Maximum   | Name                      |
|---------|------------|-----------|------------|---------|-----------|---------------------------|
|         |            |           |            |         |           |                           |
| 99.9    | 8937865    | 20        | 446893.3   | 184383  | 711131    | conv_forward_kernel       |
| 0.0     | 2688       |           | 1344.0     | 1312    | 1376      | prefn_marker_kernel       |
| 0.0     | 2528       |           | 1264.0     | 1248    | 1280      | do_not_remove_this_kernel |
|         |            |           |            |         |           |                           |
| Time(%) | Total Time | Calls     | Average    | Minimum | Maximum   | Name                      |
|         |            |           |            |         |           |                           |
| 58.6    | 166552971  |           | 20819121.4 | 9262    | 165654369 | cudaMalloc                |
| 41.0    | 116642963  | 42        | 2777213.4  | 11360   | 6167556   | cudaMemcpyAsync           |
| 0.2     | 531569     | 24        | 22148.7    | 17549   | 33077     | cudaLaunchKernel          |
| 0.1     | 185335     |           | 23166.9    | 379     | 94358     | cudaFree                  |
| 0.0     | 135517     | 20        | 6775.9     | 1419    | 78754     | cudaStreamCreate          |
| 0.0     | 79871      | 20        | 3993.6     | 1776    | 23260     | cudaStreamDestroy         |
| 0.0     | 65715      |           | 32857.5    | 25158   | 40557     | cudaMemcpy                |
| 0.0     | 31066      |           | 3883.3     | 1084    | 8421      | cudaDeviceSynchronize     |
|         |            |           |            |         |           |                           |



e. What references did you use when implementing this technique?

I used lecture 22 from class as well as this NVIDIA blog post for an additional example with multiple streams:

https://developer.nvidia.com/blog/gpu-pro-tip-cuda-7-streams-simplify-concurrency/.

f. Please Paste your kernel code for this optimization. Your code should include the non-trivial code that you have changed for this optimization.
For example, it can be the complete kernel code for Tiled shared memory convolution several lines of code for Weight matrix in constant memory, or the "for" loop for loop unrolling

```
const int Height out = Height - K + 1;
  const int Width_out = Width - K + 1;
  int H_grid = ceil(1.0 * Height_out / TILE_WIDTH);
  int W grid = ceil(1.0 * Width out / TILE WIDTH);
  int Y = H_grid * W_grid;
  dim3 blockDim(TILE WIDTH, TILE WIDTH, 1);
  dim3 gridDim(Map_out, Y, Batch/numstreams);
  cudaMalloc((void**)device_output_ptr, Batch * Map_out * Height_out * Width_out * sizeof(float));
  cudaMalloc((void**)device mask ptr, Channel * Map out * K * K * sizeof(float));
  cudaMalloc((void**)device_input_ptr, Batch * Channel * Height * Width * sizeof(float));
  cudaStream_t streams[numstreams];
 int i:
 for (i = 0; i < numstreams; i++) {
    cudaStreamCreate(&streams[i]);
  cudaMemcpyAsync(*device_mask_ptr, host_mask, Channel * Map_out * K * K * sizeof(float),
cudaMemcpyHostToDevice, streams[0]);
 int instreamdiv = Batch * Channel * Height * Width;
  instreamdiv /= numstreams;
 int outstreamdiv = Batch * Map_out * Height_out * Width_out;
  outstreamdiv /= numstreams;
 float * device input = *device input ptr;
 float * device_output = *device_output_ptr;
 float * device_mask = *device_mask_ptr;
 for (i = 0; i < numstreams; i++) {
```

```
cudaMemcpyAsync(device_input + instreamdiv * i, host_input + instreamdiv * i, instreamdiv *
sizeof(float), cudaMemcpyHostToDevice, streams[i]);
    conv_forward_kernel<<<gri>gridDim, blockDim, 0, streams[i]>>>(device_output + outstreamdiv * i, device_input + instreamdiv * i, device_mask, Batch, Map_out, Channel, Height, Width, K);
    cudaMemcpyAsync((float*)host_output + outstreamdiv * i, device_output + outstreamdiv * i, outstreamdiv * sizeof(float), cudaMemcpyDeviceToHost, streams[i]);
}
cudaDeviceSynchronize();

for (i = 0; i < numstreams; i++) {
    cudaStreamDestroy(streams[i]);
}

cudaFree(device_input_ptr);
cudaFree(device_mask_ptr);
cudaFree(device_output_ptr);</pre>
```

## 2. Optimization 2: Input channel reduction: tree

a. Which optimization did you choose to implement? Chose from the optimization below by clicking on the check box and explain why did you choose that optimization technique.

| ☐ Tiled shared memory convolution (2 points)                                        |
|-------------------------------------------------------------------------------------|
| $\square$ Shared memory matrix multiplication and input matrix unrolling (3 points) |
| ☐ Kernel fusion for unrolling and matrix-multiplication (2 points)                  |
| ☐ Weight matrix in constant memory (1 point)                                        |
| ☐ Tuning with restrict and loop unrolling ( <b>3 points</b> )                       |
| ☐ Sweeping various parameters to find best values (1 point)                         |
| ☐ Multiple kernel implementations for different layer sizes (1 point)               |
| ■ Input channel reduction: tree (3 point)                                           |
| ☐ Input channel reduction: atomics (2 point)                                        |
| ☐ Fixed point (FP16) arithmetic. (4 points)                                         |
| ☐ Using Streams to overlap computation with data transfer (4 points)                |
| ☐ An advanced matrix multiplication algorithm (5 points)                            |
| ☐ Using Tensor Cores to speed up matrix multiplication (5 points)                   |
| ☐ Overlap-Add method for FFT-based convolution (8 points)                           |
| ☐ Other optimizations: please explain                                               |
|                                                                                     |

I chose this optimization because I was really comfortable with applying the reduction tree to other concepts as I felt I learned it well in lecture and the readings.

b. How does the optimization work? Did you think the optimization would increase performance of the forward convolution? Why? Does the optimization synergize with any of your previous optimizations?

This optimization works by using the reduction tree method we learned and applying it to the channel dimension by reducing the samples based on their channels as we are adding them eventually. I thought it would increase performance by reducing the amount of overhead as well as making it work efficient. This optimization synergizes with only the optimizations that preserve the convolution kernel code we used in milestone 2.

c. List the Op Times, whole program execution time, and accuracy for batch size of 100, 1k, and 10k images using this optimization (including any previous optimizations also used).

| Batch Size | Op Time 1 | Op Time 2 | Total<br>Execution<br>Time | Accuracy |
|------------|-----------|-----------|----------------------------|----------|
| 100        | 0.225ms   | 1.05ms    | 1.497sec                   | 0.86     |
| 1000       | 2.06ms    | 10.93ms   | 9.636sec                   | 0.886    |
| 10000      | 20.44ms   | 109.3ms   | 1min34sec                  | 0.8714   |

d. Was implementing this optimization successful in improving performance? Why or why not? Include profiling results from *nsys* and *Nsight-Compute* to justify your answer, directly comparing to your baseline (or the previous optimization this one is built off of.

The tree reduction was not as successful as I hoped as the operation time increased from 1.73 seconds to 2.06. Additionally, the memory accesses were inefficient as the hit rates for the cashes decreased significantly. I had hoped that the reduction would allow for a much faster speed of data movement from the accumulated variable to the output, but the results show that the tradeoffs are not effective.

| Time(%) | Total Time | Calls | Average    | Minimum | Maximum   | Name                  |
|---------|------------|-------|------------|---------|-----------|-----------------------|
|         |            |       |            |         |           |                       |
| 54.7    |            | 8     | 21758161.2 | 74670   | 173044306 | cudaMalloc            |
| 33.2    | 105498763  | 8     | 13187345.4 | 18058   | 57027982  | cudaMemcpy            |
| 11.7    | 37372028   | 6     | 6228671.3  | 3231    | 35285237  | cudaDeviceSynchronize |
| 0.3     | 1022861    | 8     | 127857.6   | 64176   | 245535    | cudaFree              |
| 0.0     | 133165     | 6     | 22194.2    | 16095   | 27141     | cudaLaunchKernel      |
|         |            |       |            |         |           |                       |

| Generating | Generating CUDA Kernel Statistics                                                |           |            |         |          |                           |  |  |  |
|------------|----------------------------------------------------------------------------------|-----------|------------|---------|----------|---------------------------|--|--|--|
|            | Generating CUDA Memory Operation Statistics CUDA Kernel Statistics (nanoseconds) |           |            |         |          |                           |  |  |  |
| Time(%)    | Total Time                                                                       | Instances | Average    | Minimum | Maximum  | Name                      |  |  |  |
|            |                                                                                  |           |            |         |          |                           |  |  |  |
| 100.0      | 37354744                                                                         | 2         | 18677372.0 | 2073011 | 35281733 | conv_forward_kernel       |  |  |  |
| 0.0        | 2784                                                                             | 2         | 1392.0     | 1280    | 1504     | prefn_marker_kernel       |  |  |  |
| 0.0        | 2496                                                                             | 2         | 1248.0     | 1216    | 1280     | do_not_remove_this_kernel |  |  |  |
|            |                                                                                  |           |            |         |          |                           |  |  |  |



e. What references did you use when implementing this technique?

I used the lecture slides and the reading going over tree reduction.

f. Please Paste your kernel code for this optimization. Your code should include the non-trivial code that you have changed for this optimization.
For example, it can be the complete kernel code for Tiled shared memory convolution several lines of code for Weight matrix in constant memory, or the "for" loop for loop unrolling

```
const int Height_out = Height - K + 1;
  const int Width_out = Width - K + 1;
  int H grid = (Height out + TILE WIDTH - 1) / TILE WIDTH;
  int W_grid = (Width_out + TILE_WIDTH - 1) / TILE_WIDTH;
  //(void)Height_out; // silence declared but never referenced warning, remove this line when you
start working
  //(void)Width out; // silence declared but never referenced warning. remove this line when you
start working
  // We have some nice #defs for you below to simplify indexing. Feel free to use them, or create your
own.
  // An example use of these macros:
  // float a = in 4d(0,0,0,0)
  // \text{ out}_4d(0,0,0,0) = a
  extern __shared__ float tree[];
  #define out_4d(i3, i2, i1, i0) output[(i3) * (Map_out * Height_out * Width_out) + (i2) * (Height_out
* Width out) + (i1) * (Width out) + i0]
  #define in_4d(i3, i2, i1, i0) input[(i3) * (Channel * Height * Width) + (i2) * (Height * Width) + (i1) *
(Width) + i0
  #define mask 4d(i3, i2, i1, i0) mask[(i3) * (Channel * K * K) + (i2) * (K * K) + (i1) * (K) + i0]
  #define tree_3d(i2, i1, i0) tree[(i2) * (Channel * TILE_WIDTH) + (i1) * (Channel) + i0]
  // Insert your GPU convolution kernel code here
  int m = blockIdx.x;
  int h = (blockldx.z / W grid) * TILE WIDTH + threadIdx.y;
  int w = (blockIdx.z % W grid) * TILE WIDTH + threadIdx.x;
  float acc = 0.0:
  if (h < Height out && w < Width out) {
      for (int p = 0; p < K; p++) {
         for (int q = 0; q < K; q++) {
           acc += in_4d(blockldx.x, threadldx.z, h + p, w + q) * mask_4d(blockldx.y, threadldx.z, p, q);
       }
```

```
}
tree_3d(threadIdx.y, threadIdx.x, threadIdx.z) = acc;
for (unsigned int stride = 1; stride < Channel; stride <<= 1) {
    __syncthreads();
    if ((threadIdx.z + stride < Channel) && (threadIdx.z % (2 * stride) == 0)) {
        tree_3d(threadIdx.y, threadIdx.x, threadIdx.z) += tree_3d(threadIdx.y, threadIdx.x, threadIdx.z + stride);
    }
}
__syncthreads();
out_4d(blockIdx.x, blockIdx.y, h, w) = tree_3d(threadIdx.y, threadIdx.x, 0);
}
</pre>
```

## 3. Optimization 3: Fixed point (FP16) arithmetic

a. Which optimization did you choose to implement? Chose from the optimization below by clicking on the check box and explain why did you choose that optimization technique.

| ☐ Tiled shared memory convolution (2 points)                                        |
|-------------------------------------------------------------------------------------|
| $\square$ Shared memory matrix multiplication and input matrix unrolling (3 points) |
| ☐ Kernel fusion for unrolling and matrix-multiplication (2 points)                  |
| ☐ Weight matrix in constant memory (1 point)                                        |
| ☐ Tuning with restrict and loop unrolling (3 points)                                |
| $\square$ Sweeping various parameters to find best values (1 point)                 |
| ☐ Multiple kernel implementations for different layer sizes (1 point)               |
| ☐ Input channel reduction: tree (3 point)                                           |
| ☐ Input channel reduction: atomics (2 point)                                        |
| ■ Fixed point (FP16) arithmetic. (4 points)                                         |
| ☐ Using Streams to overlap computation with data transfer (4 points)                |
| ☐ An advanced matrix multiplication algorithm (5 points)                            |
| ☐ Using Tensor Cores to speed up matrix multiplication ( <b>5 points</b> )          |
| ☐ Overlap-Add method for FFT-based convolution (8 points)                           |
| $\square$ Other optimizations: please explain                                       |
|                                                                                     |

I chose this optimization as I come from a hardware background so it was really interesting to me how we are tampering with the amount of data stored in a float and data structure.

b. How does the optimization work? Did you think the optimization would increase performance of the forward convolution? Why? Does the optimization synergize with any of your previous optimizations?

The fp16 performance optimization works by reducing the time needed for each floating point operation. Typically, floats are represented by 32 bits but fp16 operations use the \_\_half data type which are 16 bits. The floats fed in are converted to \_\_half as they are passed in. Special operations are used to perform multiplication and addition on this data type. This makes computations faster along with making memory transfers more efficient. This optimization will increase the performance of forward convolution at the cost of a slight decrease in accuracy.

c. List the Op Times, whole program execution time, and accuracy for batch size of 100, 1k, and 10k images using this optimization (including any previous optimizations also used).

| Batch Size | Op Time 1 | Op Time 2 | Total<br>Execution<br>Time | Accuracy |
|------------|-----------|-----------|----------------------------|----------|
| 100        | 0.184ms   | 0.696ms   | 1.365sec                   | 0.77     |
| 1000       | 1.704ms   | 6.722ms   | 10.767sec                  | 0.829    |
| 10000      | 16.884ms  | 66.922ms  | 1min45sec                  | 0.84     |

d. Was implementing this optimization successful in improving performance? Why or why not? Include profiling results from *nsys* and *Nsight-Compute* to justify your answer, directly comparing to your baseline (or the previous optimization this one is built off of

Yes, this performance optimization was successful. The operation time decreased from 1.77 seconds to 1.69 seconds after the optimization. Memory usage was also greatly decreased. This is likely a result of the 16 bit half data type which reduced memory transfers and space allocated.

| Time(%) | Total Time | Calls | Average    | Minimum | Maximum   | Name                |
|---------|------------|-------|------------|---------|-----------|---------------------|
|         |            |       |            |         |           |                     |
| 74.6    | 195425124  | 8     | 24428140.5 | 72807   | 194226307 | cudaMalloc          |
| 21.7    | 56756869   | 8     | 7094608.6  | 20809   | 30225595  | cudaMemcpy          |
| 3.2     | 8445037    | 6     | 1407506.2  | 2936    | 6725825   | cudaDeviceSynchron  |
| 0.4     | 1127017    | 8     | 140877.1   | 61346   | 289483    | cudaFree            |
| 0.1     | 188136     | 6     | 31356.0    | 24384   | 41877     | cudaLaunchKernel    |
| 0.1     | 100130     |       | 31330.0    | 21301   | 11077     | cadaladiiciikeriiee |

| 99.9 8423380 2 4211690.0 1698775 6724605 conv_for | ward_kernel       |
|---------------------------------------------------|-------------------|
| 0.0 2720 2 1360.0 1344 1376 prefn_ma              | rker_kernel       |
| 0.0 2624 2 1312.0 1248 1376 do_not_re             | emove_this_kernel |



e. What references did you use when implementing this technique?

I used this link to learn more about the data types that NVIDIA commonly uses

https://docs.nvidia.com/cuda/cuda-math-api/group CUDA MATH HALF MISC.html

f. Please Paste your kernel code for this optimization. Your code should include the non-trivial code that you have changed for this optimization.
For example, it can be the complete kernel code for Tiled shared memory convolution several lines of code for Weight matrix in constant memory, or the "for" loop for loop unrolling

```
__half2 load1;
__half2 load2;
__half2 acc = __half2half2(0);
 half2 halfmul;
if (h < Height out && w < Width out) {
  for (int c = 0; c < Channel; c++) {
    for (int p = 0; p < K; p++) {
       for (int q = 0; q < K; q+=2) {
         load2 = \underline{\hspace{0.5cm}} halves2half2(mask_4d(m,c,p,q),mask_4d(m,c,p,q+1));
         load1 = __halves2half2(in_4d(blockldx.z,c,h+p,w+q),in_4d(blockldx.z,c,h+p,w+q+1));
         halfmul = __hmul2(load1,load2);
         acc = __hadd2(acc,halfmul);
       }
    }
  out_4d(blockldx.z, m, h, w) = __hadd(__high2half(acc),__low2half(acc));
}
```