|  |  |
| --- | --- |
| **Name:** | *Marius Juston* |
| **NetID:** | *mjuston2* |
| **Section:** | *AL2* |

**ECE 408/CS483 Milestone 3 Report**

|  |
| --- |
| 1. List Op Times, whole program execution time, and accuracy for batch size of 100, 1k, and 10k images from your basic forward convolution kernel in milestone 2. This will act as your baseline this milestone. |
| |  |  |  |  |  | | --- | --- | --- | --- | --- | | Batch Size | Op Time 1 | Op Time 2 | Total Execution Time | Accuracy | | 100 | *0.219294ms* | *0.818139ms* | *1.037433ms* | *0.86* | | 1000 | *2.07119ms* | *8.07871ms* | *10.1499ms* | *0.886* | | 10000 | *20.4125ms* | *78.6866ms* | *99.0991ms* | *0.8714* | |
| 1. **Optimization 1: *Tiled Shared Memory Convolution + Constant kernel memory*** |
| * 1. Which optimization did you choose to implement and why did you choose that optimization technique? |
| For the first optimization, I decided to implement the Tiled shared memory convolution with, in addition, placing the kernel in constant memory. I decided to choose this optimization technique because it was the next easiest step from the current non-optimized version of the kernel. |
| * 1. How does the optimization work? Did you think the optimization would increase performance of the forward convolution? Why? Does the optimization synergize with any of your previous optimizations? |
| This optimization works by first placing the kernel into constant memory. Then each thread loads into memory one, two, or three elements of the input matrix into a shared memory array, from that it then performs the convolution. A thread can load up to three elements because for the shared memory we need to store the edges required for the edge convolutions. Since this is in 2D we need some threads to handle loading data from the right and the bottom of the input matrix as well as another group for the bottom right corner. I believe that the optimization would increase the performance of the forward convolution because we were making use of shared memory to improve the reuse performance and thus reduce the number of read and write cycles to global memory. In turn, this optimization works well with the constant kernel memory since that aims to produce a similar effect. |
| * 1. List the Op Times, whole program execution time, and accuracy for batch size of 100, 1k, and 10k images using this optimization (including any previous optimizations also used). |
| |  |  |  |  |  | | --- | --- | --- | --- | --- | | Batch Size | Op Time 1 | Op Time 2 | Total Execution Time | Accuracy | | 100 | 0.174969 | 0.623188 | 0.798157 | *0.86* | | 1000 | 1.68267 | 6.12279 | 7.80546 | *0.886* | | 10000 | 16.5975 | 60.8619 | 77.4594 | *0.8714* | |
| * 1. Was implementing this optimization successful in improving performance? Why or why not? Include profiling results from *nsys* and *Nsight-Compute* to justify your answer, directly comparing to your baseline (or the previous optimization this one is built off of).   This optimization did indeed successfully improve the performance. Thanks to the shorter waiting times for the reads and writes to global memory the kernel was able to execute faster. This can be demonstrated in the OP times compared to the baseline where the times are consistently smaller. This can be noticed in the duration of 16.51ms compared to the baseline 20.32ms for Layer 1 and 59.99ms vs 81.79ms for Layer 2. In addition, this optimization has a little better utilization of the SM with the SOL SM % for the baseline layer being 81.51% and 78.82% respectively while this optimized version is 87.45% and 84.98. |
|  |
|  |
| * 1. What references did you use when implementing this technique? |
| *Lecture notes* |
| 1. **Optimization 2:** **Kernel fusion for unrolling and matrix-multiplication + Constant Kernel memory *+ Thread block optimization*** |
| 1. Which optimization did you choose to implement and why did you choose that optimization technique? |
| For this optimization, I decided to implement the kernel fusion for the matrix multiplication and unrolling as well as the utilization of shared memory for the matrix multiplication for the input variable and the use of constant memory for the kernel. I decided to make use of matrix unrolling because this format should help with the global memory bandwidth and possibly help speed up larger convolutions. |
| 1. How does the optimization work? Did you think the optimization would increase performance of the forward convolution? Why? Does the optimization synergize with any of your previous optimizations? |
| This optimization works by converting the traditional way of implementing convolutions, by sliding a kernel around the input, into a matrix multiplication operation. Though to implement this matrix multiplication the input matrix needs to be “unrolled” to format the convolutions of the matrix as the columns of the matrix. To unroll, into shared memory, each thread takes care to handle its part of remapping the input matrix into shared memory to be used later on. Once all the cached data is set a traditional dot product is used to compute the convolution out. I believed that this optimization would increase the performance of the forward convolution; however, I expect it to provide more benefits on the larger matrix, Layer 2 because this method helps with memory bandwidth issues and thus is more efficient on larger operations. This optimization also works well with the constant memory optimization and is an addon to the Shared memory matrix multiplication and input matrix unrolling. For Layer 1 a tile size of 16 was utilized while Layer 2 had a tile size of 32. |
| 1. List the Op Times, whole program execution time, and accuracy for batch size of 100, 1k, and 10k images using this optimization (including any previous optimizations also used). |
| |  |  |  |  |  | | --- | --- | --- | --- | --- | | Batch Size | Op Time 1 | Op Time 2 | Total Execution Time | Accuracy | | 100 | 0.393132 | 0.527104 | 0.920236 | *0.86* | | 1000 | 3.79919 | 5.31126 | 9.11045 | *0.886* | | 10000 | 37.782 | 53.5231 | 91.3051 | *0.8714* | |
| 1. Was implementing this optimization successful in improving performance? Why or why not? Include profiling results from *nsys* and *Nsight-Compute* to justify your answer, directly comparing to your baseline (or the previous optimization this one is built off of).   This optimization was able to partially improve the performance of the program. As expected, compared from optimization 1 with the traditional convolution layer, the Op Time for Layer 1 is slower while the larger matrix multiplication, Layer 2, which can make better use of the memory bandwidth improvements, sees performance improvements. For Layer 1 the unrolling operation might be causing more computation slowdown compared to its actual benefits. From the SOL Memory %, we are also able to see that this optimization can reduce the memory utilization by about 20-30% compared to Optimization 1. For the final implementation, the utilization of the shared tile convolution method is used for Layer 1 and this unrolled tiled memory is used for Layer 2. |
|  |
| 1. What references did you use when implementing this technique? |
| *Lectures 13, NVIDIA Blogs* |

|  |
| --- |
| 1. **Optimization 3: *FP16 Optimization + Thread block optimization + Unrolled Matrix*** |
| * 1. Which optimization did you choose to implement and why did you choose that optimization technique?   This optimization is the utilization of FP16 operations in the kernel, this is used in the unrolled matrix operation in Optimization 2. Again Layer 1 has a tile size of 16 and Layer 2 has a tile size of 32. I chose this optimization technique because it is supposed to further reduce the time to compute floating-point operations since it converts the usual 32-bit floating points into 16-bit floating points making it more memory efficient and faster to calculate. |
|  |
| * 1. How does the optimization work? Did you think the optimization would increase performance of the forward convolution? Why? Does the optimization synergize with any of your previous optimizations? |
| This optimization works by converting the float data types to \_\_half or \_\_half2. These are floating-point data types with only 16 bits instead of float’s 32 bits. By converting the size of float, you can both save memory size, since the variable is twice as small and, since there are fewer bits, performing operations such as addition and multiplications are normally much faster. I believe that this should increase the performance of the forward convolution. This optimization should also be able to be used by all the other optimizations to get a little bit more performance boost. However, the cost of this performance boost comes in the shape of a drop inaccuracy. The precision of FP16 is much smaller than its 32-bit counterpart and thus accumulates error more. |
| * 1. List the Op Times, whole program execution time, and accuracy for batch size of 100, 1k, and 10k images using this optimization (including any previous optimizations also used). |
| |  |  |  |  |  | | --- | --- | --- | --- | --- | | Batch Size | Op Time 1 | Op Time 2 | Total Execution Time | Accuracy | | 100 | 0.223222 | 0.602509 | 0.825731 | 0.84 | | 1000 | 2.12114 | 6.01286 | 8.134 | 0.87 | | 10000 | 20.9049 | 60.0765 | 80.9814 | 0.8636 | |
| * 1. Was implementing this optimization successful in improving performance? Why or why not? Include profiling results from *nsys* and *Nsight-Compute* to justify your answer, directly comparing to your baseline (or the previous optimization this one is built off of).   Sadly, it seems like the implementation of this performance did not work. Though I believe that it might be due to how I implemented the computation and conversion. I did not see much/good documentation on how to implement FP16 kernels and how to convert the floating-point arrays to half2/half format and so if implemented a different way you might be able to achieve the expected results. The FP16 Layer times for Layer 2 are slightly slower than the base Optimization 2 with matrix unrolling; however, the FP16 with Layer 1 seems to have increased the performance of the kernel. Though, strangely, the amount of memory usage has changed very little. The SOL Memory % of Layer 1 with FP16 was 32.25% compared to without it 30.80%, I believe that this number should have been smaller. However, for layer 2 a small memory reduction was noticed with 28.15% being used instead of 32.24%. It is also strange that the SOL SM% decreased for Layer 1, from 83.91% to 78.19% though it might be attributed to the increased memory usage. |
|  |
|  |
| * 1. What references did you use when implementing this technique? |
| *CUDA Blogs and library* |