# **408: Applied Parallel Programming**

### Fall 2018 - Midterm Exam 2

December 4th, 2018

- 1. This is a closed book exam except for 1 sheet of hand-written notes
- 2. You may not use any personal electronic devices except for a calculator
- 3. Please write legibly!! We are using OCR to help grade your exam
- 4. Absolutely no interaction between students is allowed
- 5. Illegible answers will likely be graded as incorrect

### **Good Luck!**

| Name:                       | SOLUTION |
|-----------------------------|----------|
| NetID:                      |          |
| Exam Room:                  |          |
|                             |          |
| Question 1 (20 points):     |          |
| Question 2 (20 points):     |          |
| Question 3 (15 points):     |          |
| Question 4 (20 points):     |          |
| Question 5 (20 points):     |          |
| Question 6 ( 5 points):     |          |
|                             |          |
|                             |          |
| Total Score (100 points): _ |          |

| Name: |
|-------|
|-------|

## Problem 1 (20 points): Multiple Choice

Choose the proper response, and if multiple responses are correct, choose all. No partial credit will be provided if the answer is partially correct, or wrong.

**Part 1a (3 points)** For the following reduction kernel fragment, if the block size is 512 and warp size is 32, how many warps in a block will have control divergence during the iteration where stride is equal to 64?

```
float partialSum[2 * blockDim.x];
     shared
   unsigned int t = threadIdx.x;
2.
   unsigned int start = 2 * blockIdx.x * blockDim.x;
   partialSum[t] = input[start + t];
   partialSum[blockDim.x + t] = input[start + blockDim.x + t];
6. for (unsigned int stride = 1; stride <= blockDim.x; stride *= 2)
7.
8.
        syncthreads();
      if (t % stride == 0)
9.
          partialSum[2 * t] += partialSum[2 * t + stride];
10.
11. }
```

**Part 1b (3 points)** For the Brent-Kung scan kernel based on reduction trees and inverse reduction trees, assume that we have 2048 elements in each section and warp size is 32, how many warps in each block will have control divergence during the reduction tree phase iteration where stride is 64?

□ 0
□ 1
□ 16
□ 32

| Name:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <b>Part 1c (3 points)</b> Suppose we need to run Brent-Kung scan algorithm on a very large input consisting of $2^{30}$ elements. For our CUDA device, the maximum number of threads in a block is $2^{10}$ and the maximum number of blocks in the x-dimension of the grid is $2^{11}$ . Further, suppose that we are using a one-dimensional grid along the x dimension. If we choose to use hierarchical parallel scan to process the input, what is the minimum number of times we need to launch the scan kernel?                                                                                                       |
| $ \begin{array}{c}                                     $                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| <b>Part 1d (3 points)</b> Suppose a processor supports atomic operations in L2 cache, assume that each atomic operation takes 5ns to complete in L2 cache and 180ns to complete in DRAM. Assume that 80% of the atomic operations hit the L2 cache. Further, assume that the kernel performs 10 floating-point operations per atomic operation. What is the floating-point throughput of the kernel execution as limited by the throughput of the atomic operations?                                                                                                                                                         |
| <ul> <li>□ 0.0025 GFLOPS</li> <li>□ 0.054 GFLOPS</li> <li>□ 0.25 GFLOPS</li> <li>□ 18.52 GFLOPS</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| Part 1e (3 points) To transfer data from the device to host (or vice versa), we use cudaMemcpy which essentially requires a pointer to a data block and the size of that block. To transfer data from nodes in an MPI system, we use MPI_Send (and MPI_Recv). One key difference to CUDA is that MPI_Send requires a pointer to the data block, the number of elements in the block, and the datatype of these elements. Which of the following accounts for this difference?  MPI_Send is non-blocking  cudaMemcpy only works on floating point data  MPI needs to work on heterogeneous systems with different endian-ness |
| <ul> <li>(byte ordering).</li> <li>□ MPI_Send can work on non-contiguous data blocks, whereas cudaMemcpy moves a single contiguous block</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |

| Name: _ |  |  |
|---------|--|--|
|         |  |  |

**Part 1f (3 points)** We need to calculate the histogram of an array with  $10^9$  elements. The histogram has four bins. Assume that each atomic operation in the global memory has a constant total latency of 100ns and each atomic operation in a shared memory has a constant total latency of 1ns. Further, assume that when we launch the kernel, blockDim =  $10^3$  and gridDim =  $10^5$ , thus thread is responsible for 10 elements. If we only consider the latencies caused by atomic operations, what is the <u>theoretical minimum</u> runtime if privatization is implemented and the elements of the array has a distribution of (50%, 30%, 10%, 10%)? Suppose that there is only one global histogram and one shared memory histogram in each block.

 $\square$  (10<sup>7</sup> + 1000) ns  $\square$  (10<sup>7</sup> + 5000) ns  $\square$  (10<sup>7</sup> + 10<sup>4</sup>) ns  $\square$  (4 \* 10<sup>7</sup> + 1000) ns  $\square$  (4 \* 10<sup>7</sup> + 5000) ns  $\square$  (4 \* 10<sup>7</sup> + 10<sup>4</sup>) ns  $\square$  None of the above

**Part 1g (2 points)** Given a sparse matrix of integers with m rows, n non-zero elements in the row with the largest number of non-zeros, and k non-zeros. How many integers are needed to represent the matrix in JDS-T? Recall that JDS-T has a transposed representation. Remember that we need to keep track of the number of non-zeros in each row, to assist the SpMV calculation.

☐ 2m+1 ☐ m+n+k ☐ 2k+m+n+1 ☐ 2k+m+1

#### **Solution:**

1a. D, 8
1b. B, 1
1c. B, 2^8 + 2
1d. C, 0.25 GFLOPS
1e. C, need to work on systems with different endian-ness
1f. B, (10^7 + 5000) ns
1g. C, 2k+m+n+1

| Name: |  |
|-------|--|
|       |  |

## Problem 2 (20 points): Histogramming

For this question, we'll consider a histogram with  $2^{18}$  bins (much larger than shared memory), and an input data stream (with values in the range  $[0, 2^{18} - 1]$ ) that needs to be binned. We want to take advantage of privatization, and observe that the input stream only has 256 bins that contain a significant fraction of the data (hot bins). Say we are provided a vector that's also  $2^{18}$  in length like the histogram, but contains a -1 for bins that are infrequently used, and a unique number between 0 and 255 for each of the hot bins. Our algorithm uses this unique value to create a private histogram in shared memory.

The following kernel will use shared memory for the hot bins and global memory for the not-so-hot bins. Once the work for a thread block is completed, the histogram will be updated in global memory.

Part 2a (12 Points) Complete the following kernel to implement the hot bin histogram. Note that each thread block will have its own privatized histogram of the hot bins in the shared memory with name histo\_private. The vector data contains the input data stream of length size, and the vector freq contains the hot bin vector.

For this problem, pay close attention to the array **global\_index**. It serves as a mapping back from the private histogram bins in **histo\_private** to the global histogram bins in **histo**.

| Name: |  |
|-------|--|
|       |  |

```
// histogram is launched with the following parameters.
2
   dim3 gridDim(1000);
3
   dim3 blockDim(256);
 4
   histo kernel << gridDim, blockDim>>> (data, freq, length, histo)
5
 6
   // Kernel Code
    global
7
   void histo kernel (unsigned int *data, int *freq, long size, unsigned int *histo)
8
9
10
        shared unsigned int histo private[256];
      shared unsigned int global index[256];
11
12
      //reset histogram
13
14
      histo private[threadIdx.x] = 0;
      global index[threadIdx.x] = 0;
15
16
17
       syncthreads();
18
      unsigned int i = threadIdx.x + blockIdx.x * blockDim.x;
      unsigned int stride = blockDim.x * gridDim.x;
19
20
      while(i < _____){
         if (______ freq[data[i]]____ > 0) {
22
           atomicAdd(&histo_private[_____freq[data[i]]_____],1);
23
           global_index[_____freq[data[i]]_____] = data[i];
24
25
         }
26
         else
           27
28
         i += stride;
29
      }
30
        syncthreads();
31
      //contribute to global histogram
      atomicAdd(     &histo[global_index[threadIdx.x]] , histo private[threadIdx.x]);
32
33 }
```

Part 2b (4 Points) How many shared-memory atomic operations and global-memory atomic operations are being performed by all the threads in the kernel if the stream contains  $10^7$  data items and 80% of the elements are in the "hot" bins? Write down the expression for your answer. Pay attention to the code structure.

shared-memory atomics: \_\_\_\_\_0.8 \* 10^7\_\_\_\_\_ global-memory atomics: \_\_\_\_\_0.2 \* 10^7 + 256\*1000\_\_\_\_

| Name: |  |  |  |
|-------|--|--|--|
|       |  |  |  |

**Part 2c (4 Points)** For this part, we will use a very simple performance model with the following assumptions:

- the input stream **data** contains 32 data items and 25 (approx 80%) of the elements are in the "hot" bins
- There is 1 thread block with 32 threads (1 warp) in this execution.
- Further assume each shared-memory atomic operation requires 1 ns, whereas each global-memory atomic operation requires 100 ns.
- All atomic operations are blocking, which means the next instruction cannot execution until the atomic operation is completed.
- All other operations require 0 ns.
- You may assume the full warp starts execution at time 0, and executes every cycle.

With these assumptions, what is the execution time in nanoseconds for this thread block?

#### **Your Answer Here:**

| Name: |  |
|-------|--|
|       |  |

# Problem 3 (15 points): Multiple Prefix Sums

For this problem, we will implement a CUDA kernel for generating the prefix sums of multiple lists of data. Data from a biology experiment recorded the amount of growth of a plant specimen in millimeters in a given day for some number of plants. From this data, we want to generate the plant size in millimeters on each day for each plant. That is we need to perform a prefix sum for each plant. For this, we will devise a CUDA kernel based on the Brent-Kung approach for prefix sum. The table below provides an example of the input data for 3 plants across 4 days.

|       | Plant 0 | Plant 1 | Plant 2 |
|-------|---------|---------|---------|
| Day 1 | 30      | 10      | 1       |
| Day 2 | 20      | 5       | 5       |
| Day 3 | 15      | 20      | 4       |
| Day 4 | 17      | 4       | 6       |

The input data <code>input</code> is a matrix, with <code>num\_plants</code> as width and <code>num\_days</code> as height. The matrix will be stored as one-dimensional array in the row-major layout. We've used a grid with 2D thread blocks to solve this task. All threads in the same x dimension will process the data of one plant. Threads in the same x dimension in a block will process <code>2\*BLOCK SIZE</code> data.

**Part 3a (3 points)**: Why does Brent-kung Prefix Sum algorithm not require double-buffering? Please answer the question in one line.

There is no read after write dependency among threads.

| You | r Answe | r Here: |  |  |  |  |
|-----|---------|---------|--|--|--|--|
|     |         |         |  |  |  |  |
|     |         |         |  |  |  |  |
|     |         |         |  |  |  |  |

| N T    |  |  |  |
|--------|--|--|--|
| Name:  |  |  |  |
| maine. |  |  |  |

Part 3b (12 points) Fill in the blanks to complete the prefix sum kernel described above.

```
#define BLOCK SIZE 32
 0
 1
     global
     void PlantScan(float *input, float *output,
 2
 3
                     int num plants, int num days)
 4
     {
 5
          shared float partialSum[BLOCK SIZE] [BLOCK SIZE*2];
 6
        int tx = threadIdx.x;
 7
        int ty = threadIdx.y;
        if (blockIdx.x * BLOCK_SIZE + tx < num_plants) {</pre>
 8
 9
           if (2 * blockIdx.y * BLOCK SIZE + ty < num days) {</pre>
              partialSum[tx][ty] =
10
                         input[(2 * blockIdx.y * BLOCK_SIZE + ty) *
11
                         num plants + blockIdx.x * BLOCK SIZE + tx];
12
13
           else partialSum[tx][ty] = 0;
14
15
           if ((2 * blockIdx.y + 1) * BLOCK SIZE + ty < num days) {</pre>
              partialSum[tx][ty+BLOCK SIZE] =
16
17
                        input[((2 * blockIdx.y + 1) * BLOCK SIZE + ty) *
                        num plants + blockIdx.x * BLOCK SIZE + tx];
18
19
           }
20
           else partialSum[tx][ty] = 0;
21
22
        else partialSum[tx][ty] = 0;
23
```

Name: \_\_\_\_\_

```
24
      int stride = 1;
25
      while(stride < 2 * BLOCK_SIZE) {</pre>
         syncthreads();
26
         int index = ____(ty+1)*stride*2-1____;
27
         if (index < 2 * BLOCK_SIZE && ___index-stride >= 0)
28
           partialSum[tx][index] +=
29
                      30
           stride = stride * 2;
31
32
         }
33
      stride = BLOCK SIZE/2;
34
      while(stride > 0){
35
36
         syncthreads();
         int index = (tx + 1) * stride * 2 - 1;
37
         if (____index+stride____ < 2 * BLOCK_SIZE) {</pre>
38
39
           40
         }
41
        stride = stride / 2;
42
      }
43
      syncthreads();
44
      . . .
45
    }
46
47
    // Host Code
    int main ()
48
49
    {
50
51
      // Invoke Kernel Here
52
      dim3 dimGrid(num plants/BLOCK WIDTH, num days/BLOCK WIDTH, 1);
53
      dim3 dimBlock(BLOCK WIDTH, BLOCK WIDTH, 1);
      PlantScan<<<dimGrid, dimBlock>>>
54
55
              (input, output, num plants, num days);
56
      . . .
57
```

| Name: |  |  |
|-------|--|--|
|       |  |  |

# Problem 4 (20 points): Sparse Matrix Multiplication

This question tests your knowledge of sparse matrix representation and operation. You are given a sparse matrix representation in CSR (Compressed Sparse Row) format:

| Nonzero values: | data      | [2, | 5, | 3, | 4, | 1, | 2, | 2, | 3] |
|-----------------|-----------|-----|----|----|----|----|----|----|----|
| Column indices: | col_index | [0, | 3, | 1, | 2, | Ο, | 1, | 2, | 3] |
| Row pointers:   | row_ptr   | [0, | 2, | 4, | 4, | 8] |    |    |    |

Part 4a (2 points): Write down the dense matrix (4 x 4) of the given CSR format

For the following parts, use \* to represent zero element

**Part 4b (2 points):** Provide the COO representation of the same matrix:

COO:

| Non zero values: | data [      | ] |
|------------------|-------------|---|
| Column indices:  | col_index [ | ] |
| Row indices:     | row index [ | ] |

**Part 4c (2 points)** Provide the JDS representation of the same matrix:

| Non zero values: | data      | [ | ] |
|------------------|-----------|---|---|
| Column indices:  | col_index | [ | ] |
| Row pointer:     | row_ptr   | [ | ] |
| Row indices:     | row_index | [ | ] |

**Part 4d (2 points)** Provide the JDS\_T representation of the same matrix:

| Non zero values: | data [      | ] |
|------------------|-------------|---|
| Column indices:  | col_index [ | ] |
| Column pointer:  | col_ptr [   | ] |
| Row indices:     | row_index [ | ] |

**Part 4e (2 points)** For sparse matrix-vector multiply, what is the major drawback of the COO representation, and why don't the other representations suffer from the same limitation?

| Name: |  |
|-------|--|
|       |  |

**Part 4f (10 points)** CSC (Compressed Sparse Column) format is similar to CSR, in which the non-zero values in column are stored continuously in the memory. CSC format of matrix is equivalent to the transpose of CSR format.

**CSC Format Example:** 

$$A = \begin{bmatrix} 0 & 4 & 1 & 5 \\ 1 & 0 & 4 & 0 \\ 0 & 2 & 0 & 2 \\ 3 & 0 & 0 & 0 \end{bmatrix}$$

Nonzero value: data [ 1, 3, 4, 2, 1, 4, 5, 2 ]
Row indices: row\_index [ 1, 3, 0, 2, 0, 1, 0, 2 ]
Column pointer: col ptr [ 0, 2, 4, 6, 8 ]

Given the information above, please fill in blank for the SpMV\_CSC kernel. Note: data – input matrix; x – input vector; y – output vector

```
global void SpMV_CSC (int num cols, float *data, int
*col ptr, int *row index, float *x, float *y)
1. {
2.
     int col = blockIdx.y*blockDim.y+threadIdx.y;
     if (col < num cols) {</pre>
4.
        int col start = col ptr[ ];
5.
        int col end = col ptr[ ];
        for (int elem = col start; elem < col end; elem++) {
6.
           float dot = data[____]*x[___];
7.
           atomicAdd( &(y[____]), dot );
8.
9.
        }
10.
     }
11. }
```

Answer:

A. 
$$\begin{bmatrix} 2 & 0 & 0 & 5 \\ 0 & 3 & 4 & 0 \\ 0 & 0 & 0 & 0 \\ 1 & 2 & 2 & 3 \end{bmatrix}$$

Name:

```
B. COO:
  values:
                       data
                                  [ 2, 5, 3, 4, 1, 2, 2, 3]
                        col index [ 0, 3, 1, 2, 0, 1, 2, 3]
  Column indices:
                        row index [ 0, 0, 1, 1, 3, 3, 3, 3]
  Row indices:
C. JDS:
  values:
                        data
                                   [ 1, 2, 2, 3, 2, 5, 3, 4 ]
  Column indices:
                       col index [ 0, 1, 2, 3, 0, 3, 1, 2 ]
  Row pointer:
                       row ptr [
                                       0, 4, 6, 8, 8
  Row indices:
                       row index [
                                          3, 0, 1, 2
                                                             1
D. JDS T:
  values:
                       data
                                  [ 1, 2, 3, 2, 5, 4, 2, 3 ]
                        col index [ 0, 0, 1, 1, 3, 2, 2, 3 ]
  Column indices:
  Column pointer:
                       col ptr [ 0, 3, 6, 7, 8
                                                             1
                                       3, 0, 1, 2
  Row indices:
                       row index [
                                                             1
```

E. Drawbacks of COO: Need atomic operation, therefore less efficient. Each thread process a portion of the data elements and use an atomic operation to accumulate result into output.

This is because the threads in SpMV kernel with COO format are no longer mapped to a particular row. COO comes with the cost of additional storage for the row\_index array.

```
F.
1. global void SpMV CSC (int num cols, float *data, int
  *col ptr, int *row index, float *x, float *y)
2. {
     int col = blockIdx.y*blockDim.y+threadIdx.y;
3.
4.
     if (col < num cols) {</pre>
        int col start = col ptr[ col ];
5.
        int col_end = col_ptr[_col+1__];
6.
        for (int elem = col start; elem < col end; elem++) {</pre>
7.
            float dot = data[__elem__] * x[ col ];
8.
9.
            atomicAdd(&(y[ row index[elem] ]), dot);
10.
              }
11.
       }
12.
```

| N T   |  |  |
|-------|--|--|
| Name: |  |  |
|       |  |  |

### Problem 5 (20 points): Convolutional Neural Network (CNN)

A basic convolution layer in a CNN consists of filter W, input X, and output Y. In this question, we want to accelerate the forward propagation of convolution layers in the training process.

W is the convolution filter weight tensor, organized as a tensor W[M, C, K, K], where M is the number of output feature maps, C is the number of input feature maps, K is the height and width of each filter. Tensors are stored as multi-dimensional arrays in the memory.

X is the input feature map, organized as a tensor X[B, C, H, W], where B is the number of images, H is the height of each input feature map, and W is the width of each input feature map.

Y is the output feature map, organized as a tensor Y[B, M, H\_out, W\_out], where H\_out = H-K+1 is the height of each output feature map and W\_out = W-K+1 is the width of each output feature map.

One way to implement the forward propagation in CUDA is to reduce the convolution into the general matrix multiplication (GEMM). The diagram below shows the relationship between basic convolution and how that can be done by using GEMM. Note that the diagram only shows a single mini-batch, e.g. batch 0.



**Part 5a (2 points):** From the lecture, we learned that before applying the GEMM, we first need to unroll the input feature map (X) into the correct shape. Your friend told you that you need to unroll the weight matrix (W) as well. Is this necessary? Why or why not?

Ans: No need for conversion. W is stored in the correct form for GEMM in the device memory.

**Part 5b (2 points):** How many times on average will be each X element be replicated after the unrolling? Provide your answer as an expression using tensor dimensions.

Ans: The size of the unrolled matrix will be  $C^*K^*K \times H_{out}^*W_{out}$ . The size of the input feature maps is  $C^*(H_{out}^*K^-1)^*(W_{out}^*K^-1)$ . The ratio of the two gives the answer:  $K^*K^*(H_{out}^*W_{out}^*)/((H_{out}^*K^-1)^*(W_{out}^*K^-1)$ .

**Part 5c (16 points):** After finishing the unrolling kernel and matrix multiplication kernel, you realized the performance could be better if we use only the tiled matrix multiplication kernel without actual unrolling. That is, instead of having a separate unrolling kernel, we perform unrolling when loading the tile into shared memory by correctly calculating the data indices. You need to fill in the missing parts so that the convolution layer is complete. (Note that we use multidimensional indexing notation for simplicity.)

```
// The code will be launched by using the following configuration.
dim3 gridDim(ceil(H out*W out/(1.0*TILE WIDTH)),
             ceil(M/(1.0*TILE WIDTH)),B);
dim3 blockDim(TILE WIDTH,TILE WIDTH,1);
// Kernel code.
01: global void ConvLayerForward(int C, int K, int W out, int H out,
float* X, float* W, float* Y) {
      __shared__ float tileMatA[TILE_WIDTH][TILE_WIDTH];
02:
      shared float tileMatB[TILE WIDTH] [TILE WIDTH];
03:
04:
05:
     int b = blockIdx.z;
06:
07:
      int tx = threadIdx.x, ty = threadIdx.y;
      int row = blockIdx.y * TILE WIDTH + ty;
180
      int column = blockIdx.x * TILE WIDTH + tx;
09:
      int numMatAColumns = C*K*K; // This is the same as numMatBRows.
10:
11:
12:
      float acc = 0.0;
13:
      int num iterations = ceil(numMatAColumns/(1.0*TILE WIDTH));
14:
15:
16:
     for (int i = 0; i < num iterations; i++) {</pre>
```

Name: \_\_\_\_\_

```
17:
         int temp col = i*TILE WIDTH + tx, temp row = i*TILE WIDTH + ty;
18:
         tileMatA[ty][tx] = 0;
19:
         tileMatB[ty][tx] = 0;
20:
         // Original indices in the filter tensor.
21:
22:
         int W m = row;
         int W_c = \underline{\qquad} temp\_col/(K*K)\underline{\qquad};
23:
         int W h = (\text{temp\_col}\%(K^*K))/K , W w = (\text{temp\_col}\%(K^*K))\%K ;
24:
25:
26:
         if (temp col < numMatAColumns && row < M)
27:
           tileMatA[ty][tx] = W[W m, W c, W h, W w];
28:
         else
29:
           tileMatA[ty][tx] = 0;
30:
31:
         // Original indices in the input tensor.
32:
         int X b = b;
         int x_c = \underline{\qquad} temp_row/(K*K)\underline{\qquad};
33:
         int x_p = \underline{ } temp_row%(K*K)/K \underline{ }, x_q = \underline{ } (temp_row%(K*K))%K ;
34:
         int X_h = ___column/W_out ___, X_w = __column%W_out___;
35:
36:
37:
         if (temp row < numMatAColumns && column < H out*W out)</pre>
           tileMatB[ty][tx] = X[X b, X c, X h + X p, X w + X q];
38:
39:
         else
40:
           tileMatB[ty][tx] = 0;
41:
         __syncthreads();
42:
43:
        for (int q = 0; q < TILE_WIDTH; q++)
44:
           acc += tileMatA[ty][q] * tileMatB[q][tx];
45:
          syncthreads();
46:
47:
       }
48:
49:
      // Original indices in the output tensor.
50:
      int Y b = b;
51:
      int Y m = row;
      int Y h = column / W_out, Y_w = column % W_out;
52:
53:
54:
      if (row < M && column < W out*H out)</pre>
55:
         Y[Y b, Y m, Y h, Y w] = acc;
56: }
```

| Name:                                                                                                                                                                                                                                                                                                                                                                              |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Problem 6 (5 points): Profiling                                                                                                                                                                                                                                                                                                                                                    |
| One fine Monday morning at Pied Piper, you get an email from Richard Hendricks! The email is as follows:                                                                                                                                                                                                                                                                           |
| "Greetings from Richard Hendricks!                                                                                                                                                                                                                                                                                                                                                 |
| I learned that you have experience optimizing CUDA code through your ECE408 coursework! I would like to take your inputs to identify performance bottlenecks in our next generation machine learning model. Why don't you educate me as I am new to it                                                                                                                             |
| Regards<br>R.H."                                                                                                                                                                                                                                                                                                                                                                   |
| After reading this email, you are delighted to know that you can educate Richard and you meet him in the afternoon. Richard has already done his homework and has some profiling data and it's your time to explain the bottlenecks in his code.                                                                                                                                   |
| <b>Part 6a (3 points)</b> I have a code base with 1 kernel and 3 different optimizations. I profiled my kernel and saw the following in the profiler's visual analysis tool. See the charts on the following page. For each of the optimization, identify as specifically as possible the utilization limiting factor (i.e., memory bound, compute latency bound, resource bound). |
| a. Optimization 1: Memory bound                                                                                                                                                                                                                                                                                                                                                    |
| b. Optimization 2:Compute latency bound                                                                                                                                                                                                                                                                                                                                            |
| c. Optimization 3:Resource or latency bound                                                                                                                                                                                                                                                                                                                                        |
| <b>Part 6b (2 points)</b> Richard asks you to guess the optimization that allowed him to improve his performance from optimization 1 to optimization 2. There are several major optimizations that could have achieved such results. Provide two below.                                                                                                                            |
| a                                                                                                                                                                                                                                                                                                                                                                                  |
| b                                                                                                                                                                                                                                                                                                                                                                                  |

### Possible answers:

- i. Shared memory
- ii. reduced the memory divergence or remove uncoalesced accesses
  iii. register tiling or reducing the redundant or repeated usage of memory operations

| Name:   |  |  |
|---------|--|--|
| maille. |  |  |

Optimization 3



This blank page is provided as extra space for calculations