

## ECE 408 Exam 2, Spring 2018

### April 23, 2018

- You are allowed one 8.0x11.5 cheat sheet with notes on both sides. The minimal font size for your text on the cheat sheet should be 8pts.
- No interactions with humans other than course staff are allowed.
- This exam is designed to take 150 minutes to complete. To allow for any unforeseen difficulties, we will be giving everyone up to 180 minutes.
- This exam is based on lectures, textbook chapters, as well as lab MPs/projects.
- The questions are randomly selected from the topics we covered.
- You can write down the reasoning behind your answers for possible partial credit.
- You must write your answers with pen in order to request regrade.

### Good luck!

| Name:                   |  |
|-------------------------|--|
| Netid:                  |  |
| UIN:                    |  |
| Question 1: (25 points) |  |
| Question 2: (15 points) |  |
| Question 3: (15 points) |  |
| Question 4: (15 points) |  |
| Question 5: (15 points) |  |
| Question 6: (15 points) |  |

**Question 1 (25 points, 30 minutes):** multiple-choice and short-answer questions. If you get more than 25 points by answering all questions (1-9), your score will saturate at 25 points.

For multiple-choice questions, give a concise explanation for your answer for possible partial credit. Answer each of the short-answer questions in as few words as you can. Your answer will be graded based on completeness, correctness, and conciseness.

- 1. (3 points) For the Brent-Kung scan kernel based on reduction trees and inverse reduction trees, assume that we have 2048 elements, which of the following values is the total levels of the reduction tree and the inverse reduction tree?
  - (A)7
  - (B) 11
  - (C) 21
  - (D) 25
- 2. (3 points) For the Brent-Kung scan kernel based on reduction trees and inverse reduction trees, assume that we have 2048 elements in each section and warp size is 32, how many warps in each block will have control divergence during the reduction tree phase iteration where stride is 16?
  - (A) 0
  - (B) 1
  - (C) 16
  - (D) 32

- 3. (3 points) For a processor that supports atomic operations in L2 cache, assume that each atomic operation takes 5ns to complete in L2 cache and 500ns to complete in DRAM. What is the approximate hit rate of the L2 cache needed to achieve a throughput of 1/250 G operations per second?
  - (A) 99%
  - (B) 75%
  - (C) 50%
  - (D) 25%
- 4. (3 points) Given a sparse matrix of integers with R original rows, L non-zero elements in the original row with the largest number of non-zeros, and a total of N non-zeros. How many more integers are needed to represent the matrix in JDS as compared to CSR?
  - (A) R
  - (B) 2N
  - (C) L
  - (D) N
- 5. (3 points) For a sparse matrix-vector multiplication (SpMV) with 100,000 rows, a total of 300,000 non-zero elements, and a maximal of 10 non-zeros in each row, how many additional zero elements will be added when we convert the matrix from CSR to ELL?
  - (A) 5
  - (B) 300,000
  - (C) 700,000
  - (D) 1,000,000

| Name: |  |
|-------|--|
|       |  |

- 6. (3 points) For a sparse matrix-vector multiplication (SpMV) with 100,000 rows, 10,000 columns, a total of 300,000 non-zero elements, how many times on average will each vector element be used?
  - (A) 1
  - (B) 3 on average
  - (C) 30 on average
  - (D) 100,000
- 7. (3 points) Keven has a 640MB array that he would like to process with GPU. He measured that the execution time of the code on CPU was 0.02 seconds. He also implemented a kernel and measured that the kernel execution on the GPU with the data in the GPU memory was 0.0004 seconds, a 50x speedup! However, he needs to transfer the data into the GPU memory. There is negligible data to be transferred back to the CPU. His system has a PCIe Gen3 x16 interconnect. What would the real speedup be?
  - (A) 40x speedup
  - (B) 20x speedup
  - (C) 5x speedup
  - (D) 0.5x speedup (100% slow down)
- 8. (3 points) For the following host code sequence:

```
1) cudaMemcpyAsync(d_A0, h_A+i, SegSize*sizeof(float),.., stream0);
2) cudaMemcpyAsync(d_A1, h_A+i+SegSize, SegSize*sizeof(float),.., stream1);
3) cudaMemcpyAsync(d_A2, h_A+i+2*SegSize, SegSize*sizeof(float),.., stream2);
4) cudaMemcpyAsync(h_C+i, d_C0, SegSize*sizeof(float),.., stream0);
5) cudaMemcpyAsync(h_C+i+SegSize, d_C1, SegSize*sizeof(float),.., stream1);
6) cudaMemcpyAsync(h_C+i+2*SegSize, d_C2, SegSize*sizeof(float),.., stream2);
```

Which of the statements could be executed in parallel on the GPU

- (A) 1) and 2)
- (B) 2) and 3)
- (C) 2) and 4)
- (D) 4) and 5)
- 9. (3 points) In the following OpenACC code, which of the following is false?
  - (A) Statement 1 will be executed redundantly executed by the the 32 gangs.
  - (B) The n iterations of loop i will be divided and distributed to the 32 gangs for execution.
  - (C) Statement 2 will be executed a total of n times.
  - (D) Statement 3 will be executed a total of m times.

```
#pragma acc parallel num_gangs(32)
{
    Statement 1;
    #pragma acc loop gang
    for (int i=0; i<n; i++) {
        Statement 2;
    }
    for (int j=0; j<m; j++) {
        Statement 3;
    }
}</pre>
```

| Name: |
|-------|
|-------|

**Question 2 (15 points, suggested time allocation 25 minutes):** This question tests your understanding of parallel histogram computation and privatization.

You are part of a team that tallies up the voting result of a president's election. The electron was between **four** different candidates. You have 300 million of votes that you need to process. Knowing GPUs can process the mass amount of data, you decided to create a **modified** version of the histogram kernel you learned in class to count how many votes each candidate got.

In the modified version of the histogram kernel, instead of having one global histogram, you have an array of global histograms so that you can have less contention on the global histograms. The global histogram array index is calculated as blockldx.x % NUM\_HISTOGRAM. Carefully analyze the host code as it will help. Below is a visualization of the modified kernel for a small example of three candidates, three thread blocks and an array of two global histograms.

(A) (3 Points) Complete the following kernel to implement the modified histogram. Note that each thread block will still have its own privatized histogram in the shared memory.

Name: \_\_\_\_\_

```
unsigned int histogram_array[NUM_HISTOGRAM * NUM_BIN] = {0};
  unsigned int histogram[NUM_BIN] = {0};
  dim3 gridDim(10000);
  dim3 blockDim(1000);
  histo kernel<<<DimGrid, DimBlock>>>(votes, NUM_VOTERS, histogram_array);
  for(int i = 0; i <NUM_HISTOGRAM; i++) {</pre>
      for(int j = 0; j < NUM_BIN; j++) {</pre>
          histogram[j] += histogram_array[i*NUM_BIN + j];
      }
  }
 global void histo kernel(unsigned int *votes, long size, unsigned int
*histo array){
   __shared__ unsigned int histo_private[____];
// Each element of the votes[] array contains an integer that selects one of
// the candidates. It value ranges from 0 to NUM_BIN-1
   int i = threadIdx.x + blockIdx.x * blockDim.x;
    int stride = _____;
   // Reset histogram
   if( threadIdx.x <NUM BIN)</pre>
       histo_private[threadidx.x] = 0;
   __syncthreads();
   while (i < size) {</pre>
    atomicAdd(&(histo_private[____]) , 1);
    i += stride;
   }
   syncthreads();
  // contribute to one of the global histograms
  if( threadIdx.x <NUM BIN){</pre>
    int histo index = blockIdx.x % NUM HISTOGRAM;
    atomicAdd(&(histo_array[______
                                                             ____]),
               histo_private[_____]);
  }
}
```

(B) (2 Point) How many times does the 500th thread of the 25th block( threadIdx.x = 500, blockIdx.x = 25) iterate in the **while loop**?

| Name:                                                                                                                                                                                                                                                                                                                             |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| (C) (2 Points) How many non-atomic global memory reads/writes and shared-memory/global-memory atomic operations are being performed by all the threads executing the kernel?  Non-atomic Global Memory reads:  Non-atomic Global Memory writes:  Shared-memory atomic operations:  Global-memory atomic operations:  Explanation: |
| For the following questions, consider only the atomic operations in the process of analyzing the kernel code. Assume that  - each atomic operation in the global memory has a constant total latency of 100ns.  - each atomic operation in a shared memory has a total latency of 1ns.                                            |
| (D) (2 Points) If the votes were evenly spread out ( 25% each), what is the theoretical minimum runtime of the <b>original histogram kernel</b> (i.e., NUM_HISTOGRAM = 1)?                                                                                                                                                        |
| (E) (2 Points) If the votes were evenly spread out ( 25% each), what is the theoretical minimum runtime of the <b>modified histogram kernel</b> ?                                                                                                                                                                                 |
| (F) (2 Points)If the vote resulted in a distribution of (90%, 5%, 5%, 0%), what is the theoretical minimum runtime of the <b>original histogram kernel</b> ?                                                                                                                                                                      |
| (G) (2 Points) If the vote resulted in a distribution of (80%, 5%,5%,10%), what is the theoretical minimum runtime of the <b>modified histogram kernel</b> ?                                                                                                                                                                      |

| Name: |
|-------|
|       |

(H) (Extra Credit 1 Point) What technique could further help with a dataset that has a large concentration of identical data values in localized areas in the histogram kernel? (Hint: This technique was discussed in the textbook.)

#### Question 3: Scan (15 points, suggested time allocation 25 minutes):

To further optimize the scan kernel, we can perform the scan operations more efficiently by further dividing the first kernel of the prefix-sum MP 5.2 into three steps. Note that all the three steps are performed within one kernel.

- In Step 1 (see the figure below), each thread operates on its own small section and perform sequential scan in the shared memory. The number of these small sections are the same as the number of threads in a block. (The figure below shows a small example where each block has 4 threads, each thread handles 4 elements, and each block is handling 16 elements).
- II. In Step 2, all threads preform Kogge-Stone scan **only** on every last element of the small sections in the shared memory.
- III. In Step 3, each threads will add the last element (partial sum of its own section) to the elements in the next section, except for the last one since it's already computed in the second step.

This algorithm only completes the scan on the data belongs each block. In order to perform a complete scan, there will be another two kernels adding partial sums to data of other blocks (like the hierarchical scan in MP5.2). This is just an attempt to make the per-block scan kernel more efficient by having each thread to handle more than one element.

For this question, we will be focusing only on the scan kernel described above. You can also assume that the number of input data elements is a multiple of the number of threads or the size of the shared memory (i.e, don't worry about the boundary conditions).

| Name:    |  |  |  |
|----------|--|--|--|
| inallie. |  |  |  |

(A) (6 points) Fill in the blanks to complete the scan kernel described above. (Hint: use the figure when you analyze the kernel code.)

```
#define TILE_WIDTH 1024 // Number of threads in each block
#define SHARE_LIMIT 4*1024 // Total number of elements being scanned in each block
// Number of elements in each small section to be processed by each thread
#define NUM_PER_SEC (SHARE_LIMIT/TILE_WIDTH)
__global__ void scan(float* arr)
   // Shared memory to hold input elements for each block
   __shared__ float scanShare[SHARE_LIMIT];
   // Explicitly store the last element of each small sub-section
   __shared__ float sectionEndShare[TILE_WIDTH];
   size_t tx = threadIdx.x;
   size_t offset = blockIdx.x * SHARE_LIMIT;
   size_t subsection_start =_____
   // Step 1
   // - Each thread will perform a scan on its own small section
   for (size_t i = 0; i < NUM_PER_SEC; i++) {</pre>
       size_t currIdx = subsection_start + i;
       scanShare[currIdx] = arr[offset + currIdx];
       if (i > 0){
           scanShare[currIdx] += scanShare[currIdx - 1];
   }
   sectionEndShare[tx] = scanShare[(tx+1)*NUM_PER_SEC - 1];
   // Step 2
   // - Kogge-Stone scan on the end of each section
   for (size_t stride = 1; stride < TILE_WIDTH; stride *= 2) {</pre>
       __syncthreads();
       if (tx >= stride) {
           sectionEndShare[tx] += sectionEndShare[tx - stride];
       }
   }
   scanShare[(tx+1)*NUM_PER_SEC - 1] = sectionEndShare[tx];
   // Step 3
   // Add partial sums to necessary elements
   __syncthreads();
   for (size_t i = 0; i < _____; i++) {
```

```
Name:
       if (tx > 0){
           scanShare[subsection_start + i] += sectionEndShare[_____];
       }
   }
   // Write results back to global memory
   __syncthreads();
   for (size_t i = 0; i < NUM_PER_SEC; i++) {</pre>
       size_t currIdx = subsection_start + i;
       arr[offset + currIdx] = scanShare[currIdx];
   }
}
// Code on host side:
// num is the number elements in the input array
int numBlocks = (num-1) / (SHARE_LIMIT) + 1;
dim3 dimBlock(TILE_WIDTH, 1, 1);
dim3 dimGrid(numBlocks, 1, 1);
scan<<<dimGrid, dimBlock>>>(dev_arr);
// Assume there will be other hierarchical scans to complete scan on the entire
array
. . .
```

(B) (5 points) After completing this kernel, you found out that the scan results is incorrect. It turns out that the implementation of Kogge-Stone in Step 2 is buggy. Explain what the problem is and rewrite the Kogge-Stone part such that it is correct.

Explanation:

New Kogge-Stone:

```
}
    _____;
sectionEndShare[tx] = temp;
}
```

(C) (4 points) When reading (step 1) and writing data (step 3) from/to global memory, there are still room for improvement. In order to further optimize it, we can first load all the data in a coalesced manner and then perform scan (this is what we call corner turning technique in the class). We can then have better memory coalescing and fewer control divergence.

Fill in the blanks such that the access pattern for reading memory (step 1) is coalesced. (Writing to memory in step 3 can be done similarly.) Note that all of the following code is still in the kernel.

(D) (3 points) Calculate the total number of arithmetic FLOPs in step 2 per block (i.e, only considering the Kogge-Stone part). Choose only one answer. Below are some equations that you might need:

| Name: |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
|-------|--|--|--|--|--|--|--|--|--|--|--|--|--|--|--|--|--|--|
|-------|--|--|--|--|--|--|--|--|--|--|--|--|--|--|--|--|--|--|

$$log_2(1024) = 10, log_2(4096) = 12$$
  
1 + 2 + ... + N/2 = N - 1

- a. 4096 \* 10
- b. 4096 \* 12
- c. 4096 \* 1024
- d. 4096 \* 2 2 12
- e. 4096 \* 10 (4096 1)
- f. 4096 \* 12 (4096 1)

Is it possible perform scan on the same amount of data (4\*1024 floating point numbers) using Kogge-Stone kernel directly in a single block? Why or why not?

(E) (2 points) We already learned in class that both Kogge-Stone and Brent-Kung scan kernel can speedup the scan operations. What is the main drawback of the Kogge-Stone scan kernel? What is the main drawback of the Brent-Kung scan kernel?

# Question 4. Sparse Matrix Multiplication (15 points, suggested time allocation 25 minutes):

This question tests your knowledge of Sparse Matrix representation and operation. We first give you the kernel codes for CSR, ELL, JDS, and JDS-T formats (You may not need to read through all the kernels. They are here help you to recall what we discussed in lecture). Based on the given code, please answer the multiple-choice questions and short answer questions below. Please note that in this problem, you have to give explanation for each question for full points.

```
CSR Kernel:
```

6. 7.

```
    __global__ void SpMV_CSR(int num_rows, float *data,

       int *col_index, int *row_ptr, float *x, float *y) {
       int row = blockIdx.x * blockDim.x + threadIdx.x;
2.
3.
      if (row < num_rows) {</pre>
4.
         float dot = 0;
         int row_start = row_ptr[row];
6.
         int row_end = row_ptr[row+1];
         for (int elem = row_start; elem < row_end; elem++) {</pre>
7.
           dot += data[elem] * x[col_index[elem]];
8.
         }
9.
         y[row] = dot;
       }
    }
ELL Kernel:

    __global__ void SpMV_ELL (int num_rows, float *data, int *col_index, int

num_elem, float *x, float *y) {
2. int row = blockIdx.x * blockDim.x + threadIdx.x;
3. if (row < num_rows) {</pre>
4.
       float dot = 0;
5.
       for (int i = 0; i < num_elem; i++) {</pre>
6.
              dot += data[row+i*num_rows] * x[col_index[row+i*num_rows]];
7.
       }
8.
      y[row] = dot;
9.
    }
10. }
JDS Kernel:

    __global__ void SpMV_JDS(int num_rows, float *data,

 int *col_index, int *jds_row_ptr,int *jds_row_perm,
 float *x, float *y) {
2.
       int row = blockIdx.x * blockDim.x + threadIdx.x;
3.
       if (row < num_rows) {</pre>
         float dot = 0;
4.
         int row_start = jds_row_ptr[row];
```

int row\_end = jds\_row\_ptr[row+1];

for (int elem = row\_start; elem < row\_end; elem++) {</pre>

```
Name:
8.
           dot += data[elem] * x[col_index[elem]];
9.
        y[jds_row_perm[row]] = dot;
      }
     }
JDS-T Kernel:
1.__global__ void SpMV_JDS_T(int num_rows, float *data,
 int *col_index, int *jds_t_col_ptr, int *jds_row_perm,
  float *x, float *y) {
      int row = blockIdx.x * blockDim.x + threadIdx.x;
2.
      if (row < num_rows) {</pre>
        float dot = 0;
4.
         unsigned in sec = 0;
        while (jds_t_col_ptr[sec+1]-jds_t_col_ptr[sec] > row){
5.
           dot += data[jds_t_col_ptr[sec]+row]*
6.
                  x[col_index[jds_t_col_ptr[sec]+row]];
7.
           sec++;
        y[jds_row_perm[row]] = dot;
8.
      }
     }
```

(A) (2 points) If we want to have a sparse matrix multiplication kernel which minimizes the control divergence between threads (given number of threads generated is a multiple of 32), which kernel above is our **best** option and give explanation?

CSR

**ELL** 

**JDS** 

JDS-T

**Explanation:** 

(B) (3 points) In terms of memory accessing, generally speaking, which kernel (or kernels) access the memory for the input matrix (the data array in this problem) in a coalesced manner? (Circle all possible choices and give explanation)

**CSR** 

ELL

**JDS** 

JDS-T

Explanation:

For the following questions, consider the original dense form of matrix = Assuming all the kernels are invoked using:

```
dim3 BlockDim (4, 1, 1);
```

| Name:                                                                                                                                                                           |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| dim3 GridDim (1, 1, 1);                                                                                                                                                         |
| Draw a diagram for each of the following sparse matrix format to show how the data is stored and the access pattern of all threads in first two iterations.                     |
| We give you an example for the <b>CSR</b> format. You need to draw similar diagrams for <b>JDS</b> , and <b>JDS-T</b> formats. For each part, give explanation for full points. |
| (Example) (0 point) <b>CSR</b> format: Diagram:                                                                                                                                 |
|                                                                                                                                                                                 |
|                                                                                                                                                                                 |
| Explanation:                                                                                                                                                                    |
| (C) (5 points) <b>JDS</b> format: Diagram:                                                                                                                                      |
|                                                                                                                                                                                 |
|                                                                                                                                                                                 |
|                                                                                                                                                                                 |
|                                                                                                                                                                                 |
| Explanation:                                                                                                                                                                    |
| (D) (5 points) <b>JDS-T</b> format: Diagram:                                                                                                                                    |

|              | Name: |
|--------------|-------|
|              |       |
|              |       |
|              |       |
|              |       |
|              |       |
|              |       |
| Explanation: |       |

# Question 5. Convolutional Neural Network(CNN) (15 points, suggested time allocation 25 minutes):

A basic convolution layer consists of filter W, input X and output Y. We want to accelerate the forward propagation of convolution layers in the training process.

- W is the convolution filter weight tensor, organized a tensor W[M, C, K, K], where
  - M is the number of output feature maps,
  - o C is the number of input feature maps,
  - K is the height and width of each filter.

Tensors are stored as multi-dimensional arrays in the memory.

- X is the input feature map, organized as a tensor X[B, C, H, W], where
  - B is the number of images in one mini-batch (recall that mini-batch is used to efficiently updates gradients while keeping relatively fast convergence),
  - H is the height of each input feature map and
  - W is the width of each input feature map.
- Y is the output feature map, organized as a tensor Y[B, M, H\_out, W\_out], where
  - H\_out = H K + 1 is the height of each output feature map and
  - W\_out = W K + 1 is the width of each output feature map.

For all questions below assume the CNN is implemented using convolution.

- (A) (2 points) Consider the following 2 declarations for the block dimension: dim3 WHDim(W\_out, H\_out, 1); dim3 HWDim(H\_out, W\_out, 1); Which declaration would cause the CNN to execute quicker and why?
- (B) (3 points) Consider the following 2 declarations for the block dimension where the first one maps threads to the output feature map elements and the second one maps threads to the input feature map elements:

```
dim3 outputDim(W_out, H_out, 1);
dim3 inputDim(W, H, 1);
```

If you use shared memory in your kernel, name one advantage of each declaration.

(C) (3 points) Consider the case where the convolution filter weight tensor W is too big to fit into constant memory. We decide to instead put it into shared memory and call it Wshared. Assuming the weights are single-precision floating-point numbers, what is the minimum amount of shared memory in bytes we need to allocate for each block and how many copies of Wshared will be allocated and loaded in the kernel execution with the following block and grid dimensions.

```
dim3 gridDim(B, M, C);
dim3 blockDim(W_out, H_out, 1);
```

(D) (3 points) Consider the case where we decide to use 3 CUDA Streams and are not given the memory already copied on the device for us like in the project. In the case where the memory transfer of the required input for a given computation takes longer than the computation itself and the time to transfer the output is even shorter, which of the following input memory section sizes will have the lowest execution time assuming no time overhead for launching of kernels for each section and that the weight tensor W is already in the device constant memory? Explain why. This is illustrated for you below.

- a) sizeof(float) \* B \* C \* H \* W bytes
- b) sizeof(float) \* B/3 \* C \* H \* W bytes
- c) sizeof(float) \* 1 \* C \* H \* W bytes
- d) all of the above are equivalent

| Name: |  |
|-------|--|
|       |  |

(E) (4 points) Consider now that we have a constant cost overhead to begin a computation and it is large enough that the the computation plus the overhead for an input section takes longer than the memory transfer of the input section. Assuming B is 15 and the time to transfer the entire input is 45 seconds and the time to transfer the entire output is 15 seconds, will your answer change for certain sized overhead? Explain why and how long the time the overhead must be if yes or why not if no. This is illustrated for you below.

| Name:                                                                                                                                                                                                                                                                                         |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Question 6. Use of visual profiler to analyze the performance of kernel execution. (15 points, suggested time allocation 20 minutes):  The profile shown on the next page was generated using nvprof and the NVIDIA Visual Profiler:                                                          |
| <ul> <li>It is annotated with two timestamps and four durations.</li> <li>It is a matrix multiplication of a [250x80] [80x1000] = [250x1000] matrix of floats, requiring approximately 40 million floating-point operations.</li> </ul>                                                       |
| The times have been adjusted to make the math easier without a calculator.                                                                                                                                                                                                                    |
| You may find the following equalities useful when computing the answers $8 / 7 = 1.14$ $32/7 = 4.58$ $8 / 2.5 = 3.2$                                                                                                                                                                          |
| (A) (3 points) How many GFLOPS (billion floating-point operations per second) does the GPU kernel achieve?                                                                                                                                                                                    |
| (B) (4 points) From the perspective of the host, the <i>performance</i> is the number of floating-point operations divided by the elapsed time between the beginning of the first cudaMemcpy to the end of the last cudaMemcpy. What is the <i>performance</i> of this matrix multiplication? |
|                                                                                                                                                                                                                                                                                               |

(D) (4 points) Estimate the host-to-device performance on the CPU to GPU link, assuming the [250x80] floatmatrix was copied using the annotated (0.025ms) cudaMemcpy