# Convolution

**Profiling and Optimizations** 



### What is a Convolution?





#### **Formal Definition**

If we assume that the dimension of the filter is  $(2r_x + 1)$  in the x dimension and  $(2r_y + 1)$  in the y dimension, the calculation of each P element can be expressed as follows

$$P_{y,x} = \sum_{j=-r_y}^{r_y} \sum_{k=-r_x}^{r_x} f_{y+j,x+k} \times N_{y,x}$$











## Why Convolutions matter?

- Image processing
  - Blurring, sharpening, embossing, edge detection
  - A good explanation can be found <u>here</u>
- Artificial Intelligence and Machine Learning
  - Convolutional neural network
  - You can visualize the process <u>here</u>







## **Naive Implementation**

- Every thread is assigned to calculate an output pixel
  - Thread's *x* and *y* indices are the same as the thread's *x* and *y* indices.
- The ratio of floating-point arithmetic calculation to global memory accesses is only about 0.25 OP/B to global memory (DRAM)
  - 2 operations for every 8 bytes loaded





## **Problems and Optimizations**

- Constant Memory
- Coarsening
- Tiling
- Caching

| Optimization                                          | Benefit to compute cores                                               | Benefit to memory                                                                 | Strategies Tuning usage of SM resources such as threads per block, shared memory per block, and registers per thread                                                                                                               |  |
|-------------------------------------------------------|------------------------------------------------------------------------|-----------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| Maximizing occupancy                                  | More work to hide<br>pipeline latency                                  | More parallel<br>memory accesses<br>to hide DRAM<br>latency                       |                                                                                                                                                                                                                                    |  |
| Enabling<br>coalesced<br>global<br>memory<br>accesses | Fewer pipeline<br>stalls waiting for<br>global memory<br>accesses      | Less global memory<br>traffic and better<br>utilization of bursts/<br>cache lines | Transfer between global memory and shared memory in a coalesced manner and performing uncoalesced accesses in shared memory (e.g., corner turning)  Rearranging the mapping of threads to data  Rearranging the layout of the data |  |
| Minimizing<br>control<br>divergence                   | High SIMD<br>efficiency (fewer<br>idle cores during<br>SIMD execution) | -                                                                                 | Rearranging the mapping<br>of threads to work and/or<br>data<br>Rearranging the layout of<br>the data                                                                                                                              |  |
| Tiling of reused data                                 | Fewer pipeline<br>stalls waiting for<br>global memory<br>accesses      | Less global memory traffic                                                        | Placing data that is reused within a block in shared memory or registers so that it is transferred between global memory and the SM only once                                                                                      |  |
| Privatization<br>(covered<br>later)                   | Fewer pipeline<br>stalls waiting for<br>atomic updates                 | Less contention and<br>serialization of<br>atomic updates                         | Applying partial updates to<br>a private copy of the data<br>and then updating the<br>universal copy when done                                                                                                                     |  |
| Thread coarsening                                     | Less redundant<br>work, divergence,<br>or synchronization              | ice, global memory parallelism to each thread                                     |                                                                                                                                                                                                                                    |  |

### **Optimizations - Constant Memory**

#### Constant Memory

- A constant memory cannot be modified by threads during kernel execution
- The size of the constant memory is quite small (64KB) and efficient
  - It is a read only cache
    - Less complex hardware to manage it
  - **ATTENTION:** it is near the device memory, and access to different parts of it will be <u>serialized</u>
    - good performance only when each thread accesses the same data
- Since everything is in constant memory cache, access in global memory is not needed, thus the Arithmetic Intensity becomes 0.5 OP/B (2 operations for every 4 bytes loaded)



## Optimizations - Coarsening

#### Coarsening

- We use the same caching and tiling approach as before
- We increase efficiency by coarsening threads
  - Each block will process multiple tile
  - Thus each block's thread will produce multiple output
  - A stride access has to be defined.



## **Optimizations - Tiling**

- Tiling (shared memory)
  - It still leverages constant memory
  - Threads load the input tile into the shared memory
  - They calculate output tiles throughout the shared memory.
- Threads will be dimensioned based on the input tile sizes
  - Thus during computation, the halo threads will be disabled





## **Optimizations Analysis**

- The Arithmetic Intensity analysis becomes more complex
  - $\circ$  Operations: OUT TILE DIM<sup>2</sup> \* (2 \* FILTER RADIUS + 1)<sup>2</sup> \* 2
  - Since floating point operations are executed 4 bytes are considered
  - Memory Access: IN\_TILE\_DIM<sup>2</sup> \* 4=(OUT\_TILE\_DIM+2 \* FILTER\_RADIUS)<sup>2</sup> \* 4
- Asymptotically if OUT\_TILE\_DIM >> FILTER\_RADIUS
  - $Operations/Memory\ Access = (2 * FILTER\_RADIUS+1)^2 / 2$
- For example

| IN_TILE_DIM                       |              | 8             | 16    | 32    | Bound |
|-----------------------------------|--------------|---------------|-------|-------|-------|
| 5x5 filter                        | OUT_TILE_DIM | 4             | 12    | 28    |       |
| (FILTER_RADIUS = 2)               | Ratio        | 3.13          | 7.03  | 9.57  | 12.5  |
| 9x9 filter<br>(FILTER_RADIUS = 4) | OUT_TILE_DIM | : <b>-</b> :: | 8     | 24    | -     |
|                                   | Ratio        | 17.0          | 10.13 | 22.78 | 40.5  |

## **Optimizations - Caching**

#### Caching Halo cells

- Similar to the previous approach about tiling
- Halo cells are data that come from other tiles
- Upon running it is very likely that these data are already available in the L2 cache
- We load in shared memory only the internal part of the tile

• We rely on the fact that halo cells are already available in L2 from previous

requests



## Code Hands-on



#### Bonus

- <u>cuDNN</u> can be used to perform convolutions
  - You have a reference to the CNN sub-library at the following <u>link</u>
- It can be used more in general to implement Convolutional neural network
- <u>cuBLAS</u> could be used to perform a matrix multiplication
  - Between each input tile and the filter



## Thank you for your attention!

Gianmarco Accordi gianmarco.accordi@polimi.it

