ECE408/CS483/CSE408 Fall 2022

**Applied Parallel Programming** 

Lecture 8: Tiled Convolution

## Course Reminders

- Lab 3 is due this Friday
- Lab 4 will be posted soon, it is due next week
- Midterm 1 is coming up

## Objective

- To learn about tiled convolution algorithms
  - Some intricate aspects of tiling algorithms
  - Output tiles versus input tiles
  - Three different styles of input tile loading
  - To prepare for Lab 4

## Tiled 1D Convolution Basic Idea



## What Shall We Parallelize?

In other words,

What should one thread do?

#### One answer:

- (same as with vector sum and matrix multiply)
- compute an output element!

## Should We Use Shared Memory?

In other words,

Can we reuse data read from global memory?

Let's look at the computation again...



Reuse reduces global memory bandwidth, so let's use shared memory.

## How Much Reuse is Possible?

MASK\_WIDTH is 5

tile 2 3 4 5 6 7 8 9

- Element 2 is used by thread 4 (1x)
- Element 3 is used by threads 4, 5 (2×)
- Element 4 is used by threads 4, 5, 6 (3x)
- Element 5 is used by threads 4, 5, 6, 7 (4x)
- Element 6 is used by threads 4, 5, 6, 7 (4x)
- Element 7 is used by threads 5, 6, 7 (3×)
- Element 8 is used by threads 6, 7 (2×)
- Element 9 is used by thread 7 (1x)

## What About the Halos?

In other words,

Do we also copy halos into shared memory?



Let's consider both possible answers.

## Can Access Halo from Global Memory

### Approach:

- threads read halo values
- directly from global memory.

#### Advantage:

- optimize reuse of shared memory
- (halo reuse is smaller).

### Disadvantages:

- Branch divergence! (shared vs. global reads)
- Halo too narrow to fill a memory burst

## Can Load Halo to Shared Memory

### Approach:

load halos to shared memory.

#### Advantages:

- Coalesce global memory accesses.
- No branch divergence during computation.

#### Disadvantages:

- Some threads must do >1 load, so
   some branch divergence in reading data.
- Slightly more shared memory needed.

# Three Tiling Strategies



© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 ECE408/CS483/ University of Illinois at Urbana-Champaign

# Strategy 1: Variable Meanings for a Block



## Loading the left halo



```
int radius = Mask_Width / 2;
int halo_index_left = (blockIdx.x - 1) * blockDim.x + threadIdx.x;
if (threadIdx.x >= (blockDim.x - radius)) {
    N_ds[threadIdx.x - (blockDim.x - radius)] =
        (halo_index_left < 0) ? 0 : N[halo_index_left];
}</pre>
```

## Loading the internal elements



```
int index = blockIdx.x * blockDim.x + threadIdx.x;
if ((blockIdx.x * blockDim.x + threadIdx.x) < Width)
    N_ds[radius + threadIdx.x] = N[index];
else
    N_ds[radius + threadIdx.x] = 0.0f;</pre>
```

# Loading the right halo



```
N_ds 2 3 4 5 6 7 8 9
```

```
int halo_index_right = (blockIdx.x + 1)*blockDim.x + threadIdx.x;
if (threadIdx.x < radius) {
    N_ds[radius + blockDim.x + threadIdx.x] =
        (halo_index_right >= Width) ? 0 : N[halo_index_right];
}
```

```
global void convolution 1D tiled kernel(float *N, float *P, int Mask Width, int Width) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int radius = Mask Width / 2;
 shared float N ds[TILE SIZE + MAX MASK WIDTH - 1];
int halo index left = (blockIdx.x - 1) * blockDim.x + threadIdx.x;
if (threadIdx.x >= (blockDim.x - radius)) {
  N ds[threadIdx.x - (blockDim.x - radius)] =
    (halo index left < 0) ? 0 : N[halo index left];</pre>
N ds[radius + threadIdx.x] = N[i]; // bounds check is needed
int halo index right = (blockIdx.x + 1) * blockDim.x + threadIdx.x;
if (threadIdx.x < radius) {</pre>
  N ds[radius + blockDim.x + threadIdx.x] =
    (halo index right >= Width) ? 0 : N[halo index right];
syncthreads();
                                                                     Strategy 1
float Pvalue = 0;
for (int j = 0; j < Mask Width; <math>j++) {
  Pvalue += N ds[threadIdx.x + j]*M[j];
P[i] = Pvalue;
```

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 ECE408/CS483/ University of Illinois at Urbana-Champaign

# Alternative implementation of Strategy 1: Variable Meanings for a Block



## Load the Input Data – step 1



## Load the Input Data – step 2



```
global void convolution 1D tiled kernel float *N, float *P, int Width) {
 int i = blockIdx.x * blockDim.x + threadIdx.x;
 int radius = MASK WIDTH / 2;
 int start = i - radius;
 shared float N ds[TILE SIZE + MASK WIDTH - 1];
 if (0 <= start && Width > start) { // all threads
   N ds[threadIdx.x] = N[start];
 else
   N ds[threadIdx.x] = 0.0f;
 if (MASK WIDTH - 1 > threadIdx.x) {      // some threads
   start += TILE SIZE;
   if (Width > start) {
     N ds[threadIdx.x + TILE SIZE] = N[start];
   else
     N ds[threadIdx.x + TILE SIZE] = 0.0f;
 syncthreads();
                                                                         Alt.
 float Pvalue = 0.0f;
                                                                  Strategy 1
 for (int j = 0; MASK WIDTH > j; j++) {
   Pvalue += N ds[threadIdx.x + j] * Mc[j];
 P[i] = Pvalue;
```

```
global
void convolution 1D tiled cache kernel(float *N, float *P, int Mask Width, int Width) {
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  shared float N ds[TILE WIDTH];
 N ds[threadIdx.x] = N[i];
  syncthreads();
  int radius = Mask Width / 2;
  int This tile start point = blockIdx.x * blockDim.x;
  int Next tile start point = (blockIdx.x + 1) * blockDim.x;
  int N start point = i - radius;
 float Pvalue = 0;
  for (int j = 0; j < Mask Width; <math>j ++) {
    int N index = N start point + j;
    if (N index >= 0 && N index < Width) {
      if ((N index >= This tile start point) && (N index < Next tile start point))
         Pvalue += N ds[threadIdx.x-radius+j] * M[j];
       else
        Pvalue += N[N index] * M[j];
                                                                    Strategy 3
 P[i] = Pvalue;
```

## Review: What Shall We Parallelize?

In other words,

What should one thread do?

#### One answer:

- (same as with vector sum and matrix multiply)
- compute an output element!
  - Strategy 1 & 3

Is that our only choice? (What about Strategy 2?)

# Strategy 2: Parallelize Loading of a Tile

### Alternately,

- each thread loads one input element, and
- some threads compute an output.

(compared with previous approach)

### Advantage:

- No branch divergence for load (high latency).
- Avoid narrow global access (2 × halo width).

## Disadvantage:

Branch divergence for compute (low latency).

# 2D Example of Loading Parallelization

Let's do an example for 2D convolution

- Thread block matches input tile size
- Each thread loads one element of input tile
- Some threads do not participate in calculating output (Strategy 2)

# Parallelizing Tile Loading

- Load a tile of N into shared memory
  - All threads participate in loading
  - A subset of threads then use each N element in shared memory



# Output Tiles Still Cover the Output!



# Input tiles need to be larger than output tiles



# **Setting Block Dimensions**

There need to be enough thread blocks to generate all P elements.

There need to be enough threads to load entire tile of input.

# Shifting from output coordinates to input coordinates



# Shifting from output coordinates to input coordinates

```
int tx = threadIdx.x;
int ty = threadIdx.y;
int row o = blockIdx.y * TILE WIDTH + ty;
int col o = blockIdx.x * TILE WIDTH + tx;
int row i = row o-2; // MASK WIDTH / 2
int col i = col o-2; // (radius in
                        prev. code)
```

# Threads that loads halos outside N should return 0.0



## Taking Care of Boundaries

```
float Pvalue = 0.0f;
if((row i >= 0) && (row i < Width) &&
   (col i >= 0) && (col i < Width)) {
 tile[ty][tx] = N[row i*Width + col i];
} else {
 tile[ty][tx] = 0.0f;
 syncthreads (); // wait for tile
```

## Not All Threads Calculate Output

```
if(ty < TILE WIDTH && tx <TILE WIDTH) {
  for (i = 0; i < 5; i++) {
    for (j = 0; j < 5; j++) {
      Pvalue += Mc[i][j] * tile[i+ty][j+tx];
  // if continues on next page
```

# Not All Threads Write Output

```
if (row o < Width && col o < Width)
        P[row o * Width + col o] = Pvalue;
   } // end of if selecting output
     // tile threads
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
```

227x27

# Alternatively

- You can extend the 1D strategy 3 tiled convolution into a 2D strategy 3 tiled convolution.
  - Each input tile matches its corresponding output tile
  - All halo elements will be loaded from global memory
  - If condition and divergence during inner product computation

mxn 71



## ANY MORE QUESTIONS? READ CHAPTER 7

## ANY MORE QUESTIONS? READ CHAPTER 7