ECE408/CS483/CSE408 Fall 2022

Applied Parallel Programming

Lecture 4: CUDA Memory Model

#### Course Reminders

- Lab 1 submission deadline is coming up
  - Submission functionality (--submit) is still being worked on...
  - Be sure to submit AND do the MP1 quiz on Canvas
- Lab 2 will be out soon, it is due next Friday

#### Objective

- To learn the basic features of the memories accessible by CUDA threads
- To prepare for MP-2 basic matrix multiplication
- To learn to evaluate the performance implications of global memory accesses

#### **Executing Thread Blocks**



- Threads run concurrently
  - SM maintains thread/block id #s
  - SM manages/schedules thread execution

# Thread Scheduling (1/2)

- Each block is executed as 32-thread warps
  - An implementation decision, not part of the CUDA programming model
  - Warps are divided based on their linearized thread index
    - Threads 0-31: warp 0
    - Threads 32-63: warp 1, etc.
    - X-dimension first, then Y, then Z
  - Warps are scheduling units in SM
- If 3 blocks are assigned to an SM and each block has 256 threads, how many warps are there in an SM?
  - Each block is divided into 256/32 = 8 warps
  - 8 warps/blk \* 3 blks = 24 warps





#### Thread Scheduling (2/2)

- SM implements zero-overhead warp scheduling
  - Warps whose next instruction has its operands ready for consumption are eligible for execution
  - Eligible warps are selected for execution on a prioritized scheduling policy
  - All threads in a warp execute the same instruction when selected



Example execution timing of an SM

### Control (branch) Divergence

- Main performance concern with branching is divergence
  - Threads within a single warp take different paths
  - Different execution paths are serialized in current GPUs
- A common case: divergence when a branch condition is a function of thread ID
  - if (threadIdx.x % 2) { }
    - This creates two different control paths for threads in a warp
    - Has divergence (50% of threads do nothing)
  - if ((threadIdx.x / WARP SIZE) % 2) { }
    - Also creates two different control paths, but...
    - Branch granularity is a whole multiple of warp size;
    - All threads in any given warp follow the same path
    - No divergence

#### **Block Granularity Considerations**

- For RGBToGrayscale, should one use 8X8, 16X16 or 32X32 blocks? Assume that in the GPU used, each SM can take up to 1536 threads and up to 8 blocks.
  - For 8X8, we have 64 threads per block. Each SM can take up to 1536 threads, which is 24 blocks. But each SM can only take up to 8 Blocks, only 512 threads (16 warps) will go into each SM!
  - For 16X16, we have 256 threads per block. Since each SM can take up to 1536 threads (48 warps), which is 6 blocks (within the 8 block limit). Thus we use the full thread capacity of an SM.
  - For 32X32, we would have 1024 threads per Block. Only one block can fit into an SM, using only 2/3 of the thread capacity of an SM.

#### Programmer View of CUDA Memories

#### Each thread can:

- Read/write per-thread registers (~1 cycle)
- Read/write per-block shared memory (~5 cycles)
- Read/write per-grid global memory (~500 cycles)
- Read/only per-grid constant memory (~5 cycles with caching)



# CUDA Variable Type Qualifiers

|        | Memory    | Scope                    | Lifetime |        |             |
|--------|-----------|--------------------------|----------|--------|-------------|
|        |           | <pre>int LocalVar;</pre> | register | thread | thread      |
| device | shared    | int SharedVar;           | shared   | block  | block       |
| device |           | int GlobalVar;           | global   | арр.   | application |
| device | constant_ | int ConstantVar;         | constant | арр.   | application |

- device
  - optional with <u>shared</u> or <u>constant</u>
  - not allowed by itself within functions
- Automatic variables with no qualifiers
  - in registers for primitive types and structures
  - in global memory for per-thread arrays

# Next Application: Matrix Multiplication

- Given two square matrices, M and N, dimensions Width × Width
  - we can multiply M by N
  - to compute a third Width × Width matrix, P:
  - P = MN

In terms of the elements of P, matrix multiplication implies computing...

$$P_{ij} = \sum_{k=1}^{Width} M_{ik} N_{kj}$$

#### Matrix Multiplication -- Simple CPU Version

```
// Matrix multiplication on the (CPU) host in single precision
void MatrixMul(float *M, float *N, float *P, int Width)
   for (int i = 0; i < Width; ++i)
        for (int j = 0; j < Width; ++j) {
            float sum = 0;
            for (int k = 0; k < Width; ++k) {
               float a = M[i * Width + k];
               float b = N[k * Width + j];
               sum += a * b;
           P[i * Width + j] = sum;
```

### Kernel Function - A Small Example

- Have each 2D thread block to compute a (BLOCK\_WIDTH)<sup>2</sup> sub-matrix of the result matrix
  - Each block has (BLOCK\_WIDTH)<sup>2</sup> threads
- Generate a 2D Grid of (WIDTH/BLOCK\_WIDTH)<sup>2</sup> blocks
- This concept is called tiling. Each block represents a tile.



# A Slightly Bigger Example (BLOCK\_WIDTH =2)

| P <sub>0,0</sub> | P <sub>0,1</sub> | P <sub>0,2</sub> | P <sub>0,3</sub> | P <sub>0,4</sub> | P <sub>0,5</sub> | P <sub>0,6</sub> | P <sub>0,7</sub> |
|------------------|------------------|------------------|------------------|------------------|------------------|------------------|------------------|
| P <sub>1,0</sub> | P <sub>1,1</sub> | P <sub>1,2</sub> | P <sub>1,3</sub> | P <sub>1,4</sub> | P <sub>1,5</sub> | P <sub>1,6</sub> | P <sub>1,7</sub> |
| P <sub>2,0</sub> | P <sub>2,1</sub> | P <sub>2,2</sub> | P <sub>2,3</sub> | P <sub>2,4</sub> | P <sub>2,5</sub> | P <sub>2,6</sub> | P <sub>2,7</sub> |
| P <sub>3,0</sub> | P <sub>3,1</sub> | P <sub>3,2</sub> | P <sub>3,3</sub> | P <sub>3,4</sub> | P <sub>3,5</sub> | P <sub>3,6</sub> | P <sub>3,7</sub> |
| P <sub>4,0</sub> | P <sub>4,1</sub> | P <sub>4,2</sub> | P <sub>4,3</sub> | P <sub>4,4</sub> | P <sub>4,5</sub> | P <sub>4,6</sub> | P <sub>4,7</sub> |
| P <sub>5,0</sub> | P <sub>5,1</sub> | P <sub>5,2</sub> | P <sub>5,3</sub> | P <sub>5,4</sub> | P <sub>5,5</sub> | P <sub>5,6</sub> | P <sub>5,7</sub> |
| P <sub>6,0</sub> | P <sub>6,1</sub> | P <sub>6,2</sub> | P <sub>6,3</sub> | P <sub>6,4</sub> | P <sub>6,5</sub> | P <sub>6,6</sub> | P <sub>6,7</sub> |
| P <sub>7,0</sub> | P <sub>7,1</sub> | P <sub>7,2</sub> | P <sub>7,3</sub> | P <sub>7,4</sub> | P <sub>7,5</sub> | P <sub>7,6</sub> | P <sub>7,7</sub> |

WIDTH = 8; BLOCK\_WIDTH = 2 Each block has 2\*2 = 4 threads

WIDTH/BLOCK\_WIDTH = 4 Use 4\* 4 = 16 blocks

# A Slightly Bigger Example (cont.) (BLOCK\_WIDTH = 4)

| P <sub>0,0</sub> | P <sub>0,1</sub> | P <sub>0,2</sub> | P <sub>0,3</sub> | P <sub>0,4</sub> | P <sub>0,5</sub> | P <sub>0,6</sub> | P <sub>0,7</sub> |
|------------------|------------------|------------------|------------------|------------------|------------------|------------------|------------------|
| P <sub>1,0</sub> | P <sub>1,1</sub> | P <sub>1,2</sub> | P <sub>1,3</sub> | P <sub>1,4</sub> | P <sub>1,5</sub> | P <sub>1,6</sub> | P <sub>1,7</sub> |
| P <sub>2,0</sub> | P <sub>2,1</sub> | P <sub>2,2</sub> | P <sub>2,3</sub> | P <sub>2,4</sub> | P <sub>2,5</sub> | P <sub>2,6</sub> | P <sub>2,7</sub> |
| P <sub>3,0</sub> |                  |                  |                  |                  |                  |                  |                  |
|                  |                  |                  |                  |                  |                  |                  |                  |
| P <sub>4,0</sub> | P <sub>4,1</sub> | P <sub>4,2</sub> | P <sub>4,3</sub> | P <sub>4,4</sub> | P <sub>4,5</sub> | P <sub>4,6</sub> | P <sub>4,7</sub> |
| P <sub>4,0</sub> |                  |                  |                  |                  |                  |                  |                  |
|                  | P <sub>5,1</sub> | P <sub>5,2</sub> | P <sub>5,3</sub> | P <sub>5,4</sub> | P <sub>5,5</sub> | P <sub>5,6</sub> | P <sub>5,7</sub> |

WIDTH = 8; BLOCK\_WIDTH = 4 Each block has 4\*4 = 16 threads

WIDTH/BLOCK\_WIDTH = 2 Use 2\* 2 = 4 blocks

#### Kernel Invocation (Host-side Code)

#### **Kernel Function**

```
// Matrix multiplication kernel - per thread code
__global__
void MatrixMulKernel(float *d_M, float *d_N, float *d_P, int Width)
{
    // Pvalue is used to store the element of the matrix
    // that is computed by the thread
    float Pvalue = 0;
```

## Work for Block (0,0) for TILE\_WIDTH = 2



### Work for Block (0,1)



Row = 0Row = 1

| Moo              | M <sub>0,1</sub> | Moja             | M0,3             | $P_{0,0}$        | $P_{0,1}$        | P <sub>1</sub>   | P <sub>(,3</sub> |
|------------------|------------------|------------------|------------------|------------------|------------------|------------------|------------------|
| M <sub>1,0</sub> | M <sub>1,1</sub> | M <sub>1,2</sub> | M <sub>1,2</sub> | P <sub>0,1</sub> | $P_{1,1}$        | P <sub>1,2</sub> | P <sub>1,3</sub> |
| M <sub>2,0</sub> | M <sub>2,1</sub> | M <sub>2,2</sub> | M <sub>2,3</sub> | P <sub>2,0</sub> | P <sub>2,1</sub> | P <sub>2,2</sub> | P <sub>2,3</sub> |
| M <sub>3,0</sub> | M <sub>3,1</sub> | M <sub>3,2</sub> | M <sub>3,3</sub> | P <sub>3,0</sub> | P <sub>3,1</sub> | P <sub>3,2</sub> | P <sub>3,3</sub> |



| N <sub>0,0</sub> | N <sub>0,1</sub> | N | J <sub>,,2</sub> | M | ),3 |
|------------------|------------------|---|------------------|---|-----|
| N <sub>1,0</sub> | N <sub>1,1</sub> | N | J <sub>.,2</sub> | N | .,3 |
| N <sub>2,0</sub> | N <sub>2,1</sub> | N | J <sub>1,2</sub> | N | 2,3 |
| N <sub>3,0</sub> | N <sub>3,1</sub> | N | J <sub>2,3</sub> | N | 3,3 |
|                  |                  |   |                  |   |     |

# A Simple Matrix Multiplication Kernel

```
global
void MatrixMulKernel(float *d_M, float *d_N, float *d_P, int Width)
   // Calculate the column index of d P and d N
   int Col = blockIdx.x * blockDim.x + threadIdx.x;
   // Calculate the row index of d P and d M
   int Row = blockIdx.y * blockDim.y + threadIdx.y;
   if ((Row < Width) && (Col < Width)) {
      float Pvalue = 0;
      // each thread computes one element of d P
      for (int k = 0; k < Width; ++k)
          Pvalue += d M[Row*Width+k] * d N[k*Width+Col];
      d P[Row*Width+Col] = Pvalue;
```

# How about performance on a device with 150 GB/s memory bandwidth?

- All threads access global memory for their input matrix elements
  - Two memory accesses (8 bytes) per floating point multiply-add (2 fp ops)
  - 4B of memory for each FLOP
  - 150 GB/s limits the code at 37.5 GFLOPS
- The actual code runs at about 25 GFLOPS
- Need to drastically cut down memory accesses to get closer to the peak of more than 1,000 GFLOPS



# ANY MORE QUESTIONS? READ CHAPTER 4!