# Lecture 3. Performance guidelines.

In this notebook we will describe the most important guidelines when programming code for NVIDIA GPU cards.

We will consider the following aspects of optimizing CUDA kernels:
- memory access patterns: *memory coalescing* for the best throughput,
- control flow: how code branching affects the performance,
- multiprocessor occupancy: experimenting with different block sizes,
- instruction-level optimizations: avoiding data type conversion, using floating point *intrisincs*.

In [None]:
! pip install --upgrade --force-reinstall git+https://github.com/pjarosik/ius-2021-gpu-short-course.git

Collecting git+https://github.com/pjarosik/ius-2021-gpu-short-course.git
  Cloning https://github.com/pjarosik/ius-2021-gpu-short-course.git to /tmp/pip-req-build-hqfc5k4j
  Running command git clone -q https://github.com/pjarosik/ius-2021-gpu-short-course.git /tmp/pip-req-build-hqfc5k4j
Building wheels for collected packages: gpu-short-course
  Building wheel for gpu-short-course (setup.py) ... [?25l- done
[?25h  Created wheel for gpu-short-course: filename=gpu_short_course-0.0.1-py3-none-any.whl size=3119 sha256=34d57b30f61b6bfbb52418e76cdacc7ef9a58b3cae3d8dc1aea6fcd82fbc58c7
  Stored in directory: /tmp/pip-ephem-wheel-cache-gus5xq4n/wheels/4f/07/fc/9537d8ac1b84ce9cde4db4fcebd10fd77e93ea2c18fcb8a656
Successfully built gpu-short-course
Installing collected packages: gpu-short-course
  Attempting uninstall: gpu-short-course
    Found existing installation: gpu-short-course 0.0.1
    Uninstalling gpu-short-course-0.0.1:
      Successfully uninstalled gpu-short-course-0.0.1
Success

## Exercise 3.1. Memory access patterns.

According to [CUDA Best Practices Guide](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html):
> For devices of compute capability 6.0 or higher, the requirements can be summarized quite easily: the concurrent accesses of the threads of a warp will coalesce into a number of transactions equal to the number of 32-byte transactions necessary to service all of the threads of the warp.

This means that the number of useful memory accesses done by our kernel, and thus its performance, largely depends on the memory access pattern it does.

Our goal is to implement a GPU kernel that only loads **useful** data from global memory that will then be used in the calculations. We can achieve this with **coalesced memory accesses**.

To achive coalesced memory accesses in our kernel, we need to meet the following conditions:
- the number of threads per block is a multiple of 32 threads,
- sequential threads in a warp access memory that is sequential.


For example, our baseline `add_vectors_gpu` and `convolve_gpu` implementations satisfy the above conditions:
- the number of threads per block was equal 256,
- adjacent threads were reading adjacent memory areas, e.g. thread `i` read the `a[i]` and `b[i]`, and thread `i+1` read `a[i+1]` and `b[i+1]`.  


We will discuss below what are the reasons for both of the conditions.

### Exercise 3.1.1. Impact of misaligned accesses.

According to [CUDA Best Practices Guide](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html):
> The number of threads per block should be a multiple of 32 threads, because this provides optimal computing efficiency and facilitates coalescing.

Recall that warp reads global memory by using a sequence of **32-byte** segments transactions.

Note that, according to [CUDA Best Practices Guide](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html):
> Memory allocated through the CUDA Runtime API (...) is guaranteed to be aligned to at least 256 bytes.

This means that if the block size is a multiple of the warp size, each block will load only its own data chunk from memory. 


#### Example

As an example, we will consider here:
- `add_vectors_gpu` function,
- block size = 13. 

Now let's take a look what memory accesses will be performed by thread block 0, 1, etc.

**Block 0**

- reads memory area [0, 52), size: 52 bytes (13 x 4-byte floats)

```
[ Segment 0 (32 bytes) ][ Segment 1 (32 bytes) ][ Segment 2 (32 bytes) ] ...
[       block 0 data (52 bytes)       ]
```

- This requires 2 x 32-byte transfers. 
- However only 52 bytes will be used (81%). 

**Block 1**

- Reads memory area [52, 104), size: 52 bytes.
```
[ Segment 0 (32 bytes) ][ Segment 1 (32 bytes) ][ Segment 2 (32 bytes) ] ...
                                       [       block 1 data (52 bytes)       ]
```
- This requires 3 x 32-byte transfers.
- However only 52 bytes will be used (54%). 


**And so on...**


As we can see, we are transfer a large amount of (theoretically) useless data.

Let's see if there is any observable performance difference between scripts using 256 and 261 threads in a **block**:

In [None]:
%%writefile 3_1_1_aligned.py

from numba import cuda
import math
import numpy as np
import gpu_short_course.tests

block_size = 256


@cuda.jit
def add_vectors_kernel(result, a, b):
    i = cuda.blockIdx.x*cuda.blockDim.x + cuda.threadIdx.x
    if i >= len(result):
        return
    result[i] = a[i] + b[i]


def add_vectors_gpu(a, b):
    result = cuda.device_array(shape=a.shape, dtype=a.dtype)
    grid_size = math.ceil(len(a)/block_size)
    add_vectors_kernel[grid_size, block_size](result, a, b)
    return result.copy_to_host()

gpu_short_course.tests.benchmark_add_vectors(add_vectors_gpu)

Writing 3_1_1_aligned.py


In [None]:
! nvprof --trace gpu python 3_1_1_aligned.py

==120747== NVPROF is profiling process 120747, command: python 3_1_1_aligned.py
GPU:0: b'GeForce MX250'
Benchmarking the function, please wait...
Benchmark result: 
Average processing time: 0.0136 seconds (+/- 0.0338), median: 0.0091
==120747== Profiling application: python 3_1_1_aligned.py
==120747== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   55.47%  411.13ms       300  1.3704ms  1.2867ms  2.6277ms  [CUDA memcpy DtoH]
                   41.17%  305.19ms       200  1.5259ms  1.3844ms  2.9887ms  [CUDA memcpy HtoD]
                    3.36%  24.899ms       100  248.99us  247.87us  267.11us  cudapy::__main__::add_vectors_kernel$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>)
No API activities were profiled.


In [None]:
%%writefile 3_1_1_misaligned.py

from numba import cuda
import math
import numpy as np
import gpu_short_course.tests

block_size = 261


@cuda.jit
def add_vectors_kernel(result, a, b):
    i = cuda.blockIdx.x*cuda.blockDim.x + cuda.threadIdx.x
    if i >= len(result):
        return
    result[i] = a[i] + b[i]


def add_vectors_gpu(a, b):
    result = cuda.device_array(shape=a.shape, dtype=a.dtype)
    grid_size = math.ceil(len(a)/block_size)
    add_vectors_kernel[grid_size, block_size](result, a, b)
    return result.copy_to_host()

gpu_short_course.tests.benchmark_add_vectors(add_vectors_gpu)

Writing 3_1_1_misaligned.py


In [None]:
! nvprof --trace gpu python 3_1_1_misaligned.py

==120777== NVPROF is profiling process 120777, command: python 3_1_1_misaligned.py
GPU:0: b'GeForce MX250'
Benchmarking the function, please wait...
Benchmark result: 
Average processing time: 0.0135 seconds (+/- 0.0362), median: 0.0092
==120777== Profiling application: python 3_1_1_misaligned.py
==120777== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   55.17%  403.34ms       300  1.3445ms  1.2913ms  2.8496ms  [CUDA memcpy DtoH]
                   41.37%  302.47ms       200  1.5123ms  1.3749ms  2.1925ms  [CUDA memcpy HtoD]
                    3.46%  25.261ms       100  252.61us  250.91us  275.04us  cudapy::__main__::add_vectors_kernel$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>)
No API activities were profiled.


The difference in performance of the aligned and misaligned versions is rather minor (on some devices even negligible).

Why? 

In this particular case, adjacent warps **reuse the cached data** their neighbors fetched. 

Anyway, setting a block size to a multiple of warp size, might be a good rule of thumb: it facilitates coalescing, and (as we will discuss later), helps to avoid wasting multiprocessor computation time on under-populated warps.

### Exercise 3.1.2. Impact of strided accesses.

A non-unit-strided global memory accesses may impact effective memory bandwidth. 

We say that GPU kernel performs a unit-strided memory access, if threads with successive identifiers read the data from successive memory areas, in other words, the following access pattern is respected:

```
x = data[(some custom offset) + threadIdx.x]
```

When using the data access notation for a multidimensional array, make sure that the last axis is addressed using `threadIdx.x`:

```
x = data[(other dimensions...), threadIdx.x]
```


The degradation in performance can be especially apparent when working with multidimensional arrays - the choice of the axis, along which choosing a specific operation is performed, can affect effective bandwidth.

#### Example

Let's consider a 1D convolution along one of the axes of a 2D array.

```
         axis 1
     ---------------
x = [[0,  1,  2,  3], |
     [4,  5,  6,  7], | axis 0
     [8,  9, 10, 11]] |

h = [1, 1]

```

NumPy stores arrays in row-major order, so the above array is actually kept in computer's memory as a following 1D array:

```
x = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11] 
```

Let's consider doing convolution along axis 0 and 1.

**Convolve along axis 1**:

```
         axis 1
     ---------------
x = [[0,  1,  2,  3]  *  [1, 1] = [0,  1,  3,  5] 
     [4,  5,  6,  7], *  [1, 1] = [4,  9, 11, 13]
     [8,  9, 10, 11]] *  [1, 1] = [8, 17, 19, 21]
```

For the first output row:

```
x = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]  
    [1]      y = [0]
    [1, 1]       [1]
       [1, 1]    [3]
          [1, 1] [5]

```

`y[0]` `y[1]`, `y[2]` and `y[3]` are computed by threads with `threadIdx.x` equal `0`, `1`, `2` and `3`, respectively.

**Convolve along axis 0**:

```
x = [[0,  1,   2,   3]  | 
     [4,  5,   6,   7], | axis 0
     [8,  9,  10,  11]] | 
      *   *    *    *
     [1] [1]  [1]  [1]
     [1] [1]  [1]  [1]
   y  =   =    =    =
    [ 0] [ 1] [ 2] [ 3]
    [ 4] [ 6] [ 8] [10]
    [12] [14] [16] [18]
```

For the first output column:

```
x = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]  
    [1]                                    y = [ 0]
    [1,          1]                            [ 4]
                [1,         1]                 [12]
```

`y[0]` `y[1]`, and `y[2]` are computed by threads with `threadIdx.x` equal `0`, `1` and `2` respectively.

As we can see in the above example, the stride is much larger for the convolution along axis 0. Will it impact the bandwidth?

In [None]:
%%writefile 3_1_2_convolve_strided_access.py
import math
import numpy as np
from numba import cuda, float32
import cupy as cp
import gpu_short_course

@cuda.jit
def convolve_axis0_gpu_kernel(y, x, h):
    i = cuda.blockIdx.x*cuda.blockDim.x + cuda.threadIdx.x
    j = cuda.blockIdx.y*cuda.blockDim.y + cuda.threadIdx.y

    N = len(h)
    o = int(math.ceil(N/2)-1)
    HEIGHT = x.shape[0]
    WIDTH = x.shape[1]
    if i >= HEIGHT or j >= WIDTH:
        return
    
    value = float32(0.0)
    for k in range(N):
        l = i + o - k
        if l >= 0 and l < HEIGHT:

            ## --- Get data along the second (1) axis.
            value += x[l, j] * h[k]
            
    y[i, j] = value
    
    
def convolve_axis0_gpu(x, h):
    y = cuda.device_array(x.shape, dtype=x.dtype)
    block = (32, 32)
    height, width = x.shape
    block_h, block_w = block
    grid = (math.ceil(width/block_w), 
            math.ceil(height/block_h))
    convolve_axis0_gpu_kernel[grid, block](y, x, h)
    return y.copy_to_host()


@cuda.jit
def convolve_axis1_gpu_kernel(y, x, h):
    i = cuda.blockIdx.x*cuda.blockDim.x + cuda.threadIdx.x
    j = cuda.blockIdx.y*cuda.blockDim.y + cuda.threadIdx.y

    N = len(h)
    o = int(math.ceil(N/2)-1)
    
    HEIGHT = x.shape[0]
    WIDTH = x.shape[1]
    
    if i >= WIDTH or j >= HEIGHT:
        return
    
    value = float32(0.0)
    for k in range(N):
        l = i + o - k
        if l >= 0 and l < WIDTH:
            
            ## --- Get data along the first (0) axis.
            value += x[j, l]*h[k]
            
    y[j, i] = value
    
    
def convolve_axis1_gpu(x, h):
    y = cuda.device_array(x.shape, dtype=x.dtype)
    block = (32, 32)
    height, width = x.shape
    block_h, block_w = block
    grid = (math.ceil(width/block_w), 
            math.ceil(height/block_h))
    convolve_axis1_gpu_kernel[grid, block](y, x, h)
    return y.copy_to_host()


gpu_short_course.run_convolve_2d_input(convolve_axis0_gpu, axis=0)
gpu_short_course.run_convolve_2d_input(convolve_axis1_gpu, axis=1)

Writing 3_1_2_convolve_strided_access.py


In [None]:
! python 3_1_2_convolve_strided_access.py --mode test

GPU:0: b'GeForce MX250'
All tests passed.
All tests passed.


In [None]:
! nvprof --trace gpu python 3_1_2_convolve_strided_access.py --mode benchmark quiet=1

==120832== NVPROF is profiling process 120832, command: python 3_1_2_convolve_strided_access.py --mode benchmark quiet=1
GPU:0: b'GeForce MX250'
Benchmarking, please wait...
Benchmarking, please wait...
==120832== Profiling application: python 3_1_2_convolve_strided_access.py --mode benchmark quiet=1
==120832== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   60.57%  6.42933s       100  64.293ms  50.388ms  97.451ms  cudapy::__main__::convolve_axis0_gpu_kernel$241(Array<float, int=2, C, mutable, aligned>, Array<float, int=2, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>)
                   31.24%  3.31607s       100  33.161ms  25.843ms  46.395ms  cudapy::__main__::convolve_axis1_gpu_kernel$242(Array<float, int=2, C, mutable, aligned>, Array<float, int=2, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>)
                    5.27%  559.35ms       600  932.25us     768ns  2.9319ms  [CUDA me

On my GPU (Nvidia GeForce MX250), convolution along axis 1 takes much less time than along axis 0.

We can use profiler metrics to verify what memory access efficiency for both cases we have:

In [None]:
! nvprof --metrics gld_efficiency,gst_efficiency python 3_1_2_convolve_strided_access.py --mode benchmark n=1 quiet=1 2>&1 | grep -v "^="

GPU:0: b'GeForce MX250'
Benchmarking, please wait...
Benchmarking, please wait...
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "GeForce MX250 (0)"
    Kernel: cudapy::__main__::convolve_axis0_gpu_kernel$241(Array<float, int=2, C, mutable, aligned>, Array<float, int=2, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>)
          1                            gld_efficiency             Global Memory Load Efficiency      12.50%      12.50%      12.50%
          1                            gst_efficiency            Global Memory Store Efficiency      12.24%      12.24%      12.24%
    Kernel: cudapy::__main__::convolve_axis1_gpu_kernel$242(Array<float, int=2, C, mutable, aligned>, Array<float, int=2, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>)
          1                            gld_efficiency             Global Memory Load Efficiency      70.05%      70.05%  

# Exercise 3.2. Control flow: how code branching affects the performance.

Due to SIMT architecture of the CUDA multiprocessors, it is recommended to avoid different paths within the same warp.

According to [CUDA Best Practices Guide](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html):
> Flow control instructions (if, switch, do, for, while) can significantly affect the instruction throughput by causing threads of the same warp to diverge; that is, to follow different execution paths. If this happens, the different execution paths must be executed separately; this increases the total number of instructions executed for this warp.

### Example

Let's implement the following function:

```
y[i] = r[i]*a[i] + b[i]
```

where `r[i] = i mod 8`.

We can implement it in one of the two ways:
1. directly by definition (see `add_vectors_mod8_kernel`),
2. by doing a sequence of `if ... elif ... elif ... else` blocks (see `add_vectors_mod8_branches_kernel`).

In [None]:
%%writefile 3_2_control_flow.py

from numba import cuda
import math
import numpy as np
import gpu_short_course.tests


block_size = 256


@cuda.jit
def add_vectors_mod8_kernel(result, a, b):
    i = cuda.blockIdx.x*cuda.blockDim.x + cuda.threadIdx.x
    if i >= len(result):
        return
    r = i % 8 + 1
    result[i] = r*a[i] + b[i]


def add_vectors_mod8(a, b):
    result = cuda.device_array(shape=a.shape, dtype=a.dtype)
    grid_size = math.ceil(len(a)/block_size)
    add_vectors_mod8_kernel[grid_size, block_size](result, a, b)
    return result.copy_to_host()


@cuda.jit
def add_vectors_mod8_branches_kernel(result, a, b):
    i = cuda.blockIdx.x*cuda.blockDim.x + cuda.threadIdx.x
    if i >= len(result):
        return
    if i % 8 == 0:
        result[i] = a[i] + b[i]
    elif i % 8 == 1:
        result[i] = 2*a[i] + b[i]
    elif i % 8 == 2:
        result[i] = 3*a[i] + b[i]
    elif i % 8 == 3:
        result[i] = 4*a[i] + b[i]
    elif i % 8 == 4:
        result[i] = 5*a[i] + b[i]
    elif i % 8 == 5:
        result[i] = 6*a[i] + b[i]
    elif i % 8 == 6:
        result[i] = 7*a[i] + b[i]
    elif i % 8 == 7:
        result[i] = 8*a[i] + b[i]


def add_vectors_mod8_branches(a, b):
    result = cuda.device_array(shape=a.shape, dtype=a.dtype)
    grid_size = math.ceil(len(a)/block_size)
    add_vectors_mod8_branches_kernel[grid_size, block_size](result, a, b)
    return result.copy_to_host()


gpu_short_course.tests.benchmark_add_vectors(add_vectors_mod8)
gpu_short_course.tests.benchmark_add_vectors(add_vectors_mod8_branches)

Writing 3_2_control_flow.py


Let's check how much time does it take to execute each of the kernel:

In [None]:
! nvprof --trace gpu python 3_2_control_flow.py

==120889== NVPROF is profiling process 120889, command: python 3_2_control_flow.py
GPU:0: b'GeForce MX250'
Benchmarking the function, please wait...
Benchmark result: 
Average processing time: 0.0150 seconds (+/- 0.0370), median: 0.0106
Benchmarking the function, please wait...
Benchmark result: 
Average processing time: 0.0168 seconds (+/- 0.0178), median: 0.0122
==120889== Profiling application: python 3_2_control_flow.py
==120889== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   41.58%  837.90ms       600  1.3965ms  1.2846ms  2.8637ms  [CUDA memcpy DtoH]
                   31.36%  631.89ms       400  1.5797ms  1.3816ms  3.3680ms  [CUDA memcpy HtoD]
                   22.81%  459.63ms       100  4.5963ms  1.4817ms  38.123ms  cudapy::__main__::add_vectors_mod8_branches_kernel$242(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>)
           

Let's also measure `branch_efficiency` metric defined as a:
> Ratio of non-divergent branches to total branches expressed as percentage.

In [None]:
! nvprof --metrics branch_efficiency python 3_2_control_flow.py

==120924== NVPROF is profiling process 120924, command: python 3_2_control_flow.py
GPU:0: b'GeForce MX250'
Benchmarking the function, please wait...
Benchmark result: 
Average processing time: 0.0190 seconds (+/- 0.0341), median: 0.0144
Benchmarking the function, please wait...
Benchmark result: 
Average processing time: 0.0193 seconds (+/- 0.0165), median: 0.0146
==120924== Profiling application: python 3_2_control_flow.py
==120924== Profiling result:
==120924== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "GeForce MX250 (0)"
    Kernel: cudapy::__main__::add_vectors_mod8_kernel$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>)
        100                         branch_efficiency                         Branch Efficiency     100.00%     100.00%     100.00%
    Kernel: cudapy::__main__::add_vectors_

Of course, the above example has been artificially complicated just to show the effect of complex kernel logic on the kernel's performance.

# Exercise 3.3. Multiprocessor occupancy: thread block size.

Recall that:
> The number of threads per block should be a **multiple of 32 threads**, because this provides optimal computing efficiency and facilitates coalescing.

[CUDA Best Practices Guide](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html) also gives some other suggestions how to choose the proper number of threads per block:

> There are many such factors involved in selecting block size, and inevitably some experimentation is required. However, a few rules of thumb should be followed:
> 1. Threads per block should be **a multiple of warp size** to avoid wasting computation on under-populated warps and to facilitate coalescing.
> 2. A **minimum of 64 threads** per block should be used, and only if there are multiple concurrent blocks per multiprocessor.
> 3. Between **128 and 256 threads** per block is a good initial range for experimentation with different block sizes.
> 4. Use several smaller thread blocks rather than one large thread block per multiprocessor if latency affects performance. This is particularly beneficial to kernels that frequently call __syncthreads().

### Example

Let's mesure `add_vectors`' occupancy for a different number of threads:


In [None]:
%%writefile 3_3_occupancy_16.py

from numba import cuda
import math
import numpy as np
import gpu_short_course.tests

block_size = 16


@cuda.jit
def add_vectors_kernel(result, a, b):
    i = cuda.blockIdx.x*cuda.blockDim.x + cuda.threadIdx.x
    if i >= len(result):
        return
    result[i] = a[i] + b[i]


def add_vectors_gpu(a, b):
    result = cuda.device_array(shape=a.shape, dtype=a.dtype)
    grid_size = math.ceil(len(a)/block_size)
    add_vectors_kernel[grid_size, block_size](result, a, b)
    return result.copy_to_host()


gpu_short_course.tests.benchmark_add_vectors(add_vectors_gpu)

Writing 3_3_occupancy_16.py


In [None]:
! nvprof --trace gpu python 3_3_occupancy_16.py

==120971== NVPROF is profiling process 120971, command: python 3_3_occupancy_16.py
GPU:0: b'GeForce MX250'
Benchmarking the function, please wait...
Benchmark result: 
Average processing time: 0.0163 seconds (+/- 0.0370), median: 0.0108
==120971== Profiling application: python 3_3_occupancy_16.py
==120971== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   43.63%  416.23ms       300  1.3874ms  1.2930ms  2.6358ms  [CUDA memcpy DtoH]
                   31.96%  304.87ms       200  1.5243ms  1.4158ms  2.7896ms  [CUDA memcpy HtoD]
                   24.42%  232.96ms       100  2.3295ms  642.53us  16.229ms  cudapy::__main__::add_vectors_kernel$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>)
No API activities were profiled.


According to NVIDIA documentation, `achieved_occupancy` measures:
> Ratio of the average active warps per active cycle to the maximum number of warps supported on a multiprocessor.

In [None]:
! nvprof --metrics achieved_occupancy python 3_3_occupancy_16.py

==120999== NVPROF is profiling process 120999, command: python 3_3_occupancy_16.py
GPU:0: b'GeForce MX250'
Benchmarking the function, please wait...
Benchmark result: 
Average processing time: 0.0236 seconds (+/- 0.0420), median: 0.0174
==120999== Profiling application: python 3_3_occupancy_16.py
==120999== Profiling result:
==120999== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "GeForce MX250 (0)"
    Kernel: cudapy::__main__::add_vectors_kernel$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>)
        100                        achieved_occupancy                        Achieved Occupancy    0.302620    0.336189    0.322717


In [None]:
%%writefile 3_3_occupancy_256.py

from numba import cuda
import math
import numpy as np
import gpu_short_course.tests

block_size = 256


@cuda.jit
def add_vectors_kernel(result, a, b):
    i = cuda.blockIdx.x*cuda.blockDim.x + cuda.threadIdx.x
    if i >= len(result):
        return
    result[i] = a[i] + b[i]


def add_vectors_gpu(a, b):
    result = cuda.device_array(shape=a.shape, dtype=a.dtype)
    grid_size = math.ceil(len(a)/block_size)
    add_vectors_kernel[grid_size, block_size](result, a, b)
    return result.copy_to_host()


gpu_short_course.tests.benchmark_add_vectors(add_vectors_gpu)

Writing 3_3_occupancy_256.py


In [None]:
! nvprof --trace gpu python 3_3_occupancy_256.py
! nvprof --metrics achieved_occupancy python 3_3_occupancy_256.py

==121026== NVPROF is profiling process 121026, command: python 3_3_occupancy_256.py
GPU:0: b'GeForce MX250'
Benchmarking the function, please wait...
Benchmark result: 
Average processing time: 0.0126 seconds (+/- 0.0315), median: 0.0090
==121026== Profiling application: python 3_3_occupancy_256.py
==121026== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   54.84%  392.64ms       300  1.3088ms  1.2852ms  1.5636ms  [CUDA memcpy DtoH]
                   41.69%  298.46ms       200  1.4923ms  1.3814ms  2.6135ms  [CUDA memcpy HtoD]
                    3.47%  24.828ms       100  248.28us  247.39us  257.15us  cudapy::__main__::add_vectors_kernel$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>)
No API activities were profiled.
==121054== NVPROF is profiling process 121054, command: python 3_3_occupancy_256.py
GPU:0: b'GeForce MX250'
Benchmarking

In [None]:
%%writefile 3_3_occupancy_1024.py

from numba import cuda
import math
import numpy as np
import gpu_short_course.tests

block_size = 1024


@cuda.jit
def add_vectors_kernel(result, a, b):
    i = cuda.blockIdx.x*cuda.blockDim.x + cuda.threadIdx.x
    if i >= len(result):
        return
    result[i] = a[i] + b[i]


def add_vectors_gpu(a, b):
    result = cuda.device_array(shape=a.shape, dtype=a.dtype)
    grid_size = math.ceil(len(a)/block_size)
    add_vectors_kernel[grid_size, block_size](result, a, b)
    return result.copy_to_host()

gpu_short_course.tests.benchmark_add_vectors(add_vectors_gpu)

Writing 3_3_occupancy_1024.py


In [None]:
! nvprof --trace gpu python 3_3_occupancy_1024.py
! nvprof --metrics achieved_occupancy python 3_3_occupancy_1024.py

==121085== NVPROF is profiling process 121085, command: python 3_3_occupancy_1024.py
GPU:0: b'GeForce MX250'
Benchmarking the function, please wait...
Benchmark result: 
Average processing time: 0.0143 seconds (+/- 0.0316), median: 0.0105
==121085== Profiling application: python 3_3_occupancy_1024.py
==121085== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   53.86%  422.55ms       300  1.4085ms  1.2888ms  2.6570ms  [CUDA memcpy DtoH]
                   40.29%  316.07ms       200  1.5804ms  1.3943ms  3.2166ms  [CUDA memcpy HtoD]
                    5.86%  45.945ms       100  459.45us  249.19us  1.1231ms  cudapy::__main__::add_vectors_kernel$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>)
No API activities were profiled.
==121120== NVPROF is profiling process 121120, command: python 3_3_occupancy_1024.py
GPU:0: b'GeForce MX250'
Benchmark

Finding the right number of threads per block requires some experimentation, but generally 128 or 256 threads is a good starting point.

# Exercise 3.4. Instruction-level optimizations.

This chapter covers the following:

1. Impact of the data type selection and conversion on the kernel's performance.
2. Floating-point intrinsics.

## Exercise 3.4.1. Data types.

When implementing a CUDA GPU kernel, keep the following in mind:

- the choice of the input data type may affect the kernel performance,
- data type conversion in kernel implementation may affect kernel performance.

Let's do some comparison of  `float32` and `float64` data, based on an example of a one-dimensional convolution.

First, let's recall our baseline implementation of convolution operator.

NOTE:
- our benchmark function generates `float32` input data,
- we used in the kernel implementation the `float32` keyword to enforce the proper data type of `value` variable. 

In [None]:
%%writefile 3_4_1_convolve_float32.py
import math
import numpy as np
from numba import cuda, float32
from gpu_short_course.tests import benchmark_convolve


@cuda.jit
def convolve_gpu_kernel(y, x, h):
    i = cuda.blockIdx.x*cuda.blockDim.x + cuda.threadIdx.x
    if i >= len(y):
        return
    M, N = len(x), len(h)
    
    o = int(math.ceil(N/2)-1)

    value = float32(0.0)
    for j in range(N):
        k = i+o-j
        if k >= 0 and k < M:
            value += x[k]*h[j]
    y[i] = value
    

def convolve_gpu(x, h):
    y = cuda.device_array(x.shape, dtype=x.dtype)
    block_size = 256
    grid_size = math.ceil(len(y)/block_size)
    convolve_gpu_kernel[grid_size, block_size](y, x, h)
    return y.copy_to_host()


benchmark_convolve(convolve_gpu, dtype=np.float32)

Writing 3_4_1_convolve_float32.py


In [None]:
! nvprof --trace gpu python 3_4_1_convolve_float32.py

==121149== NVPROF is profiling process 121149, command: python 3_4_1_convolve_float32.py
GPU:0: b'GeForce MX250'
Benchmarking the function, please wait...
Benchmark result: 
Average processing time: 0.1231 seconds (+/- 0.0903), median: 0.0957
==121149== Profiling application: python 3_4_1_convolve_float32.py
==121149== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   96.46%  11.3062s       100  113.06ms  19.506ms  348.87ms  cudapy::__main__::convolve_gpu_kernel$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>)
                    2.26%  265.12ms       300  883.73us  1.1840us  2.6277ms  [CUDA memcpy DtoH]
                    1.27%  149.27ms       200  746.35us  1.0240us  1.6264ms  [CUDA memcpy HtoD]
No API activities were profiled.


Now let's get rid of the `float32` keyword on line `17` and see if it has any effect on performance.

In [None]:
%%writefile 3_4_1_convolve_float32_and_float64.py
import math
import numpy as np
from numba import cuda, float32
from gpu_short_course.tests import benchmark_convolve


@cuda.jit
def convolve_gpu_kernel(y, x, h):
    i = cuda.blockIdx.x*cuda.blockDim.x + cuda.threadIdx.x
    if i >= len(y):
        return
    M, N = len(x), len(h)
    
    o = int(math.ceil(N/2)-1)

    value = 0.0
    for j in range(N):
        k = i+o-j
        if k >= 0 and k < M:
            value += x[k]*h[j]
    y[i] = value
    

def convolve_gpu(x, h):
    y = cuda.device_array(x.shape, dtype=x.dtype)
    block_size = 256
    grid_size = math.ceil(len(y)/block_size)
    convolve_gpu_kernel[grid_size, block_size](y, x, h)
    return y.copy_to_host()


benchmark_convolve(convolve_gpu, dtype=np.float32)

Writing 3_4_1_convolve_float32_and_float64.py


In [None]:
! nvprof --trace gpu python 3_4_1_convolve_float32_and_float64.py

==121192== NVPROF is profiling process 121192, command: python 3_4_1_convolve_float32_and_float64.py
GPU:0: b'GeForce MX250'
Benchmarking the function, please wait...
Benchmark result: 
Average processing time: 0.1373 seconds (+/- 0.1268), median: 0.0682
==121192== Profiling application: python 3_4_1_convolve_float32_and_float64.py
==121192== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   96.84%  12.7187s       100  127.19ms  30.170ms  532.16ms  cudapy::__main__::convolve_gpu_kernel$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>)
                    2.03%  266.36ms       300  887.85us  1.2160us  2.6352ms  [CUDA memcpy DtoH]
                    1.13%  148.28ms       200  741.41us     960ns  2.5882ms  [CUDA memcpy HtoD]
No API activities were profiled.


The processing may increase, because 0.0 is a float64, and we are doing promotion from float32 to float64, i.e.:

`value += (float64)(x[k]*h[j])`

then we downgrade from float64 to float32:

`y[i] = (float32)value`.

(note: the above may vary between different GPUs)

Lets check what results we will get when we will use only `float64` values in computations.

In [None]:
%%writefile 3_4_1_convolve_float64.py

import math
import numpy as np
from numba import cuda, float32
from gpu_short_course.tests import benchmark_convolve


@cuda.jit
def convolve_gpu_kernel(y, x, h):
    i = cuda.blockIdx.x*cuda.blockDim.x + cuda.threadIdx.x
    if i >= len(y):
        return
    M, N = len(x), len(h)
    
    o = int(math.ceil(N/2)-1)

    value = 0.0
    for j in range(N):
        k = i+o-j
        if k >= 0 and k < M:
            value += x[k]*h[j]
    y[i] = value
    

def convolve_gpu(x, h):
    y = cuda.device_array(x.shape, dtype=x.dtype)
    block_size = 256
    grid_size = math.ceil(len(y)/block_size)
    convolve_gpu_kernel[grid_size, block_size](y, x, h)
    return y.copy_to_host()


benchmark_convolve(convolve_gpu, dtype=np.float64)

Writing 3_4_1_convolve_float64.py


In [None]:
! nvprof --trace gpu python 3_4_1_convolve_float64.py

==121243== NVPROF is profiling process 121243, command: python 3_4_1_convolve_float64.py
GPU:0: b'GeForce MX250'
Benchmarking the function, please wait...
Benchmark result: 
Average processing time: 0.1755 seconds (+/- 0.0391), median: 0.1882
==121243== Profiling application: python 3_4_1_convolve_float64.py
==121243== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   94.90%  16.0141s       100  160.14ms  20.212ms  245.10ms  cudapy::__main__::convolve_gpu_kernel$241(Array<double, int=1, C, mutable, aligned>, Array<double, int=1, C, mutable, aligned>, Array<double, int=1, C, mutable, aligned>)
                    3.24%  547.14ms       300  1.8238ms  1.3120us  5.1162ms  [CUDA memcpy DtoH]
                    1.86%  313.50ms       200  1.5675ms  1.1840us  6.1230ms  [CUDA memcpy HtoD]
No API activities were profiled.


The above results may differ between different GPU cards.

## Exercise 3.4.2. Floating point intrisincs.

CUDA Math API provides a set of intrinsic functions, specialized to carry out various calculations. Sometimes, using an intrisinc function explicitly in your code may improve its performance.

A complete list of intrinsic functions for CUDA C/C++ is available [here](https://docs.nvidia.com/cuda/cuda-math-api/).

A complete list of floating-point intrinsics in Numba is avalable [here](https://numba.pydata.org/numba-doc/latest/cuda-reference/kernel.html#floating-point-intrinsics). 

Note: At the stage of optimizing the machine code, the compiler may decide to use the intrinsic function (the same or similar), regardless of whether we used it in our implementation or not. Still, if you want to be sure that the intrinsic function is used, we should call it explicitly.

Let's check if using `cuda.fma` intrinsic in our `convolution` gives us any improvement:

In [None]:
%%writefile 3_4_2_convolve_intrinsics.py
import math
from numba import cuda, float32
import numpy as np
import gpu_short_course.tests
import cupy as cp


@cuda.jit
def convolve_gpu_kernel(y, x, h):
    i = cuda.blockIdx.x*cuda.blockDim.x + cuda.threadIdx.x
    if i >= len(y):
        return

    M, N = len(x), len(h)
    o = int(math.ceil(N/2)-1)
    
    value = float32(0.0)
    for j in range(N):
        k = i + o - j
        if k >= 0 and k < M:
            value += x[k]*h[j]
    y[i] = value


def convolve(x, h):
    y = cuda.device_array(x.shape, dtype=x.dtype)
    block_size = 256
    grid_size = math.ceil(len(y)/block_size)
    convolve_gpu_kernel[grid_size, block_size](y, x, h)
    return y.copy_to_host()


@cuda.jit
def convolve_fma_gpu_kernel(y, x, h):
    i = cuda.blockIdx.x*cuda.blockDim.x + cuda.threadIdx.x
    if i >= len(y):
        return

    M, N = len(x), len(h)
    o = int(math.ceil(N/2)-1)
    
    value = float32(0.0)
    for j in range(N):
        k = i + o - j
        if k >= 0 and k < M:
            value = cuda.fma(x[k], h[j], value)
    y[i] = value


def convolve_fma(x, h):
    y = cuda.device_array(x.shape, dtype=x.dtype)
    block_size = 256
    grid_size = math.ceil(len(y)/block_size)
    convolve_fma_gpu_kernel[grid_size, block_size](y, x, h)
    return y.copy_to_host()


gpu_short_course.tests.benchmark_convolve(convolve_fma)
gpu_short_course.tests.benchmark_convolve(convolve)

Overwriting 3_4_2_convolve_intrinsics.py


In [None]:
!nvprof --trace gpu python 3_4_2_convolve_intrinsics.py

==121341== NVPROF is profiling process 121341, command: python 3_4_2_convolve_intrinsics.py
GPU:0: b'GeForce MX250'
Benchmarking the function, please wait...
Benchmark result: 
Average processing time: 0.0323 seconds (+/- 0.0499), median: 0.0270
Benchmarking the function, please wait...
Benchmark result: 
Average processing time: 0.0322 seconds (+/- 0.0107), median: 0.0313
==121341== Profiling application: python 3_4_2_convolve_intrinsics.py
==121341== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   44.91%  2.38789s       100  23.879ms  18.964ms  30.330ms  cudapy::__main__::convolve_gpu_kernel$242(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>)
                   37.76%  2.00778s       100  20.078ms  17.149ms  25.120ms  cudapy::__main__::convolve_fma_gpu_kernel$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, a