# Lecture 3. Performance guidelines.

In this notebook we will describe the most important guidelines when programming code for NVIDIA GPU cards.

We will consider the following aspects of optimizing CUDA kernels:
- memory access patterns: *memory coalescing* for the best throughput,
- control flow: how code branching affects the performance,
- multiprocessor occupancy: experimenting with different block sizes,
- instruction-level optimizations: avoiding data type conversion, using floating point *intrisincs*.

In [1]:
! pip install --upgrade --force-reinstall git+https://github.com/us4useu/ius-2021-gpu-short-course.git

Collecting git+https://github.com/us4useu/ius-2021-gpu-short-course.git
  Cloning https://github.com/us4useu/ius-2021-gpu-short-course.git to /tmp/pip-req-build-wbr_nmpi
  Running command git clone -q https://github.com/us4useu/ius-2021-gpu-short-course.git /tmp/pip-req-build-wbr_nmpi
Building wheels for collected packages: gpu-short-course
  Building wheel for gpu-short-course (setup.py) ... [?25ldone
[?25h  Created wheel for gpu-short-course: filename=gpu_short_course-0.0.1-py3-none-any.whl size=3119 sha256=c75a350312997dad55e8fdbbd05d52f80052a9f42ac13d200433f0061bfbd6dd
  Stored in directory: /tmp/pip-ephem-wheel-cache-fahb98fv/wheels/2e/0b/02/0f96bd15166b9aa28455baa3c75c129be82031e9b820986f2d
Successfully built gpu-short-course
Installing collected packages: gpu-short-course
  Attempting uninstall: gpu-short-course
    Found existing installation: gpu-short-course 0.0.1
    Uninstalling gpu-short-course-0.0.1:
      Successfully uninstalled gpu-short-course-0.0.1
Successfully ins

## Exercise 3.1. Memory access patterns.

According to [CUDA Best Practices Guide](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html):
> For devices of compute capability 6.0 or higher, the requirements can be summarized quite easily: the concurrent accesses of the threads of a warp will coalesce into a number of transactions equal to the number of 32-byte transactions necessary to service all of the threads of the warp.

This means that the number of useful memory accesses done by our kernel, and thus its performance, largely depends on the memory access pattern it does.

Our goal is to implement a GPU kernel that only loads **useful** data from global memory that will then be used in the calculations. We can achieve this with **coalesced memory accesses**.

To achive coalesced memory accesses in our kernel, we need to meet the following conditions:
- the number of threads per block is a multiple of 32 threads,
- sequential threads in a warp access memory that is sequential.


For example, our baseline `add_vectors_gpu` and `convolve_gpu` implementations satisfy the above conditions:
- the number of threads per block was equal 256,
- adjacent threads were reading adjacent memory areas, e.g. thread `i` read the `a[i]` and `b[i]`, and thread `i+1` read `a[i+1]` and `b[i+1]`.  


We will discuss below what are the reasons for both of the conditions.

### Exercise 3.1.1. Impact of misaligned accesses.

According to [CUDA Best Practices Guide](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html):
> The number of threads per block should be a multiple of 32 threads, because this provides optimal computing efficiency and facilitates coalescing.

Recall that warp reads global memory by using a sequence of **32-byte** segments transactions.

Note that, according to [CUDA Best Practices Guide](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html):
> Memory allocated through the CUDA Runtime API (...) is guaranteed to be aligned to at least 256 bytes.

This means that if the block size is a multiple of the warp size, each block will load only its own data chunk from memory. 


#### Example

As an example, we will consider here:
- `add_vectors_gpu` function,
- block size = 13. 

Now let's take a look what memory accesses will be performed by thread block 0, 1, etc.

**Block 0**

- reads memory area [0, 52), size: 52 bytes (13 x 4-byte floats)

```
[ Segment 0 (32 bytes) ][ Segment 1 (32 bytes) ][ Segment 2 (32 bytes) ] ...
[       block 0 data (52 bytes)       ]
```

- This requires 2 x 32-byte transfers. 
- However only 52 bytes will be used (81%). 

**Block 1**

- Reads memory area [52, 104), size: 52 bytes.
```
[ Segment 0 (32 bytes) ][ Segment 1 (32 bytes) ][ Segment 2 (32 bytes) ] ...
                                       [       block 1 data (52 bytes)       ]
```
- This requires 3 x 32-byte transfers.
- However only 52 bytes will be used (54%). 


**And so on...**


As we can see, we are transfer a large amount of (theoretically) useless data.

Let's see if there is any observable performance difference between scripts using 256 and 261 threads in a **block**:

In [2]:
%%writefile 3_1_1_aligned.py

from numba import cuda
import math
import numpy as np
import gpu_short_course.tests

block_size = 256


@cuda.jit
def add_vectors_kernel(result, a, b):
    i = cuda.blockIdx.x*cuda.blockDim.x + cuda.threadIdx.x
    if i >= len(result):
        return
    result[i] = a[i] + b[i]


def add_vectors_gpu(a, b):
    result = cuda.device_array(shape=a.shape, dtype=a.dtype)
    grid_size = math.ceil(len(a)/block_size)
    add_vectors_kernel[grid_size, block_size](result, a, b)
    return result.copy_to_host()

gpu_short_course.tests.benchmark_add_vectors(add_vectors_gpu)

Overwriting 3_1_1_aligned.py


In [3]:
! nvprof --trace gpu python 3_1_1_aligned.py

==7062== NVPROF is profiling process 7062, command: python 3_1_1_aligned.py
GPU:0: b'GeForce MX250'
Benchmarking the function, please wait...
Benchmark result: 
Average processing time: 0.0151 seconds (+/- 0.0345), median: 0.0115
==7062== Profiling application: python 3_1_1_aligned.py
==7062== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   57.19%  460.05ms       300  1.5335ms  1.2867ms  2.7827ms  [CUDA memcpy DtoH]
                   39.68%  319.20ms       200  1.5960ms  1.4214ms  2.6505ms  [CUDA memcpy HtoD]
                    3.12%  25.116ms       100  251.16us  248.04us  273.07us  cudapy::__main__::add_vectors_kernel$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>)
No API activities were profiled.


In [4]:
%%writefile 3_1_1_misaligned.py

from numba import cuda
import math
import numpy as np
import gpu_short_course.tests

block_size = 261


@cuda.jit
def add_vectors_kernel(result, a, b):
    i = cuda.blockIdx.x*cuda.blockDim.x + cuda.threadIdx.x
    if i >= len(result):
        return
    result[i] = a[i] + b[i]


def add_vectors_gpu(a, b):
    result = cuda.device_array(shape=a.shape, dtype=a.dtype)
    grid_size = math.ceil(len(a)/block_size)
    add_vectors_kernel[grid_size, block_size](result, a, b)
    return result.copy_to_host()

gpu_short_course.tests.benchmark_add_vectors(add_vectors_gpu)

Overwriting 3_1_1_misaligned.py


In [5]:
! nvprof --trace gpu python 3_1_1_misaligned.py

==7320== NVPROF is profiling process 7320, command: python 3_1_1_misaligned.py
GPU:0: b'GeForce MX250'
Benchmarking the function, please wait...
Benchmark result: 
Average processing time: 0.0154 seconds (+/- 0.0427), median: 0.0104
==7320== Profiling application: python 3_1_1_misaligned.py
==7320== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   52.98%  414.09ms       300  1.3803ms  1.2899ms  2.8600ms  [CUDA memcpy DtoH]
                   43.77%  342.10ms       200  1.7105ms  1.3977ms  5.3995ms  [CUDA memcpy HtoD]
                    3.24%  25.359ms       100  253.59us  250.92us  274.86us  cudapy::__main__::add_vectors_kernel$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>)
No API activities were profiled.


The difference in performance of the aligned and misaligned versions is rather minor (on some devices even negligible).

Why? 

In this particular case, adjacent warps **reuse the cached data** their neighbors fetched. 

Anyway, setting a block size to a multiple of warp size, might be a good rule of thumb: it facilitates coalescing, and (as we will discuss later), helps to avoid wasting multiprocessor computation time on under-populated warps.

### Exercise 3.1.2. Impact of strided accesses.

A non-unit-strided global memory accesses may impact effective memory bandwidth. 

We say that GPU kernel performs a unit-strided memory access, if threads with successive identifiers read the data from successive memory areas, in other words, the following access pattern is respected:

```
x = data[(some custom offset) + threadIdx.x]
```

When using the data access notation for a multidimensional array, make sure that the last axis is addressed using `threadIdx.x`:

```
x = data[(other dimensions...), threadIdx.x]
```


The degradation in performance can be especially apparent when working with multidimensional arrays - the choice of the axis, along which choosing a specific operation is performed, can affect effective bandwidth.

#### Example

Let's consider a 1D convolution along one of the axes of a 2D array.

```
         axis 1
     ---------------
x = [[0,  1,  2,  3], |
     [4,  5,  6,  7], | axis 0
     [8,  9, 10, 11]] |

h = [1, 1]

```

NumPy stores arrays in row-major order, so the above array is actually kept in computer's memory as a following 1D array:

```
x = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11] 
```

Let's consider doing convolution along axis 0 and 1.

**Convolve along axis 1**:

```
         axis 1
     ---------------
x = [[0,  1,  2,  3]  *  [1, 1] = [0,  1,  3,  5] 
     [4,  5,  6,  7], *  [1, 1] = [4,  9, 11, 13]
     [8,  9, 10, 11]] *  [1, 1] = [8, 17, 19, 21]
```

For the first output row:

```
x = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]  
    [1]      y = [0]
    [1, 1]       [1]
       [1, 1]    [3]
          [1, 1] [5]

```

`y[0]` `y[1]`, `y[2]` and `y[3]` are computed by threads with `threadIdx.x` equal `0`, `1`, `2` and `3`, respectively.

**Convolve along axis 0**:

```
x = [[0,  1,   2,   3]  | 
     [4,  5,   6,   7], | axis 0
     [8,  9,  10,  11]] | 
      *   *    *    *
     [1] [1]  [1]  [1]
     [1] [1]  [1]  [1]
   y  =   =    =    =
    [ 0] [ 1] [ 2] [ 3]
    [ 4] [ 6] [ 8] [10]
    [12] [14] [16] [18]
```

For the first output column:

```
x = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]  
    [1]                                    y = [ 0]
    [1,          1]                            [ 4]
                [1,         1]                 [12]
```

`y[0]` `y[1]`, and `y[2]` are computed by threads with `threadIdx.x` equal `0`, `1` and `2` respectively.

As we can see in the above example, the stride is much larger for the convolution along axis 0. Will it impact the bandwidth?

In [6]:
%%writefile 3_1_2_convolve_strided_access.py
import math
import numpy as np
from numba import cuda, float32
import cupy as cp
import gpu_short_course

@cuda.jit
def convolve_axis0_gpu_kernel(y, x, h):
    i = cuda.blockIdx.x*cuda.blockDim.x + cuda.threadIdx.x
    j = cuda.blockIdx.y*cuda.blockDim.y + cuda.threadIdx.y

    N = len(h)
    o = int(math.ceil(N/2)-1)
    HEIGHT = x.shape[0]
    WIDTH = x.shape[1]
    if i >= HEIGHT or j >= WIDTH:
        return
    
    value = float32(0.0)
    for k in range(N):
        l = i + o - k
        if l >= 0 and l < HEIGHT:

            ## --- Get data along the second (1) axis.
            value += x[l, j] * h[k]
            
    y[i, j] = value
    
    
def convolve_axis0_gpu(x, h):
    y = cuda.device_array(x.shape, dtype=x.dtype)
    block = (32, 32)
    height, width = x.shape
    block_h, block_w = block
    grid = (math.ceil(width/block_w), 
            math.ceil(height/block_h))
    convolve_axis0_gpu_kernel[grid, block](y, x, h)
    return y.copy_to_host()


@cuda.jit
def convolve_axis1_gpu_kernel(y, x, h):
    i = cuda.blockIdx.x*cuda.blockDim.x + cuda.threadIdx.x
    j = cuda.blockIdx.y*cuda.blockDim.y + cuda.threadIdx.y

    N = len(h)
    o = int(math.ceil(N/2)-1)
    
    HEIGHT = x.shape[0]
    WIDTH = x.shape[1]
    
    if i >= WIDTH or j >= HEIGHT:
        return
    
    value = float32(0.0)
    for k in range(N):
        l = i + o - k
        if l >= 0 and l < WIDTH:
            
            ## --- Get data along the first (0) axis.
            value += x[j, l]*h[k]
            
    y[j, i] = value
    
    
def convolve_axis1_gpu(x, h):
    y = cuda.device_array(x.shape, dtype=x.dtype)
    block = (32, 32)
    height, width = x.shape
    block_h, block_w = block
    grid = (math.ceil(width/block_w), 
            math.ceil(height/block_h))
    convolve_axis1_gpu_kernel[grid, block](y, x, h)
    return y.copy_to_host()


gpu_short_course.run_convolve_2d_input(convolve_axis0_gpu, axis=0)
gpu_short_course.run_convolve_2d_input(convolve_axis1_gpu, axis=1)

Overwriting 3_1_2_convolve_strided_access.py


In [7]:
! python 3_1_2_convolve_strided_access.py --mode test

GPU:0: b'GeForce MX250'
All tests passed.
All tests passed.


In [8]:
! nvprof --trace gpu python 3_1_2_convolve_strided_access.py --mode benchmark quiet=1

==7372== NVPROF is profiling process 7372, command: python 3_1_2_convolve_strided_access.py --mode benchmark quiet=1
GPU:0: b'GeForce MX250'
Benchmarking, please wait...
Benchmarking, please wait...
==7372== Profiling application: python 3_1_2_convolve_strided_access.py --mode benchmark quiet=1
==7372== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   61.20%  5.56611s       100  55.661ms  51.235ms  84.812ms  cudapy::__main__::convolve_axis0_gpu_kernel$241(Array<float, int=2, C, mutable, aligned>, Array<float, int=2, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>)
                   29.55%  2.68800s       100  26.880ms  25.369ms  38.138ms  cudapy::__main__::convolve_axis1_gpu_kernel$242(Array<float, int=2, C, mutable, aligned>, Array<float, int=2, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>)
                    5.88%  534.98ms       600  891.63us  1.0880us  2.9331ms  [CUDA memcpy Dto

On my GPU (Nvidia GeForce MX250), convolution along axis 1 takes much less time than along axis 0.

We can use profiler metrics to verify what memory access efficiency for both cases we have:

In [9]:
! nvprof --metrics gld_efficiency,gst_efficiency python 3_1_2_convolve_strided_access.py --mode benchmark n=1 quiet=1 2>&1 | grep -v "^="

GPU:0: b'GeForce MX250'
Benchmarking, please wait...
Benchmarking, please wait...
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "GeForce MX250 (0)"
    Kernel: cudapy::__main__::convolve_axis0_gpu_kernel$241(Array<float, int=2, C, mutable, aligned>, Array<float, int=2, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>)
          1                            gld_efficiency             Global Memory Load Efficiency      12.50%      12.50%      12.50%
          1                            gst_efficiency            Global Memory Store Efficiency      12.50%      12.50%      12.50%
    Kernel: cudapy::__main__::convolve_axis1_gpu_kernel$242(Array<float, int=2, C, mutable, aligned>, Array<float, int=2, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>)
          1                            gld_efficiency             Global Memory Load Efficiency      70.05%      70.05%    

# Exercise 3.2. Control flow: how code branching affects the performance.

Due to SIMT architecture of the CUDA multiprocessors, it is recommended to avoid different paths within the same warp.

According to [CUDA Best Practices Guide](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html):
> Flow control instructions (if, switch, do, for, while) can significantly affect the instruction throughput by causing threads of the same warp to diverge; that is, to follow different execution paths. If this happens, the different execution paths must be executed separately; this increases the total number of instructions executed for this warp.

### Example

Let's implement the following function:

```
y[i] = r[i]*a[i] + b[i]
```

where `r[i] = i mod 8`.

We can implement it in one of the two ways:
1. directly by definition (see `add_vectors_mod8_kernel`),
2. by doing a sequence of `if ... elif ... elif ... else` blocks (see `add_vectors_mod8_branches_kernel`).

In [10]:
%%writefile 3_2_control_flow.py

from numba import cuda
import math
import numpy as np
import gpu_short_course.tests


block_size = 256


@cuda.jit
def add_vectors_mod8_kernel(result, a, b):
    i = cuda.blockIdx.x*cuda.blockDim.x + cuda.threadIdx.x
    if i >= len(result):
        return
    r = i % 8 + 1
    result[i] = r*a[i] + b[i]


def add_vectors_mod8(a, b):
    result = cuda.device_array(shape=a.shape, dtype=a.dtype)
    grid_size = math.ceil(len(a)/block_size)
    add_vectors_mod8_kernel[grid_size, block_size](result, a, b)
    return result.copy_to_host()


@cuda.jit
def add_vectors_mod8_branches_kernel(result, a, b):
    i = cuda.blockIdx.x*cuda.blockDim.x + cuda.threadIdx.x
    if i >= len(result):
        return
    if i % 8 == 0:
        result[i] = a[i] + b[i]
    elif i % 8 == 1:
        result[i] = 2*a[i] + b[i]
    elif i % 8 == 2:
        result[i] = 3*a[i] + b[i]
    elif i % 8 == 3:
        result[i] = 4*a[i] + b[i]
    elif i % 8 == 4:
        result[i] = 5*a[i] + b[i]
    elif i % 8 == 5:
        result[i] = 6*a[i] + b[i]
    elif i % 8 == 6:
        result[i] = 7*a[i] + b[i]
    elif i % 8 == 7:
        result[i] = 8*a[i] + b[i]


def add_vectors_mod8_branches(a, b):
    result = cuda.device_array(shape=a.shape, dtype=a.dtype)
    grid_size = math.ceil(len(a)/block_size)
    add_vectors_mod8_branches_kernel[grid_size, block_size](result, a, b)
    return result.copy_to_host()


gpu_short_course.tests.benchmark_add_vectors(add_vectors_mod8)
gpu_short_course.tests.benchmark_add_vectors(add_vectors_mod8_branches)

Writing 3_2_control_flow.py


Let's check how much time does it take to execute each of the kernel:

In [11]:
! nvprof --trace gpu python 3_2_control_flow.py

==7429== NVPROF is profiling process 7429, command: python 3_2_control_flow.py
GPU:0: b'GeForce MX250'
Benchmarking the function, please wait...
Benchmark result: 
Average processing time: 0.0136 seconds (+/- 0.0409), median: 0.0090
Benchmarking the function, please wait...
Benchmark result: 
Average processing time: 0.0124 seconds (+/- 0.0186), median: 0.0101
==7429== Profiling application: python 3_2_control_flow.py
==7429== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   50.61%  789.60ms       600  1.3160ms  1.2820ms  2.5836ms  [CUDA memcpy DtoH]
                   37.83%  590.28ms       400  1.4757ms  1.3868ms  1.6767ms  [CUDA memcpy HtoD]
                    9.54%  148.84ms       100  1.4884ms  1.4873ms  1.4910ms  cudapy::__main__::add_vectors_mod8_branches_kernel$242(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>)
                   

Let's also measure `branch_efficiency` metric defined as a:
> Ratio of non-divergent branches to total branches expressed as percentage.

In [12]:
! nvprof --metrics branch_efficiency python 3_2_control_flow.py

==7458== NVPROF is profiling process 7458, command: python 3_2_control_flow.py
GPU:0: b'GeForce MX250'
Benchmarking the function, please wait...
Benchmark result: 
Average processing time: 0.0166 seconds (+/- 0.0331), median: 0.0121
Benchmarking the function, please wait...
Benchmark result: 
Average processing time: 0.0158 seconds (+/- 0.0164), median: 0.0133
==7458== Profiling application: python 3_2_control_flow.py
==7458== Profiling result:
==7458== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "GeForce MX250 (0)"
    Kernel: cudapy::__main__::add_vectors_mod8_kernel$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>)
        100                         branch_efficiency                         Branch Efficiency     100.00%     100.00%     100.00%
    Kernel: cudapy::__main__::add_vectors_mod8_branc

Of course, the above example has been artificially complicated just to show the effect of complex kernel logic on the kernel's performance.

# Exercise 3.3. Multiprocessor occupancy: thread block size.

Recall that:
> The number of threads per block should be a **multiple of 32 threads**, because this provides optimal computing efficiency and facilitates coalescing.

[CUDA Best Practices Guide](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html) also gives some other suggestions how to choose the proper number of threads per block:

> There are many such factors involved in selecting block size, and inevitably some experimentation is required. However, a few rules of thumb should be followed:
> 1. Threads per block should be **a multiple of warp size** to avoid wasting computation on under-populated warps and to facilitate coalescing.
> 2. A **minimum of 64 threads** per block should be used, and only if there are multiple concurrent blocks per multiprocessor.
> 3. Between **128 and 256 threads** per block is a good initial range for experimentation with different block sizes.
> 4. Use several smaller thread blocks rather than one large thread block per multiprocessor if latency affects performance. This is particularly beneficial to kernels that frequently call __syncthreads().

### Example

Let's mesure `add_vectors`' occupancy for a different number of threads:


In [13]:
%%writefile 3_3_occupancy_16.py

from numba import cuda
import math
import numpy as np
import gpu_short_course.tests

block_size = 16


@cuda.jit
def add_vectors_kernel(result, a, b):
    i = cuda.blockIdx.x*cuda.blockDim.x + cuda.threadIdx.x
    if i >= len(result):
        return
    result[i] = a[i] + b[i]


def add_vectors_gpu(a, b):
    result = cuda.device_array(shape=a.shape, dtype=a.dtype)
    grid_size = math.ceil(len(a)/block_size)
    add_vectors_kernel[grid_size, block_size](result, a, b)
    return result.copy_to_host()


gpu_short_course.tests.benchmark_add_vectors(add_vectors_gpu)

Writing 3_3_occupancy_16.py


In [14]:
! nvprof --trace gpu python 3_3_occupancy_16.py

==7510== NVPROF is profiling process 7510, command: python 3_3_occupancy_16.py
GPU:0: b'GeForce MX250'
Benchmarking the function, please wait...
Benchmark result: 
Average processing time: 0.0147 seconds (+/- 0.0326), median: 0.0109
==7510== Profiling application: python 3_3_occupancy_16.py
==7510== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   51.11%  414.19ms       300  1.3806ms  1.2831ms  2.9075ms  [CUDA memcpy DtoH]
                   40.67%  329.56ms       200  1.6478ms  1.4194ms  3.6617ms  [CUDA memcpy HtoD]
                    8.21%  66.563ms       100  665.63us  653.01us  755.47us  cudapy::__main__::add_vectors_kernel$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>)
No API activities were profiled.


According to NVIDIA documentation, `achieved_occupancy` measures:
> Ratio of the average active warps per active cycle to the maximum number of warps supported on a multiprocessor.

In [15]:
! nvprof --metrics achieved_occupancy python 3_3_occupancy_16.py

==7546== NVPROF is profiling process 7546, command: python 3_3_occupancy_16.py
GPU:0: b'GeForce MX250'
Benchmarking the function, please wait...
Benchmark result: 
Average processing time: 0.0156 seconds (+/- 0.0288), median: 0.0123
==7546== Profiling application: python 3_3_occupancy_16.py
==7546== Profiling result:
==7546== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "GeForce MX250 (0)"
    Kernel: cudapy::__main__::add_vectors_kernel$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>)
        100                        achieved_occupancy                        Achieved Occupancy    0.331074    0.335876    0.333732


In [16]:
%%writefile 3_3_occupancy_256.py

from numba import cuda
import math
import numpy as np
import gpu_short_course.tests

block_size = 256


@cuda.jit
def add_vectors_kernel(result, a, b):
    i = cuda.blockIdx.x*cuda.blockDim.x + cuda.threadIdx.x
    if i >= len(result):
        return
    result[i] = a[i] + b[i]


def add_vectors_gpu(a, b):
    result = cuda.device_array(shape=a.shape, dtype=a.dtype)
    grid_size = math.ceil(len(a)/block_size)
    add_vectors_kernel[grid_size, block_size](result, a, b)
    return result.copy_to_host()


gpu_short_course.tests.benchmark_add_vectors(add_vectors_gpu)

Writing 3_3_occupancy_256.py


In [17]:
! nvprof --trace gpu python 3_3_occupancy_256.py
! nvprof --metrics achieved_occupancy python 3_3_occupancy_256.py

==7572== NVPROF is profiling process 7572, command: python 3_3_occupancy_256.py
GPU:0: b'GeForce MX250'
Benchmarking the function, please wait...
Benchmark result: 
Average processing time: 0.0126 seconds (+/- 0.0319), median: 0.0089
==7572== Profiling application: python 3_3_occupancy_256.py
==7572== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   55.14%  393.50ms       300  1.3117ms  1.2834ms  1.6101ms  [CUDA memcpy DtoH]
                   41.38%  295.29ms       200  1.4764ms  1.4039ms  1.7570ms  [CUDA memcpy HtoD]
                    3.48%  24.842ms       100  248.42us  247.81us  254.41us  cudapy::__main__::add_vectors_kernel$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>)
No API activities were profiled.
==7600== NVPROF is profiling process 7600, command: python 3_3_occupancy_256.py
GPU:0: b'GeForce MX250'
Benchmarking the functio

In [18]:
%%writefile 3_3_occupancy_1024.py

from numba import cuda
import math
import numpy as np
import gpu_short_course.tests

block_size = 1024


@cuda.jit
def add_vectors_kernel(result, a, b):
    i = cuda.blockIdx.x*cuda.blockDim.x + cuda.threadIdx.x
    if i >= len(result):
        return
    result[i] = a[i] + b[i]


def add_vectors_gpu(a, b):
    result = cuda.device_array(shape=a.shape, dtype=a.dtype)
    grid_size = math.ceil(len(a)/block_size)
    add_vectors_kernel[grid_size, block_size](result, a, b)
    return result.copy_to_host()

gpu_short_course.tests.benchmark_add_vectors(add_vectors_gpu)

Writing 3_3_occupancy_1024.py


In [19]:
! nvprof --trace gpu python 3_3_occupancy_1024.py
! nvprof --metrics achieved_occupancy python 3_3_occupancy_1024.py

==7626== NVPROF is profiling process 7626, command: python 3_3_occupancy_1024.py
GPU:0: b'GeForce MX250'
Benchmarking the function, please wait...
Benchmark result: 
Average processing time: 0.0140 seconds (+/- 0.0363), median: 0.0093
==7626== Profiling application: python 3_3_occupancy_1024.py
==7626== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   55.28%  419.03ms       300  1.3968ms  1.2898ms  2.8975ms  [CUDA memcpy DtoH]
                   41.41%  313.87ms       200  1.5693ms  1.3803ms  4.3742ms  [CUDA memcpy HtoD]
                    3.32%  25.140ms       100  251.40us  249.25us  273.77us  cudapy::__main__::add_vectors_kernel$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>)
No API activities were profiled.
==7653== NVPROF is profiling process 7653, command: python 3_3_occupancy_1024.py
GPU:0: b'GeForce MX250'
Benchmarking the func

Finding the right number of threads per block requires some experimentation, but generally 128 or 256 threads is a good starting point.

# Exercise 3.4. Instruction-level optimizations.

This chapter covers the following:

1. Impact of the data type selection and conversion on the kernel's performance.
2. Floating-point intrinsics.

## Exercise 3.4.1. Data types.

When implementing a CUDA GPU kernel, keep the following in mind:

- the choice of the input data type may affect the kernel performance,
- data type conversion in kernel implementation may affect kernel performance.

Let's do some comparison of  `float32` and `float64` data, based on an example of a one-dimensional convolution.

First, let's recall our baseline implementation of convolution operator.

NOTE:
- our benchmark function generates `float32` input data,
- we used in the kernel implementation the `float32` keyword to enforce the proper data type of `value` variable. 

In [20]:
%%writefile 3_4_1_convolve_float32.py
import math
import numpy as np
from numba import cuda, float32
from gpu_short_course.tests import benchmark_convolve


@cuda.jit
def convolve_gpu_kernel(y, x, h):
    i = cuda.blockIdx.x*cuda.blockDim.x + cuda.threadIdx.x
    if i >= len(y):
        return
    M, N = len(x), len(h)
    
    o = int(math.ceil(N/2)-1)

    value = float32(0.0)
    for j in range(N):
        k = i+o-j
        if k >= 0 and k < M:
            value += x[k]*h[j]
    y[i] = value
    

def convolve_gpu(x, h):
    y = cuda.device_array(x.shape, dtype=x.dtype)
    block_size = 256
    grid_size = math.ceil(len(y)/block_size)
    convolve_gpu_kernel[grid_size, block_size](y, x, h)
    return y.copy_to_host()


benchmark_convolve(convolve_gpu, dtype=np.float32)

Writing 3_4_1_convolve_float32.py


In [21]:
! nvprof --trace gpu python 3_4_1_convolve_float32.py

==7679== NVPROF is profiling process 7679, command: python 3_4_1_convolve_float32.py
GPU:0: b'GeForce MX250'
Benchmarking the function, please wait...
Benchmark result: 
Average processing time: 0.0291 seconds (+/- 0.0366), median: 0.0238
==7679== Profiling application: python 3_4_1_convolve_float32.py
==7679== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   81.13%  1.88550s       100  18.855ms  17.327ms  29.681ms  cudapy::__main__::convolve_gpu_kernel$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>)
                   12.16%  282.55ms       300  941.84us  1.1200us  2.8355ms  [CUDA memcpy DtoH]
                    6.71%  155.99ms       200  779.97us     896ns  3.6081ms  [CUDA memcpy HtoD]
No API activities were profiled.


Now let's get rid of the `float32` keyword on line `17` and see if it has any effect on performance.

In [22]:
%%writefile 3_4_1_convolve_float32_and_float64.py
import math
import numpy as np
from numba import cuda, float32
from gpu_short_course.tests import benchmark_convolve


@cuda.jit
def convolve_gpu_kernel(y, x, h):
    i = cuda.blockIdx.x*cuda.blockDim.x + cuda.threadIdx.x
    if i >= len(y):
        return
    M, N = len(x), len(h)
    
    o = int(math.ceil(N/2)-1)

    value = 0.0
    for j in range(N):
        k = i+o-j
        if k >= 0 and k < M:
            value += x[k]*h[j]
    y[i] = value
    

def convolve_gpu(x, h):
    y = cuda.device_array(x.shape, dtype=x.dtype)
    block_size = 256
    grid_size = math.ceil(len(y)/block_size)
    convolve_gpu_kernel[grid_size, block_size](y, x, h)
    return y.copy_to_host()


benchmark_convolve(convolve_gpu, dtype=np.float32)

Writing 3_4_1_convolve_float32_and_float64.py


In [23]:
! nvprof --trace gpu python 3_4_1_convolve_float32_and_float64.py

==7707== NVPROF is profiling process 7707, command: python 3_4_1_convolve_float32_and_float64.py
GPU:0: b'GeForce MX250'
Benchmarking the function, please wait...
Benchmark result: 
Average processing time: 0.0398 seconds (+/- 0.0367), median: 0.0341
==7707== Profiling application: python 3_4_1_convolve_float32_and_float64.py
==7707== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   87.04%  2.95362s       100  29.536ms  26.907ms  43.880ms  cudapy::__main__::convolve_gpu_kernel$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>)
                    8.28%  280.85ms       300  936.16us  1.1200us  2.6610ms  [CUDA memcpy DtoH]
                    4.69%  159.02ms       200  795.08us     896ns  3.6370ms  [CUDA memcpy HtoD]
No API activities were profiled.


The processing time may increase, because 0.0 is a float64, and we are doing promotion from float32 to float64, i.e.:

`value += (float64)(x[k]*h[j])`

then we downgrade from float64 to float32:

`y[i] = (float32)value`.

(note: the above may vary between different GPUs)

Lets check what results we will get when we will use only `float64` values in computations.

In [24]:
%%writefile 3_4_1_convolve_float64.py

import math
import numpy as np
from numba import cuda, float32
from gpu_short_course.tests import benchmark_convolve


@cuda.jit
def convolve_gpu_kernel(y, x, h):
    i = cuda.blockIdx.x*cuda.blockDim.x + cuda.threadIdx.x
    if i >= len(y):
        return
    M, N = len(x), len(h)
    
    o = int(math.ceil(N/2)-1)

    value = 0.0
    for j in range(N):
        k = i+o-j
        if k >= 0 and k < M:
            value += x[k]*h[j]
    y[i] = value
    

def convolve_gpu(x, h):
    y = cuda.device_array(x.shape, dtype=x.dtype)
    block_size = 256
    grid_size = math.ceil(len(y)/block_size)
    convolve_gpu_kernel[grid_size, block_size](y, x, h)
    return y.copy_to_host()


benchmark_convolve(convolve_gpu, dtype=np.float64)

Writing 3_4_1_convolve_float64.py


In [25]:
! nvprof --trace gpu python 3_4_1_convolve_float64.py

==7735== NVPROF is profiling process 7735, command: python 3_4_1_convolve_float64.py
GPU:0: b'GeForce MX250'
Benchmarking the function, please wait...
Benchmark result: 
Average processing time: 0.0342 seconds (+/- 0.0381), median: 0.0294
==7735== Profiling application: python 3_4_1_convolve_float64.py
==7735== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   70.00%  1.97212s       100  19.721ms  17.682ms  29.016ms  cudapy::__main__::convolve_gpu_kernel$241(Array<double, int=1, C, mutable, aligned>, Array<double, int=1, C, mutable, aligned>, Array<double, int=1, C, mutable, aligned>)
                   19.14%  539.18ms       300  1.7973ms  1.1840us  5.2818ms  [CUDA memcpy DtoH]
                   10.86%  306.04ms       200  1.5302ms     992ns  4.6555ms  [CUDA memcpy HtoD]
No API activities were profiled.


The above results may differ between different GPU cards.

## Exercise 3.4.2. Floating point intrisincs.

CUDA Math API provides a set of intrinsic functions, specialized to carry out various calculations. Sometimes, using an intrisinc function explicitly in your code may improve its performance.

A complete list of intrinsic functions for CUDA C/C++ is available [here](https://docs.nvidia.com/cuda/cuda-math-api/).

A complete list of floating-point intrinsics in Numba is avalable [here](https://numba.pydata.org/numba-doc/latest/cuda-reference/kernel.html#floating-point-intrinsics). 

Note: At the stage of optimizing the machine code, the compiler may decide to use the intrinsic function (the same or similar), regardless of whether we used it in our implementation or not. Still, if you want to be sure that the intrinsic function is used, we should call it explicitly.

Let's check if using `cuda.fma` intrinsic in our `convolution` gives us any improvement:

In [26]:
%%writefile 3_4_2_convolve_intrinsics.py
import math
from numba import cuda, float32
import numpy as np
import gpu_short_course.tests
import cupy as cp


@cuda.jit
def convolve_gpu_kernel(y, x, h):
    i = cuda.blockIdx.x*cuda.blockDim.x + cuda.threadIdx.x
    if i >= len(y):
        return

    M, N = len(x), len(h)
    o = int(math.ceil(N/2)-1)
    
    value = float32(0.0)
    for j in range(N):
        k = i + o - j
        if k >= 0 and k < M:
            value += x[k]*h[j]
    y[i] = value


def convolve(x, h):
    y = cuda.device_array(x.shape, dtype=x.dtype)
    block_size = 256
    grid_size = math.ceil(len(y)/block_size)
    convolve_gpu_kernel[grid_size, block_size](y, x, h)
    return y.copy_to_host()


@cuda.jit
def convolve_fma_gpu_kernel(y, x, h):
    i = cuda.blockIdx.x*cuda.blockDim.x + cuda.threadIdx.x
    if i >= len(y):
        return

    M, N = len(x), len(h)
    o = int(math.ceil(N/2)-1)
    
    value = float32(0.0)
    for j in range(N):
        k = i + o - j
        if k >= 0 and k < M:
            value = cuda.fma(x[k], h[j], value)
    y[i] = value


def convolve_fma(x, h):
    y = cuda.device_array(x.shape, dtype=x.dtype)
    block_size = 256
    grid_size = math.ceil(len(y)/block_size)
    convolve_fma_gpu_kernel[grid_size, block_size](y, x, h)
    return y.copy_to_host()


gpu_short_course.tests.benchmark_convolve(convolve_fma)
gpu_short_course.tests.benchmark_convolve(convolve)

Writing 3_4_2_convolve_intrinsics.py


In [27]:
!nvprof --trace gpu python 3_4_2_convolve_intrinsics.py

==7764== NVPROF is profiling process 7764, command: python 3_4_2_convolve_intrinsics.py
GPU:0: b'GeForce MX250'
Benchmarking the function, please wait...
Benchmark result: 
Average processing time: 0.0282 seconds (+/- 0.0360), median: 0.0241
Benchmarking the function, please wait...
Benchmark result: 
Average processing time: 0.0263 seconds (+/- 0.0114), median: 0.0246
==7764== Profiling application: python 3_4_2_convolve_intrinsics.py
==7764== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   41.20%  1.89466s       100  18.947ms  17.413ms  23.234ms  cudapy::__main__::convolve_gpu_kernel$242(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>)
                   40.47%  1.86077s       100  18.608ms  17.414ms  28.650ms  cudapy::__main__::convolve_fma_gpu_kernel$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>,