# Lecture 3. Performance guidelines.

In this notebook we will describe the most important guidelines when programming code for NVIDIA GPU cards.

We will consider the following aspects of optimizing CUDA kernels:
- memory access patterns: *memory coalescing* for the best throughput,
- control flow: how code branching affects the performance,
- multiprocessor occupancy: experimenting with different block sizes.

## Exercise 3.1. Memory access patterns.

According to [CUDA Best Practices Guide](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html):
> For devices of compute capability 6.0 or higher, the requirements can be summarized quite easily: the concurrent accesses of the threads of a warp will coalesce into a number of transactions equal to the number of 32-byte transactions necessary to service all of the threads of the warp.

This means that the number of useful memory accesses done by our kernel, and thus its performance, largely depends on the memory access pattern it does.

Our goal is to implement a GPU kernel that only loads **useful** data from global memory that will then be used in the calculations. We can achieve this with **coalesced memory accesses**.

To achive coalesced memory accesses in our kernel, we need to meet the following conditions:
- the number of threads per block is a multiple of 32 threads,
- sequential threads in a warp access memory that is sequential.


For example, our baseline `add_vectors_gpu` and `convolve_gpu` implementations satisfy the above conditions:
- the number of threads per block was equal 256,
- adjacent threads were reading adjacent memory areas, e.g. thread `i` read the `a[i]` and `b[i]`, and thread `i+1` read `a[i+1]` and `b[i+1]`.  


We will discuss below what are the reasons for both of the conditions.

### Exercise 3.1.1. Impact of misaligned accesses.

According to [CUDA Best Practices Guide](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html):
> The number of threads per block should be a multiple of 32 threads, because this provides optimal computing efficiency and facilitates coalescing.

Recall that warp reads global memory by using a sequence of **32-byte** segments transactions.

Note that, according to [CUDA Best Practices Guide](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html):
> Memory allocated through the CUDA Runtime API (...) is guaranteed to be aligned to at least 256 bytes.

This means that if the block size is a multiple of the warp size, each block will load only its own data chunk from memory. 


#### Example

As an example, we will consider here:
- `add_vectors_gpu` function,
- block size = 13. 

Now let's take a look what memory accesses will be performed by thread block 0, 1, etc.

**Block 0**

- reads memory area [0, 52), size: 52 bytes (13 x 4-byte floats)

```
[ Segment 0 (32 bytes) ][ Segment 1 (32 bytes) ][ Segment 2 (32 bytes) ] ...
[       block 0 data (52 bytes)       ]
```

- This requires 2 x 32-byte transfers. 
- However only 52 bytes will be used (81%). 

**Block 1**

- Reads memory area [52, 104), size: 52 bytes.
```
[ Segment 0 (32 bytes) ][ Segment 1 (32 bytes) ][ Segment 2 (32 bytes) ] ...
                                       [       block 1 data (52 bytes)       ]
```
- This requires 3 x 32-byte transfers.
- However only 52 bytes will be used (54%). 


**And so on...**


As we can see, we are transfer a large amount of (theoretically) useless data.

Let's see if there is any observable performance difference between scripts using 256 and 261 threads in a **block**:

In [1]:
%%writefile 3_1_1_aligned.py

from numba import cuda
import math
import numpy as np
import gpu_short_course.tests

block_size = 256


@cuda.jit
def add_vectors_kernel(result, a, b):
    i = cuda.blockIdx.x*cuda.blockDim.x + cuda.threadIdx.x
    if i >= len(result):
        return
    result[i] = a[i] + b[i]


def add_vectors_gpu(a, b):
    result = cuda.device_array(shape=a.shape, dtype=a.dtype)
    grid_size = math.ceil(len(a)/block_size)
    add_vectors_kernel[grid_size, block_size](result, a, b)
    return result.copy_to_host()

gpu_short_course.tests.benchmark_add_vectors(add_vectors_gpu)

Writing 3_1_1_aligned.py


In [2]:
! nvprof --trace gpu python 3_1_1_aligned.py

==23182== NVPROF is profiling process 23182, command: python 3_1_1_aligned.py
GPU:0: b'GeForce GTX TITAN X'
Benchmarking the function, please wait...
Benchmark result: 
Average processing time: 0.0088 seconds (+/- 0.0374), median: 0.0040
==23182== Profiling application: python 3_1_1_aligned.py
==23182== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   60.56%  115.60ms       300  385.35us  338.89us  900.86us  [CUDA memcpy DtoH]
                   36.77%  70.185ms       200  350.92us  346.22us  395.60us  [CUDA memcpy HtoD]
                    2.67%  5.0905ms       100  50.905us  49.506us  52.609us  cudapy::__main__::add_vectors_kernel$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>)
No API activities were profiled.


In [3]:
%%writefile 3_1_1_misaligned.py

from numba import cuda
import math
import numpy as np
import gpu_short_course.tests

block_size = 261


@cuda.jit
def add_vectors_kernel(result, a, b):
    i = cuda.blockIdx.x*cuda.blockDim.x + cuda.threadIdx.x
    if i >= len(result):
        return
    result[i] = a[i] + b[i]


def add_vectors_gpu(a, b):
    result = cuda.device_array(shape=a.shape, dtype=a.dtype)
    grid_size = math.ceil(len(a)/block_size)
    add_vectors_kernel[grid_size, block_size](result, a, b)
    return result.copy_to_host()

gpu_short_course.tests.benchmark_add_vectors(add_vectors_gpu)

Writing 3_1_1_misaligned.py


In [4]:
! nvprof --trace gpu python 3_1_1_misaligned.py

==23268== NVPROF is profiling process 23268, command: python 3_1_1_misaligned.py
GPU:0: b'GeForce GTX TITAN X'
Benchmarking the function, please wait...
Benchmark result: 
Average processing time: 0.0086 seconds (+/- 0.0340), median: 0.0040
==23268== Profiling application: python 3_1_1_misaligned.py
==23268== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   60.54%  116.19ms       300  387.30us  339.34us  711.03us  [CUDA memcpy DtoH]
                   36.79%  70.600ms       200  353.00us  346.03us  406.54us  [CUDA memcpy HtoD]
                    2.67%  5.1324ms       100  51.324us  50.113us  52.833us  cudapy::__main__::add_vectors_kernel$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>)
No API activities were profiled.


The difference in performance of the aligned and misaligned versions is rather minor (on some devices even negligible).

Why? 

In this particular case, adjacent warps **reuse the cached data** their neighbors fetched. 

Anyway, setting a block size to a multiple of warp size, might be a good rule of thumb: it facilitates coalescing, and (as we will discuss later), helps to avoid wasting multiprocessor computation time on under-populated warps.

### Exercise 3.1.2. Impact of strided accesses.

A non-unit-strided global memory accesses may impact effective memory bandwidth. 

We say that GPU kernel performs a unit-strided memory access, if threads with successive identifiers read the data from successive memory areas, in other words, the following access pattern is respected:

```
x = data[(some custom offset) + threadIdx.x]
```

When using the data access notation for a multidimensional array, make sure that the last axis is addressed using `threadIdx.x`:

```
x = data[(other dimensions...), threadIdx.x]
```


The degradation in performance can be especially apparent when working with multidimensional arrays - the choice of the axis, along which choosing a specific operation is performed, can affect effective bandwidth.

#### Example

Let's consider a 1D convolution along one of the axes of a 2D array.

```
         axis 1
     ---------------
x = [[0,  1,  2,  3], |
     [4,  5,  6,  7], | axis 0
     [8,  9, 10, 11]] |

h = [1, 1]

```

NumPy stores arrays in row-major order, so the above array is actually kept in computer's memory as a following 1D array:

```
x = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11] 
```

Let's consider doing convolution along axis 0 and 1.

**Convolve along axis 1**:

```
         axis 1
     ---------------
x = [[0,  1,  2,  3]  *  [1, 1] = [0,  1,  3,  5] 
     [4,  5,  6,  7], *  [1, 1] = [4,  9, 11, 13]
     [8,  9, 10, 11]] *  [1, 1] = [8, 17, 19, 21]
```

For the first output row:

```
x = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]  
    [1]      y = [0]
    [1, 1]       [1]
       [1, 1]    [3]
          [1, 1] [5]

```

`y[0]` `y[1]`, `y[2]` and `y[3]` are computed by threads with `threadIdx.x` equal `0`, `1`, `2` and `3`, respectively.

**Convolve along axis 0**:

```
x = [[0,  1,   2,   3]  | 
     [4,  5,   6,   7], | axis 0
     [8,  9,  10,  11]] | 
      *   *    *    *
     [1] [1]  [1]  [1]
     [1] [1]  [1]  [1]
   y  =   =    =    =
    [ 0] [ 1] [ 2] [ 3]
    [ 4] [ 6] [ 8] [10]
    [12] [14] [16] [18]
```

For the first output column:

```
x = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]  
    [1]                                    y = [ 0]
    [1,          1]                            [ 4]
                [1,         1]                 [12]
```

`y[0]` `y[1]`, and `y[2]` are computed by threads with `threadIdx.x` equal `0`, `1` and `2` respectively.

As we can see in the above example, the stride is much larger for the convolution along axis 0. Will it impact the bandwidth?

In [5]:
%%writefile 3_1_2_convolve_strided_access.py
import math
import numpy as np
from numba import cuda, float32
import cupy as cp
import gpu_short_course

@cuda.jit
def convolve_axis0_gpu_kernel(y, x, h):
    i = cuda.blockIdx.x*cuda.blockDim.x + cuda.threadIdx.x
    j = cuda.blockIdx.y*cuda.blockDim.y + cuda.threadIdx.y

    N = len(h)
    o = int(math.ceil(N/2)-1)
    HEIGHT = x.shape[0]
    WIDTH = x.shape[1]
    if i >= HEIGHT or j >= WIDTH:
        return
    
    value = float32(0.0)
    for k in range(N):
        l = i + o - k
        if l >= 0 and l < HEIGHT:

            ## --- Get data along the second (1) axis.
            value += x[l, j] * h[k]
            
    y[i, j] = value
    
    
def convolve_axis0_gpu(x, h):
    y = cuda.device_array(x.shape, dtype=x.dtype)
    block = (32, 32)
    height, width = x.shape
    block_h, block_w = block
    grid = (math.ceil(width/block_w), 
            math.ceil(height/block_h))
    convolve_axis0_gpu_kernel[grid, block](y, x, h)
    return y.copy_to_host()


@cuda.jit
def convolve_axis1_gpu_kernel(y, x, h):
    i = cuda.blockIdx.x*cuda.blockDim.x + cuda.threadIdx.x
    j = cuda.blockIdx.y*cuda.blockDim.y + cuda.threadIdx.y

    N = len(h)
    o = int(math.ceil(N/2)-1)
    
    HEIGHT = x.shape[0]
    WIDTH = x.shape[1]
    
    if i >= WIDTH or j >= HEIGHT:
        return
    
    value = float32(0.0)
    for k in range(N):
        l = i + o - k
        if l >= 0 and l < WIDTH:
            
            ## --- Get data along the first (0) axis.
            value += x[j, l]*h[k]
            
    y[j, i] = value
    
    
def convolve_axis1_gpu(x, h):
    y = cuda.device_array(x.shape, dtype=x.dtype)
    block = (32, 32)
    height, width = x.shape
    block_h, block_w = block
    grid = (math.ceil(width/block_w), 
            math.ceil(height/block_h))
    convolve_axis1_gpu_kernel[grid, block](y, x, h)
    return y.copy_to_host()


gpu_short_course.run_convolve_2d_input(convolve_axis0_gpu, axis=0)
gpu_short_course.run_convolve_2d_input(convolve_axis1_gpu, axis=1)

Writing 3_1_2_convolve_strided_access.py


In [6]:
! python 3_1_2_convolve_strided_access.py --mode test

GPU:0: b'GeForce GTX TITAN X'
All tests passed.
All tests passed.


In [7]:
! nvprof --trace gpu python 3_1_2_convolve_strided_access.py --mode benchmark quiet=1

==23411== NVPROF is profiling process 23411, command: python 3_1_2_convolve_strided_access.py --mode benchmark quiet=1
GPU:0: b'GeForce GTX TITAN X'
Benchmarking, please wait...
Benchmarking, please wait...
==23411== Profiling application: python 3_1_2_convolve_strided_access.py --mode benchmark quiet=1
==23411== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   65.28%  1.59769s       100  15.977ms  15.617ms  18.364ms  cudapy::__main__::convolve_axis0_gpu_kernel$241(Array<float, int=2, C, mutable, aligned>, Array<float, int=2, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>)
                   25.61%  626.83ms       100  6.2683ms  6.1722ms  6.3813ms  cudapy::__main__::convolve_axis1_gpu_kernel$242(Array<float, int=2, C, mutable, aligned>, Array<float, int=2, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>)
                    6.24%  152.66ms       600  254.44us  1.5360us  850.36us  [CUDA 

On my GPU (Nvidia GeForce MX250), convolution along axis 1 takes much less time than along axis 0.

We can use profiler metrics to verify what memory access efficiency for both cases we have:

In [8]:
! nvprof --metrics gld_efficiency,gst_efficiency python 3_1_2_convolve_strided_access.py --mode benchmark n=1 quiet=1 2>&1 | grep -v "^="

GPU:0: b'GeForce GTX TITAN X'
Benchmarking, please wait...
Benchmarking, please wait...
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "GeForce GTX TITAN X (0)"
    Kernel: cudapy::__main__::convolve_axis0_gpu_kernel$241(Array<float, int=2, C, mutable, aligned>, Array<float, int=2, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>)
          1                            gld_efficiency             Global Memory Load Efficiency      12.50%      12.50%      12.50%
          1                            gst_efficiency            Global Memory Store Efficiency      12.50%      12.50%      12.50%
    Kernel: cudapy::__main__::convolve_axis1_gpu_kernel$242(Array<float, int=2, C, mutable, aligned>, Array<float, int=2, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>)
          1                            gld_efficiency             Global Memory Load Efficiency      70.05%    

# Exercise 3.2. Control flow: how code branching affects the performance.

Due to SIMT architecture of the CUDA multiprocessors, it is recommended to avoid different paths within the same warp.

According to [CUDA Best Practices Guide](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html):
> Flow control instructions (if, switch, do, for, while) can significantly affect the instruction throughput by causing threads of the same warp to diverge; that is, to follow different execution paths. If this happens, the different execution paths must be executed separately; this increases the total number of instructions executed for this warp.

### Example

Let's implement the following function:

```
y[i] = r[i]*a[i] + b[i]
```

where `r[i] = i mod 8`.

We can implement it in one of the two ways:
1. directly by definition (see `add_vectors_mod8_kernel`),
2. by doing a sequence of `if ... elif ... elif ... else` blocks (see `add_vectors_mod8_branches_kernel`).

In [9]:
%%writefile 3_2_control_flow.py

from numba import cuda
import math
import numpy as np
import gpu_short_course.tests


block_size = 256


@cuda.jit
def add_vectors_mod8_kernel(result, a, b):
    i = cuda.blockIdx.x*cuda.blockDim.x + cuda.threadIdx.x
    if i >= len(result):
        return
    r = i % 8 + 1
    result[i] = r*a[i] + b[i]


def add_vectors_mod8(a, b):
    result = cuda.device_array(shape=a.shape, dtype=a.dtype)
    grid_size = math.ceil(len(a)/block_size)
    add_vectors_mod8_kernel[grid_size, block_size](result, a, b)
    return result.copy_to_host()


@cuda.jit
def add_vectors_mod8_branches_kernel(result, a, b):
    i = cuda.blockIdx.x*cuda.blockDim.x + cuda.threadIdx.x
    if i >= len(result):
        return
    if i % 8 == 0:
        result[i] = a[i] + b[i]
    elif i % 8 == 1:
        result[i] = 2*a[i] + b[i]
    elif i % 8 == 2:
        result[i] = 3*a[i] + b[i]
    elif i % 8 == 3:
        result[i] = 4*a[i] + b[i]
    elif i % 8 == 4:
        result[i] = 5*a[i] + b[i]
    elif i % 8 == 5:
        result[i] = 6*a[i] + b[i]
    elif i % 8 == 6:
        result[i] = 7*a[i] + b[i]
    elif i % 8 == 7:
        result[i] = 8*a[i] + b[i]


def add_vectors_mod8_branches(a, b):
    result = cuda.device_array(shape=a.shape, dtype=a.dtype)
    grid_size = math.ceil(len(a)/block_size)
    add_vectors_mod8_branches_kernel[grid_size, block_size](result, a, b)
    return result.copy_to_host()


gpu_short_course.tests.benchmark_add_vectors(add_vectors_mod8)
gpu_short_course.tests.benchmark_add_vectors(add_vectors_mod8_branches)

Writing 3_2_control_flow.py


Let's check how much time does it take to execute each of the kernel:

In [10]:
! nvprof --trace gpu python 3_2_control_flow.py

==23580== NVPROF is profiling process 23580, command: python 3_2_control_flow.py
GPU:0: b'GeForce GTX TITAN X'
Benchmarking the function, please wait...
Benchmark result: 
Average processing time: 0.0092 seconds (+/- 0.0403), median: 0.0041
Benchmarking the function, please wait...
Benchmark result: 
Average processing time: 0.0076 seconds (+/- 0.0230), median: 0.0042
==23580== Profiling application: python 3_2_control_flow.py
==23580== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   56.09%  232.18ms       600  386.96us  339.79us  1.1761ms  [CUDA memcpy DtoH]
                   34.01%  140.76ms       400  351.90us  346.22us  607.63us  [CUDA memcpy HtoD]
                    8.20%  33.932ms       100  339.32us  336.97us  341.90us  cudapy::__main__::add_vectors_mod8_branches_kernel$242(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>)
         

Let's also measure `branch_efficiency` metric defined as a:
> Ratio of non-divergent branches to total branches expressed as percentage.

In [11]:
! nvprof --metrics branch_efficiency python 3_2_control_flow.py

==23664== NVPROF is profiling process 23664, command: python 3_2_control_flow.py
GPU:0: b'GeForce GTX TITAN X'
Benchmarking the function, please wait...
Benchmark result: 
Average processing time: 0.0184 seconds (+/- 0.0313), median: 0.0142
Benchmarking the function, please wait...
Benchmark result: 
Average processing time: 0.0173 seconds (+/- 0.0192), median: 0.0142
==23664== Profiling application: python 3_2_control_flow.py
==23664== Profiling result:
==23664== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "GeForce GTX TITAN X (0)"
    Kernel: cudapy::__main__::add_vectors_mod8_kernel$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>)
        100                         branch_efficiency                         Branch Efficiency     100.00%     100.00%     100.00%
    Kernel: cudapy::__main__::add_v

Of course, the above example has been artificially complicated just to show the effect of complex kernel logic on the kernel's performance.

# Exercise 3.3. Multiprocessor occupancy: thread block size.

Recall that:
> The number of threads per block should be a **multiple of 32 threads**, because this provides optimal computing efficiency and facilitates coalescing.

[CUDA Best Practices Guide](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html) also gives some other suggestions how to choose the proper number of threads per block:

> There are many such factors involved in selecting block size, and inevitably some experimentation is required. However, a few rules of thumb should be followed:
> 1. Threads per block should be **a multiple of warp size** to avoid wasting computation on under-populated warps and to facilitate coalescing.
> 2. A **minimum of 64 threads** per block should be used, and only if there are multiple concurrent blocks per multiprocessor.
> 3. Between **128 and 256 threads** per block is a good initial range for experimentation with different block sizes.
> 4. Use several smaller thread blocks rather than one large thread block per multiprocessor if latency affects performance. This is particularly beneficial to kernels that frequently call __syncthreads().

### Example

Let's mesure `add_vectors`' occupancy for a different number of threads:


In [12]:
%%writefile 3_3_occupancy_16.py

from numba import cuda
import math
import numpy as np
import gpu_short_course.tests

block_size = 16


@cuda.jit
def add_vectors_kernel(result, a, b):
    i = cuda.blockIdx.x*cuda.blockDim.x + cuda.threadIdx.x
    if i >= len(result):
        return
    result[i] = a[i] + b[i]


def add_vectors_gpu(a, b):
    result = cuda.device_array(shape=a.shape, dtype=a.dtype)
    grid_size = math.ceil(len(a)/block_size)
    add_vectors_kernel[grid_size, block_size](result, a, b)
    return result.copy_to_host()


gpu_short_course.tests.benchmark_add_vectors(add_vectors_gpu)

Writing 3_3_occupancy_16.py


In [13]:
! nvprof --trace gpu python 3_3_occupancy_16.py

==23747== NVPROF is profiling process 23747, command: python 3_3_occupancy_16.py
GPU:0: b'GeForce GTX TITAN X'
Benchmarking the function, please wait...
Benchmark result: 
Average processing time: 0.0085 seconds (+/- 0.0320), median: 0.0041
==23747== Profiling application: python 3_3_occupancy_16.py
==23747== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   57.99%  117.13ms       300  390.42us  340.68us  1.0115ms  [CUDA memcpy DtoH]
                   34.72%  70.120ms       200  350.60us  345.99us  391.11us  [CUDA memcpy HtoD]
                    7.29%  14.717ms       100  147.17us  147.01us  148.68us  cudapy::__main__::add_vectors_kernel$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>)
No API activities were profiled.


According to NVIDIA documentation, `achieved_occupancy` measures:
> Ratio of the average active warps per active cycle to the maximum number of warps supported on a multiprocessor.

In [14]:
! nvprof --metrics achieved_occupancy python 3_3_occupancy_16.py

==23833== NVPROF is profiling process 23833, command: python 3_3_occupancy_16.py
GPU:0: b'GeForce GTX TITAN X'
Benchmarking the function, please wait...
Benchmark result: 
Average processing time: 0.0215 seconds (+/- 0.0311), median: 0.0172
==23833== Profiling application: python 3_3_occupancy_16.py
==23833== Profiling result:
==23833== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "GeForce GTX TITAN X (0)"
    Kernel: cudapy::__main__::add_vectors_kernel$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>)
        100                        achieved_occupancy                        Achieved Occupancy    0.184555    0.189857    0.186089


In [15]:
%%writefile 3_3_occupancy_256.py

from numba import cuda
import math
import numpy as np
import gpu_short_course.tests

block_size = 256


@cuda.jit
def add_vectors_kernel(result, a, b):
    i = cuda.blockIdx.x*cuda.blockDim.x + cuda.threadIdx.x
    if i >= len(result):
        return
    result[i] = a[i] + b[i]


def add_vectors_gpu(a, b):
    result = cuda.device_array(shape=a.shape, dtype=a.dtype)
    grid_size = math.ceil(len(a)/block_size)
    add_vectors_kernel[grid_size, block_size](result, a, b)
    return result.copy_to_host()


gpu_short_course.tests.benchmark_add_vectors(add_vectors_gpu)

Writing 3_3_occupancy_256.py


In [16]:
! nvprof --trace gpu python 3_3_occupancy_256.py
! nvprof --metrics achieved_occupancy python 3_3_occupancy_256.py

==23917== NVPROF is profiling process 23917, command: python 3_3_occupancy_256.py
GPU:0: b'GeForce GTX TITAN X'
Benchmarking the function, please wait...
Benchmark result: 
Average processing time: 0.0086 seconds (+/- 0.0346), median: 0.0041
==23917== Profiling application: python 3_3_occupancy_256.py
==23917== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   60.73%  116.27ms       300  387.56us  335.31us  1.1607ms  [CUDA memcpy DtoH]
                   36.62%  70.107ms       200  350.53us  345.90us  395.50us  [CUDA memcpy HtoD]
                    2.65%  5.0808ms       100  50.807us  49.250us  54.273us  cudapy::__main__::add_vectors_kernel$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>)
No API activities were profiled.
==24010== NVPROF is profiling process 24010, command: python 3_3_occupancy_256.py
GPU:0: b'GeForce GTX TITAN X'
Benchm

In [17]:
%%writefile 3_3_occupancy_1024.py

from numba import cuda
import math
import numpy as np
import gpu_short_course.tests

block_size = 1024


@cuda.jit
def add_vectors_kernel(result, a, b):
    i = cuda.blockIdx.x*cuda.blockDim.x + cuda.threadIdx.x
    if i >= len(result):
        return
    result[i] = a[i] + b[i]


def add_vectors_gpu(a, b):
    result = cuda.device_array(shape=a.shape, dtype=a.dtype)
    grid_size = math.ceil(len(a)/block_size)
    add_vectors_kernel[grid_size, block_size](result, a, b)
    return result.copy_to_host()

gpu_short_course.tests.benchmark_add_vectors(add_vectors_gpu)

Writing 3_3_occupancy_1024.py


In [18]:
! nvprof --trace gpu python 3_3_occupancy_1024.py
! nvprof --metrics achieved_occupancy python 3_3_occupancy_1024.py

==24096== NVPROF is profiling process 24096, command: python 3_3_occupancy_1024.py
GPU:0: b'GeForce GTX TITAN X'
Benchmarking the function, please wait...
Benchmark result: 
Average processing time: 0.0086 seconds (+/- 0.0350), median: 0.0040
==24096== Profiling application: python 3_3_occupancy_1024.py
==24096== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   60.54%  115.47ms       300  384.89us  337.51us  1.0125ms  [CUDA memcpy DtoH]
                   36.81%  70.213ms       200  351.07us  346.06us  404.56us  [CUDA memcpy HtoD]
                    2.65%  5.0476ms       100  50.475us  49.505us  51.874us  cudapy::__main__::add_vectors_kernel$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>)
No API activities were profiled.
==24180== NVPROF is profiling process 24180, command: python 3_3_occupancy_1024.py
GPU:0: b'GeForce GTX TITAN X'
Ben

Finding the right number of threads per block requires some experimentation, but generally 128 or 256 threads is a good starting point.