# 1.3. Numba and CUDA Toolkit

## Debugging with NUMBA debugger

## Debugging with cuda-memcheck

## Measuring Python code execution time

## Profiling kernels with nvprof / nsys in command line

In [1]:
%%writefile 1.3.x-add-vectors.py

import math
import numpy as np
from numba import cuda

# CUDA kernel.

@cuda.jit
def add(c, a, b):
    i = cuda.blockIdx.x*cuda.blockDim.x + cuda.threadIdx.x
    if i < c.shape[0]:
        c[i] = a[i] + b[i]

# Test data.
n = 100000    
a = np.random.rand(n).astype(np.float32)
b = np.random.rand(n).astype(np.float32)
c = np.zeros(n)

block_size = (16, )
grid_size = (math.ceil(n/block_size[0]), )

for i in range(100):
    add[grid_size, block_size](c, a, b)
    np.testing.assert_almost_equal(a+b, c)

Overwriting 1.3.x-add-vectors.py


In [3]:
! nsys profile --stats=true python 1.3.x-add-vectors.py

Collecting data...
Processing events...
Capturing symbol files...
Saving temporary "/tmp/nsys-report-4423-0565-1255-752d.qdstrm" file to disk...
Creating final output files...

Saved report file to "/tmp/nsys-report-4423-0565-1255-752d.qdrep"

Exported successfully to
/tmp/nsys-report-4423-0565-1255-752d.sqlite

Generating CUDA API Statistics...
CUDA API Statistics (nanoseconds)

Time(%)      Total Time       Calls         Average         Minimum         Maximum  Name                                                                            
-------  --------------  ----------  --------------  --------------  --------------  --------------------------------------------------------------------------------
   54.2        31536014         300        105120.0           70328          328218  cuMemcpyDtoH_v2                                                                 
   17.4        10118072         300         33726.9           22459          104826  cuMemcpyHtoD_v2                   

##