Numba is a just-in-time, type-specializing, function compiler.
Numba is one of the most commonly used libraries nowadays to speed-up python code. It can speed up your existing python code by a big margin by simply decorating your existing functions with numba decorators. Numba provides various decorators to speed up the python code.

The more important decorators in Numba are

- @jit &@njit  - to speed up almost any python function.
- @vectorize   - to speed up numpy-like universal functions.
- @guvectorize - extended version of @vectorize decorator.
- @stencil     - to speed up function performing stencil kernel operations (e.g convolutions)

The Numba @jit decorator fundamentally operates in two compilation modes, nopython mode and object mode

The behaviour of the nopython compilation mode is to essentially compile the decorated function so that it will run entirely without the involvement of the Python interpreter. This is the recommended and best-practice way to use the Numba jit decorator as it leads to the best performance.

The @vectorize decorator requires us to specify possible data types of input and output of the function, in order to create a compiled version for each data type.

Apart from datatypes, it accepts two other arguments:

-   target
-   cache - boolean parameter specifying whether to use caching to speed up reruns of the same function again and again with the same inputs.


The target argument accepts one string as input specifying how to further speed up code based on available resources:

- 'cpu' - This is default argument. It's used for a single-core (single-threaded) CPU.
- 'parallel' - This argument runs code in parallel on multi-core (multi-threaded) CPU.
- 'cuda' - This argument is set for GPU


Example:

```python
@vectorize([ret_datatype1(input1_datatype1,input2_datatype1,...), ret_datatype2(input1_datatype2,input2_datatype2,...), ...], target='cpu', cache=False)
def func(x):
    return x*x
```

NOTE: The data type should be in order from less memory data type to more memory data type.



In [None]:
import numba

print("Numba Version : {}".format(numba.__version__))

Numba Version : 0.58.1


In [21]:

@numba.njit
def plus1(x):
    return x + 1


import numpy as np
plus1(np.arange(10))
plus1.signatures


#Inspecting LLVM control graph
plus1.inspect_cfg(plus1.signatures[0]).display()


'<?xml version="1.0" encoding="UTF-8" standalone="no"?>\n<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"\n "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">\n<!-- Generated by graphviz version 2.43.0 (0)\n -->\n<!-- Title: _ZN8__main__5plus1B3v31B38c8tJTIcFKzyF2ILShI4CrgQElQb6HczSBAA_3dE5ArrayIxLi1E1C7mutable7alignedE Pages: 1 -->\n<svg width="1400pt" height="5360pt"\n viewBox="0.00 0.00 1400.14 5360.00" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">\n<g id="graph0" class="graph" transform="scale(1 1) rotate(0) translate(4 5356)">\n<title>_ZN8__main__5plus1B3v31B38c8tJTIcFKzyF2ILShI4CrgQElQb6HczSBAA_3dE5ArrayIxLi1E1C7mutable7alignedE</title>\n<polygon fill="white" stroke="transparent" points="-4,4 -4,-5356 1396.14,-5356 1396.14,4 -4,4"/>\n<!-- Node0x55c0827ac270 -->\n<g id="node1" class="node">\n<title>Node0x55c0827ac270</title>\n<polygon fill="white" stroke="transparent" points="196.5,-5264.5 196.5,-5277.5 523.5,-5277.5 523.5,-5264.5 196.5,-5264.5"/>\n

In [None]:
@njit
def foo(x):
    if x < 3:
        return x + 1
    return x + 2

foo(10)

print(foo.inspect_disasm_cfg(signature=foo.signatures[0]))

In [15]:
import numpy as np
import timeit
from numba import jit
from numba import vectorize, int64, float32, float64

def cube_formula(x):
    return x**3 + 3*x**2 + 3

#cube_formula_jitted = jit(cube_formula)

print(cube_formula(5))


#NUMPY VECTORIZE
vectorized_cube_formula = np.vectorize(cube_formula)
arr = np.arange(1, 1000000, dtype=np.int64)


%timeit vectorized_cube_formula(arr)

#print("The time taken with numpy vectorize is ",timeit.timeit(stmt='vectorized_cube_formula(arr)',globals=globals()))















203
793 ms ± 4.32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [14]:
arr = np.arange(1, 1000000, dtype=np.int64)


@jit(nopython=False)
def cube_formula_jitted(x):
    xs = []
    for i in x:
        xs.append(i**3 + 3*i**2 + 3)
    return xs

res = cube_formula_jitted(arr)


print("The time taken with Numba jit is:\n")
%timeit cube_formula_jitted(arr)
#print(timeit.timeit('cube_formula_jitted(arr)', globals=globals()))

@jit(nopython=True)
def new_cube_formula_jitted(x):
    xs = []
    for i in x:
        xs.append(i**3 + 3*i**2 + 3)
    return xs

print("The time taken with Numba jit, in nopython mode, is \n")
%timeit new_cube_formula_jitted(arr)

#in python script you can use
#print("The time taken with Numba jit is ",timeit.timeit('arr = np.arange(1, 1000000, dtype=np.int64); new_cube_formula_jitted(arr)', setup="from __main__ import new_cube_formula_jitted"))
#print("The time taken with Numba jit, in nopython mode, is ",timeit.timeit(stmt='new_cube_formula_jitted(arr)', globals=globals()))



  @jit(nopython=False)


The time taken with Numba jit is:

25.7 ms ± 332 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
The time taken with Numba jit, in nopython mode, is 

27 ms ± 509 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [16]:
@vectorize([int64(int64), float32(float32), float64(float64)])
def cube_formula_numba_vec(x):
    return x**3 + 3*x**2 + 3


print("The time taken with Numba vectorize is: \n ")

%timeit cube_formula_numba_vec(arr)


The time taken with Numba vectorize is: 
 
633 µs ± 23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [17]:
#NUMBA PARALLELIZED with multithreading

@vectorize([int64(int64), float32(float32), float64(float64)], target="parallel")
def cube_formula_numba_vec_paralleled(x):
    return x**3 + 3*x**2 + 3



print("The time taken with Numba vectorize parallelized is: \n ")

%timeit cube_formula_numba_vec_paralleled(arr)

#print("The time taken with Numba jit is ",timeit.timeit(stmt='cube_formula_numba_vec_paralleled(arr)', globals=globals()))


The time taken with Numba vectorize parallelized is: 
 
170 µs ± 3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [19]:
#Caching in Numba
@vectorize([int64(int64), float32(float32), float64(float64)], cache=True)
def cube_formula_numba_vec_cached(x):
    return x**3 + 3*x**2 + 3

print("The time taken with Numba vectorize cached is: \n ")

%timeit cube_formula_numba_vec_cached(arr)

#print("The time taken with numpy vectorize is ",timeit.timeit(stmt='cube_formula_numba_vec_cached(arr)'))

The time taken with Numba vectorize cached is: 
 
646 µs ± 15.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [None]:
#EXERCISE: Given the serial recursive function to build the Fibonacci sequence, defined below, use numba decorators to speed up the function and time the different decoreated versions.

def Fibonacci():


#Question: Is the caching utility useful in this case? Why?


# Numba on GPU

In [None]:
#Importing necessary libraries

from numba import cuda, float32
import numpy as np
import math

In [None]:
#Vector Addition with Numba on GPU

@cuda.jit
def f(a, b, c):

    # like threadIdx.x + (blockIdx.x * blockDim.x)

    tid = cuda.grid(1)

    size = len(c)


    if tid < size:

        c[tid] = a[tid] + b[tid]

We have two ways to launch this kernel:
- defining the cuda grid size (the number of threads)
- using for all construct

In [None]:
#WAY 1


In [1]:
#Matrix multiplication
@cuda.jit
def matmul(A, B, C):

    """Perform square matrix multiplication of C = A * B."""

    i, j = cuda.grid(2)

    if i < C.shape[0] and j < C.shape[1]:

        tmp = 0.

        for k in range(A.shape[1]):

            tmp += A[i, k] * B[k, j]

        C[i, j] = tmp




In [3]:
x_h = np.arange(16).reshape([4, 4])
y_h = np.ones([4, 4])
z_h = np.zeros([4, 4])


#Moving numpy arrays to GPU
x_d = cuda.to_device(x_h)
y_d = cuda.to_device(y_h)
z_d = cuda.to_device(z_h)


#defining the grid on the GPU, using 16 threads per block
threadsperblock = (32, 32)
blockspergrid_x = math.ceil(z_h.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(z_h.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)

#performing the matmul
matmul[blockspergrid, threadsperblock](x_d, y_d, z_d)

#copying the output array back to the CPU
z_h = z_d.copy_to_host()

print(z_h)

print(x_h @ y_h)

[[ 6.  6.  6.  6.]
 [22. 22. 22. 22.]
 [38. 38. 38. 38.]
 [54. 54. 54. 54.]]
[[ 6.  6.  6.  6.]
 [22. 22. 22. 22.]
 [38. 38. 38. 38.]
 [54. 54. 54. 54.]]




In [4]:
%timeit matmul[blockspergrid, threadsperblock](x_d, y_d, z_d)

59.2 µs ± 658 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [None]:
#COMPARE the execution time with Cupy matmul