In [1]:
import numba

### How to measure the performance of Numba?

First, recall that Numba has to compile your function for the argument types given before it executes the machine code version of your function.   
This takes time!  


#### Common mistake

A really common mistake when measuring performance is to not account for the above behaviour and to time code once with a simple timer that includes the time taken to compile your function in the execution time.

In [2]:
from numba import jit
import numpy as np
import time

x = np.arange(100).reshape(10, 10)

@jit(nopython=True)
def go_fast(a): # Function is compiled and runs in machine code
    trace = 0.0
    for i in range(a.shape[0]):
        trace += np.tanh(a[i, i])
    return a + trace

# DO NOT REPORT THIS... COMPILATION TIME IS INCLUDED IN THE EXECUTION TIME!
start = time.perf_counter()
go_fast(x)
end = time.perf_counter()
print("Elapsed (with compilation) = {}s".format((end - start)))

# NOW THE FUNCTION IS COMPILED, RE-TIME IT EXECUTING FROM CACHE
start = time.perf_counter()
go_fast(x)
end = time.perf_counter()
print("Elapsed (after compilation) = {}s".format((end - start)))


Elapsed (with compilation) = 1.1794615999999678s
Elapsed (after compilation) = 3.629999997656341e-05s


### Extra options in some decorators

Extra options available in some decorators:

- parallel = True - enable the automatic parallelization of the function.

- fastmath = True - enable fast-math behaviour for the function.

### Numba version

In [4]:
numba.__version__

'0.56.0'

In [10]:
!numba -s

System info:
--------------------------------------------------------------------------------
__Time Stamp__
Report started (local time)                   : 2022-09-19 17:33:28.600817
UTC start time                                : 2022-09-19 15:33:28.600817
Running time (s)                              : 5.835758

__Hardware Information__
Machine                                       : AMD64
CPU Name                                      : tigerlake
CPU Count                                     : 8
Number of accessible CPUs                     : 8
List of accessible CPUs cores                 : 0 1 2 3 4 5 6 7
CFS Restrictions (CPUs worth of runtime)      : None

CPU Features                                  : 64bit adx aes avx avx2
                                                avx512bitalg avx512bw avx512cd
                                                avx512dq avx512f avx512ifma
                                                avx512vbmi avx512vbmi2 avx512vl
                      

### Lazy compilation

In [11]:
from numba import jit

@jit
def f(x, y):
    # A somewhat trivial example
    return x + y

In [12]:
f(1,2)

3

### Eager compilation

In [13]:
from numba import jit, int32

@jit(int32(int32, int32))
def f(x, y):
    # A somewhat trivial example
    return x + y

### Calling other functions

Numba-compiled functions can call other compiled functions

In [15]:
@jit
def square(x):
    return x ** 2

@jit
def hypot(x, y):
    return math.sqrt(square(x) + square(y))

### Compilation options

A number of keyword-only arguments can be passed to the @jit decorator:
- nopython,
- nogil,
- cache,
- parallel.

### Why @vectorize?

Numba’s vectorize allows Python functions taking scalar input arguments to be used as NumPy ufuncs. NumPy ufuncs automatically get other features such as reduction, accumulation or broadcasting.

In [23]:
from numba import vectorize, float64, int64, float32

@vectorize([int32(int32, int32),
            int64(int64, int64),
            float32(float32, float32),
            float64(float64, float64)])
def f(x, y):
    return x + y

In [58]:
a = np.arange(100000)

In [59]:
a

array([    0,     1,     2, ..., 99997, 99998, 99999])

In [60]:
f(a,a)



array([     0,      2,      4, ..., 199994, 199996, 199998], dtype=int64)

In [63]:
f(a,a).reshape(50000,2)

array([[     0,      2],
       [     4,      6],
       [     8,     10],
       ...,
       [199988, 199990],
       [199992, 199994],
       [199996, 199998]])

The vectorize() decorator supports multiple ufunc targets:
- cpu (Single-threaded CPU)
- parallel (Multi-core CPU)
- cuda (CUDA GPU)

In [64]:
from numba import vectorize, float64, int64, float32

@vectorize([int32(int32, int32),
            int64(int64, int64),
            float32(float32, float32),
            float64(float64, float64)], target='cpu')
def f(x, y):
    return x + y

In [65]:
%timeit f(a,a).reshape(50000,2)

13.3 µs ± 773 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [66]:
from numba import vectorize, float64, int64, float32

@vectorize([int32(int32, int32),
            int64(int64, int64),
            float32(float32, float32),
            float64(float64, float64)], target='parallel')
def f(x, y):
    return x + y

In [67]:
%timeit f(a,a).reshape(50000,2)

169 µs ± 4.84 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [84]:
from numba import vectorize, float64, int64, float32

@vectorize([int32(int32, int32),
            int64(int64, int64),
            float32(float32, float32),
            float64(float64, float64)], target='cuda')
def f(x, y):
    return x + y

f.max_blocksize = 32
print(dir(f))

['_CUDAUFuncDispatcher__reduce', '__call__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_max_blocksize', '_maxblocksize', 'functions', 'max_blocksize', 'reduce']


In [85]:
%timeit f(a,a).reshape(50000,2)



1.54 ms ± 72.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


### The @guvectorize decorator

While vectorize() allows you to write ufuncs that work on one element at a time, the guvectorize() decorator takes the concept one step further and allows you to write ufuncs that will work on an arbitrary number of elements of input arrays, and take and return arrays of differing dimensions.

Contrary to vectorize() functions, guvectorize() functions don’t return their result value: they take it as an array argument, which must be filled in by the function.

In [89]:
from numba import guvectorize

@guvectorize([(int64[:], int64, int64[:])], '(n),()->(n)')
def g(x, y, res):
    for i in range(x.shape[0]):
        res[i] = x[i] + y

The declaration of input and output layouts, in symbolic form: (n),()->(n) tells NumPy that the function takes a n-element one-dimension array, a scalar (symbolically denoted by the empty tuple ()) and returns a n-element one-dimension array.