# numba

**Table of contents**<a id='toc0_'></a>    
- 1. [Introduction](#toc1_)    
- 2. [Further speed-up](#toc2_)    

<!-- vscode-jupyter-toc-config
	numbering=true
	anchor=true
	flat=false
	minLevel=2
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

You will be introduced to how  to use the **numba** package to speed-up your code.

In [1]:
import time
import numpy as np
import numba as nb

import matplotlib.pyplot as plt
plt.rcParams.update({"axes.grid":True,"grid.color":"black","grid.alpha":"0.25","grid.linestyle":"--"})
plt.rcParams.update({'font.size': 14})

## 1. <a id='toc1_'></a>[Introduction](#toc0_)

Writing **vectorized code can be cumbersome**, and in some cases it is impossible. Instead we can use the **numba** module. 

Adding the decorator `nb.njit` on top of a function tells numba to compile this function **to machine code just-in-time**.

This takes some time when the function is called the first time, but subsequent calls are then a lot faster. 

*The input types can, however, not change between calls because numba infer them on the first call.*

In [2]:
def myfun_numpy_vec(x1,x2):
    y = np.empty((1,x1.size))
    I = x1 < 0.5
    y[I] = np.sum(np.exp(x2*x1[I]),axis=0)
    y[~I] = np.sum(np.log(x2*x1[~I]),axis=0)
    return y

# setup
x1 = np.random.uniform(size=10**6)
x2 = np.random.uniform(size=np.int64(100)) # adjust the size of the problem
x1_np = x1.reshape((1,x1.size))
x2_np = x2.reshape((x2.size,1))

# timing
%timeit myfun_numpy_vec(x1_np,x2_np)

1.6 s ± 126 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


**Numba:** The first call is slower, but the result is the same, and the subsequent calls are faster:

In [3]:
@nb.njit
def myfun_numba(x1,x2):
    y = np.empty(x1.size)
    for i in range(x1.size):
        if x1[i] < 0.5:
            y[i] = np.sum(np.exp(x2*x1[i]))
        else:
            y[i] = np.sum(np.log(x2*x1[i]))
    return y

# call to just-in-time compile
%time myfun_numba(x1,x2)

# actual measurement
%timeit myfun_numba(x1,x2)

assert np.allclose(myfun_numpy_vec(x1_np,x2_np),myfun_numba(x1,x2))

Wall time: 2.09 s
550 ms ± 25.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


You can also call Python-version:

In [None]:
%time myfun_numba.py_func(x1,x2)

**Caveats:** Only a limited number of Python and Numpy features are supported inside just-in-time compiled functions.

- [Supported Python features](https://numba.pydata.org/numba-doc/dev/reference/pysupported.html)
- [Supported Numpy features](https://numba.pydata.org/numba-doc/dev/reference/numpysupported.html)


## 2. <a id='toc2_'></a>[Further speed-up](#toc0_)

**Further speed up:** Use

1. parallelization (with ``prange``), and 
1. faster but less precise math (with ``fastmath``)

In [4]:
@nb.njit(parallel=True)
def myfun_numba_par(x1,x2):
    y = np.empty(x1.size)
    for i in nb.prange(x1.size): # in parallel across threads
        if x1[i] < 0.5:
            y[i] = np.sum(np.exp(x2*x1[i]))
        else:
            y[i] = np.sum(np.log(x2*x1[i]))
    return y

assert np.allclose(myfun_numpy_vec(x1_np,x2_np),myfun_numba_par(x1,x2))
%time myfun_numba_par(x1,x2)
%timeit myfun_numba_par(x1,x2)

Wall time: 224 ms
277 ms ± 46.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [5]:
@nb.njit(parallel=True,fastmath=True)
def myfun_numba_par_fast(x1,x2):
    y = np.empty(x1.size)
    for i in nb.prange(x1.size): # in parallel across threads
        if x1[i] < 0.5:
            y[i] = np.sum(np.exp(x2*x1[i]))
        else:
            y[i] = np.sum(np.log(x2*x1[i]))
    return y

assert np.allclose(myfun_numpy_vec(x1_np,x2_np),myfun_numba_par_fast(x1,x2))
%time myfun_numba_par_fast(x1,x2)
%timeit myfun_numba_par_fast(x1,x2)

Wall time: 241 ms
239 ms ± 14.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
