# Numba

**Table of contents**<a id='toc0_'></a>    
- 1. [Numpy code](#toc1_)    
- 2. [Numba](#toc2_)    
- 3. [Paralization](#toc3_)    
- 4. [Extra: Calling an optimizer](#toc4_)    

<!-- vscode-jupyter-toc-config
	numbering=true
	anchor=true
	flat=false
	minLevel=2
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [1]:
import numpy as np
import numba as nb

**Python** is inherently slow:

1. It interpret the code line by line (no compiling)
1. It runs the code line by line (serial)

**Numba** package allow for simple **just-in-time compilation**. 

Just add decorator `numba.njit` on top of a function.

1. First run is slower because code is analyzed.
1. But subsequent calls are then a lot faster. 

*The input types can, however, not change between calls because numba infer them on the first call.*

## 1. <a id='toc1_'></a>[Numpy code](#toc0_)

Basic implementation:

In [2]:
def myfun_numpy(x1,x2):

    y = np.empty(x1.size)

    for i in range(x1.size):
        if x1[i] < 0.5:
            y[i] = np.sum(np.exp(x2*x1[i]))
        else:
            y[i] = np.sum(np.log(x2*x1[i]))

    return y

Vectorized implementation:

In [3]:
def myfun_numpy_vec(x1,x2):

    y = np.empty((1,x1.size))
    I = x1 < 0.5

    y[I] = np.sum(np.exp(x2*x1[I]),axis=0)
    y[~I] = np.sum(np.log(x2*x1[~I]),axis=0)
    
    return y

Inputs:

In [4]:
rng = np.random.default_rng(1234)
x1 = rng.uniform(size=10**6)
x2 = rng.uniform(size=np.int64(100))
x1_np = x1.reshape((1,x1.size))
x2_np = x2.reshape((x2.size,1))

Same results:

In [5]:
assert np.allclose(myfun_numpy_vec(x1_np,x2_np),myfun_numpy(x1,x2))

Timing:

In [6]:
%timeit myfun_numpy(x1,x2)

6.53 s ± 54.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [7]:
%timeit myfun_numpy_vec(x1_np,x2_np)

1.05 s ± 32.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## 2. <a id='toc2_'></a>[Numba](#toc0_)

In [8]:
@nb.njit
def myfun_numba(x1,x2):

    y = np.empty(x1.size)

    for i in range(x1.size):
        
        if x1[i] < 0.5:
            y[i] = np.sum(np.exp(x2*x1[i]))
        else:
            y[i] = np.sum(np.log(x2*x1[i]))
            
    return y

# call to just-in-time compile
%time myfun_numba(x1,x2)

# actual measurement
%timeit myfun_numba(x1,x2)

assert np.allclose(myfun_numpy_vec(x1_np,x2_np),myfun_numba(x1,x2))

CPU times: total: 469 ms
Wall time: 4.4 s
933 ms ± 50 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


You can also call the Python-version:

In [9]:
%time myfun_numba.py_func(x1,x2)

CPU times: total: 2.72 s
Wall time: 7.92 s


array([-105.4802591 ,  120.24914509, -111.10859202, ...,  100.9689567 ,
       -120.51811646,  103.39232248])

**Caveats:** Only a limited number of Python and Numpy features are supported inside just-in-time compiled functions.

- [Supported Python features](https://numba.pydata.org/numba-doc/dev/reference/pysupported.html)
- [Supported Numpy features](https://numba.pydata.org/numba-doc/dev/reference/numpysupported.html)


## 3. <a id='toc3_'></a>[Paralization](#toc0_)

In [10]:
@nb.njit(parallel=True)
def myfun_numba_par(x1,x2):

    y = np.empty(x1.size)
    
    for i in nb.prange(x1.size): # in parallel across threads
        if x1[i] < 0.5:
            y[i] = np.sum(np.exp(x2*x1[i]))
        else:
            y[i] = np.sum(np.log(x2*x1[i]))
            
    return y

assert np.allclose(myfun_numpy_vec(x1_np,x2_np),myfun_numba_par(x1,x2))

myfun_numba_par(x1,x2)
%timeit myfun_numba_par(x1,x2)

252 ms ± 14.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## 4. <a id='toc4_'></a>[Extra: Calling an optimizer](#toc0_)

Using solver from `QuantEcon` (see [documentation](https://quanteconpy.readthedocs.io/en/latest/index.html)).

In [11]:
import quantecon as qe

In [12]:
n = 4000
alphas = rng.uniform(size=n)
betas = rng.uniform(size=n)
gammas = rng.uniform(size=n)

In [13]:
@nb.njit
def solver_nb(alpha,beta,gamma):

    def obj(x,alpha,beta,gamma):
        return (x[0]-alpha)**2 + (x[1]-beta)**2 + (x[2]-gamma)**2

    res = qe.optimize.nelder_mead(obj,np.array([0.0,0.0,0.0]),args=(alpha,beta,gamma))

    return res.x

**Serial version:**

In [14]:
@nb.njit
def serial_solver_nb(alphas,betas,gammas):

    n = alphas.size
    xopts = np.zeros((n,3))

    for i in range(n):
        xopts[i,:] = solver_nb(alphas[i],betas[i],gammas[i])

%time serial_solver_nb(alphas,betas,gammas)
%time serial_solver_nb(alphas,betas,gammas)

CPU times: total: 10.4 s
Wall time: 16.1 s
CPU times: total: 2.38 s
Wall time: 3.4 s


**Parallel version:**

In [15]:
@nb.njit(parallel=True)
def parallel_solver_nb(alphas,betas,gammas):

    n = alphas.size
    xopts = np.zeros((n,3))

    for i in nb.prange(n):
        xopts[i,:] = solver_nb(alphas[i],betas[i],gammas[i])

%time parallel_solver_nb(alphas,betas,gammas)
%time parallel_solver_nb(alphas,betas,gammas)

CPU times: total: 13.5 s
Wall time: 9.08 s
CPU times: total: 8.45 s
Wall time: 1.13 s
