# Working with Numba

This notebooks provides some examples of how to work with **Numba** and compare the speed-up with C++.

From the **consav** package we will use the **runtools** module to control the behavior of **Numba**.

**Links:**

- [Supported Python features](https://numba.pydata.org/numba-doc/dev/reference/pysupported.html)
- [Supported Numpy features](https://numba.pydata.org/numba-doc/dev/reference/numpysupported.html)

**Requirements:** You must have these two compilers installed:

* **vs**: Free *Microsoft Visual Studio 2017 Community Edition* ([link](https://visualstudio.microsoft.com/downloads/))
* **intel:** Costly *Intel Parallel Studio 2018 Composer Edition* ([link](https://software.intel.com/en-us/parallel-studio-xe))

**Computer used for timings:** Windows 10 computer with two Intel(R) Xeon(R) Gold 6154 3.00 GHz CPUs (18 cores, 36 logical processes each) and 192 GB of RAM.

In [1]:
THREADS = [1,4,8,16,32,64]

# Decorating Python functions

Imports and numba settings:

In [2]:
import time
import numpy as np

from consav import runtools
runtools.write_numba_config(threads=8,threading_layer='tbb')
import numba as nb # must be imported after write_numba_config!
#nb.config.__dict__ # see all config options

## Functions

In [3]:
def test_standard(X,Y,Z,NX,NY):

    # X is lenght NX
    # Y is lenght NY
    # Z is length NX

    for i in range(NX):
        Z[i] = 0
        for j in range(NY):
            Z[i] += np.exp(np.log(X[i]*Y[j]+0.001))/(X[i]*Y[j])-1

@nb.njit
def test(X,Y,Z,NX,NY):
    for i in nb.prange(NX):
        Z[i] = 0
        for j in range(NY):
            Z[i] += np.exp(np.log(X[i]*Y[j]+0.001))/(X[i]*Y[j])-1
            
@nb.njit(parallel=True)
def test_par(X,Y,Z,NX,NY):
    for i in nb.prange(NX):
        Z[i] = 0
        for j in range(NY):
            Z[i] += np.exp(np.log(X[i]*Y[j]+0.001))/(X[i]*Y[j])-1

@nb.njit(parallel=True,fastmath=True)
def test_par_fast(X,Y,Z,NX,NY):
    for i in nb.prange(NX):
        Z[i] = 0
        for j in range(NY):
            Z[i] += np.exp(np.log(X[i]*Y[j]+0.001))/(X[i]*Y[j])-1

## Settings

Choose settings and make random draws:

In [4]:
# a. settings
NX = 100
NY = 20000

# b. random draws
np.random.seed(1998)
X = np.random.sample(NX)
Y = np.random.sample(NY)
Z = np.zeros(NX)

## Examples

In [5]:
tic = time.time()
test_standard(X,Y,Z,NX,NY)
toc = time.time()
print(f'{"python":10s} {np.sum(Z):.8f} in {toc-tic:4.2f} secs')

test(X,Y,Z,NX,NY) # test run
tic = time.time()
test(X,Y,Z,NX,NY)
toc = time.time()
print(f'{"numba":10s} {np.sum(Z):.8f} in {toc-tic:4.2f} secs')

test_par(X,Y,Z,NX,NY) # test run
tic = time.time()
test_par(X,Y,Z,NX,NY)
toc = time.time()
print(f'{"numba par":10s} {np.sum(Z):.8f} in {toc-tic:4.2f} secs')

test_par_fast(X,Y,Z,NX,NY) # test run
tic = time.time()
test_par_fast(X,Y,Z,NX,NY)
toc = time.time()
print(f'{"numba fast":10s} {np.sum(Z):.8f} in {toc-tic:4.2f} secs')

python     182878.74978038 in 6.39 secs
numba      182878.74978038 in 0.03 secs
numba par  182878.74978038 in 0.00 secs
numba fast 182878.74978038 in 0.00 secs


**Conclusion:** Huge speed-up by numba. The `fastmath=True` option seems to be able to do some pure compiler magic on top of this.

## jitclass

To parse around large amount of variables, **jitclasses** are usefull. They need to be declared, but can then be used in numba functions.

In [6]:
# a. setup jit class
parlist = [
    ('X',nb.double[:]),
    ('Y',nb.double[:]),    
    ('Z',nb.double[:]),   
    ('N',nb.int64),    
    ('a',nb.double),
    ('b',nb.double),
    ('threads',nb.int64)
]

@nb.jitclass(parlist)
class ParClass():
    def __init__(self):
        pass

# b. create
par = ParClass()
par.N = 10
par.X = np.zeros(par.N)
par.Y = np.zeros(par.N)
par.Z = np.zeros(0)
par.a = 2
par.b = 1
par.threads = 4

# c. call function
@nb.jit
def test(par):
    par.Z = np.zeros(par.X.size)
    for i in range(par.N):
        par.Z[i] = par.X[i] + par.Y[i]
    
test(par)
print(par.Z)

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


# Test parallization in Numba and C++

Compile C++ function for comparison:

In [7]:
from consav import cpptools
cpptools.compile('cppfuncs/test_numba',compiler='vs',dllfilename='test_numba_vs')
cpptools.compile('cppfuncs/test_numba',compiler='intel',dllfilename='test_numba_intel')

cpp files compiled
cpp files compiled


Run tests with **different number of threads**:

In [8]:
for threads in THREADS:
    
        print(f'threads = {threads}')
        
        print(f' threading_layer = tbb')
        runtools.write_numba_config(threads=threads,threading_layer='tbb')
        !python test_numba.py

        print(f' threading_layer = omp')
        runtools.write_numba_config(threads=threads,threading_layer='omp')
        !python test_numba.py
        
        print('')

threads = 1
 threading_layer = tbb
  numba      326725974.7 in 30.2 secs
 threading_layer = omp
  numba      326725974.7 in 26.3 secs
  c++, vs    326725974.7 in 26.7 secs
  c++, intel 326725974.7 in 25.6 secs

threads = 4
 threading_layer = tbb
  numba      326725974.7 in 8.2 secs
 threading_layer = omp
  numba      326725974.7 in 7.5 secs
  c++, vs    326725974.7 in 6.9 secs
  c++, intel 326725974.7 in 7.3 secs

threads = 8
 threading_layer = tbb
  numba      326725974.7 in 4.2 secs
 threading_layer = omp
  numba      326725974.7 in 3.5 secs
  c++, vs    326725974.7 in 3.5 secs
  c++, intel 326725974.7 in 3.0 secs

threads = 16
 threading_layer = tbb
  numba      326725974.7 in 2.3 secs
 threading_layer = omp
  numba      326725974.7 in 1.7 secs
  c++, vs    326725974.7 in 1.8 secs
  c++, intel 326725974.7 in 1.5 secs

threads = 32
 threading_layer = tbb
  numba      326725974.7 in 1.5 secs
 threading_layer = omp
  numba      326725974.7 in 1.5 secs
  c++, vs    326725974.7 in 1.6 se

**Conclusion:**

1. Numba is as efficient as pure C++
2. Numba with `threading_layer=tbb` delivers the same speed-up as the intel C++ compiler
3. Numba with `threading_layer=omp` delivers the same speed-up as the vs C++ compiler
4. When there are many threads the intel C++ compiler (or numba with `threading_layer=tbb`) performs best

# Reset

In [9]:
runtools.write_numba_config(threads=8,threading_layer='omp') # rest to omp and 8 threads