# Working with Numba

This notebooks provides some examples of how to work with **Numba** and compare the speed-up with C++.

From the **consav** package we will use the **runtools** module to control the behavior of **Numba**.

**Links:**

- [Supported Python features](https://numba.pydata.org/numba-doc/dev/reference/pysupported.html)
- [Supported Numpy features](https://numba.pydata.org/numba-doc/dev/reference/numpysupported.html)

**Requirements:** You must have these two compilers installed:

* **vs**: Free *Microsoft Visual Studio 2017 Community Edition* ([link](https://visualstudio.microsoft.com/downloads/))
* **intel:** Costly *Intel Parallel Studio 2018 Composer Edition* ([link](https://software.intel.com/en-us/parallel-studio-xe))

**Computer used for timings:** Windows 10 computer with two Intel(R) Xeon(R) Gold 6154 3.00 GHz CPUs (18 cores, 36 logical processes each) and 192 GB of RAM.

In [1]:
THREADS = [1,4,8,16,32,64]

# Decorating Python functions

Imports and numba settings:

In [2]:
import time
import numpy as np

from consav import runtools
runtools.write_numba_config(threads=8,threading_layer='tbb')
import numba as nb # must be imported after write_numba_config!
#nb.config.__dict__ # see all config options

## Functions

In [3]:
def test_standard(X,Y,Z,NX,NY):

    # X is lenght NX
    # Y is lenght NY
    # Z is length NX

    for i in range(NX):
        Z[i] = 0
        for j in range(NY):
            Z[i] += np.exp(np.log(X[i]*Y[j]+0.001))/(X[i]*Y[j])-1

@nb.njit
def test(X,Y,Z,NX,NY):
    for i in nb.prange(NX):
        Z[i] = 0
        for j in range(NY):
            Z[i] += np.exp(np.log(X[i]*Y[j]+0.001))/(X[i]*Y[j])-1
            
@nb.njit(parallel=True)
def test_par(X,Y,Z,NX,NY):
    for i in nb.prange(NX):
        Z[i] = 0
        for j in range(NY):
            Z[i] += np.exp(np.log(X[i]*Y[j]+0.001))/(X[i]*Y[j])-1

@nb.njit(parallel=True,fastmath=True)
def test_par_fast(X,Y,Z,NX,NY):
    for i in nb.prange(NX):
        Z[i] = 0
        for j in range(NY):
            Z[i] += np.exp(np.log(X[i]*Y[j]+0.001))/(X[i]*Y[j])-1

## Settings

Choose settings and make random draws:

In [4]:
# a. settings
NX = 100
NY = 20000

# b. random draws
np.random.seed(1998)
X = np.random.sample(NX)
Y = np.random.sample(NY)
Z = np.zeros(NX)

## Examples

In [5]:
tic = time.time()
test_standard(X,Y,Z,NX,NY)
toc = time.time()
print(f'{"python":10s} {np.sum(Z):.8f} in {toc-tic:4.2f} secs')

test(X,Y,Z,NX,NY) # test run
tic = time.time()
test(X,Y,Z,NX,NY)
toc = time.time()
print(f'{"numba":10s} {np.sum(Z):.8f} in {toc-tic:4.2f} secs')

test_par(X,Y,Z,NX,NY) # test run
tic = time.time()
test_par(X,Y,Z,NX,NY)
toc = time.time()
print(f'{"numba par":10s} {np.sum(Z):.8f} in {toc-tic:4.2f} secs')

test_par_fast(X,Y,Z,NX,NY) # test run
tic = time.time()
test_par_fast(X,Y,Z,NX,NY)
toc = time.time()
print(f'{"numba fast":10s} {np.sum(Z):.8f} in {toc-tic:4.2f} secs')

python     182878.74978038 in 14.73 secs
numba      182878.74978038 in 0.07 secs
numba par  182878.74978038 in 0.02 secs
numba fast 182878.74978038 in 0.00 secs


**Conclusion:** Huge speed-up by numba. The `fastmath=True` option seems to be able to do some pure compiler magic on top of this.

## jitclass

To parse around large amount of variables, **namedtuples** are usefull. They need to be declared, but can then be used in numba functions. The can be derived from e.g. a **SimpleNamespace** (or a dictionary).

In [6]:
from types import SimpleNamespace
from collections import namedtuple

In [7]:
# a. setup SimpleNameSapce
par = SimpleNamespace()
par.N = 10
par.X = np.random.uniform(size=par.N)
par.Y = np.random.uniform(size=par.N)
par.Z = np.zeros(par.N)

# b. namedtuple
ParClasss = namedtuple('ParClass',[k for k in par.__dict__.keys()])
par_ = ParClasss(**par.__dict__)

# c. call function
@nb.jit
def test(par):
    for i in range(par.N):
        par.Z[i] = par.X[i] + par.Y[i]
    
# d. print output
test(par_)
print(par.Z)

[1.04038069 1.09288527 1.38327145 0.19575595 0.90414522 1.46588398
 0.9836273  1.0594155  0.0704249  0.86774585]


# Test parallization in Numba and C++

Compile C++ function for comparison:

In [8]:
from consav import cpptools
cpptools.compile('cppfuncs/example_numba',compiler='vs',dllfilename='example_numba_vs')
#cpptools.compile('cppfuncs/example_numba',compiler='intel',dllfilename='example_numba_intel')

cpp files compiled


Run tests with **different number of threads**:

In [None]:
for threads in THREADS:
    
    print(f'threads = {threads}')

    print(f' threading_layer = tbb')
    runtools.write_numba_config(threads=threads,threading_layer='tbb')
    !python test_numba.py

    print(f'\n threading_layer = omp')
    runtools.write_numba_config(threads=threads,threading_layer='omp')
    !python test_numba.py

    print('')

threads = 1
 threading_layer = tbb
  numba      326725974.7 in 57.8 secs
 threading_layer = omp

  numba      326725974.7 in 54.7 secs
  C++, vs    326725974.7 in 27.4 secs

threads = 4
 threading_layer = tbb
  numba      326725974.7 in 8.6 secs

 threading_layer = omp
  numba      326725974.7 in 9.1 secs
  C++, vs    326725974.7 in 8.4 secs

threads = 8
 threading_layer = tbb
  numba      326725974.7 in 7.0 secs
 threading_layer = omp

  numba      326725974.7 in 7.4 secs
threads = 16
 threading_layer = tbb

  C++, vs    326725974.7 in 7.7 secs
  numba      326725974.7 in 7.3 secs
 threading_layer = omp

  numba      326725974.7 in 6.7 secs
  C++, vs    326725974.7 in 7.1 secs

threads = 32
 threading_layer = tbb
  numba      326725974.7 in 7.0 secs

 threading_layer = omp
  numba      326725974.7 in 7.1 secs
  C++, vs    326725974.7 in 6.8 secs

threads = 64
 threading_layer = tbb
  numba      326725974.7 in 6.8 secs
 threading_layer = omp



**Conclusion:**

1. Numba is as efficient as pure C++
2. Numba with `threading_layer=tbb` delivers the same speed-up as the intel C++ compiler
3. Numba with `threading_layer=omp` delivers the same speed-up as the vs C++ compiler
4. When there are many threads the intel C++ compiler (or numba with `threading_layer=tbb`) performs best

# Reset

In [None]:
runtools.write_numba_config(threads=8,threading_layer='omp') # rest to omp and 8 threads