# Comparing Numba and C++

This notebooks provides an example of how **Numba** compares in speed-up to C++.

**Computer used for timings:** Windows 10 computer with two Intel(R) Xeon(R) Gold 6254 3.10 GHz CPUs (18 cores, 36 logical processes each) and 768 GB of RAM.

# Imports

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import time
import ctypes as ct
import numpy as np
import numba as nb

In [3]:
from consav import cpptools

In [4]:
DO_INTEL = False

# Numba

In [5]:
print(f'This computer has {nb.config.NUMBA_DEFAULT_NUM_THREADS} CPUs')
print(f'Numba is using {nb.config.NUMBA_NUM_THREADS} CPUs')

threads_list = [x for x in np.arange(1,nb.config.NUMBA_NUM_THREADS+1) if x in [1,4,8] or x%8 == 0]
compilers = ['vs','intel'] if DO_INTEL else ['vs']

This computer has 8 CPUs
Numba is using 8 CPUs


In [6]:
nb.config.THREADING_LAYER = 'tbb' # alternative: 'omp'

**Test function:**

In [7]:
# a. test function
@nb.njit(parallel=True)
def test_func(X,Y,Z):
    for i in nb.prange(X.size):
        Z[i] = 0
        for j in range(Y.size):
            Z[i] += np.exp(np.log(X[i]*Y[j]+0.001))/(X[i]*Y[j])-1
            
# b. settings
NX = 40000
NY = 40000

# c. random draws
np.random.seed(1998)
X = np.random.sample(NX)
Y = np.random.sample(NY)
Z = np.zeros(NX)

**Test runs:**

In [8]:
NYtest = 2
Ytest = np.random.sample(NYtest)
test_func(X,Ytest,Z)

**Timed runs:**

In [9]:
for threads in threads_list:

    # b. set threads
    nb.set_num_threads(threads)

    # c. run
    tic = time.time()
    test_func(X,Y,Z)
    toc = time.time()

    print(f'{nb.threading_layer()} with {threads:2d} threads in {toc-tic:4.1f} secs [checksum: {np.sum(Z):.1f}]')

tbb with  1 threads in 29.4 secs [checksum: 326725974.7]
tbb with  2 threads in 15.3 secs [checksum: 326725974.7]
tbb with  4 threads in  8.9 secs [checksum: 326725974.7]
tbb with  8 threads in  7.5 secs [checksum: 326725974.7]


# C++

**Link** C++ functions (`cppfuncs/compare_with_numba.cpp)`:

In [10]:
filename = 'cppfuncs/compare_with_numba.cpp'
compare_with_numba_vs = cpptools.link_to_cpp(filename,options={'compiler':'vs','dllfilename':'example_numba_vs.dll'})
if DO_INTEL: compare_with_numba_intel = cpptools.link_to_cpp(filename,options={'compiler':'intel','dllfilename':'example_numba_intel.dll'})

**Timed runs:**

In [11]:
for compiler in compilers:    
    for threads in threads_list:    
        
        tic = time.time()
        if compiler == 'vs':
            compare_with_numba_vs.test_func(X,Y,Z,NX,NY,threads)
        else:
            compare_with_numba_intel.test_func(X,Y,Z,NX,NY,threads)    
        toc = time.time()
        
        print(f'{compiler} with {threads:2d} in {toc-tic:4.1f} secs [checksum: {np.sum(Z):.1f}]')
    
    print('')

vs with  1 in 27.4 secs [checksum: 326725974.7]
vs with  2 in 14.6 secs [checksum: 326725974.7]
vs with  4 in  9.1 secs [checksum: 326725974.7]
vs with  8 in  7.4 secs [checksum: 326725974.7]



**Clean-up:**

In [12]:
compare_with_numba_vs.clean_up()
if DO_INTEL: compare_with_numba_intel.clean_up()

# Conclusions

1. Numba is almost as efficient as pure C++
2. In C++ performance is best with `compiler='intel'` - especially with > 36 cores 
3. With `nb.config:THREADING_LAYER = 'tbb'` performance is similar to `compiler='intel'`
4. With `nb.config:THREADING_LAYER = 'omp'` performance is similar to `compiler='vs'`