# Comparing Numba and C++

This notebooks provides an example of how **Numba** compares in speed-up to C++.

**Computer used for timings:** Windows 10 computer with 2x Intel(R) Xeon(R) Gold 6254 3.10 GHz CPUs (18 cores, 36 logical processes each) and 768 GB of RAM.

# Imports

In [1]:
%load_ext autoreload
%autoreload 2

import time
import ctypes as ct
import numpy as np
import numba as nb

In [2]:
from consav import cpptools

In [3]:
DO_INTEL = True

# Numba

In [4]:
print(f'This computer has {nb.config.NUMBA_DEFAULT_NUM_THREADS} CPUs')
print(f'Numba is using {nb.config.NUMBA_NUM_THREADS} CPUs')

threads_list = [x for x in np.arange(1,nb.config.NUMBA_NUM_THREADS+1) if x in [1,4,8] or x%8 == 0]
compilers = ['vs','intel'] if DO_INTEL else ['vs']

This computer has 72 CPUs
Numba is using 72 CPUs


In [5]:
nb.config.THREADING_LAYER = 'omp' # alternative: 'tbb'

**Test function:**

In [6]:
# a. test function
@nb.njit(parallel=True)
def test_func(X,Y,Z):
    for i in nb.prange(X.size):
        Z[i] = 0
        for j in range(Y.size):
            Z[i] += np.exp(np.log(X[i]*Y[j]+0.001))/(X[i]*Y[j])-1
            
# b. settings
NX = 40000
NY = 40000

# c. random draws
np.random.seed(1998)
X = np.random.sample(NX)
Y = np.random.sample(NY)
Z = np.zeros(NX)

**Test runs:**

In [7]:
NYtest = 2
Ytest = np.random.sample(NYtest)
test_func(X,Ytest,Z)

**Timed runs:**

In [8]:
for threads in threads_list:

    # b. set threads
    nb.set_num_threads(threads)

    # c. run
    tic = time.time()
    test_func(X,Y,Z)
    toc = time.time()

    print(f'{nb.threading_layer()} with {threads:2d} threads in {toc-tic:4.1f} secs [checksum: {np.sum(Z):.1f}]')

omp with  1 threads in 24.6 secs [checksum: 326725974.7]
omp with  4 threads in  6.3 secs [checksum: 326725974.7]
omp with  8 threads in  3.1 secs [checksum: 326725974.7]
omp with 16 threads in  1.6 secs [checksum: 326725974.7]
omp with 24 threads in  1.6 secs [checksum: 326725974.7]
omp with 32 threads in  1.2 secs [checksum: 326725974.7]
omp with 40 threads in  1.4 secs [checksum: 326725974.7]
omp with 48 threads in  1.4 secs [checksum: 326725974.7]
omp with 56 threads in  1.3 secs [checksum: 326725974.7]
omp with 64 threads in  1.3 secs [checksum: 326725974.7]
omp with 72 threads in  1.4 secs [checksum: 326725974.7]


# C++

**Link** C++ functions (`cppfuncs/compare_with_numba.cpp)`:

In [9]:
filename = 'cppfuncs/compare_with_numba.cpp'
compare_with_numba_vs = cpptools.link_to_cpp(filename,options={'compiler':'vs','dllfilename':'example_numba_vs.dll'})
if DO_INTEL: compare_with_numba_intel = cpptools.link_to_cpp(filename,options={'compiler':'intel','dllfilename':'example_numba_intel.dll'})

**Timed runs:**

In [10]:
for compiler in compilers:    
    for threads in threads_list:    
        
        tic = time.time()
        if compiler == 'vs':
            compare_with_numba_vs.test_func(X,Y,Z,NX,NY,threads)
        else:
            compare_with_numba_intel.test_func(X,Y,Z,NX,NY,threads)    
        toc = time.time()
        
        print(f'{compiler} with {threads:2d} in {toc-tic:4.1f} secs [checksum: {np.sum(Z):.1f}]')
    
    print('')

vs with  1 in 24.7 secs [checksum: 326725974.7]
vs with  4 in  6.3 secs [checksum: 326725974.7]
vs with  8 in  3.2 secs [checksum: 326725974.7]
vs with 16 in  1.6 secs [checksum: 326725974.7]
vs with 24 in  1.6 secs [checksum: 326725974.7]
vs with 32 in  1.3 secs [checksum: 326725974.7]
vs with 40 in  1.3 secs [checksum: 326725974.7]
vs with 48 in  1.3 secs [checksum: 326725974.7]
vs with 56 in  1.3 secs [checksum: 326725974.7]
vs with 64 in  1.3 secs [checksum: 326725974.7]
vs with 72 in  1.3 secs [checksum: 326725974.7]

intel with  1 in 22.1 secs [checksum: 326725974.7]
intel with  4 in  5.7 secs [checksum: 326725974.7]
intel with  8 in  2.9 secs [checksum: 326725974.7]
intel with 16 in  1.4 secs [checksum: 326725974.7]
intel with 24 in  1.4 secs [checksum: 326725974.7]
intel with 32 in  1.2 secs [checksum: 326725974.7]
intel with 40 in  0.9 secs [checksum: 326725974.7]
intel with 48 in  0.8 secs [checksum: 326725974.7]
intel with 56 in  0.7 secs [checksum: 326725974.7]
intel with 6

**Clean-up:**

In [11]:
compare_with_numba_vs.clean_up()
if DO_INTEL: compare_with_numba_intel.clean_up()

# Conclusions

1. Numba is almost as efficient as pure C++
2. In C++ performance is best with `compiler='intel'` - especially with > 36 cores 
3. With `nb.config:THREADING_LAYER = 'tbb'` performance is similar to `compiler='intel'`
4. With `nb.config:THREADING_LAYER = 'omp'` performance is similar to `compiler='vs'`