# Parallel functionality of Numba

 - Elwin van 't Wout
 - Pontificia Universidad Católica de Chile
 - IMT3870
 - 28-8-2023

Sum the values of a vector and compare the timing between parallelised versions.

In [1]:
import numpy as np
from numba import jit, prange

In [2]:
def sum_vector_python(a):
    s = 0
    for i in range(a.size):
        s += a[i]
    return s   

In [3]:
def sum_vector_numpy(a):
    s = np.sum(a)
    return s   

For Numba, we can use exactly the same function as before but with the Numba decorator added. As the first version, we use the Numba optimisation (```nopython=True```) but without parallelisation (```parallel=False```).

In [4]:
@jit(nopython=True, parallel=False)
def sum_vector_numba_serial(a):
    s = 0
    for i in range(a.size):
        s += a[i]
    return s   

Adding the parallel option to the Numba decorator makes Numba search for parts of the code than can be parallelised. Add the option ```parallel=True``` for automatic parallelisation. This will only work when ```nopython=True```.

In [5]:
@jit(nopython=True, parallel=True)
def sum_vector_numba_parallel(a):
    s = 0
    for i in range(a.size):
        s += a[i]
    return s   

Instead of letting Numba search for parallelisation opportunities, you can explicitly state that a for loop needs to be parallelised. Use the function ```prange()``` instead of the standard ```range()```in the for loop. In this case, Numba automatically detects that the variable ```s``` for the sum is a shared variable.

In [6]:
@jit(nopython=True, parallel=True)
def sum_vector_numba_prange(a):
    s = 0
    for i in prange(a.size):
        s += a[i]
    return s   

Let us create a vector with elements $0,1,2,\dots,n-1$ and calculate the sum.

In [7]:
n = int(1e7)
vec = np.arange(n)

Before performing the timings, call the Numba functions once, so that they are compiled

In [8]:
print("Sum of vector with serial Numba:", sum_vector_numba_serial(vec))
print("Sum of vector with parallel Numba:", sum_vector_numba_parallel(vec))
print("Sum of vector with prange Numba:", sum_vector_numba_prange(vec))

Sum of vector with serial Numba: 49999995000000
Sum of vector with parallel Numba: 49999995000000


The keyword argument 'parallel=True' was specified but no transformation for parallel execution was possible.

To find out why, try turning on parallel diagnostics, see https://numba.readthedocs.io/en/stable/user/parallel.html#diagnostics for help.
[1m
File "C:\Users\Shesc\AppData\Local\Temp\ipykernel_3988\1566478512.py", line 2:[0m
[1m@jit(nopython=True, parallel=True)
[1mdef sum_vector_numba_parallel(a):
[0m[1m^[0m[0m
[0m


Sum of vector with prange Numba: 49999995000000


Numba might give warnings when it is not able to perform the requested optimisation of the code.

In [9]:
%%timeit
sum_vector_python(vec)

  s += a[i]


1.43 s ± 73.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [10]:
%%timeit
sum_vector_numpy(vec)

3.81 ms ± 61.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [11]:
%%timeit
sum_vector_numba_serial(vec)

2.38 ms ± 98.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [12]:
%%timeit
sum_vector_numba_parallel(vec)

2.36 ms ± 55.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [13]:
%%timeit
sum_vector_numba_prange(vec)

1.51 ms ± 111 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


The number of threads used by Numba is stored in global variables.

In [14]:
from numba import config
print("The number of available CPUs detected by Numba is:", config.NUMBA_DEFAULT_NUM_THREADS)
print("The number of threads used by Numba is:", config.NUMBA_NUM_THREADS)

The number of available CPUs detected by Numba is: 6
The number of threads used by Numba is: 6


The number of threads used by Numba can be changed manually.

In [15]:
from numba import set_num_threads, get_num_threads
set_num_threads(2)
print("The current number of threads used by Numba is:", get_num_threads())

The current number of threads used by Numba is: 2


Numba optimization diagnostics:

In [38]:
@jit(nopython=True, parallel=True)
def multiple_prange(size):
    s = 0
    for i in prange(size):
        for j in prange(size):
            for l in prange(size):
                for m in prange(size):
                    s += i*j*l*m
    return s   

In [39]:
# Compilation:
%time multiple_prange(100000000)
%timeit multiple_prange(100000000)


CPU times: total: 78.1 ms
Wall time: 469 ms
3.79 ms ± 139 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [40]:
multiple_prange.parallel_diagnostics(level=4)

 
 Parallel Accelerator Optimizing:  Function multiple_prange, 
C:\Users\Shesc\AppData\Local\Temp\ipykernel_3988\3333088043.py (1)  


Parallel loop listing for  Function multiple_prange, C:\Users\Shesc\AppData\Local\Temp\ipykernel_3988\3333088043.py (1) 
------------------------------------------|loop #ID
@jit(nopython=True, parallel=True)        | 
def multiple_prange(size):                | 
    s = 0                                 | 
    for i in prange(size):----------------| #20
        for j in prange(size):------------| #19
            for l in prange(size):--------| #18
                for m in prange(size):----| #17
                    s += i*j*l*m          | 
    return s                              | 
--------------------------------- Fusing loops ---------------------------------
Attempting fusion of parallel loops (combines loops with similar properties)...
----------------------------- Before Optimisation ------------------------------
Parallel region 0:
+--20 (paralle