## Slow loops

Python loops are inefficient for numeric operations.

In [1]:
import numpy as np

Here's a function that computes the sum of the log of all non-zero values.

In [2]:
def sum_log_nz(ary):
    res = np.zeros(ary.shape[0])
    for i in range(ary.shape[0]):
        v = ary[i] 
        if v != 0:
            res[i] = np.log(v)
    return res.sum()

Test the function

In [3]:
a = np.random.random(5_000_000)

In [4]:
a

array([0.18467213, 0.33784349, 0.08921073, ..., 0.06290937, 0.97608245,
       0.87281367])

In [5]:
sum_log_nz(a)

-4999748.670592799

Time the function

In [6]:
%%time 
sum_log_nz(a)

CPU times: user 20.8 s, sys: 69.8 ms, total: 20.9 s
Wall time: 21 s


-4999748.670592799

## SIMD Loops

Numba can compile the inefficient pure-Python loop into SIMD-vectorized native loop.

In [7]:
import numba

Try compiling the function with Numba.

Notice the difference between settings of `fastmath=<True|False>`.

In [8]:
fast_sum_log_nz = numba.njit(fastmath=True)(sum_log_nz)
fast_sum_log_nz

CPUDispatcher(<function sum_log_nz at 0x7f378b8bd630>)

In [9]:
fast_sum_log_nz(a)

-4999748.67059264

Notice the improved performance

In [10]:
%%time

fast_sum_log_nz(a)

CPU times: user 371 ms, sys: 28.6 ms, total: 400 ms
Wall time: 371 ms


-4999748.67059264

In [11]:
slow_sum_log_nz = numba.njit(fastmath=False)(sum_log_nz)


In [12]:
%%time
slow_sum_log_nz(a)

CPU times: user 837 ms, sys: 15.5 ms, total: 853 ms
Wall time: 859 ms


-4999748.67059264

In [13]:
fast_sum_log_nz.inspect_cfg(fast_sum_log_nz.signatures[0]).display()

ModuleNotFoundError: No module named 'graphviz'

## Parallel Loops

Numba can auto-parallize the function to leverage multiple threads.

In [14]:
par_sum_log_nz = numba.njit(parallel=True)(sum_log_nz)

In [15]:
par_sum_log_nz(a)

-4999748.670592664

Use the `.parallel_diagnostics()` to inspect what the compiler has done to optimize the function.

Note: 
* notice how the manually written loop is not recognized.

In [16]:
par_sum_log_nz.parallel_diagnostics()

 
 Parallel Accelerator Optimizing:  Function sum_log_nz, 
/tmp/ipykernel_12394/958321262.py (1)  


Parallel loop listing for  Function sum_log_nz, /tmp/ipykernel_12394/958321262.py (1) 
-------------------------------------|loop #ID
def sum_log_nz(ary):                 | 
    res = np.zeros(ary.shape[0])-----| #0
    for i in range(ary.shape[0]):    | 
        v = ary[i]                   | 
        if v != 0:                   | 
            res[i] = np.log(v)       | 
    return res.sum()-----------------| #1
------------------------------ After Optimisation ------------------------------
Parallel structure is already optimal.
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
 


Use `numba.prange` to mark a loop for parallelization.

In [17]:
@numba.njit(parallel=True, fastmath=True)
def par_sum_log_nz(ary):
    res = np.zeros(ary.shape[0])
    for i in numba.prange(ary.shape[0]):
        v = ary[i] 
        if v != 0:
            res[i] = np.log(v)
    return res.sum()

In [18]:
par_sum_log_nz(a)

-4999748.670592664

In [19]:
%%time
par_sum_log_nz(a)

CPU times: user 260 ms, sys: 38.2 ms, total: 298 ms
Wall time: 112 ms


-4999748.670592664

Compare the result of the `.parallel_diagnostics()` with the previous version.

Note:
* 3 loops are recognized.
* the loops are fused because they iterate over the same domain.

In [20]:
par_sum_log_nz.parallel_diagnostics()

 
 Parallel Accelerator Optimizing:  Function par_sum_log_nz, 
/tmp/ipykernel_12394/1255685608.py (1)  


Parallel loop listing for  Function par_sum_log_nz, /tmp/ipykernel_12394/1255685608.py (1) 
---------------------------------------------|loop #ID
@numba.njit(parallel=True, fastmath=True)    | 
def par_sum_log_nz(ary):                     | 
    res = np.zeros(ary.shape[0])-------------| #2
    for i in numba.prange(ary.shape[0]):-----| #4
        v = ary[i]                           | 
        if v != 0:                           | 
            res[i] = np.log(v)               | 
    return res.sum()-------------------------| #3
------------------------------ After Optimisation ------------------------------
Parallel region 0:
+--2 (parallel, fused with loop(s): 3, 4)


 
Parallel region 0 (loop #2) had 2 loop(s) fused.
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
