# Writing efficient code in Python: vectorization and `numba`

In this example, we will learn how the code operating with large data can be optimized through vectorization and using the Numba package.

<iframe width="1280" height="720" src="https://www.youtube.com/embed/d6YGiS-ZhJ0" title="2_2 HPC workshop01" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

```{eval-rst}
:download:`Download the slides. <./slides/Python_numba.pdf>`
```

### Interpreter and compiler
Python uses an interpreter. This means the `python.exe` program parses your script (i.e., text file with code on Python language) and executes it row by row. This also means you need a Python interpreter in each system where you want to execute your scripts.

On the other hand, languages like C++ or Fortran need a compiler to make executable files (your application or library) from the source code. Compiler analyzes the code in a few passes and do a lot of optimizations and outputs the machine code, i.e. application. 

Python is higher level language. As such, there are a lot of things going on under the hood that Python programmers do not have to care about. For instance memory allocation and deallocation, or everything that allows using a lot of existing modules in a plug-and-play fashion. The disadvantage of Python is lower performance, and that is why most numerical packages are written in C/C++/Fortran. 

For example, big loops are inefficient in Python. However, there are several ways to overcome this issue, such a *vectorization* and the `numba` library.

We will try to use different Python implementations for the simple task and compare the performance. 

### Let's start...


First, let's generate large volume of data: 100 million random values representing age of population.

In [None]:
# length of array
N = 10000000

In [None]:
from numpy import random

# generate random array of integers from 0 to 100 with size N.
ages = random.randint(0, 100, N)

In [None]:
# calculate the average age (without any condition)
average_age = ages.mean()
print("Average age is", average_age)

Next, we are going to find the average age of adults (with age above 18 years old). 
For this we will use 'for' loop first.

In [None]:
# function returns the average age only for an adults (age >= 18) using loop
def calc_average_adult_age_loop(ages):
    average_adult_age = 0 
    adult_counter = 0
    for i in ages:
        if i >= 18:
            average_adult_age +=i
            adult_counter += 1
    if (adult_counter > 0):
        average_adult_age /= adult_counter
    return average_adult_age

In [None]:
# calculate and measure the calculation time using loop
import time

timers = dict()
start = time.perf_counter()
average_adult_age = calc_average_adult_age_loop(ages)
end = time.perf_counter()

print("Time (loop):", end - start, "sec.")
print("Average adult age is", average_adult_age)

timers["loop"] = end - start

### Vectorization in numpy

For large arrays, looping over each element can be slow in high-level languages like Python due to overhead.

Vectorization is used to speed up the Python code without using loops. Using this approach may largely reduce the running time of code, improving its efficiency. Various `numpy` operations can be vectorized, such as i) dot product of vectors, also known as the scalar product; ii) outer products which results in square matrix of dimension equal to length X length of the vectors; or iii) element-wise multiplication of two matrices, where each element of first matrix is multiplied by its corresponding element in the later matrix (dimension of the matrices should be same).

Vectorized operations in `numpy` delegate the looping internally to highly optimized C and Fortran functions. 
This also makes your Python code cleaner. Indexing of `numpy.ndarray` is also extremely efficient. Let's see how we can exploit it to make the computation of the average adult age much faster.

In [None]:
# calculate the average age only for an adults (age >= 18) using numpy vectorization
start = time.perf_counter()
average_adult_age = ages[ages>=18].mean()
end = time.perf_counter()

print("Time (vectorized):", end - start, "sec.")
print("Average adult age is", average_adult_age)

timers["numpy vectorized"] = end - start

### Vectorization using `numba`

It is not always possible to implement the code with complicated conditions in a `numpy` vectorized way. 
In such cases, the [`numba` Python library](https://numba.pydata.org/) can help to speedup the Python loops. 

`numba` analyzes your function and compiles it into fast machine code using the LLVM (Low Level Virtual Machine) compiler. LLVM is an open source compiler which supports JIT (Just-In-Time) code generation, where the code is compiled during the execution.

To use `numba` JIT support, you only need importing
```python
from numba import jit
```
and adding JIT decorator before the function definition:
```python
@jit(nopython=True)
```
> Note: Python *decorators* allow us modify the behaviour of a function or a class without permanently modifying it. You can find more about it [here](https://www.geeksforgeeks.org/decorators-in-python/).

By default, `numba` might peform the compilation and optimization only for part of code. Specifying `nopython=True` forces to compile the decorated function so that it will run entirely without the involvement of the Python interpreter. This is the recommended and best-practice way to use the `numba` JIT decorator as it leads to the best performance.

:::{figure-md} markdown-fig
<img src="./slides/Slide8.PNG" alt="numba explanation" class="bg-primary mb-1">

Schematic overview of how Numba works.
:::

Let's add `numba` to our original loop implementation for computing the average adult age.

In [None]:
# calculate and measure the calculation time using loop with numba decorator
from numba import jit

@jit(nopython=True)
# calculate the average age only for an adults (age >= 18) using loop
def calc_average_adult_age_loop_numba(ages):
    average_adult_age = 0 
    adult_counter = 0
    for i in ages:
        if i >= 18:
            average_adult_age += i
            adult_counter += 1
    if (adult_counter > 0):
        average_adult_age /= adult_counter
    return average_adult_age

The first run will still require a comparable amount of time since it includes the compilation time.

In [None]:
start = time.perf_counter()
# first call will include the compilation time
average_adult_age = calc_average_adult_age_loop_numba(ages)
end = time.perf_counter()

print("Time (loop numba):", end - start, "sec.")
print("Average adult age is", average_adult_age)

Next runs of the same function will be extremely fast. It happens because the code is compiled (translated to machine code) rather than interpreted (executed by Python interpretator)

In [None]:
# measure the time
start = time.perf_counter()
average_adult_age = calc_average_adult_age_loop_numba(ages)
end = time.perf_counter()

print("Time (loop numba):", end - start, "sec.")
print("Average adult age is", average_adult_age)

timers["loop numba"] = end - start

Of course, we get the speedup even if we use a completely different dataset.

In [None]:
# create new dataset
ages_new = random.randint(0, 100, N)

# measure the time
start = time.perf_counter()
average_adult_age = calc_average_adult_age_loop_numba(ages_new)
end = time.perf_counter()

print("Time (loop numba):", end - start, "sec.")
print("Average adult age is", average_adult_age)

timers["loop numba"] = end - start

### Parallelization on multiple cores

Our computers have many cores, so we can ask numba to automatically parallelize the loop by adding `parallel=True` argument.

Parallelization changes the loop execution order, so it is not possible to directly parallelize the loop if it has a dependency on the previous iteration values (for example, `x[i+1]= f(x[i])`).

:::{figure-md} markdown-fig
<img src="./slides/Slide10.PNG" alt="numba explanation" class="bg-primary mb-1">

Parallelization on multiple cores.
:::

In [None]:
# with numba decorator and auto-parallelization and fast math
@jit(nopython=True, parallel=True)
# calculate the average age only for an adults (age >= 18) using loop
def calc_average_adult_age_loop_numba_parallel(ages):
    average_adult_age = 0 
    adult_counter = 0
    for i in ages:
        if i >= 18:
            average_adult_age += i
            adult_counter += 1
    if (adult_counter > 0):
        average_adult_age /= adult_counter
    return average_adult_age

In [None]:
start = time.perf_counter()
# first call will include the compilation time
average_adult_age = calc_average_adult_age_loop_numba_parallel(ages)
end = time.perf_counter()

print("Time (loop parallel):", end - start, "sec.")
print("Average adult age is", average_adult_age)

This message means that auto parallelization has not been applied.

## How to implement parallel loops correctly?


Another feature of the code transformation pass when `(parallel=True)` is support for consistent parallel loops. One can use Numba’s `prange` instead of Python's standard `range` to specify that a loop can be parallelized. The user is required to make sure that the loop does not have cross iteration dependencies.
:::{figure-md} markdown-fig
<img style="float: left;" src="slides/Slide11.PNG" width="100%">

Schematic explanation of parallelization.
:::

Let's use `prange` in `for` loop to specify that  our loop has not cross iteration dependencies and can be parallelized.

In [None]:
from numba import prange

# with numba decorator and auto-parallelization
@jit(nopython=True, parallel=True)
# calculate the average age only for an adults (age >= 18) using loop
def calc_average_adult_age_loop_numba_parallel(ages):
    average_adult_age = 0 
    adult_counter = 0
    for i in prange(ages.shape[0]):
        if ages[i] >= 18:
            average_adult_age += ages[i]
            adult_counter += 1
    if (adult_counter > 0):
        average_adult_age /= adult_counter
    return average_adult_age

In [None]:
start = time.perf_counter()
# first call will include the compilation time
average_adult_age = calc_average_adult_age_loop_numba_parallel(ages)
end = time.perf_counter()

print("Time (loop parallel):", end - start, "sec.")
print("Average adult age is", average_adult_age)

In [None]:
# measure the time
start = time.perf_counter()
average_adult_age = calc_average_adult_age_loop_numba_parallel(ages)
end = time.perf_counter()

print("Time (loop numba parallel):", end - start, "sec.")
print("Average adult age is", average_adult_age)

timers["loop numba parallel"] = end - start

In [None]:
# print timers
max_col_width = 20
for t in timers.keys():
    tab_spaces = ' ' * (max_col_width - len(t))
    print(t, tab_spaces, timers[t], 'sec.')

### Another example of numpy vectorization 

Dot product is an algebraic operation in which two equal length vectors are being multiplied such that it produces a single number. Dot Product often called as inner product. This product results in a scalar number. Let’s consider two matrix a and b of same length, the dot product is done by taking the transpose of first matrix and then mathematical matrix multiplication of a’(transpose of a) and b is followed.

> Note: run the cell twice if you don't witness a significant speedup the first time.

In [None]:
# Dot product
import time
import numpy as np

timers = dict()

N = 1000000

a = np.random.rand(N,1).T # we must use the transpose of a to perform the dot product
b = np.random.rand(N,1)
   
# classic dot product of vectors implementation 
start_time = time.perf_counter()
dot = 0.0;
  
for i in range(len(a)):
      dot += a[i] * b[i]
  
end_time = time.perf_counter()

timers["loop"] = end_time - start_time
  
start_time = time.perf_counter()
n_dot_product = np.dot(a, b)
end_time = time.perf_counter()

timers["numpy vectorized"] = end_time - start_time
  
max_col_width = 20
for t in timers.keys():
    tab_spaces = ' ' * (max_col_width - len(t))
    print(t, tab_spaces, timers[t], 'sec.')
    

### More data and examples on Numpy vectorization

More examples and explanations can be found at [vectorized algebraic operations](https://www.geeksforgeeks.org/vectorization-in-python/) and [vectorized mathematical functions](https://www.geeksforgeeks.org/vectorized-operations-in-numpy/).