# Lecture on Numba
## JIT Compilation in Python

![](images/028.png)

## What is Numba?

![](images/030.PNG)

Numba is a just-in-time compiler for Python.

Numba is best used for loops, numpy arrays operations and functions.

Some of its properties are:
1. Compile Python code into an intermediate code readable for LLVM (Low Level Virtual Machine).
2. Vectorize functions.
3. Run functions in parallel both Multi-CPU and GPU cores.
4. Easy to use thanks to decorators.

Here, we want to gain speed without having to change many lines in our programs. 

Due to Python is a high-level language, we need to generate a low-level intermediate language through compilation to obtain a significant speedup.



The most common way to use Numba is through its collection of decorators:



Numba can be installed
inside conda environment as:
```
$ conda install numba
```
with pip as:
```
$ pip install numba
```

Numba is often used as a core package so its dependencies are kept to an
absolute minimum, however, extra packages can be installed as follows to provide
additional functionality:

* ``scipy`` - enables support for compiling ``numpy.linalg`` functions.
* ``colorama`` - enables support for color highlighting in backtraces/error
  messages.
* ``pyyaml`` - enables configuration of Numba via a YAML config file.
* ``intel-cmplr-lib-rt`` - allows the use of the Intel SVML (high performance
  short vector math library, x86_64 only). Installation instructions are in the
  :ref:`performance tips <intel-svml>`.



If you ask yourselves:
Will Numba work for my code?

Numba works well on code that looks like this::
```python
    from numba import jit
    import numpy as np

    x = np.arange(100).reshape(10, 10)

    @jit(nopython=True) # Set "nopython" mode for best performance, equivalent to @njit
    def go_fast(a): # Function is compiled to machine code when called the first time
        trace = 0.0
        for i in range(a.shape[0]):   # Numba likes loops
            trace += np.tanh(a[i, i]) # Numba likes NumPy functions
        return a + trace              # Numba likes NumPy broadcasting

    print(go_fast(x))
```

It won't work very well, if at all, on code that looks like this::
```python
    from numba import jit
    import pandas as pd

    x = {'a': [1, 2, 3], 'b': [20, 30, 40]}

    @jit
    def use_pandas(a): # Function will not benefit from Numba jit
        df = pd.DataFrame.from_dict(a) # Numba doesn't know about pd.DataFrame
        df += 1                        # Numba doesn't understand what this is
        return df.cov()                # or this!

    print(use_pandas(x))

```

Note that Pandas is not understood by Numba and as a result Numba would simply
run this code via the interpreter but with the added cost of the Numba internal
overheads!

How to measure the performance of Numba?
----------------------------------------
We need to remember that Numba has to compile your function for the argument types
given before it executes the machine code version of your function. This takes
time. However, once the compilation has taken place Numba caches the machine
code version of your function for the particular types of arguments presented.
If it is called again with the same types, it can reuse the cached version
instead of having to compile again.

A really common mistake when measuring performance is to not account for the compiling time and to time code once with a simple timer that includes the
time taken to compile your function in the execution time.



In [None]:
#BAD EXAMPLE
from numba import jit
import numpy as np
import time

x = np.arange(100).reshape(10, 10)

@jit(nopython=True)
def go_fast(a): # Function is compiled and runs in machine code
        trace = 0.0
        for i in range(a.shape[0]):
            trace += np.tanh(a[i, i])
        return a + trace

# THE FIRST TIME WE TIME THE FUNCTION COMPILATION TIME IS INCLUDED IN THE EXECUTION TIME!
start = time.perf_counter()
go_fast(x)
end = time.perf_counter()
print("Elapsed (with compilation) = {}s".format((end - start)))

# NOW THE FUNCTION IS COMPILED, IF WE RE-TIME IT:
start = time.perf_counter()
go_fast(x)
end = time.perf_counter()
print("Elapsed (without compilation) = {}s".format((end - start)))


Elapsed (with compilation) = 0.23587280600000327s
Elapsed (without compilation) = 8.881700000529236e-05s


A good way of timing numba function is using the magic command timeit (we will see examples now)

#NUMBA DECORATORS

The more important decorators in Numba are

- @jit &@njit - Decorators to speed up almost any python function.
- @vectorize - Decorator to speed up numpy-like universal functions.
- @guvectorize - Decorator which is an extended version of @vectorize decorator.
- @stencil - Decorator which speeds up function performing stencil kernel operations like convolution, correlation, etc.



The Numba @jit decorator fundamentally operates in two compilation modes, nopython mode and object mode.

If we define the `@jit` decorator with the parameter `nopython=True`, the decorated function will run enterily without the involvement of the Python Interpreter.

This is the recommended and best-practice way.


The @vectorize decorator requires us to specify possible data types of input and output of the function, in order to create a compiled version for each data type. The data type should be in order from less memory data type to more memory data type.
Apart from datatypes, it accepts two other arguments.

target - This argument accepts one of the below-mentioned three strings as input specifying how to further speed up code based on available resources.
'cpu' - This is default argument. It's used for a single-core (single-threaded) CPU.
'parallel' - This argument runs code in parallel on multi-core (multi-threaded) CPU.
'cuda' - This argument is set for GPU.
cache - This parameter accepts boolean values specifying whether to use caching to speed up reruns of the same function again and again with the same inputs.

How fast can Numba be?
---------------
Assuming Numba can operate in ``nopython`` mode, or at least compile some loops,
it will target compilation to your specific CPU. Speed up varies depending on
application but can be one to two orders of magnitude.

In [None]:
import numba

print("Numba Version : {}".format(numba.__version__))

import numpy as np
import timeit
from numba import jit

def cube_formula(x):
    return x**3 + 3*x**2 + 3

#cube_formula_jitted = jit(cube_formula)

@jit(nopython=False)
def cube_formula_jitted(x):
    xs = []
    for i in x:
        xs.append(i**3 + 3*i**2 + 3)
    return xs


arr = np.arange(1, 1000000, dtype=np.int64)

#print("The time taken with Numba jit, in Python mode is ",timeit.timeit(stmt='cube_formula_jitted(arr)'))
print("The time taken with Numba jit, in Python mode, it is \n")

%timeit cube_formula_jitted(arr)

@jit(nopython=True)
def new_cube_formula_jitted(x):
    xs = []
    for i in x:
        xs.append(i**3 + 3*i**2 + 3)

    return xs

print("The time taken with Numba jit, in no Python mode, it is \n")

%timeit new_cube_formula_jitted(arr)
#timeit.timeit(stmt='new_cube_formula_jitted(arr)', globals=globals()))




Numba Version : 0.58.1
The time taken with Numba jit, in Python mode, it is 



  @jit(nopython=False)


37.2 ms ± 1.44 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The time taken with Numba jit, in no Python mode, it is 

55.5 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [None]:

from numba import vectorize, int64, float32, float64

#COMPARING Numpy vectorize VS NUMBA vectorize

vectorized_cube_formula = np.vectorize(cube_formula)

print("The time taken with Numpy vectorize applied to the original formula is: \n")

%timeit vectorized_cube_formula(arr)
#print(timeit.timeit(stmt='vectorized_cube_formula(arr)', globals=globals()))



@vectorize([int64(int64), float32(float32), float64(float64)])
def cube_formula_numba_vec(x):
    return x**3 + 3*x**2 + 3


print("The time taken with Numba jit is : \n ")
%timeit cube_formula_numba_vec(arr)


#NUMBA PARALLELIZED with multithreading

@vectorize([int64(int64), float32(float32), float64(float64)], target="parallel")
def cube_formula_numba_vec_paralleled(x):
    return x**3 + 3*x**2 + 3

print("The time taken with Numba vectorized parallelized is ")

%timeit cube_formula_numba_vec_paralleled(arr)


#Caching in Numba
@vectorize([int64(int64), float32(float32), float64(float64)], cache=True)
def cube_formula_numba_vec_cached(x):
    return x**3 + 3*x**2 + 3



print("The time taken with Numba vectorized cached is ")

%timeit cube_formula_numba_vec_cached(arr)



The time taken with Numpy vectorize applied to the original formula is: 

916 ms ± 63.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The time taken with Numba jit is : 
 
1.07 ms ± 157 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The time taken with Numba vectorized parallelized is 
763 µs ± 20 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The time taken with Numba vectorized cached is 
1.08 ms ± 109 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


#STENCIL KERNELS in NUMBA
---------------------------------

Stencils are a common computational pattern in which array elements are updated according to some fixed pattern called the stencil kernel. Numba provides the @stencil decorator so that users may easily specify a stencil kernel

In [None]:
import numba
from numba import stencil


#Convolution in 1D
def conv_op(a, b):
    for i in range(a.shape[0]):
        if i-1 < 0 or i+1 >= a.shape[0]:
            b[i] = 0
        else:
            b[i] = a[i-1] + a[i] + a[i+1]

input_arr = np.arange(1_000_000)
output_arr = np.empty_like(input_arr)

%timeit conv_op(input_arr,output_arr)

@stencil
def conv_op(a):
    return a[-1] + a[0] + a[1]

%timeit conv_op(input_arr)



#Convolution in 2D

def conv_op(a, b):
    for i in range(a.shape[0]):
        for j in range(a.shape[1]):
            if i+1 == a.shape[0] or j+1 == a.shape[1]:
                b[i,j] = 0
            elif i-1 < 0 or j-1 < 0:
                b[i,j] = 0
            else:
                b[i,j] = a[i, j+1] + a[i+1, j] + a[i, j-1] + a[i-1, j]

input_arr = np.arange(1000_000).reshape((1000, 1000))
output_arr = np.empty_like(input_arr)

%timeit conv_op(input_arr, output_arr)


@stencil
def conv_op(a):
    return a[0, 1] + a[1, 0] + a[0, -1] + a[-1, 0]


%timeit output_arr = conv_op(input_arr)


What if you do not know the type of arguments your function will return at compile time?
Or, more in g

In [None]:
import numpy as np

from numba import generated_jit, types

@generated_jit(nopython=True)
def is_missing(x):
    """
    Return True if the value is missing, False otherwise.
    """
    if isinstance(x, types.Float):
        return lambda x: np.isnan(x)
    elif isinstance(x, (types.NPDatetime, types.NPTimedelta)):
        # The corresponding Not-a-Time value
        missing = x('NaT')
        return lambda x: x == missing
    else:
        return lambda x: False

 Exercise 1

1. Given two methods `py_sum(x,y)` and `np.sum(x,y)`, creates a method `numba_sum(x,y)` that computes the L1-distance of each element of two input arrays `x` and `y` with lengths $N$ (i.e., $\sum_{i=0}^{N-1} |x_i - y_i|$). 

   Be sure of _using Python's built-in functions only_ for the `numba_sum(x,y)` function. 
   
   Finally, measure their computational time for every method and compare them!

2. **(Bonus):** Run 10 repetitions of the same computation, store those computation times, and show the mean and standard deviation for `py_sum(x)`, `np.sum(x)`, and`numba_sum(x)`.



Exercise 2
--------------------------

Optimized the Jacobi code using Numba, in particular using the vectorize decorators.

# NUMBA for FOURIER ANALYSIS

-----------------------------------------



In [None]:
#Given the following function for 1D Fast Fourier transform

def FFT(x):
    """
    A recursive implementation of
    the 1D Cooley-Tukey FFT, the
    input should have a length of
    power of 2.
    """
    N = len(x)

    if N == 1:
        return x
    else:
        X_even = FFT(x[::2])
        X_odd = FFT(x[1::2])
        factor = \
          np.exp(-2j*np.pi*np.arange(N)/ N)

        X = np.concatenate(\
            [X_even+factor[:int(N/2)]*X_odd,
             X_even+factor[int(N/2):]*X_odd])
        return X

#measure the speedup obtained if we apply this function to a big random  array.