## Install Numba
Numba can be installed both with `conda` or `pip`

In [None]:
!pip install numba numpy

Note that Numba is designed to be using in conjunction with NumPy, so for other libraries like Pandas, it may not work as well.

## How compliing makes it faster
In this example below, we will see how we can make a function go faster by compiling it.

In [None]:
from numba import jit
import numpy as np
import time

x = np.arange(100).reshape(10, 10)

def normal(a):
    trace = 0.0
    for i in range(a.shape[0]):
        trace += np.tanh(a[i, i])
    return a + trace

@jit(nopython=True)
def go_fast(a): # Function is compiled and runs in machine code
    trace = 0.0
    for i in range(a.shape[0]):
        trace += np.tanh(a[i, i])
    return a + trace

# TIME WITHOUT COMPILING
start = time.perf_counter()
normal(x)
end = time.perf_counter()
print("Elapsed (without compilation) = {}s".format((end - start)))

# DO NOT REPORT THIS... COMPILATION TIME IS INCLUDED IN THE EXECUTION TIME!
start = time.perf_counter()
go_fast(x)
end = time.perf_counter()
print("Elapsed (with compilation) = {}s".format((end - start)))

# NOW THE FUNCTION IS COMPILED, RE-TIME IT EXECUTING FROM CACHE
start = time.perf_counter()
go_fast(x)
end = time.perf_counter()
print("Elapsed (after compilation) = {}s".format((end - start)))

As you can see, the time that run the function with compliation for the 1st time is much slower than runing it without the complilation. However, after compliation, it is way faster.

**Question: Why it is slower the 1st time?**

So we always have a trade off. If the function is only used a couple of times, it may as will be better without the compliation.

## Vector opteraions VS loop operations
Before we moved on and talk more about how we can make use of Numba to make things faster, we should also knows that Numpy loves vectors. Let's compare the same trigonometric identity opeataion `cos(x)^2 + sin(x)^2` with vector operation and looping through it's elements.

In [None]:
arr = np.arange(1.e7)

In [None]:
def ident_np(x):
    return np.cos(x) ** 2 + np.sin(x) ** 2

In [None]:
%%timeit
ident_np(arr)

In [None]:
def ident_loops(x):
    r = np.empty_like(x)
    n = len(x)
    for i in range(n):
        r[i] = np.cos(x[i]) ** 2 + np.sin(x[i]) ** 2
    return r

In [None]:
%%timeit
ident_loops(arr)  # warning: really slow, can take a few munites

As we can see, the same operation is almost 10 times slower if it is elementwise.

## Loop is where Numba really shine
So it seems Numba loves repatitive operations, what happens if we speed up the above operations with `@njit`? (`@njit` is the short hand for `@jit(nopython=True)`)

In [None]:
from numba import njit

In [None]:
@njit
def ident_np_njit(x):
    return np.cos(x) ** 2 + np.sin(x) ** 2

In [None]:
# compliting it
ident_np_njit(arr)

In [None]:
%%timeit
ident_np_njit(arr)

In [None]:
@njit
def ident_loops_njit(x):
    r = np.empty_like(x)
    n = len(x)
    for i in range(n):
        r[i] = np.cos(x[i]) ** 2 + np.sin(x[i]) ** 2
    return r

In [None]:
# compiling it
ident_loops_njit(arr)

In [None]:
%%timeit
ident_loops_njit(arr)

As you see now, the different between the vetor operation and the elementwise looping operation is almost none. In other words, the `@njit` speed up the operation almost 10 folds.

There are still improvement of the vector operations but it is not as much as the elementwise looping.

**Exercise: Can you change the following into elementwise operation?**

In [None]:
def julia_fast(mesh, c=-1, num_iter=10, radius=2):

    z = mesh.copy()
    diverge_len = np.zeros(z.shape)

    for i in range(num_iter):
        conv_mask = np.abs(z) < radius
        z[conv_mask] = np.square(z[conv_mask]) + c
        diverge_len[conv_mask] += 1

    return diverge_len

In [None]:
# Un-commend the line below to see how we do it.
#%load element_op.py

## Fast math
In calculations, [IEEE 754 compliance](https://en.wikipedia.org/wiki/IEEE_754) is followed to make sure the calcuation is accurate to the precision specified. This will lock the order of the operation and limit the speed of some processed. When the precision is not a huge concern. We can relex the compliance and perform fastmath to gain some speed

In [None]:
arr = np.arange(1.e7)

@njit(fastmath=False)
def do_sum(A):
    acc = 0.
    # without fastmath, this loop must accumulate in strict order
    for x in A:
        acc += np.sqrt(x)
    return acc

@njit(fastmath=True)
def do_sum_fast(A):
    acc = 0.
    # with fastmath, the reduction can be vectorized as floating point
    # reassociation is permitted.
    for x in A:
        acc += np.sqrt(x)
    return acc

In [None]:
# compile the functions
do_sum(arr)
do_sum_fast(arr)

In [None]:
%%timeit
do_sum(arr)

In [None]:
%%timeit
do_sum_fast(arr)

As you see, the fast math version is takes almost only half the time.

# Make use of multiple cores

In [None]:
@njit(parallel=True)
def ident_parallel(x):
    return np.cos(x) ** 2 + np.sin(x) ** 2

In [None]:
# compiling it
ident_parallel(arr)

In [None]:
%%timeit
ident_parallel(arr)

Depending on how many cores your have on your computer, you will see carious degrees of improvement. On my 3.1 GHz Dual-Core Intel Core i7 early 2015 MacBook Pro, it is slightly faster than the one using njit and almost 7 times faster than without compilation. On my new 8 Core M2 2022 Macbook air, it is 10 times faster. 

**Question: What about yours?**

Now we get a feeling of how Numba makes things faster and when Numba can makes things really fast, lets go to the second notebook and see what can go wrong and how to fix it.