## SWD 6 Notebook 3: More on Numba

[Official documentation](https://numba.pydata.org/)

Although Numba appears to give a great speed-up for very little programming overhead (compared to Cython, say) we need to understand that it works best in certain scenarios:

* code that is numerically oriented
* code that makes a lot of use of numpy arrays and numpy functions
* code that has a lot of loops

It won't work at all well with codes:

* that handle a lot of strings
* that uses Pandas

Cython however will give some speedup in almost all circumstances whereas Numba will optimise most effectively numerical codes as described above.

Numba will work with:

* **OS**: Windows (32 and 64 bit), OSX and Linux (32 and 64 bit)
* **Architecture**: x86, x86_64, ppc64le. Experimental on armv7l, armv8l (aarch64).
* **GPUs**: Nvidia CUDA. Experimental on AMD ROC.

* CPython
* NumPy 1.15 - latest


## A first Numba example

As we mentioned earlier, Numba is most effective on numerically intensive codes. From the Numba documentation:

In [None]:
%%timeit
from numba import jit
import numpy as np

x = np.arange(100).reshape(10, 10)

@jit(nopython=True) # Set "nopython" mode for best performance, equivalent to @njit
def go_fast(a): # Function is compiled to machine code when called the first time
    trace = 0.0
    for i in range(a.shape[0]):   # Numba likes loops
        trace += np.tanh(a[i, i]) # Numba likes NumPy functions
    return a + trace              # Numba likes NumPy broadcasting

go_fast(x)

1 loop, best of 5: 218 ms per loop


Numba's `@jit` decorator defines a function to be compiled by Numba.

It operates in two modes:
* `nopython` mode
* `object` mode

In `nopython` mode, Numba will compile the decorated function without any involvement of the Python interpreter. This should give the best performance and is the preferred model.

The fallback is `object` mode. Here, Numba will attempt to compile loops into machine code functions but for everything else it will fall back to the Python interpreter. Although this will often give some speedup it isn't the optimal model and should be avoided.

## Measuring performance

In the timed cells above, you may have seen some odd results. Perhaps the Numba code seemed slower?

We need to understand just how Numba works.

* The first time Numba visits a function, it will compile it for the argument types given before it runs the machine code version.
* This first compilation adds an overhead
* **Subsequent** runs of the function use the compiled version and will be much faster

We can see this in the code below:

In [None]:
from numba import jit
import numpy as np
import time

@jit(nopython=True)
def go_fast(a): # Function is compiled and runs in machine code
    trace = 0.0
    for i in range(a.shape[0]):
        trace += np.tanh(a[i, i])
    return a + trace

In [None]:
x = np.arange(100).reshape(10, 10)

On the first run, there's a compilation overhead so the function takes a while to run:

In [None]:
%%timeit -n 1 -r 1
go_fast(x)

1 loop, best of 1: 444 ms per loop


However on second and subsequent calls to the function the compiled version is used and it's **much** faster.

In [None]:
%%timeit -n 1 -r 1
go_fast(x)

1 loop, best of 1: 27.5 µs per loop


On my runs, for example:

| First run | Second run |
|-----------|------------|
| 444 ms    | 27.5 µs    |

## A more involved example

In the following code example, we use several numba concepts:

**nopython mode** [Recommended and Best Practice mode]. 
In nopython mode, the decorated function will be run entirely without the involvement of a Python interpreter. For this to work native python objects have to be replaced with Numba supported data structures/types.
Use @njit and @jit(nopython=True) decorators to Numba JIT compile your functions

**Object mode**   
In object mode, Numba identifies loops with only nopython operations and compiles them into machine code. The rest of the code will run using a Python interpreter. Use @jit to invoke object model compilation

**Run code in Parallel**.  
Invoked by adding `parallel=True` in `@njit` , `@jit` decorators. Numba allows you to explicitly run code in parallel by the use of `prange` keyword. Numba automatically optimises your code when run in parallel.
These optimisations can be viewed by using `numba_func.parallel_diagnostics(level=4)` level refers to the level of details. 1 is for minimum and 4 is for maximum.

In [13]:
from numba import njit
from numba.np.ufunc import parallel
import numpy as np
from time import perf_counter
from numba import prange # to force explicit parallel runs of the loop
# if using notebook replace this with %%timeit

# example 1 Native Python code
def trace_normal(a): # native python function
  trace = 0.0
  for i in range(a.shape[0]):
    trace += a[i, i]

  return a + trace

# example 2 numpy code
# baseline operation that we will replicate in numba
def pure_numpy_trace(a):
  return (np.trace(a) + a)

# Example 3 numba optimized code
@njit
def trace_numba(a): # Function is compiled to machine code when called the first time
  trace = 0.0
  for i in range(a.shape[0]): # Numba likes loops
    trace += a[i, i] # Numba likes NumPy functions
  return a + trace # Numba likes NumPy broadcasting

@njit(parallel=True)
def trace_numba_parallel(a):
  trace = 0.0
  for i in range(a.shape[0]):
    trace += a[i, i]
  return a + trace


In [8]:
# Create some large arrays
large_x = np.arange(1000000).reshape(1000, 1000)
small_x = np.arange(10000).reshape(100, 100)

In [10]:
# Use timeit to return some results
%%timeit 
trace_normal(large_x)

1000 loops, best of 5: 1.63 ms per loop


In [11]:
%%timeit 
trace_normal(small_x)

The slowest run took 32.34 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 5: 46.8 µs per loop


In [14]:
%%timeit 
pure_numpy_trace(large_x)

The slowest run took 8.32 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 5: 981 µs per loop


## Exercise: 

Try timing all three versions of the code with both large and small arrays.

What are your observations?

## Signatures

It is also possible to specify the signature of the Numba function. A **function signature** describes the types of the arguments and the return type of the function. This can produce slightly faster code as the compiler does not need to infer the types. However the function is no longer able to accept other types.



In [1]:
from numba import jit, int32, float64

@jit(float64(int32, int32))
def f(x, y):
    # A somewhat trivial example
    return (x + y) / 3.14

In this example, `float64(int32, int32)` is the function’s signature specifying a function that takes two 32-bit integer arguments and returns a double precision float. Numba provides a shorthand notation, so the same signature can be specified as `f8(i4, i4)`.

The specialisation will be compiled by the @jit decorator, and no other specialization will be allowed. This is useful if you want fine-grained control over types chosen by the compiler (for example, to use single-precision floats).

If you omit the return type, e.g. by writing (int32, int32) instead of float64(int32, int32), Numba will try to infer it for you. Function signatures can also be strings, and you can pass several of them as a list; see the `numba.jit()` documentation for more details.

The new compiled function gives the expected results.

In [2]:
f(1, 3)

1.2738853503184713

But passing reals will cause an unexpected result:

In [4]:
f(1.1, 3.2)

1.2738853503184713

For ease of use and further comparison, numba also retains a copy of the uncompiled versions of functions.

This can be accessed via the `.py_func` attribute of the function:

In [15]:
f.py_func(1.1, 3.2)

1.3694267515923568

which gives a more expected results.

## Exercise:

Using the `julia.py` code in the Github repository, compare it;s execution speed using a range of numba tools.

What are your observations?