# Introduction to Numba
Part of this lecture is based on the material by [Dr. Gregory Watson](https://nyu-cds.github.io/python-itertools/)

What we will learn:
- What is Numba
- On-the-fly code generation 
- Native code generation for the CPU (default) and GPU hardware

You will need the Numba package for this lecture (anaconda already install it).


----
Numba provides the ability to speed up applications with high performance functions written directly in Python, rather than using language extensions such as Cython.

Numba allows the compilation of selected portions of pure Python code to native code, and generates optimized machine code.

With a few simple annotations, array-oriented and math-heavy Python code can be just-in-time (JIT) optimized to achieve performance similar to C and C++, without having to switch languages or Python interpreters.

Numba works at the function level. Numba can generate native code for  functions as well as the wrapper code needed to call it directly from Python. This compilation is done on-the-fly and in-memory.

----
Numba’s central feature is the **numba.jit()** decoration (take a moment to recap function decoration we learned before), which marks a function for optimization by Numba’s JIT compiler.

Lets start with a simple example:

In [2]:
import numpy as np

original = np.arange(0.0, 10.0, 0.01, dtype='float')
shuffled = original.copy()
np.random.shuffle(shuffled)

sorted = shuffled.copy()

In [3]:
# bubblesort as pure python code

def bubblesort(X):
    N = len(X)
    for end in range(N, 1, -1):
        for i in range(end - 1):
            cur = X[i]
            if cur > X[i + 1]:
                tmp = X[i]
                X[i] = X[i + 1]
                X[i + 1] = tmp

In [4]:
%timeit -n 10 sorted[:] = shuffled[:]; bubblesort(sorted)
print(original[:10])
print(shuffled[:10])
print(sorted[:10])

251 ms ± 1.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
[ 0.    0.01  0.02  0.03  0.04  0.05  0.06  0.07  0.08  0.09]
[ 3.35  1.98  8.11  8.44  4.65  8.05  8.46  1.12  2.19  9.7 ]
[ 0.    0.01  0.02  0.03  0.04  0.05  0.06  0.07  0.08  0.09]


Now incorporating Numba to optimize 

In [5]:
from numba import jit
@jit
def bubblesort_numba(X):
    N = len(X)
    for end in range(N, 1, -1):
        for i in range(end - 1):
            cur = X[i]
            if cur > X[i + 1]:
                tmp = X[i]
                X[i] = X[i + 1]
                X[i + 1] = tmp

In [6]:
%timeit -n 10 sorted[:] = shuffled[:]; bubblesort_numba(sorted)

The slowest run took 17.64 times longer than the fastest. This could mean that an intermediate result is being cached.
5.19 ms ± 8.92 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


It is also possible to specify the signature of the Numba function. A function signature describes the types of the arguments and the return type of the function. This can produce **slightly** faster code as the compiler does not need to infer the types. However the function is no longer able to accept other types. The specified types within @jit called the function _signature_.

In [7]:
from numba import jit, float64

@jit("void(float64[:])")
def bubblesort_numba_argtypes(X):
    N = len(X)
    for end in range(N, 1, -1):
        for i in range(end - 1):
            cur = X[i]
            if cur > X[i + 1]:
                tmp = X[i]
                X[i] = X[i + 1]
                X[i + 1] = tmp

In [8]:
%timeit -n 10 sorted[:] = shuffled[:]; bubblesort_numba(sorted)

1.62 ms ± 78.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


---
### Compilation Modes
Numba has two compilation modes: 
- nopython mode 
- object mode

**nopython mode**: 
```
Numba compiler generates code that does not access Python C API. This mode produces the highest performance code, but requires that the native types of all values in the function can be inferred.
```

**object mode**:
```
Numba compiler generates code that handles all values as Python objects and uses the Python C API to perform all operations on those objects. Code compiled in object mode will often run no faster than Python interpreted code. This mode is used when the type of some variables can not be inferred.
```

A typical approach is to force the **nopython** mode, triggering an error message when the mode is not possible.

In [9]:
import numpy as np

original = np.arange(0.0, 10.0, 0.01, dtype='float')
shuffled = original.copy()
np.random.shuffle(shuffled)

sorted = shuffled.copy()

In [13]:
from numba import jit, float64

@jit("void(float64[:])",nopython=True)
def bubblesort_nopython_flag(X):
    N = len(X)
    for end in range(N, 1, -1):
        for i in range(end - 1):
            cur = X[i]
            if cur > X[i + 1]:
                tmp = X[i]
                X[i] = X[i + 1]
                X[i + 1] = tmp

In [14]:
%timeit -n 10 sorted[:] = shuffled[:]; bubblesort_nopython_flag(sorted)

1.09 ms ± 141 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


Notice that this code compiles cleanly. However, if we introduce an object whose type cannot be inferred an error message shows up.

In [12]:
from decimal import Decimal

@jit("void(float64[:])",nopython=True)
def bubblesort(X):
    N = len(X)
    val = Decimal(100)  # just to force an error
    for end in range(N, 1, -1):
        for i in range(end - 1):
            cur = X[i]
            if cur > X[i + 1]:
                tmp = X[i]
                X[i] = X[i + 1]
                X[i + 1] = tmp

TypingError: Failed at nopython (nopython frontend)
Untyped global name 'Decimal': cannot determine Numba type of <class 'type'>
File "<ipython-input-12-bcb05d76a9f0>", line 6

### Calling other functions
Numba functions can call other Numba functions. OBoth functions must have the **@jit** decorator, otherwise the code will be much slower.

In [None]:
import numpy as np

original = np.arange(0.0, 10.0, 0.01, dtype='float')
shuffled = original.copy()
np.random.shuffle(shuffled)

sorted = shuffled.copy()

In [42]:
from numba import jit, float64

@jit("void(float64[:])",nopython=True)
def bubblesort_ff(X):
    N = len(X)
    for end in range(N, 1, -1):
        for i in range(end - 1):
            cur = X[i]
            if cur > X[i + 1]:
                tmp = X[i]
                X[i] = X[i + 1]
                X[i + 1] = tmp
               
@jit("void(float64[:])",nopython=True)
def do_sort(sorted):
    bubblesort_ff(sorted)
    

In [44]:
%timeit -n 10 sorted[:]=shuffled[:]; do_sort(sorted)

841 µs ± 83.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


---
###  NumPy Universal Functions ([ufunc](https://docs.scipy.org/doc/numpy-1.10.0/reference/ufuncs.html#universal-functions-ufunc))
```
Examples of Numpy ufunc include add(), multiply(), and sin()
```
Numba’s **@vectorize** decorator allows Python functions taking scalar input arguments to be used as **NumPy ufuncs**. Creating a traditional NumPy ufunc is not the most straightforward process and involves writing some C code. Numba makes this easy. Using the @vectorize decorator, Numba can compile a pure Python function into a ufunc that operates over NumPy arrays as fast as traditional ufuncs written in C.

The @vectorize decorator has two modes of operation:

- **Eager**, or decoration-time, compilation. If you pass one or more type signatures to the decorator, you will be building a Numpy ufunc. We’re just going to consider eager compilation here.
- **Lazy**, or call-time, compilation. When not given any signatures, the decorator will give you a Numba dynamic universal function (DUFunc) that dynamically compiles a new kernel when called with a previously unsupported input type.

Using @vectorize, you write your function as operating over input scalars, rather than arrays. Numba will generate the surrounding loop (or kernel) allowing efficient iteration over the actual inputs. 

In [52]:
import numpy as np
from numba import vectorize, int64

@vectorize([int64(int64, int64)])
def vec_add_vectorize(x, y):
    return x + y

In [57]:
a = np.arange(6, dtype=np.int64)
b = np.linspace(0, 10, 6, dtype=np.int64)
print(vec_add_vectorize(a, a))
print(vec_add_vectorize(b, b))

[ 0  2  4  6  8 10]
[ 0  4  8 12 16 20]
[ 0  2  4  6  8 10]
[ 0  4  8 12 16 20]


In [58]:
@jit("int64[:](int64[:], int64[:])")
def vec_add_jit(x, y):
    return x + y

In [59]:
print(vec_add_jit(a, a))
print(vec_add_jit(b, b))

[ 0  2  4  6  8 10]
[ 0  4  8 12 16 20]


The difference between the **@vectorize** and **@jit** is that the former is creating a new function while the latter is using the Numpy function.