# Numba

_________________________
## *Intro*
_________________________

Numba does something that we may not be used to in typical python

In [1]:
def func_one(n):
    result = 0
    for i in range(n):
        squared = n * n
        result += squared
    return result

def func_two(n):
    result = 0
    squared = n * n
    for i in range(n):
        result += squared
    return result



these two functions demonstrate this
The one core difference is function one has the square calculation inside the for loop and function two calculates it outside the for loop

Let's time the functions


In [2]:
%%timeit
func_one(10000)

826 µs ± 7.82 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [3]:
%%timeit
func_two(10000)

514 µs ± 999 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)


Let's now make some edits using `numba` and see the performance differences

In [4]:
import numba as nb

@nb.njit
def func_one(n):
    result = 0
    for i in range(n):
        squared = n * n
        result += squared
    return result

@nb.njit
def func_two(n):
    result = 0
    squared = n * n
    for i in range(n):
        result += squared
    return result

func_one(1); func_two(2);

In [5]:
%%timeit
func_one(10000)

138 ns ± 1.35 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


In [6]:
%%timeit
func_two(10000)

137 ns ± 0.376 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


These now run in nanoseconds instead of microseconds

Both of them are also about the same speed



Numba applies a just-in-time compiler to analyze and convert the functions into more efficient code

* The compiler can make assumptions about the type of variables
* What is happening in the example above is the compiler realizes that the `squared = n * n` in the for loop is just a constant that can be moved outside the loop

_________________________
## *compile*
_________________________

*Not everything is faster!*

When you run this the first time, numba actually makes it slower because it needs to compile

Numba also doesn't know all options for all python code out there, so it can't always improve performance

Numba can only speed up a subset of python code


In [7]:
@nb.njit
def func_test(n):
    result = {}
    for i in range(n):
        new_dict = {'a' * n: n}
        result[squared] = new_dict
    return result

In [8]:
func_test(10)

TypingError: Failed in nopython mode pipeline (step: nopython frontend)
[1mNameError: name 'squared' is not defined[0m

Numba is built for numeric code so we should limit ourselves to that if we're going to use it

_________________________
## *benchmark*
_________________________

## Numba Settings

In [9]:
import numpy as np

In [15]:
@nb.njit()
def hypot_n(x, y):
    return (x**2 + y**2)**0.5

@nb.njit(parallel=True, fastmath=True)
def hypot_p(x, y):
    return (x**2 + y**2)**0.5

the `parallel=True` flag means that if you pass a numpy array and the operation is parallizable, it will try to use more than one core

the `fastmath` flag is a little more involved. Documentation is [here](https://llvm.org/docs/LangRef.html#fast-math-flags)

In [12]:
r1, r2 = np.random.random(size=(2000, 2000)), np.random.random(size=(2000, 2000))
hypot_n(r1, r2); hypot_p(r1, r2)

array([[0.62883245, 0.99587588, 0.72639647, ..., 1.02122589, 0.48601202,
        0.95756198],
       [1.01008962, 1.12699009, 0.74899116, ..., 0.74018585, 0.91445294,
        0.50736059],
       [1.03262022, 0.46906692, 1.00489487, ..., 0.72117873, 0.97040632,
        0.89808593],
       ...,
       [0.65087832, 0.85938314, 0.54353816, ..., 0.80766781, 0.98841149,
        0.4878289 ],
       [0.24019373, 0.70292642, 0.91528304, ..., 1.04550116, 0.87566152,
        0.90827608],
       [1.22438166, 0.82707268, 0.19680433, ..., 0.97183508, 1.27621754,
        1.16466602]])

In [13]:
%%timeit
hypot_n(r1, r2)

11.5 ms ± 189 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [14]:
%%timeit
hypot_p(r1, r2)

7.77 ms ± 219 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Let's see how long this takes without numba

In [16]:
%%timeit
(r1**2 + r2**2)**0.5

27.7 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


We can see that this jit compiler can even be faster than numpy at times

_________________________
## *types*
_________________________

We can speed up the code even more by being explicit about the types of inputs it can accept

In [20]:
from numba import float64

float64[:, :](float64[:, :], float64[:, :])

(array(float64, 2d, A), array(float64, 2d, A)) -> array(float64, 2d, A)

The above signature definition is what we're now feeding as the first argument to the nb decorator

We've imported the float64 type
the above accepts two-dimensional numpy arrays

In [22]:
@nb.njit(float64[:, :](float64[:, :], float64[:, :]), parallel=True, fastmath=True)
def hypot_t(x, y):
    return (x**2 + y**2)**0.5

hypot_t(r1, r2);

In [23]:
%%timeit
hypot_n(r1, r2)

11.5 ms ± 197 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [24]:
%%timeit
hypot_t(r1, r2)

7.47 ms ± 163 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


We again see a speed improvement

_________________________
## *vectorize*
_________________________

## Numpy axes

Functions that have axes allows us to apply funcitons on a specific axis inside an array


In [25]:
arr = np.ones((5, 4)) * 1.01
arr

array([[1.01, 1.01, 1.01, 1.01],
       [1.01, 1.01, 1.01, 1.01],
       [1.01, 1.01, 1.01, 1.01],
       [1.01, 1.01, 1.01, 1.01],
       [1.01, 1.01, 1.01, 1.01]])

In [26]:
arr.cumsum(axis=0)

array([[1.01, 1.01, 1.01, 1.01],
       [2.02, 2.02, 2.02, 2.02],
       [3.03, 3.03, 3.03, 3.03],
       [4.04, 4.04, 4.04, 4.04],
       [5.05, 5.05, 5.05, 5.05]])

This takes the sum vertically down 

In [27]:
arr.cumsum(axis=1)

array([[1.01, 2.02, 3.03, 4.04],
       [1.01, 2.02, 3.03, 4.04],
       [1.01, 2.02, 3.03, 4.04],
       [1.01, 2.02, 3.03, 4.04],
       [1.01, 2.02, 3.03, 4.04]])

This takes the sum horizontally across 

Numba can also take axes as arguments

In [33]:
from numba import vectorize, float64, float32

@vectorize([float64(float64, float64)])
def cumprod(x, y):
    return x * y

this `cumprod` function is now a universal funtion inside numpy

this means we can use it to accumulate along an axis or reduce along an axis

In [29]:
cumprod.accumulate(arr, axis=0)

array([[1.01      , 1.01      , 1.01      , 1.01      ],
       [1.0201    , 1.0201    , 1.0201    , 1.0201    ],
       [1.030301  , 1.030301  , 1.030301  , 1.030301  ],
       [1.04060401, 1.04060401, 1.04060401, 1.04060401],
       [1.05101005, 1.05101005, 1.05101005, 1.05101005]])

In [30]:
cumprod.accumulate(arr, axis=1)

array([[1.01      , 1.0201    , 1.030301  , 1.04060401],
       [1.01      , 1.0201    , 1.030301  , 1.04060401],
       [1.01      , 1.0201    , 1.030301  , 1.04060401],
       [1.01      , 1.0201    , 1.030301  , 1.04060401],
       [1.01      , 1.0201    , 1.030301  , 1.04060401]])

In [31]:
cumprod.reduce(arr, axis=0)

array([1.05101005, 1.05101005, 1.05101005, 1.05101005])

In [32]:
cumprod.reduce(arr, axis=1)

array([1.04060401, 1.04060401, 1.04060401, 1.04060401, 1.04060401])

we can also allow float32

In [34]:
@vectorize([float64(float64, float64),float32(float32, float32)])
def cumprod(x, y):
    return x * y

There are some unsupported types, as well as async functions

documentation is [here](https://numba.pydata.org/numba-doc/dev/reference/pysupported.html)