<img src='img/anaconda-logo.png' align='left' style="padding:10px">
<br>
*Copyright Continuum 2012-2016 All Rights Reserved.*

# Accelerate: Faster Arrays with Numba

## Numba Overview

Numba can speed up your applications with high performance functions written directly in Python. 

With a few simple annotations, array-oriented computationally-intensive Python code can be optimized just-in-time to perform as well as, and sometimes better than, pre-compiled C, C++ and Fortran code.

## Numba Features

* intended to accelerate mathematical and scientific Python code.
* integration with the Python scientific software stack (thanks to Numpy)
* on-the-fly code generation (at import time or runtime, at the user’s preference)
* native code generation for the CPU (default) and GPU hardware

## Table of Contents
* [Accelerate: Faster Arrays with Numba](#Accelerate:-Faster-Arrays-with-Numba)
	* [Numba Overview](#Numba-Overview)
	* [Numba Features](#Numba-Features)
	* [Numba Set-up](#Numba-Set-up)
* [Why is Numba Needed?](#Why-is-Numba-Needed?)
	* [Python, Numpy, and Memory](#Python,-Numpy,-and-Memory)
	* [Numba and Memory](#Numba-and-Memory)
* [Numba JIT](#Numba-JIT)
	* [Numba ``@jit`` decorator](#Numba-@jit-decorator)
	* [Exercise: JIT a for loop](#Exercise:-JIT-a-for-loop)
	* [Numba JIT: Example: 2D Sum](#Numba-JIT:-Example:-2D-Sum)
	* [Numba JIT Example: Cumulative Sum](#Numba-JIT-Example:-Cumulative-Sum)
	* [Exercise: Compute $\pi$ Faster](#Exercise:-Compute-$\pi$-Faster)
	* [What is Numba Doing? LLVM and JIT Compilation](#What-is-Numba-Doing?-LLVM-and-JIT-Compilation)
	* [Inspecting LLVM and JIT Outputs](#Inspecting-LLVM-and-JIT-Outputs)
* [Numba Vectorize](#Numba-Vectorize)
	* [Numpy Ufuncs](#Numpy-Ufuncs)
	* [Numpy Example: Computing a Signal](#Numpy-Example:-Computing-a-Signal)
	* [NumPy Broadcasting](#NumPy-Broadcasting)
	* [Creating Ufuncs with Numba](#Creating-Ufuncs-with-Numba)
	* [Exercise: Vectorize the Signal](#Exercise:-Vectorize-the-Signal)
* [Numba and GPUs](#Numba-and-GPUs)
	* [Introduction to CUDA](#Introduction-to-CUDA)
	* [Set-up and Test](#Set-up-and-Test)
	* [CUDA and Ufuncs](#CUDA-and-Ufuncs)
	* [CUDA and Trigonometry](#CUDA-and-Trigonometry)
* [Numba Strategies](#Numba-Strategies)
	* [Math and Science](#Math-and-Science)
	* [Specify Types](#Specify-Types)
	* [Targeted Optimization](#Targeted-Optimization)


## Numba Set-up

In [None]:
import numpy as np
from numba import jit
from numba import vectorize

# Why is Numba Needed?

Computationally intensive operations usually involve...

* arrays and loops
* same operation is applied to a large number of data elements
* elements stored in an array or container

## Python, Numpy, and Memory

Repeated access to array element within a loop can be expensive due to the inefficiency of the interpreted code.

* The most popular array container in python is the NumPy ndarray
* n-dimensional array object with data stored in a single memory buffer.
* In python, access to the memory buffer of a ndarray is inefficient.
* The interpreter must go through layers of methods due to indirection and finally into C code that directly reads the memory.
    


This is where Numba comes in ...

## Numba and Memory

* uses a LLVM as a JIT compiler, emits machine code
* direct access the underlying memory buffer
* elminates the inefficiency in the interpreted code
* resulting code can perform as fast as the equivalent C code 
* retains the flexibility of high-level python code

# Numba JIT

## Numba ``@jit`` decorator

Numba provides just-in-time (JIT) compiling via a function decorator `@jit`.

In [None]:
from numba import jit

In [None]:
@jit
def add(a,b):
    c = a + b
    return c

In [None]:
%timeit add(1,2)

## Exercise: JIT a for loop

Define a function that sums all the integers from 1 to n, using a for loop.

* add an input parameter that determines the number of iterations `n` in the for loop
* first implementation, do ***not** use `@jit`
* second version, add the `@jit` decorator
* use `%timeit` for compare performance of both implementations with `n=10` iterations
* use `%timeit` for compare performance of both implementations with `n=1000000` iterations

In [None]:
def func1(n=10):
    total = 0
    for item in range(n):
        total+=item
    return total

In [None]:
@jit
def func2(n=10):
    total = 0
    for item in range(n):
        total+=item
    return total

In [None]:
%timeit func1(10)

In [None]:
%timeit func2(10)

In [None]:
%timeit func1(1000000)

In [None]:
%timeit func2(1000000)

*Using pure python `for` loops is almost always slower than any other implementation. Use numpy or numba jit when you can. But which is better, numpy or numba?...*

## Example: JIT a 2D Sum

Here we implement 4 different implementations of a 2-dimnesional sum
* python
* python + numba `@jit`
* python + numba `@jit` + input type specification
* numpy

We will time each and compare.

Pure python implementation

In [None]:
def sum2d(arr):
    M, N = arr.shape
    total = 0.0
    for i in range(M):
        for j in range(N):
            total += arr[i,j]
    return total

Pure python with numba `@jit`

In [None]:
@jit
def sum2d_jit(arr):
    M, N = arr.shape
    total = 0.0
    for i in range(M):
        for j in range(N):
            total += arr[i,j]
    return total

Pure python with numba `@jit`, and with type specification

In [None]:
@jit('float32(float32[:])')
def sum2d_jit_typed(arr):
    M, N = arr.shape
    total = 0.0
    for i in range(M):
        for j in range(N):
            total += arr[i,j]
    return total

Finally, a numpy implementation

In [None]:
def sum2d_numpy(arr):
    M, N = arr.shape
    total = arr.sum()
    return total

Now, the timing comparisons...

In [None]:
dim_size = 1000
arr = np.random.random((dim_size,dim_size))

print("\n Timing the non-numba run")
%timeit sum2d(arr)
print("\n Timing the numba run")
%timeit sum2d_jit(arr)
print("\n Timing the numba typed run")
%timeit sum2d_jit_typed(arr)
print("\n Timing the numpy run")
%timeit sum2d_numpy(arr)

***Numba is not always the best answer. You have to try to find out.***

* Numba provided an improvement over pure python, but in this case, is still slower than numpy. 
* In the next example, the nature of the computation will benefit more from numba.

## Example: JIT a Cumulative Sum

Let's consider a different problem and see how numba performs in this case.

Implement cumulative sum of an array.  (aka. inclusive-scan)

$$ y_i = \sum^{i}_{j=0}{x_j}  $$

Every element of the output is the sum of all previous elements in the input including the element at the current index.

In this example, our implementations will be:

* python
* numpy
* python with ``@jit``
* numpy with ``@jit``

First, define the python implementation:

In [None]:
def cumsum(arr):
    "Perform a cummulative reduction over addition"
    assert arr.ndim == 1
    accum = 0                      # accumulator (identity on domain)
    out = np.zeros_like(arr)       # allocate output array
    for i in range(arr.shape[0]):  # loop over every element
        accum += arr[i]            # accumulate values from the input 
        out[i] = accum             # store the accumulator to the current output
    return out

Apply ``@jit`` to the our pure python implementation.

In [None]:
from numba import jit

# Identical code as above, just showing use of decorator
# We could equivalently write `fast_accumulate = jit(accumulate)`
@jit
def cumsum_jit(arr):             
    "Perform a cummulative reduction over addition"
    assert arr.ndim == 1
    accum = 0                      # accumulator (identity on domain)
    out = np.zeros_like(arr)       # allocate output array
    for i in range(arr.shape[0]):  # loop over every element
        accum += arr[i]            # accumulate values from the input 
        out[i] = accum             # store the accumulator to the current output
    return out

Defining our pure numpy implementation

In [None]:
def cumsum_np(arr):
    return np.cumsum(arr)

Define our numpy with `@jit` implementation:

In [None]:
@jit
def cumsum_np_jit(arr):
    return np.cumsum(arr)

Test the numerical outputs of all implementations to verify that they match the expected numerical outputs.

In [None]:
arr = np.random.randint(1, 4, 10)      # test with random array
print('arr', arr)

In [None]:
expected = np.cumsum(arr)
out_py     = cumsum(arr)
out_py_jit = cumsum_jit(arr)
out_np     = cumsum_np(arr)
out_np_jit = cumsum_np_jit(arr)

print('out_py',     out_py)
print('out_py_jit', out_py_jit)
print('out_np',     out_np)
print('out_np_jit', out_np_jit)

assert np.all(out_py == expected)
assert np.all(out_py_jit == expected)
assert np.all(out_np_jit == expected)

Compare the speed of the 4 implementations

In [None]:
arr = np.random.randint(1, 4, 1000)

print("\n Python")
%timeit cumsum(arr)
print("\n NumPy")
%timeit cumsum_np(arr)
print("\n Python with JIT")
%timeit cumsum_jit(arr)
print("\n Numpy with JIT")
%timeit cumsum_np_jit(arr)

In this case, numba wins. In many cases, it helps to test different implementations such as the examples given above.

But what changed? How do you decide when to `@jit` and when not to `@jit`?

## Exercise: Compute $\pi$ Faster

Recall that we previously used the Accelerate profiler to look at different implementations of the [Wallis product](https://en.wikipedia.org/wiki/Wallis_product) for estimating the value of $\pi$. 

In 1655, John Wallis determined that $\pi$ could be computed as a product of ratios:

$$\pi = 2\prod_{i=1}^{\infty}\frac{4i^2}{4i^2-1}$$

Use the Numba two develop two additional implementations, with python and numpy, and then compare all 4 with the Acccelerate profiler.

In [None]:
from numba import jit

In [None]:
# Python implementation
def compute_pi_v1(n):
    pi = 2.0
    for i in range(1,n):
        tmp = 4*i**2
        pi *= tmp/(tmp-1)
    return pi

In [None]:
# Numpy implementation
def compute_pi_v2(n):
    series = 4.0*np.arange(1,n)**2
    series /= (series-1)
    return 2.0*series.prod()

In [None]:
# Python and @jit

@jit
def compute_pi_v3(n):
    pi = 2.0
    for i in range(1,n):
        tmp = 4*i**2
        pi *= tmp/(tmp-1)
    return pi

In [None]:
# Numpy and @jit

@jit
def compute_pi_v4(n):
    series = 4.0*np.arange(1,n)**2
    series /= (series-1)
    return 2.0*series.prod()

Here we will perform `%timeit` profiling:

In [None]:
n = int(1e6)
print('Version 1 Profiled')
%timeit compute_pi_v1(n)
print('Version 2 Profiled')
%timeit compute_pi_v2(n)
print('Version 3 Profiled')
%timeit compute_pi_v3(n)
print('Version 4 Profiled')
%timeit compute_pi_v4(n)

In [None]:
# Use the Accelerate profiler to compare all 4 implementations:
from accelerate import profiler

In [None]:
p1 = profiler.Profile()
p1.enable()
compute_pi_v1(n)
p1.disable()
p1.print_stats()

In [None]:
p2 = profiler.Profile()
p2.enable()
compute_pi_v2(n)
p2.disable()
p2.print_stats()

In [None]:
p3 = profiler.Profile()
p3.enable()
compute_pi_v3(n)
p3.disable()
p3.print_stats()

In [None]:
p4 = profiler.Profile()
p4.enable()
compute_pi_v4(n)
p4.disable()
p4.print_stats()

## What is Numba Doing? LLVM and JIT Compilation

Some understanding of how numba works can help select the best strategies to test.

<img src="img/numba-llvm.png" width="45%" align="right"/>
---

**Numba works by compilation in two stages**:
* Numba converts python to LLVM (low-level virtual machine) code
* LLVM JIT compiler converts LLVM code to machine-specific assemply code

**Numba JIT'ed code can be faster than precompiled C code**
* the LLVM JIT compilier can emit specialized instructions for the specific host CPU.
* versus precompiled code, which uses generic instructions for maximum portability.
* JIT'ed code paths are also specialized over argument types to get maximum benefit from JIT'ing.

**Numba often (not always) improves code already written using NumPy**.
* The array type in NumPy is accessible to Numba's intermediate representation (IR)
* Often more specific datatypes or code paths can be identified than those in the general ufuncs of NumPy.

## Inspecting LLVM and JIT Outputs

**Compilation Step 1**: It is posisble to inspect the LLVM code:

In [None]:
for k, v in cumsum_np_jit.inspect_llvm().items():  # loop through each overload
    print('signature', k)
    print(v)

**Compilation Step 2**: It is even possible to inspect the JIT-generated assembly code:

In [None]:
for k, v in cumsum_np_jit.inspect_asm().items():  # loop through each overload
    print('signature', k)
    print(v)

# Numba Vectorize

You can create your own vectorized universal functions with numba!

## Review of Numpy Ufuncs

Most ***numpy*** array operations are **implemented as ufuncs** (*"universal functions"*). 

These are flexible functions that operate on arrays with compatible shapes, performing element-by-element calculations in a vectorized fashion.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## Numpy Example: Computing a Signal

In [None]:
# Amplitudes, Frequencies, Phases, and times, for two components of a signal
a1 = 1.0
a2 = 3.0
f1 = 2.0
f2 = 4.0
p1 = 0
p2 = np.pi*np.random.random()
num_times = 10001
t = np.linspace(-2*np.pi, +2*np.pi, num_times)

# vectorized operations on numpy arrays with numpy ufunc operators {*, +, /}
t1 = t*(2*np.pi/f1) + p1
t2 = t*(2*np.pi/f2) + p2

# adding some noise to the amplitudes
n1 = 1 + 0.2*np.random.random(num_times)
n2 = 1 + 0.2*np.random.random(num_times)

# vectorized operations on numpy arrays with numpy ufunc functions {sin(), cos()}

# computing the signal component
s1 = a1*n1*np.sin(t1)
s2 = a2*n2*np.cos(t2)

# computing the total signal
y  = s1*s2

In [None]:
# Visualize the result

fig, ax = plt.subplots()
ax.plot(t,y)
ax.set_xlim(-2*np.pi, +2*np.pi)

## NumPy Broadcasting

Numpy uses a _**broadcasting**_ rule to adjust the shapes and dimenions of the operands.
* Broadcasting is a pairwise operation on the shapes of the arguments.
* Its result is a new shape that is compatible with both arguments.
* Each value in the shape corresponds to the size of a dimension.
* Any dimension that is sized 0 or 1 can be raised to any larger dimension by repeating the values in inner dimensions.
* By treating scalars as 0-dimension arrays, scalar can be broadcasted to arrays as well.

## Creating Ufuncs with Numba

For example the function $y' = ax + y$
is implemented using ufuncs (i.e. arithmetic operations on numpy arrays are implemented via ufuncs)

In [None]:
a = np.random.random(10)
x = np.random.random(10)
y = np.random.random(10)

yprime = a * x + y
print(yprime)

Numba can JIT array expressions like this:

In [None]:
@jit
def axpy(a, x, y):
    return a * x + y

In [None]:
yprime_numba = axpy(a, x, y)
print(yprime_numba)
assert np.all(yprime_numba == yprime)

**Numba can also generate ufuncs directly.** This is accomplished using the **``@vectorize``** decorator.

The function being "vectorized" is the kernel, a scalar function that is applied element-wise.  

In [None]:
from numba import vectorize

In [None]:
# Specify signatures to compile.  i.e. single and double precision versions
signatures = ['(float32, float32, float32)', 
              '(float64, float64, float64)']

In [None]:
@vectorize(signatures)
def axpy_ufunc(a, x, y):
    # This function receives scalar arguments
    return a * x + y

In [None]:
yprime_ufunc = axpy_ufunc(a, x, y)
print(yprime_ufunc)
assert np.all(yprime_ufunc == yprime)

Numba ufuncs are powerful because they can **target multicore execution** (this example) and **GPU execution** (next section below).

In [None]:
@vectorize(signatures, target='parallel')  # CPU threads
def axpy_ufunc_par(a, x, y):
    return a * x + y

In [None]:
yprime_ufunc_par = axpy_ufunc_par(a, x, y)
print(yprime_ufunc_par)
assert np.all(yprime_ufunc_par == yprime)

Profiling them shows the relative performance:

In [None]:
n = 10 ** 7
a = np.random.random(n)
x = np.random.random(n)
y = np.random.random(n)

In [None]:
print('NumPy array expresion')
%timeit a * x + y
print('\nNumba jit')
%timeit axpy(a, x, y)
print('\nNumba vectorize serial')
%timeit axpy_ufunc(a, x, y)
print('\nNumba vectorize multithreaded')
%timeit axpy_ufunc_par(a, x, y)

## Exercise: Vectorize the Signal

In a previous example, we constructed a signal using numpy trigonometry functions. 

* Create a function called `signal_numpy()` using the code from the numpy example above.
* Create a function called `signal_math()` that does the same thing, but using `math.sin()` and `math.cos()` instead of the numpy functions.
* Create a fucntion called `signal_math_ufunc()` that uses `signal_math()` and `@vectorize` to create a `ufunc`
* Use `%timeit` or another profiler to compare run time performance of the three

All implementations should take the following as input parameters, with the defaults shown below:

```
a1 = 1.0
a2 = 3.0
f1 = 2.0
f2 = 4.0
p1 = 0
p2 = np.pi
num_times = 10001
```

*Hint: it's okay with use tuple unpacking, i.e. construct a dictionary of keyword-args and pass in `**kwargs`*

In [None]:
import math
import numpy as np

In [None]:
def signal_math():
    # code here


In [None]:
# decorator here
def signal_math_ufunc():
    # code here


In [None]:
def signal_numpy():
    # code here


In [None]:
# Profile run times and compare all three here. 
# How do you think signal_math_ufunc() will compare with signal_numpy()?


# Numba CUDA

* Numba contains support for CUDA GPU programming. 
* Numba provides a Python dialect for low-level programming on the CUDA GPU hardware. 
* It provides full control over the hardware for fine tunning the performance of CUDA kernels.

## Introduction to CUDA

**Reference**: 
http://numba.pydata.org/numba-doc/0.13/CUDAintro.html

> *A CUDA GPU contains one or more streaming multiprocessors (SMs). Each SM is a many-core processor that is optimized for high throughput. The manycore architecture is very different from the common multicore CPU architecture. Instead of having a large cache and complex logic for instruction level optimization, a manycore processor achieves high throughput by executing many threads in parallel on many simpler cores. It overcomes latency due to cache miss or long operations by using zero-cost context switching. It is common to launch a CUDA kernel with hundreds or thousands of threads to keep the GPU busy.*

> *The CUDA programming model is similar to the SIMD vector model in modern CPUs. A CUDA SM schedules the same instruction from a warp of 32-threads at each issuing cycle. The advantage of CUDA is that the programmer does not need to handle the divergence of execution path in a warp, whereas a SIMD programmer would be required to properly mask and shuffle the vectors. The CUDA model decouples the data structure from the program logic.*

To know more about CUDA, please refer to NVIDIA CUDA-C Programming Guide.

## Set-up and Test

The examples in this section of the lesson will not run if your computer does not have a CUDA compatible GPU.

* the notes below outline the requirements for the CUDA examples
* there is a cuda support test below which will bypass CUDA examples if your hardware cannot run them

**Requirements**

* A CUDA-Enabled GPU: http://numba.pydata.org/numba-doc/0.13/CUDASupport.html
* set the path `NUMBAPRO_CUDA_DRIVER` to your CUDA driver

The following error message with occur if you have not set the driver path:

> ```
CUDA driver library cannot be found.
If you are sure that a CUDA driver is installed,
try setting environment variable NUMBAPRO_CUDA_DRIVER
with the file path of the CUDA driver shared library.
```

Test your system for CUDA support by running the following cell:

In [None]:
import sys
try:
    from numba import cuda
    cuda.detect()
    assert cuda.is_available()
    cuda_capable = True
except Exception as e:
    e = sys.exc_info()[0]
    print(  "Error: %s" % str(e) )
    cuda_capable = False
print( "CUDA capable =", cuda_capable )

## CUDA and Ufuncs

As mentioend above, Numba created ufuncs can be targeted for `parallel` or `cuda` execution.

Below are mutiple implementations we will profile using a numpy implementation as our baseline for performance comparisons.

* Numpy
* Numba JIT
* Numba Vectorize (serial)
* Numba Vectorize (parallel, mutithreaded on CPU)
* Numba Vectorize (parallel, on CUDA GPU)

In [None]:
# Numpy Ufunc
def axpy_numpy_ufunc(a, x, y):
    return a * x + y

In [None]:
# NUMBA JIT
@jit
def axpy_numba_jit(a, x, y):
    return a * x + y

In [None]:
# NUMBA Ufunc
@vectorize(signatures)
def axpy_numba_vectorize(a, x, y):
    return a * x + y

In [None]:
# NUMBA Ufunc targeted at CPU threads
@vectorize(signatures, target='parallel')
def axpy_numba_parallel(a, x, y):
    return a * x + y

In [None]:
# NUMBA Ufunc targeted at GPU execution
@vectorize(signatures, target='cuda')
def axpy_numba_cuda(a, x, y):
    return a * x + y

In [None]:
n = 10 ** 7
a = np.random.random(n)
x = np.random.random(n)
y = np.random.random(n)

In [None]:
print('NumPy array expresion')
%timeit axpy_numpy_ufunc(a, x, y)
print('\nNumba jit')
%timeit axpy_numba_jit(a, x, y)
print('\nNumba vectorize serial')
%timeit axpy_numba_vectorize(a, x, y)
print('\nNumba vectorize parallel/multithreaded')
%timeit axpy_numba_parallel(a, x, y)
print('\nNumba vectorize CUDA/GPU')
if cuda_capable:
    %timeit axpy_numba_cuda(a, x, y)
else:
    print('\nSystem not CUDA capable. Did not run.')

## CUDA and Trigonometry

This example demonstrates when the GPU ufunc can speed things up.

GPU has dedicated special function units for computing transcendental functions like sin and cosine.  

These operations can be a lot faster on the GPU even if the data has to be transferred between the CPU and GPU via the PCI-express.

In [None]:
import math

def trig(x, y):
    return math.sin(x) + math.cos(y)

Define the function signatures (input and return types) to compile

In [None]:
trig_sig = ['float32(float32, float32)', 'float64(float64, float64)']

Define the different implementations to compare when profiling performance:

In [None]:
# specialize for multicore CPU version
trig_par = vectorize(trig_sig, target='parallel')(trig)

In [None]:
# specialize for CUDA version
trig_gpu = vectorize(trig_sig, target='cuda')(trig)

In [None]:
# specialize for default serial CPU version
trig_serial = vectorize(trig_sig)(trig)

Define input and test numerical output for GPU implementation aginst a trusted baseline

In [None]:
n = 10 ** 7
x = np.random.random(n).astype(np.float32).reshape((100,-1))
y = np.random.random(n).astype(np.float32).reshape((100,-1))

assert np.allclose(np.sin(x) + np.cos(y), trig_gpu(x, y))

Profile all implementation and compare performance

In [None]:
print('NumPy trig')
%timeit np.sin(x) + np.cos(y)
print('\nNumba ufunc serial')
%timeit trig_serial(x, y)
print('\nNumba ufunc multithread')
%timeit trig_par(x, y)
print('\nNumba ufunc gpu')
if cuda_capable:
    %timeit trig_gpu(x, y)
else:
    print('\nSystem not CUDA capable. Did not run.')

# Numba Strategies

All the general advice regarding optimization still applies. However, here are some Numba-specific strategies:

## Math and Science

* Numba is largely intended to accelerate mathematical and scientific Python code
* Applying numba to very short, trivial functions or arbitrary python objects will not likely help.    

## Specify Types

* If the function is only called once, specify the types so that the function will be pre-compiled.
* If a type signature is not specified, Numba will guess datatypes and specialize the function when it is first executed.
* This means that the first execution will be slow because numba has to compile it.

## Targeted Optimization

* Only try to compile the critical paths in your code
* If you have a piece of performance-critical computational code amongst some higher-level code, refactor
* Factoring out the performance-critical code in a separate function allows you to compile just that function with Numba.

---
*Copyright Continuum 2012-2016 All Rights Reserved.*