# Table of Contents
* [Learning Objectives](#Learning-Objectives)
* [Pandas and Performance](#Pandas-and-Performance)
	* [Set-Up](#Set-Up)
* [Numba & Cython](#Numba-&-Cython)
* [Dask](#Dask)


# Learning Objectives

After this notebook, the learner will be able to:
* Use Numba and Cython to improve computational speed of operations on panadas containers
* Use Dask to perform simple parallel processing tasks on pandas containers

# Pandas and Performance

Here we are focusing on the storage/execution part of the PyData Stack. Pandas will often defer certain computations to various engines. 

The most familiar is ``numpy`` (which serves as the storage back-end as well). 

For reductions, we use ``bottleneck``, which is quite efficient at things like ``nansum`` (summing across a 1-d array with nans).

When using ``.query()`` or ``.eval()``, we defer the computation to ``numexpr``, which can evaluate from a string, multiple computations in a single expression. Furthermore it can operation in using multiple cores.

![PyData ecosystem](img/pydata-ecosystem.png)

## Set-Up

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

pd.options.display.max_rows = 8
pd.options.display.max_columns = 8

# Numba & Cython

http://pandas.pydata.org/pandas-docs/stable/enhancingperf.html

We are going to use a ``numba`` JIT (just-in-time) compiler, and ``cython`` (a static compiler) that allow one to write python and have it run at c-like speeds

In [None]:
from numba import jit
import cython
%load_ext cython

np.random.seed(1234)
pd.set_option('max_row',12)
s = pd.Series(np.random.randn(1e5))
com = 20.0

We are going to compare 4 different expressions of the same computation for exponential weighted moving average (EWMA). This is a calculation that cannot be easily vectorized, as its a recurrence relation, that is, you have to do a computation for a particular row, then proceed to the next one.

- python
- cython1
- cython2 (the actual pandas impl)
- numba

The python implementation is straightforward and serves as a refernce impl.

In [None]:
def python(s):
    output = pd.Series(index=range(len(s)))

    alpha = 1. / (1. + com)
    old_weight = 1.0
    new_weight = 1.0
    weighted_avg = s[0]
    output[0] = weighted_avg
    
    for i in range(1,len(s)):
        v = s[i]
        old_weight *= (1-alpha)
        weighted_avg = ((old_weight * weighted_avg) + 
                        (new_weight * v)) / (old_weight + new_weight)
        old_weight += new_weight
        output[i] = weighted_avg
        
    return output

We can take the python implemtation and type the variables in cython. We expose this cython function via a python function wrapper that creates and returns a Series.

In [None]:
%%cython
cimport cython
@cython.wraparound(False)
@cython.boundscheck(False)
def _cython(double[:] arr, double com, double[:] output):
    cdef:
        double alpha, old_weight, new_weight, weighted_avg, v
        int i
    
    alpha = 1. / (1. + com)
    old_weight = 1.0
    new_weight = 1.0
    weighted_avg = arr[0]
    output[0] = weighted_avg
    
    for i in range(1,arr.shape[0]):
        v = arr[i]
        old_weight *= (1-alpha)
        weighted_avg = ((old_weight * weighted_avg) + 
                        (new_weight * v)) / (old_weight + new_weight)
        old_weight += new_weight
        output[i] = weighted_avg
        
    return output

In [None]:
def cython1(s):
    output = np.empty(len(s),dtype='float64')
    _cython(s.values, com, output)
    return Series(output)

This is the pandas implementation of ewma.

In [None]:
def cython2(s):
    return s.ewm(com=com,adjust=True).mean()

This is the numba implementation, looks very similar to the python impl, just with the ``@jit`` decorator. We are also wrapping this in a function to get/return a Series. In newer numba implementations, we can also do the array allocation INSIDE the function.

In [None]:
@jit
def _numba(arr, output):
    alpha = 1. / (1. + com)
    old_weight = 1.0
    new_weight = 1.0
    weighted_avg = arr[0]
    output[0] = weighted_avg
    
    for i in range(1,arr.shape[0]):
        v = arr[i]
        old_weight *= (1-alpha)
        weighted_avg = ((old_weight * weighted_avg) + 
                        (new_weight * v)) / (old_weight + new_weight)
        old_weight += new_weight
        output[i] = weighted_avg
    

def numba(s):
 
    output = np.empty(len(s),dtype='float64')
    _numba(s.values, output)
    return Series(output)

The most important thing in comparing performance is correctness!! We need to be sure that we are comparing the SAME things.

In [None]:
result1 = python(s)
result2 = cython1(s)
result3 = cython2(s)
result4 = numba(s)
result1.equals(
    result2) and result1.equals(
    result3) and result1.equals(
    result4)

In [None]:
%timeit -n 1 -r 1 python(s)

In [None]:
%timeit cython1(s)

In [None]:
%timeit cython2(s)

In [None]:
%timeit numba(s)

It might be suprising that the pandas implmentation takes a bit longer than cython/numba. We are doing additional work inside this impl though; meaning, we are doing ``NaN`` checking.

# Dask

Dask is a library to help you with out-of-core calculations and parallelization. In this case we are using all of cores to do this computation. pandas has recently released the global-interpreter-lock (GIL) in order to faciliation this type of multi-threaded/core computation.

https://dask.readthedocs.org/en/latest/

In [None]:
import dask.dataframe as dd
from dask import threaded, multiprocessing

In [None]:
np.random.seed(1234)
N = int(1e7)
df = pd.DataFrame({'key' : np.random.randint(0,1000,size=N), 
                'value' : np.random.randn(N)})
ddf = dd.from_pandas(df, npartitions=8)
ddf

This code looks remarkably like the pandas code above, with the exception of the ``.compute()`` method call. This is by design.

In [None]:
%timeit df.groupby('key').value.sum()

In [None]:
%timeit ddf.groupby('key').value.sum().compute(get=threaded.get)