### Install HPAT
```bash
conda create -n HPAT python=3.6
source activate HPAT
conda install pandas
conda install numba -c numba
conda install pyarrow mpich -c conda-forge
conda install hpat -c ehsantn
```

### Pi Example
Let's run a simple python numerical function that calculates Pi using a Monte Carlo method:

In [2]:
import numpy as np
import time

def calc_pi(n):
    t1 = time.time()
    x = 2 * np.random.ranf(n) - 1
    y = 2 * np.random.ranf(n) - 1
    pi = 4 * np.sum(x**2 + y**2 < 1) / n
    print("Execution time:", time.time()-t1, "\nresult:", pi)
    return pi

calc_pi(2 * 10**8)

Execution time: 4.863769292831421 
result: 3.14138024


3.14138024

Now let's run the same function with hpat.jit decoration:

In [3]:
import hpat

@hpat.jit
def calc_pi(n):
    t1 = time.time()
    x = 2 * np.random.ranf(n) - 1
    y = 2 * np.random.ranf(n) - 1
    pi = 4 * np.sum(x**2 + y**2 < 1) / n
    print("Execution time:", time.time()-t1, "\nresult:", pi)
    return pi

calc_pi(2 * 10**8)

Execution time: 1.8851549625396729 
result: 3.1416436


3.1416436

The hpat.jit decoration gives the function to HPAT to compile. HPAT generates an efficient binary version which replaces the original function in the python environment. In this case, the HPAT version is ~2.5x faster.

However, this function was run on a single core. Let's save this code in `pi.py`and run it on multiple cores using MPI:

```bash
mpiexec -n 4 python pi.py
Execution time: 0.9372961521148682 
result: 3.1415531
```

`mpiexec` launches 4 python processes that run the same python program. HPAT parallelize the decorated function on these processes (using the MPI library for communication). It divides the problem space (`n`) and manages communication between cores. In this case, `np.sum` function requires a reduction. The generated function is scalable and can run on any number of cores (e.g. `mpiexec -n 10000`).

HPAT supports threading within a single as well. Let's enable HPAT threading for this program:

In [4]:
hpat.multithread_mode = True

@hpat.jit
def calc_pi(n):
    t1 = time.time()
    x = 2 * np.random.ranf(n) - 1
    y = 2 * np.random.ranf(n) - 1
    pi = 4 * np.sum(x**2 + y**2 < 1) / n
    print("Execution time:", time.time()-t1, "\nresult:", pi)
    return pi

calc_pi(2 * 10**8)

Execution time: 0.46541595458984375 
result: 3.14165164


3.14165164

HPAT can compile a subset of Python which includes Pandas and Numpy operations. See HPAT documentation [here](https://intellabs.github.io/hpat/). HPAT is built on top of Numba, which means Numba's restrictions apply as well. See Numba's documentation [here](http://numba.pydata.org/numba-doc/latest/index.html).

### Automatic Parallelization
HPAT parallelizes programs automatically based on the map-reduce parallel pattern. Put simply, this means the compiler analyzes the program to determine whether each array should be distributed or not.

To demonstrate parallelization, let's use a simple example. First generate some random number data:

In [5]:
import h5py
f = h5py.File("data.h5", "w")
n = 16
f.create_dataset('A', data=np.random.ranf(n))
f.close()

Here is an example HPAT code that reads this data and sums the values:

In [6]:
@hpat.jit
def example_1D():
    f = h5py.File("data.h5", "r")
    A = f['A'][:]
    return np.sum(A)

r = example_1D()
print(r)

9.022310532208113


Array A is the output of an I/O operation and is input to np.sum. Based on semantics of I/O and np.sum, HPAT determines that A can be distributed since I/O can output a distributed array and np.sum can take a distributed array as input. In map-reduce terminology, A is output of a map operator and is input to a reduce operator. Hence, HPAT distributes A and all operations associated with A (i.e. I/O and np.sum).

#### Distribution Report
The distributions found by HPAT can be printed using the hpat.distribution_report() function:

In [8]:
hpat.distribution_report()

Array distributions:
   $A.688               1D_Block

Parfor distributions:
   18                   1D_Block


This report suggests that the function has an array that is distributed in 1D_Block fashion. The variable name is renamed from A to $A.23 through the optimization passes. The report also suggests that there is a parfor (data-parallel for loop) that is "1D_Block" distributed.

Arrays are distributed in one-dimensional block (1D_Block) manner among processors. This means that processors own equal chunks of each distributed array (except possibly the last processor). Multi-dimensional arrays are distributed along their first dimension by default. For example, chunks of rows are distributed for a 2D matrix. The figure below illustrates the distribution of a 9-element one-dimensional Numpy array, as well as a 9 by 2 array, on three processors. 
<img src="https://intellabs.github.io/hpat/_build/html/_images/dist.jpg" width="35%">
HPAT replicates the arrays that are not distributed. This is called REP distribution for consistency.

Checking the distribution report is generally use for ensuring that the function is parallelized as expected. However, HPAT is able to fuse operations and eliminate unused arrays. For example:

In [15]:
@hpat.jit
def example_elim():
    A = np.arange(10)
    return A.sum()

r = example_elim()
print(r)
hpat.distribution_report()

45
Array distributions:

Parfor distributions:
   22                   1D_Block


In this case, the parfor for `arange` and `sum` are fused, and the intermediate array `A` is eliminated. Hence, the program has no arrays left after optimization.