PySpark
-------

Spark can manage "big data" collections with a small set of high-level primitives like `map`, `filter`, `groupby`, and `join`.  With these common patterns we can often handle computations that are more complex than map, but are still structured.


*Note: PySpark requires a little additional setup. Usually, the following environment variables need to be set (using `/usr/libexec/java_home` on OS X or similar on Linux)*:

```bash
export JAVA_HOME="$(/usr/libexec/java_home)"
```


![PySpark Internals](http://i.imgur.com/YlI8AqEl.png)

- Apache Spark is a fast and general-purpose cluster computing system. 
- It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. 
- It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

## Simple example

In [1]:
from pyspark import SparkContext
import os
os.environ["PYSPARK_PYTHON"]="python3"
sc = SparkContext('local[2]') # Create a local spark cluster with 2 workers

We have a spark context sc to use with a tiny local spark cluster with 2 nodes (will work just fine on a multicore machine).

In [2]:
print(sc) # it is like a Pool Processor executor

<SparkContext master=local[2] appName=pyspark-shell>


In [3]:
rdd = sc.parallelize(range(8))  # create collection
rdd

PythonRDD[1] at RDD at PythonRDD.scala:48

In [4]:
rdd.collect()  # Gather results back to local process

[0, 1, 2, 3, 4, 5, 6, 7]

In [5]:
# Square each element
rdd.map(lambda x: x ** 2)

PythonRDD[2] at RDD at PythonRDD.scala:48

In [6]:
# Square each element
rdd.map(lambda x: x ** 2)

PythonRDD[3] at RDD at PythonRDD.scala:48

In [7]:
# Square each element and collect results
rdd.map(lambda x: x ** 2).collect()

[0, 1, 4, 9, 16, 25, 36, 49]

In [8]:
# Map-Reduce operation 
from operator import add
rdd.map(lambda x: x ** 2).reduce(add)

140

In [10]:
# Select only the even elements
rdd.filter(lambda x: x % 2 == 0).collect()

[0, 2, 4, 6]

In [14]:
# Cartesian product of each pair of elements in two sequences 
# (or the same sequence in this case)
rdd.cartesian(rdd).collect()[5:]

AttributeError: 'NoneType' object has no attribute 'setCallSite'

In [15]:
# Stop the local spark cluster
sc.stop()

## Pi computation example

In [18]:
import sys
from random import random
from operator import add

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("PythonPi").getOrCreate()

partitions = 4
n = 1000000 * partitions

def f(_):
    x = random() * 2 - 1
    y = random() * 2 - 1
    return 1 if x ** 2 + y ** 2 <= 1 else 0

count = spark.sparkContext.parallelize(range(1, n+1), partitions).map(f).reduce(add)
print("Pi is roughly %f" % (4.0 * count / n))

spark.stop()

Pi is roughly 3.140908


## Application

We again start with the following sequential code

```python
series = {}
for fn in filenames:   # Simple map over filenames
    series[fn] = pd.read_hdf(fn)['x']

results = {}

for a in filenames:    # Doubly nested loop over the same collection
    for b in filenames:  
        if a != b:     # Filter out bad elements
            results[a, b] = series[a].corr(series[b])  # Apply function

((a, b), corr) = max(results.items(), key=lambda kv: kv[1])  # Reduction
```

### Spark/Dask.bag methods

We can construct most of the above computation with the following Spark/Dask.bag methods:

*  `collection.map(function)`: apply function to each element in collection
*  `collection.cartesian(collection)`: Create new collection with every pair of inputs
*  `collection.filter(predicate)`: Keep only elements of colleciton that match the predicate function
*  `collection.max()`: Compute maximum element

We use these briefly in isolated exercises and then combine them to rewrite the previous computation from the `submit` section.

In [1]:
from pyspark import SparkContext
import os
os.environ["PYSPARK_PYTHON"]="python3"
sc = SparkContext('local[4]')

In [2]:
rdd = sc.parallelize(range(5))  # create collection
rdd

PythonRDD[1] at RDD at PythonRDD.scala:48

In [3]:
rdd.collect()  # Gather results back to local process

[0, 1, 2, 3, 4]

### `map`

In [9]:
# Square each element

rdd.map(lambda x: x ** 2)

PythonRDD[2] at RDD at PythonRDD.scala:48

In [10]:
# Square each element and collect results

rdd.map(lambda x: x ** 2).collect()

[0, 1, 4, 9, 16]

In [11]:
# Select only the even elements

rdd.filter(lambda x: x % 2 == 0).collect()

[0, 2, 4]

In [12]:
# Cartesian product of each pair of elements in two sequences (or the same sequence in this case)

rdd.cartesian(rdd).collect()

[(0, 0),
 (0, 1),
 (0, 2),
 (0, 3),
 (0, 4),
 (1, 0),
 (1, 1),
 (1, 2),
 (1, 3),
 (1, 4),
 (2, 0),
 (2, 1),
 (2, 2),
 (2, 3),
 (2, 4),
 (3, 0),
 (4, 0),
 (3, 1),
 (4, 1),
 (3, 2),
 (4, 2),
 (3, 3),
 (3, 4),
 (4, 3),
 (4, 4)]

In [13]:
# Chain operations to construct more complex computations

(rdd.map(lambda x: x ** 2)
    .cartesian(rdd)
    .filter(lambda tup: tup[0] % 2 == 0)
    .collect())

[(0, 0),
 (0, 1),
 (0, 2),
 (0, 3),
 (0, 4),
 (4, 0),
 (4, 1),
 (4, 2),
 (4, 3),
 (4, 4),
 (16, 0),
 (16, 1),
 (16, 2),
 (16, 3),
 (16, 4)]

### Exercise: Parallelize pairwise correlations with PySpark

To make this a bit easier we're just going to compute the maximum correlation and not try to keep track of the stocks that yielded this maximal result.

In [14]:
from glob import glob
import os
import pandas as pd

filenames = sorted(glob(os.path.join('..', 'data', 'json', '*.h5')))  # ../data/json/*.json
filenames[:5]

['../data/json/hal.h5',
 '../data/json/hp.h5',
 '../data/json/hpq.h5',
 '../data/json/ibm.h5',
 '../data/json/jbl.h5']

In [15]:
%%time

### Sequential Code

series = []
for fn in filenames:   # Simple map over filenames
    series.append(pd.read_hdf(fn)['close'])

results = []

for a in series:    # Doubly nested loop over the same collection
    for b in series:  
        if not (a == b).all():     # Filter out comparisons of the same series 
            results.append(a.corr(b))  # Apply function

result = max(results)

CPU times: user 903 ms, sys: 187 ms, total: 1.09 s
Wall time: 1.23 s


In [None]:
%%time

### Parallel code

rdd = sc.parallelize(filenames)

# TODO

result = corr

In [None]:
result

In [22]:
%%time
### Parallel code

rdd = sc.parallelize(filenames)
series = rdd.map(lambda fn: pd.read_hdf(fn)['close'])

corr = (series.cartesian(series)
              .filter(lambda ab: not (ab[0] == ab[1]).all())
              .map(lambda ab: ab[0].corr(ab[1]))
              .max())

result = corr


CPU times: user 9.52 ms, sys: 3.15 ms, total: 12.7 ms
Wall time: 5.09 s


In [None]:
# There is a lot of setup, workers creation, there is a lot of communications
# the correlation function is too small

In [18]:
result

0.94625064960703875

### Dask.bag

In [23]:
%%time
### Parallel Code

import dask.bag as db

b = db.from_sequence(filenames)
series = b.map(lambda fn: pd.read_hdf(fn)['close'])

corr = (series.product(series)
              .filter(lambda ab: not (ab[0] == ab[1]).all())
              .map(lambda ab: ab[0].corr(ab[1]))
              .max())

result = corr.compute()


CPU times: user 1.27 s, sys: 1.32 s, total: 2.59 s
Wall time: 5.64 s


In [24]:
%%time

import dask

result = corr.compute(get=dask.local.get_sync)

CPU times: user 905 ms, sys: 225 ms, total: 1.13 s
Wall time: 1.13 s


### Conclusion

*  Higher level collections include functions for common patterns
*  Move data to collection, construct lazy computation, trigger at the end
*  Used PySpark (`cartesian + map`) and Dask.bag (`product + map`) to handle nested for loop

In [25]:
sc.stop()