### Getting started with PySpark
#### Create spark context and other python imports

In [None]:
%matplotlib inline
from __future__ import print_function, division
from pyspark import SparkConf
from pyspark import SparkContext 
from StringIO import StringIO
import matplotlib.pyplot as plt
import numpy as np
from pprint import pprint

conf = SparkConf()
conf.set('spark.executor.instances', 2)
sc = SparkContext()

### A few preliminaries

#### lambda expressions
It is helpful to know lambda expressions and about unpacking arguments arguments for them.

<code>lambda</code> is a statement for an expression that can have a closure.  It's like a function or other callable, but in one line without a return statement.

<code>lambda</code> is useful in pyspark because there is a need for a lot of simple callables.

#### mapping / reducing
In pyspark and other parallel computing frameworks, there is a concept of 'mapping'.

E.g. mapping across a list of tuples and summing the second element of each tuple
```python
x = [(1, 2), (2, 3)]

sum(map(lambda x:x[1], x)) # answer is 5 in python
# or
sc.parallelize(x).map(lambda x:x[1]).sum() # answer is 5 in pyspark
```
In mapping and reducing with pyspark it can be helpful to unpack the arguments to a lambda used in a mapper.

```python
sc.parallelize(x).map(lambda (key, value): value).sum() # 5
```
(x was mapped as a Resilient Distributed Dataset, the key-values were split and the values were summed.

#### sc.parallelize: create an RDD from a python iterable

In [None]:
import random
def generate_rands(length=100):
    for idx in range(length):
        yield (random.randint(0,1), random.randint(0,1))
rand_rdd = sc.parallelize(generate_rands())

#### rand_rdd is represented now in parallel java processes for use with pyspark map reduce
If we want to get the RDD back to python, we can call .collect()

In [None]:
rand_rdd.collect()

#### But a better way to do it is to <code>take</code> a few elements rather than the full RDD
This avoids pulling the entire RDD into the main python process, thereby breaking parallelism.

In [None]:
rand_rdd.take(5)

#### sc.binaryFiles: read binary HDFS files into an RDD, matching a wildcard pattern

In [None]:
img_files = sc.binaryFiles('/img/*')  # wildcard hadoop distributed file system name
img_files.take(1) # Don't call .collect() on this one!

#### sc.pickleFile: read results typically from pyspark that were output with saveAsPickleFile

In [None]:
image_measurements = sc.pickleFile('hdfs:///t1/map_each_image/measures')
image_measurements.take(1)  # Don't call .collect() on this one!

#### .map applys a function to each element of the RDD
This functions maps x and y random numbers and calculates their squares and product

In [None]:
products = rand_rdd.map(lambda (x, y): (x**2, y**2, x*y))
products.take(1)

####.reduce applys an aggregation to an RDD
.reduce takes a callable which is called with two elements of the RDD (repeatedly)

Here is the covariance

In [None]:
products.map(lambda (x_square, y_square, xy): (xy)).reduce(lambda a, b: a + b)

### Going through some map reduce ideas with the image files

#### Mapping the 'load_image' function to filenames will load each filename's image into a Resilient Distributed Dataset (RDD)

In [None]:
def load_image(image):
    """Load one image, where image = (key, blob)"""
    from StringIO import StringIO
    from PIL import Image
    img = Image.open(StringIO(image[1]))
    return  image[0], np.asarray(img, dtype=np.uint8)

img_files = sc.binaryFiles('/img/malestaff*')  # wildcard hadoop distributed file system name
img_mapped = sc.map(load_image, img_files)
img_mapped.cache() # cache this RDD for later use (only a performance helper)
img_mapped.take(1) # take the 1st one and repr it to see what it looks like


#### As shown above, img_mapped is a iterable of tuples (filename, image).  We can in turn map the RDD of images to an analysis function.  Here we are calculating the percentiles of one band of color.

In [None]:
band = 0 # red
red_percentiles = img_mapped.map(
                        lambda (fname, img): np.percentile(img[:,:,band], (5, 25, 50, 75, 95))
                    )
red_percentiles.take(1)


#### With spark, we can also reduce an RDD.  This will take an average of the red percentiles by summing and dividing by the count.

In [None]:
avgs = red_percentiles.reduce(
                            lambda a, b: a + b
                        ) / red_percentiles.count()
print(avgs)

####Example of groupby operations, with a random grouping

In [None]:
rand_groups = img_mapped.map(
                lambda (fname, img): (random.randint(0,2), (fname, img))
            ).groupByKey(
            )

print("Grouped", rand_groups.take(1))



#### Calculating a mean color within the groups.  Collect brings the results to python as list.

In [None]:
rand_groups.map(
        lambda (rnd, results): (rnd, np.mean((np.mean(img[:,:,band]) for fname, img in results)))
    ).collect()

#### Using some of the outputs of image-analyzer, the measurements taken on each raw image.  We have saved them as a pickleFile RDD, so we can load the RDD with pickleFile

In [None]:
config = yaml.load(open('config.yaml').read())
output_hdfs = 'hdfs:///t1/map_each_image/measures_2'


#### map the measurements function to the images, saving at output_hdfs path

In [None]:
input_file_spec = '/imgs/ma*'
map_each_image(sc, config, input_file_spec, output_hdfs)

#### later load the measures from hdfs using unpickling

In [None]:
measures = sc.pickleFile(output_hdfs)
print(measures)

#### Inspecting an RDD, to see what it looks like, we have to call <code>take</code> to see element(s) of the RDD and we can do type inspection to figure out what is going on.

In [None]:
example = measures.take(1)
print('example is of type', type(example))
key, value = example
print('value is of type', type(value))
print('and has keys of', tuple(value.keys()))

#### From the image analyzer code we know that the 'histo' key is the 3 band histogram flattened into one array, so we can check its size and know where the median colors are.  For example, if it is 21 elements long, then the first 7 are the red histogram and the middle of those 7 elements is the median red color.

In [None]:
def get_median_from_histo(histo, band):
    lh = len(histo)
    return histo[lh // 2 + 1 + band * lh]

#### As an exercise, sort the images by median red color.  (band = 0)

In [None]:
measures.sortBy(lambda (key, value): get_median_from_histo(value['histo'], 0))

####Make a function to print out images in the same order they are found in <code>measures</code>.

In [None]:
def imshow_measures(measures):
    collected_imgs = meaures.map(lambda x:load_image(x[0])).collect()
    map(lambda key, value: plt.imshow(value), collected_imgs)

In [None]:
imshow_measures(measures)

#### sort by median in the green band

In [None]:
measures.sortBy(lambda (key, value): get_median_from_histo(value['histo']), 1)

#### show the images again

In [None]:
imshow_measures(measures)

#### Joins in PySpark work on RDDs of tuples and equality on the first item in the tuples is used for joining

Here is an example with a 2 lists of tuples that are spark RDDs 

In [None]:
list_a = sc.parallelize([(1, 2), (2, 3), (3, 5)])
list_b = sc.parallelize([(1, 3), (2, 5)])

In [None]:
list_a.join(list_b).collect()

In [None]:
list_b.join(list_a).collect()

In [None]:
list_a.fullRightJoin(list_b).collect()

In [None]:
list_a.fullLeftJoin(list_b).collect()

#### Joins with the image data
In the example images we have /img/ which includes the original and /fuzzy/ which is a fuzzy version of those.  We can map the measurements of fuzzy and original images and join on ward cluster hashes.  In each image's measures dictionary, the ward clusters are saved in a list.  We can use <code>flatMap</code> to flatten that list and create an RDD with tuples of (one_ward_cluster_hash, image_file_name).

In [None]:
candidates = sc.pickleFile('hdfs:///c1/map_each_image/measures')

In [None]:
def flatten_ward(rdd):
    return rdd.flatMap(lambda (key, value): [(wc, key) for wc in value['ward']])

In [None]:
candidates_flat = flatten_ward(candidates)
originals_flat = flatten_ward(measures)

In [None]:
joined = candidates_flat.join(originals_flat)

In [None]:
joined.take(10)