# From Python to Spark (PySpark)

PySpark is a Python API for Apache Spark

Getting started with Spark for Python programmers is particularly easy if one is using functional programming style: 
`map`s, `filter`s and `lambda` functions (as opposed to `for`- and `while`-loops).


# Functional primitives in Python and PySpark

In many programming languages, elementwise transformations are applied to a list or an array using the control flow operators like `for` and `while`. In functional programming languages like Scala and concequently Spark a set of primitives like `map`, `filter` and `reduce` are used instead. Some of these are availablle in Python too.

Some common functional primitives with Pythonic list-comprehension equivalents, are:

- `map(f, list) === [ f(x) for x in list ]`: Apply `f` element-wise to `list`.
- `filter(f, list) === [ x for x in list if f(x) ]`: Filter `list` using `f`.
- `flatMap(f, list) === [ f(x) for y in list for x in y ]`: Here `f` is a function that eats elements (of the type contained in list) and spits out lists, and `flatMap` first applies f element-wise to the elements of `list` and then _flattens_ or _concatenates_ the resulting lists.  It is sometimes also called `concatMap`.
- `reduce(f, list[, initial])`: Here `f` is a function of _two_ variables, and folds over the list applying `f` to the "accumulator" and the next value in the list.  That is, it performs the following recursion

$$    a_{-1} = \mathrm{initial} $$
$$    a_i = f(a_{i-1}, \mathrm{list}_i) $$

with the with the final answer being $a_{\mathrm{len}(\mathrm{list})-1}$.  (If initial is omitted, just start with $a_0 = \mathrm{list}_0$.)  For instance,
>           
    reduce(lambda x,y: x+y, [1,2,3,4]) = ((1+2)+3)+4 = 10
    
    
## Remark:
This is where the name "map reduce" comes from..

# Lambda functions in Python

Python supports the creation of anonymous functions (i.e. functions that are not bound to a name) at runtime, using a construct called "lambda".

Sometimes you need to pass a function as an argument, or you want to do a short but complex operation multiple times. You could define your function the normal way, or you could make a lambda function, a mini-function that returns the result of a single expression. The two definitions are completely identical:

In [1]:
##traditional named function
def add(a,b): return a+b

##lambda function
add2 = lambda a,b: a+b

The advantage of the `lambda` function is that it is in itself an expression, and can be used inside another statement. Here's an example using the `map` function, which calls a function on every element in a list, and returns a list of the results:

In [2]:
squares = map(lambda a: a*a, [1,2,3,4,5])
print (*squares)

1 4 9 16 25


# Exercise 1: mapping the list in Python

Suppose you need to perform a transformation on a list of element. For instance, to calculate a square of each element of the list. One way to write this in Python would be as follows: 

In [3]:
import sys

def simple_squares():
    numbers = [1,2,3,4,5]
    squares = []
    for number in numbers:
        squares.append(number*number)
        # Now, squares should have [1,4,9,16,25]
    print ("List of squares: {}".format(squares))

In [4]:
#Exercise0: mapping the list
simple_squares()

List of squares: [1, 4, 9, 16, 25]


## Functional way

Python provides a few ways to re-write the same piece of code in a more compact form: list comprehensions and with the map. Python programmers who do not use Spark typically prefer the list comprehensions to the map. But using the map is what allows you to adjust to Spark way of programming the easiest:

In [5]:
def square(x):
    return x*x

def python_squares():
    ## Pythonic way
    numbers = [1,2,3,4,5]
    squares = map(square, numbers)
    #Now, squares should have [1,4,9,16,25]
    print ("List of squares calculated in a Functional way: ", *squares)

def python_squares_lambda():
    ## Pythonic way
    numbers = [1,2,3,4,5]
    squares = map(lambda x: x*x, numbers)
    #Now, squares should have [1,4,9,16,25]
    print ("List of squares calculated in a Functional way with lambda: ", *squares)

In [6]:
# Mapping the list in a pythonic way
python_squares()

List of squares calculated in a Functional way:  1 4 9 16 25


In [7]:
#Lambda function 
python_squares_lambda()

List of squares calculated in a Functional way with lambda:  1 4 9 16 25


# Exercise 2: filtering the list in Python

What if you're more interested in filtering the list? Say you want to remove every element with a value equal to or greater than 4? (Okay, so the examples aren't very realistic. Whatever...) A Python neophyte might write:

In [8]:
def filter_squares():
    numbers = [1,2,3,4,5]
    numbers_under_4 = []
    for number in numbers:
        if number < 4:
            numbers_under_4.append(number)
            # Now, numbers_under_4 contains [1,4,9]
    print ("Numbers under 4 only: ",numbers_under_4)

You could reduce the size of the code with the filter function:

In [9]:
def python_filter_squares_lambda():
    numbers = [1,2,3,4,5]
    numbers_under_4 = filter(lambda x: x < 4,numbers)
    print ("Numbers under 4 only: ",*numbers_under_4)

In [10]:
#Exercise1: filtering the list
filter_squares()

Numbers under 4 only:  [1, 2, 3]


In [11]:
#filtering the list in a pythonic way with lambda
python_filter_squares_lambda()

Numbers under 4 only:  1 2 3


## Pandas data structures and functionality

Below, we're going to explore a dataset of mortgage insurance issued by the *Federal Housing Authority (FHA)*. The data is broken down by census tract and tells us how big of a player the FHA is in each tract (how many homes etc ...). 

In [22]:
import pandas as pd

In [23]:
names =["State_Code", "County_Code", "Census_Tract_Number", "NUM_ALL", "NUM_FHA", "PCT_NUM_FHA", "AMT_ALL", "AMT_FHA", "PCT_AMT_FHA"]
df = pd.read_csv('../0_Preexercise/data/fha_by_tract.csv', names=names)  ## Loading a CSV file, without a header (so we have to provide field names)

df['GEOID'] = df['Census_Tract_Number']*100 + 10**6 * df['County_Code'] \
    + 10**9 * df['State_Code']   
    
df = df.sort_values('State_Code')  
df.head()

Unnamed: 0,State_Code,County_Code,Census_Tract_Number,NUM_ALL,NUM_FHA,PCT_NUM_FHA,AMT_ALL,AMT_FHA,PCT_AMT_FHA,GEOID
23999,1.0,49.0,9613.0,16,4,25.0,2184,799,36.5842,1049961000.0
55215,1.0,3.0,102.0,8,1,12.5,774,76,9.81912,1003010000.0
65492,1.0,27.0,,1,0,0.0,82,0,0.0,
45193,1.0,95.0,311.0,20,3,15.0,1495,263,17.592,1095031000.0
33750,1.0,39.0,9618.0,14,3,21.4286,1243,333,26.79,1039962000.0


## Map and filter in Pandas


Pandas supports functional transformations (`map`, `filter`) too:

In [24]:
df["State_Code2"] = df["State_Code"].apply(lambda x: x+1)

df["State_Code2"].head()

23999    2.0
55215    2.0
65492    2.0
45193    2.0
33750    2.0
Name: State_Code2, dtype: float64

In [25]:
df = df[df['County_Code'] > 75]

df.head()

Unnamed: 0,State_Code,County_Code,Census_Tract_Number,NUM_ALL,NUM_FHA,PCT_NUM_FHA,AMT_ALL,AMT_FHA,PCT_AMT_FHA,GEOID,State_Code2
45193,1.0,95.0,311.0,20,3,15.0,1495,263,17.592,1095031000.0,2.0
23024,1.0,89.0,5.01,9,3,33.3333,615,232,37.7236,1089001000.0,2.0
65507,1.0,99.0,756.0,1,0,0.0,116,0,0.0,1099076000.0,2.0
39229,1.0,103.0,54.04,60,14,23.3333,9263,2051,22.1419,1103005000.0,2.0
65472,1.0,113.0,,2,0,0.0,435,0,0.0,,2.0


# Exercise 4: from Python to PySpark

Say I want to map and filter a list at the same time. In other words, I'd like to see the square of each element in the list where said element is under 4. Once more, the Python neophyte way:


In [18]:
numbers = [1,2,3,4,5]
squares = []
for number in numbers:
    if number < 4:
        squares.append(number*number)
print (squares)

[1, 4, 9]


Before re-writing it in PySpark, re-write it using map and filter expressions:

In [19]:
numbers = [1,2,3,4,5]
squares = map(lambda x: x*x, filter(lambda x: x < 4, numbers))
print (*squares)

1 4 9


Now do with PySpark

In [20]:
import pyspark
try:
    sc
except NameError:    
    sc = pyspark.SparkContext('local[*]')

In [21]:
#We do not need to create the SparkSession/SparkContext in the notebook, but we would do in a standalone application...
numbers_rdd = sc.parallelize(numbers)
squares_rdd = numbers_rdd.filter(lambda x: x < 4).map(lambda x: x*x)
print (squares_rdd.collect())

[1, 4, 9]


## Submitting Spark jobs via Slurm (this part is done on Adroit cluster)

Open a new terminal window, or use an ssh client to login to Adroit cluster:

```bash
ssh -XC your_username@adroit3.princeton.edu
```

Checkout the course exercise on Adroit as well:

```bash
git clone https://github.com/ASvyatkovskiy/BigDataCourse && cd BigDataCourse
```

Change into the working directory
```bash
cd BigDataCourse/1_TransformationsActions
```

Before starting with the exercise.py, you need to make sure the scratch is set up.
Look for your scratch folder:

```bash
ls -l /scratch/network/<your_username>
```

create it if necessary:
```bash
mkdir /scratch/network/<your_username>
```

Define an environmental variable to store its location:

```bash
export SCRATCH_PATH="/scratch/network/<your_username>"
``` 

The Slurm submission file for Spark job will look like:

```bash
#SBATCH -N 1
#SBATCH -t 00:05:00
#SBATCH --ntasks-per-node 2
#SBATCH --cpus-per-task 3

module load spark/hadoop2.7/2.2.0
spark-start
echo $MASTER

spark-submit --total-executor-cores 6 exercise.py
```

Monitor the progress of your Spark application:

```bash
squeue -u alexeys
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            219838       all slurm_fo  alexeys  R       0:04      1 adroit-06
```       

Read the comments in the exercise.py file inline to understand the exercise.