# From Python to Spark (PySpark)

PySpark is a Python API for Apache Spark

Getting started with Spark for Python programmers is particularly easy if one is using a certain programming style: 
maps, filters and lambda functions.

# Lambda functions in Python

Python supports the creation of anonymous functions (i.e. functions that are not bound to a name) at runtime, using a construct called "lambda".

Sometimes you need to pass a function as an argument, or you want to do a short but complex operation multiple times. You could define your function the normal way, or you could make a lambda function, a mini-function that returns the result of a single expression. The two definitions are completely identical:

In [None]:
##traditional named function
def add(a,b): return a+b

##lambda function
add2 = lambda a,b: a+b

The advantage of the lambda function is that it is in itself an expression, and can be used inside another statement. Here's an example using the map function, which calls a function on every element in a list, and returns a list of the results:

In [None]:
squares = map(lambda a: a*a, [1,2,3,4,5])
print squares

# Exercise 1: mapping the list in Python

Suppose you need to perform a transformation on a list of element. For instance, to calculate a square of each element of the list. One way to write this in Python would be as follows: 

In [None]:
import sys

def simple_squares():
    numbers = [1,2,3,4,5]
    squares = []
    for number in numbers:
        squares.append(number*number)
        # Now, squares should have [1,4,9,16,25]
    print "List of squares: ", squares

In [None]:
#Exercise0: mapping the list
simple_squares()

## Pythonic way

Python provides a few ways to re-write the same piece of code in a more compact form: list comprehensions and with the map. Python programmers who do not use Spark typically prefer the list comprehensions to the map. But using the map is what allows you to adjust to Spark way of programming the easiest:

In [None]:
def square(x):
    return x*x

def python_squares():
    ## Pythonic way
    numbers = [1,2,3,4,5]
    squares = map(square, numbers)
    #Now, squares should have [1,4,9,16,25]
    print "List of squares calculated in a Pythonic way: ", squares

def python_squares_lambda():
    ## Pythonic way
    numbers = [1,2,3,4,5]
    squares = map(lambda x: x*x, numbers)
    #Now, squares should have [1,4,9,16,25]
    print "List of squares calculated in a Pythonic way with lambda: ", squares

In [None]:
# Mapping the list in a pythonic way
python_squares()

In [None]:
#Lambda function 
python_squares_lambda()

# Exercise 2: filtering the list in Python

What if you're more interested in filtering the list? Say you want to remove every element with a value equal to or greater than 4? (Okay, so the examples aren't very realistic. Whatever...) A Python neophyte might write:

In [None]:
def filter_squares():
    numbers = [1,2,3,4,5]
    numbers_under_4 = []
    for number in numbers:
        if number < 4:
            numbers_under_4.append(number)
            # Now, numbers_under_4 contains [1,4,9]
    print "Numbers under 4 only: ",numbers_under_4

You could reduce the size of the code with the filter function:

In [None]:
def python_filter_squares_lambda():
    numbers = [1,2,3,4,5]
    numbers_under_4 = filter(lambda x: x < 4,numbers)
    print "Numbers under 4 only: ",numbers_under_4

In [None]:
#Exercise1: filtering the list
filter_squares()

In [None]:
#filtering the list in a pythonic way with lambda
python_filter_squares_lambda()

# Exercise 3: Pandas data structures and functionality

Pandas is Python's answer to R.  It's a good tool for small(ish) data analysis -- i.e., when everything fits into memory. The basic new "noun" in pandas is the **data frame**. As a part of pre-exercises, you have received an iPython notebook with some Pandas case study. 

It's like a table, with rows and columns (e.g., as in SQL).  Except:
  - The rows can be indexed by something interesting (there is special support for labels like categorical and timeseries data).  This is especially useful when you have timeseries data with potentially missing data points.
  - Cells can store Python objects. (Like in SQL, columns are homogeneous.)
  - Instead of "NULL", the name for a non-existent value is "NA".  Unlike R, Python's data frames only support NAs in columns of some data types (basically: floating point numbers and 'objects') -- but this is mostly a non-issue (because it will "up-cast" integers to float64, etc.)
  
Pandas provides a "batteries-included" basic data analysis:
  - **Loading data:** `read_csv`, `read_table`, `read_sql`, and `read_html`
  - **Selection, filtering, and aggregation** (i.e., SQL-type operations): There's a special syntax for `SELECT`ing.  There's the `merge` method for `JOIN`ing.  There's also an easy syntax for what in SQL is a mouthful: Creating a new column whose value is computed from other column -- with the bonus that now the computations can use the full power of Python (though it might be faster if it didn't).
  - **"Pivot table" style aggregation**: If you're an Excel cognosceti, you may appreciate this.
  - **NA handling**: Like R's data frames, there is good support for transforming NA values with default values / averaging tricks / etc.
  - **Basic statistics:** e.g. `mean`, `median`, `max`, `min`, and the convenient `describe`.
  - **Plugging into more advanced analytics:** Okay, this isn't batteries included.  But still, it plays reasonably with `sklearn`.
  - **Visualization:** For instance `plot` and `hist`.
  
  
## Map and filter in Pandas:



In [None]:
import pandas as pd

names =["State_Code", "County_Code", "Census_Tract_Number", "NUM_ALL", "NUM_FHA", "PCT_NUM_FHA", "AMT_ALL", "AMT_FHA", "PCT_AMT_FHA"]
df = pd.read_csv('../preexercise/data/fha_by_tract.csv', names=names)  ## Loading a CSV file, without a header (so we have to provide field names)

df.head()

In [None]:
df["State_Code2"] = df["State_Code"].apply(lambda x: x+1)

df["State_Code2"].head()

In [None]:
df = df[df['County_Code'] > 75]

df.head()

# Exercise 4: from Python to PySpark

Say I want to map and filter a list at the same time. In other words, I'd like to see the square of each element in the list where said element is under 4. Once more, the Python neophyte way:


In [None]:
numbers = [1,2,3,4,5]
squares = []
for number in numbers:
    if number < 4:
        squares.append(number*number)
print squares

Before re-writing it in PySpark, re-write it using map and filter expressions:

In [None]:
numbers = [1,2,3,4,5]
squares = map(lambda x: x*x, filter(lambda x: x < 4, numbers))
print squares

Now do with PySpark

In [None]:
#We do not need to create the Spark Context in the notebook, but we would do in a standalone application...
#sc = SparkContext("My First App")
numbers_rdd = sc.parallelize(numbers)
squares_rdd = numbers_rdd.filter(lambda x: x < 4).map(lambda x: x*x)
print squares_rdd.collect()

## Submitting Spark jobs via Slurm

Change into the working directory
```bash
cd BigDataCourse/1_TransformationsActions
```

Before starting with the exercise.py, you need to make sure the scratch is set up.
Look for your scratch folder:

```bash
ls -l /scratch/network/<your_username>
```

create it if necessary:
```bash
mkdir /scratch/network/<your_username>
```

Define an environmental variable to store its location:

```bash
export SCRATCH_PATH="/scratch/network/<your_username>"
``` 

The Slurm submission file for Spark job will look like:

```bash
#SBATCH -N 1
#SBATCH -t 00:05:00
#SBATCH --ntasks-per-node 2
#SBATCH --cpus-per-task 3

module load spark/hadoop2.6/1.6.1
spark-start
echo $MASTER

spark-submit --total-executor-cores 6 exercise.py
```

Monitor the progress of your Spark application:

```bash
squeue -u alexeys
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            219838       all slurm_fo  alexeys  R       0:04      1 adroit-06
```             

# Transformations in Python and PySpark (quick look forward)

It is Pythonic to operate on lists - elementwise operations (maps), filtering, etc. In many other languages, starting with Lisp but extending to many "functional" programming languages, a different style is preferred:

The idea is that if `f` is a function, then one thinks of the application
>          
    list   |---->   [ f(x) for x in list ]

on lists as a function of _two_ arguments: `f` and `list`.  The idea of viewing the function `f` as a parameter is typical in functional programming languages, and can be taken as a definition of the later term.

Some common idioms in this style, with Pythonic equivalents, are:

- `map(f, list) === [ f(x) for x in list ]`: Apply `f` element-wise to `list`.
- `filter(f, list) === [ x for x in list if f(x) ]`: Filter `list` using `f`.
- `flatMap(f, list) === [ f(x) for y in list for x in y ]`: Here `f` is a function that eats elements (of the type contained in list) and spits out lists, and `flatMap` first applies f element-wise to the elements of `list` and then _flattens_ or _concatenates_ the resulting lists.  It is sometimes also called `concatMap`.
- `reduce(f, list[, initial])`: Here `f` is a function of _two_ variables, and folds over the list applying `f` to the "accumulator" and the next value in the list.  That is, it performs the following recursion

$$    a_{-1} = \mathrm{initial} $$
$$    a_i = f(a_{i-1}, \mathrm{list}_i) $$

with the with the final answer being $a_{\mathrm{len}(\mathrm{list})-1}$.  (If initial is omitted, just start with $a_0 = \mathrm{list}_0$.)  For instance,
>           
    reduce(lambda x,y: x+y, [1,2,3,4]) = ((1+2)+3)+4 = 10
    
    
### Remark:
This is where the name "map reduce" comes from..



# Anatomy of the Spark application


# Spark transformations and actions


# Working with key-value pairs