# DAY1


# From Python to Spark (PySpark)

PySpark is a Python API for Apache Spark

Getting started with Spark for Python programmers is particularly easy if one is using a certain programming style: 
maps, filters and lambda functions.

# Lambda functions in Python

Python supports the creation of anonymous functions (i.e. functions that are not bound to a name) at runtime, using a construct called "lambda".

Sometimes you need to pass a function as an argument, or you want to do a short but complex operation multiple times. You could define your function the normal way, or you could make a lambda function, a mini-function that returns the result of a single expression. The two definitions are completely identical:

In [1]:
##traditional named function
def add(a,b): return a+b

##lambda function
add2 = lambda a,b: a+b

The advantage of the lambda function is that it is in itself an expression, and can be used inside another statement. Here's an example using the map function, which calls a function on every element in a list, and returns a list of the results:

In [3]:
squares = map(lambda a: a*a, [1,2,3,4,5])
print squares

[1, 4, 9, 16, 25]


# Exercise 1: mapping the list in Python

To get started with the exercise:

```bash
cd from_python_to_spark/
```
and inspect the exercise1.py file therein.

Suppose you need to perform a transformation on a list of element. For instance, to calculate a square of each element of the list. One way to write this in Python would be as follows: 

In [4]:
numbers = [1,2,3,4,5]
squares = []
for number in numbers:
    squares.append(number*number)
    # Now, squares should have [1,4,9,16,25]
print "List of squares: ", squares

List of squares:  [1, 4, 9, 16, 25]


## Pythonic way

Python provides a few ways to re-write the same piece of code in a more compact form: list comprehensions and with the map. Python programmers who do not use Spark typically prefer the list comprehensions to the map. But using the map is what allows you to adjust to Spark way of programming the easiest:

In [5]:
numbers = [1,2,3,4,5]
squares = map(lambda x: x*x, numbers)
#Now, squares should have [1,4,9,16,25]
print "List of squares: ", squares

List of squares:  [1, 4, 9, 16, 25]


# Exercise 2: filtering the list in Python

What if you're more interested in filtering the list? Say you want to remove every element with a value equal to or greater than 4? (Okay, so the examples aren't very realistic. Whatever...) A Python neophyte might write:

In [6]:
numbers = [1,2,3,4,5]
numbers_under_4 = []
for number in numbers:
    if number < 4:
        numbers_under_4.append(number)
        # Now, numbers_under_4 contains [1,4,9]
print "Numbers under 4 only: ",numbers_under_4

Numbers under 4 only:  [1, 2, 3]


You could reduce the size of the code with the filter function:

In [7]:
numbers = [1,2,3,4,5]
numbers_under_4 = filter(lambda x: x < 4, numbers)
# Now, numbers_under_4 contains [1,2,3]
print "Numbers under 4 only: ",numbers_under_4

Numbers under 4 only:  [1, 2, 3]


# Exercise 3: Pandas data structures and functionality

Pandas is Python's answer to R.  It's a good tool for small(ish) data analysis -- i.e., when everything fits into memory. The basic new "noun" in pandas is the **data frame**. As a part of pre-exercises, you have received an iPython notebook with some Pandas case study. 

It's like a table, with rows and columns (e.g., as in SQL).  Except:
  - The rows can be indexed by something interesting (there is special support for labels like categorical and timeseries data).  This is especially useful when you have timeseries data with potentially missing data points.
  - Cells can store Python objects. (Like in SQL, columns are homogeneous.)
  - Instead of "NULL", the name for a non-existent value is "NA".  Unlike R, Python's data frames only support NAs in columns of some data types (basically: floating point numbers and 'objects') -- but this is mostly a non-issue (because it will "up-cast" integers to float64, etc.)
  
Pandas provides a "batteries-included" basic data analysis:
  - **Loading data:** `read_csv`, `read_table`, `read_sql`, and `read_html`
  - **Selection, filtering, and aggregation** (i.e., SQL-type operations): There's a special syntax for `SELECT`ing.  There's the `merge` method for `JOIN`ing.  There's also an easy syntax for what in SQL is a mouthful: Creating a new column whose value is computed from other column -- with the bonus that now the computations can use the full power of Python (though it might be faster if it didn't).
  - **"Pivot table" style aggregation**: If you're an Excel cognosceti, you may appreciate this.
  - **NA handling**: Like R's data frames, there is good support for transforming NA values with default values / averaging tricks / etc.
  - **Basic statistics:** e.g. `mean`, `median`, `max`, `min`, and the convenient `describe`.
  - **Plugging into more advanced analytics:** Okay, this isn't batteries included.  But still, it plays reasonably with `sklearn`.
  - **Visualization:** For instance `plot` and `hist`.
  
  
## Map and filter in Pandas:


In [1]:
import pandas as pd

names =["State_Code", "County_Code", "Census_Tract_Number", "NUM_ALL", "NUM_FHA", "PCT_NUM_FHA", "AMT_ALL", "AMT_FHA", "PCT_AMT_FHA"]
df = pd.read_csv('../preexercise/data/fha_by_tract.csv', names=names)  ## Loading a CSV file, without a header (so we have to provide field names)

df.head()

Unnamed: 0,State_Code,County_Code,Census_Tract_Number,NUM_ALL,NUM_FHA,PCT_NUM_FHA,AMT_ALL,AMT_FHA,PCT_AMT_FHA
0,8,75,,1,1,100,258,258,100
1,28,49,103.01,1,1,100,71,71,100
2,40,3,,1,1,100,215,215,100
3,39,113,603.0,3,3,100,206,206,100
4,12,105,124.04,2,2,100,303,303,100


In [2]:
df["State_Code2"] = df["State_Code"].apply(lambda x: x+1)

df["State_Code2"].head()

0     9
1    29
2    41
3    40
4    13
Name: State_Code2, dtype: float64

In [8]:
df = df[df['County_Code'] > 75]

df.head()

Unnamed: 0,State_Code,County_Code,Census_Tract_Number,NUM_ALL,NUM_FHA,PCT_NUM_FHA,AMT_ALL,AMT_FHA,PCT_AMT_FHA,State_Code2,County_Code2
3,39,113,603.0,3,3,100,206,206,100,40,39
4,12,105,124.04,2,2,100,303,303,100,13,12
5,12,86,9808.0,1,1,100,188,188,100,13,12
7,12,103,207.0,2,2,100,100,100,100,13,12
8,36,119,30.0,1,1,100,354,354,100,37,36


# Exercise 4: from Python to PySpark

Say I want to map and filter a list at the same time. In other words, I'd like to see the square of each element in the list where said element is under 4. Once more, the Python neophyte way:


In [None]:
numbers = [1,2,3,4,5]
squares = []
for number in numbers:
    if number < 4:
        squares.append(number*number)
print squares

Before re-writing it in PySpark, re-write it using map and filter expressions:

In [None]:
numbers = [1,2,3,4,5]
squares = map(lambda x: x*x, filter(lambda x: x < 4, numbers))
print squares

Now do with PySpark

In [11]:
#I do not need to create the Spark Context in the notebook, but you do...
#sc = SparkContext("My First App")
numbers_rdd = sc.parallelize(numbers)
squares_rdd = numbers_rdd.filter(lambda x: x < 4).map(lambda x: x*x)
print squares_rdd.collect()

[1, 4, 9]


## Submitting Spark jobs via Slurm

Before starting with the exercise2.py, you need to make sure the scratch is set up.
Look for your scratch folder:

```bash
ls -l /scratch/network/<your_username>
```

create it if necessary:
```bash
mkdir /scratch/network/<your_username>
```

Define an environmental variable to store its location:

```bash
export SCRATCH_PATH="/scratch/network/<your_username>"
``` 

The Slurm submission file for Spark job will look like:

```bash
#SBATCH -N 1
#SBATCH -t 00:05:00
#SBATCH --ntasks-per-node 2
#SBATCH --cpus-per-task 3

module load spark/hadoop2.6/1.4.1
spark-start
echo $MASTER

spark-submit --total-executor-cores 2 exercise2.py
```

Monitor the progress of your Spark application:

```bash
squeue -u alexeys
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            219838       all slurm_fo  alexeys  R       0:04      1 adroit-06
```             

# Transformations in Python and PySpark (quick look forward)

It is Pythonic to operate on lists - elementwise operations (maps), filtering, etc. In many other languages, starting with Lisp but extending to many "functional" programming languages, a different style is preferred:

The idea is that if `f` is a function, then one thinks of the application
>          
    list   |---->   [ f(x) for x in list ]

on lists as a function of _two_ arguments: `f` and `list`.  The idea of viewing the function `f` as a parameter is typical in functional programming languages, and can be taken as a definition of the later term.

Some common idioms in this style, with Pythonic equivalents, are:

- `map(f, list) === [ f(x) for x in list ]`: Apply `f` element-wise to `list`.
- `filter(f, list) === [ x for x in list if f(x) ]`: Filter `list` using `f`.
- `flatMap(f, list) === [ f(x) for y in list for x in y ]`: Here `f` is a function that eats elements (of the type contained in list) and spits out lists, and `flatMap` first applies f element-wise to the elements of `list` and then _flattens_ or _concatenates_ the resulting lists.  It is sometimes also called `concatMap`.
- `reduce(f, list[, initial])`: Here `f` is a function of _two_ variables, and folds over the list applying `f` to the "accumulator" and the next value in the list.  That is, it performs the following recursion

$$    a_{-1} = \mathrm{initial} $$
$$    a_i = f(a_{i-1}, \mathrm{list}_i) $$

with the with the final answer being $a_{\mathrm{len}(\mathrm{list})-1}$.  (If initial is omitted, just start with $a_0 = \mathrm{list}_0$.)  For instance,
>           
    reduce(lambda x,y: x+y, [1,2,3,4]) = ((1+2)+3)+4 = 10
    
    
### Remark:
This is where the name "map reduce" comes from..

# Spark core fundamentals
(slides)

# Anatomy of the Spark application


# Spark transformations and actions


# Working with key-value pairs

# Loading data into RDD

To start the exercise, change into loading_data folder:
```bash
cd loading_data
```

In the root folder you will find a set of files starting with `load`, and a few folders. 
Let us inspect the load1_unstructured.py file.

In [13]:
from pyspark import SparkContext
import sys
import time

def main1(args):
    start = time.time()
    #sc = SparkContext(appName="LoadUnstructured")

    #By default it assumes file located on hdfs folder, 
    #but by prefixing "file://" it will search the local file system
    #Can specify a folder, can pass list of folders or use wild character
    input_rdd = sc.textFile("../loading_data/unstructured/")

    #inspect it, understand how it is structured (list of strings-lines)
    print input_rdd.take(10)
    print "Input dataset has ", input_rdd.count(), " lines"

    counts = input_rdd.flatMap(lambda line: line.split()) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)

    print "\nTaking the 10 most frequent words in the text and corresponding frequencies:"
    print counts.takeOrdered(10, key=lambda x: -x[1])
    end = time.time()
    print "Elapsed time: ", (end-start)

def main2(args):
    start = time.time()
    #sc = SparkContext(appName="LoadUnstructured")

    #Use alternative approach: load the dinitial file into a pair RDD
    input_pair_rdd = sc.wholeTextFiles("../loading_data/unstructured/")

    #inspect it, understand how it is structured (list of strings-lines)
    print input_pair_rdd.take(3)
    print "Input dataset has ", input_pair_rdd.count(), " files"
    counts = input_pair_rdd.flatMap(lambda line: line[1].split()) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)
    print "\nTaking the 10 most frequent words in the text and corresponding frequencies:"
    print counts.takeOrdered(10, key=lambda x: -x[1])
    end = time.time()
    print "Elapsed time: ", (end-start)


if __name__ == "__main__":
    # Try the record-per-line-input
    main1(sys.argv)
    #Use alternative approach: load the initial file into a pair RDD
    #main2(sys.argv)

[u'', u'', u'', u'', u'                        THE ADVENTURES OF SHERLOCK HOLMES', u'', u'                               Arthur Conan Doyle', u'', u'', u'']
Input dataset has  53271  lines

Taking the 10 most frequent words in the text and corresponding frequencies:
[(u'the', 22635), (u'of', 11167), (u'and', 11086), (u'to', 10707), (u'a', 10433), (u'I', 10183), (u'in', 7006), (u'that', 6911), (u'was', 6779), (u'his', 4955)]
Elapsed time:  1.72066187859


## Loading CSV

Next, we are going to learn how to load data in structured CSV format. There is at least two ways to do that:

1) Read the files line by line with textFiles() method, split on delimiter

Similarly to Python, there is a data structured designed to be used when working with structured data (I mean Pandas Dataframes), it is also called the dataframe (a concept closely linked to Spark SQL). There is a way to read CSV directly into Spark dataframe 

2) Read the files into dataframe using spark-csv module from Databricks
https://github.com/databricks/spark-csv

You do not need to install it, I did the work for you by adding the build Jars into the appropriate /lib folder...

### Load CSV

### Mini-exercise on loading CSV

Use what you have learned in the load2_csv.py exercise to load a set of CSV datasets:

-- Actor

-- Movie

-- Actor playing in movie (relationships)

and find movies where **Tom Hanks** played in.

Save the answer in the JSON format.

# Machine learning

Let us consider a very simple machine learning example of logistic regression.
Logistic regression is an iterative machine learning algorithm that seeks to find the best hyperplane that separates two sets of points in a multi-dimensional feature space. It can be used to classify messages into spam vs non-spam, for example. Because the algorithm applies the same MapReduce operation repeatedly to the same dataset, it benefits greatly from caching the input in RAM across iterations.

## Non-MLlib implementation

First, let us consider the non-MLlib implementation and try to evaluate the effect of caching and partitioning on the perfromance. We're going to try to learn the rule that y(x) = 1 if x < fraction_positive, 0 otherwise

Our training sample will be generated as follows:

In [17]:
import numpy as np
N = 10**3
fraction_positive = 0.5

def y(x):
    return 1 if x < fraction_positive else 0

def generate_sample():
    sample_X = np.arange(0, 1, 1.0/N)
    np.random.shuffle( sample_X) # In-place shuffle!
    sample_Y = map(y, sample_X)
    return (sample_X, sample_Y)

(sample_X, sample_Y) = generate_sample()


## By hand.  This is the example code taken from the Spark Examples on the website.
#  This is much slower than the above code, so I'm not going to even run it (or extract predictions, or test it..)
start = time.time()
def logistic_by_hand(ITERATIONS,nparts):
    points = ( sc.parallelize( zip(sample_X, sample_Y), nparts)
                 .map(lambda (x,y): LabeledPoint(y, [1, x]))
                 .cache() )
    w = np.random.ranf(size = 2) # current separating plane
    print "Original random plane: %s" % w
    for i in xrange(ITERATIONS):
        gradient = points.map(
            lambda pt: (1 / (1 + np.exp(-pt.label*(w.dot(pt.features)))) - 1) * pt.label * pt.features
        ).reduce(lambda a, b: a + b)
        w -= gradient
    print "Final separating plane: %s" % w

logistic_by_hand(20,3)
end = time.time()
print "Elapsed time: ", (end-start)

## MLlib based implementation


In [18]:
import time
start = time.time()
## Using MLLib and it's data structures.  This is fairly quick.
from pyspark.mllib.classification import LogisticRegressionWithSGD
from pyspark.mllib.regression import LabeledPoint

points = ( sc.parallelize( zip(sample_X, sample_Y), 3)
             .map(lambda (x,y): LabeledPoint(y, [1, x]))
             .cache() )
model = LogisticRegressionWithSGD.train(points)

# Evaluating the model on training data
labelsAndPreds = points.map(lambda p: (p.label, model.predict(p.features)))
trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(points.count())
print "Accuracy on training set: %s" % (1 - trainErr)
end = time.time()
print "Elapsed time: ", (end-start)

Accuracy on training set: 0.896
Elapsed time:  15.8722097874


### Exercises

1. Play with the `fraction_positive` parameter: What happens to the accuracy measure as `fraction_positive` gets below 0.30 or above 0.70? (You should be somewhat disappointed with the results!)  What do you think is happening, and can you improve on it?
1. Play with the "by hand" version (.. after lowering N to say 10**4 or so): Figure out what it's actually doing and how to use it to get results.  How much slower than the MLLib version does it seem to be?


# DAY 2

## Spark SQL and DataFrames

http://spark.apache.org/sql/

Let us go back to a load4_json.py example from yesterday. Here, we load a small JSON file, register temporary table in memory and run a SQL-like query on it in a distributed way.

In [11]:
from pyspark import SparkContext
from pyspark.sql import SQLContext

import sys
import os

def main_sqlcontext(args):
    #Note 2 Contexts: SparkContext and SQL context
    #sc = SparkContext(appName="LoadJson")
    sqlContext = SQLContext(sc)

    input = sqlContext.read.json("../loading_data/json/")
    input.registerTempTable("movies")
    answer = sqlContext.sql("SELECT * FROM movies WHERE title = 'Cloud Atlas'")
    answer.show()

if __name__ == "__main__":
    main_sqlcontext(sys.argv)

+---------+--------------------+-----------+
|     name|               roles|      title|
+---------+--------------------+-----------+
|Tom Hanks|Old Salty Dog / M...|Cloud Atlas|
+---------+--------------------+-----------+



Same solution, but using the Pandas dataframe-like syntax:

In [12]:
def main_sqlcontext(args):
    #Note 2 Contexts: SparkContext and SQL context
    #sc = SparkContext(appName="LoadJson")
    sqlContext = SQLContext(sc)

    input = sqlContext.read.json("../loading_data/json/")
    answer = input.where(input.title =="Cloud Atlas")
    answer.show()

if __name__ == "__main__":
    main_sqlcontext(sys.argv)

+---------+--------------------+-----------+
|     name|               roles|      title|
+---------+--------------------+-----------+
|Tom Hanks|Old Salty Dog / M...|Cloud Atlas|
+---------+--------------------+-----------+



## Project1: web mining and text processing 

The project is based on a Kaggle competition which had taken place in the past.

The dataset consists of over 300,000 raw HTML files containing text, links, and downloadable images. 

Given the HTML of websites served to users of StumbleUpon, the challenge was to identify the paid content disguised as just another internet gem.

If media companies could better identify poorly designed native ads, they can keep them off your feed and out of your user experience. 

https://www.kaggle.com/c/dato-native


### Analysis workflow

1) Perfrom web-scraping of data from raw HTML to JSON


2) Extract features for classification (we will focus on text features today)


3) Train a classifier and estimate cross-validation error


This is where the domain knowledge you have aqcuired during pre-exercises becomes handy! **Quick look into scraping and NLP notebooks**


#### Sub-task 1

1) Run local web scraper:

```bash
python scrape_html.py
```

2) Modify to use Spark libraries sparky_scrape1_html_exercise.py.
Then run it with Slurm or directly on the headnode like:

```bash
/usr/licensed/spark/spark-1.5.2-bin-hadoop2.6/bin/pyspark sparky_scrape1_html_exercise.py
```

Compare two files.            

At least in the beginning, a lot of Spark code you write will be ported from some of the existing Python solutions that you have been using for a while.

Inspect the scrape_html.py file. What it does it scrapes text, links, and images from the HTML pages using BeautifulSoup. You are asked to re-write the code in Spark and run it on a cluster.

Surprisingly (or not), the amount of changes you make will be minimal.

We are going to run this application on a single core (the amount of data will be small), to make sure that the lines in the JSON file go in the same order. Keeping this in mind, we can compare to files by simply running diff:

```bash
diff chunk.json /scratch/network/alexeys/BigDataCourse/web_dataset_preprocessed2/part-00000
```

#### Sub-task 2

Having performed the scraping, we are all set to go to steps 2) and 3): feature engineering and classification.
First, we need to label our data - labels are provided in a separate file, so we need to perfrom a relational JOIN on keys (HTML file ID in this case). I keep positive and negative samples in two different RDDs for the sake of convenience.

**The type of learner**: We're going to choose to look at this as a _supervised classification_ problem.  There are also unsupervised approaches, but you have to make choices sometimes.  This means we need some "marked up" data:

**The training dataset**: The result of web-scraping step, stored as a JSON file

We are going to restrict ourselves on the text features and the bag-of-words approach with TF-IDF weighting applied.
To prepare text feature, we are going to step through the regular:

1) Tokenization, n-grams

2) Stemming, stopword removal

3) TF calculation and text vectorization

4) IDF, TF-IDF calculation



#### TF-IDF: term frequency–inverse document frequency

With single word vocabularies, we can probably do an okay job of coming up with a reasonable (if short) list of words that distinguish between the two documents.  With n-grams, even for $n=2$, it is better to let a computer help us.  

Just using frequencies, as above, is clearly not great.  Both apples the fruit and Apple the company are enjoyed around the world (one of the 2-grams that came up above!).  We would like to find words that are common in one document, not not common in all of them.  This is the goal of the __td-idf weighting__.  A precise definition is:


  1. If $d$ denotes a document and $t$ denotes a term, then the _raw term frequency_ $\mathrm{tf}^{raw}(t,d)$ is
  $$ \mathrm{tf}^{raw}(t,d) = \text{the number of times the term $t$ occurs in the document $d$} $$
  The vector of all term frequencies can optionally be _normalized_ either by dividing by the maximum of ny single word's occurance count ($L^1$) or by the Euclidean length of the vector of word occurance counts ($L^2$).  Scikit-learn by defaults does this second one:
  $$ \mathrm{tf}(t,d) = \mathrm{tf}^{L^2}(t,d) = \frac{\mathrm{tf}^{raw}(t,d)}{\sqrt{\sum_t \mathrm{tf}^{raw}(t,d)^2}} $$
  2. If $$ D = \left\{ d : d \in D \right\} $$ is the set of possible documents, then  the _inverse document frequency_ is
  $$ \mathrm{idf}^{naive}(t,D) = \log \frac{\# D}{\# \{d \in D : t \in d\}} \\
  = \log \frac{\text{count of all documents}}{\text{count of those documents containing the term $t$}} $$
  with a common variant being
  $$ \mathrm{idf}(t, D) = \log \frac{\# D}{1 + \# \{d \in D : t \in d\}} \\
   = \log \frac{\text{count of all documents}}{1 + \text{count of those documents containing the term $t$}} $$
  (This second one is the default in scikit-learn. Without this tweak we would omit the $1+$ in the denominator and have to worry about dividing by zero if $t$ is not found in any documents.)
  3. Finally, the weight that we assign to the term $t$ appearing in document $d$ and depending on the corpus of all documents $D$ is
  $$ \mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \mathrm{idf}(t,D) $$
  
#### Labeled Point format  
  
After that, we only need to convert our features to the LabeledPoint format, LabledPoint is a class that represents the features and labels of a data point - all MLlib classifier expect data in that format.


## Different ML classifiers

Finally, we will play with various ML classifiers available on the market.


### Decision Trees

A decision tree is a binary tree.  At each of the internal nodes, it chooses a feature $i$ and a threshold $t$.  Each leaf has a value.  Evaluation of the model is just traversal of the tree from the root.  At each node, for example $j$, we go down the left branch if $X_{ji} \le t$ and the right branch otherwise.  The value of the model $f(X_{ji})$ is the value at the value at the terminating leaf of this traveral.  Below, we show a picture of this on small decision tree trained on the iris data set.  Notice that each internal node has a decision criterion and each leaf has the breakdown of label classes left at this leaf of the tree.  For a geometric picture of a decision tree, take a look at this [blog post](https://shapeofdata.wordpress.com/2013/07/02/decision-trees/).


### Random Forests

A random forest is just an ensemble of decision trees.  The predicted value is just the average of the trees (for both regression and classification problems - for classification problems, it is the probabilities that are averaged).  You can adjust `n_estimators` to change the number of trees in the forest.  If each tree is trained on the same subset of data, why aren't they identical?  Two reasons:
1. **Subsampling**: each tree is actually trained on a random selected (with replacement) subset (i.e. bootstrap)
1. **Maximum Features**: the optimal split comes from a randomly selected subset of the features.  In scikit-learn, this feature is controlled by `max_features`.

### Random Forest Training Algorithm and Tuning Parameters

A Random Forest is pretty straightforward to train once you know how a Decision Tree works.  In fact, their construction can even be parallelized.  

Below, various parameters that affect decision tree and random forest training are discussed. 

The first two parameters we mention are the most important, and tuning them can often improve performance:

**numTrees**: Number of trees in the forest.

Increasing the number of trees will decrease the variance in predictions, improving the model’s test-time accuracy.
Training time increases roughly linearly in the number of trees.


**maxDepth**: Maximum depth of each tree in the forest.
Increasing the depth makes the model more expressive and powerful. However, deep trees take longer to train and are also more prone to overfitting.
In general, it is acceptable to train deeper trees when using random forests than when using a single decision tree. One tree is more likely to overfit than a random forest (because of the variance reduction from averaging multiple trees in the forest).

The next two parameters generally do not require tuning. However, they can be tuned to speed up training.

**subsamplingRate**: This parameter specifies the size of the dataset used for training each tree in the forest, as a fraction of the size of the original dataset. The default (1.0) is recommended, but decreasing this fraction can speed up training.

**featureSubsetStrategy**: Number of features to use as candidates for splitting at each tree node. The number is specified as a fraction or function of the total number of features. Decreasing this number will speed up training, but can sometimes impact performance if too low.

### Linear SVM

The canonical Support Vector Machine is the linear one.  Assume we have two groups labeled by $y = \pm 1$.  Then we are trying to find the line $\beta$ such that $X \beta + \beta_0$ maximially separates the points in our two classes:

![SVM Diagram from Hastie et al's The Elements of Statistical Learning](/files/images/svm3.png)

If the two classes can be separated by a linear hyperplane (picture on the left), we want to maximize the **margin** $M$ of the **boundary region**.  A little bit of math can show us that finding the largest separation is actually solved by the minimization problem

$$
\min_{\beta, \beta_0} \|\beta\| \\
\mbox{subject to } y_j (X_{j\cdot} \cdot \beta + \beta_0) \ge 1 \quad \mbox{for } j = 1,\ldots,N
$$

The picture and the equation are equivalent: in the picture we are setting the margin to be $M$ and finding the largest margin possible.  In the equation, we are setting the margin to be $1$ and finding the smallest $\beta$ that will make that true.  So $\beta$ and $M$ are related through $\| \beta \| = \frac{1}{M}$.  If the two classes cannot be separated (picture on the right), we will have to add a forgiveness terms $\xi$,

$$
\min_{\beta, \beta_0} \|\beta\| \\
\mbox{subject to } \left\{ \begin{array} {cl} 
 y_j (X_{j\cdot} \cdot \beta + \beta_0) \ge (1-\xi_j) & \mbox{for } j = 1,\ldots,N \\
 \xi_j \ge 0 & \mbox{for } j = 1,\ldots,N \\
 \sum_j \xi_j \le C
\end{array}\right.
$$

for some constant $C$.  The constant $C$ is an important tradeoff.  It corresponds to the total "forgiveness budget" (see the last constraint).  The larger $C$, the forgiveness we have and the wider the margin $M$ can be.  We can rewrite the constrained optimization problem as the primal Lagrangian function with Lagrange multipliers $\alpha_j \ge 0$, $\mu_j \ge 0$, and $\gamma \ge 0$,  for each of our three constraints:

$$ L_P(\gamma) = \min_{\beta, \beta_0, \xi} \max_{\alpha, \mu} \frac{1}{2} \| \beta \|^2 - \sum_j \alpha_j \left[y_j (X_{j \cdot} \cdot \beta + \beta_0 - (1-\xi_j)\right] - \sum_j \mu_j \xi_j  + \gamma \sum_j \xi_j$$

There is a one-to-one correspondence between $\gamma$ and $C$.  By taking first order conditions, first-order conditions, the dual Lagrangian problem can be formulated as

$$
L_D(\gamma) = \max_{\alpha} \sum_j \alpha_j - \frac{1}{2} \sum_{j, j'} \alpha_j \alpha_{j'} y_j y_{j'} X_{j \cdot} \cdot X_{j' \cdot} \,. \\
\mbox{subject to } \left\{ \begin{array} {cl} 
0 = \sum_j \alpha_j y_j \\
0 \le \alpha_j \le \gamma & \mbox{for } j = 1,\ldots,N
\end{array}\right.
$$

This is now a reasonably straightforward quadratic programming problem.  It is solved via [Sequential Minimization Optimization](https://en.wikipedia.org/wiki/Sequential_minimal_optimization).  Once we have solved this problem for $\alpha$, we can easily work out the coefficients from

$$ \beta = \sum_j \alpha_j y_j X_{j \cdot} $$

**Key takeaways**:
1. Critically, only points inside the margin or on the wrong side of the margin ($j$ for which $\xi_j > 0$) affect the SVM (see the picture).  This is intuitively clear from the picture.  In the dual form, this is because $\alpha_j$ is the Lagrangian constraint corresponding to $y_j (X_{j\cdot} \cdot \beta + \beta_0) \ge (1-\xi_j)$ and Complementary Slackness shows tells us that $\alpha_j > 0$ is non-zero only when the constraint is binding ($y_j (X_{j\cdot} \cdot \beta + \beta_0) = (1-\xi_j)$), i.e. we're in the boundary region.  This is meaning the **Support Vector** in "SVM": only the vectors in the boundary-the **Support Vectors**-contribute to the solution.
1. $C$ or $\gamma$ give a trade-off between the amount of forgiveness and the size of the margin or boundary region.  Hence, it controls how many points affect the SVM (based on the distance from the boundary).

Below, we plot out a simple two-class linear SVM on some synthetic data


### Non-linear SVM

What if we don't believe that our data can be cleanly split by a linear hyperplane?  The common way to incorporate non-linear features is to have a non-linear function $h(X_{j\cdot})$ (possibly to a higher-dimensional feature space with dimension $p'$ where $p' \ge p$) and to train on that space.  One intuition is that there's a higher-dimensional space in which the data is has a linear separation and $h$ gives a non-linear mapping into that space.

#### Kernel Trick

The **Kernel Trick** in SVM tells us that rather than directly computing the (potentially very large) vectors $h(X_{j \cdot})$, we can just modify the Kernel.  If we use the transformed data $h(X_{j \cdot})$, the dual Lagrangian would be

$$ \max_{\alpha} \sum_j \alpha_j - \frac{1}{2} \sum_j \sum_{j'} \alpha_j \alpha_{j'} y_j y_{j'} h(X_{j \cdot}) \cdot h(X_{j' \cdot}) $$

We can rewrite

$$h(X_{j \cdot}) \cdot h(X_{j' \cdot})  = K(X_{j \cdot}, X_{j' \cdot})$$ 

for some non-linear Kernel $K$.  Our problem then becomes,

$$ \max_{\alpha} \sum_j \alpha_j - \frac{1}{2} \sum_j \sum_{j'} \alpha_j \alpha_{j'} y_j y_{j'} K(X_{j \cdot}, X_{j' \cdot}) $$

There's a one-to-one correspondence between Kernel functions and functions $h$ (although $h$'s range may be infinite dimensional).  Some common Kernels include

<table>
<tr>
<th>Kernel</th>
<th>$K(x,x')$</th>
<th>Scikit `kernel` parameter</th>
</tr>

<tr>
<td>Linear Kernel</td>
<td>$x \cdot x'$</td>
<td>`kernel='linear'`</td>
</tr>

<tr>
<td>$d$-th Degree Polynomial</td>
<td>$(r + c x \cdot x')^d$</td>
<td>`kernel='poly'`</td>
</tr>

<tr>
<td>Radial Kernel</td>
<td>$ \exp(- c \|x - x' \|^2) $</td>
<td>`kernel='rbf'`</td>
</tr>

<tr>
<td>Neural Network Kernel</td>
<td>$\tanh(c x \cdot x' + r)$</td>
<td>`kernel='sigmoid'`</td>
</tr>
</table>

The benefit of using a Kernel is that we don't have to compute a very high-dimensional (possibly infinite-dimensional) $h$.  All that complexity is just wrapped into the kernel $K$.

## Running bypassing the scheduler

This way is not recommended if you are using Princeton comoputing resources. However, just for educational purposes we will try running bypassing the Slurm scheduler:


```bash
/usr/licensed/spark/spark-1.5.2-bin-hadoop2.6/bin/pyspark <script_name.py>
```


# Building Spark applications with Scala API

Apache Spark is written in Scala. Scala (along with Python and Java) is among three languages supported by Spark, and in fact Scala functionality is typically added the first to new Spark releases.


## Preparing our first Scala Spark application

Let us start with an PySpark application we have prepared on one of the previous steps. Here it is:

In [3]:
from pyspark import SparkContext
import sys
import time

def main1(args):
    start = time.time()
    #sc = SparkContext(appName="LoadUnstructured")

    input_rdd = sc.textFile("../loading_data/unstructured/",10)
    counts = input_rdd.flatMap(lambda line: line.split()) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)

    print "\nTaking the 10 most frequent words in the text and corresponding frequencies:"
    print counts.takeOrdered(10, key=lambda x: -x[1])
    end = time.time()
    print "Elapsed time: ", (end-start)

if __name__ == "__main__":
    main1(sys.argv)


Taking the 10 most frequent words in the text and corresponding frequencies:
[(u'the', 22635), (u'of', 11167), (u'and', 11086), (u'to', 10707), (u'a', 10433), (u'I', 10183), (u'in', 7006), (u'that', 6911), (u'was', 6779), (u'his', 4955)]
Elapsed time:  1.07145404816


In [None]:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.SparkContext._

object WordCount {
  def main(args: Array[String]) {

    val conf = new SparkConf().setAppName("WordCount")


    val textFile = spark.textFile("../loading_data/unstructured/",10)
    val counts = textFile.flatMap(line => line.split(" "))
                 .map(word => (word, 1))
                 .reduceByKey(_ + _)
    println("\nTaking the 10 most frequent words in the text and corresponding frequencies:")
    println(counts.takeOrdered(10).(Ordering[Int].reverse.on(x=>x._2)))
        
    val t1 = System.nanoTime()
    println("Elapsed time: " + (t1 - t0)/1000000000.)
    spark.stop()
  }
}

## Submitting Scala Spark application Q/A

Q: So you've written some Spark code in Scala. How do you submit it to Spark and run it?  
A: Use `sbt` or `maven` to package it into a Java jar, and submit it to Spark using `spark-submit`

Q: What's a Java jar?  
A: JAR (Java Archive) is a package file format typically used to aggregate many Java class files and associated metadata and resources (text, images, etc.) into one file to distribute application software or libraries on the Java platform.

### Packaging with `sbt`

**What is SBT?**  
SBT is a modern build tool written in/for Scala, though it is also a general purpose build tool  

**Why SBT?**
- Good dependency management
- Full Scala language support for creating tasks
- Launch REPL in project context

Create a root directory for your project and run:
```bash
mkdir -p src/{main,test}/{resources,scala}
mkdir lib project
```
within it. 

This script will automatically create the proper `sbt` directory structure, which borrows from the Java `maven` directory structure. The script will also generate a template `build.sbt` file at the top of the directory that you should fill out with the appropriate versions and dependencies for your app.

Then we can take our Scala code, and put it in the src folder (you should have it in the main folder, so just move it there):

```bash
mv WordCount.scala src/main/scala/
```

**Project Layout (Directory structure)**   

`project` – project definition files  
`project/build/` *yourproject* `.scala` – the main project definition file  
`project/build.properties` – project, sbt and scala version definitions  
`src/main` – your app code goes here, in a subdirectory indicating the code’s language (e.g. src/main/scala, src/main/java)  
`src/main/resources` – static files you want added to your jar (e.g. logging config)  
`src/test` – like src/main, but for tests  
`lib_managed` – the jar files your project depends on. Populated by sbt update  
`target` – the destination for generated stuff (e.g. generated thrift
code, class files, jars)  

#### `build.sbt`: Dependencies and versioning

Example `simple.sbt` (located in the root directory of your project) 

```scala
name := "WordCount"

version := "1.0"

scalaVersion := "2.10.4"

libraryDependencies ++= Seq(
    // Spark dependency
    "org.apache.spark" % "spark-core_2.10" % "1.4.1" % "provided"
)
```


#### Assembly.sbt to build a fat Jar

Example assembly.sbt located in the /project folder of your project:

```scala
resolvers += Resolver.url("artifactory", url("http://scalasbt.artifactoryonline.com/scalasbt/sbt-plugin-releases"))(Resolver.ivyStylePatterns)

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.13.0")
```

### Running (submitting a `jar` to Spark)

1. Run `sbt assembly` in your project's home directory. The output to console will tell you the name and location of the resulting jar (under `./target`) 

You should now see the Jar file generated:
```bash
[alexeys@bd scala_spark] ll target/scala-2.10/
total 6968
drwxr-xr-x. 2 alexeys cses    4096 Dec  9 10:43 classes
-rw-r--r--. 1 alexeys cses 7129172 Dec  9 10:43 WordCount-assembly-1.0.jar
```

2. In the Slurm batch script, use spark-submit as usual to submit the Spark app, but you would need to specify the --class and the path to jar from the current folder, for instance:

```bash
spark-submit --class "WordCount" --total-executor-cores target/scala-2.10/WordCount-assembly-1.0.jar
```

# Clustering



k-means is one of the most commonly used clustering algorithms that clusters the data points into a predefined number of clusters. The MLlib implementation includes a parallelized variant of the k-means++ method called kmeans||. The implementation in MLlib has the following parameters:

1) k is the number of desired clusters.

2) maxIterations is the maximum number of iterations to run.

3) initializationMode specifies either random initialization or initialization via k-means.

4) runs is the number of times to run the k-means algorithm (k-means is not guaranteed to find a globally optimal solution, and when run multiple times on a given dataset, the algorithm returns the best clustering result).

5) initializationSteps determines the number of steps in the k-means|| algorithm.

6) epsilon determines the distance threshold within which we consider k-means to have converged.

7) initialModel is an optional set of cluster centers used for initialization. If this parameter is supplied, only one run is performed.


## NYC taxi data

We are going to be working with the NYC taxi geographic data of the following format:

**vendor_id, pickup_datetime, dropoff_datetime, passenger_count, trip_distance, pickup_longitude, pickup_latitude, rate_code, store_and_fwd_flag, dropoff_longitude, dropoff_latitude, payment_type, fare_amount, surcharge, mta_tax, tip_amount, tolls_amount, total_amount**


The goal is to determine the NYC taxi activity.
