<div style="color:red;font-weight:bold;background:yellow;text-align:center;padding:10px;border:solid">
    <h1>RUN IN EMR CLUSTER ONLY</h1>
    If the URL of the current page does not begin with "ec2", then do **NOT** proceed!
</div>

# SparkML Lab

In this lab, we will do some Machine Learning using PySpark.

This is a power library with a lot of feature for numerical data processing.
As you may have noticed, traditional data storage systems (DBMS, Data Warehouses, etc.) are not well suited for numerical data processing.

The Spark ML Library adds that capability to the Spark environment.

Read More [Here](https://spark.apache.org/docs/latest/ml-guide.html).
Useful sections include:
 * https://spark.apache.org/docs/2.3.0/ml-statistics.html
 * https://spark.apache.org/docs/2.3.0/ml-pipeline.html
 * https://spark.apache.org/docs/2.3.0/ml-features.html
 * https://spark.apache.org/docs/2.3.0/ml-classification-regression.html
 

### Connecting to PySpark

In [1]:
name = !hostname
if "dsa" in name[0]:
    raise RuntimeError("Only run this notebook in the EMR Cluster!")
import findspark
findspark.init()
from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName("pyspark-lab")
sc = SparkContext(conf=conf)

### Loading the Dataset

In [2]:
from pyspark.sql import SQLContext

# To use Spark SQL we create a SQLContext from SparkContext
sqlContext = SQLContext(sc)

# Location of the dataset on HDFS
DATASET = '/datasets/titanic.csv'

# Load a table with a CSV format reader
dataset = sqlContext.read.format('com.databricks.spark.csv').options(
    header='true', inferschema='true').load(DATASET)

### Create the Machine Learning Model
We will use a decision tree classifier for this experiment.

In [3]:
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

# create a vector assembler - this will create a new column that 
#   includes columns that are considered 
#   features and assembles them into a vector
features = VectorAssembler(
    inputCols=["pclass", "sex", "age", "embarked"],
    outputCol="features")

# split into Train and Test
train_data, test_data = features.transform(dataset).randomSplit([0.7, 0.3])

# create model
decision_tree = DecisionTreeClassifier(labelCol="survived", featuresCol="features")

### Train the Model

In [4]:
model = decision_tree.fit(train_data)

### Predict and Evaluate Model

In [5]:
# predict on the test data -- the model has not seen in
predictions = model.transform(test_data)

#bring in evaluator
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
#create evaluator
evaluator = MulticlassClassificationEvaluator(labelCol="survived", predictionCol="prediction")
# get the accuracy
accuracy = evaluator.evaluate(predictions)

print("{:.2f}%".format(accuracy*100))


79.15%


You will see more of the SparkML library in courses such as Data Mining and Information Retrieval and Applied Machine Learning.

# Save your notebook, then `File > Close and Halt`

---