<div style="color:red;font-weight:bold;background:yellow;text-align:center;padding:10px;border:solid">
    <h1>RUN IN EMR CLUSTER ONLY</h1>
    If the URL of the current page does not begin with "ec2", then do **NOT** proceed!
</div>

# SparkML Practice
In this practice, you will use the tools you learned in the readings and labs to perform some machine learning.

## Connecting to PySpark

In [1]:
name = !hostname
if "dsa" in name[0]:
    raise RuntimeError("Only run this notebook in the EMR Cluster!")
import findspark
findspark.init()
from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName("pyspark-lab")
spark_context = SparkContext(conf=conf)

We will use SparkML to perform a classification task. 

To do this, we will use the Iris dataset. The Iris dataset is a standard dataset for introductory machine learning. It deals with distinguishing between 3 types of Iris flowers. 

The dataset contains 4 features. We will import it from a library called `sklearn`.

In [2]:
from sklearn import datasets
import pandas as pd
from pyspark.sql import SQLContext

# To use Spark SQL we create a SQLContext from SparkContext
sqlContext = SQLContext(spark_context)

iris = datasets.load_iris()

# the data
iris_data = iris.data
# the labels
iris_labels = iris.target

# create pandas dataframe
pd_df = pd.DataFrame(iris_data)
pd_df["label"] = iris_labels

# create spark dataframe
df = sqlContext.createDataFrame(pd_df)

In [3]:
df.head()

Row(0=5.1, 1=3.5, 2=1.4, 3=0.2, label=0)

## 1
Create the VectorAssembler & Data Partitioning

Now, use SparkSQL to create the features and then partition the data into a training set with 80% of the data and a testing set with 20% of the data

In [4]:
from pyspark.ml.feature import VectorAssembler

# create a vector assembler - this will create a new column that includes columns that are considered 
# features and assembles them into a vector
features = VectorAssembler(
    inputCols = ["0","1","2","3"],
    outputCol = "features")

# split into Train and Test
train_data, test_data = features.transform(df).randomSplit([0.8,0.2])


## 2
Create the Model

Use the `MultilayerPerceptronClassifier`

Be sure to pass `layers=[4,5,4,3]` to the `MultilayerPercentronClassifier`'s contructor

In [5]:
from pyspark.ml.classification import MultilayerPerceptronClassifier
layers = [4,5,4,3]
trainer = MultilayerPerceptronClassifier(layers=layers)

## 3
Train the model:

In [6]:
model = trainer.fit(train_data)

## 4
Predict and Evaluate Model

Use the trained model to predict the classes of the test data, and then print the accuracy of the model

In [7]:
# predict on the test data -- the model has not seen in
result = model.transform(test_data)

#bring in evaluator
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

#create evaluator
predictionAndLabels = result.select("label", "prediction")
evaluator = MulticlassClassificationEvaluator(metricName="accuracy")

# get the accuracy
accuracy = evaluator.evaluate(predictionAndLabels)


print("{:.2f}%".format(accuracy*100))

96.30%


## 5
Now that we have seen the MLP, how does the Naive Bayes classifier perform?

https://spark.apache.org/docs/2.1.0/ml-classification-regression.html#naive-bayes

In the cell below, train and test using the Naive Bayes classifier within `pyspark.ml`. Feel free to experiment with the NB parameters

In [11]:
from pyspark.ml.classification import NaiveBayes


# create the trainer and set its parameters
nb = NaiveBayes(smoothing=1.0, modelType="multinomial")

# train the model
model = nb.fit(train_data)

# predictions
predictions = model.transform(test_data)

# compute accuracy on the test set
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction",
                                              metricName="accuracy")

accuracy = evaluator.evaluate(predictions)
print("Test set accuracy = " + str(accuracy*100)+"%")



Test set accuracy = 77.77777777777779%


## 6
Use the train data to perform linear regression and then display the coefficients, intercepts, and $r^2$ value after regression 

https://spark.apache.org/docs/2.1.0/ml-classification-regression.html#linear-regression

In [13]:
from pyspark.ml.regression import LinearRegression


lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)

# Fit the model
lrModel = lr.fit(train_data)

# Print the coefficients and intercept for linear regression
print("Coefficients: %s" % str(lrModel.coefficients))
print("Intercept: %s" % str(lrModel.intercept))

# Summarize the model over the training set and print out some metrics
trainingSummary = lrModel.summary

print("r2: %f" % trainingSummary.r2)



Coefficients: [0.0,0.0,0.13758625148863807,0.38816100358457084]
Intercept: 0.02439025765463954
r2: 0.822424
